Skip to main content

Showing 1–50 of 123 results for author: Zou, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04716  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Optimal Model Selection for Conformalized Robust Optimization

    Authors: Yajie Bao, Yang Hu, Haojie Ren, Peng Zhao, Changliang Zou

    Abstract: In decision-making under uncertainty, Contextual Robust Optimization (CRO) provides reliability by minimizing the worst-case decision loss over a prediction set, hedging against label variability. While recent advances use conformal prediction to construct prediction sets for machine learning models, the downstream decisions critically depend on model selection. This paper introduces novel model s… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2507.00006  [pdf, ps, other

    cs.GR cs.LG eess.IV

    MVGBench: Comprehensive Benchmark for Multi-view Generation Models

    Authors: Xianghui Xie, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen, Gerard Pons-Moll

    Abstract: We propose MVGBench, a comprehensive benchmark for multi-view image generation models (MVGs) that evaluates 3D consistency in geometry and texture, image quality, and semantics (using vision language models). Recently, MVGs have been the main driving force in 3D object creation. However, existing metrics compare generated images against ground truth target views, which is not suitable for generati… ▽ More

    Submitted 11 June, 2025; originally announced July 2025.

    Comments: 17 pages, 11 figures, 9 tables, project page: https://virtualhumans.mpi-inf.mpg.de/MVGBench/

  3. arXiv:2506.23121  [pdf, ps, other

    eess.IV cs.AI cs.CV cs.LG

    CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation

    Authors: Xinlei Yu, Changmiao Wang, Hui Jin, Ahmed Elazab, Gangyong Jia, Xiang Wan, Changqing Zou, Ruiquan Ge

    Abstract: Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduc… ▽ More

    Submitted 5 July, 2025; v1 submitted 29 June, 2025; originally announced June 2025.

    Comments: Accepted By ACMMM25

  4. arXiv:2506.21270  [pdf, ps, other

    cs.CV

    Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

    Authors: Cheng Zou, Senlin Cheng, Bolei Xu, Dandan Zheng, Xiaobo Li, Jingdong Chen, Ming Yang

    Abstract: Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inc… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: 10 pages, 6 figures

  5. arXiv:2506.19420  [pdf, other

    cs.AI

    Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

    Authors: Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin

    Abstract: Multimodal sarcasm understanding is a high-order cognitive task. Although large language models (LLMs) have shown impressive performance on many downstream NLP tasks, growing evidence suggests that they struggle with sarcasm understanding. In this paper, we propose Commander-GPT, a modular decision routing framework inspired by military command theory. Rather than relying on a single LLM's capabil… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  6. arXiv:2506.16701  [pdf, ps, other

    cs.CV

    Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition

    Authors: Xiaodan Hu, Chuhang Zou, Suchen Wang, Jaechul Kim, Narendra Ahuja

    Abstract: Recent video action recognition methods have shown excellent performance by adapting large-scale pre-trained language-image models to the video domain. However, language models contain rich common sense priors - the scene contexts that humans use to constitute an understanding of objects, human-object interactions, and activities - that have not been fully exploited. In this paper, we introduce a… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  7. arXiv:2506.10100  [pdf, ps, other

    cs.CV

    EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

    Authors: Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, Linfeng Zhang

    Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holist… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  8. arXiv:2506.09344  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.SD eess.AS

    Ming-Omni: A Unified Multimodal Model for Perception and Generation

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

    Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 18 pages,8 figures

  9. arXiv:2506.07315  [pdf, ps, other

    q-fin.ST cs.AI

    Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation

    Authors: Zonghan Wu, Junlin Wang, Congyuan Zou, Chenhan Wang, Yilei Shao

    Abstract: Generative AI, particularly large language models (LLMs), is beginning to transform the financial industry by automating tasks and helping to make sense of complex financial information. One especially promising use case is the automatic creation of fundamental analysis reports, which are essential for making informed investment decisions, evaluating credit risks, guiding corporate mergers, etc. W… ▽ More

    Submitted 22 May, 2025; originally announced June 2025.

  10. arXiv:2506.07136  [pdf, ps, other

    cs.CV

    Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion

    Authors: Huaize Liu, Wenzhang Sun, Qiyuan Zhang, Donglin Di, Biao Gong, Hao Li, Chen Wei, Changqing Zou

    Abstract: Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode c… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  11. arXiv:2506.06295  [pdf, ps, other

    cs.LG cs.AI cs.CL

    dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

    Authors: Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, Linfeng Zhang

    Abstract: Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniqu… ▽ More

    Submitted 17 May, 2025; originally announced June 2025.

  12. arXiv:2506.05762  [pdf, other

    cs.LG

    BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

    Authors: Yunpeng Qing, Shuo Chen, Yixiao Chi, Shunyu Liu, Sixu Lin, Changqing Zou

    Abstract: Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enr… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  13. arXiv:2505.23272  [pdf, ps, other

    cs.CV

    Are MLMs Trapped in the Visual Room?

    Authors: Yazhou Zhang, Chunwang Zou, Qimeng Liu, Lu Rong, Ben Yao, Zheng Lian, Qiuchi Li, Peng Zhang, Jing Qin

    Abstract: Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the \textbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perce… ▽ More

    Submitted 30 May, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

    Comments: 19 pages

  14. arXiv:2505.21457  [pdf, ps, other

    cs.CV cs.AI

    Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

    Authors: Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen

    Abstract: Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Project Page: https://aim-uofa.github.io/ACTIVE-o3

  15. arXiv:2505.19147  [pdf, ps, other

    cs.CL cs.AI cs.CV

    Shifting AI Efficiency From Model-Centric to Data-Centric Compression

    Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang

    Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over lo… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: Project: \url{https://github.com/xuyang-liu16/Awesome-Token-level-Model-Compression}

  16. arXiv:2505.18926  [pdf, ps, other

    cs.LG physics.flu-dyn

    Hybrid Neural-MPM for Interactive Fluid Simulations in Real-Time

    Authors: Jingxuan Xu, Hong Huang, Chuhang Zou, Manolis Savva, Yunchao Wei, Wuyang Chen

    Abstract: We propose a neural physics system for real-time, interactive fluid simulations. Traditional physics-based methods, while accurate, are computationally intensive and suffer from latency issues. Recent machine-learning methods reduce computational costs while preserving fidelity; yet most still fail to satisfy the latency constraints for real-time use and lack support for interactive applications.… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  17. arXiv:2505.11992  [pdf, ps, other

    cs.CV

    SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

    Authors: Songchun Zhang, Huiyao Xu, Sitong Guo, Zhongwei Xie, Pengwei Liu, Hujun Bao, Weiwei Xu, Changqing Zou

    Abstract: Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. This work takes on the challenge of reconstructing photorealistic 3D scenes from sparse or single-view inputs. We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffus… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

    Comments: 18 pages, 16 figures

  18. arXiv:2505.04986  [pdf, other

    stat.ML cs.LG

    Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach

    Authors: Qian Peng, Yajie Bao, Haojie Ren, Zhaojun Wang, Changliang Zou

    Abstract: Conformal prediction is a powerful tool for constructing prediction intervals for black-box models, providing a finite sample coverage guarantee for exchangeable data. However, this exchangeability is compromised when some entries of the test feature are contaminated, such as in the case of cellwise outliers. To address this issue, this paper introduces a novel framework called detect-then-impute… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: 23 pages, 15 figures

  19. arXiv:2505.02471  [pdf, ps, other

    cs.CV

    Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, Libin Wang, Qingpei Guo, Rui Liu, Weilong Chai, Xinyu Xiao, Ziyuan Huang

    Abstract: We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale repr… ▽ More

    Submitted 12 June, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

    Comments: https://github.com/inclusionAI/Ming/tree/Ming-Lite-Omni-Preview/Ming-unify

  20. arXiv:2504.10331  [pdf, other

    cs.CV

    LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis

    Authors: Hao Sun, Fenggen Yu, Huiyao Xu, Tao Zhang, Changqing Zou

    Abstract: Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data--such as RAW sensor inputs or multi-exposure s… ▽ More

    Submitted 19 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: Project page: https://sunhao242.github.io/LL-Gaussian_web.github.io/

  21. arXiv:2503.18681   

    cs.CL cs.AI

    Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models

    Authors: Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin

    Abstract: Sarcasm detection, as a crucial research direction in the field of Natural Language Processing (NLP), has attracted widespread attention. Traditional sarcasm detection tasks have typically focused on single-modal approaches (e.g., text), but due to the implicit and subtle nature of sarcasm, such methods often fail to yield satisfactory results. In recent years, researchers have shifted the focus o… ▽ More

    Submitted 3 July, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: Our original goal was to use Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection (arXiv:2506.19420) to replace Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models (arXiv:2503.18681). Due to various reasons, both versions were released, so we would like to withdraw the latter

  22. arXiv:2503.15013  [pdf, other

    physics.geo-ph cs.LG

    Ambient Noise Full Waveform Inversion with Neural Operators

    Authors: Caifeng Zou, Zachary E. Ross, Robert W. Clayton, Fan-Chi Lin, Kamyar Azizzadenesheli

    Abstract: Numerical simulations of seismic wave propagation are crucial for investigating velocity structures and improving seismic hazard assessment. However, standard methods such as finite difference or finite element are computationally expensive. Recent studies have shown that a new class of machine learning models, called neural operators, can solve the elastodynamic wave equation orders of magnitude… ▽ More

    Submitted 25 March, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

    Comments: Added references

  23. arXiv:2503.11043  [pdf, other

    cs.LG

    InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences

    Authors: Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy T. Feng, Caifeng Zou, Yu Sun, Nikola Kovachki, Zachary E. Ross, Katherine L. Bouman, Yisong Yue

    Abstract: Plug-and-play diffusion priors (PnPDP) have emerged as a promising research direction for solving inverse problems. However, current studies primarily focus on natural image restoration, leaving the performance of these algorithms in scientific inverse problems largely unexplored. To address this gap, we introduce \textsc{InverseBench}, a framework that evaluates diffusion models across five dis… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  24. arXiv:2503.10270  [pdf, other

    cs.CV

    EEdit: Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

    Authors: Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, Linfeng Zhang

    Abstract: Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion pr… ▽ More

    Submitted 30 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    Comments: 17 pages,fix figure mistake(inv/fwd skipping) in fig2

  25. arXiv:2503.10096  [pdf, other

    cs.CV

    A Self-supervised Motion Representation for Portrait Video Generation

    Authors: Qiyuan Zhang, Chenyu Wu, Wenzhang Sun, Huaize Liu, Donglin Di, Wei Chen, Changqing Zou

    Abstract: Recent advancements in portrait video generation have been noteworthy. However, existing methods rely heavily on human priors and pre-trained generative models, Motion representations based on human priors may introduce unrealistic motion, while methods relying on pre-trained generative models often suffer from inefficient inference. To address these challenges, we propose Semantic Latent Motion (… ▽ More

    Submitted 13 June, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

  26. arXiv:2503.06923  [pdf, other

    cs.CV cs.AI

    From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

    Authors: Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, Linfeng Zhang

    Abstract: Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant inte… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 13 pages, 14 figures

  27. arXiv:2503.05484  [pdf, other

    cs.GR cs.CV

    DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction

    Authors: Miaowei Wang, Yibo Zhang, Rui Ma, Weiwei Xu, Changqing Zou, Daniel Morris

    Abstract: We present DecoupledGaussian, a novel system that decouples static objects from their contacted surfaces captured in-the-wild videos, a key prerequisite for realistic Newtonian-based physical simulations. Unlike prior methods focused on synthetic data or elastic jittering along the contact surface, which prevent objects from fully detaching or moving independently, DecoupledGaussian allows for sig… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

    Comments: CVPR2025 Accepted

  28. arXiv:2502.06820  [pdf, other

    cs.LG cs.AI

    LoCA: Location-Aware Cosine Adaptation for Parameter-Efficient Fine-Tuning

    Authors: Zhekai Du, Yinjie Min, Jingjing Li, Ke Lu, Changliang Zou, Liuhua Peng, Tingjin Chu, Mingming Gong

    Abstract: Low-rank adaptation (LoRA) has become a prevalent method for adapting pre-trained large language models to downstream tasks. However, the simple low-rank decomposition form may constrain the hypothesis space. To address this limitation, we introduce Location-aware Cosine Adaptation (LoCA), a novel frequency-domain parameter-efficient fine-tuning method based on inverse Discrete Cosine Transform (i… ▽ More

    Submitted 29 April, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

  29. arXiv:2502.00818  [pdf, other

    stat.ML cs.LG

    Error-quantified Conformal Inference for Time Series

    Authors: Junxi Wu, Dongjian Hu, Yajie Bao, Shu-Tao Xia, Changliang Zou

    Abstract: Uncertainty quantification in time series prediction is challenging due to the temporal dependence and distribution shift on sequential data. Conformal inference provides a pivotal and flexible instrument for assessing the uncertainty of machine learning models through prediction sets. Recently, a series of online conformal inference methods updated thresholds of prediction sets by performing onli… ▽ More

    Submitted 2 February, 2025; originally announced February 2025.

    Comments: ICLR 2025 camera version

  30. arXiv:2502.00253  [pdf, other

    eess.IV cs.CV

    Patch Triplet Similarity Purification for Guided Real-World Low-Dose CT Image Denoising

    Authors: Junhao Long, Fengwei Yang, Juncheng Yan, Baoping Zhang, Chao Jin, Jian Yang, Changliang Zou, Jun Xu

    Abstract: Image denoising of low-dose computed tomography (LDCT) is an important problem for clinical diagnosis with reduced radiation exposure. Previous methods are mostly trained with pairs of synthetic or misaligned LDCT and normal-dose CT (NDCT) images. However, trained with synthetic noise or misaligned LDCT/NDCT image pairs, the denoising networks would suffer from blurry structure or motion artifacts… ▽ More

    Submitted 31 January, 2025; originally announced February 2025.

  31. arXiv:2501.14249  [pdf, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1084 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 19 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  32. arXiv:2501.11430  [pdf, other

    cs.LG cs.AI

    A Survey on Diffusion Models for Anomaly Detection

    Authors: Jing Liu, Zhenchao Ma, Zepu Wang, Chenxuanyin Zou, Jiayang Ren, Zehua Wang, Liang Song, Bo Hu, Yang Liu, Victor C. M. Leung

    Abstract: Diffusion models (DMs) have emerged as a powerful class of generative AI models, showing remarkable potential in anomaly detection (AD) tasks across various domains, such as cybersecurity, fraud detection, healthcare, and manufacturing. The intersection of these two fields, termed diffusion models for anomaly detection (DMAD), offers promising solutions for identifying deviations in increasingly c… ▽ More

    Submitted 26 February, 2025; v1 submitted 20 January, 2025; originally announced January 2025.

  33. arXiv:2501.01808  [pdf, other

    cs.CV

    MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

    Authors: Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, Hujun Bao

    Abstract: The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a) the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emo… ▽ More

    Submitted 8 January, 2025; v1 submitted 3 January, 2025; originally announced January 2025.

  34. arXiv:2501.00375  [pdf, other

    cs.CV cs.LG

    Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free

    Authors: Evelyn Zhang, Bang Xiao, Jiayi Tang, Qianli Ma, Chang Zou, Xuefei Ning, Xuming Hu, Linfeng Zhang

    Abstract: Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high computational costs and slows generation speed, limiting broader adoption. The community has made numerous efforts to reduce this computational burden, with metho… ▽ More

    Submitted 31 December, 2024; originally announced January 2025.

  35. arXiv:2412.18911  [pdf, other

    cs.LG cs.AI cs.CV

    Accelerating Diffusion Transformers with Dual Feature Caching

    Authors: Chang Zou, Evelyn Zhang, Runlin Guo, Haohang Xu, Conghui He, Xuming Hu, Linfeng Zhang

    Abstract: Diffusion Transformers (DiT) have become the dominant methods in image and video generation yet still suffer substantial computational costs. As an effective approach for DiT acceleration, feature caching methods are designed to cache the features of DiT in previous timesteps and reuse them in the next timesteps, allowing us to skip the computation in the next timesteps. However, on the one hand,… ▽ More

    Submitted 25 December, 2024; originally announced December 2024.

  36. Physics-Informed Deep Learning Model for Line-integral Diagnostics Across Fusion Devices

    Authors: Cong Wang, Weizhe Yang, Haiping Wang, Renjie Yang, Jing Li, Zhijun Wang, Yixiong Wei, Xianli Huang, Chenshu Hu, Zhaoyang Liu, Xinyao Yu, Changqing Zou, Zhifeng Zhao

    Abstract: Rapid reconstruction of 2D plasma profiles from line-integral measurements is important in nuclear fusion. This paper introduces a physics-informed model architecture called Onion, that can enhance the performance of models and be adapted to various backbone networks. The model under Onion incorporates physical information by a multiplication process and applies the physics-informed loss function… ▽ More

    Submitted 9 June, 2025; v1 submitted 27 November, 2024; originally announced December 2024.

    Journal ref: Nuclear Fusion (2025)

  37. arXiv:2411.18263  [pdf, other

    cs.CV

    TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

    Authors: Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, Changqing Zou

    Abstract: Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is no… ▽ More

    Submitted 31 May, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

  38. arXiv:2411.17335  [pdf, other

    cs.CV

    VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension

    Authors: Zeyu Ling, Bo Han, Shiyang Li, Jikang Cheng, Hongdeng Shen, Changqing Zou

    Abstract: Large language models (LLMs) are, by design, inherently capable of multi-task learning: through a unified next-token prediction paradigm, they can naturally address a wide variety of downstream tasks. Prior work in the motion domain has demonstrated some generality by adapting LLMs via a Motion Tokenizer coupled with an autoregressive Transformer to generate and understand human motion. However, t… ▽ More

    Submitted 26 May, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

  39. arXiv:2411.10947  [pdf, other

    cs.CV

    Direct and Explicit 3D Generation from a Single Image

    Authors: Haoyu Wu, Meher Gitika Karumuri, Chuhang Zou, Seungbae Bang, Yuelong Li, Dimitris Samaras, Sunil Hadap

    Abstract: Current image-to-3D approaches suffer from high computational costs and lack scalability for high-resolution outputs. In contrast, we introduce a novel framework to directly generate explicit surface geometry and texture using multi-view 2D depth and RGB images along with 3D Gaussian features using a repurposed Stable Diffusion model. We introduce a depth branch into U-Net for efficient and high q… ▽ More

    Submitted 16 November, 2024; originally announced November 2024.

    Comments: 3DV 2025, Project page: https://hao-yu-wu.github.io/gen3d/

  40. arXiv:2411.10187  [pdf, other

    cs.CV

    Try-On-Adapter: A Simple and Flexible Try-On Paradigm

    Authors: Hanzhong Guo, Jianfeng Zhang, Cheng Zou, Jun Li, Meng Wang, Ruxue Wen, Pingzhong Tang, Jingdong Chen, Ming Yang

    Abstract: Image-based virtual try-on, widely used in online shopping, aims to generate images of a naturally dressed person conditioned on certain garments, providing significant research and commercial potential. A key challenge of try-on is to generate realistic images of the model wearing the garments while preserving the details of the garments. Previous methods focus on masking certain parts of the ori… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

    Comments: Image virtual try-on, 7 pages, 3 figures

  41. arXiv:2411.10109  [pdf

    cs.AI cs.HC cs.LG

    Generative Agent Simulations of 1,000 People

    Authors: Joon Sung Park, Carolyn Q. Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, Michael S. Bernstein

    Abstract: The promise of human behavioral simulation--general-purpose computational agents that replicate human behavior across domains--could enable broad applications in policymaking and social science. We present a novel agent architecture that simulates the attitudes and behaviors of 1,052 real individuals--applying large language models to qualitative interviews about their lives, then measuring how we… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

  42. arXiv:2411.00836  [pdf, other

    cs.CV cs.AI cs.CL

    DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

    Authors: Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang

    Abstract: The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In… ▽ More

    Submitted 24 February, 2025; v1 submitted 29 October, 2024; originally announced November 2024.

    Comments: Accepted by ICLR 2025

  43. arXiv:2410.05317  [pdf, other

    cs.LG cs.AI cs.CV

    Accelerating Diffusion Transformers with Token-wise Feature Caching

    Authors: Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang

    Abstract: Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exh… ▽ More

    Submitted 19 February, 2025; v1 submitted 4 October, 2024; originally announced October 2024.

    Comments: ToCa is honored to be accepted by ICLR 2025

  44. arXiv:2409.17610  [pdf, other

    cs.CL cs.CV

    ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

    Authors: Zhangpu Li, Changhong Zou, Suxue Ma, Zhicheng Yang, Chen Du, Youbao Tang, Zhenjie Cao, Ning Zhang, Jui-Hsin Lai, Ruei-Sung Lin, Yuan Ni, Xingzhi Sun, Jing Xiao, Jieke Hou, Kai Zhang, Mei Han

    Abstract: The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality i… ▽ More

    Submitted 29 October, 2024; v1 submitted 26 September, 2024; originally announced September 2024.

  45. arXiv:2409.15689  [pdf, other

    cs.CV

    Plenoptic PNG: Real-Time Neural Radiance Fields in 150 KB

    Authors: Jae Yong Lee, Yuqun Wu, Chuhang Zou, Derek Hoiem, Shenlong Wang

    Abstract: The goal of this paper is to encode a 3D scene into an extremely compact representation from 2D images and to enable its transmittance, decoding and rendering in real-time across various platforms. Despite the progress in NeRFs and Gaussian Splats, their large model size and specialized renderers make it challenging to distribute free-viewpoint 3D content as easily as images. To address this, we h… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

  46. arXiv:2409.02543  [pdf, other

    cs.CV

    StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

    Authors: Wen Li, Muyuan Fang, Cheng Zou, Biao Gong, Ruobing Zheng, Meng Wang, Jingdong Chen, Ming Yang

    Abstract: Despite the burst of innovative methods for controlling the diffusion process, effectively controlling image styles in text-to-image generation remains a challenging task. Many adapter-based methods impose image representation conditions on the denoising process to accomplish image control. However these conditions are not aligned with the word embedding space, leading to interference between imag… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: Accepted by ECCV2024

  47. arXiv:2408.17251  [pdf, other

    cs.CV cs.AI

    Abstracted Gaussian Prototypes for One-Shot Concept Learning

    Authors: Chelsea Zou, Kenneth J. Kurtz

    Abstract: We introduce a cluster-based generative image segmentation framework to encode higher-level representations of visual concepts based on one-shot learning inspired by the Omniglot Challenge. The inferred parameters of each component of a Gaussian Mixture Model (GMM) represent a distinct topological subpart of a visual concept. Sampling new data from these parameters generates augmented subparts to… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

  48. arXiv:2408.11319  [pdf, other

    cs.CL cs.AI

    SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding

    Authors: Yazhou Zhang, Chunwang Zou, Zheng Lian, Prayag Tiwari, Jing Qin

    Abstract: In the era of large language models (LLMs), the task of ``System I''~-~the fast, unconscious, and intuitive tasks, e.g., sentiment analysis, text classification, etc., have been argued to be successfully solved. However, sarcasm, as a subtle linguistic phenomenon, often employs rhetorical devices like hyperbole and figuration to convey true sentiments and intentions, involving a higher level of ab… ▽ More

    Submitted 23 August, 2024; v1 submitted 20 August, 2024; originally announced August 2024.

  49. arXiv:2407.06305  [pdf, other

    cs.CV cs.GR

    SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers

    Authors: Mingrui Zhao, Yizhi Wang, Fenggen Yu, Changqing Zou, Ali Mahdavi-Amiri

    Abstract: Shape abstraction is an important task for simplifying complex geometric structures while retaining essential features. Sweep surfaces, commonly found in human-made objects, aid in this process by effectively capturing and representing object geometry, thereby facilitating abstraction. In this paper, we introduce \papername, a novel approach to shape abstraction through sweep surfaces. We propose… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 14 pages,20 figures, ECCV 2024

  50. arXiv:2405.15305  [pdf, other

    cs.CV

    Diff3DS: Generating View-Consistent 3D Sketch via Differentiable Curve Rendering

    Authors: Yibo Zhang, Lihong Wang, Changqing Zou, Tieru Wu, Rui Ma

    Abstract: 3D sketches are widely used for visually representing the 3D shape and structure of objects or scenes. However, the creation of 3D sketch often requires users to possess professional artistic skills. Existing research efforts primarily focus on enhancing the ability of interactive sketch generation in 3D virtual systems. In this work, we propose Diff3DS, a novel differentiable rendering framework… ▽ More

    Submitted 9 March, 2025; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: ICLR 2025