Skip to main content

Showing 1–50 of 91 results for author: Xie, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.09641  [pdf, other

    cs.GR

    SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

    Authors: Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Enze Xie, Song Han

    Abstract: This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-t… ▽ More

    Submitted 23 March, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

    Comments: 22 pages, 11 figures, 8 tables, In submission

  2. arXiv:2502.07701   

    cs.CV

    Magic 1-For-1: Generating One Minute Video Clips within One Minute

    Authors: Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou

    Abstract: In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorit… ▽ More

    Submitted 16 February, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

    Comments: Serious updates are needed

  3. arXiv:2501.18427  [pdf, other

    cs.CV

    SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

    Authors: Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han

    Abstract: This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Prun… ▽ More

    Submitted 23 March, 2025; v1 submitted 30 January, 2025; originally announced January 2025.

  4. arXiv:2412.19917  [pdf, other

    cs.CV

    Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts

    Authors: Enze Xie, Jiaho Lyu, Daiqing Wu, Huawen Shen, Yu Zhou

    Abstract: The recent emergence of the Segment Anything Model (SAM) enables various domain-specific segmentation tasks to be tackled cost-effectively by using bounding boxes as prompts. However, in scene text segmentation, SAM can not achieve desirable performance. The word-level bounding box as prompts is too coarse for characters, while the character-level bounding box as prompts suffers from over-segmenta… ▽ More

    Submitted 27 December, 2024; originally announced December 2024.

  5. arXiv:2412.09782  [pdf, other

    cs.RO cs.CV cs.MA

    EI-Drive: A Platform for Cooperative Perception with Realistic Communication Models

    Authors: Hanchu Zhou, Edward Xie, Wei Shao, Dechen Gao, Michelle Dong, Junshan Zhang

    Abstract: The growing interest in autonomous driving calls for realistic simulation platforms capable of accurately simulating cooperative perception process in realistic traffic scenarios. Existing studies for cooperative perception often have not accounted for transmission latency and errors in real-world environments. To address this gap, we introduce EI-Drive, an edge-AI based autonomous driving simulat… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

  6. arXiv:2411.05007  [pdf, other

    cs.CV cs.LG

    SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

    Authors: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han

    Abstract: Diffusion models can effectively generate high-quality images. However, as they scale, rising memory demands and higher latency pose substantial deployment challenges. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where existing post-training quantization met… ▽ More

    Submitted 3 March, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

    Comments: ICLR 2025 Spotlight Quantization Library: https://github.com/mit-han-lab/deepcompressor Inference Engine: https://github.com/mit-han-lab/nunchaku Website: https://hanlab.mit.edu/projects/svdquant Demo: https://svdquant.mit.edu Blog: https://hanlab.mit.edu/blog/svdquant

  7. arXiv:2411.02429  [pdf, other

    cs.CL cs.AI cs.CE

    IdeaBench: Benchmarking Large Language Models for Research Idea Generation

    Authors: Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, Aidong Zhang

    Abstract: Large Language Models (LLMs) have transformed how people interact with artificial intelligence (AI) systems, achieving state-of-the-art results in various tasks, including scientific discovery and hypothesis generation. However, the lack of a comprehensive and systematic evaluation framework for generating research ideas using LLMs poses a significant obstacle to understanding and assessing their… ▽ More

    Submitted 31 October, 2024; originally announced November 2024.

  8. arXiv:2411.02382  [pdf, other

    cs.CL cs.AI

    Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models

    Authors: Guangzhi Xiong, Eric Xie, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang

    Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in various scientific domains, from natural language processing to complex problem-solving tasks. Their ability to understand and generate human-like text has opened up new possibilities for advancing scientific research, enabling tasks such as data analysis, literature review, and even experimental design. One of the most prom… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

  9. arXiv:2410.10812  [pdf, other

    cs.CV cs.AI cs.LG

    HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

    Authors: Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, Song Han

    Abstract: We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To addr… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: Demo: https://hart.mit.edu. The first two authors contributed equally to this work

  10. arXiv:2410.10733  [pdf, other

    cs.CV cs.AI

    Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

    Authors: Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han

    Abstract: We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing… ▽ More

    Submitted 21 April, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

    Comments: ICLR 2025. The first two authors contributed equally to this work

  11. arXiv:2410.10629  [pdf, other

    cs.CV

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Authors: Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han

    Abstract: We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096$\times$4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8$\times$, we trained an AE that… ▽ More

    Submitted 20 October, 2024; v1 submitted 14 October, 2024; originally announced October 2024.

    Comments: Technical Report

  12. arXiv:2409.04429  [pdf, other

    cs.CV cs.LG

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Authors: Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu

    Abstract: VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for… ▽ More

    Submitted 4 March, 2025; v1 submitted 6 September, 2024; originally announced September 2024.

    Comments: Code: https://github.com/mit-han-lab/vila-u. The first two authors contributed equally to this work

  13. arXiv:2408.09320  [pdf, other

    cs.HC cs.SD eess.AS

    Auptimize: Optimal Placement of Spatial Audio Cues for Extended Reality

    Authors: Hyunsung Cho, Alexander Wang, Divya Kartik, Emily Liying Xie, Yukang Yan, David Lindlbauer

    Abstract: Spatial audio in Extended Reality (XR) provides users with better awareness of where virtual elements are placed, and efficiently guides them to events such as notifications, system alerts from different windows, or approaching avatars. Humans, however, are inaccurate in localizing sound cues, especially with multiple sources due to limitations in human auditory perception such as angular discrimi… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

    Comments: UIST 2024

    ACM Class: H.5.1; H.5.2; H.5.5

  14. arXiv:2407.11382  [pdf, other

    cs.CV cs.AI cs.RO

    Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

    Authors: Jianhao Li, Tianyu Sun, Zhongdao Wang, Enze Xie, Bailan Feng, Hongbo Zhang, Ze Yuan, Ke Xu, Jiaheng Liu, Ping Luo

    Abstract: This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quali… ▽ More

    Submitted 17 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  15. arXiv:2403.16996  [pdf, other

    cs.CV cs.RO

    DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End Driving

    Authors: Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, Ping Luo

    Abstract: End-to-end driving has made significant progress in recent years, demonstrating benefits such as system simplicity and competitive driving performance under both open-loop and closed-loop settings. Nevertheless, the lack of interpretability and controllability in its driving decisions hinders real-world deployment for end-to-end driving systems. In this paper, we collect a comprehensive end-to-end… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  16. arXiv:2403.13807  [pdf, other

    cs.CV cs.AI cs.LG

    Editing Massive Concepts in Text-to-Image Diffusion Models

    Authors: Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu

    Abstract: Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Project page: https://silentview.github.io/EMCID/ . Code: https://github.com/SilentView/EMCID

  17. arXiv:2403.10047  [pdf, other

    cs.CV

    TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

    Authors: Jiahao Lyu, Jin Wei, Gangyan Zeng, Zeng Li, Enze Xie, Wei Wang, Yu Zhou

    Abstract: Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precis… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: 12 pages, 8 figures

  18. arXiv:2403.04692  [pdf, other

    cs.CV

    PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

    Authors: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li

    Abstract: In this paper, we introduce PixArt-Σ, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-Σrepresents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σis its training efficiency. Leveraging the foundational pre-training of PixArt-α,… ▽ More

    Submitted 17 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Project Page: https://pixart-alpha.github.io/PixArt-sigma-project/

  19. arXiv:2402.17376  [pdf, other

    cs.CV cs.AI cs.LG

    Accelerating Diffusion Sampling with Optimized Time Steps

    Authors: Shuchen Xue, Zhaoqiang Liu, Fei Chen, Shifeng Zhang, Tianyang Hu, Enze Xie, Zhenguo Li

    Abstract: Diffusion probabilistic models (DPMs) have shown remarkable performance in high-resolution image synthesis, but their sampling efficiency is still to be desired due to the typically large number of sampling steps. Recent advancements in high-order numerical ODE solvers for DPMs have enabled the generation of high-quality images with much fewer sampling steps. While this is a significant developmen… ▽ More

    Submitted 3 July, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

    Comments: CVPR 2024

  20. arXiv:2402.13572  [pdf, other

    cs.LG cs.AI math.NA

    AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures

    Authors: Yihang Gao, Chuanyang Zheng, Enze Xie, Han Shi, Tianyang Hu, Yu Li, Michael K. Ng, Zhenguo Li, Zhaoqiang Liu

    Abstract: Besides natural language processing, transformers exhibit extraordinary performance in solving broader applications, including scientific computing and computer vision. Previous works try to explain this from the expressive power and capability perspectives that standard transformers are capable of performing some algorithms. To empower transformers with algorithmic capabilities and motivated by t… ▽ More

    Submitted 10 January, 2025; v1 submitted 21 February, 2024; originally announced February 2024.

    Comments: Published at Transactions on Machine Learning Research (TMLR). The paper provides insight that the Transformer architectures can mimic the algorithm structures in (in-context) algorithm learning and representation. The incorporated algorithmic structure in Algoformer shows its potential in (deep learning for) scientific computing, besides the real language tasks

  21. arXiv:2401.15688  [pdf, other

    cs.CV

    Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

    Authors: Zhenyu Wang, Enze Xie, Aoxue Li, Zhongdao Wang, Xihui Liu, Zhenguo Li

    Abstract: Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it comes to retaining object attributes and relationships. In this paper, we propose CompAgent, a training-free approach for compositional text-to-image generation,… ▽ More

    Submitted 30 January, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

  22. arXiv:2401.05252  [pdf, other

    cs.CV

    PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models

    Authors: Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li

    Abstract: This technical report introduces PIXART-δ, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-α model. PIXART-α is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-δ significantly accelerates the inference speed… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: Technical Report

  23. arXiv:2312.15856  [pdf, other

    cs.GR cs.CV

    SplatMesh: Interactive 3D Segmentation and Editing Using Mesh-Based Gaussian Splatting

    Authors: Kaichen Zhou, Lanqing Hong, Xinhai Chang, Yingji Zhong, Enze Xie, Hao Dong, Zhihao Li, Yongxin Yang, Zhenguo Li, Wei Zhang

    Abstract: A key challenge in fine-grained 3D-based interactive editing is the absence of an efficient representation that balances diverse modifications with high-quality view synthesis under a given memory constraint. While 3D meshes provide robustness for various modifications, they often yield lower-quality view synthesis compared to 3D Gaussian Splatting, which, in turn, suffers from instability during… ▽ More

    Submitted 14 April, 2025; v1 submitted 25 December, 2023; originally announced December 2023.

  24. arXiv:2312.11562  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    A Survey of Reasoning with Foundation Models

    Authors: Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, Yue Wu, Wenhai Wang, Junsong Chen, Zhangyue Yin, Xiaozhe Ren, Jie Fu, Junxian He, Wu Yuan, Qi Liu, Xihui Liu, Yu Li, Hao Dong, Yu Cheng, Ming Zhang, Pheng Ann Heng , et al. (9 additional authors not shown)

    Abstract: Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, e.g., Large Language Models (LLMs), there is a growing interest in exploring… ▽ More

    Submitted 25 January, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

    Comments: 20 Figures, 160 Pages, 750+ References, Project Page https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models

  25. arXiv:2312.07231  [pdf, other

    cs.CV cs.AI cs.LG

    Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

    Authors: Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li

    Abstract: Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose Fas… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Project Page: https://dit-3d.github.io/FastDiT-3D/

  26. arXiv:2312.02936  [pdf, other

    cs.CV

    Drag-A-Video: Non-rigid Video Editing with Point-based Interaction

    Authors: Yao Teng, Enze Xie, Yue Wu, Haoyu Han, Zhenguo Li, Xihui Liu

    Abstract: Video editing is a challenging task that requires manipulating videos on both the spatial and temporal dimensions. Existing methods for video editing mainly focus on changing the appearance or style of the objects in the video, while keeping their structures unchanged. However, there is no existing method that allows users to interactively ``drag'' any points of instances on the first frame to pre… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

  27. arXiv:2311.14603  [pdf, other

    cs.CV

    Animate124: Animating One Image to 4D Dynamic Scene

    Authors: Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, Gim Hee Lee

    Abstract: We introduce Animate124 (Animate-one-image-to-4D), the first work to animate a single in-the-wild image into 3D video through textual motion descriptions, an underexplored problem with significant applications. Our 4D generation leverages an advanced 4D grid dynamic Neural Radiance Field (NeRF) model, optimized in three distinct stages using multiple diffusion priors. Initially, a static model is… ▽ More

    Submitted 18 February, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

    Comments: Project Page: https://animate124.github.io

  28. arXiv:2311.14580  [pdf, other

    cs.CV

    Large Language Models as Automated Aligners for benchmarking Vision-Language Models

    Authors: Yuanfeng Ji, Chongjian Ge, Weikai Kong, Enze Xie, Zhengying Liu, Zhengguo Li, Ping Luo

    Abstract: With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of th… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

  29. arXiv:2311.01682  [pdf, other

    cs.CV

    Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

    Authors: Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Ping Luo, Zaiqing Nie

    Abstract: Cooperatively utilizing both ego-vehicle and infrastructure sensor data can significantly enhance autonomous driving perception abilities. However, the uncertain temporal asynchrony and limited communication conditions can lead to fusion misalignment and constrain the exploitation of infrastructure data. To address these issues in vehicle-infrastructure cooperative 3D (VIC3D) object detection, we… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Comments: Accepted by NeurIPs2023. arXiv admin note: text overlap with arXiv:2303.10552

  30. arXiv:2310.02954  [pdf, other

    cs.CL

    DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning

    Authors: Jing Xiong, Zixuan Li, Chuanyang Zheng, Zhijiang Guo, Yichun Yin, Enze Xie, Zhicheng Yang, Qingxing Cao, Haiming Wang, Xiongwei Han, Jing Tang, Chengming Li, Xiaodan Liang

    Abstract: Recent advances in natural language processing, primarily propelled by Large Language Models (LLMs), have showcased their remarkable capabilities grounded in in-context learning. A promising avenue for guiding LLMs in intricate reasoning tasks involves the utilization of intermediate reasoning steps within the Chain-of-Thought (CoT) paradigm. Nevertheless, the central challenge lies in the effecti… ▽ More

    Submitted 2 March, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted in ICLR 2024

  31. arXiv:2310.02601  [pdf, other

    cs.CV cs.AI

    MagicDrive: Street View Generation with Diverse 3D Geometry Control

    Authors: Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, Qiang Xu

    Abstract: Recent advancements in diffusion models have significantly enhanced the data synthesis with 2D control. Yet, precise 3D control in street view generation, crucial for 3D perception tasks, remains elusive. Specifically, utilizing Bird's-Eye View (BEV) as the primary condition often leads to challenges in geometry control (e.g., height), affecting the representation of object shapes, occlusion patte… ▽ More

    Submitted 3 May, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Project Page: https://flymin.github.io/magicdrive; Figure 7 updated

  32. arXiv:2310.01412  [pdf, other

    cs.CV cs.RO

    DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

    Authors: Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, Hengshuang Zhao

    Abstract: Multimodal large language models (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on… ▽ More

    Submitted 8 November, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Accepted by RA-L. The project page is available at https://tonyxuqaq.github.io/projects/DriveGPT4/

  33. arXiv:2310.00656  [pdf, other

    cs.AI

    LEGO-Prover: Neural Theorem Proving with Growing Libraries

    Authors: Haiming Wang, Huajian Xin, Chuanyang Zheng, Lin Li, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jian Yin, Zhenguo Li, Heng Liao, Xiaodan Liang

    Abstract: Despite the success of large language models (LLMs), the task of theorem proving still remains one of the hardest reasoning tasks that is far from being fully solved. Prior methods using language models have demonstrated promising results, but they still struggle to prove even middle school level theorems. One common limitation of these methods is that they assume a fixed theorem library during th… ▽ More

    Submitted 27 October, 2023; v1 submitted 1 October, 2023; originally announced October 2023.

  34. arXiv:2310.00426  [pdf, other

    cs.CV

    PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Authors: Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li

    Abstract: The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$α$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and eve… ▽ More

    Submitted 29 December, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

    Comments: Project Page: https://pixart-alpha.github.io

  35. arXiv:2309.15806  [pdf, other

    cs.CL cs.AI

    Lyra: Orchestrating Dual Correction in Automated Theorem Proving

    Authors: Chuanyang Zheng, Haiming Wang, Enze Xie, Zhengying Liu, Jiankai Sun, Huajian Xin, Jianhao Shen, Zhenguo Li, Yu Li

    Abstract: Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framewo… ▽ More

    Submitted 24 August, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to TMLR: https://openreview.net/forum?id=9Z0yB8rmQ2

  36. arXiv:2308.13853  [pdf, other

    cs.CV

    Beyond One-to-One: Rethinking the Referring Image Segmentation

    Authors: Yutao Hu, Qixiong Wang, Wenqi Shao, Enze Xie, Zhenguo Li, Jungong Han, Ping Luo

    Abstract: Referring image segmentation aims to segment the target object referred by a natural language expression. However, previous methods rely on the strong assumption that one sentence must describe one target in the image, which is often not the case in real-world applications. As a result, such methods fail when the expressions refer to either no objects or multiple objects. In this paper, we address… ▽ More

    Submitted 26 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  37. arXiv:2307.06350  [pdf, other

    cs.CV

    T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation

    Authors: Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, Xihui Liu

    Abstract: Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary gro… ▽ More

    Submitted 8 March, 2025; v1 submitted 12 July, 2023; originally announced July 2023.

    Comments: This is the journal version. For conference version (T2I-CompBench): arXiv:2307.06350v2. Project page: https://karine-h.github.io/T2I-CompBench-new/

  38. arXiv:2307.04106  [pdf, other

    cs.CV

    Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

    Authors: Jiayu Yang, Enze Xie, Miaomiao Liu, Jose M. Alvarez

    Abstract: Recent vision-only perception models for autonomous driving achieved promising results by encoding multi-view image features into Bird's-Eye-View (BEV) space. A critical step and the main bottleneck of these methods is transforming image features into the BEV coordinate frame. This paper focuses on leveraging geometry information, such as depth, to model such feature transformation. Existing works… ▽ More

    Submitted 11 July, 2023; v1 submitted 9 July, 2023; originally announced July 2023.

  39. arXiv:2307.02159  [pdf, other

    stat.ML cs.CV cs.LG math.AP

    DiffFlow: A Unified SDE Framework for Score-Based Diffusion Models and Generative Adversarial Networks

    Authors: Jingwei Zhang, Han Shi, Jincheng Yu, Enze Xie, Zhenguo Li

    Abstract: Generative models can be categorized into two types: explicit generative models that define explicit density forms and allow exact likelihood inference, such as score-based diffusion models (SDMs) and normalizing flows; implicit generative models that directly learn a transformation from the prior to the data distribution, such as generative adversarial nets (GANs). While these two types of models… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

    Comments: Tech Report

  40. arXiv:2307.01831  [pdf, other

    cs.CV cs.AI cs.LG

    DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

    Authors: Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li

    Abstract: Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape genera… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: Project Page: https://dit-3d.github.io/

  41. arXiv:2306.16329  [pdf, other

    cs.CV

    DiffComplete: Diffusion-based Generative 3D Shape Completion

    Authors: Ruihang Chu, Enze Xie, Shentong Mo, Zhenguo Li, Matthias Nießner, Chi-Wing Fu, Jiaya Jia

    Abstract: We introduce a new diffusion-based approach for shape completion on 3D range scans. Compared with prior deterministic and probabilistic methods, we strike a balance between realism, multi-modality, and high fidelity. We propose DiffComplete by casting shape completion as a generative task conditioned on the incomplete shape. Our key designs are two-fold. First, we devise a hierarchical feature agg… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

    Comments: Project Page: https://ruihangchu.com/diffcomplete.html

  42. arXiv:2306.04607  [pdf, other

    cs.CV cs.AI

    GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation

    Authors: Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung

    Abstract: Diffusion models have attracted significant attention due to the remarkable ability to create content and generate data for tasks like image classification. However, the usage of diffusion models to generate the high-quality object detection data remains an underexplored area, where not only image-level perceptual quality but also geometric conditions such as bounding boxes and camera views are es… ▽ More

    Submitted 16 February, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: Accept by ICLR 2024. Project Page: https://kaichen1998.github.io/projects/geodiffusion/

  43. arXiv:2305.08850  [pdf, other

    cs.CV

    Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts

    Authors: Yuyang Zhao, Enze Xie, Lanqing Hong, Zhenguo Li, Gim Hee Lee

    Abstract: The text-driven image and video diffusion models have achieved unprecedented success in generating realistic and diverse content. Recently, the editing and variation of existing images and videos in diffusion-based generative models have garnered significant attention. However, previous works are limited to editing content with text or providing coarse personalization using a single visual clue, r… ▽ More

    Submitted 18 February, 2024; v1 submitted 15 May, 2023; originally announced May 2023.

    Comments: Project page: https://make-a-protagonist.github.io

  44. Periodicity Analysis of the Logistic Map over Ring $\mathbb{Z}_{3^n}$

    Authors: Xiaoxiong Lu, Eric Yong Xie, Chengqing Li

    Abstract: Periodicity analysis of sequences generated by a deterministic system is a long-standing challenge in both theoretical research and engineering applications. To overcome the inevitable degradation of the Logistic map on a finite-precision circuit, its numerical domain is commonly converted from a real number field to a ring or a finite field. This paper studies the period of sequences generated by… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

    Comments: 10 pages

    MSC Class: 65P20

    Journal ref: International Journal of Bifurcation and Chaos, vol. 33, no. 5, art. no. 2350063, 2023

  45. arXiv:2304.09801  [pdf, other

    cs.CV

    MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation

    Authors: Chongjian Ge, Junsong Chen, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, Ping Luo

    Abstract: Perception systems in modern autonomous driving vehicles typically take inputs from complementary multi-modal sensors, e.g., LiDAR and cameras. However, in real-world applications, sensor corruptions and failures lead to inferior performances, thus compromising autonomous safety. In this paper, we propose a robust framework, called MetaBEV, to address extreme real-world environments involving over… ▽ More

    Submitted 19 April, 2023; originally announced April 2023.

    Comments: Project page: https://chongjiange.github.io/metabev.html

  46. arXiv:2304.09797  [pdf, other

    cs.CL cs.LG

    Progressive-Hint Prompting Improves Reasoning in Large Language Models

    Authors: Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, Yu Li

    Abstract: The performance of Large Language Models (LLMs) in reasoning tasks depends heavily on prompt design, with Chain-of-Thought (CoT) and self-consistency being critical methods that enhance this ability. However, these methods do not fully exploit the answers generated by the LLM to guide subsequent responses. This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP), that ena… ▽ More

    Submitted 7 October, 2024; v1 submitted 19 April, 2023; originally announced April 2023.

    Comments: Accepted to ICML AI4MATH 2024

  47. arXiv:2304.06648  [pdf, other

    cs.CV

    DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning

    Authors: Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, Zhenguo Li

    Abstract: Diffusion models have proven to be highly effective in generating high-quality images. However, adapting large pre-trained diffusion models to new domains remains an open challenge, which is critical for real-world applications. This paper proposes DiffFit, a parameter-efficient strategy to fine-tune large pre-trained diffusion models that enable fast adaptation to new domains. DiffFit is embarras… ▽ More

    Submitted 27 July, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: Tech Report

  48. arXiv:2304.01168  [pdf, other

    cs.CV cs.LG cs.RO

    DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving

    Authors: Tianqi Wang, Sukmin Kim, Wenxuan Ji, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, Ping Luo

    Abstract: Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a large-scale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes… ▽ More

    Submitted 17 December, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

  49. arXiv:2303.17559  [pdf, other

    cs.CV

    DDP: Diffusion Model for Dense Visual Prediction

    Authors: Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo

    Abstract: We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipel… ▽ More

    Submitted 13 May, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: Added controlnet exp

  50. arXiv:2303.10552  [pdf, other

    cs.CV

    Vehicle-Infrastructure Cooperative 3D Object Detection via Feature Flow Prediction

    Authors: Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Jirui Yuan, Ping Luo, Zaiqing Nie

    Abstract: Cooperatively utilizing both ego-vehicle and infrastructure sensor data can significantly enhance autonomous driving perception abilities. However, temporal asynchrony and limited wireless communication in traffic environments can lead to fusion misalignment and impact detection performance. This paper proposes Feature Flow Net (FFNet), a novel cooperative detection framework that uses a feature f… ▽ More

    Submitted 18 March, 2023; originally announced March 2023.

    Comments: Under Review