Skip to main content

Showing 1–50 of 341 results for author: Gao, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09161  [pdf, ps, other

    cond-mat.mtrl-sci cs.LG

    Bridging Theory and Experiment in Materials Discovery: Machine-Learning-Assisted Prediction of Synthesizable Structures

    Authors: Yu Xin, Peng Liu, Zhuohang Xie, Wenhui Mi, Pengyue Gao, Hong Jian Zhao, Jian Lv, Yanchao Wang, Yanming Ma

    Abstract: Even though thermodynamic energy-based crystal structure prediction (CSP) has revolutionized materials discovery, the energy-driven CSP approaches often struggle to identify experimentally realizable metastable materials synthesized through kinetically controlled pathways, creating a critical gap between theoretical predictions and experimental synthesis. Here, we propose a synthesizability-driven… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  2. Towards Adaptive Meta-Gradient Adversarial Examples for Visual Tracking

    Authors: Wei-Long Tian, Peng Gao, Xiao Liu, Long Xu, Hamido Fujita, Hanan Aljuai, Mao-Li Wang

    Abstract: In recent years, visual tracking methods based on convolutional neural networks and Transformers have achieved remarkable performance and have been successfully applied in fields such as autonomous driving. However, the numerous security issues exposed by deep learning models have gradually affected the reliable application of visual tracking methods in real-world scenarios. Therefore, how to reve… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  3. arXiv:2505.05446  [pdf, ps, other

    cs.CV cs.CL

    Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

    Authors: Han Xiao, Yina Xie, Guanxin Tan, Yinghao Chen, Rui Hu, Ke Wang, Aojun Zhou, Hao Li, Hao Shao, Xudong Lu, Peng Gao, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li

    Abstract: Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextu… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: CVPR2025

  4. arXiv:2505.03203  [pdf, other

    cs.CV

    PiCo: Enhancing Text-Image Alignment with Improved Noise Selection and Precise Mask Control in Diffusion Models

    Authors: Chang Xie, Chenyi Zhuang, Pan Gao

    Abstract: Advanced diffusion models have made notable progress in text-to-image compositional generation. However, it is still a challenge for existing models to achieve text-image alignment when confronted with complex text prompts. In this work, we highlight two factors that affect this alignment: the quality of the randomly initialized noise and the reliability of the generated controlling mask. We then… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  5. arXiv:2504.21367  [pdf, other

    cs.CE

    Implementation and Security Analysis of Cryptocurrencies Based on Ethereum

    Authors: Pengfei Gao, Dechao Kong, Xiaoqi Li

    Abstract: Blockchain technology has set off a wave of decentralization in the world since its birth. The trust system constructed by blockchain technology based on cryptography algorithm and computing power provides a practical and powerful solution to solve the trust problem in human society. In order to make more convenient use of the characteristics of blockchain and build applications on it, smart contr… ▽ More

    Submitted 6 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

  6. arXiv:2504.16080  [pdf, other

    cs.CV

    From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

    Authors: Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, Hongsheng Li

    Abstract: Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: All code, checkpoints, and datasets are available at \url{https://diffusion-cot.github.io/reflection2perfection}

  7. arXiv:2504.15780  [pdf, other

    cs.AI cs.CL

    TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

    Authors: Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, Botian Shi, Bo Zhang, Yu Qiao

    Abstract: Mathematical geometric problem solving (GPS) often requires effective integration of multimodal information and verifiable logical coherence. Despite the fast development of large language models in general problem solving, it remains unresolved regarding with both methodology and benchmarks, especially given the fact that exiting synthetic GPS benchmarks are often not self-verified and contain no… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  8. arXiv:2504.13608  [pdf, other

    cs.CV

    Cross-Hierarchical Bidirectional Consistency Learning for Fine-Grained Visual Classification

    Authors: Pengxiang Gao, Yihao Liang, Yanzhi Song, Zhouwang Yang

    Abstract: Fine-Grained Visual Classification (FGVC) aims to categorize closely related subclasses, a task complicated by minimal inter-class differences and significant intra-class variance. Existing methods often rely on additional annotations for image classification, overlooking the valuable information embedded in Tree Hierarchies that depict hierarchical label relationships. To leverage this knowledge… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  9. arXiv:2504.13472  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

    Authors: Xinchen Wang, Pengfei Gao, Chao Peng, Ruida Hu, Cuiyun Gao

    Abstract: Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-b… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  10. Uncertainty Guided Refinement for Fine-Grained Salient Object Detection

    Authors: Yao Yuan, Pan Gao, Qun Dai, Jie Qin, Wei Xiang

    Abstract: Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model's perception of uncertain regions. S… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

    Comments: IEEE Transactions on Image Processing 2025

  11. arXiv:2504.08628  [pdf, other

    stat.ML cs.LG

    Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

    Authors: Chenyang Zhang, Peifeng Gao, Difan Zou, Yuan Cao

    Abstract: Modern neural networks are usually highly over-parameterized. Behind the wide usage of over-parameterized networks is the belief that, if the data are simple, then the trained network will be automatically equivalent to a simple predictor. Following this intuition, many existing works have studied different notions of "ranks" of neural networks and their relation to the rank of data. In this work,… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: 43 pages, 4 figures

  12. arXiv:2504.07960  [pdf, other

    cs.CV

    VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

    Authors: Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, Ming-Ming Cheng

    Abstract: Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropr… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Project page: https://visualcloze.github.io/

  13. arXiv:2504.07089  [pdf, other

    cs.CV cs.CL

    OmniCaptioner: One Captioner to Rule Them All

    Authors: Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Tianshuo Peng, Shufei Zhang, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Bo Zhang, Peng Gao

    Abstract: We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g.… ▽ More

    Submitted 27 April, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

    Comments: More visualizations on Homepage: https://alpha-innovator.github.io/OmniCaptioner-project-page and Official code: https://github.com/Alpha-Innovator/OmniCaptioner

  14. arXiv:2504.04903  [pdf, other

    cs.CV cs.AI

    Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision

    Authors: Yuandong Pu, Le Zhuo, Kaiwen Zhu, Liangbin Xie, Wenlong Zhang, Xiangyu Chen, Peng Gao, Yu Qiao, Chao Dong, Yihao Liu

    Abstract: We present Lunima-OmniLV (abbreviated as OmniLV), a universal multimodal multi-task framework for low-level vision that addresses over 100 sub-tasks across four major categories: image restoration, image enhancement, weak-semantic dense prediction, and stylization. OmniLV leverages both textual and visual prompts to offer flexible and user-friendly interactions. Built on Diffusion Transformer (DiT… ▽ More

    Submitted 8 April, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

  15. arXiv:2504.04837  [pdf, other

    cs.CV

    Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos

    Authors: Zhi Zuo, Chenyi Zhuang, Zhiqiang Shen, Pan Gao, Jie Qin

    Abstract: Point cloud video representation learning is primarily built upon the masking strategy in a self-supervised manner. However, the progress is slow due to several significant challenges: (1) existing methods learn the motion particularly with hand-crafted designs, leading to unsatisfactory motion patterns during pre-training which are non-transferable on fine-tuning scenarios. (2) previous Masked Au… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

    Comments: 11 pages, 7 figures

  16. arXiv:2504.04339  [pdf, other

    cs.CV

    NCL-CIR: Noise-aware Contrastive Learning for Composed Image Retrieval

    Authors: Peng Gao, Yujian Lee, Zailong Chen, Hui zhang, Xubo Liu, Yiyang Hu, Guquang Jing

    Abstract: Composed Image Retrieval (CIR) seeks to find a target image using a multi-modal query, which combines an image with modification text to pinpoint the target. While recent CIR methods have shown promise, they mainly focus on exploring relationships between the query pairs (image and text) through data augmentation or model design. These methods often assume perfect alignment between queries and tar… ▽ More

    Submitted 27 April, 2025; v1 submitted 5 April, 2025; originally announced April 2025.

    Comments: Has been accepted by ICASSP2025

  17. arXiv:2503.21758  [pdf, other

    cs.CV

    Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

    Authors: Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, Peng Gao

    Abstract: We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task ex… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Tech Report, 21 pages, 12 figures

  18. arXiv:2503.21749  [pdf, other

    cs.CV

    LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

    Authors: Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Peng Gao, Bin Fu, Zhen Li

    Abstract: We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyo… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Project page: https://zhaoshitian.github.io/lexart/

  19. arXiv:2503.13185  [pdf, other

    cs.CV cs.AI

    3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o

    Authors: Dingning Liu, Cheng Wang, Peng Gao, Renrui Zhang, Xinzhu Ma, Yuan Meng, Zhihui Wang

    Abstract: Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visu… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  20. arXiv:2503.07050  [pdf, other

    cs.CV cs.AI cs.MM

    TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

    Authors: Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Peng Gao, Hongsheng Li

    Abstract: Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion models. To bridge this gap, we introduce TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs), a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps. TIDE employs Sparse Autoencoders (SAEs) wi… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  21. arXiv:2502.20127  [pdf, other

    cs.SE cs.AI cs.CL

    SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning

    Authors: Zexiong Ma, Chao Peng, Pengfei Gao, Xiangxin Meng, Yanzhen Zou, Bing Xie

    Abstract: Mainstream issue-resolving frameworks predominantly rely on commercial models, leading to high costs and privacy concerns. Existing training approaches for issue resolving struggle with poor generalization and fail to fully leverage open-source development resources. We propose Subtask-oriented Reinforced Fine-Tuning (SoRFT), a novel training approach to enhance the issue resolving capability of L… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

  22. arXiv:2502.17821  [pdf, other

    cs.RO cs.AI cs.LG

    CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems

    Authors: Rui Liu, Yu Shen, Peng Gao, Pratap Tokekar, Ming Lin

    Abstract: Multi-modality learning has become a crucial technique for improving the performance of machine learning applications across domains such as autonomous driving, robotics, and perception systems. While existing frameworks such as Auxiliary Modality Learning (AML) effectively utilize multiple data sources during training and enable inference with reduced modalities, they primarily operate in a singl… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  23. arXiv:2502.16736  [pdf, other

    cs.LG cs.AI

    AUKT: Adaptive Uncertainty-Guided Knowledge Transfer with Conformal Prediction

    Authors: Rui Liu, Peng Gao, Yu Shen, Ming Lin, Pratap Tokekar

    Abstract: Knowledge transfer between teacher and student models has proven effective across various machine learning applications. However, challenges arise when the teacher's predictions are noisy, or the data domain during student training shifts from the teacher's pretraining data. In such scenarios, blindly relying on the teacher's predictions can lead to suboptimal knowledge transfer. To address these… ▽ More

    Submitted 24 February, 2025; v1 submitted 23 February, 2025; originally announced February 2025.

  24. arXiv:2502.16286  [pdf, other

    cs.CR cs.AI cs.LG

    Verification of Bit-Flip Attacks against Quantized Neural Networks

    Authors: Yedi Zhang, Lei Huang, Pengfei Gao, Fu Song, Jun Sun, Jin Song Dong

    Abstract: In the rapidly evolving landscape of neural network security, the resilience of neural networks against bit-flip attacks (i.e., an attacker maliciously flips an extremely small amount of bits within its parameter storage memory system to induce harmful behavior), has emerged as a relevant area of research. Existing studies suggest that quantization may serve as a viable defense against such attack… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

    Comments: 37 pages, 13 figures, 14 tables

  25. arXiv:2502.12098  [pdf, other

    cs.RO

    Bandwidth-Adaptive Spatiotemporal Correspondence Identification for Collaborative Perception

    Authors: Peng Gao, Williard Joshua Jose, Hao Zhang

    Abstract: Correspondence identification (CoID) is an essential capability in multi-robot collaborative perception, which enables a group of robots to consistently refer to the same objects within their respective fields of view. In real-world applications, such as connected autonomous driving, vehicles face challenges in directly sharing raw observations due to limited communication bandwidth. In order to a… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  26. arXiv:2502.09621  [pdf, other

    cs.CV cs.AI cs.CL

    MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

    Authors: Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li

    Abstract: Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR,… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

    Comments: Project Page: https://mmecot.github.io/

  27. arXiv:2502.06782  [pdf, other

    cs.CV

    Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

    Authors: Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao

    Abstract: Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to vide… ▽ More

    Submitted 12 February, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

  28. arXiv:2502.06171  [pdf

    eess.IV cs.CV

    A Data-Efficient Pan-Tumor Foundation Model for Oncology CT Interpretation

    Authors: Wenhui Lei, Hanyu Chen, Zitian Zhang, Luyang Luo, Qiong Xiao, Yannian Gu, Peng Gao, Yankai Jiang, Ci Wang, Guangtao Wu, Tongjia Xu, Yingjie Zhang, Xiaofan Zhang, Pranav Rajpurkar, Shaoting Zhang, Zhenning Wang

    Abstract: Artificial intelligence-assisted imaging analysis has made substantial strides in tumor diagnosis and management. Here we present PASTA, a pan-tumor CT foundation model that achieves state-of-the-art performance on 45 of 46 representative oncology tasks -- including lesion segmentation, tumor detection in plain CT, tumor staging, survival prediction, structured report generation, and cross-modalit… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

    Comments: 57 pages, 7 figures

  29. arXiv:2502.02748  [pdf, other

    cs.LG cond-mat.mtrl-sci

    ReGNet: Reciprocal Space-Aware Long-Range Modeling and Multi-Property Prediction for Crystals

    Authors: Jianan Nie, Peiyao Xiao, Kaiyi Ji, Peng Gao

    Abstract: Predicting properties of crystals from their structures is a fundamental yet challenging task in materials science. Unlike molecules, crystal structures exhibit infinite periodic arrangements of atoms, requiring methods capable of capturing both local and global information effectively. However, most current works fall short of capturing long-range interactions within periodic structures. To addre… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

  30. arXiv:2502.02481  [pdf, other

    cs.CL

    Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study

    Authors: Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang

    Abstract: Large language models (LLMs) have shown continuously improving multilingual capabilities, and even small-scale open-source models have demonstrated rapid performance enhancement. In this paper, we systematically explore the abilities of open LLMs with less than ten billion parameters to handle multilingual machine translation (MT) tasks. We conduct comprehensive evaluations on six popular LLMs and… ▽ More

    Submitted 24 February, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

    Comments: Accept to NAACL2025 Main Conference

  31. arXiv:2501.13926  [pdf, other

    cs.CV cs.AI cs.CL

    Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

    Authors: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng

    Abstract: Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. W… ▽ More

    Submitted 23 January, 2025; originally announced January 2025.

    Comments: Journal Version. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT

  32. arXiv:2501.13920  [pdf, other

    cs.CV cs.CL cs.LG

    IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

    Authors: Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li

    Abstract: With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I… ▽ More

    Submitted 23 January, 2025; originally announced January 2025.

    Comments: 75 pages, 73 figures, Evaluation scripts: https://github.com/jylei16/Imagine-e

  33. arXiv:2501.08453  [pdf, other

    cs.CV cs.LG

    Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

    Authors: Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, Yi Wang, Yuming Jiang, Yaohui Wang, Peng Gao, Xinyuan Chen, Hengjie Li, Dahua Lin, Yu Qiao, Ziwei Liu

    Abstract: We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

  34. arXiv:2501.03447  [pdf, other

    cs.SE

    CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering

    Authors: Jialiang Chen, Kaifa Zhao, Jie Liu, Chao Peng, Jierui Liu, Hang Zhu, Pengfei Gao, Ping Yang, Shuiguang Deng

    Abstract: Large language models that enhance software development tasks, such as code generation, code completion, and code question answering (QA), have been extensively studied in both academia and the industry. The models are integrated into popular intelligent IDEs like JetBrains and Cursor. Current benchmarks for evaluating models' code comprehension capabilities primarily focus on code generation or c… ▽ More

    Submitted 6 January, 2025; originally announced January 2025.

    Comments: 12 pages, 7 figures, 8 tables

  35. arXiv:2501.01895  [pdf, other

    cs.RO cs.CV cs.LG

    EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

    Authors: Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren

    Abstract: We introduce EnerVerse, a generative robotics foundation model that constructs and interprets embodied spaces. EnerVerse employs an autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning. To model the 3D robotics world, we propose Free Anchor Views (FAVs), a multi-view video representation offering f… ▽ More

    Submitted 10 February, 2025; v1 submitted 3 January, 2025; originally announced January 2025.

    Comments: Website: https://sites.google.com/view/enerverse

  36. arXiv:2412.16919  [pdf, other

    cs.CV

    TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

    Authors: Xuying Zhang, Yutong Liu, Yangguang Li, Renrui Zhang, Yufei Liu, Kai Wang, Wanli Ouyang, Zhiwei Xiong, Peng Gao, Qibin Hou, Ming-Ming Cheng

    Abstract: We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrate the multimodal unification and promising learning capabilities of the next-token prediction paradigm to conditional 3D object generation. To achieve this, the… ▽ More

    Submitted 11 March, 2025; v1 submitted 22 December, 2024; originally announced December 2024.

  37. arXiv:2412.14764  [pdf, other

    cs.SE cs.AI

    CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering

    Authors: Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, Cuiyun Gao

    Abstract: In this work, we introduce CodeRepoQA, a large-scale benchmark specifically designed for evaluating repository-level question-answering capabilities in the field of software engineering. CodeRepoQA encompasses five programming languages and covers a wide range of scenarios, enabling comprehensive evaluation of language models. To construct this dataset, we crawl data from 30 well-known repositorie… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

  38. arXiv:2411.18019  [pdf, other

    cs.SE

    A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models

    Authors: Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, Cuiyun Gao

    Abstract: Automatically resolving software issues is crucial for software development in practice, impacting the software quality and user experience. The process of resolving real-world issues encompasses tasks such as question-answering (QA), fault localization, and code editing. Existing benchmarks such as HumanEval fall short in their ability to assess LLMs' proficiency in solving issues within a codeba… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

  39. arXiv:2411.18015  [pdf, other

    cs.SE cs.AI

    AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions

    Authors: Xinchen Wang, Pengfei Gao, Xiangxin Meng, Chao Peng, Ruida Hu, Yun Lin, Cuiyun Gao

    Abstract: In software maintenance, bug reproduction is essential for effective fault localization and repair. Manually writing reproduction scripts is a time-consuming task with high requirements for developers. Hence, automation of bug reproduction has increasingly attracted attention from researchers and practitioners. However, the existing studies on bug reproduction are generally limited to specific bug… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

  40. arXiv:2411.17217  [pdf, other

    cs.CV

    Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning

    Authors: Hui-Yue Yang, Hui Chen, Ao Wang, Kai Chen, Zijia Lin, Yongliang Tang, Pengcheng Gao, Yuming Quan, Jungong Han, Guiguang Ding

    Abstract: Segment Anything Model (SAM) has made great progress in anomaly segmentation tasks due to its impressive generalization ability. However, existing methods that directly apply SAM through prompting often overlook the domain shift issue, where SAM performs well on natural images but struggles in industrial scenarios. Parameter-Efficient Fine-Tuning (PEFT) offers a promising solution, but it may yiel… ▽ More

    Submitted 25 December, 2024; v1 submitted 26 November, 2024; originally announced November 2024.

    Comments: Accepted by AAAI-25

  41. arXiv:2411.11454  [pdf, other

    cs.CV

    Relevance-guided Audio Visual Fusion for Video Saliency Prediction

    Authors: Li Yu, Xuanzhe Sun, Pan Gao, Moncef Gabbouj

    Abstract: Audio data, often synchronized with video frames, plays a crucial role in guiding the audience's visual attention. Incorporating audio information into video saliency prediction tasks can enhance the prediction of human visual behavior. However, existing audio-visual saliency prediction methods often directly fuse audio and visual features, which ignore the possibility of inconsistency between the… ▽ More

    Submitted 18 November, 2024; originally announced November 2024.

  42. arXiv:2411.10213  [pdf, other

    cs.SE cs.AI

    An Empirical Study on LLM-based Agents for Automated Bug Fixing

    Authors: Xiangxin Meng, Zexiong Ma, Pengfei Gao, Chao Peng

    Abstract: Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically, demonstrating the capability in addressing software defects by engaging in development environment interaction, iterative validation and code modification. However, systematic analysis of these agent and non-agent systems remain limited, particularly regarding performance variations among top-performing… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

  43. arXiv:2410.21060  [pdf, other

    cs.CR cs.AI cs.LG

    CTINexus: Automatic Cyber Threat Intelligence Knowledge Graph Construction Using Large Language Models

    Authors: Yutong Cheng, Osama Bajaber, Saimon Amanuel Tsegai, Dawn Song, Peng Gao

    Abstract: Textual descriptions in cyber threat intelligence (CTI) reports, such as security articles and news, are rich sources of knowledge about cyber threats, crucial for organizations to stay informed about the rapidly evolving threat landscape. However, current CTI knowledge extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. Synt… ▽ More

    Submitted 21 April, 2025; v1 submitted 28 October, 2024; originally announced October 2024.

    Comments: Accepted at 2025 IEEE European Symposium on Security and Privacy (Euro S&P)

  44. arXiv:2410.17823  [pdf, other

    cs.LG cs.CV eess.IV

    Att2CPC: Attention-Guided Lossy Attribute Compression of Point Clouds

    Authors: Kai Liu, Kang You, Pan Gao, Manoranjan Paul

    Abstract: With the great progress of 3D sensing and acquisition technology, the volume of point cloud data has grown dramatically, which urges the development of efficient point cloud compression methods. In this paper, we focus on the task of learned lossy point cloud attribute compression (PCAC). We propose an efficient attention-based method for lossy compression of point cloud attributes leveraging on a… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

  45. arXiv:2410.15007  [pdf, other

    cs.CV cs.MM

    DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer

    Authors: Ying Hu, Chenyi Zhuang, Pan Gao

    Abstract: Style transfer aims to fuse the artistic representation of a style image with the structural information of a content image. Existing methods train specific networks or utilize pre-trained models to learn content and style features. However, they rely solely on textual or spatial representations that are inadequate to achieve the balance between content and style. In this work, we propose a novel… ▽ More

    Submitted 19 October, 2024; originally announced October 2024.

    Comments: Accepted to ACMMM Asia 2024. Code is available at https://github.com/I2-Multimedia-Lab/DiffuseST

  46. arXiv:2410.12156  [pdf, other

    cs.LG cs.AI physics.chem-ph

    FragNet: A Graph Neural Network for Molecular Property Prediction with Four Layers of Interpretability

    Authors: Gihan Panapitiya, Peiyuan Gao, C Mark Maupin, Emily G Saldanha

    Abstract: Molecular property prediction is a crucial step in many modern-day scientific applications including drug discovery and energy storage material design. Despite the availability of numerous machine learning models for this task, we are lacking in models that provide both high accuracies and interpretability of the predictions. We introduce the FragNet architecture, a graph neural network not only c… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

  47. arXiv:2410.11772  [pdf, other

    cs.CL cs.LG

    Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models

    Authors: Kai Yao, Penglei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu

    Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant popularity for adapting pre-trained Large Language Models (LLMs) to downstream tasks, primarily due to their potential to significantly reduce memory and computational overheads. However, a common limitation in most PEFT approaches is their application of a uniform architectural design across all layers. This uniformity involve… ▽ More

    Submitted 5 November, 2024; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: EMNLP 2024

  48. arXiv:2410.10511  [pdf, other

    cs.CV

    Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling

    Authors: Wenze Liu, Le Zhuo, Yi Xin, Sheng Xia, Peng Gao, Xiangyu Yue

    Abstract: We introduce a new paradigm for AutoRegressive (AR) image generation, termed Set AutoRegressive Modeling (SAR). SAR generalizes the conventional AR to the next-set setting, i.e., splitting the sequence into arbitrary sets containing multiple tokens, rather than outputting each token in a fixed raster order. To accommodate SAR, we develop a straightforward architecture termed Fully Masked Transform… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 19 pages, 17 figures, 8 tables, github repo: https://github.com/poppuppy/SAR

  49. arXiv:2410.09962  [pdf, other

    cs.CV

    LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

    Authors: Han Qiu, Jiaxing Huang, Peng Gao, Qin Qi, Xiaoqin Zhang, Ling Shao, Shijian Lu

    Abstract: Hallucination, a phenomenon where multimodal large language models~(MLLMs) tend to generate textual responses that are plausible but unaligned with the image, has become one major hurdle in various MLLM-related applications. Several benchmarks have been created to gauge the hallucination levels of MLLMs, by either raising discriminative questions about the existence of objects or introducing LLM e… ▽ More

    Submitted 15 October, 2024; v1 submitted 13 October, 2024; originally announced October 2024.

  50. arXiv:2410.07536  [pdf, other

    cs.CV

    I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow

    Authors: Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, Peng Gao

    Abstract: Rectified Flow Transformers (RFTs) offer superior training and inference efficiency, making them likely the most viable direction for scaling up diffusion models. However, progress in generation resolution has been relatively slow due to data quality and training costs. Tuning-free resolution extrapolation presents an alternative, but current methods often reduce generative stability, limiting pra… ▽ More

    Submitted 14 October, 2024; v1 submitted 9 October, 2024; originally announced October 2024.