Skip to main content

Showing 1–50 of 272 results for author: You, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.06948  [pdf, other

    cs.CV cs.LG

    Unsupervised Learning for Class Distribution Mismatch

    Authors: Pan Du, Wangbo Zhao, Xinai Lu, Nian Liu, Zhikai Li, Chaoyu Gong, Suyun Zhao, Hong Chen, Cuiping Li, Kai Wang, Yang You

    Abstract: Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability a… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML 2025

  2. arXiv:2505.00596  [pdf, other

    cs.RO cs.AI cs.LG

    A Finite-State Controller Based Offline Solver for Deterministic POMDPs

    Authors: Alex Schutz, Yang You, Matias Mattamala, Ipek Caliskanelli, Bruno Lacerda, Nick Hawes

    Abstract: Deterministic partially observable Markov decision processes (DetPOMDPs) often arise in planning problems where the agent is uncertain about its environmental state but can act and observe deterministically. In this paper, we propose DetMCVI, an adaptation of the Monte Carlo Value Iteration (MCVI) algorithm for DetPOMDPs, which builds policies in the form of finite-state controllers (FSCs). DetMCV… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: 9 pages, 6 figures. Appendix attached. To be published in Proceedings of IJCAI 2025. For code see http://github.com/ori-goals/DetMCVI

    ACM Class: I.2.8; I.2.9

  3. arXiv:2504.13638  [pdf

    cs.CV

    DenSe-AdViT: A novel Vision Transformer for Dense SAR Object Detection

    Authors: Yang Zhang, Jingyi Cao, Yanan You, Yuanyuan Qiao

    Abstract: Vision Transformer (ViT) has achieved remarkable results in object detection for synthetic aperture radar (SAR) images, owing to its exceptional ability to extract global features. However, it struggles with the extraction of multi-scale local features, leading to limited performance in detecting small targets, especially when they are densely arranged. Therefore, we propose Density-Sensitive Visi… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  4. arXiv:2504.08353  [pdf, other

    cs.GR cs.CV cs.LG

    Single View Garment Reconstruction Using Diffusion Mapping Via Pattern Coordinates

    Authors: Ren Li, Cong Cao, Corentin Dumery, Yingxuan You, Hao Li, Pascal Fua

    Abstract: Reconstructing 3D clothed humans from images is fundamental to applications like virtual try-on, avatar creation, and mixed reality. While recent advances have enhanced human body recovery, accurate reconstruction of garment geometry -- especially for loose-fitting clothing -- remains an open challenge. We present a novel method for high-fidelity 3D garment reconstruction from single images that b… ▽ More

    Submitted 15 May, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

    Comments: SIGGRAPH 2025

  5. arXiv:2504.06803  [pdf, other

    cs.CV

    DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation

    Authors: Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Hao Luo, Yibing Song, Gao Huang, Fan Wang, Yang You

    Abstract: Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the \emph{static} inference paradigm, which inevitably introduces redundant computation in certain \emph{diffusion timesteps} and \emph{spatial regions}. To overcome thi… ▽ More

    Submitted 16 April, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

    Comments: Extended journal version for ICLR. arXiv admin note: substantial text overlap with arXiv:2410.03456

  6. arXiv:2504.05782  [pdf, other

    cs.CV cs.AI

    MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

    Authors: Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang

    Abstract: Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited da… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: 11 pages, 8 figures

  7. arXiv:2504.04787  [pdf, other

    cs.CV cs.AI

    Dynamic Vision Mamba

    Authors: Mengxuan Wu, Zekai Li, Zhiyuan Liang, Moyang Li, Xuanlei Zhao, Samir Khaki, Zheng Zhu, Xiaojiang Peng, Konstantinos N. Plataniotis, Kai Wang, Wangbo Zhao, Yang You

    Abstract: Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra compu… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  8. arXiv:2503.12545  [pdf, other

    cs.CV

    PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models

    Authors: Zhaopan Xu, Pengfei Zhou, Weidong Tang, Jiaxin Ai, Wangbo Zhao, Xiaojiang Peng, Kai Wang, Yang You, Wenqi Shao, Hongxun Yao, Kaipeng Zhang

    Abstract: In recent years, Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in tasks such as visual question answering, visual understanding, and reasoning. However, this impressive progress relies on vast amounts of data collected from the internet, raising significant concerns about privacy and security. To address these issues, machine unlearning (MU) has emerged as a pr… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  9. arXiv:2503.10700  [pdf, other

    cs.CV cs.MM

    TA-V2A: Textually Assisted Video-to-Audio Generation

    Authors: Yuhuan You, Xihong Wu, Tianshu Qu

    Abstract: As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as curr… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  10. arXiv:2503.09642  [pdf, other

    cs.GR cs.AI

    Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

    Authors: Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang , et al. (7 additional authors not shown)

    Abstract: Video generation models have achieved remarkable progress in the past year. The quality of AI video continues to improve, but at the cost of larger model size, increased data quantity, and greater demand for training compute. In this report, we present Open-Sora 2.0, a commercial-level video generation model trained for only $200k. With this model, we demonstrate that the cost of training a top-pe… ▽ More

    Submitted 23 March, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

  11. arXiv:2503.05157  [pdf, other

    cs.CL

    Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy

    Authors: Ruixi Lin, Ziqiao Wang, Yang You

    Abstract: Language models are strong few-shot learners and achieve good overall accuracy in text classification tasks, masking the fact that their results suffer from great class accuracy imbalance. We believe that the pursuit of overall accuracy should not come from enriching the strong classes, but from raising up the weak ones. To address the imbalance, we propose a Heaviside step function based ensemble… ▽ More

    Submitted 26 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

  12. arXiv:2502.13533  [pdf, other

    cs.LG cs.AI cs.CL

    Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models

    Authors: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, Kunlong Zhou

    Abstract: Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To… ▽ More

    Submitted 15 March, 2025; v1 submitted 19 February, 2025; originally announced February 2025.

    Comments: Accepted at ICLR 2025

  13. arXiv:2502.10389  [pdf, other

    cs.CV cs.AI

    Region-Adaptive Sampling for Diffusion Transformers

    Authors: Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang

    Abstract: Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the ima… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  14. arXiv:2502.07508  [pdf, other

    cs.CV

    Enhance-A-Video: Better Generated Video for Free

    Authors: Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You

    Abstract: DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to… ▽ More

    Submitted 27 February, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

  15. arXiv:2502.04078  [pdf, other

    cs.MM

    CDIO: Cross-Domain Inference Optimization with Resource Preference Prediction for Edge-Cloud Collaboration

    Authors: Zheming Yang, Wen Ji, Qi Guo, Dieli Hu, Chang Zhao, Xiaowei Li, Xuanlei Zhao, Yi Zhao, Chaoyu Gong, Yang You

    Abstract: Currently, massive video tasks are processed by edge-cloud collaboration. However, the diversity of task requirements and the dynamics of resources pose great challenges to efficient inference, resulting in many wasted resources. In this paper, we present CDIO, a cross-domain inference optimization framework designed for edge-cloud collaboration. For diverse input tasks, CDIO can predict resource… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

    Comments: 10 pages, 9 figures

  16. arXiv:2501.12948  [pdf, other

    cs.CL cs.AI cs.LG

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu , et al. (175 additional authors not shown)

    Abstract: We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

  17. arXiv:2501.11587  [pdf, other

    cs.LG cs.AI

    Recurrent Diffusion for Large-Scale Parameter Generation

    Authors: Kai Wang, Dongwen Tang, Wangbo Zhao, Konstantin Schürholt, Zhangyang Wang, Yang You

    Abstract: Parameter generation has long struggled to match the scale of today large vision and language models, curbing its broader utility. In this paper, we introduce Recurrent Diffusion for Large Scale Parameter Generation (RPG), a novel framework that generates full neural network parameters up to hundreds of millions on a single GPU. Our approach first partitions a networks parameters into non-overlapp… ▽ More

    Submitted 10 February, 2025; v1 submitted 20 January, 2025; originally announced January 2025.

    Comments: Generating 200 million parameters in just minutes

  18. arXiv:2501.09316  [pdf, other

    cs.AI

    SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs

    Authors: Anbang Ye, Qianran Ma, Jia Chen, Muqi Li, Tong Li, Fujiao Liu, Siqi Mai, Meichen Lu, Haitao Bao, Yang You

    Abstract: Despite significant advancements in general-purpose AI agents, several challenges still hinder their practical application in real-world scenarios. First, the limited planning capabilities of Large Language Models (LLM) restrict AI agents from effectively solving complex tasks that require long-horizon planning. Second, general-purpose AI agents struggle to efficiently utilize domain-specific know… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

    Comments: 35 pages, 5 figures

  19. arXiv:2501.00602  [pdf, other

    cs.CV cs.LG

    STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

    Authors: Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, Danfei Xu, Boris Ivanovic, Yue Wang, Marco Pavone

    Abstract: We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality c… ▽ More

    Submitted 31 December, 2024; originally announced January 2025.

    Comments: Project page at: https://jiawei-yang.github.io/STORM/

  20. arXiv:2501.00601  [pdf, other

    cs.CV cs.AI cs.GR

    DreamDrive: Generative 4D Scene Modeling from Street View Images

    Authors: Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, Yue Wang

    Abstract: Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild d… ▽ More

    Submitted 3 January, 2025; v1 submitted 31 December, 2024; originally announced January 2025.

    Comments: Project page: https://pointscoder.github.io/DreamDrive/

  21. arXiv:2412.20404  [pdf, other

    cs.CV

    Open-Sora: Democratizing Efficient Video Production for All

    Authors: Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, Yang You

    Abstract: Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we… ▽ More

    Submitted 29 December, 2024; originally announced December 2024.

  22. arXiv:2412.19437  [pdf, other

    cs.CL cs.AI

    DeepSeek-V3 Technical Report

    Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao , et al. (175 additional authors not shown)

    Abstract: We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for loa… ▽ More

    Submitted 18 February, 2025; v1 submitted 26 December, 2024; originally announced December 2024.

  23. arXiv:2412.19211  [pdf, other

    cs.LG

    Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining

    Authors: Yuxin You, Zhen Liu, Xiangchao Wen, Yongtao Zhang, Wei Ai

    Abstract: Graph mining is an important area in data mining and machine learning that involves extracting valuable information from graph-structured data. In recent years, significant progress has been made in this field through the development of graph neural networks (GNNs). However, GNNs are still deficient in generalizing to diverse graph data. Aiming to this issue, Large Language Models (LLMs) could pro… ▽ More

    Submitted 26 December, 2024; originally announced December 2024.

  24. arXiv:2412.19018  [pdf, other

    cs.CL

    Let the Fuzzy Rule Speak: Enhancing In-context Learning Debiasing with Interpretability

    Authors: Ruixi Lin, Yang You

    Abstract: Large language models (LLMs) often struggle with balanced class accuracy in text classification tasks using in-context learning (ICL), hindering some practical uses due to user dissatisfaction or safety risks caused by misclassifications. Retraining LLMs to address root causes in data or model priors is neither easy nor cost-effective. This paper delves deeper into the class accuracy imbalance iss… ▽ More

    Submitted 11 February, 2025; v1 submitted 25 December, 2024; originally announced December 2024.

  25. arXiv:2412.17365  [pdf, other

    cs.CL cs.AI

    Boosting LLM via Learning from Data Iteratively and Selectively

    Authors: Qi Jia, Siyu Ren, Ziheng Qin, Fuzhao Xue, Jinjie Ni, Yang You

    Abstract: Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform instruction tuning by iterative data selection (\ApproachName{}). We measure the quality of a sample from complexity and diversity simultaneously. Instead of calculating… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.

  26. arXiv:2412.12496  [pdf, other

    cs.CV cs.AI

    Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training

    Authors: Mingjia Shi, Yuhao Zhou, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Shanmukha Ramakrishna Vedantam, Wangbo Zhao, Kai Wang, Yang You

    Abstract: Vision Mamba has shown close to state of the art performance on computer vision tasks, drawing much interest in increasing it's efficiency. A promising approach is token reduction (that has been successfully implemented in ViTs). Pruning informative tokens in Mamba leads to a high loss of key knowledge and degraded performance. An alternative, of merging tokens preserves more information than prun… ▽ More

    Submitted 14 April, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

    MSC Class: 68T07 ACM Class: I.2

  27. arXiv:2412.10302  [pdf, other

    cs.CV cs.AI cs.CL

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Authors: Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao , et al. (2 additional authors not shown)

    Abstract: We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage Deep… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

  28. arXiv:2412.05256  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Extrapolated Urban View Synthesis Benchmark

    Authors: Xiangyu Han, Zhen Jia, Boyi Li, Yan Wang, Boris Ivanovic, Yurong You, Lingjie Liu, Yue Wang, Marco Pavone, Chen Feng, Yiming Li

    Abstract: Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-ti… ▽ More

    Submitted 12 March, 2025; v1 submitted 6 December, 2024; originally announced December 2024.

    Comments: Project page: https://ai4ce.github.io/EUVS-Benchmark/

  29. arXiv:2412.03324  [pdf, other

    cs.CV

    A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

    Authors: Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yibing Song, Kai Wang, Zhangyang Wang, Yang You

    Abstract: Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our… ▽ More

    Submitted 5 December, 2024; v1 submitted 4 December, 2024; originally announced December 2024.

  30. arXiv:2412.01179  [pdf, other

    cs.CV

    Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

    Authors: Tao Tang, Hong Liu, Yingxuan You, Ti Wang, Wenhao Li

    Abstract: Human Mesh Reconstruction (HMR) from monocular video plays an important role in human-robot interaction and collaboration. However, existing video-based human mesh reconstruction methods face a trade-off between accurate reconstruction and smooth motion. These methods design networks based on either RNNs or attention mechanisms to extract local temporal correlations or global temporal dependencies… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

    Comments: Accepted by IROS 2024. Project page: https://github.com/TangTao-PKU/DGTR

  31. arXiv:2411.19458  [pdf, other

    cs.CV

    Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

    Authors: Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas

    Abstract: Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equiv… ▽ More

    Submitted 19 February, 2025; v1 submitted 28 November, 2024; originally announced November 2024.

    Comments: 10 pages; Accepted to ICLR 2025

  32. arXiv:2411.11871  [pdf, other

    cs.IR cs.LG math.OC

    MultiBalance: Multi-Objective Gradient Balancing in Industrial-Scale Multi-Task Recommendation System

    Authors: Yun He, Xuxing Chen, Jiayi Xu, Renqin Cai, Yiling You, Jennifer Cao, Minhui Huang, Liu Yang, Yiqun Liu, Xiaoyi Liu, Rong Jin, Sem Park, Bo Long, Xue Feng

    Abstract: In industrial recommendation systems, multi-task learning (learning multiple tasks simultaneously on a single model) is a predominant approach to save training/serving resources and improve recommendation performance via knowledge transfer between the joint learning tasks. However, multi-task learning often suffers from negative transfer: one or several tasks are less optimized than training them… ▽ More

    Submitted 3 November, 2024; originally announced November 2024.

  33. arXiv:2411.03999  [pdf, other

    cs.DC cs.AI

    ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

    Authors: Ziji Shi, Jialin Li, Yang You

    Abstract: Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or e… ▽ More

    Submitted 6 November, 2024; originally announced November 2024.

    Comments: Accepted at ACM Symposium on Cloud Computing (SoCC) 2024

  34. arXiv:2410.17193  [pdf, other

    cs.CV cs.AI

    Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios

    Authors: Kai Wang, Zekai Li, Zhi-Qi Cheng, Samir Khaki, Ahmad Sajedi, Ramakrishna Vedantam, Konstantinos N Plataniotis, Alexander Hauptmann, Yang You

    Abstract: Dataset distillation has demonstrated strong performance on simple datasets like CIFAR, MNIST, and TinyImageNet but struggles to achieve similar results in more complex scenarios. In this paper, we propose EDF (emphasizes the discriminative features), a dataset distillation method that enhances key discriminative regions in synthetic images using Grad-CAM activation maps. Our approach is inspired… ▽ More

    Submitted 31 March, 2025; v1 submitted 22 October, 2024; originally announced October 2024.

    Comments: 24 pages, 13 figures

  35. arXiv:2410.16919  [pdf, other

    cs.RO cs.AI cs.CL cs.LG

    EnvBridge: Bridging Diverse Environments with Cross-Environment Knowledge Transfer for Embodied AI

    Authors: Tomoyuki Kagaya, Yuxuan Lou, Thong Jing Yuan, Subramanian Lakshmi, Jayashree Karlekar, Sugiri Pranata, Natsuki Murakami, Akira Kinose, Koki Oguri, Felix Wick, Yang You

    Abstract: In recent years, Large Language Models (LLMs) have demonstrated high reasoning capabilities, drawing attention for their applications as agents in various decision-making processes. One notably promising application of LLM agents is robotic manipulation. Recent research has shown that LLMs can generate text planning or control code for robots, providing substantial flexibility and interaction capa… ▽ More

    Submitted 22 October, 2024; originally announced October 2024.

  36. ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos

    Authors: Tao Tang, Hong Liu, Yingxuan You, Ti Wang, Wenhao Li

    Abstract: Although existing video-based 3D human mesh recovery methods have made significant progress, simultaneously estimating human pose and shape from low-resolution image features limits their performance. These image features lack sufficient spatial information about the human body and contain various noises (e.g., background, lighting, and clothing), which often results in inaccurate pose and inconsi… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

    Comments: Accepted by ACM MM 2024. Project page: https://github.com/TangTao-PKU/ARTS

  37. arXiv:2410.13754  [pdf, other

    cs.AI cs.LG cs.MM

    MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

    Authors: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh

    Abstract: Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalizati… ▽ More

    Submitted 18 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

  38. arXiv:2410.13496  [pdf, other

    cs.RO

    State Estimation Transformers for Agile Legged Locomotion

    Authors: Chen Yu, Yichu Yang, Tianlin Liu, Yangwei You, Mingliang Zhou, Diyun Xiang

    Abstract: We propose a state estimation method that can accurately predict the robot's privileged states to push the limits of quadruped robots in executing advanced skills such as jumping in the wild. In particular, we present the State Estimation Transformers (SET), an architecture that casts the state estimation problem as conditional sequence modeling. SET outputs the robot states that are hard to obtai… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: Accepted by IROS 2024

  39. arXiv:2410.09181  [pdf, other

    cs.CR cs.AI cs.CL cs.CY cs.LG

    Can a large language model be a gaslighter?

    Authors: Wei Li, Luyao Zhu, Yang Song, Ruixi Lin, Rui Mao, Yang You

    Abstract: Large language models (LLMs) have gained human trust due to their capabilities and helpfulness. However, this in turn may allow LLMs to affect users' mindsets by manipulating language. It is termed as gaslighting, a psychological effect. In this work, we aim to investigate the vulnerability of LLMs under prompt-based and fine-tuning-based gaslighting attacks. Therefore, we propose a two-stage fram… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Comments: 10/26 (Main Body/Total), 8 figures

  40. arXiv:2410.04035  [pdf, other

    cs.HC

    Gamifying XAI: Enhancing AI Explainability for Non-technical Users through LLM-Powered Narrative Gamifications

    Authors: Yuzhe You, Jian Zhao

    Abstract: Artificial intelligence (AI) has become tightly integrated into modern technology, yet existing exploratory visualizations for explainable AI (XAI) are primarily designed for users with technical expertise. This leaves everyday users, who also regularly interact with AI systems, with limited resources to explore or understand AI technologies they use. We propose a novel framework that enables non-… ▽ More

    Submitted 5 October, 2024; originally announced October 2024.

  41. arXiv:2410.03456  [pdf, other

    cs.CV

    Dynamic Diffusion Transformer

    Authors: Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You

    Abstract: Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynami… ▽ More

    Submitted 8 October, 2024; v1 submitted 4 October, 2024; originally announced October 2024.

  42. arXiv:2410.01733  [pdf, other

    cs.CL

    Visual Perception in Text Strings

    Authors: Qi Jia, Xiang Yue, Shanshan Huang, Ziheng Qin, Yizhu Liu, Bill Yuchen Lin, Yang You

    Abstract: Understanding visual semantics embedded in consecutive characters is a crucial capability for both large language models (LLMs) and multi-modal large language models (MLLMs). This type of artifact possesses the unique characteristic that identical information can be readily formulated in both texts and images, making them a significant proxy for analyzing modern LLMs' and MLLMs' capabilities in mo… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

  43. arXiv:2409.16451  [pdf, other

    cs.RO

    Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly

    Authors: Jiankai Sun, Aidan Curtis, Yang You, Yan Xu, Michael Koehle, Leonidas Guibas, Sachin Chitta, Mac Schwager, Hui Li

    Abstract: Generalizable long-horizon robotic assembly requires reasoning at multiple levels of abstraction. End-to-end imitation learning (IL) has been proven a promising approach, but it requires a large amount of demonstration data for training and often fails to meet the high-precision requirement of assembly tasks. Reinforcement Learning (RL) approaches have succeeded in high-precision assembly tasks, b… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

  44. arXiv:2408.12588  [pdf, other

    cs.CV cs.DC

    Real-Time Video Generation with Pyramid Attention Broadcast

    Authors: Xuanlei Zhao, Xiaolong Jin, Kai Wang, Yang You

    Abstract: We present Pyramid Attention Broadcast (PAB), a real-time, high quality and training-free approach for DiT-based video generation. Our method is founded on the observation that attention difference in the diffusion process exhibits a U-shaped pattern, indicating significant redundancy. We mitigate this by broadcasting attention outputs to subsequent steps in a pyramid style. It applies different b… ▽ More

    Submitted 27 February, 2025; v1 submitted 22 August, 2024; originally announced August 2024.

    Comments: ICLR 2025

  45. arXiv:2408.03360  [pdf, other

    cs.LG cs.AI

    Prioritize Alignment in Dataset Distillation

    Authors: Zekai Li, Ziyao Guo, Wangbo Zhao, Tianle Zhang, Zhi-Qi Cheng, Samir Khaki, Kaipeng Zhang, Ahmad Sajedi, Konstantinos N Plataniotis, Kai Wang, Yang You

    Abstract: Dataset Distillation aims to compress a large dataset into a significantly more compact, synthetic one without compromising the performance of the trained models. To achieve this, existing methods use the agent model to extract information from the target dataset and embed it into the distilled dataset. Consequently, the quality of extracted and embedded information determines the quality of the d… ▽ More

    Submitted 12 October, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

    Comments: 19 pages, 9 figures

  46. arXiv:2408.02214  [pdf, other

    cs.CV

    More Than Positive and Negative: Communicating Fine Granularity in Medical Diagnosis

    Authors: Xiangyu Peng, Kai Wang, Jianfei Yang, Yingying Zhu, Yang You

    Abstract: With the advance of deep learning, much progress has been made in building powerful artificial intelligence (AI) systems for automatic Chest X-ray (CXR) analysis. Most existing AI models are trained to be a binary classifier with the aim of distinguishing positive and negative cases. However, a large gap exists between the simple binary setting and complicated real-world medical scenarios. In this… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

  47. arXiv:2408.01437  [pdf, other

    cs.CV cs.GR

    Img2CAD: Reverse Engineering 3D CAD Models from Images through VLM-Assisted Conditional Factorization

    Authors: Yang You, Mikaela Angelina Uy, Jiaqi Han, Rahul Thomas, Haotong Zhang, Suya You, Leonidas Guibas

    Abstract: Reverse engineering 3D computer-aided design (CAD) models from images is an important task for many downstream applications including interactive editing, manufacturing, architecture, robotics, etc. The difficulty of the task lies in vast representational disparities between the CAD output and the image input. CAD models are precise, programmatic constructs that involves sequential operations comb… ▽ More

    Submitted 19 July, 2024; originally announced August 2024.

  48. arXiv:2408.01415  [pdf, other

    cs.AI cs.LG

    Conditional LoRA Parameter Generation

    Authors: Xiaolong Jin, Kai Wang, Dongwen Tang, Wangbo Zhao, Yukun Zhou, Junshu Tang, Yang You

    Abstract: Generative models have achieved remarkable success in image, video, and text domains. Inspired by this, researchers have explored utilizing generative models to generate neural network parameters. However, these efforts have been limited by the parameter size and the practicality of generating high-performance parameters. In this paper, we propose COND P-DIFF, a novel approach that demonstrates th… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

  49. arXiv:2407.14102  [pdf, other

    cs.RO

    MSSP : A Versatile Multi-Scenario Adaptable Intelligent Robot Simulation Platform Based on LIDAR-Inertial Fusion

    Authors: Qiyan Li, Chang Wu, Yifei Yuan, Yuan You

    Abstract: This letter presents a multi-scenario adaptable intelligent robot simulation platform based on LIDAR-inertial fusion, with three main features: (1 The platform includes an versatile robot model that can be freely controlled through manual control or autonomous tracking. This model is equipped with various types of LIDAR and Inertial Measurement Unit (IMU), providing ground truth information with a… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  50. arXiv:2407.01646  [pdf, other

    cs.SE cs.AI

    ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization

    Authors: Chunrong Fang, Weisong Sun, Yuchen Chen, Xiao Chen, Zhao Wei, Quanjun Zhang, Yudu You, Bin Luo, Yang Liu, Zhenyu Chen

    Abstract: (Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets in… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: Accepted to IEEE Transactions on Software Engineering (TSE)

    MSC Class: 68-04 ACM Class: D.2.3; I.2.7