Skip to main content

Showing 1–50 of 132 results for author: Hou, Q

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.19442  [pdf, other

    cs.DC

    Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

    Authors: Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yifan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, Xin Liu

    Abstract: In this report, we propose Triton-distributed, an extension of existing Triton compiler, to overcome the programming challenges in distributed AI systems. Triton-distributed is the first compiler that supports native overlapping optimizations for distributed AI workloads, providing a good coverage of existing optimizations from different frameworks. First, we integrate communication primitives com… ▽ More

    Submitted 4 May, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

  2. arXiv:2504.15928  [pdf, other

    cs.CV cs.AI

    A Clinician-Friendly Platform for Ophthalmic Image Analysis Without Technical Barriers

    Authors: Meng Wang, Tian Lin, Qingshan Hou, Aidi Lin, Jingcheng Wang, Qingsheng Peng, Truong X. Nguyen, Danqi Fang, Ke Zou, Ting Xu, Cancan Xue, Ten Cheer Quek, Qinkai Yu, Minxin Liu, Hui Zhou, Zixuan Xiao, Guiqin He, Huiyu Liang, Tingkun Shi, Man Chen, Linna Liu, Yuanyuan Peng, Lianyu Wang, Qiuming Hu, Junhong Chen , et al. (15 additional authors not shown)

    Abstract: Artificial intelligence (AI) shows remarkable potential in medical imaging diagnostics, but current models typically require retraining when deployed across different clinical centers, limiting their widespread adoption. We introduce GlobeReady, a clinician-friendly AI platform that enables ocular disease diagnosis without retraining/fine-tuning or technical expertise. GlobeReady achieves high acc… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  3. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  4. arXiv:2504.04701  [pdf, other

    cs.CV

    DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation

    Authors: Bo-Wen Yin, Jiao-Long Cao, Ming-Ming Cheng, Qibin Hou

    Abstract: Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images,… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR 2025

  5. arXiv:2503.23508  [pdf, other

    cs.CV

    Re-Aligning Language to Visual Objects with an Agentic Workflow

    Authors: Yuming Chen, Jiangyan Feng, Haodong Zhang, Lijun Gong, Feng Zhu, Rui Zhao, Qibin Hou, Ming-Ming Cheng, Yibing Song

    Abstract: Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VL… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

    Comments: 33 pages, 20 figures, 17 tables, ICLR 2025

  6. arXiv:2503.21076  [pdf, other

    cs.CV cs.LG

    KAC: Kolmogorov-Arnold Classifier for Continual Learning

    Authors: Yusong Hu, Zichen Liang, Fei Yang, Qibin Hou, Xialei Liu, Ming-Ming Cheng

    Abstract: Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their po… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: CVPR 2025

  7. arXiv:2503.20313  [pdf, other

    cs.DC

    TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

    Authors: Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin Liu

    Abstract: Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping com… ▽ More

    Submitted 3 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

  8. arXiv:2503.17831  [pdf, other

    eess.IV cs.AI cs.CV

    FundusGAN: A Hierarchical Feature-Aware Generative Framework for High-Fidelity Fundus Image Generation

    Authors: Qingshan Hou, Meng Wang, Peng Cao, Zou Ke, Xiaoli Liu, Huazhu Fu, Osmar R. Zaiane

    Abstract: Recent advancements in ophthalmology foundation models such as RetFound have demonstrated remarkable diagnostic capabilities but require massive datasets for effective pre-training, creating significant barriers for development and deployment. To address this critical challenge, we propose FundusGAN, a novel hierarchical feature-aware generative framework specifically designed for high-fidelity fu… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

  9. arXiv:2503.14453  [pdf, other

    stat.ML cs.LG

    Online Conformal Probabilistic Numerics via Adaptive Edge-Cloud Offloading

    Authors: Qiushuo Hou, Sangwoo Park, Matteo Zecchin, Yunlong Cai, Guanding Yu, Osvaldo Simeone

    Abstract: Consider an edge computing setting in which a user submits queries for the solution of a linear system to an edge processor, which is subject to time-varying computing availability. The edge processor applies a probabilistic linear solver (PLS) so as to be able to respond to the user's query within the allotted time and computing budget. Feedback to the user is in the form of a set of plausible so… ▽ More

    Submitted 29 April, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

    Comments: This paper has been submitted to a conference

  10. arXiv:2503.12929  [pdf, other

    cs.CV

    AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction

    Authors: Xuying Zhang, Yupeng Zhou, Kai Wang, Yikai Wang, Zhen Li, Shaohui Jiao, Daquan Zhou, Qibin Hou, Ming-Ming Cheng

    Abstract: Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our emp… ▽ More

    Submitted 27 April, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

  11. arXiv:2502.19811  [pdf, other

    cs.DC cs.AI cs.LG

    Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

    Authors: Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

    Abstract: Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and… ▽ More

    Submitted 4 March, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

  12. arXiv:2502.18461  [pdf, other

    cs.CV

    K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

    Authors: Ziheng Ouyang, Zhen Li, Qibin Hou

    Abstract: Recent studies have explored combining different LoRAs to jointly generate learned style and content. However, existing methods either fail to effectively preserve both the original subject and style simultaneously or require additional training. In this paper, we argue that the intrinsic properties of LoRA can effectively guide diffusion models in merging learned subject and style. Building on th… ▽ More

    Submitted 2 March, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

    Comments: CVPR 2025, Project page: https://k-lora.github.io/K-LoRA.io/

  13. arXiv:2502.06782  [pdf, other

    cs.CV

    Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

    Authors: Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao

    Abstract: Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to vide… ▽ More

    Submitted 12 February, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

  14. arXiv:2502.06289  [pdf

    eess.IV cs.AI cs.CV

    Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?

    Authors: Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham

    Abstract: The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domai… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

  15. arXiv:2502.04393  [pdf, other

    cs.CV

    UniCP: A Unified Caching and Pruning Framework for Efficient Video Generation

    Authors: Wenzhang Sun, Qirui Hou, Donglin Di, Jiahui Yang, Yongjia Ma, Jianxun Cui

    Abstract: Diffusion Transformers (DiT) excel in video generation but encounter significant computational challenges due to the quadratic complexity of attention. Notably, attention differences between adjacent diffusion steps follow a U-shaped pattern. Current methods leverage this property by caching attention blocks, however, they still struggle with sudden error spikes and large discrepancies. To address… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

  16. arXiv:2501.12016  [pdf

    cs.CV cs.LG

    Are Traditional Deep Learning Model Approaches as Effective as a Retinal-Specific Foundation Model for Ocular and Systemic Disease Detection?

    Authors: Samantha Min Er Yew, Xiaofeng Lei, Jocelyn Hui Lin Goh, Yibing Chen, Sahana Srinivasan, Miao-li Chee, Krithi Pushpanathan, Ke Zou, Qingshan Hou, Zhi Da Soh, Cancan Xue, Marco Chak Yan Yu, Charumathi Sabanayagam, E Shyong Tai, Xueling Sim, Yaxing Wang, Jost B. Jonas, Vinay Nangia, Gabriel Dawei Yang, Emma Anran Ran, Carol Yim-Lui Cheung, Yangqin Feng, Jun Zhou, Rick Siow Mong Goh, Yukun Zhou , et al. (4 additional authors not shown)

    Abstract: Background: RETFound, a self-supervised, retina-specific foundation model (FM), showed potential in downstream applications. However, its comparative performance with traditional deep learning (DL) models remains incompletely understood. This study aimed to evaluate RETFound against three ImageNet-pretrained supervised DL models (ResNet50, ViT-base, SwinV2) in detecting ocular and systemic disease… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

  17. arXiv:2501.05067  [pdf, other

    cs.CV cs.AI

    LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

    Authors: Jiaxing Zhao, Boyuan Sun, Xiang Chen, Xihan Wei, Qibin Hou

    Abstract: In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary strengths of each projector. We observe that different visual projectors exhibit distinct characteristics when handling specific tasks. For instance, some projectors… ▽ More

    Submitted 14 March, 2025; v1 submitted 9 January, 2025; originally announced January 2025.

    Comments: 18 pages, 10 figures

  18. arXiv:2501.03775  [pdf, other

    cs.CV

    Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection

    Authors: Xinbin Yuan, Zhaohui Zheng, Yuxuan Li, Xialei Liu, Li Liu, Xiang Li, Qibin Hou, Ming-Ming Cheng

    Abstract: While witnessed with rapid development, remote sensing object detection remains challenging for detecting high aspect ratio objects. This paper shows that large strip convolutions are good feature representation learners for remote sensing object detection and can detect objects of various aspect ratios well. Based on large strip convolutions, we build a new network architecture called Strip R-CNN… ▽ More

    Submitted 1 March, 2025; v1 submitted 7 January, 2025; originally announced January 2025.

  19. arXiv:2412.20665  [pdf, other

    cs.CV cs.MM

    SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

    Authors: Yuxuan Li, Xiang Li, Yunheng Li, Yicheng Zhang, Yimian Dai, Qibin Hou, Ming-Ming Cheng, Jian Yang

    Abstract: With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional Object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model's applicability in more… ▽ More

    Submitted 29 December, 2024; originally announced December 2024.

  20. arXiv:2412.16919  [pdf, other

    cs.CV

    TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

    Authors: Xuying Zhang, Yutong Liu, Yangguang Li, Renrui Zhang, Yufei Liu, Kai Wang, Wanli Ouyang, Zhiwei Xiong, Peng Gao, Qibin Hou, Ming-Ming Cheng

    Abstract: We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrate the multimodal unification and promising learning capabilities of the next-token prediction paradigm to conditional 3D object generation. To achieve this, the… ▽ More

    Submitted 11 March, 2025; v1 submitted 22 December, 2024; originally announced December 2024.

  21. arXiv:2412.11464  [pdf, other

    cs.CV

    High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation

    Authors: Quan-Sheng Zeng, Yunheng Li, Daquan Zhou, Guanbin Li, Qibin Hou, Ming-Ming Cheng

    Abstract: Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and l… ▽ More

    Submitted 12 March, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

    Comments: Revised version according to comments from reviewers of ICLR2025

  22. arXiv:2412.06244  [pdf, other

    cs.CV

    Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

    Authors: Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng

    Abstract: Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from signific… ▽ More

    Submitted 10 March, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

  23. arXiv:2410.18931  [pdf, other

    cs.CV

    Sort-free Gaussian Splatting via Weighted Sum Rendering

    Authors: Qiqi Hou, Randall Rauwendaal, Zifeng Li, Hoang Le, Farzad Farhadzadeh, Fatih Porikli, Alexei Bourd, Amir Said

    Abstract: Recently, 3D Gaussian Splatting (3DGS) has emerged as a significant advancement in 3D scene reconstruction, attracting considerable attention due to its ability to recover high-fidelity details while maintaining low complexity. Despite the promising results achieved by 3DGS, its rendering performance is constrained by its dependence on costly non-commutative alpha-blending operations. These operat… ▽ More

    Submitted 8 April, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

    Comments: ICLR 2025

  24. arXiv:2410.14279  [pdf, other

    cs.CV

    ControlSR: Taming Diffusion Models for Consistent Real-World Image Super Resolution

    Authors: Yuhao Wan, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen, Ming-Ming Cheng, Bo Li

    Abstract: We present ControlSR, a new method that can tame Diffusion Models for consistent real-world image super-resolution (Real-ISR). Previous Real-ISR models mostly focus on how to activate more generative priors of text-to-image diffusion models to make the output high-resolution (HR) images look better. However, since these methods rely too much on the generative priors, the content of the output imag… ▽ More

    Submitted 1 April, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

  25. arXiv:2410.06397  [pdf, other

    cs.LG cs.DS math.ST

    Provable Accuracy Bounds for Hybrid Dynamical Optimization and Sampling

    Authors: Matthew X. Burns, Qingyuan Hou, Michael C. Huang

    Abstract: Analog dynamical accelerators (DXs) are a growing sub-field in computer architecture research, offering order-of-magnitude gains in power efficiency and latency over traditional digital methods in several machine learning, optimization, and sampling tasks. However, limited-capacity accelerators require hybrid analog/digital algorithms to solve real-world problems, commonly using large-neighborhood… ▽ More

    Submitted 7 May, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: 33 pages, 3 figures

    MSC Class: 60J60 ACM Class: F.2.0

  26. arXiv:2410.00150  [pdf, other

    cs.IT cs.LG cs.NI eess.SP

    What If We Had Used a Different App? Reliable Counterfactual KPI Analysis in Wireless Systems

    Authors: Qiushuo Hou, Sangwoo Park, Matteo Zecchin, Yunlong Cai, Guanding Yu, Osvaldo Simeone

    Abstract: In modern wireless network architectures, such as Open Radio Access Network (O-RAN), the operation of the radio access network (RAN) is managed by applications, or apps for short, deployed at intelligent controllers. These apps are selected from a given catalog based on current contextual information. For instance, a scheduling app may be selected on the basis of current traffic and network condit… ▽ More

    Submitted 23 January, 2025; v1 submitted 30 September, 2024; originally announced October 2024.

    Comments: This paper has been submitted to a journal

  27. arXiv:2409.15623  [pdf, other

    eess.AS cs.AI cs.SD

    Safe Guard: an LLM-agent for Real-time Voice-based Hate Speech Detection in Social Virtual Reality

    Authors: Yiwen Xu, Qinyang Hou, Hongyu Wan, Mirjana Prpa

    Abstract: In this paper, we present Safe Guard, an LLM-agent for the detection of hate speech in voice-based interactions in social VR (VRChat). Our system leverages Open AI GPT and audio feature extraction for real-time voice interactions. We contribute a system design and evaluation of the system that demonstrates the capability of our approach in detecting hate speech, and reducing false positives compar… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

  28. arXiv:2409.09350  [pdf, other

    cs.CV

    OPUS: Occupancy Prediction Using a Sparse Set

    Authors: Jiabao Wang, Zhaojiang Liu, Qiang Meng, Liujiang Yan, Ke Wang, Jie Yang, Wei Liu, Qibin Hou, Ming-Ming Cheng

    Abstract: Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Perform… ▽ More

    Submitted 30 October, 2024; v1 submitted 14 September, 2024; originally announced September 2024.

  29. arXiv:2408.14968  [pdf, other

    cs.IR cs.CL

    MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce

    Authors: Hao Jiang, Haoxiang Zhang, Qingshan Hou, Chaofeng Chen, Weisi Lin, Jingchang Zhang, Annan Wang

    Abstract: Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook indi… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

  30. arXiv:2408.07595  [pdf, other

    cs.CV

    Progressive Radiance Distillation for Inverse Rendering with Gaussian Splatting

    Authors: Keyang Ye, Qiming Hou, Kun Zhou

    Abstract: We propose progressive radiance distillation, an inverse rendering method that combines physically-based rendering with Gaussian-based radiance field rendering using a distillation progress map. Taking multi-view images as input, our method starts from a pre-trained radiance field guidance, and distills physically-based light and material parameters from the radiance field using an image-fitting p… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  31. arXiv:2407.04800  [pdf, other

    cs.CV

    Segmentation-Free Guidance for Text-to-Image Diffusion Models

    Authors: Kambiz Azarian, Debasmit Das, Qiqi Hou, Fatih Porikli

    Abstract: We introduce segmentation-free guidance, a novel method designed for text-to-image diffusion models like Stable Diffusion. Our method does not require retraining of the diffusion model. At no additional compute cost, it uses the diffusion model itself as an implied segmentation network, hence named segmentation-free guidance, to dynamically adjust the negative prompt for each patch of the generate… ▽ More

    Submitted 3 June, 2024; originally announced July 2024.

  32. arXiv:2407.04305  [pdf, other

    cs.CV

    Towards Stable 3D Object Detection

    Authors: Jiabao Wang, Qiang Meng, Guochao Liu, Liujiang Yan, Ke Wang, Ming-Ming Cheng, Qibin Hou

    Abstract: In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the community. To bridge this gap, this work proposes Stability Index (SI), a new metric that can comprehensively evaluate the stability of 3D detectors in terms of… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  33. arXiv:2407.00021  [pdf, other

    cs.CV cs.GR eess.IV

    Neural Graphics Texture Compression Supporting Random Access

    Authors: Farzad Farhadzadeh, Qiqi Hou, Hoang Le, Amir Said, Randall Rauwendaal, Alex Bourd, Fatih Porikli

    Abstract: Advances in rendering have led to tremendous growth in texture assets, including resolution, complexity, and novel textures components, but this growth in data volume has not been matched by advances in its compression. Meanwhile Neural Image Compression (NIC) has advanced significantly and shown promising results, but the proposed methods cannot be directly adapted to neural texture compression.… ▽ More

    Submitted 25 October, 2024; v1 submitted 6 May, 2024; originally announced July 2024.

    Comments: ECCV 2024

  34. arXiv:2406.15819  [pdf, other

    cs.LG cs.IT cs.NI eess.SP

    Automatic AI Model Selection for Wireless Systems: Online Learning via Digital Twinning

    Authors: Qiushuo Hou, Matteo Zecchin, Sangwoo Park, Yunlong Cai, Guanding Yu, Kaushik Chowdhury, Osvaldo Simeone

    Abstract: In modern wireless network architectures, such as O-RAN, artificial intelligence (AI)-based applications are deployed at intelligent controllers to carry out functionalities like scheduling or power control. The AI "apps" are selected on the basis of contextual information such as network conditions, topology, traffic statistics, and design goals. The mapping between context and AI model parameter… ▽ More

    Submitted 21 October, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

    Comments: submitted for a journal publication

  35. arXiv:2406.06858  [pdf, other

    cs.LG cs.DC

    FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

    Authors: Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu

    Abstract: Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation… ▽ More

    Submitted 23 October, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  36. arXiv:2406.00670  [pdf, other

    cs.CV

    Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

    Authors: Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, Ming-Ming Cheng

    Abstract: Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual fea… ▽ More

    Submitted 6 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

    Comments: Accepted by ICML 2024

  37. arXiv:2405.08021  [pdf, other

    cs.SD eess.AS

    Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion

    Authors: Zhao Ren, Kevin Scheck, Qinhan Hou, Stefano van Gogh, Michael Wand, Tanja Schultz

    Abstract: Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available dat… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

    Comments: Accepted by EMBC 2024

  38. arXiv:2405.01434  [pdf, other

    cs.CV

    StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

    Authors: Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou

    Abstract: For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

  39. 3D Gaussian Splatting with Deferred Reflection

    Authors: Keyang Ye, Qiming Hou, Kun Zhou

    Abstract: The advent of neural and Gaussian-based radiance field methods have achieved great success in the field of novel view synthesis. However, specular reflection remains non-trivial, as the high frequency radiance field is notoriously difficult to fit stably and accurately. We present a deferred shading method to effectively render specular reflection with Gaussian splatting. The key challenge comes f… ▽ More

    Submitted 4 June, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

  40. Synthesizing Realistic Data for Table Recognition

    Authors: Qiyu Hou, Jun Wang, Meixuan Qiao, Lujun Tian

    Abstract: To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic… ▽ More

    Submitted 9 July, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

    Comments: ICDAR 2024

  41. arXiv:2404.04887  [pdf, other

    cs.CV

    A Clinical-oriented Multi-level Contrastive Learning Method for Disease Diagnosis in Low-quality Medical Images

    Authors: Qingshan Hou, Shuai Cheng, Peng Cao, Jinzhu Yang, Xiaoli Liu, Osmar R. Zaiane, Yih Chung Tham

    Abstract: Representation learning offers a conduit to elucidate distinctive features within the latent space and interpret the deep models. However, the randomness of lesion distribution and the complexity of low-quality factors in medical images pose great challenges for models to extract key lesion features. Disease diagnosis methods guided by contrastive learning (CL) have shown significant advantages in… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

  42. arXiv:2403.17879  [pdf, other

    cs.CV eess.IV

    Low-Latency Neural Stereo Streaming

    Authors: Qiqi Hou, Farzad Farhadzadeh, Amir Said, Guillaume Sautiere, Hoang Le

    Abstract: The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods, both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance, they compress left and right views sequentially, leading to poor parallel… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR2024

  43. arXiv:2403.17749  [pdf, other

    cs.CV

    Multi-Task Dense Prediction via Mixture of Low-Rank Experts

    Authors: Yuqi Yang, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen, Bo Li

    Abstract: Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships, MLoRE adds a gener… ▽ More

    Submitted 27 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted at CVPR 2024

  44. arXiv:2403.11735  [pdf, other

    cs.CV cs.LG

    LSKNet: A Foundation Lightweight Backbone for Remote Sensing

    Authors: Yuxuan Li, Xiang Li, Yimian Dai, Qibin Hou, Li Liu, Yongxiang Liu, Ming-Ming Cheng, Jian Yang

    Abstract: Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote se… ▽ More

    Submitted 30 September, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2303.09030

  45. arXiv:2403.06534  [pdf, other

    cs.CV cs.AI cs.CE cs.LG

    SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

    Authors: Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming-Ming Cheng, Jian Yang

    Abstract: Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source met… ▽ More

    Submitted 30 September, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: 22 Pages, 10 Figures, 9 Tables

  46. arXiv:2402.17403  [pdf, other

    cs.CV

    Sora Generates Videos with Stunning Geometrical Consistency

    Authors: Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, Ming-Ming Cheng

    Abstract: The recently developed Sora model [1] has exhibited remarkable capabilities in video generation, sparking intense discussions regarding its ability to simulate real-world phenomena. Despite its growing popularity, there is a lack of established metrics to evaluate its fidelity to real-world physics quantitatively. In this paper, we introduce a new benchmark that assesses the quality of the generat… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: 5 pages, 3 figures

  47. arXiv:2402.15627  [pdf, other

    cs.LG cs.DC

    MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

    Authors: Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao , et al. (7 additional authors not shown)

    Abstract: We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model bl… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  48. arXiv:2402.09270  [pdf, other

    cs.CV

    Fast Window-Based Event Denoising with Spatiotemporal Correlation Enhancement

    Authors: Huachen Fang, Jinjian Wu, Qibin Hou, Weisheng Dong, Guangming Shi

    Abstract: Previous deep learning-based event denoising methods mostly suffer from poor interpretability and difficulty in real-time processing due to their complex architecture designs. In this paper, we propose window-based event denoising, which simultaneously deals with a stack of events while existing element-based denoising focuses on one event each time. Besides, we give the theoretical analysis based… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  49. arXiv:2402.05375  [pdf, other

    cs.CV

    Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

    Authors: Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, Jian Yang

    Abstract: The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to man… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Comments: ICLR 2024. Our code is available in https://github.com/sen-mao/SuppressEOT

  50. arXiv:2312.08866  [pdf, other

    eess.IV cs.CV

    MCANet: Medical Image Segmentation with Multi-Scale Cross-Axis Attention

    Authors: Hao Shao, Quansheng Zeng, Qibin Hou, Jufeng Yang

    Abstract: Efficiently capturing multi-scale information and building long-range dependencies among pixels are essential for medical image segmentation because of the various sizes and shapes of the lesion regions or organs. In this paper, we present Multi-scale Cross-axis Attention (MCA) to solve the above challenging issues based on the efficient axial attention. Instead of simply connecting axial attentio… ▽ More

    Submitted 17 April, 2025; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: accept to Machine Intelligence Research.DOI: 10.1007/s11633-025-1552-6