Skip to main content

Showing 1–50 of 208 results for author: Gao, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10442  [pdf, ps, other

    cs.RO cs.AI

    IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning

    Authors: Dechen Gao, Hang Wang, Hanchu Zhou, Nejib Ammar, Shatadal Mishra, Ahmadreza Moradipari, Iman Soltani, Junshan Zhang

    Abstract: Imitation learning (IL) and reinforcement learning (RL) each offer distinct advantages for robotics policy learning: IL provides stable learning from demonstrations, and RL promotes generalization through exploration. While existing robot learning approaches using IL-based pre-training followed by RL-based fine-tuning are promising, this two-step learning paradigm often suffers from instability an… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.05752  [pdf, other

    cs.CV cs.CY cs.LG cs.RO eess.IV

    Automating Infrastructure Surveying: A Framework for Geometric Measurements and Compliance Assessment Using Point Cloud Data

    Authors: Amin Ghafourian, Andrew Lee, Dechen Gao, Tyler Beer, Kin Yen, Iman Soltani

    Abstract: Automation can play a prominent role in improving efficiency, accuracy, and scalability in infrastructure surveying and assessing construction and compliance standards. This paper presents a framework for automation of geometric measurements and compliance assessment using point cloud data. The proposed approach integrates deep learning-based detection and segmentation, in conjunction with geometr… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: 19 pages, 15 figures, 4 tables

  3. arXiv:2504.01168  [pdf, other

    cs.DS

    LimTDD: A Compact Decision Diagram Integrating Tensor and Local Invertible Map Representations

    Authors: Xin Hong, Aochu Dai, Dingchao Gao, Sanjiang Li, Zhengfeng Ji, Mingsheng Ying

    Abstract: Tensor Decision Diagrams (TDDs) provide an efficient structure for representing tensors by combining techniques from both tensor networks and decision diagrams, demonstrating competitive performance in quantum circuit simulation and verification. However, existing decision diagrams, including TDDs, fail to exploit isomorphisms within tensors, limiting their compression efficiency. This paper intro… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

  4. arXiv:2503.22759  [pdf, other

    cs.CR cs.AI

    Data Poisoning in Deep Learning: A Survey

    Authors: Pinlong Zhao, Weiyao Zhu, Pengfei Jiao, Di Gao, Ou Wu

    Abstract: Deep learning has become a cornerstone of modern artificial intelligence, enabling transformative applications across a wide range of domains. As the core element of deep learning, the quality and security of training data critically influence model performance and reliability. However, during the training process, deep learning models face the significant threat of data poisoning, where attackers… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  5. arXiv:2503.18556  [pdf, other

    cs.CV cs.CL

    Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models

    Authors: Bin Li, Dehong Gao, Yeyuan Wang, Linbo Jin, Shanqing Yu, Xiaoyan Cai, Libin Yang

    Abstract: Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: Accepted by ICME2025

  6. arXiv:2503.13563  [pdf, other

    cs.CL cs.AI cs.IR

    MES-RAG: Bringing Multi-modal, Entity-Storage, and Secure Enhancements to RAG

    Authors: Pingyu Wu, Daiheng Gao, Jing Tang, Huimin Chen, Wenbo Zhou, Weiming Zhang, Nenghai Yu

    Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by using external knowledge, but it struggles with precise entity information retrieval. In this paper, we proposed MES-RAG framework, which enhances entity-specific query handling and provides accurate, secure, and consistent responses. MES-RAG introduces proactive security measures that ensure system integrity by applying… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: NAACL 2025

  7. arXiv:2503.11290  [pdf, other

    cs.CV eess.IV

    EmoAgent: Multi-Agent Collaboration of Plan, Edit, and Critic, for Affective Image Manipulation

    Authors: Qi Mao, Haobo Hu, Yujie He, Difei Gao, Haokun Chen, Libiao Jin

    Abstract: Affective Image Manipulation (AIM) aims to alter an image's emotional impact by adjusting multiple visual elements to evoke specific feelings.Effective AIM is inherently complex, necessitating a collaborative approach that involves identifying semantic cues within source images, manipulating these elements to elicit desired emotional responses, and verifying that the combined adjustments successfu… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  8. arXiv:2503.06353  [pdf

    cs.CY cs.AI

    The AI Pentad, the CHARME$^{2}$D Model, and an Assessment of Current-State AI Regulation

    Authors: Di Kevin Gao, Sudip Mittal, Jiming Wu, Hongwei Du, Jingdao Chen, Shahram Rahimi

    Abstract: Artificial Intelligence (AI) has made remarkable progress in the past few years with AI-enabled applications beginning to permeate every aspect of our society. Despite the widespread consensus on the need to regulate AI, there remains a lack of a unified approach to framing, developing, and assessing AI regulations. Many of the existing methods take a value-based approach, for example, accountabil… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

  9. arXiv:2503.04146  [pdf, other

    cs.DS cs.ET quant-ph

    Image Computation for Quantum Transition Systems

    Authors: Xin Hong, Dingchao Gao, Sanjiang Li, Shenggang Ying, Mingsheng Ying

    Abstract: With the rapid progress in quantum hardware and software, the need for verification of quantum systems becomes increasingly crucial. While model checking is a dominant and very successful technique for verifying classical systems, its application to quantum systems is still an underdeveloped research area. This paper advances the development of model checking quantum systems by providing efficient… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  10. arXiv:2502.18480  [pdf, other

    cs.IR cs.AI cs.CL

    QExplorer: Large Language Model Based Query Extraction for Toxic Content Exploration

    Authors: Shaola Ren, Li Ke, Longtao Huang, Dehong Gao, Hui Xue

    Abstract: Automatically extracting effective queries is challenging in information retrieval, especially in toxic content exploration, as such content is likely to be disguised. With the recent achievements in generative Large Language Model (LLM), we are able to leverage the capabilities of LLMs to extract effective queries for similar content exploration directly. This study proposes QExplorer, an approac… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

  11. arXiv:2502.15281  [pdf, other

    cs.CR cs.SE

    DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications

    Authors: Chengyan Ma, Ruidong Han, Ye Liu, Yuqing Niu, Di Lu, Chuang Tian, Jianfeng Ma, Debin Gao, David Lo

    Abstract: Trusted Execution Environment (TEE) enhances the security of mobile applications and cloud services by isolating sensitive code in the secure world from the non-secure normal world. However, TEE applications are still confronted with vulnerabilities stemming from bad partitioning. Bad partitioning can lead to critical security problems of TEE, such as leaking sensitive data to the normal world or… ▽ More

    Submitted 21 February, 2025; originally announced February 2025.

  12. SEM-CLIP: Precise Few-Shot Learning for Nanoscale Defect Detection in Scanning Electron Microscope Image

    Authors: Qian Jin, Yuqi Jiang, Xudong Lu, Yumeng Liu, Yining Chen, Dawei Gao, Qi Sun, Cheng Zhuo

    Abstract: In the field of integrated circuit manufacturing, the detection and classification of nanoscale wafer defects are critical for subsequent root cause analysis and yield enhancement. The complex background patterns observed in scanning electron microscope (SEM) images and the diverse textures of the defects pose significant challenges. Traditional methods usually suffer from insufficient data, label… ▽ More

    Submitted 15 February, 2025; originally announced February 2025.

    Comments: Published in ACM/IEEE International Conference on Computer-Aided Design (ICCAD), 2024

  13. arXiv:2502.14215  [pdf, other

    cs.SE cs.AI

    Towards Secure Program Partitioning for Smart Contracts with LLM's In-Context Learning

    Authors: Ye Liu, Yuqing Niu, Chengyan Ma, Ruidong Han, Wei Ma, Yi Li, Debin Gao, David Lo

    Abstract: Smart contracts are highly susceptible to manipulation attacks due to the leakage of sensitive information. Addressing manipulation vulnerabilities is particularly challenging because they stem from inherent data confidentiality issues rather than straightforward implementation bugs. To tackle this by preventing sensitive information leakage, we present PartitionGPT, the first LLM-driven approach… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

  14. arXiv:2502.13379  [pdf, other

    cs.CR cs.SE

    AutoTEE: Automated Migration and Protection of Programs in Trusted Execution Environments

    Authors: Ruidong Han, Zhou Yang, Chengyan Ma, Ye Liu, Yuqing Niu, Siqi Ma, Debin Gao, David Lo

    Abstract: Trusted Execution Environments (TEEs) isolate a special space within a device's memory that is not accessible to the normal world (also known as Untrusted Environment), even when the device is compromised. Thus, developers can utilize TEEs to provide strong security guarantees for their programs, making sensitive operations like encrypted data storage, fingerprint verification, and remote attestat… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

    Comments: 14 pages

  15. arXiv:2502.11427  [pdf, other

    cs.CL cs.CV

    Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

    Authors: Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen

    Abstract: Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale dataset. To address it, we propos… ▽ More

    Submitted 16 February, 2025; originally announced February 2025.

    Comments: under review

  16. arXiv:2502.09596  [pdf, other

    cs.AI cs.MA

    KIMAs: A Configurable Knowledge Integrated Multi-Agent System

    Authors: Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding

    Abstract: Knowledge-intensive conversations supported by large language models (LLMs) have become one of the most popular and helpful applications that can assist people in different aspects. Many current knowledge-intensive applications are centered on retrieval-augmented generation (RAG) techniques. While many open-source RAG frameworks facilitate the development of RAG-based applications, they often fall… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

  17. arXiv:2502.08047  [pdf, other

    cs.AI cs.MA

    WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

    Authors: Henry Hengyuan Zhao, Difei Gao, Mike Zheng Shou

    Abstract: Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread… ▽ More

    Submitted 19 February, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

    Comments: 19 pages, 18 figures

  18. arXiv:2501.04963  [pdf, other

    cs.CR

    Shelving it rather than Ditching it: Dynamically Debloating DEX and Native Methods of Android Applications without APK Modification

    Authors: Zicheng Zhang, Jiakun Liu, Ferdian Thung, Haoyu Ma, Rui Li, Yan Naing Tun, Wei Minn, Lwin Khin Shar, Shahar Maoz, Eran Toch, David Lo, Joshua Wong, Debin Gao

    Abstract: Today's Android developers tend to include numerous features to accommodate diverse user requirements, which inevitably leads to bloated apps. Yet more often than not, only a fraction of these features are frequently utilized by users, thus a bloated app costs dearly in potential vulnerabilities, expanded attack surfaces, and additional resource consumption. Especially in the event of severe secur… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

  19. arXiv:2412.20413  [pdf, other

    cs.CV

    EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

    Authors: Daiheng Gao, Shilin Lu, Shaw Walters, Wenbo Zhou, Jiaming Chu, Jie Zhang, Bang Zhang, Mengxi Jia, Jian Zhao, Zhaoxin Fan, Weiming Zhang

    Abstract: Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-… ▽ More

    Submitted 2 January, 2025; v1 submitted 29 December, 2024; originally announced December 2024.

    Comments: 24 pages, 18 figures

  20. arXiv:2412.20062  [pdf, other

    cs.CV

    MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion

    Authors: Zechao Zhan, Dehong Gao, Jinxia Zhang, Jiale Huang, Yang Hu, Xin Wang

    Abstract: Text-guided image editing model has achieved great success in general domain. However, directly applying these models to the fashion domain may encounter two issues: (1) Inaccurate localization of editing region; (2) Weak editing magnitude. To address these issues, the MADiff model is proposed. Specifically, to more accurately identify editing region, the MaskNet is proposed, in which the foregrou… ▽ More

    Submitted 15 January, 2025; v1 submitted 28 December, 2024; originally announced December 2024.

  21. arXiv:2412.19997  [pdf, other

    cs.CV

    FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

    Authors: Jiale Huang, Dehong Gao, Jinxia Zhang, Zechao Zhan, Yang Hu, Xin Wang

    Abstract: Large-scale Vision-Language Pre-training (VLP) has demonstrated remarkable success in the general domain. However, in the fashion domain, items are distinguished by fine-grained attributes like texture and material, which are crucial for tasks such as retrieval. Existing models often fail to leverage these fine-grained attributes from both text and image modalities. To address the above issues, we… ▽ More

    Submitted 12 January, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

    Comments: 5 pages, Accepted by ICASSP2025, full paper

  22. arXiv:2412.16869  [pdf, other

    cs.CV

    CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models

    Authors: Yeyuan Wang, Dehong Gao, Bin Li, Rujiao Long, Lei Yi, Xiaoyan Cai, Libin Yang, Jinxia Zhang, Shanqing Yu, Qi Xuan

    Abstract: The impressive performance of Large Language Model (LLM) has prompted researchers to develop Multi-modal LLM (MLLM), which has shown great potential for various multi-modal tasks. However, current MLLM often struggles to effectively address fine-grained multi-modal challenges. We argue that this limitation is closely linked to the models' visual grounding capabilities. The restricted spatial aware… ▽ More

    Submitted 22 December, 2024; originally announced December 2024.

    Comments: 5 pages, Accepted by ICASSP2025, full paper

  23. arXiv:2412.11596  [pdf, other

    cs.CV cs.GR

    MeshArt: Generating Articulated Meshes with Structure-guided Transformers

    Authors: Daoyi Gao, Yawar Siddiqui, Lei Li, Angela Dai

    Abstract: Articulated 3D object generation is fundamental for creating realistic, functional, and interactable virtual assets which are not simply static. We introduce MeshArt, a hierarchical transformer-based approach to generate articulated 3D meshes with clean, compact geometry, reminiscent of human-crafted 3D models. We approach articulated mesh generation in a part-by-part fashion across two stages. Fi… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: Project Page: https://daoyig.github.io/Mesh_Art/

  24. arXiv:2412.10029  [pdf, other

    cs.CV

    Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples

    Authors: Yeyuan Wang, Dehong Gao, Lei Yi, Linbo Jin, Jinxia Zhang, Libin Yang, Xiaoyan Cai

    Abstract: Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intri… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

    Comments: 15pages, Accepted by AAAI2025, full paper

  25. arXiv:2412.09782  [pdf, other

    cs.RO cs.CV cs.MA

    EI-Drive: A Platform for Cooperative Perception with Realistic Communication Models

    Authors: Hanchu Zhou, Edward Xie, Wei Shao, Dechen Gao, Michelle Dong, Junshan Zhang

    Abstract: The growing interest in autonomous driving calls for realistic simulation platforms capable of accurately simulating cooperative perception process in realistic traffic scenarios. Existing studies for cooperative perception often have not accounted for transmission latency and errors in real-world environments. To address this gap, we introduce EI-Drive, an edge-AI based autonomous driving simulat… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

  26. arXiv:2412.07405  [pdf, other

    cs.LG cs.AI

    MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning

    Authors: Yufei Ma, Zihan Liang, Huangyu Dai, Ben Chen, Dehong Gao, Zhuoran Ran, Wang Zihan, Linbo Jin, Wen Jiang, Guannan Zhang, Xiaoyan Cai, Libin Yang

    Abstract: The growing demand for larger-scale models in the development of \textbf{L}arge \textbf{L}anguage \textbf{M}odels (LLMs) poses challenges for efficient training within limited computational resources. Traditional fine-tuning methods often exhibit instability in multi-task learning and rely heavily on extensive training resources. Here, we propose MoDULA (\textbf{M}ixture \textbf{o}f \textbf{D}omai… ▽ More

    Submitted 10 December, 2024; originally announced December 2024.

  27. arXiv:2411.17465  [pdf, other

    cs.CV cs.AI cs.CL cs.HC

    ShowUI: One Vision-Language-Action Model for GUI Visual Agent

    Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

    Abstract: Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-langu… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: Technical Report. Github: https://github.com/showlab/ShowUI

  28. arXiv:2411.10323  [pdf, other

    cs.AI cs.CL cs.CV

    The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

    Authors: Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou

    Abstract: The recently released model, Claude 3.5 Computer Use, stands out as the first frontier AI model to offer computer use in public beta as a graphical user interface (GUI) agent. As an early beta, its capability in the real-world complex environment remains unknown. In this case study to explore Claude 3.5 Computer Use, we curate and organize a collection of carefully designed tasks spanning a variet… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

    Comments: 40 pages, 21 figures, preprint

  29. arXiv:2410.14659  [pdf, other

    cs.LG stat.ML

    Harnessing Causality in Reinforcement Learning With Bagged Decision Times

    Authors: Daiqi Gao, Hsin-Yu Lai, Predrag Klasnja, Susan A. Murphy

    Abstract: We consider reinforcement learning (RL) for a class of problems with bagged decision times. A bag contains a finite sequence of consecutive decision times. The transition dynamics are non-Markovian and non-stationary within a bag. All actions within a bag jointly impact a single reward, observed at the end of the bag. For example, in mobile health, multiple activity suggestions in a day collective… ▽ More

    Submitted 6 May, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

  30. arXiv:2410.04360  [pdf, other

    cs.MA cs.AI

    GenSim: A General Social Simulation Platform with Large Language Model based Agents

    Authors: Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, Bolin Ding, Jingren Zhou, Jun Wang, Ji-Rong Wen

    Abstract: With the rapid advancement of large language models (LLMs), recent years have witnessed many promising studies on leveraging LLM-based agents to simulate human social behavior. While prior work has demonstrated significant potential across various domains, much of it has focused on specific scenarios involving a limited number of agents and has lacked the ability to adapt when errors occur during… ▽ More

    Submitted 9 October, 2024; v1 submitted 6 October, 2024; originally announced October 2024.

  31. arXiv:2409.17435  [pdf, other

    cs.RO

    Active Vision Might Be All You Need: Exploring Active Vision in Bimanual Robotic Manipulation

    Authors: Ian Chuang, Andrew Lee, Dechen Gao, M-Mahdi Naddaf-Sh, Iman Soltani

    Abstract: Imitation learning has demonstrated significant potential in performing high-precision manipulation tasks using visual feedback. However, it is common practice in imitation learning for cameras to be fixed in place, resulting in issues like occlusion and limited field of view. Furthermore, cameras are often placed in broad, general locations, without an effective viewpoint specific to the robot's… ▽ More

    Submitted 7 March, 2025; v1 submitted 25 September, 2024; originally announced September 2024.

    Comments: 6 pages, 4 figures

  32. GEVO: Memory-Efficient Monocular Visual Odometry Using Gaussians

    Authors: Dasong Gao, Peter Zhi Xuan Li, Vivienne Sze, Sertac Karaman

    Abstract: Constructing a high-fidelity representation of the 3D scene using a monocular camera can enable a wide range of applications on mobile devices, such as micro-robots, smartphones, and AR/VR headsets. On these devices, memory is often limited in capacity and its access often dominates the consumption of compute energy. Although Gaussian Splatting (GS) allows for high-fidelity reconstruction of 3D sc… ▽ More

    Submitted 29 January, 2025; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: 8 pages

  33. arXiv:2409.03185  [pdf, ps, other

    quant-ph cs.ET

    DasAtom: A Divide-and-Shuttle Atom Approach to Quantum Circuit Transformation

    Authors: Yunqi Huang, Dingchao Gao, Shenggang Ying, Sanjiang Li

    Abstract: Neutral atom (NA) quantum systems are emerging as a leading platform for quantum computation, offering superior or competitive qubit count and gate fidelity compared to superconducting circuits and ion traps. However, the unique features of NA devices, such as long-range interactions, long qubit coherence time, and the ability to physically move qubits, present distinct challenges for quantum circ… ▽ More

    Submitted 20 January, 2025; v1 submitted 4 September, 2024; originally announced September 2024.

    Comments: This paper is accepted by IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

  34. arXiv:2408.16251  [pdf, other

    cs.IT eess.SP

    Neural Network-Assisted Hybrid Model Based Message Passing for Parametric Holographic MIMO Near Field Channel Estimation

    Authors: Zhengdao Yuan, Yabo Guo, Dawei Gao, Qinghua Guo, Zhongyong Wang, Chongwen Huang, Ming Jin, Kai-Kit Wong

    Abstract: Holographic multiple-input and multiple-output (HMIMO) is a promising technology with the potential to achieve high energy and spectral efficiencies, enhance system capacity and diversity, etc. In this work, we address the challenge of HMIMO near field (NF) channel estimation, which is complicated by the intricate model introduced by the dyadic Green's function. Despite its complexity, the channel… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  35. arXiv:2408.08913  [pdf, other

    cs.IR

    MLoRA: Multi-Domain Low-Rank Adaptive Network for CTR Prediction

    Authors: Zhiming Yang, Haining Gao, Dehong Gao, Luwei Yang, Libin Yang, Xiaoyan Cai, Wei Ning, Guannan Zhang

    Abstract: Click-through rate (CTR) prediction is one of the fundamental tasks in the industry, especially in e-commerce, social media, and streaming media. It directly impacts website revenues, user satisfaction, and user retention. However, real-world production platforms often encompass various domains to cater for diverse customer needs. Traditional CTR prediction models struggle in multi-domain recommen… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

    Comments: 11 pages. Accepted by RecSys'2024, full paper

  36. arXiv:2407.21757  [pdf, other

    cs.CV cs.MM

    Learning Video Context as Interleaved Multimodal Sequences

    Authors: Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou

    Abstract: Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as i… ▽ More

    Submitted 12 September, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  37. arXiv:2407.17789  [pdf, other

    cs.MA cs.AI

    Very Large-Scale Multi-Agent Simulation in AgentScope

    Authors: Xuchen Pan, Dawei Gao, Yuexiang Xie, Yushuo Chen, Zhewei Wei, Yaliang Li, Bolin Ding, Ji-Rong Wen, Jingren Zhou

    Abstract: Recent advances in large language models (LLMs) have opened new avenues for applying multi-agent systems in very large-scale simulations. However, there remain several challenges when conducting multi-agent simulations with existing platforms, such as limited scalability and low efficiency, unsatisfied agent diversity, and effort-intensive management processes. To address these challenges, we deve… ▽ More

    Submitted 28 October, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

    Comments: We have released code on https://github.com/modelscope/agentscope/tree/main/examples/paper_large_scale_simulation

  38. arXiv:2407.16224  [pdf, other

    cs.CV

    OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

    Authors: Ke Sun, Jian Cao, Qi Wang, Linrui Tian, Xindi Zhang, Lian Zhuo, Bang Zhang, Liefeng Bo, Wenbo Zhou, Weiming Zhang, Daiheng Gao

    Abstract: Virtual Try-On (VTON) has become a transformative technology, empowering users to experiment with fashion without ever having to physically try on clothing. However, existing methods often struggle with generating high-fidelity and detail-consistent results. While diffusion models, such as Stable Diffusion series, have shown their capability in creating high-quality and photorealistic images, they… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: 10 pages, 13 figures

  39. arXiv:2406.13719  [pdf, other

    cs.CV

    GUI Action Narrator: Where and When Did That Action Take Place?

    Authors: Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

    Abstract: The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. T… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  40. arXiv:2406.11816  [pdf, other

    cs.CV

    VideoLLM-online: Online Video Large Language Model for Streaming Video

    Authors: Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

    Abstract: Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-St… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: CVPR 2024. This arxiv version is upgraded with Llama-3

  41. arXiv:2406.10227  [pdf, other

    cs.CV cs.AI

    VideoGUI: A Benchmark for GUI Automation from Instructional Videos

    Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

    Abstract: Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-c… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 24 pages, 16 tables, 17 figures

  42. arXiv:2405.20580  [pdf, other

    cs.GR

    Topology-Aware Blending Method for Implicit Heterogeneous Porous Model Design

    Authors: Depeng Gao, Yang Gao, Yuanzhi Zhang, Hongwei Lin

    Abstract: Porous structures are materials consisting of minuscule pores, where the microstructure morphology significantly impacts their macroscopic properties. Integrating different porous structures through a blending method is indispensable to cater to diverse functional regions in heterogeneous models. Previous studies on blending methods for porous structures have mainly focused on controlling the… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  43. arXiv:2405.14974  [pdf, other

    cs.CV cs.AI cs.CL

    LOVA3: Learning to Visual Question Answering, Asking and Assessment

    Authors: Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou

    Abstract: Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learning outcomes. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioni… ▽ More

    Submitted 19 February, 2025; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: NeurIPS 2024. The code is available at https://github.com/showlab/LOVA3

  44. arXiv:2405.09111  [pdf, other

    cs.RO cs.AI

    CarDreamer: Open-Source Learning Platform for World Model based Autonomous Driving

    Authors: Dechen Gao, Shuangyu Cai, Hanchu Zhou, Hang Wang, Iman Soltani, Junshan Zhang

    Abstract: To safely navigate intricate real-world scenarios, autonomous vehicles must be able to adapt to diverse road conditions and anticipate future events. World model (WM) based reinforcement learning (RL) has emerged as a promising approach by learning and predicting the complex dynamics of various environments. Nevertheless, to the best of our knowledge, there does not exist an accessible platform fo… ▽ More

    Submitted 25 July, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

    Comments: Dechen Gao, Shuangyu Cai, Hanchu Zhou, Hang Wang contributed equally

  45. arXiv:2404.18106  [pdf, other

    cs.CV

    Semi-supervised Text-based Person Search

    Authors: Daming Gao, Yang Bai, Min Cao, Hao Dou, Mang Ye, Min Zhang

    Abstract: Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtain… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

    Comments: 13 pages

  46. arXiv:2404.14676  [pdf, other

    cs.CV cs.GR

    DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance

    Authors: Linxuan Xin, Zheng Zhang, Jinfu Wei, Wei Gao, Duan Gao

    Abstract: Prior material creation methods had limitations in producing diverse results mainly because reconstruction-based methods relied on real-world measurements and generation-based methods were trained on relatively small material datasets. To address these challenges, we propose DreamPBR, a novel diffusion-based generative framework designed to create spatially-varying appearance properties guided by… ▽ More

    Submitted 1 July, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: 16 pages, 17 figures

    ACM Class: I.3.0; I.4.9

  47. AI Ethics: A Bibliometric Analysis, Critical Issues, and Key Gaps

    Authors: Di Kevin Gao, Andrew Haverly, Sudip Mittal, Jiming Wu, Jingdao Chen

    Abstract: Artificial intelligence (AI) ethics has emerged as a burgeoning yet pivotal area of scholarly research. This study conducts a comprehensive bibliometric analysis of the AI ethics literature over the past two decades. The analysis reveals a discernible tripartite progression, characterized by an incubation phase, followed by a subsequent phase focused on imbuing AI with human-like attributes, culmi… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Journal ref: International Journal of Business Analytics (IJBAN), 2024, 11(1), 1-19

  48. arXiv:2403.11789  [pdf, other

    cs.CV

    EMIE-MAP: Large-Scale Road Surface Reconstruction Based on Explicit Mesh and Implicit Encoding

    Authors: Wenhua Wu, Qi Wang, Guangming Wang, Junping Wang, Tiankun Zhao, Yang Liu, Dongchao Gao, Zhe Liu, Hesheng Wang

    Abstract: Road surface reconstruction plays a vital role in autonomous driving systems, enabling road lane perception and high-precision mapping. Recently, neural implicit encoding has achieved remarkable results in scene representation, particularly in the realistic rendering of scene textures. However, it faces challenges in directly representing geometric information for large-scale scenes. To address th… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  49. arXiv:2403.10014  [pdf, other

    cs.NI cs.AI

    NNCTC: Physical Layer Cross-Technology Communication via Neural Networks

    Authors: Haoyu Wang, Jiazhao Wang, Demin Gao, Wenchao Jiang

    Abstract: Cross-technology communication(CTC) enables seamless interactions between diverse wireless technologies. Most existing work is based on reversing the transmission path to identify the appropriate payload to generate the waveform that the target devices can recognize. However, this method suffers from many limitations, including dependency on specific technologies and the necessity for intricate al… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: 12 pages

    ACM Class: C.2.2

  50. arXiv:2403.09861  [pdf, other

    cs.ET cs.AI

    NN-Defined Modulator: Reconfigurable and Portable Software Modulator on IoT Gateways

    Authors: Jiazhao Wang, Wenchao Jiang, Ruofeng Liu, Bin Hu, Demin Gao, Shuai Wang

    Abstract: A physical-layer modulator is a vital component for an IoT gateway to map the symbols to signals. However, due to the soldered hardware chipsets on the gateway's motherboards or the diverse toolkits on different platforms for the software radio, the existing solutions either have limited extensibility or are platform-specific. Such limitation is hard to ignore when modulation schemes and hardware… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Journal ref: NSDI 2024