-
Semantic Scheduling for LLM Inference
Authors:
Wenyue Hua,
Dujian Ding,
Yile Gu,
Yujie Ren,
Kai Mei,
Minghua Ma,
William Yang Wang
Abstract:
Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in…
▽ More
Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. To illustrate its effectiveness, we present a medical emergency management application, underscoring the potential benefits of semantic scheduling for critical, time-sensitive tasks. The code and data are available at https://github.com/Wenyueh/latency_optimization_with_priority_constraints.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia
Authors:
Katelyn Xiaoying Mei,
Anna Seo Gyeong Choi,
Hilke Schellmann,
Mona Sloane,
Allison Koenecke
Abstract:
Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems' growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to nav…
▽ More
Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems' growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems' performance for aphasia speakers. First, audits often adhere to a single method of text standardization during data pre-processing, which (a) masks variability in ASR performance from applying different standardization methods, and (b) may not be consistent with how users - especially those from marginalized speech communities - would want their transcriptions to be standardized. Second, audits often display high-level demographic findings without further considering performance disparities among (a) more nuanced demographic subgroups, and (b) relevant covariates capturing acoustic information from the input audio. Third, audits often rely on a single gold-standard metric -- the Word Error Rate -- which does not fully capture the extent of errors arising from generative AI models, such as transcription hallucinations. We propose a more holistic auditing framework that accounts for these three pitfalls, and exemplify its results in our case study, finding consistently worse ASR performance for aphasia speakers relative to a control group. We call on practitioners to implement these robust ASR auditing practices that remain flexible to the rapidly changing ASR landscape.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Reference-Guided Identity Preserving Face Restoration
Authors:
Mo Zhou,
Keren Ye,
Viraj Shah,
Kangfu Mei,
Mauricio Delbracio,
Peyman Milanfar,
Vishal M. Patel,
Hossein Talebi
Abstract:
Preserving face identity is a critical yet persistent challenge in diffusion-based image restoration. While reference faces offer a path forward, existing reference-based methods often fail to fully exploit their potential. This paper introduces a novel approach that maximizes reference face utility for improved face restoration and identity preservation. Our method makes three key contributions:…
▽ More
Preserving face identity is a critical yet persistent challenge in diffusion-based image restoration. While reference faces offer a path forward, existing reference-based methods often fail to fully exploit their potential. This paper introduces a novel approach that maximizes reference face utility for improved face restoration and identity preservation. Our method makes three key contributions: 1) Composite Context, a comprehensive representation that fuses multi-level (high- and low-level) information from the reference face, offering richer guidance than prior singular representations. 2) Hard Example Identity Loss, a novel loss function that leverages the reference face to address the identity learning inefficiencies found in the existing identity loss. 3) A training-free method to adapt the model to multi-reference inputs during inference. The proposed method demonstrably restores high-quality faces and achieves state-of-the-art identity preserving restoration on benchmarks such as FFHQ-Ref and CelebA-Ref-Test, consistently outperforming previous work.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS
Authors:
Kai Mei,
Xi Zhu,
Hang Gao,
Shuhang Lin,
Yongfeng Zhang
Abstract:
We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structur…
▽ More
We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems. The source code of LiteCUA is available at https://github.com/agiresearch/LiteCUA, and it is also integrated into the AIOS main branch as part of AIOS at https://github.com/agiresearch/AIOS.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Passing the Buck to AI: How Individuals' Decision-Making Patterns Affect Reliance on AI
Authors:
Katelyn Xiaoying Mei,
Rock Yuren Pang,
Alex Lyford,
Lucy Lu Wang,
Katharina Reinecke
Abstract:
Psychological research has identified different patterns individuals have while making decisions, such as vigilance (making decisions after thorough information gathering), hypervigilance (rushed and anxious decision-making), and buckpassing (deferring decisions to others). We examine whether these decision-making patterns shape peoples' likelihood of seeking out or relying on AI. In an online exp…
▽ More
Psychological research has identified different patterns individuals have while making decisions, such as vigilance (making decisions after thorough information gathering), hypervigilance (rushed and anxious decision-making), and buckpassing (deferring decisions to others). We examine whether these decision-making patterns shape peoples' likelihood of seeking out or relying on AI. In an online experiment with 810 participants tasked with distinguishing food facts from myths, we found that a higher buckpassing tendency was positively correlated with both seeking out and relying on AI suggestions, while being negatively correlated with the time spent reading AI explanations. In contrast, the higher a participant tended towards vigilance, the more carefully they scrutinized the AI's information, as indicated by an increased time spent looking through the AI's explanations. These findings suggest that a person's decision-making pattern plays a significant role in their adoption and reliance on AI, which provides a new understanding of individual differences in AI-assisted decision-making.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Designing AI Systems that Augment Human Performed vs. Demonstrated Critical Thinking
Authors:
Katelyn Xiaoying Mei,
Nic Weber
Abstract:
The recent rapid advancement of LLM-based AI systems has accelerated our search and production of information. While the advantages brought by these systems seemingly improve the performance or efficiency of human activities, they do not necessarily enhance human capabilities. Recent research has started to examine the impact of generative AI on individuals' cognitive abilities, especially critica…
▽ More
The recent rapid advancement of LLM-based AI systems has accelerated our search and production of information. While the advantages brought by these systems seemingly improve the performance or efficiency of human activities, they do not necessarily enhance human capabilities. Recent research has started to examine the impact of generative AI on individuals' cognitive abilities, especially critical thinking. Based on definitions of critical thinking across psychology and education, this position paper proposes the distinction between demonstrated and performed critical thinking in the era of generative AI and discusses the implication of this distinction in research and development of AI systems that aim to augment human critical thinking.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
The Power of Context: How Multimodality Improves Image Super-Resolution
Authors:
Kangfu Mei,
Hossein Talebi,
Mojtaba Ardakani,
Vishal M. Patel,
Peyman Milanfar,
Mauricio Delbracio
Abstract:
Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, seg…
▽ More
Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at https://mmsr.kfmei.com/.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Cerebrum (AIOS SDK): A Platform for Agent Development, Deployment, Distribution, and Discovery
Authors:
Balaji Rama,
Kai Mei,
Yongfeng Zhang
Abstract:
Autonomous LLM-based agents have emerged as a powerful paradigm for complex task execution, yet the field lacks standardized tools for development, deployment, distribution and discovery of agents. We present Cerebrum, an Agent SDK for AIOS that addresses this gap through three key components: (1) a comprehensive SDK featuring a modular four-layer architecture for agent development, encompassing L…
▽ More
Autonomous LLM-based agents have emerged as a powerful paradigm for complex task execution, yet the field lacks standardized tools for development, deployment, distribution and discovery of agents. We present Cerebrum, an Agent SDK for AIOS that addresses this gap through three key components: (1) a comprehensive SDK featuring a modular four-layer architecture for agent development, encompassing LLM, memory, storage, and tool management; (2) a community-driven Agent Hub for sharing and discovering agents, complete with version control and dependency management; (3) an interactive web interface for testing and evaluating agents. The platform's effectiveness is demonstrated through implementations of various agent architectures, including Chain of Thought (CoT), ReAct, and tool-use agents. Cerebrum advances the field by providing a unified framework that standardizes agent development while maintaining flexibility for researchers and developers to innovate and distribute their agents. The live website is at https://app.aios.foundation, the code is at https://github.com/agiresearch/Cerebrum, and video is at https://app.aios.foundation/video-demo.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Authors:
Zitang Zhou,
Ke Mei,
Yu Lu,
Tianyi Wang,
Fengyun Rao
Abstract:
This paper introduces HarmonySet, a comprehensive dataset designed to advance video-music understanding. HarmonySet consists of 48,328 diverse video-music pairs, annotated with detailed information on rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance. We propose a multi-step human-machine collaborative framework for efficient annotation, combining human insi…
▽ More
This paper introduces HarmonySet, a comprehensive dataset designed to advance video-music understanding. HarmonySet consists of 48,328 diverse video-music pairs, annotated with detailed information on rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance. We propose a multi-step human-machine collaborative framework for efficient annotation, combining human insights with machine-generated descriptions to identify key transitions and assess alignment across multiple dimensions. Additionally, we introduce a novel evaluation framework with tasks and metrics to assess the multi-dimensional alignment of video and music, including rhythm, emotion, theme, and cultural context. Our extensive experiments demonstrate that HarmonySet, along with the proposed evaluation framework, significantly improves the ability of multimodal models to capture and analyze the intricate relationships between video and music.
△ Less
Submitted 4 March, 2025; v1 submitted 3 March, 2025;
originally announced March 2025.
-
OmniRouter: Budget and Performance Controllable Multi-LLM Routing
Authors:
Kai Mei,
Wujiang Xu,
Shuhang Lin,
Yongfeng Zhang
Abstract:
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal r…
▽ More
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlook global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi-LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing task as a constrained optimization problem, assigning models that minimize total cost while ensuring the required performance level. Specifically, a hybrid retrieval-augmented predictor is designed to predict the capabilities and costs of LLMs and a constrained optimizer is employed to control globally optimal query-model allocation. Experiments show that OmniRouter achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15% compared to competitive router baselines. The code and the dataset are available at https://github.com/agiresearch/OmniRouter.
△ Less
Submitted 31 May, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
iAgent: LLM Agent as a Shield between User and Recommender Systems
Authors:
Wujiang Xu,
Yunxiao Shi,
Zujie Liang,
Xuying Ning,
Kai Mei,
Kun Wang,
Xi Zhu,
Min Xu,
Yongfeng Zhang
Abstract:
Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's ben…
▽ More
Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's benefits, which may hinder their ability to protect and capture users' true interests. Second, these models are typically optimized using data from all users, which may overlook individual user's preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure.
△ Less
Submitted 29 May, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
A-MEM: Agentic Memory for LLM Agents
Authors:
Wujiang Xu,
Kai Mei,
Hang Gao,
Juntao Tan,
Zujie Liang,
Yongfeng Zhang
Abstract:
While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adap…
▽ More
While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/AgenticMemory, while the source code of agentic memory system is available at https://github.com/agiresearch/A-mem.
△ Less
Submitted 2 June, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding
Authors:
Mingyu Jin,
Kai Mei,
Wujiang Xu,
Mingjie Sun,
Ruixiang Tang,
Mengnan Du,
Zirui Liu,
Yongfeng Zhang
Abstract:
Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value…
▽ More
Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value layers respectively). Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model's parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE), which has appeared since the first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The Code is Available at https://github.com/MingyuJ666/Rope_with_LLM.
△ Less
Submitted 20 May, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Redundant Queries in DETR-Based 3D Detection Methods: Unnecessary and Prunable
Authors:
Lizhen Xu,
Shanmin Pang,
Wenzhao Qiu,
Zehao Wu,
Xiuxiu Bai,
Kuizhi Mei,
Jianru Xue
Abstract:
Query-based models are extensively used in 3D object detection tasks, with a wide range of pre-trained checkpoints readily available online. However, despite their popularity, these models often require an excessive number of object queries, far surpassing the actual number of objects to detect. The redundant queries result in unnecessary computational and memory costs. In this paper, we find that…
▽ More
Query-based models are extensively used in 3D object detection tasks, with a wide range of pre-trained checkpoints readily available online. However, despite their popularity, these models often require an excessive number of object queries, far surpassing the actual number of objects to detect. The redundant queries result in unnecessary computational and memory costs. In this paper, we find that not all queries contribute equally -- a significant portion of queries have a much smaller impact compared to others. Based on this observation, we propose an embarrassingly simple approach called \bd{G}radually \bd{P}runing \bd{Q}ueries (GPQ), which prunes queries incrementally based on their classification scores. It is straightforward to implement in any query-based method, as it can be seamlessly integrated as a fine-tuning step using an existing checkpoint after training. With GPQ, users can easily generate multiple models with fewer queries, starting from a checkpoint with an excessive number of queries. Experiments on various advanced 3D detectors show that GPQ effectively reduces redundant queries while maintaining performance. Using our method, model inference on desktop GPUs can be accelerated by up to 1.31x. Moreover, after deployment on edge devices, it achieves up to a 67.86\% reduction in FLOPs and a 76.38\% decrease in inference time. The code will be available at \url{https://github.com/iseri27/Gpq}.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
From Commands to Prompts: LLM-based Semantic File System for AIOS
Authors:
Zeru Shi,
Kai Mei,
Mingyu Jin,
Yongye Su,
Chaoji Zuo,
Wenyue Hua,
Wujiang Xu,
Yujie Ren,
Zirui Liu,
Mengnan Du,
Dong Deng,
Yongfeng Zhang
Abstract:
Large language models (LLMs) have demonstrated significant potential in the development of intelligent applications and systems such as LLM-based agents and agent operating systems (AIOS). However, when these applications and systems interact with the underlying file system, the file system still remains the traditional paradigm: reliant on manual navigation through precise commands. This paradigm…
▽ More
Large language models (LLMs) have demonstrated significant potential in the development of intelligent applications and systems such as LLM-based agents and agent operating systems (AIOS). However, when these applications and systems interact with the underlying file system, the file system still remains the traditional paradigm: reliant on manual navigation through precise commands. This paradigm poses a bottleneck to the usability of these systems as users are required to navigate complex folder hierarchies and remember cryptic file names. To address this limitation, we propose an LLM-based semantic file system ( LSFS ) for prompt-driven file management. Unlike conventional approaches, LSFS incorporates LLMs to enable users or agents to interact with files through natural language prompts, facilitating semantic file management. At the macro-level, we develop a comprehensive API set to achieve semantic file management functionalities, such as semantic file retrieval, file update monitoring and summarization, and semantic file rollback). At the micro-level, we store files by constructing semantic indexes for them, design and implement syscalls of different semantic operations (e.g., CRUD, group by, join) powered by vector database. Our experiments show that LSFS offers significant improvements over traditional file systems in terms of user convenience, the diversity of supported functions, and the accuracy and efficiency of file operations. Additionally, with the integration of LLM, our system enables more intelligent file management tasks, such as content summarization and version comparison, further enhancing its capabilities.
△ Less
Submitted 18 March, 2025; v1 submitted 23 September, 2024;
originally announced October 2024.
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
Authors:
Hanrong Zhang,
Jingyuan Huang,
Kai Mei,
Yifei Yao,
Zhenting Wang,
Chenlu Zhan,
Hongwei Wang,
Yongfeng Zhang
Abstract:
Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive frame…
▽ More
Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 27 different types of attack/defense methods, and 7 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30\%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. We also introduce a new metric to evaluate the agents' capability to balance utility and security. Our code can be found at https://github.com/agiresearch/ASB.
△ Less
Submitted 29 May, 2025; v1 submitted 3 October, 2024;
originally announced October 2024.
-
AutoFlow: Automated Workflow Generation for Large Language Model Agents
Authors:
Zelong Li,
Shuyuan Xu,
Kai Mei,
Wenyue Hua,
Balaji Rama,
Om Raheja,
Hao Wang,
He Zhu,
Yongfeng Zhang
Abstract:
Recent advancements in Large Language Models (LLMs) have shown significant progress in understanding complex natural language. One important application of LLM is LLM-based AI Agent, which leverages the ability of LLM as well as external tools for complex-task solving. To make sure LLM Agents follow an effective and reliable procedure to solve the given task, manually designed workflows are usuall…
▽ More
Recent advancements in Large Language Models (LLMs) have shown significant progress in understanding complex natural language. One important application of LLM is LLM-based AI Agent, which leverages the ability of LLM as well as external tools for complex-task solving. To make sure LLM Agents follow an effective and reliable procedure to solve the given task, manually designed workflows are usually used to guide the working mechanism of agents. However, manually designing the workflows requires considerable efforts and domain knowledge, making it difficult to develop and deploy agents on massive scales. To address these issues, we propose AutoFlow, a framework designed to automatically generate workflows for agents to solve complex tasks. AutoFlow takes natural language program as the format of agent workflow and employs a workflow optimization procedure to iteratively optimize the workflow quality. Besides, this work offers two workflow generation methods: fine-tuning-based and in-context-based methods, making the AutoFlow framework applicable to both open-source and closed-source LLMs. Experimental results show that our framework can produce robust and reliable agent workflows. We believe that the automatic generation and interpretation of workflows in natural language represent a promising paradigm for solving complex tasks, particularly with the rapid development of LLMs. The source code of this work is available at https://github.com/agiresearch/AutoFlow.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Refining Self-Supervised Learnt Speech Representation using Brain Activations
Authors:
Hengyu Li,
Kangdi Mei,
Zhaoci Liu,
Yang Ai,
Liping Chen,
Jie Zhang,
Zhenhua Ling
Abstract:
It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work,…
▽ More
It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work, we therefore propose to use the brain activations recorded by fMRI to refine the often-used wav2vec2.0 model by aligning model representations toward human neural responses. Experimental results on SUPERB reveal that this operation is beneficial for several downstream tasks, e.g., speaker verification, automatic speech recognition, intent classification.One can then consider the proposed method as a new alternative to improve self-supervised speech models.
△ Less
Submitted 13 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
AIOS Compiler: LLM as Interpreter for Natural Language Programming and Flow Programming of AI Agents
Authors:
Shuyuan Xu,
Zelong Li,
Kai Mei,
Yongfeng Zhang
Abstract:
Since their inception, programming languages have trended towards greater readability and lower barriers for programmers. Following this trend, natural language can be a promising type of programming language that provides great flexibility and usability and helps towards the democracy of programming. However, the inherent vagueness, ambiguity, and verbosity of natural language pose significant ch…
▽ More
Since their inception, programming languages have trended towards greater readability and lower barriers for programmers. Following this trend, natural language can be a promising type of programming language that provides great flexibility and usability and helps towards the democracy of programming. However, the inherent vagueness, ambiguity, and verbosity of natural language pose significant challenges in developing an interpreter that can accurately understand the programming logic and execute instructions written in natural language. Fortunately, recent advancements in Large Language Models (LLMs) have demonstrated remarkable proficiency in interpreting complex natural language. Inspired by this, we develop a novel system for Code Representation and Execution (CoRE), which employs LLM as interpreter to interpret and execute natural language instructions. The proposed system unifies natural language programming, pseudo-code programming, and flow programming under the same representation for constructing language agents, while LLM serves as the interpreter to interpret and execute the agent programs. In this paper, we begin with defining the programming syntax that structures natural language instructions logically. During the execution, we incorporate external memory to minimize redundancy. Furthermore, we equip the designed interpreter with the capability to invoke external tools, compensating for the limitations of LLM in specialized domains or when accessing real-time information. This work is open-source at https://github.com/agiresearch/CoRE, https://github.com/agiresearch/OpenAGI, and https://github.com/agiresearch/AIOS.
△ Less
Submitted 21 May, 2024; v1 submitted 11 May, 2024;
originally announced May 2024.
-
Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?
Authors:
Mingyu Jin,
Qinkai Yu,
Jingyuan Huang,
Qingcheng Zeng,
Zhenting Wang,
Wenyue Hua,
Haiyan Zhao,
Kai Mei,
Yanda Meng,
Kaize Ding,
Fan Yang,
Mengnan Du,
Yongfeng Zhang
Abstract:
Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of "Concept Depth" to suggest that more complex concepts are ty…
▽ More
Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of "Concept Depth" to suggest that more complex concepts are typically acquired in deeper layers. Specifically, we categorize concepts based on their level of abstraction, defining them in the order of increasing complexity within factual, emotional, and inferential tasks. We conduct extensive probing experiments using layer-wise representations across various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the three domains of tasks. Our findings reveal that models could efficiently conduct probing for simpler tasks in shallow layers, and more complex tasks typically necessitate deeper layers for accurate understanding. Additionally, we examine how external factors, such as adding noise to the input and quantizing the model weights, might affect layer-wise representations. Our findings suggest that these factors can impede the development of a conceptual understanding of LLMs until deeper layers are explored. We hope that our proposed concept and experimental insights will enhance the understanding of the mechanisms underlying LLMs. Our codes are available at https://github.com/Luckfort/CD.
△ Less
Submitted 4 February, 2025; v1 submitted 10 April, 2024;
originally announced April 2024.
-
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
Authors:
Kangfu Mei,
Zhengzhong Tu,
Mauricio Delbracio,
Hossein Talebi,
Vishal M. Patel,
Peyman Milanfar
Abstract:
We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established te…
▽ More
We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. Moreover, we extend our study to demonstrate the generalizability of the these findings by applying various diffusion samplers, exploring diverse downstream tasks, evaluating post-distilled models, as well as comparing performance relative to training compute. These findings open up new pathways for the development of LDM scaling strategies which can be employed to enhance generative capabilities within limited inference budgets.
△ Less
Submitted 10 December, 2024; v1 submitted 1 April, 2024;
originally announced April 2024.
-
AIOS: LLM Agent Operating System
Authors:
Kai Mei,
Xi Zhu,
Wujiang Xu,
Wenyue Hua,
Mingyu Jin,
Zelong Li,
Shuyuan Xu,
Ruosong Ye,
Yingqiang Ge,
Yongfeng Zhang
Abstract:
LLM-based intelligent agents face significant deployment challenges, particularly related to resource management. Allowing unrestricted access to LLM or tool resources can lead to inefficient or even potentially harmful resource allocation and utilization for agents. Furthermore, the absence of proper scheduling and resource management mechanisms in current agent designs hinders concurrent process…
▽ More
LLM-based intelligent agents face significant deployment challenges, particularly related to resource management. Allowing unrestricted access to LLM or tool resources can lead to inefficient or even potentially harmful resource allocation and utilization for agents. Furthermore, the absence of proper scheduling and resource management mechanisms in current agent designs hinders concurrent processing and limits overall system efficiency. As the diversity and complexity of agents continue to grow, addressing these resource management issues becomes increasingly critical to LLM-based agent systems. To address these challenges, this paper proposes the architecture of AIOS (LLM-based AI Agent Operating System) under the context of managing LLM-based agents. It introduces a novel architecture for serving LLM-based agents by isolating resources and LLM-specific services from agent applications into an AIOS kernel. This AIOS kernel provides fundamental services (e.g., scheduling, context management, memory management, storage management, access control) and efficient management of resources (e.g., LLM and external tools) for runtime agents. To enhance usability, AIOS also includes an AIOS-Agent SDK, a comprehensive suite of APIs designed for utilizing functionalities provided by the AIOS kernel. Experimental results demonstrate that using AIOS can achieve up to 2.1x faster execution for serving agents built by various agent frameworks. The source code is available at https://github.com/agiresearch/AIOS.
△ Less
Submitted 11 May, 2025; v1 submitted 25 March, 2024;
originally announced March 2024.
-
What if LLMs Have Different World Views: Simulating Alien Civilizations with LLM-based Agents
Authors:
Zhaoqian Xue,
Beichen Wang,
Suiyuan Zhu,
Kai Mei,
Hua Tang,
Wenyue Hua,
Mengnan Du,
Yongfeng Zhang
Abstract:
This study introduces "CosmoAgent," an innovative artificial intelligence system that utilizes Large Language Models (LLMs) to simulate complex interactions between human and extraterrestrial civilizations. This paper introduces a mathematical model for quantifying the levels of civilization development and further employs a state transition matrix approach to evaluate their trajectories. Through…
▽ More
This study introduces "CosmoAgent," an innovative artificial intelligence system that utilizes Large Language Models (LLMs) to simulate complex interactions between human and extraterrestrial civilizations. This paper introduces a mathematical model for quantifying the levels of civilization development and further employs a state transition matrix approach to evaluate their trajectories. Through this methodology, our study quantitatively analyzes the growth trajectories of civilizations, providing insights into future decision-making at critical points of growth and saturation. Furthermore, this paper acknowledges the vast diversity of potential living conditions across the universe, which could foster unique cosmologies, ethical codes, and worldviews among different civilizations. Recognizing the Earth-centric bias inherent in current LLM designs, we propose the novel concept of using LLM agents with diverse ethical paradigms and simulating interactions between entities with distinct moral principles. This innovative research not only introduces a novel method for comprehending potential inter-civilizational dynamics but also holds practical value in enabling entities with divergent value systems to strategize, prevent conflicts, and engage in games under conditions of asymmetric information. The accompanying code is available at https://github.com/MingyuJ666/Simulating-Alien-Civilizations-with-LLM-based-Agents.
△ Less
Submitted 8 June, 2025; v1 submitted 20 February, 2024;
originally announced February 2024.
-
Careless Whisper: Speech-to-Text Hallucination Harms
Authors:
Allison Koenecke,
Anna Seo Gyeong Choi,
Katelyn X. Mei,
Hilke Schellmann,
Mona Sloane
Abstract:
Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurat…
▽ More
Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurate, we find that roughly 1\% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38\% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations -- a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.
△ Less
Submitted 2 May, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Latent Feature-Guided Diffusion Models for Shadow Removal
Authors:
Kangfu Mei,
Luis Figueroa,
Zhe Lin,
Zhihong Ding,
Scott Cohen,
Vishal M. Patel
Abstract:
Recovering textures under shadows has remained a challenging problem due to the difficulty of inferring shadow-free scenes from shadow images. In this paper, we propose the use of diffusion models as they offer a promising approach to gradually refine the details of shadow regions during the diffusion process. Our method improves this process by conditioning on a learned latent feature space that…
▽ More
Recovering textures under shadows has remained a challenging problem due to the difficulty of inferring shadow-free scenes from shadow images. In this paper, we propose the use of diffusion models as they offer a promising approach to gradually refine the details of shadow regions during the diffusion process. Our method improves this process by conditioning on a learned latent feature space that inherits the characteristics of shadow-free images, thus avoiding the limitation of conventional methods that condition on degraded images only. Additionally, we propose to alleviate potential local optima during training by fusing noise features with the diffusion network. We demonstrate the effectiveness of our approach which outperforms the previous best method by 13% in terms of RMSE on the AISTD dataset. Further, we explore instance-level shadow removal, where our model outperforms the previous best method by 82% in terms of RMSE on the DESOBA dataset.
△ Less
Submitted 10 May, 2025; v1 submitted 4 December, 2023;
originally announced December 2023.
-
War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars
Authors:
Wenyue Hua,
Lizhou Fan,
Lingyao Li,
Kai Mei,
Jianchao Ji,
Yingqiang Ge,
Libby Hemphill,
Yongfeng Zhang
Abstract:
Can we avoid wars at the crossroads of history? This question has been pursued by individuals, scholars, policymakers, and organizations throughout human history. In this research, we attempt to answer the question based on the recent advances of Artificial Intelligence (AI) and Large Language Models (LLMs). We propose \textbf{WarAgent}, an LLM-powered multi-agent AI system, to simulate the partic…
▽ More
Can we avoid wars at the crossroads of history? This question has been pursued by individuals, scholars, policymakers, and organizations throughout human history. In this research, we attempt to answer the question based on the recent advances of Artificial Intelligence (AI) and Large Language Models (LLMs). We propose \textbf{WarAgent}, an LLM-powered multi-agent AI system, to simulate the participating countries, their decisions, and the consequences, in historical international conflicts, including the World War I (WWI), the World War II (WWII), and the Warring States Period (WSP) in Ancient China. By evaluating the simulation effectiveness, we examine the advancements and limitations of cutting-edge AI systems' abilities in studying complex collective human behaviors such as international conflicts under diverse settings. In these simulations, the emergent interactions among agents also offer a novel perspective for examining the triggers and conditions that lead to war. Our findings offer data-driven and AI-augmented insights that can redefine how we approach conflict resolution and peacekeeping strategies. The implications stretch beyond historical analysis, offering a blueprint for using AI to understand human history and possibly prevent future international conflicts. Code and data are available at \url{https://github.com/agiresearch/WarAgent}.
△ Less
Submitted 30 January, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
LightLM: A Lightweight Deep and Narrow Language Model for Generative Recommendation
Authors:
Kai Mei,
Yongfeng Zhang
Abstract:
This paper presents LightLM, a lightweight Transformer-based language model for generative recommendation. While Transformer-based generative modeling has gained importance in various AI sub-fields such as NLP and vision, generative recommendation is still in its infancy due to its unique demand on personalized generative modeling. Existing works on generative recommendation often use NLP-oriented…
▽ More
This paper presents LightLM, a lightweight Transformer-based language model for generative recommendation. While Transformer-based generative modeling has gained importance in various AI sub-fields such as NLP and vision, generative recommendation is still in its infancy due to its unique demand on personalized generative modeling. Existing works on generative recommendation often use NLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are heavy-weight and are not specifically designed for recommendation tasks. LightLM tackles the issue by introducing a light-weight deep and narrow Transformer architecture, which is specifically tailored for direct generation of recommendation items. This structure is especially apt for straightforward generative recommendation and stems from the observation that language model does not have to be too wide for this task, as the input predominantly consists of short tokens that are well-suited for the model's capacity. We also show that our devised user and item ID indexing methods, i.e., Spectral Collaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables the deep and narrow Transformer architecture to outperform large-scale language models for recommendation. Besides, to address the hallucination problem of generating items as output, we propose the constrained generation process for generative recommenders. Experiments on real-world datasets show that LightLM outperforms various competitive baselines in terms of both recommendation accuracy and efficiency. The code can be found at https://github.com/dongyuanjushi/LightLM.
△ Less
Submitted 29 October, 2023; v1 submitted 26 October, 2023;
originally announced October 2023.
-
CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
Authors:
Kangfu Mei,
Mauricio Delbracio,
Hossein Talebi,
Zhengzhong Tu,
Vishal M. Patel,
Peyman Milanfar
Abstract:
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement, restoration, editing, and compositing. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. To address this challenge, we introduce a novel method dubbed CoDi, that…
▽ More
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement, restoration, editing, and compositing. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. To address this challenge, we introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs while significantly reducing the sampling steps required to achieve high-quality results. Our method can leverage architectures such as ControlNet to incorporate conditioning inputs without compromising the model's prior knowledge gained during large scale pre-training. Additionally, a conditional consistency loss enforces consistent predictions across diffusion steps, effectively compelling the model to generate high-quality images with conditions in a few steps. Our conditional-task learning and distillation approach outperforms previous distillation methods, achieving a new state-of-the-art in producing high-quality images with very few steps (e.g., 1-4) across multiple tasks, including super-resolution, text-guided image editing, and depth-to-image generation.
△ Less
Submitted 17 February, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
Bias Against 93 Stigmatized Groups in Masked Language Models and Downstream Sentiment Classification Tasks
Authors:
Katelyn X. Mei,
Sonia Fereidooni,
Aylin Caliskan
Abstract:
The rapid deployment of artificial intelligence (AI) models demands a thorough investigation of biases and risks inherent in these models to understand their impact on individuals and society. This study extends the focus of bias evaluation in extant work by examining bias against social stigmas on a large scale. It focuses on 93 stigmatized groups in the United States, including a wide range of c…
▽ More
The rapid deployment of artificial intelligence (AI) models demands a thorough investigation of biases and risks inherent in these models to understand their impact on individuals and society. This study extends the focus of bias evaluation in extant work by examining bias against social stigmas on a large scale. It focuses on 93 stigmatized groups in the United States, including a wide range of conditions related to disease, disability, drug use, mental illness, religion, sexuality, socioeconomic status, and other relevant factors. We investigate bias against these groups in English pre-trained Masked Language Models (MLMs) and their downstream sentiment classification tasks. To evaluate the presence of bias against 93 stigmatized conditions, we identify 29 non-stigmatized conditions to conduct a comparative analysis. Building upon a psychology scale of social rejection, the Social Distance Scale, we prompt six MLMs: RoBERTa-base, RoBERTa-large, XLNet-large, BERTweet-base, BERTweet-large, and DistilBERT. We use human annotations to analyze the predicted words from these models, with which we measure the extent of bias against stigmatized groups. When prompts include stigmatized conditions, the probability of MLMs predicting negative words is approximately 20 percent higher than when prompts have non-stigmatized conditions. In the sentiment classification tasks, when sentences include stigmatized conditions related to diseases, disability, education, and mental illness, they are more likely to be classified as negative. We also observe a strong correlation between bias in MLMs and their downstream sentiment classifiers (r =0.79). The evidence indicates that MLMs and their downstream sentiment classification tasks exhibit biases against socially stigmatized groups.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models
Authors:
Kai Mei,
Zheng Li,
Zhenting Wang,
Yang Zhang,
Shiqing Ma
Abstract:
Prompt-based learning is vulnerable to backdoor attacks. Existing backdoor attacks against prompt-based models consider injecting backdoors into the entire embedding layers or word embedding vectors. Such attacks can be easily affected by retraining on downstream tasks and with different prompting strategies, limiting the transferability of backdoor attacks. In this work, we propose transferable b…
▽ More
Prompt-based learning is vulnerable to backdoor attacks. Existing backdoor attacks against prompt-based models consider injecting backdoors into the entire embedding layers or word embedding vectors. Such attacks can be easily affected by retraining on downstream tasks and with different prompting strategies, limiting the transferability of backdoor attacks. In this work, we propose transferable backdoor attacks against prompt-based models, called NOTABLE, which is independent of downstream tasks and prompting strategies. Specifically, NOTABLE injects backdoors into the encoders of PLMs by utilizing an adaptive verbalizer to bind triggers to specific words (i.e., anchors). It activates the backdoor by pasting input with triggers to reach adversary-desired anchors, achieving independence from downstream tasks and prompting strategies. We conduct experiments on six NLP tasks, three popular models, and three prompting strategies. Empirical results show that NOTABLE achieves superior attack performance (i.e., attack success rate over 90% on all the datasets), and outperforms two state-of-the-art baselines. Evaluations on three defenses show the robustness of NOTABLE. Our code can be found at https://github.com/RU-System-Software-and-Security/Notable.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified Visual Modalities
Authors:
Kangfu Mei,
Mo Zhou,
Vishal M. Patel
Abstract:
Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the ``scaling property'', where it is difficult for the model to capture local structures…
▽ More
Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the ``scaling property'', where it is difficult for the model to capture local structures through uniform sampling. To this end, we propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating additional guidance, e.g., text description, to complement the global geometry. The model can be scaled to generate high-resolution data while unifying multiple modalities. Experimental results on data generation in various modalities demonstrate the effectiveness of our model, as well as its potential as a foundation framework for scalable modality-unified visual content generation.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
UAV Obstacle Avoidance by Human-in-the-Loop Reinforcement in Arbitrary 3D Environment
Authors:
Xuyang Li,
Jianwu Fang,
Kai Du,
Kuizhi Mei,
Jianru Xue
Abstract:
This paper focuses on the continuous control of the unmanned aerial vehicle (UAV) based on a deep reinforcement learning method for a large-scale 3D complex environment. The purpose is to make the UAV reach any target point from a certain starting point, and the flying height and speed are variable during navigation. In this work, we propose a deep reinforcement learning (DRL)-based method combine…
▽ More
This paper focuses on the continuous control of the unmanned aerial vehicle (UAV) based on a deep reinforcement learning method for a large-scale 3D complex environment. The purpose is to make the UAV reach any target point from a certain starting point, and the flying height and speed are variable during navigation. In this work, we propose a deep reinforcement learning (DRL)-based method combined with human-in-the-loop, which allows the UAV to avoid obstacles automatically during flying. We design multiple reward functions based on the relevant domain knowledge to guide UAV navigation. The role of human-in-the-loop is to dynamically change the reward function of the UAV in different situations to suit the obstacle avoidance of the UAV better. We verify the success rate and average step size on urban, rural, and forest scenarios, and the experimental results show that the proposed method can reduce the training convergence time and improve the efficiency and accuracy of navigation tasks. The code is available on the website https://github.com/Monnalo/UAV_navigation.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
OpenAGI: When LLM Meets Domain Experts
Authors:
Yingqiang Ge,
Wenyue Hua,
Kai Mei,
Jianchao Ji,
Juntao Tan,
Shuyuan Xu,
Zelong Li,
Yongfeng Zhang
Abstract:
Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents, enabling them to harness expert models for complex task-solving towards Artificial General Intelligence (AGI). Large Language Models (LLMs) show promising learning and reasoning abilities, and can effectively u…
▽ More
Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents, enabling them to harness expert models for complex task-solving towards Artificial General Intelligence (AGI). Large Language Models (LLMs) show promising learning and reasoning abilities, and can effectively use external models, tools, plugins, or APIs to tackle complex problems. In this work, we introduce OpenAGI, an open-source AGI research and development platform designed for solving multi-step, real-world tasks. Specifically, OpenAGI uses a dual strategy, integrating standard benchmark tasks for benchmarking and evaluation, and open-ended tasks including more expandable models, tools, plugins, or APIs for creative problem-solving. Tasks are presented as natural language queries to the LLM, which then selects and executes appropriate models. We also propose a Reinforcement Learning from Task Feedback (RLTF) mechanism that uses task results to improve the LLM's task-solving ability, which creates a self-improving AI feedback loop. While we acknowledge that AGI is a broad and multifaceted research challenge with no singularly defined solution path, the integration of LLMs with domain-specific expert models, inspired by mirroring the blend of general and specialized intelligence in humans, offers a promising approach towards AGI. We are open-sourcing the OpenAGI project's code, dataset, benchmarks, evaluation methods, and the UI demo to foster community involvement in AGI advancement: https://github.com/agiresearch/OpenAGI.
△ Less
Submitted 3 November, 2023; v1 submitted 9 April, 2023;
originally announced April 2023.
-
UNICORN: A Unified Backdoor Trigger Inversion Framework
Authors:
Zhenting Wang,
Kai Mei,
Juan Zhai,
Shiqing Ma
Abstract:
The backdoor attack, where the adversary uses inputs stamped with triggers (e.g., a patch) to activate pre-planted malicious behaviors, is a severe threat to Deep Neural Network (DNN) models. Trigger inversion is an effective way of identifying backdoor models and understanding embedded adversarial behaviors. A challenge of trigger inversion is that there are many ways of constructing the trigger.…
▽ More
The backdoor attack, where the adversary uses inputs stamped with triggers (e.g., a patch) to activate pre-planted malicious behaviors, is a severe threat to Deep Neural Network (DNN) models. Trigger inversion is an effective way of identifying backdoor models and understanding embedded adversarial behaviors. A challenge of trigger inversion is that there are many ways of constructing the trigger. Existing methods cannot generalize to various types of triggers by making certain assumptions or attack-specific constraints. The fundamental reason is that existing work does not consider the trigger's design space in their formulation of the inversion problem. This work formally defines and analyzes the triggers injected in different spaces and the inversion problem. Then, it proposes a unified framework to invert backdoor triggers based on the formalization of triggers and the identified inner behaviors of backdoor models from our analysis. Our prototype UNICORN is general and effective in inverting backdoor triggers in DNNs. The code can be found at https://github.com/RU-System-Software-and-Security/UNICORN.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Hard-aware Instance Adaptive Self-training for Unsupervised Cross-domain Semantic Segmentation
Authors:
Chuang Zhu,
Kebin Liu,
Wenqi Tang,
Ke Mei,
Jiaqi Zou,
Tiejun Huang
Abstract:
The divergence between labeled training data and unlabeled testing data is a significant challenge for recent deep learning models. Unsupervised domain adaptation (UDA) attempts to solve such problem. Recent works show that self-training is a powerful approach to UDA. However, existing methods have difficulty in balancing the scalability and performance. In this paper, we propose a hard-aware inst…
▽ More
The divergence between labeled training data and unlabeled testing data is a significant challenge for recent deep learning models. Unsupervised domain adaptation (UDA) attempts to solve such problem. Recent works show that self-training is a powerful approach to UDA. However, existing methods have difficulty in balancing the scalability and performance. In this paper, we propose a hard-aware instance adaptive self-training framework for UDA on the task of semantic segmentation. To effectively improve the quality and diversity of pseudo-labels, we develop a novel pseudo-label generation strategy with an instance adaptive selector. We further enrich the hard class pseudo-labels with inter-image information through a skillfully designed hard-aware pseudo-label augmentation. Besides, we propose the region-adaptive regularization to smooth the pseudo-label region and sharpen the non-pseudo-label region. For the non-pseudo-label region, consistency constraint is also constructed to introduce stronger supervision signals during model optimization. Our method is so concise and efficient that it is easy to be generalized to other UDA methods. Experiments on GTA5 to Cityscapes, SYNTHIA to Cityscapes, and Cityscapes to Oxford RobotCar demonstrate the superior performance of our approach compared with the state-of-the-art methods. Our codes are available at https://github.com/bupt-ai-cz/HIAST.
△ Less
Submitted 24 March, 2025; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Bi-Noising Diffusion: Towards Conditional Diffusion Models with Generative Restoration Priors
Authors:
Kangfu Mei,
Nithin Gopalakrishnan Nair,
Vishal M. Patel
Abstract:
Conditional diffusion probabilistic models can model the distribution of natural images and can generate diverse and realistic samples based on given conditions. However, oftentimes their results can be unrealistic with observable color shifts and textures. We believe that this issue results from the divergence between the probabilistic distribution learned by the model and the distribution of nat…
▽ More
Conditional diffusion probabilistic models can model the distribution of natural images and can generate diverse and realistic samples based on given conditions. However, oftentimes their results can be unrealistic with observable color shifts and textures. We believe that this issue results from the divergence between the probabilistic distribution learned by the model and the distribution of natural images. The delicate conditions gradually enlarge the divergence during each sampling timestep. To address this issue, we introduce a new method that brings the predicted samples to the training data manifold using a pretrained unconditional diffusion model. The unconditional model acts as a regularizer and reduces the divergence introduced by the conditional model at each sampling step. We perform comprehensive experiments to demonstrate the effectiveness of our approach on super-resolution, colorization, turbulence removal, and image-deraining tasks. The improvements obtained by our method suggest that the priors can be incorporated as a general plugin for improving conditional diffusion models.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
VIDM: Video Implicit Diffusion Models
Authors:
Kangfu Mei,
Vishal M. Patel
Abstract:
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images. In this paper, we propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition manner, i.e. one can sample plausible video motions according to the latent feature of frames. We improve the quality of the generated vi…
▽ More
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images. In this paper, we propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition manner, i.e. one can sample plausible video motions according to the latent feature of frames. We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization. Various experiments are conducted on datasets consisting of videos with different resolutions and different number of frames. Results show that the proposed method outperforms the state-of-the-art generative adversarial network-based methods by a significant margin in terms of FVD scores as well as perceptible visual quality.
△ Less
Submitted 30 November, 2022;
originally announced December 2022.
-
Rethinking the Reverse-engineering of Trojan Triggers
Authors:
Zhenting Wang,
Kai Mei,
Hailun Ding,
Juan Zhai,
Shiqing Ma
Abstract:
Deep Neural Networks are vulnerable to Trojan (or backdoor) attacks. Reverse-engineering methods can reconstruct the trigger and thus identify affected models. Existing reverse-engineering methods only consider input space constraints, e.g., trigger size in the input space. Expressly, they assume the triggers are static patterns in the input space and fail to detect models with feature space trigg…
▽ More
Deep Neural Networks are vulnerable to Trojan (or backdoor) attacks. Reverse-engineering methods can reconstruct the trigger and thus identify affected models. Existing reverse-engineering methods only consider input space constraints, e.g., trigger size in the input space. Expressly, they assume the triggers are static patterns in the input space and fail to detect models with feature space triggers such as image style transformations. We observe that both input-space and feature-space Trojans are associated with feature space hyperplanes. Based on this observation, we design a novel reverse-engineering method that exploits the feature space constraint to reverse-engineer Trojan triggers. Results on four datasets and seven different attacks demonstrate that our solution effectively defends both input-space and feature-space Trojans. It outperforms state-of-the-art reverse-engineering methods and other types of defenses in both Trojaned model detection and mitigation tasks. On average, the detection accuracy of our method is 93\%. For Trojan mitigation, our method can reduce the ASR (attack success rate) to only 0.26\% with the BA (benign accuracy) remaining nearly unchanged. Our code can be found at https://github.com/RU-System-Software-and-Security/FeatureRE.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
AT-DDPM: Restoring Faces degraded by Atmospheric Turbulence using Denoising Diffusion Probabilistic Models
Authors:
Nithin Gopalakrishnan Nair,
Kangfu Mei,
Vishal M. Patel
Abstract:
Although many long-range imaging systems are designed to support extended vision applications, a natural obstacle to their operation is degradation due to atmospheric turbulence. Atmospheric turbulence causes significant degradation to image quality by introducing blur and geometric distortion. In recent years, various deep learning-based single image atmospheric turbulence mitigation methods, inc…
▽ More
Although many long-range imaging systems are designed to support extended vision applications, a natural obstacle to their operation is degradation due to atmospheric turbulence. Atmospheric turbulence causes significant degradation to image quality by introducing blur and geometric distortion. In recent years, various deep learning-based single image atmospheric turbulence mitigation methods, including CNN-based and GAN inversion-based, have been proposed in the literature which attempt to remove the distortion in the image. However, some of these methods are difficult to train and often fail to reconstruct facial features and produce unrealistic results especially in the case of high turbulence. Denoising Diffusion Probabilistic Models (DDPMs) have recently gained some traction because of their stable training process and their ability to generate high quality images. In this paper, we propose the first DDPM-based solution for the problem of atmospheric turbulence mitigation. We also propose a fast sampling technique for reducing the inference times for conditional DDPMs. Extensive experiments are conducted on synthetic and real-world data to show the significance of our model. To facilitate further research, all codes and pretrained models are publically available at http://github.com/Nithin-GK/AT-DDPM
△ Less
Submitted 20 September, 2022; v1 submitted 23 August, 2022;
originally announced August 2022.
-
Deep Semantic Statistics Matching (D2SM) Denoising Network
Authors:
Kangfu Mei,
Vishal M. Patel,
Rui Huang
Abstract:
The ultimate aim of image restoration like denoising is to find an exact correlation between the noisy and clear image domains. But the optimization of end-to-end denoising learning like pixel-wise losses is performed in a sample-to-sample manner, which ignores the intrinsic correlation of images, especially semantics. In this paper, we introduce the Deep Semantic Statistics Matching (D2SM) Denois…
▽ More
The ultimate aim of image restoration like denoising is to find an exact correlation between the noisy and clear image domains. But the optimization of end-to-end denoising learning like pixel-wise losses is performed in a sample-to-sample manner, which ignores the intrinsic correlation of images, especially semantics. In this paper, we introduce the Deep Semantic Statistics Matching (D2SM) Denoising Network. It exploits semantic features of pretrained classification networks, then it implicitly matches the probabilistic distribution of clear images at the semantic feature space. By learning to preserve the semantic distribution of denoised images, we empirically find our method significantly improves the denoising capabilities of networks, and the denoised results can be better understood by high-level vision tasks. Comprehensive experiments conducted on the noisy Cityscapes dataset demonstrate the superiority of our method on both the denoising performance and semantic segmentation accuracy. Moreover, the performance improvement observed on our extended tasks including super-resolution and dehazing experiments shows its potentiality as a new general plug-and-play component.
△ Less
Submitted 19 July, 2022;
originally announced July 2022.
-
A comparison of different atmospheric turbulence simulation methods for image restoration
Authors:
Nithin Gopalakrishnan Nair,
Kangfu Mei,
Vishal M. Patel
Abstract:
Atmospheric turbulence deteriorates the quality of images captured by long-range imaging systems by introducing blur and geometric distortions to the captured scene. This leads to a drastic drop in performance when computer vision algorithms like object/face recognition and detection are performed on these images. In recent years, various deep learning-based atmospheric turbulence mitigation metho…
▽ More
Atmospheric turbulence deteriorates the quality of images captured by long-range imaging systems by introducing blur and geometric distortions to the captured scene. This leads to a drastic drop in performance when computer vision algorithms like object/face recognition and detection are performed on these images. In recent years, various deep learning-based atmospheric turbulence mitigation methods have been proposed in the literature. These methods are often trained using synthetically generated images and tested on real-world images. Hence, the performance of these restoration methods depends on the type of simulation used for training the network. In this paper, we systematically evaluate the effectiveness of various turbulence simulation methods on image restoration. In particular, we evaluate the performance of two state-or-the-art restoration networks using six simulations method on a real-world LRFID dataset consisting of face images degraded by turbulence. This paper will provide guidance to the researchers and practitioners working in this field to choose the suitable data generation models for training deep models for turbulence mitigation. The implementation codes for the simulation methods, source codes for the networks, and the pre-trained models will be publicly made available.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
Thermal to Visible Image Synthesis under Atmospheric Turbulence
Authors:
Kangfu Mei,
Yiqun Mei,
Vishal M. Patel
Abstract:
In many practical applications of long-range imaging such as biometrics and surveillance, thermal imagining modalities are often used to capture images in low-light and nighttime conditions. However, such imaging systems often suffer from atmospheric turbulence, which introduces severe blur and deformation artifacts to the captured images. Such an issue is unavoidable in long-range imaging and sig…
▽ More
In many practical applications of long-range imaging such as biometrics and surveillance, thermal imagining modalities are often used to capture images in low-light and nighttime conditions. However, such imaging systems often suffer from atmospheric turbulence, which introduces severe blur and deformation artifacts to the captured images. Such an issue is unavoidable in long-range imaging and significantly decreases the face verification accuracy. In this paper, we first investigate the problem with a turbulence simulation method on real-world thermal images. An end-to-end reconstruction method is then proposed which can directly transform thermal images into visible-spectrum images by utilizing natural image priors based on a pre-trained StyleGAN2 network. Compared with the existing two-steps methods of consecutive turbulence mitigation and thermal to visible image translation, our method is demonstrated to be effective in terms of both the visual quality of the reconstructed results and face verification accuracy. Moreover, to the best of our knowledge, this is the first work that studies the problem of thermal to visible image translation under atmospheric turbulence.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
Theoretical Analysis of Deep Neural Networks in Physical Layer Communication
Authors:
Jun Liu,
Haitao Zhao,
Dongtang Ma,
Kai Mei,
Jibo Wei
Abstract:
Recently, deep neural network (DNN)-based physical layer communication techniques have attracted considerable interest. Although their potential to enhance communication systems and superb performance have been validated by simulation experiments, little attention has been paid to the theoretical analysis. Specifically, most studies in the physical layer have tended to focus on the application of…
▽ More
Recently, deep neural network (DNN)-based physical layer communication techniques have attracted considerable interest. Although their potential to enhance communication systems and superb performance have been validated by simulation experiments, little attention has been paid to the theoretical analysis. Specifically, most studies in the physical layer have tended to focus on the application of DNN models to wireless communication problems but not to theoretically understand how does a DNN work in a communication system. In this paper, we aim to quantitatively analyze why DNNs can achieve comparable performance in the physical layer comparing with traditional techniques, and also drive their cost in terms of computational complexity. To achieve this goal, we first analyze the encoding performance of a DNN-based transmitter and compare it to a traditional one. And then, we theoretically analyze the performance of DNN-based estimator and compare it with traditional estimators. Third, we investigate and validate how information is flown in a DNN-based communication system under the information theoretic concepts. Our analysis develops a concise way to open the "black box" of DNNs in physical layer communication, which can be applied to support the design of DNN-based intelligent communication techniques and help to provide explainable performance assessment.
△ Less
Submitted 26 August, 2022; v1 submitted 20 February, 2022;
originally announced February 2022.
-
LTT-GAN: Looking Through Turbulence by Inverting GANs
Authors:
Kangfu Mei,
Vishal M. Patel
Abstract:
In many applications of long-range imaging, we are faced with a scenario where a person appearing in the captured imagery is often degraded by atmospheric turbulence. However, restoring such degraded images for face verification is difficult since the degradation causes images to be geometrically distorted and blurry. To mitigate the turbulence effect, in this paper, we propose the first turbulenc…
▽ More
In many applications of long-range imaging, we are faced with a scenario where a person appearing in the captured imagery is often degraded by atmospheric turbulence. However, restoring such degraded images for face verification is difficult since the degradation causes images to be geometrically distorted and blurry. To mitigate the turbulence effect, in this paper, we propose the first turbulence mitigation method that makes use of visual priors encapsulated by a well-trained GAN. Based on the visual priors, we propose to learn to preserve the identity of restored images on a spatial periodic contextual distance. Such a distance can keep the realism of restored images from the GAN while considering the identity difference at the network learning. In addition, hierarchical pseudo connections are proposed for facilitating the identity-preserving learning by introducing more appearance variance without identity changing. Extensive experiments show that our method significantly outperforms prior art in both the visual quality and face verification accuracy of restored results.
△ Less
Submitted 4 December, 2021;
originally announced December 2021.
-
Mobile App Crowdsourced Test Report Consistency Detection via Deep Image-and-Text Fusion Understanding
Authors:
Shengcheng Yu,
Chunrong Fang,
Quanjun Zhang,
Zhihao Cao,
Yexiao Yun,
Zhenfei Cao,
Kai Mei,
Zhenyu Chen
Abstract:
Crowdsourced testing, as a distinct testing paradigm, has attracted much attention in software testing, especially in mobile application (app) testing field. Compared with in-house testing, crowdsourced testing shows superiority with the diverse testing environments when faced with the mobile testing fragmentation problem. However, crowdsourced testing also encounters the low-quality test report p…
▽ More
Crowdsourced testing, as a distinct testing paradigm, has attracted much attention in software testing, especially in mobile application (app) testing field. Compared with in-house testing, crowdsourced testing shows superiority with the diverse testing environments when faced with the mobile testing fragmentation problem. However, crowdsourced testing also encounters the low-quality test report problem caused by unprofessional crowdworkers involved with different expertise. In order to handle the submitted reports of uneven quality, app developers have to distinguish high-quality reports from low-quality ones to help the bug inspection. One kind of typical low-quality test report is inconsistent test reports, which means the textual descriptions are not focusing on the attached bug-occurring screenshots. According to our empirical survey, only 18.07% crowdsourced test reports are consistent. Inconsistent reports cause waste on mobile app testing.
To solve the inconsistency problem, we propose ReCoDe to detect the consistency of crowdsourced test reports via deep image-and-text fusion understanding. ReCoDe is a two-stage approach that first classifies the reports based on textual descriptions into different categories according to the bug feature. In the second stage, ReCoDe has a deep understanding of the GUI image features of the app screenshots and then applies different strategies to handle different types of bugs to detect the consistency of the crowdsourced test reports. We conduct an experiment on a dataset with over 22k test reports to evaluate ReCoDe, and the results show the effectiveness of ReCoDe in detecting the consistency of crowdsourced test reports. Besides, a user study is conducted to prove the practical value of ReCoDe in effectively helping app developers improve the efficiency of reviewing the crowdsourced test reports.
△ Less
Submitted 12 June, 2023; v1 submitted 16 August, 2021;
originally announced August 2021.
-
A Low Complexity Learning-based Channel Estimation for OFDM Systems with Online Training
Authors:
Kai Mei,
Jun Liu,
Xiaoying Zhang,
Kuo Cao,
Nandana Rajatheva,
Jibo Wei
Abstract:
In this paper, we devise a highly efficient machine learning-based channel estimation for orthogonal frequency division multiplexing (OFDM) systems, in which the training of the estimator is performed online. A simple learning module is employed for the proposed learning-based estimator. The training process is thus much faster and the required training data is reduced significantly. Besides, a tr…
▽ More
In this paper, we devise a highly efficient machine learning-based channel estimation for orthogonal frequency division multiplexing (OFDM) systems, in which the training of the estimator is performed online. A simple learning module is employed for the proposed learning-based estimator. The training process is thus much faster and the required training data is reduced significantly. Besides, a training data construction approach utilizing least square (LS) estimation results is proposed so that the training data can be collected during the data transmission. The feasibility of this novel construction approach is verified by theoretical analysis and simulations. Based on this construction approach, two alternative training data generation schemes are proposed. One scheme transmits additional block pilot symbols to create training data, while the other scheme adopts a decision-directed method and does not require extra pilot overhead. Simulation results show the robustness of the proposed channel estimation method. Furthermore, the proposed method shows better adaptation to practical imperfections compared with the conventional minimum mean-square error (MMSE) channel estimation. It outperforms the existing machine learning-based channel estimation techniques under varying channel conditions.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
Opening the Black Box of Deep Neural Networks in Physical Layer Communication
Authors:
Jun Liu,
Haitao Zhao,
Dongtang Ma,
Kai Mei,
Jibo Wei
Abstract:
Deep Neural Network (DNN)-based physical layer techniques are attracting considerable interest due to their potential to enhance communication systems. However, most studies in the physical layer have tended to focus on the application of DNN models to wireless communication problems but not to theoretically understand how does a DNN work in a communication system. In this paper, we aim to quantit…
▽ More
Deep Neural Network (DNN)-based physical layer techniques are attracting considerable interest due to their potential to enhance communication systems. However, most studies in the physical layer have tended to focus on the application of DNN models to wireless communication problems but not to theoretically understand how does a DNN work in a communication system. In this paper, we aim to quantitatively analyze why DNNs can achieve comparable performance in the physical layer comparing with traditional techniques and their cost in terms of computational complexity. We further investigate and also experimentally validate how information is flown in a DNN-based communication system under the information theoretic concepts.
△ Less
Submitted 18 February, 2022; v1 submitted 2 June, 2021;
originally announced June 2021.
-
SDAN: Squared Deformable Alignment Network for Learning Misaligned Optical Zoom
Authors:
Kangfu Mei,
Shenglong Ye,
Rui Huang
Abstract:
Deep Neural Network (DNN) based super-resolution algorithms have greatly improved the quality of the generated images. However, these algorithms often yield significant artifacts when dealing with real-world super-resolution problems due to the difficulty in learning misaligned optical zoom. In this paper, we introduce a Squared Deformable Alignment Network (SDAN) to address this issue. Our networ…
▽ More
Deep Neural Network (DNN) based super-resolution algorithms have greatly improved the quality of the generated images. However, these algorithms often yield significant artifacts when dealing with real-world super-resolution problems due to the difficulty in learning misaligned optical zoom. In this paper, we introduce a Squared Deformable Alignment Network (SDAN) to address this issue. Our network learns squared per-point offsets for convolutional kernels, and then aligns features in corrected convolutional windows based on the offsets. So the misalignment will be minimized by the extracted aligned features. Different from the per-point offsets used in the vanilla Deformable Convolutional Network (DCN), our proposed squared offsets not only accelerate the offset learning but also improve the generation quality with fewer parameters. Besides, we further propose an efficient cross packing attention layer to boost the accuracy of the learned offsets. It leverages the packing and unpacking operations to enlarge the receptive field of the offset learning and to enhance the ability of extracting the spatial connection between the low-resolution images and the referenced images. Comprehensive experiments show the superiority of our method over other state-of-the-art methods in both computational efficiency and realistic details.
△ Less
Submitted 25 November, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
AttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing
Authors:
Qi Song,
Kangfu Mei,
Rui Huang
Abstract:
Two factors have proven to be very important to the performance of semantic segmentation models: global context and multi-level semantics. However, generating features that capture both factors always leads to high computational complexity, which is problematic in real-time scenarios. In this paper, we propose a new model, called Attention-Augmented Network (AttaNet), to capture both global contex…
▽ More
Two factors have proven to be very important to the performance of semantic segmentation models: global context and multi-level semantics. However, generating features that capture both factors always leads to high computational complexity, which is problematic in real-time scenarios. In this paper, we propose a new model, called Attention-Augmented Network (AttaNet), to capture both global context and multilevel semantics while keeping the efficiency high. AttaNet consists of two primary modules: Strip Attention Module (SAM) and Attention Fusion Module (AFM). Viewing that in challenging images with low segmentation accuracy, there are a significantly larger amount of vertical strip areas than horizontal ones, SAM utilizes a striping operation to reduce the complexity of encoding global context in the vertical direction drastically while keeping most of contextual information, compared to the non-local approaches. Moreover, AFM follows a cross-level aggregation strategy to limit the computation, and adopts an attention strategy to weight the importance of different levels of features at each pixel when fusing them, obtaining an efficient multi-level representation. We have conducted extensive experiments on two semantic segmentation benchmarks, and our network achieves different levels of speed/accuracy trade-offs on Cityscapes, e.g., 71 FPS/79.9% mIoU, 130 FPS/78.5% mIoU, and 180 FPS/70.1% mIoU, and leading performance on ADE20K as well.
△ Less
Submitted 10 March, 2021;
originally announced March 2021.
-
Scale-Aware Network with Regional and Semantic Attentions for Crowd Counting under Cluttered Background
Authors:
Qiaosi Yi,
Yunxing Liu,
Aiwen Jiang,
Juncheng Li,
Kangfu Mei,
Mingwen Wang
Abstract:
Crowd counting is an important task that shown great application value in public safety-related fields, which has attracted increasing attention in recent years. In the current research, the accuracy of counting numbers and crowd density estimation are the main concerns. Although the emergence of deep learning has greatly promoted the development of this field, crowd counting under cluttered backg…
▽ More
Crowd counting is an important task that shown great application value in public safety-related fields, which has attracted increasing attention in recent years. In the current research, the accuracy of counting numbers and crowd density estimation are the main concerns. Although the emergence of deep learning has greatly promoted the development of this field, crowd counting under cluttered background is still a serious challenge. In order to solve this problem, we propose a ScaleAware Crowd Counting Network (SACCN) with regional and semantic attentions. The proposed SACCN distinguishes crowd and background by applying regional and semantic self-attention mechanisms on the shallow layers and deep layers, respectively. Moreover, the asymmetric multi-scale module (AMM) is proposed to deal with the problem of scale diversity, and regional attention based dense connections and skip connections are designed to alleviate the variations on crowd scales. Extensive experimental results on multiple public benchmarks demonstrate that our proposed SACCN achieves satisfied superior performances and outperform most state-of-the-art methods. All codes and pretrained models will be released soon.
△ Less
Submitted 7 January, 2021; v1 submitted 5 January, 2021;
originally announced January 2021.