-
EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models
Authors:
Mingzhe Li,
Gehao Zhang,
Zhenting Wang,
Shiqing Ma,
Siqi Pan,
Richard Cartwright,
Juan Zhai
Abstract:
Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking valida…
▽ More
Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
Authors:
Heng Wang,
Jianbo Ma,
Santiago Pascual,
Richard Cartwright,
Weidong Cai
Abstract:
Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched whe…
▽ More
Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.
△ Less
Submitted 13 December, 2023; v1 submitted 18 August, 2023;
originally announced August 2023.
-
A low latency attention module for streaming self-supervised speech representation learning
Authors:
Jianbo Ma,
Siqi Pan,
Deepak Chandran,
Andrea Fanelli,
Richard Cartwright
Abstract:
The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media…
▽ More
The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media processing problems, such as speech processing, require real-time solutions. In this paper, we present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements, while allowing real-time inference with low and fixed latency. The attention module proposed in this paper includes two components, streaming attention (SA) and low-latency streaming attention (LLSA). The SA represents our proposal for an efficient streaming SSRL implementation, while the LLSA solves the latency build-up problem of other streaming attention architectures, such as the masked acausal attention (MAA), guaranteeing a latency equal to one layer even when multiple layers are stacked. We present a comparative analysis between the vanilla attention, which we will refer here as acausal attention (AA), the SA, and the LLSA, by training a streaming SSRL with automatic speech recognition as downstream task. When training on librispeech-clean-100 and testing on librispeech-test-clean, our low-latency attention module has a word error rate (WER) of 5.84%, which represents a significant improvement over the MAA (WER = 13.82%). Our implementation also reduces the inference latency from 1.92 to 0.16 seconds. The proposed low-latency module preserves many of the benefits of conventional acausal transformers, but also enables latency characteristics that make it applicable to real-time streaming applications.
△ Less
Submitted 17 March, 2024; v1 submitted 26 February, 2023;
originally announced February 2023.
-
Language Support for Adaptation: Intent-Driven Programming in FAST
Authors:
Yao-Hsiang Yang,
Adam Duracz,
Ferenc A. Bartha,
Ryuichi Sai,
Ahsan Pervaiz,
Saeid Barati,
Dung Nguyen,
Robert Cartwright,
Henry Hoffmann,
Krishna V. Palem
Abstract:
Historically, programming language semantics has focused on assigning a precise mathematical meaning to programs. That meaning is a function from the program's input domain to its output domain determined solely by its syntactic structure. Such a semantics, fosters the development of portable applications which are oblivious to the performance characteristics and limitations (such as a maximum mem…
▽ More
Historically, programming language semantics has focused on assigning a precise mathematical meaning to programs. That meaning is a function from the program's input domain to its output domain determined solely by its syntactic structure. Such a semantics, fosters the development of portable applications which are oblivious to the performance characteristics and limitations (such as a maximum memory footprint) of particular hardware and software platforms. This paper introduces the idea of intent-driven programming where the meaning of a program additionally depends on an accompanying intent specification expressing how the ordinary program meaning is dynamically modified during execution to satisfy additional properties expressed by the intent. These include both intensional properties---e.g., resource usage---and extensional properties---e.g., accuracy of the computed answer. To demonstrate the intent-driven programming model's value, this paper presents a general-purpose intent-driven programming language---called FAST---implemented as an extension of Swift. FAST consists of an intent compiler, a profiler, a general controller interface and a runtime module which supports interoperation with legacy C/C++ codes. Compared to existing frameworks for adaptive computing, \FAST{} supports dynamic adaptation to changes both in the operating environment and in the intent itself, and enables the mixing of procedural control and control based on feedback and optimization.
△ Less
Submitted 12 July, 2019;
originally announced July 2019.
-
NOOP: A Domain-Theoretic Model of Nominally-Typed OOP
Authors:
Moez AbdelGawad,
Robert Cartwright
Abstract:
The majority of industrial-strength object-oriented (OO) software is written using nominally-typed OO programming languages. Extant domain-theoretic models of OOP developed to analyze OO type systems miss, however, a crucial feature of these mainstream OO languages: nominality. This paper presents the construction of NOOP as the first domain-theoretic model of OOP that includes full class/type nam…
▽ More
The majority of industrial-strength object-oriented (OO) software is written using nominally-typed OO programming languages. Extant domain-theoretic models of OOP developed to analyze OO type systems miss, however, a crucial feature of these mainstream OO languages: nominality. This paper presents the construction of NOOP as the first domain-theoretic model of OOP that includes full class/type names information found in nominally-typed OOP. Inclusion of nominal information in objects of NOOP and asserting that type inheritance in statically-typed OO programming languages is an inherently nominal notion allow readily proving that type inheritance and subtyping are completely identified in these languages. This conclusion is in full agreement with intuitions of developers and language designers of these OO languages, and contrary to the belief that "inheritance is not subtyping," which came from assuming non-nominal (a.k.a., structural) models of OOP.
To motivate the construction of NOOP, this paper briefly presents the benefits of nominal-typing to mainstream OO developers and OO language designers, as compared to structural-typing. After presenting NOOP, the paper further briefly compares NOOP to the most widely known domain-theoretic models of OOP. Leveraging the development of NOOP, the comparisons presented in this paper provide clear, brief and precise technical and mathematical accounts for the relation between nominal and structural OO type systems. NOOP, thus, provides a firmer semantic foundation for analyzing and progressing nominally-typed OO programming languages.
△ Less
Submitted 21 January, 2018;
originally announced January 2018.
-
Domain Theory: An Introduction
Authors:
Robert Cartwright,
Rebecca Parsons,
Moez AbdelGawad
Abstract:
This monograph is an ongoing revision of "Lectures On A Mathematical Theory of Computation" by Dana Scott. Scott's monograph uses a formulation of domains called neighborhood systems in which finite elements are selected subsets of a master set of objects called "tokens". Since tokens have little intuitive significance, Scott has discarded neighborhood systems in favor of an equivalent formulation…
▽ More
This monograph is an ongoing revision of "Lectures On A Mathematical Theory of Computation" by Dana Scott. Scott's monograph uses a formulation of domains called neighborhood systems in which finite elements are selected subsets of a master set of objects called "tokens". Since tokens have little intuitive significance, Scott has discarded neighborhood systems in favor of an equivalent formulation of domains called information systems. Unfortunately, he has not rewritten his monograph to reflect this change.
We have rewritten Scott's monograph in terms of finitary bases instead of information systems. A finitary basis is an information system that is closed under least upper bounds on finite consistent subsets. This convention ensures that every finite answer is represented by a single basis object instead of a set of objects.
△ Less
Submitted 14 June, 2016; v1 submitted 19 May, 2016;
originally announced May 2016.
-
Modeling Basic Aspects of Cyber-Physical Systems, Part II
Authors:
Yingfu Zeng,
Chad Rose,
Paul Brauner,
Walid Taha,
Jawad Masood,
Roland Philippsen,
Marcia O. Malley,
Robert Cartwright
Abstract:
We continue to consider the question of what language features are needed to effectively model cyber-physical systems (CPS). In previous work, we proposed using a core language as a way to study this question, and showed how several basic aspects of CPS can be modeled clearly in a language with a small set of constructs. This paper reports on the result of our analysis of two, more complex, case s…
▽ More
We continue to consider the question of what language features are needed to effectively model cyber-physical systems (CPS). In previous work, we proposed using a core language as a way to study this question, and showed how several basic aspects of CPS can be modeled clearly in a language with a small set of constructs. This paper reports on the result of our analysis of two, more complex, case studies from the domain of rigid body dynamics. The first one, a quadcopter, illustrates that previously proposed core language can support larger, more interesting systems than previously shown. The second one, a serial robot, provides a concrete example of why we should add language support for static partial derivatives, namely that it would significantly improve the way models of rigid body dynamics can be expressed.
△ Less
Submitted 5 August, 2014;
originally announced August 2014.