-
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation
Authors:
Yiwei Ma,
Yijun Fan,
Jiayi Ji,
Haowei Wang,
Xiaoshuai Sun,
Guannan Jiang,
Annan Shu,
Rongrong Ji
Abstract:
In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images an…
▽ More
In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: https://xmu-xiaoma666.github.io/Projects/X-Dreamer/ .
△ Less
Submitted 30 July, 2024; v1 submitted 30 November, 2023;
originally announced December 2023.
-
Pseudo-label Alignment for Semi-supervised Instance Segmentation
Authors:
Jie Hu,
Chen Chen,
Liujuan Cao,
Shengchuan Zhang,
Annan Shu,
Guannan Jiang,
Rongrong Ji
Abstract:
Pseudo-labeling is significant for semi-supervised instance segmentation, which generates instance masks and classes from unannotated images for subsequent training. However, in existing pipelines, pseudo-labels that contain valuable information may be directly filtered out due to mismatches in class and mask quality. To address this issue, we propose a novel framework, called pseudo-label alignin…
▽ More
Pseudo-labeling is significant for semi-supervised instance segmentation, which generates instance masks and classes from unannotated images for subsequent training. However, in existing pipelines, pseudo-labels that contain valuable information may be directly filtered out due to mismatches in class and mask quality. To address this issue, we propose a novel framework, called pseudo-label aligning instance segmentation (PAIS), in this paper. In PAIS, we devise a dynamic aligning loss (DALoss) that adjusts the weights of semi-supervised loss terms with varying class and mask score pairs. Through extensive experiments conducted on the COCO and Cityscapes datasets, we demonstrate that PAIS is a promising framework for semi-supervised instance segmentation, particularly in cases where labeled data is severely limited. Notably, with just 1\% labeled data, PAIS achieves 21.2 mAP (based on Mask-RCNN) and 19.9 mAP (based on K-Net) on the COCO dataset, outperforming the current state-of-the-art model, \ie, NoisyBoundary with 7.7 mAP, by a margin of over 12 points. Code is available at: \url{https://github.com/hujiecpp/PAIS}.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Authors:
Qiong Wu,
Shubin Huang,
Yiyi Zhou,
Pingyang Dai,
Annan Shu,
Guannan Jiang,
Rongrong Ji
Abstract:
Prompt tuning is a parameter-efficient way to deploy large-scale pre-trained models to downstream tasks by adding task-specific tokens. In terms of vision-language pre-trained (VLP) models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks, which greatly exacerbates the already high computational overhead. In this paper,…
▽ More
Prompt tuning is a parameter-efficient way to deploy large-scale pre-trained models to downstream tasks by adding task-specific tokens. In terms of vision-language pre-trained (VLP) models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks, which greatly exacerbates the already high computational overhead. In this paper, we revisit the principle of prompt tuning for Transformer-based VLP models, and reveal that the impact of soft prompt tokens can be actually approximated via independent information diffusion steps, thereby avoiding the expensive global attention modeling and reducing the computational complexity to a large extent. Based on this finding, we propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning. To validate APT, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of downstream tasks. Meanwhile, the generalization of APT is also validated on CLIP for image classification and StableDiffusion for text-to-image generation. The experimental results not only show the superior performance gains and computation efficiency of APT against the conventional prompt tuning methods, e.g., +7.01% accuracy and -82.30% additional computation overhead on METER, but also confirm its merits over other parameter-efficient transfer learning approaches.
△ Less
Submitted 21 August, 2023; v1 submitted 27 June, 2023;
originally announced June 2023.
-
Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms
Authors:
Joseph Konan,
Ojas Bhargave,
Shikhar Agnihotri,
Hojeong Lee,
Ankit Shah,
Shuo Han,
Yunyang Zeng,
Amanda Shu,
Haohui Liu,
Xuankai Chang,
Hamza Khalid,
Minseon Gwak,
Kawon Lee,
Minjeong Kim,
Bhiksha Raj
Abstract:
In this paper, we present a method for fine-tuning models trained on the Deep Noise Suppression (DNS) 2020 Challenge to improve their performance on Voice over Internet Protocol (VoIP) applications. Our approach involves adapting the DNS 2020 models to the specific acoustic characteristics of VoIP communications, which includes distortion and artifacts caused by compression, transmission, and plat…
▽ More
In this paper, we present a method for fine-tuning models trained on the Deep Noise Suppression (DNS) 2020 Challenge to improve their performance on Voice over Internet Protocol (VoIP) applications. Our approach involves adapting the DNS 2020 models to the specific acoustic characteristics of VoIP communications, which includes distortion and artifacts caused by compression, transmission, and platform-specific processing. To this end, we propose a multi-task learning framework for VoIP-DNS that jointly optimizes noise suppression and VoIP-specific acoustics for speech enhancement. We evaluate our approach on a diverse VoIP scenarios and show that it outperforms both industry performance and state-of-the-art methods for speech enhancement on VoIP applications. Our results demonstrate the potential of models trained on DNS-2020 to be improved and tailored to different VoIP platforms using VoIP-DNS, whose findings have important applications in areas such as speech recognition, voice assistants, and telecommunication.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Cellular Network Speech Enhancement: Removing Background and Transmission Noise
Authors:
Amanda Shu,
Hamza Khalid,
Haohui Liu,
Shikhar Agnihotri,
Joseph Konan,
Ojas Bhargave
Abstract:
The primary objective of speech enhancement is to reduce background noise while preserving the target's speech. A common dilemma occurs when a speaker is confined to a noisy environment and receives a call with high background and transmission noise. To address this problem, the Deep Noise Suppression (DNS) Challenge focuses on removing the background noise with the next-generation deep learning m…
▽ More
The primary objective of speech enhancement is to reduce background noise while preserving the target's speech. A common dilemma occurs when a speaker is confined to a noisy environment and receives a call with high background and transmission noise. To address this problem, the Deep Noise Suppression (DNS) Challenge focuses on removing the background noise with the next-generation deep learning models to enhance the target's speech; however, researchers fail to consider Voice Over IP (VoIP) applications their transmission noise. Focusing on Google Meet and its cellular application, our work achieves state-of-the-art performance on the Google Meet To Phone Track of the VoIP DNS Challenge. This paper demonstrates how to beat industrial performance and achieve 1.92 PESQ and 0.88 STOI, as well as superior acoustic fidelity, perceptual quality, and intelligibility in various metrics.
△ Less
Submitted 21 January, 2023;
originally announced January 2023.
-
Language Without Words: A Pointillist Model for Natural Language Processing
Authors:
Peiyou Song,
Anhei Shu,
David Phipps,
Dan Wallach,
Mohit Tiwari,
Jedidiah Crandall,
George Luger
Abstract:
This paper explores two separate questions: Can we perform natural language processing tasks without a lexicon?; and, Should we? Existing natural language processing techniques are either based on words as units or use units such as grams only for basic classification tasks. How close can a machine come to reasoning about the meanings of words and phrases in a corpus without using any lexicon, bas…
▽ More
This paper explores two separate questions: Can we perform natural language processing tasks without a lexicon?; and, Should we? Existing natural language processing techniques are either based on words as units or use units such as grams only for basic classification tasks. How close can a machine come to reasoning about the meanings of words and phrases in a corpus without using any lexicon, based only on grams?
Our own motivation for posing this question is based on our efforts to find popular trends in words and phrases from online Chinese social media. This form of written Chinese uses so many neologisms, creative character placements, and combinations of writing systems that it has been dubbed the "Martian Language." Readers must often use visual queues, audible queues from reading out loud, and their knowledge and understanding of current events to understand a post. For analysis of popular trends, the specific problem is that it is difficult to build a lexicon when the invention of new ways to refer to a word or concept is easy and common. For natural language processing in general, we argue in this paper that new uses of language in social media will challenge machines' abilities to operate with words as the basic unit of understanding, not only in Chinese but potentially in other languages.
△ Less
Submitted 11 December, 2012;
originally announced December 2012.
-
A Pointillism Approach for Natural Language Processing of Social Media
Authors:
Peiyou Song,
Anhei Shu,
Anyu Zhou,
Dan Wallach,
Jedidiah R. Crandall
Abstract:
The Chinese language poses challenges for natural language processing based on the unit of a word even for formal uses of the Chinese language, social media only makes word segmentation in Chinese even more difficult. In this document we propose a pointillism approach to natural language processing. Rather than words that have individual meanings, the basic unit of a pointillism approach is trigra…
▽ More
The Chinese language poses challenges for natural language processing based on the unit of a word even for formal uses of the Chinese language, social media only makes word segmentation in Chinese even more difficult. In this document we propose a pointillism approach to natural language processing. Rather than words that have individual meanings, the basic unit of a pointillism approach is trigrams of characters. These grams take on meaning in aggregate when they appear together in a way that is correlated over time.
Our results from three kinds of experiments show that when words and topics do have a meme-like trend, they can be reconstructed from only trigrams. For example, for 4-character idioms that appear at least 99 times in one day in our data, the unconstrained precision (that is, precision that allows for deviation from a lexicon when the result is just as correct as the lexicon version of the word or phrase) is 0.93. For longer words and phrases collected from Wiktionary, including neologisms, the unconstrained precision is 0.87. We consider these results to be very promising, because they suggest that it is feasible for a machine to reconstruct complex idioms, phrases, and neologisms with good precision without any notion of words. Thus the colorful and baroque uses of language that typify social media in challenging languages such as Chinese may in fact be accessible to machines.
△ Less
Submitted 21 June, 2012;
originally announced June 2012.
-
Quire: Lightweight Provenance for Smart Phone Operating Systems
Authors:
Michael Dietz,
Shashi Shekhar,
Yuliy Pisetsky,
Anhei Shu,
Dan S. Wallach
Abstract:
Smartphone apps often run with full privileges to access the network and sensitive local resources, making it difficult for remote systems to have any trust in the provenance of network connections they receive. Even within the phone, different apps with different privileges can communicate with one another, allowing one app to trick another into improperly exercising its privileges (a Confused De…
▽ More
Smartphone apps often run with full privileges to access the network and sensitive local resources, making it difficult for remote systems to have any trust in the provenance of network connections they receive. Even within the phone, different apps with different privileges can communicate with one another, allowing one app to trick another into improperly exercising its privileges (a Confused Deputy attack). In Quire, we engineered two new security mechanisms into Android to address these issues. First, we track the call chain of IPCs, allowing an app the choice of operating with the diminished privileges of its callers or to act explicitly on its own behalf. Second, a lightweight signature scheme allows any app to create a signed statement that can be verified anywhere inside the phone. Both of these mechanisms are reflected in network RPCs, allowing remote systems visibility into the state of the phone when an RPC is made. We demonstrate the usefulness of Quire with two example applications. We built an advertising service, running distinctly from the app which wants to display ads, which can validate clicks passed to it from its host. We also built a payment service, allowing an app to issue a request which the payment service validates with the user. An app cannot not forge a payment request by directly connecting to the remote server, nor can the local payment service tamper with the request.
△ Less
Submitted 11 February, 2011;
originally announced February 2011.