Search | arXiv e-print repository

Visual Prompting for One-shot Controllable Video Editing without Inversion

Authors: Zhengbo Zhang, Yuxi Zhou, Duo Peng, Joo-Hwee Lim, Zhigang Tu, De Wen Soh, Lin Geng Foo

Abstract: One-shot controllable video editing (OCVE) is an important yet challenging task, aiming to propagate user edits that are made -- using any image editing tool -- on the first frame of a video to all subsequent frames, while ensuring content consistency between edited frames and source frames. To achieve this, prior methods employ DDIM inversion to transform source frames into latent noise, which is… ▽ More One-shot controllable video editing (OCVE) is an important yet challenging task, aiming to propagate user edits that are made -- using any image editing tool -- on the first frame of a video to all subsequent frames, while ensuring content consistency between edited frames and source frames. To achieve this, prior methods employ DDIM inversion to transform source frames into latent noise, which is then fed into a pre-trained diffusion model, conditioned on the user-edited first frame, to generate the edited video. However, the DDIM inversion process accumulates errors, which hinder the latent noise from accurately reconstructing the source frames, ultimately compromising content consistency in the generated edited frames. To overcome it, our method eliminates the need for DDIM inversion by performing OCVE through a novel perspective based on visual prompting. Furthermore, inspired by consistency models that can perform multi-step consistency sampling to generate a sequence of content-consistent images, we propose a content consistency sampling (CCS) to ensure content consistency between the generated edited frames and the source frames. Moreover, we introduce a temporal-content consistency sampling (TCS) based on Stein Variational Gradient Descent to ensure temporal consistency across the edited frames. Extensive experiments validate the effectiveness of our approach. △ Less

Submitted 19 April, 2025; originally announced April 2025.

Comments: accepted by cvpr2025

arXiv:2504.00640 [pdf, other]

POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

Authors: Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, Jun Liu

Abstract: Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating… ▽ More Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM. Project page is https://lanyunzhu.site/POPEN/ △ Less

Submitted 1 April, 2025; originally announced April 2025.

Comments: CVPR2025

arXiv:2502.02358 [pdf, other]

MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Authors: Ziyan Guo, Zeyu Hu, Na Zhao, De Wen Soh

Abstract: Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion ge… ▽ More Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding} to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: https://diouo.github.io/motionlab.github.io/. △ Less

Submitted 12 March, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

arXiv:2410.01535 [pdf, other]

GaussianBlock: Building Part-Aware Compositional and Editable 3D Scene by Primitives and Gaussians

Authors: Shuyi Jiang, Qihao Zhao, Hossein Rahmani, De Wen Soh, Jun Liu, Na Zhao

Abstract: Recently, with the development of Neural Radiance Fields and Gaussian Splatting, 3D reconstruction techniques have achieved remarkably high fidelity. However, the latent representations learnt by these methods are highly entangled and lack interpretability. In this paper, we propose a novel part-aware compositional reconstruction method, called GaussianBlock, that enables semantically coherent and… ▽ More Recently, with the development of Neural Radiance Fields and Gaussian Splatting, 3D reconstruction techniques have achieved remarkably high fidelity. However, the latent representations learnt by these methods are highly entangled and lack interpretability. In this paper, we propose a novel part-aware compositional reconstruction method, called GaussianBlock, that enables semantically coherent and disentangled representations, allowing for precise and physical editing akin to building blocks, while simultaneously maintaining high fidelity. Our GaussianBlock introduces a hybrid representation that leverages the advantages of both primitives, known for their flexible actionability and editability, and 3D Gaussians, which excel in reconstruction quality. Specifically, we achieve semantically coherent primitives through a novel attention-guided centering loss derived from 2D semantic priors, complemented by a dynamic splitting and fusion strategy. Furthermore, we utilize 3D Gaussians that hybridize with primitives to refine structural details and enhance fidelity. Additionally, a binding inheritance strategy is employed to strengthen and maintain the connection between the two. Our reconstructed scenes are evidenced to be disentangled, compositional, and compact across diverse benchmarks, enabling seamless, direct and precise editing while maintaining high quality. △ Less

Submitted 24 April, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

arXiv:2401.00194 [pdf, other]

On the Identifiability from Modulo Measurements under DFT Sensing Matrix

Authors: Qi Zhang, Jiang Zhu, Fengzhong Qu, Zheng Zhu, De Wen Soh

Abstract: Modulo sampling (MS) has been recently introduced to enhance the dynamic range of conventional ADCs by applying a modulo operator before sampling. This paper examines the identifiability of a measurement model where measurements are taken using a discrete Fourier transform (DFT) sensing matrix, followed by a modulo operator (modulo-DFT). Firstly, we derive a necessary and sufficient condition for… ▽ More Modulo sampling (MS) has been recently introduced to enhance the dynamic range of conventional ADCs by applying a modulo operator before sampling. This paper examines the identifiability of a measurement model where measurements are taken using a discrete Fourier transform (DFT) sensing matrix, followed by a modulo operator (modulo-DFT). Firstly, we derive a necessary and sufficient condition for the unique identification of the modulo-DFT sensing model based on the number of measurements and the indices of zero elements in the original signal. Then, we conduct a deeper analysis of three specific cases: when the number of measurements is a power of $2$, a prime number, and twice a prime number. Additionally, we investigate the identifiability of periodic bandlimited (PBL) signals under MS, which can be considered as the modulo-DFT sensing model with additional symmetric and conjugate constraints on the original signal. We also provide a necessary and sufficient condition based solely on the number of samples in one period for the unique identification of the PBL signal under MS, though with an ambiguity in the direct current (DC) component. Furthermore, we show that when the oversampling factor exceeds $3(1+1/P)$, the PBL signal can be uniquely identified with an ambiguity in the DC component, where $P$ is the number of harmonics, including the fundamental component, in the positive frequency part. Finally, we also present a recovery algorithm that estimates the original signal by solving integer linear equations, and we conduct simulations to validate our conclusions. △ Less

Submitted 6 August, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

arXiv:2309.04901 [pdf, other]

One-Bit-Aided Modulo Sampling for DOA Estimation

Authors: Qi Zhang, Jiang Zhu, Fengzhong Qu, De Wen Soh

Abstract: Modulo sampling has recently drawn a great deal of attention for cutting-edge applications, due to overcoming the barrier of information loss through sensor saturation and clipping. This is a significant problem, especially when the range of signal amplitudes is unknown or in the near-far case. To overcome this fundamental bottleneck, we propose a one-bit-aided (1bit-aided) modulo sampling scheme… ▽ More Modulo sampling has recently drawn a great deal of attention for cutting-edge applications, due to overcoming the barrier of information loss through sensor saturation and clipping. This is a significant problem, especially when the range of signal amplitudes is unknown or in the near-far case. To overcome this fundamental bottleneck, we propose a one-bit-aided (1bit-aided) modulo sampling scheme for direction-of-arrival (DOA) estimation. On the one hand, one-bit quantization involving a simple comparator offers the advantages of low-cost and low-complexity implementation. On the other hand, one-bit quantization provides an estimate of the normalized covariance matrix of the unquantized measurements via the arcsin law. The estimate of the normalized covariance matrix is used to implement blind integer-forcing (BIF) decoder to unwrap the modulo samples to construct the covariance matrix, and subspace methods can be used to perform the DOA estimation. Our approach named as 1bit-aided-BIF addresses the near-far problem well and overcomes the intrinsic low dynamic range of one-bit quantization. Numerical experiments validate the excellent performance of the proposed algorithm. △ Less

Submitted 30 December, 2023; v1 submitted 9 September, 2023; originally announced September 2023.

arXiv:2305.11791 [pdf, other]

Enhancing Few-shot NER with Prompt Ordering based Data Augmentation

Authors: Huiming Wang, Liying Cheng, Wenxuan Zhang, De Wen Soh, Lidong Bing

Abstract: Recently, data augmentation (DA) methods have been proven to be effective for pre-trained language models (PLMs) in low-resource settings, including few-shot named entity recognition (NER). However, conventional NER DA methods are mostly aimed at sequence labeling models, i.e., token-level classification, and few are compatible with unified autoregressive generation frameworks, which can handle a… ▽ More Recently, data augmentation (DA) methods have been proven to be effective for pre-trained language models (PLMs) in low-resource settings, including few-shot named entity recognition (NER). However, conventional NER DA methods are mostly aimed at sequence labeling models, i.e., token-level classification, and few are compatible with unified autoregressive generation frameworks, which can handle a wider range of NER tasks, such as nested NER. Furthermore, these generation frameworks have a strong assumption that the entities will appear in the target sequence with the same left-to-right order as the source sequence. In this paper, we claim that there is no need to keep this strict order, and more diversified but reasonable target entity sequences can be provided during the training stage as a novel DA method. Nevertheless, a naive mixture of augmented data can confuse the model since one source sequence will then be paired with different target sequences. Therefore, we propose a simple but effective Prompt Ordering based Data Augmentation (PODA) method to improve the training of unified autoregressive generation frameworks under few-shot NER scenarios. Experimental results on three public NER datasets and further analyses demonstrate the effectiveness of our approach. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: 7 pages, 2 figures

Showing 1–7 of 7 results for author: Soh, D W