-
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Authors:
Weixian Lei,
Jiacong Wang,
Haochen Wang,
Xiangtai Li,
Jun Hao Liew,
Jiashi Feng,
Zilong Huang
Abstract:
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing n…
▽ More
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
Authors:
Tianwei Xiong,
Jun Hao Liew,
Zilong Huang,
Jiashi Feng,
Xihui Liu
Abstract:
In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing lite…
▽ More
In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
MagicArticulate: Make Your 3D Models Articulation-Ready
Authors:
Chaoyue Song,
Jianfeng Zhang,
Xiu Li,
Fan Yang,
Yiwen Chen,
Zhongcong Xu,
Jun Hao Liew,
Xiaoyang Guo,
Fayao Liu,
Jiashi Feng,
Guosheng Lin
Abstract:
With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Traditional approaches rely heavily on manual annotation, which is both time-consuming and labor-intensive. Moreover, the lack of large-scale benchmarks has hindered the development of learning-based solutio…
▽ More
With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Traditional approaches rely heavily on manual annotation, which is both time-consuming and labor-intensive. Moreover, the lack of large-scale benchmarks has hindered the development of learning-based solutions. In this work, we present MagicArticulate, an effective framework that automatically transforms static 3D models into articulation-ready assets. Our key contributions are threefold. First, we introduce Articulation-XL, a large-scale benchmark containing over 33k 3D models with high-quality articulation annotations, carefully curated from Objaverse-XL. Second, we propose a novel skeleton generation method that formulates the task as a sequence modeling problem, leveraging an auto-regressive transformer to naturally handle varying numbers of bones or joints within skeletons and their inherent dependencies across different 3D models. Third, we predict skinning weights using a functional diffusion process that incorporates volumetric geodesic distance priors between vertices and joints. Extensive experiments demonstrate that MagicArticulate significantly outperforms existing methods across diverse object categories, achieving high-quality articulation that enables realistic animation. Project page: https://chaoyuesong.github.io/MagicArticulate.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
High Quality Human Image Animation using Regional Supervision and Motion Blur Condition
Authors:
Zhongcong Xu,
Chaoyue Song,
Guoxian Song,
Jianfeng Zhang,
Jun Hao Liew,
Hongyi Xu,
You Xie,
Linjie Luo,
Guosheng Lin,
Jiashi Feng,
Mike Zheng Shou
Abstract:
Recent advances in video diffusion models have enabled realistic and controllable human image animation with temporal coherence. Although generating reasonable results, existing methods often overlook the need for regional supervision in crucial areas such as the face and hands, and neglect the explicit modeling for motion blur, leading to unrealistic low-quality synthesis. To address these limita…
▽ More
Recent advances in video diffusion models have enabled realistic and controllable human image animation with temporal coherence. Although generating reasonable results, existing methods often overlook the need for regional supervision in crucial areas such as the face and hands, and neglect the explicit modeling for motion blur, leading to unrealistic low-quality synthesis. To address these limitations, we first leverage regional supervision for detailed regions to enhance face and hand faithfulness. Second, we model the motion blur explicitly to further improve the appearance quality. Third, we explore novel training strategies for high-resolution human animation to improve the overall fidelity. Experimental results demonstrate that our proposed method outperforms state-of-the-art approaches, achieving significant improvements upon the strongest baseline by more than 21.0% and 57.4% in terms of reconstruction precision (L1) and perceptual quality (FVD) on HumanDance dataset. Code and model will be made available.
△ Less
Submitted 29 September, 2024;
originally announced September 2024.
-
Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations
Authors:
Tiancheng Shen,
Jun Hao Liew,
Long Mai,
Lu Qi,
Jiashi Feng,
Jiaya Jia
Abstract:
Advances in text-based image generation and editing have revolutionized content creation, enabling users to create impressive content from imaginative text prompts. However, existing methods are not designed to work well with the oversimplified prompts that are often encountered in typical scenarios when users start their editing with only vague or abstract purposes in mind. Those scenarios demand…
▽ More
Advances in text-based image generation and editing have revolutionized content creation, enabling users to create impressive content from imaginative text prompts. However, existing methods are not designed to work well with the oversimplified prompts that are often encountered in typical scenarios when users start their editing with only vague or abstract purposes in mind. Those scenarios demand elaborate ideation efforts from the users to bridge the gap between such vague starting points and the detailed creative ideas needed to depict the desired results. In this paper, we introduce the task of Image Editing Recommendation (IER). This task aims to automatically generate diverse creative editing instructions from an input image and a simple prompt representing the users' under-specified editing purpose. To this end, we introduce Creativity-Vision Language Assistant~(Creativity-VLA), a multimodal framework designed specifically for edit-instruction generation. We train Creativity-VLA on our edit-instruction dataset specifically curated for IER. We further enhance our model with a novel 'token-for-localization' mechanism, enabling it to support both global and local editing operations. Our experimental results demonstrate the effectiveness of \ours{} in suggesting instructions that not only contain engaging creative elements but also maintain high relevance to both the input image and the user's initial hint.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention
Authors:
Lianghui Zhu,
Zilong Huang,
Bencheng Liao,
Jun Hao Liew,
Hanshu Yan,
Jiashi Feng,
Xinggang Wang
Abstract:
Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic complexity efficiency, especially when handling long sequences. In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attent…
▽ More
Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic complexity efficiency, especially when handling long sequences. In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attention (GLA) into the 2D diffusion backbone. Specifically, we introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead. We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness. In addition to superior performance to DiT and other sub-quadratic-time diffusion models at $256 \times 256$ resolution, DiG demonstrates greater efficiency than these methods starting from a $512$ resolution. Specifically, DiG-S/2 is $2.5\times$ faster and saves $75.7\%$ GPU memory compared to DiT-S/2 at a $1792$ resolution. Additionally, DiG-XL/2 is $4.2\times$ faster than the Mamba-based model at a $1024$ resolution and $1.8\times$ faster than DiT with FlashAttention-2 at a $2048$ resolution. We will release the code soon. Code is released at https://github.com/hustvl/DiG.
△ Less
Submitted 26 November, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance
Authors:
Jiannan Huang,
Jun Hao Liew,
Hanshu Yan,
Yuyang Yin,
Yao Zhao,
Humphrey Shi,
Yunchao Wei
Abstract:
Recent text-to-image customization works have proven successful in generating images of given concepts by fine-tuning diffusion models on a few examples. However, tuning-based methods inherently tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (*e.g.*, headphone is missing when generating "a `dog wearing a headphone"). Interestingly, we notice that…
▽ More
Recent text-to-image customization works have proven successful in generating images of given concepts by fine-tuning diffusion models on a few examples. However, tuning-based methods inherently tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (*e.g.*, headphone is missing when generating "a `dog wearing a headphone"). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (*e.g.*, "a dog wearing a headphone"), implying that the compositional ability only disappears after personalization tuning. We observe a semantic shift in the customized concept after fine-tuning, indicating that the personalized concept is not aligned with the original concept, and further show through theoretical analyses that this semantic shift leads to increased difficulty in sampling the joint conditional probability distribution, resulting in the loss of the compositional ability. Inspired by this finding, we present **ClassDiffusion**, a technique that leverages a **semantic preservation loss** to explicitly regulate the concept space when learning a new concept. Although simple, this approach effectively prevents semantic drift during the fine-tuning process of the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of fine-tuning models. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.
△ Less
Submitted 13 March, 2025; v1 submitted 27 May, 2024;
originally announced May 2024.
-
LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos
Authors:
Yujun Shi,
Jun Hao Liew,
Hanshu Yan,
Vincent Y. F. Tan,
Jiashi Feng
Abstract:
Accuracy and speed are critical in image editing tasks. Pan et al. introduced a drag-based image editing framework that achieves pixel-level control using Generative Adversarial Networks (GANs). A flurry of subsequent studies enhanced this framework's generality by leveraging large-scale diffusion models. However, these methods often suffer from inordinately long processing times (exceeding 1 minu…
▽ More
Accuracy and speed are critical in image editing tasks. Pan et al. introduced a drag-based image editing framework that achieves pixel-level control using Generative Adversarial Networks (GANs). A flurry of subsequent studies enhanced this framework's generality by leveraging large-scale diffusion models. However, these methods often suffer from inordinately long processing times (exceeding 1 minute per edit) and low success rates. Addressing these issues head on, we present LightningDrag, a rapid approach enabling high quality drag-based image editing in ~1 second. Unlike most previous methods, we redefine drag-based editing as a conditional generation task, eliminating the need for time-consuming latent optimization or gradient-based guidance during inference. In addition, the design of our pipeline allows us to train our model on large-scale paired video frames, which contain rich motion information such as object translations, changing poses and orientations, zooming in and out, etc. By learning from videos, our approach can significantly outperform previous methods in terms of accuracy and consistency. Despite being trained solely on videos, our model generalizes well to perform local shape deformations not presented in the training data (e.g., lengthening of hair, twisting rainbows, etc.). Extensive qualitative and quantitative evaluations on benchmark datasets corroborate the superiority of our approach. The code and model will be released at https://github.com/magic-research/LightningDrag.
△ Less
Submitted 16 September, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator
Authors:
Hanshu Yan,
Xingchao Liu,
Jiachun Pan,
Jun Hao Liew,
Qiang Liu,
Jiashi Feng
Abstract:
We present Piecewise Rectified Flow (PeRFlow), a flow-based method for accelerating diffusion models. PeRFlow divides the sampling process of generative flows into several time windows and straightens the trajectories in each interval via the reflow operation, thereby approaching piecewise linear flows. PeRFlow achieves superior performance in a few-step generation. Moreover, through dedicated par…
▽ More
We present Piecewise Rectified Flow (PeRFlow), a flow-based method for accelerating diffusion models. PeRFlow divides the sampling process of generative flows into several time windows and straightens the trajectories in each interval via the reflow operation, thereby approaching piecewise linear flows. PeRFlow achieves superior performance in a few-step generation. Moreover, through dedicated parameterizations, the PeRFlow models inherit knowledge from the pretrained diffusion models. Thus, the training converges fast and the obtained models show advantageous transfer ability, serving as universal plug-and-play accelerators that are compatible with various workflows based on the pre-trained diffusion models. Codes for training and inference are publicly released. https://github.com/magic-research/piecewise-rectified-flow
△ Less
Submitted 2 September, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
Authors:
Weimin Wang,
Jiawei Liu,
Zhijie Lin,
Jiangqiao Yan,
Shuo Chen,
Chetwin Low,
Tuyen Hoang,
Jie Wu,
Jun Hao Liew,
Hanshu Yan,
Daquan Zhou,
Jiashi Feng
Abstract:
The growing demand for high-fidelity video generation from textual descriptions has catalyzed significant research in this field. In this work, we introduce MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo…
▽ More
The growing demand for high-fidelity video generation from textual descriptions has catalyzed significant research in this field. In this work, we introduce MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution video with remarkable fidelity and smoothness. It demonstrates superior performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion model via user evaluation at large scale.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process
Authors:
Mengyu Wang,
Henghui Ding,
Jun Hao Liew,
Jiajun Liu,
Yao Zhao,
Yunchao Wei
Abstract:
In this paper, we explore a principal way to enhance the quality of object masks produced by different segmentation models. We propose a model-agnostic solution called SegRefiner, which offers a novel perspective on this problem by interpreting segmentation refinement as a data generation process. As a result, the refinement process can be smoothly implemented through a series of denoising diffusi…
▽ More
In this paper, we explore a principal way to enhance the quality of object masks produced by different segmentation models. We propose a model-agnostic solution called SegRefiner, which offers a novel perspective on this problem by interpreting segmentation refinement as a data generation process. As a result, the refinement process can be smoothly implemented through a series of denoising diffusion steps. Specifically, SegRefiner takes coarse masks as inputs and refines them using a discrete diffusion process. By predicting the label and corresponding states-transition probabilities for each pixel, SegRefiner progressively refines the noisy masks in a conditional denoising manner. To assess the effectiveness of SegRefiner, we conduct comprehensive experiments on various segmentation tasks, including semantic segmentation, instance segmentation, and dichotomous image segmentation. The results demonstrate the superiority of our SegRefiner from multiple aspects. Firstly, it consistently improves both the segmentation metrics and boundary metrics across different types of coarse masks. Secondly, it outperforms previous model-agnostic refinement methods by a significant margin. Lastly, it exhibits a strong capability to capture extremely fine details when refining high-resolution images. The source code and trained models are available at https://github.com/MengyuWang826/SegRefiner.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Towards Accurate Guided Diffusion Sampling through Symplectic Adjoint Method
Authors:
Jiachun Pan,
Hanshu Yan,
Jun Hao Liew,
Jiashi Feng,
Vincent Y. F. Tan
Abstract:
Training-free guided sampling in diffusion models leverages off-the-shelf pre-trained networks, such as an aesthetic evaluation model, to guide the generation process. Current training-free guided sampling algorithms obtain the guidance energy function based on a one-step estimate of the clean image. However, since the off-the-shelf pre-trained networks are trained on clean images, the one-step es…
▽ More
Training-free guided sampling in diffusion models leverages off-the-shelf pre-trained networks, such as an aesthetic evaluation model, to guide the generation process. Current training-free guided sampling algorithms obtain the guidance energy function based on a one-step estimate of the clean image. However, since the off-the-shelf pre-trained networks are trained on clean images, the one-step estimation procedure of the clean image may be inaccurate, especially in the early stages of the generation process in diffusion models. This causes the guidance in the early time steps to be inaccurate. To overcome this problem, we propose Symplectic Adjoint Guidance (SAG), which calculates the gradient guidance in two inner stages. Firstly, SAG estimates the clean image via $n$ function calls, where $n$ serves as a flexible hyperparameter that can be tailored to meet specific image quality requirements. Secondly, SAG uses the symplectic adjoint method to obtain the gradients accurately and efficiently in terms of the memory requirements. Extensive experiments demonstrate that SAG generates images with higher qualities compared to the baselines in both guided image and video generation tasks.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text
Authors:
Jianfeng Zhang,
Xuanmeng Zhang,
Huichao Zhang,
Jun Hao Liew,
Chenxu Zhang,
Yi Yang,
Jiashi Feng
Abstract:
We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a coarse-to-fine generative model that generates expli…
▽ More
We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text that are ready for animation, significantly outperforming previous methods. Moreover, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation. For more results, please refer to our project page: http://jeff95.me/projects/avatarstudio.html
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
Authors:
Zhongcong Xu,
Jianfeng Zhang,
Jun Hao Liew,
Hanshu Yan,
Jia-Wei Liu,
Chenxu Zhang,
Jiashi Feng,
Mike Zheng Shou
Abstract:
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout…
▽ More
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
XAGen: 3D Expressive Human Avatars Generation
Authors:
Zhongcong Xu,
Jianfeng Zhang,
Jun Hao Liew,
Jiashi Feng,
Mike Zheng Shou
Abstract:
Recent advances in 3D-aware GAN models have enabled the generation of realistic and controllable human body images. However, existing methods focus on the control of major body joints, neglecting the manipulation of expressive attributes, such as facial expressions, jaw poses, hand poses, and so on. In this work, we present XAGen, the first 3D generative model for human avatars capable of expressi…
▽ More
Recent advances in 3D-aware GAN models have enabled the generation of realistic and controllable human body images. However, existing methods focus on the control of major body joints, neglecting the manipulation of expressive attributes, such as facial expressions, jaw poses, hand poses, and so on. In this work, we present XAGen, the first 3D generative model for human avatars capable of expressive control over body, face, and hands. To enhance the fidelity of small-scale regions like face and hands, we devise a multi-scale and multi-part 3D representation that models fine details. Based on this representation, we propose a multi-part rendering technique that disentangles the synthesis of body, face, and hands to ease model training and enhance geometric quality. Furthermore, we design multi-part discriminators that evaluate the quality of the generated avatars with respect to their appearance and fine-grained control capabilities. Experiments show that XAGen surpasses state-of-the-art methods in terms of realism, diversity, and expressive control abilities. Code and data will be made available at https://showlab.github.io/xagen.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation
Authors:
Hanshu Yan,
Jun Hao Liew,
Long Mai,
Shanchuan Lin,
Jiashi Feng
Abstract:
This paper addresses the issue of modifying the visual appearance of videos while preserving their motion. A novel framework, named MagicProp, is proposed, which disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation. In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify t…
▽ More
This paper addresses the issue of modifying the visual appearance of videos while preserving their motion. A novel framework, named MagicProp, is proposed, which disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation. In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame. The flexibility of these techniques enables the editing of arbitrary regions within the frame. In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach. To achieve this, a diffusion-based conditional generation model, called PropDPM, is developed, which synthesizes the target frame by conditioning on the reference appearance, the target motion, and its previous appearance. The autoregressive editing approach ensures temporal consistency in the resulting videos. Overall, MagicProp combines the flexibility of image-editing techniques with the superior temporal consistency of autoregressive modeling, enabling flexible editing of object types and aesthetic styles in arbitrary regions of input videos while maintaining good temporal consistency across frames. Extensive experiments in various video editing scenarios demonstrate the effectiveness of MagicProp.
△ Less
Submitted 2 September, 2023;
originally announced September 2023.
-
MagicEdit: High-Fidelity and Temporally Coherent Video Editing
Authors:
Jun Hao Liew,
Hanshu Yan,
Jianfeng Zhang,
Zhongcong Xu,
Jiashi Feng
Abstract:
In this report, we present MagicEdit, a surprisingly simple yet effective solution to the text-guided video editing task. We found that high-fidelity and temporally coherent video-to-video translation can be achieved by explicitly disentangling the learning of content, structure and motion signals during training. This is in contradict to most existing methods which attempt to jointly model both t…
▽ More
In this report, we present MagicEdit, a surprisingly simple yet effective solution to the text-guided video editing task. We found that high-fidelity and temporally coherent video-to-video translation can be achieved by explicitly disentangling the learning of content, structure and motion signals during training. This is in contradict to most existing methods which attempt to jointly model both the appearance and temporal representation within a single framework, which we argue, would lead to degradation in per-frame quality. Despite its simplicity, we show that MagicEdit supports various downstream video editing tasks, including video stylization, local editing, video-MagicMix and video outpainting.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
MagicAvatar: Multimodal Avatar Generation and Animation
Authors:
Jianfeng Zhang,
Hanshu Yan,
Zhongcong Xu,
Jiashi Feng,
Jun Hao Liew
Abstract:
This report presents MagicAvatar, a framework for multimodal video generation and animation of human avatars. Unlike most existing methods that generate avatar-centric videos directly from multimodal inputs (e.g., text prompts), MagicAvatar explicitly disentangles avatar video generation into two stages: (1) multimodal-to-motion and (2) motion-to-video generation. The first stage translates the mu…
▽ More
This report presents MagicAvatar, a framework for multimodal video generation and animation of human avatars. Unlike most existing methods that generate avatar-centric videos directly from multimodal inputs (e.g., text prompts), MagicAvatar explicitly disentangles avatar video generation into two stages: (1) multimodal-to-motion and (2) motion-to-video generation. The first stage translates the multimodal inputs into motion/ control signals (e.g., human pose, depth, DensePose); while the second stage generates avatar-centric video guided by these motion signals. Additionally, MagicAvatar supports avatar animation by simply providing a few images of the target person. This capability enables the animation of the provided human identity according to the specific motion derived from the first stage. We demonstrate the flexibility of MagicAvatar through various applications, including text-guided and video-guided avatar generation, as well as multimodal avatar animation.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models
Authors:
Jiachun Pan,
Jun Hao Liew,
Vincent Y. F. Tan,
Jiashi Feng,
Hanshu Yan
Abstract:
Existing customization methods require access to multiple reference examples to align pre-trained diffusion probabilistic models (DPMs) with user-provided concepts. This paper aims to address the challenge of DPM customization when the only available supervision is a differentiable metric defined on the generated contents. Since the sampling procedure of DPMs involves recursive calls to the denois…
▽ More
Existing customization methods require access to multiple reference examples to align pre-trained diffusion probabilistic models (DPMs) with user-provided concepts. This paper aims to address the challenge of DPM customization when the only available supervision is a differentiable metric defined on the generated contents. Since the sampling procedure of DPMs involves recursive calls to the denoising UNet, naïve gradient backpropagation requires storing the intermediate states of all iterations, resulting in extremely high memory consumption. To overcome this issue, we propose a novel method AdjointDPM, which first generates new samples from diffusion models by solving the corresponding probability-flow ODEs. It then uses the adjoint sensitivity method to backpropagate the gradients of the loss to the models' parameters (including conditioning signals, network weights, and initial noises) by solving another augmented ODE. To reduce numerical errors in both the forward generation and gradient backpropagation processes, we further reparameterize the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration. Finally, we demonstrate the effectiveness of AdjointDPM on three interesting tasks: converting visual effects into identification text embeddings, finetuning DPMs for specific types of stylization, and optimizing initial noise to generate adversarial samples for security auditing.
△ Less
Submitted 20 March, 2024; v1 submitted 20 July, 2023;
originally announced July 2023.
-
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
Authors:
Yujun Shi,
Chuhui Xue,
Jun Hao Liew,
Jiachun Pan,
Hanshu Yan,
Wenqing Zhang,
Vincent Y. F. Tan,
Song Bai
Abstract:
Accurate and controllable image editing is a challenging task that has attracted significant attention recently. Notably, DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. However, due to its reliance on generative adversarial networks (GANs), its generality is limited by the capacity of pretrained GAN models. In this…
▽ More
Accurate and controllable image editing is a challenging task that has attracted significant attention recently. Notably, DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. However, due to its reliance on generative adversarial networks (GANs), its generality is limited by the capacity of pretrained GAN models. In this work, we extend this editing framework to diffusion models and propose a novel approach DragDiffusion. By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images. Our approach involves optimizing the diffusion latents to achieve precise spatial control. The supervision signal of this optimization process is from the diffusion model's UNet features, which are known to contain rich semantic and geometric information. Moreover, we introduce two additional techniques, namely LoRA fine-tuning and latent-MasaCtrl, to further preserve the identity of the original image. Lastly, we present a challenging benchmark dataset called DragBench -- the first benchmark to evaluate the performance of interactive point-based image editing methods. Experiments across a wide range of challenging cases (e.g., images with multiple objects, diverse object categories, various styles, etc.) demonstrate the versatility and generality of DragDiffusion. Code: https://github.com/Yujun-Shi/DragDiffusion.
△ Less
Submitted 7 April, 2024; v1 submitted 26 June, 2023;
originally announced June 2023.
-
Delving Deeper into Data Scaling in Masked Image Modeling
Authors:
Cheng-Ze Lu,
Xiaojie Jin,
Qibin Hou,
Jun Hao Liew,
Ming-Ming Cheng,
Jiashi Feng
Abstract:
Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we t…
▽ More
Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we take a step further and propose to investigate this problem in a more practical setting. Specifically, we utilize the web-collected Coyo-700M dataset. We randomly sample varying numbers of training images from the Coyo dataset and construct a series of sub-datasets, containing 0.5M, 1M, 5M, 10M, and 100M images, for pre-training. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models. The study reveals that: 1) MIM can be viewed as an effective method to improve the model capacity when the scale of the training data is relatively small; 2) Strong reconstruction targets can endow the models with increased capacities on downstream tasks; 3) MIM pre-training is data-agnostic under most scenarios, which means that the strategy of sampling pre-training data is non-critical. We hope these observations could provide valuable insights for future research on MIM.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation
Authors:
Yabo Zhang,
Zihao Wang,
Jun Hao Liew,
Jingjia Huang,
Manyu Zhu,
Jiashi Feng,
Wangmeng Zuo
Abstract:
In this work, we investigate performing semantic segmentation solely through the training on image-sentence pairs. Due to the lack of dense annotations, existing text-supervised methods can only learn to group an image into semantic regions via pixel-insensitive feedback. As a result, their grouped results are coarse and often contain small spurious regions, limiting the upper-bound performance of…
▽ More
In this work, we investigate performing semantic segmentation solely through the training on image-sentence pairs. Due to the lack of dense annotations, existing text-supervised methods can only learn to group an image into semantic regions via pixel-insensitive feedback. As a result, their grouped results are coarse and often contain small spurious regions, limiting the upper-bound performance of segmentation. On the other hand, we observe that grouped results from self-supervised models are more semantically consistent and break the bottleneck of existing methods. Motivated by this, we introduce associate self-supervised spatially-consistent grouping with text-supervised semantic segmentation. Considering the part-like grouped results, we further adapt a text-supervised model from image-level to region-level recognition with two core designs. First, we encourage fine-grained alignment with a one-way noun-to-region contrastive loss, which reduces the mismatched noun-region pairs. Second, we adopt a contextually aware masking strategy to enable simultaneous recognition of all grouped regions. Coupled with spatially-consistent grouping and region-adapted recognition, our method achieves 59.2% mIoU and 32.4% mIoU on Pascal VOC and Pascal Context benchmarks, significantly surpassing the state-of-the-art methods.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation
Authors:
Kunyang Han,
Yong Liu,
Jun Hao Liew,
Henghui Ding,
Yunchao Wei,
Jiajun Liu,
Yitong Wang,
Yansong Tang,
Yujiu Yang,
Jiashi Feng,
Yao Zhao
Abstract:
Recent advancements in pre-trained vision-language models, such as CLIP, have enabled the segmentation of arbitrary concepts solely from textual inputs, a process commonly referred to as open-vocabulary semantic segmentation (OVS). However, existing OVS techniques confront a fundamental challenge: the trained classifier tends to overfit on the base classes observed during training, resulting in su…
▽ More
Recent advancements in pre-trained vision-language models, such as CLIP, have enabled the segmentation of arbitrary concepts solely from textual inputs, a process commonly referred to as open-vocabulary semantic segmentation (OVS). However, existing OVS techniques confront a fundamental challenge: the trained classifier tends to overfit on the base classes observed during training, resulting in suboptimal generalization performance to unseen classes. To mitigate this issue, recent studies have proposed the use of an additional frozen pre-trained CLIP for classification. Nonetheless, this approach incurs heavy computational overheads as the CLIP vision encoder must be repeatedly forward-passed for each mask, rendering it impractical for real-world applications. To address this challenge, our objective is to develop a fast OVS model that can perform comparably or better without the extra computational burden of the CLIP image encoder during inference. To this end, we propose a core idea of preserving the generalizable representation when fine-tuning on known classes. Specifically, we introduce a text diversification strategy that generates a set of synonyms for each training category, which prevents the learned representation from collapsing onto specific known category names. Additionally, we employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Extensive experiments demonstrate that our proposed model achieves robust generalization performance across various datasets. Furthermore, we perform a preliminary exploration of open-vocabulary video segmentation and present a benchmark that can facilitate future open-vocabulary research in the video domain.
△ Less
Submitted 15 July, 2023; v1 submitted 16 March, 2023;
originally announced March 2023.
-
PV3D: A 3D Generative Model for Portrait Video Generation
Authors:
Zhongcong Xu,
Jianfeng Zhang,
Jun Hao Liew,
Wenqing Zhang,
Song Bai,
Jiashi Feng,
Mike Zheng Shou
Abstract:
Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D,…
▽ More
Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos. Specifically, our method extends the recent static 3D-aware image GAN to the video domain by generalizing the 3D implicit neural representation to model the spatio-temporal space. To introduce motion dynamics to the generation process, we develop a motion generator by stacking multiple motion layers to generate motion features via modulated convolution. To alleviate motion ambiguities caused by camera/human motions, we propose a simple yet effective camera condition strategy for PV3D, enabling both temporal and multi-view consistent video generation. Moreover, PV3D introduces two discriminators for regularizing the spatial and temporal domains to ensure the plausibility of the generated portrait videos. These elaborated designs enable PV3D to generate 3D-aware motion-plausible portrait videos with high-quality appearance and geometry, significantly outperforming prior works. As a result, PV3D is able to support many downstream applications such as animating static portraits and view-consistent video motion editing. Code and models are released at https://showlab.github.io/pv3d.
△ Less
Submitted 20 June, 2023; v1 submitted 13 December, 2022;
originally announced December 2022.
-
MagicMix: Semantic Mixing with Diffusion Models
Authors:
Jun Hao Liew,
Hanshu Yan,
Daquan Zhou,
Jiashi Feng
Abstract:
Have you ever imagined what a corgi-alike coffee machine or a tiger-alike rabbit would look like? In this work, we attempt to answer these questions by exploring a new task called semantic mixing, aiming at blending two different semantics to create a new concept (e.g., corgi + coffee machine -- > corgi-alike coffee machine). Unlike style transfer, where an image is stylized according to the refer…
▽ More
Have you ever imagined what a corgi-alike coffee machine or a tiger-alike rabbit would look like? In this work, we attempt to answer these questions by exploring a new task called semantic mixing, aiming at blending two different semantics to create a new concept (e.g., corgi + coffee machine -- > corgi-alike coffee machine). Unlike style transfer, where an image is stylized according to the reference style without changing the image content, semantic blending mixes two different concepts in a semantic manner to synthesize a novel concept while preserving the spatial layout and geometry. To this end, we present MagicMix, a simple yet effective solution based on pre-trained text-conditioned diffusion models. Motivated by the progressive generation property of diffusion models where layout/shape emerges at early denoising steps while semantically meaningful details appear at later steps during the denoising process, our method first obtains a coarse layout (either by corrupting an image or denoising from a pure Gaussian noise given a text prompt), followed by injection of conditional prompt for semantic mixing. Our method does not require any spatial mask or re-training, yet is able to synthesize novel objects with high fidelity. To improve the mixing quality, we further devise two simple strategies to provide better control and flexibility over the synthesized content. With our method, we present our results over diverse downstream applications, including semantic style transfer, novel object synthesis, breed mixing, and concept removal, demonstrating the flexibility of our method. More results can be found on the project page https://magicmix.github.io
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
SODAR: Segmenting Objects by DynamicallyAggregating Neighboring Mask Representations
Authors:
Tao Wang,
Jun Hao Liew,
Yu Li,
Yunpeng Chen,
Jiashi Feng
Abstract:
Recent state-of-the-art one-stage instance segmentation model SOLO divides the input image into a grid and directly predicts per grid cell object masks with fully-convolutional networks, yielding comparably good performance as traditional two-stage Mask R-CNN yet enjoying much simpler architecture and higher efficiency. We observe SOLO generates similar masks for an object at nearby grid cells, an…
▽ More
Recent state-of-the-art one-stage instance segmentation model SOLO divides the input image into a grid and directly predicts per grid cell object masks with fully-convolutional networks, yielding comparably good performance as traditional two-stage Mask R-CNN yet enjoying much simpler architecture and higher efficiency. We observe SOLO generates similar masks for an object at nearby grid cells, and these neighboring predictions can complement each other as some may better segment certain object part, most of which are however directly discarded by non-maximum-suppression. Motivated by the observed gap, we develop a novel learning-based aggregation method that improves upon SOLO by leveraging the rich neighboring information while maintaining the architectural efficiency. The resulting model is named SODAR. Unlike the original per grid cell object masks, SODAR is implicitly supervised to learn mask representations that encode geometric structure of nearby objects and complement adjacent representations with context. The aggregation method further includes two novel designs: 1) a mask interpolation mechanism that enables the model to generate much fewer mask representations by sharing neighboring representations among nearby grid cells, and thus saves computation and memory; 2) a deformable neighbour sampling mechanism that allows the model to adaptively adjust neighbor sampling locations thus gathering mask representations with more relevant context and achieving higher performance. SODAR significantly improves the instance segmentation performance, e.g., it outperforms a SOLO model with ResNet-101 backbone by 2.2 AP on COCO \texttt{test} set, with only about 3\% additional computation. We further show consistent performance gain with the SOLOv2 model.
△ Less
Submitted 23 December, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Energy-efficient neural network inference with microcavity exciton-polaritons
Authors:
M. Matuszewski,
A. Opala,
R. Mirek,
M. Furman,
M. Król,
K. Tyszka,
T. C. H. Liew,
D. Ballarini,
D. Sanvitto,
J. Szczytko,
B. Piętka
Abstract:
We propose all-optical neural networks characterized by very high energy efficiency and performance density of inference. We argue that the use of microcavity exciton-polaritons allows to take advantage of the properties of both photons and electrons in a seamless manner. This results in strong optical nonlinearity without the use of optoelectronic conversion. We propose a design of a realistic ne…
▽ More
We propose all-optical neural networks characterized by very high energy efficiency and performance density of inference. We argue that the use of microcavity exciton-polaritons allows to take advantage of the properties of both photons and electrons in a seamless manner. This results in strong optical nonlinearity without the use of optoelectronic conversion. We propose a design of a realistic neural network and estimate energy cost to be at the level of attojoules per bit, also when including the optoelectronic conversion at the input and output of the network, several orders of magnitude below state-of-the-art hardware implementations. We propose two kinds of nonlinear binarized nodes based either on optical phase shifts and interferometry or on polariton spin rotations.
△ Less
Submitted 28 August, 2021;
originally announced August 2021.
-
Body Meshes as Points
Authors:
Jianfeng Zhang,
Dongdong Yu,
Jun Hao Liew,
Xuecheng Nie,
Jiashi Feng
Abstract:
We consider the challenging multi-person 3D body mesh estimation task in this work. Existing methods are mostly two-stage based--one stage for person localization and the other stage for individual body mesh estimation, leading to redundant pipelines with high computation cost and degraded performance for complex scenes (e.g., occluded person instances). In this work, we present a single-stage mod…
▽ More
We consider the challenging multi-person 3D body mesh estimation task in this work. Existing methods are mostly two-stage based--one stage for person localization and the other stage for individual body mesh estimation, leading to redundant pipelines with high computation cost and degraded performance for complex scenes (e.g., occluded person instances). In this work, we present a single-stage model, Body Meshes as Points (BMP), to simplify the pipeline and lift both efficiency and performance. In particular, BMP adopts a new method that represents multiple person instances as points in the spatial-depth space where each point is associated with one body mesh. Hinging on such representations, BMP can directly predict body meshes for multiple persons in a single stage by concurrently localizing person instance points and estimating the corresponding body meshes. To better reason about depth ordering of all the persons within the same scene, BMP designs a simple yet effective inter-instance ordinal depth loss to obtain depth-coherent body mesh estimation. BMP also introduces a novel keypoint-aware augmentation to enhance model robustness to occluded person instances. Comprehensive experiments on benchmarks Panoptic, MuPoTS-3D and 3DPW clearly demonstrate the state-of-the-art efficiency of BMP for multi-person body mesh estimation, together with outstanding accuracy. Code can be found at: https://github.com/jfzhang95/BMP.
△ Less
Submitted 5 July, 2021; v1 submitted 6 May, 2021;
originally announced May 2021.
-
Efficient emotion recognition using hyperdimensional computing with combinatorial channel encoding and cellular automata
Authors:
Alisha Menon,
Anirudh Natarajan,
Reva Agashe,
Daniel Sun,
Melvin Aristio,
Harrison Liew,
Yakun Sophia Shao,
Jan M. Rabaey
Abstract:
In this paper, a hardware-optimized approach to emotion recognition based on the efficient brain-inspired hyperdimensional computing (HDC) paradigm is proposed. Emotion recognition provides valuable information for human-computer interactions, however the large number of input channels (>200) and modalities (>3) involved in emotion recognition are significantly expensive from a memory perspective.…
▽ More
In this paper, a hardware-optimized approach to emotion recognition based on the efficient brain-inspired hyperdimensional computing (HDC) paradigm is proposed. Emotion recognition provides valuable information for human-computer interactions, however the large number of input channels (>200) and modalities (>3) involved in emotion recognition are significantly expensive from a memory perspective. To address this, methods for memory reduction and optimization are proposed, including a novel approach that takes advantage of the combinatorial nature of the encoding process, and an elementary cellular automaton. HDC with early sensor fusion is implemented alongside the proposed techniques achieving two-class multi-modal classification accuracies of >76% for valence and >73% for arousal on the multi-modal AMIGOS and DEAP datasets, almost always better than state of the art. The required vector storage is seamlessly reduced by 98% and the frequency of vector requests by at least 1/5. The results demonstrate the potential of efficient hyperdimensional computing for low-power, multi-channeled emotion recognition tasks.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration
Authors:
Hasan Genc,
Seah Kim,
Alon Amid,
Ameer Haj-Ali,
Vighnesh Iyer,
Pranav Prakash,
Jerry Zhao,
Daniel Grubb,
Harrison Liew,
Howard Mao,
Albert Ou,
Colin Schmidt,
Samuel Steffl,
John Wright,
Ion Stoica,
Jonathan Ragan-Kelley,
Krste Asanovic,
Borivoje Nikolic,
Yakun Sophia Shao
Abstract:
DNN accelerators are often developed and evaluated in isolation without considering the cross-stack, system-level effects in real-world environments. This makes it difficult to appreciate the impact of System-on-Chip (SoC) resource contention, OS overheads, and programming-stack inefficiencies on overall performance/energy-efficiency. To address this challenge, we present Gemmini, an open-source*,…
▽ More
DNN accelerators are often developed and evaluated in isolation without considering the cross-stack, system-level effects in real-world environments. This makes it difficult to appreciate the impact of System-on-Chip (SoC) resource contention, OS overheads, and programming-stack inefficiencies on overall performance/energy-efficiency. To address this challenge, we present Gemmini, an open-source*, full-stack DNN accelerator generator. Gemmini generates a wide design-space of efficient ASIC accelerators from a flexible architectural template, together with flexible programming stacks and full SoCs with shared resources that capture system-level effects. Gemmini-generated accelerators have also been fabricated, delivering up to three orders-of-magnitude speedups over high-performance CPUs on various DNN benchmarks.
* https://github.com/ucb-bar/gemmini
△ Less
Submitted 9 July, 2021; v1 submitted 22 November, 2019;
originally announced November 2019.
-
Polaritonic neuromorphic computing outperforms linear classifiers
Authors:
D. Ballarini,
A. Gianfrate,
R. Panico,
A. Opala,
S. Ghosh,
L. Dominici,
V. Ardizzone,
M. De Giorgi,
G. Lerario,
G. Gigli,
T. C. H. Liew,
M. Matuszewski,
D. Sanvitto
Abstract:
Machine learning software applications are nowadays ubiquitous in many fields of science and society for their outstanding capability of solving computationally vast problems like the recognition of patterns and regularities in big datasets. One of the main goals of research is the realization of a physical neural network able to perform data processing in a much faster and energy-efficient way th…
▽ More
Machine learning software applications are nowadays ubiquitous in many fields of science and society for their outstanding capability of solving computationally vast problems like the recognition of patterns and regularities in big datasets. One of the main goals of research is the realization of a physical neural network able to perform data processing in a much faster and energy-efficient way than the state-of-the-art technology. Here we show that lattices of exciton-polariton condensates accomplish neuromorphic computing using fast optical nonlinearities and with lower error rate than any previous hardware implementation. We demonstrate that our neural network significantly increases the recognition efficiency compared to the linear classification algorithms on one of the most widely used benchmarks, the MNIST problem, showing a concrete advantage from the integration of optical systems in reservoir computing architectures.
△ Less
Submitted 7 November, 2019;
originally announced November 2019.
-
Classification Calibration for Long-tail Instance Segmentation
Authors:
Tao Wang,
Yu Li,
Bingyi Kang,
Junnan Li,
Jun Hao Liew,
Sheng Tang,
Steven Hoi,
Jiashi Feng
Abstract:
Remarkable progress has been made in object instance detection and segmentation in recent years. However, existing state-of-the-art methods are mostly evaluated with fairly balanced and class-limited benchmarks, such as Microsoft COCO dataset [8]. In this report, we investigate the performance drop phenomenon of state-of-the-art two-stage instance segmentation models when processing extreme long-t…
▽ More
Remarkable progress has been made in object instance detection and segmentation in recent years. However, existing state-of-the-art methods are mostly evaluated with fairly balanced and class-limited benchmarks, such as Microsoft COCO dataset [8]. In this report, we investigate the performance drop phenomenon of state-of-the-art two-stage instance segmentation models when processing extreme long-tail training data based on the LVIS [5] dataset, and find a major cause is the inaccurate classification of object proposals. Based on this observation, we propose to calibrate the prediction of classification head to improve recognition performance for the tail classes. Without much additional cost and modification of the detection model architecture, our calibration method improves the performance of the baseline by a large margin on the tail classes. Codes will be available. Importantly, after the submission, we find significant improvement can be further achieved by modifying the calibration head, which we will update later.
△ Less
Submitted 30 July, 2020; v1 submitted 29 October, 2019;
originally announced October 2019.
-
PANet: Few-Shot Image Semantic Segmentation with Prototype Alignment
Authors:
Kaixin Wang,
Jun Hao Liew,
Yingtian Zou,
Daquan Zhou,
Jiashi Feng
Abstract:
Despite the great progress made by deep CNNs in image semantic segmentation, they typically require a large number of densely-annotated images for training and are difficult to generalize to unseen object categories. Few-shot segmentation has thus been developed to learn to perform segmentation from only a few annotated examples. In this paper, we tackle the challenging few-shot segmentation probl…
▽ More
Despite the great progress made by deep CNNs in image semantic segmentation, they typically require a large number of densely-annotated images for training and are difficult to generalize to unseen object categories. Few-shot segmentation has thus been developed to learn to perform segmentation from only a few annotated examples. In this paper, we tackle the challenging few-shot segmentation problem from a metric learning perspective and present PANet, a novel prototype alignment network to better utilize the information of the support set. Our PANet learns class-specific prototype representations from a few support images within an embedding space and then performs segmentation over the query images through matching each pixel to the learned prototypes. With non-parametric metric learning, PANet offers high-quality prototypes that are representative for each semantic class and meanwhile discriminative for different classes. Moreover, PANet introduces a prototype alignment regularization between support and query. With this, PANet fully exploits knowledge from the support and provides better generalization on few-shot segmentation. Significantly, our model achieves the mIoU score of 48.1% and 55.7% on PASCAL-5i for 1-shot and 5-shot settings respectively, surpassing the state-of-the-art method by 1.8% and 8.6%.
△ Less
Submitted 6 February, 2020; v1 submitted 18 August, 2019;
originally announced August 2019.
-
FPGA with Improved Routability and Robustness in 130nm CMOS with Open-Source CAD Targetability
Authors:
Guanshun Yu,
Tom Y. Cheng,
Blayne Kettlewell,
Harrison Liew,
Mingoo Seok,
Peter R. Kinget
Abstract:
This paper outlines an FPGA VLSI design methodology that was used to realize a fully functioning FPGA chip in 130nm CMOS with improved routability and memory robustness. The architectural design space exploration and synthesis capability were enabled by the Verilog-to-Routing CAD tool. The capabilities of this tool were extended to enable bitstream generation and deployment. To validate the archit…
▽ More
This paper outlines an FPGA VLSI design methodology that was used to realize a fully functioning FPGA chip in 130nm CMOS with improved routability and memory robustness. The architectural design space exploration and synthesis capability were enabled by the Verilog-to-Routing CAD tool. The capabilities of this tool were extended to enable bitstream generation and deployment. To validate the architecture and bitstream implementation, a Chisel (Constructing Hardware in the Embedded Scala Language) model of the FPGA was created to rapidly verify the microarchitectural details of the device prior to schematic design. A custom carrier board and configuration tool were used to verify correct operational characteristics of the FPGA over various resource utilizations and clock frequencies.
△ Less
Submitted 9 December, 2017;
originally announced December 2017.
-
Perceptrons with Hebbian learning based on wave ensembles in plastic potentials
Authors:
T. Espinosa-Ortega,
T. C. H. Liew
Abstract:
A general scheme to realize a perceptron for hardware neural networks is presented, where multiple interconnections are achieved by a superposition of Schrodinger waves. Spatially patterned potentials process information by coupling different points of reciprocal space. The necessary potential shape is obtained from the Hebbian learning rule, either through exact calculation or construction from a…
▽ More
A general scheme to realize a perceptron for hardware neural networks is presented, where multiple interconnections are achieved by a superposition of Schrodinger waves. Spatially patterned potentials process information by coupling different points of reciprocal space. The necessary potential shape is obtained from the Hebbian learning rule, either through exact calculation or construction from a superposition of known optical inputs. This allows implementation in a wide range of compact optical systems, including: 1) any non-linear optical system; 2) optical systems patterned by optical lithography; and 3) exciton-polariton systems with phonon or nuclear spin interactions.
△ Less
Submitted 25 February, 2015; v1 submitted 29 August, 2014;
originally announced August 2014.
-
Polariton Condensate Transistor Switch
Authors:
T. Gao,
P. S. Eldridge,
T. C. H. Liew,
S. I. Tsintzos,
G. Stavrinidis,
G. Deligeorgis,
Z. Hatzopoulos,
P. G. Savvidis
Abstract:
A polariton condensate transistor switch is realized through optical excitation of a microcavity ridge with two beams. The ballistically ejected polaritons from a condensate formed at the source are gated using the 20 times weaker second beam to switch on and off the flux of polaritons. In the absence of the gate beam the small built-in detuning creates potential landscape in which ejected polarit…
▽ More
A polariton condensate transistor switch is realized through optical excitation of a microcavity ridge with two beams. The ballistically ejected polaritons from a condensate formed at the source are gated using the 20 times weaker second beam to switch on and off the flux of polaritons. In the absence of the gate beam the small built-in detuning creates potential landscape in which ejected polaritons are channelled toward the end of the ridge where they condense. The low loss photon-like propagation combined with strong nonlinearities associated with their excitonic component makes polariton based transistors particularly attractive for the implementation of all-optical integrated circuits.
△ Less
Submitted 24 May, 2012; v1 submitted 21 May, 2012;
originally announced May 2012.