Skip to content

Latest commit

 

History

History

2-Controllable-GAN

Controllable Generation

Awesome Maintenance PR's Welcome

A collection of resources on Controllable Generation (Latent Space Manipulation). Specific attribute of pose is summarized in 3D-Aware-Generation; Conditional GANs is summarized in Conditional-GANs.

PS: this repo does not contains style transfer, I would tend to classify it as artistic creation (mix and interpolate features from different images).

Key Words: interpreting, latent space navigation, steerable, steerability, interpretable, semantics, manual annotation, meaningful directions, semantic image editing, independence, exclusiveness.

Contributing

Feedback and contributions are welcome! If you think I have missed out on something (or) have any suggestions (papers, implementations and other resources), feel free to pull a request or leave an issue. I will release the latex-pdf version in the future. ⬇️markdown format:

[Paper Name](abs link)  
*[Author 1](homepage), Author 2, and Author 3*
**[`Conference/Journal Year`] (`Institution`)** [[Github](link)] [[Project](link)]

😄 Now you can use this script to automatically generate the above text.

Contents

  1. Introduction
  2. [Methods Taxonomy](#Methods Taxonomy)
  3. Tricks
  4. Literature

Introduction

中文介绍

我认为要弄清楚这么几个问题:

  1. 为什么存在这种可能性,

  2. 意义在哪里

    检验生成模型的插值性能,已经成为了一个主流的必备实验,因为我们希望学到的模型本身就具有良好的泛化性质,也就是在Latent Space上的分布是连续的。Manifold。

  3. 哪些属性是我们可以编辑的?

    人脸的五官,目标位置的放缩/旋转移动,视角

我们主要关心的范式是

隐空间做编辑,想编辑新生成的,也想编辑新生成的,所以要做Inversion

Conventional generative models excel at generating random realistic samples with statistics resembling the training set. However, controllable and interactive matters rather than random. GANs do not provide an inherent way of comprehending or controlling the underlying generative factors. But some researches show that a well-trained GAN is able to encode different semantics inside the latent space. Therefore, a key problem of generative models is to gain explicit control of the synthesis process or results.

The goal is to generate or modify images satisfying our specific requirements. The requirement should be semantical meaningful interpretable and easy to distinguish without affecting other attributes (e.g. object pose,). The manipulation could be single or multi attributes of interest.

The first line proposes conditional GANs to generate with $p(z \mid y)$ or employ auxiliary classifiers to enforce the gan network to generate desired results.

Another line is to utilize the latent space of a pretrained generator for image manipulation. In this line, we could also control and modify a given real image by inverting the image into latent code (inversion). The main idea behind this method is to disentangle the latent space. A common practice is to analyze and dissect GANs’ latent spaces, finding disentangled latent variables suitable for editing. The disentangled latent variables sometimes are interpretable directions $d$. Careful modifications of the latent embeddings then translate to desired changes in the generated output.

$$ x = G(z_0) \rightarrow x' = G(z_0 + \alpha d) $$

These directions should be orthogonal which do not interfere with each other, moving in these directions corresponds to attributes variation. These valid direction could also be seen as some manifold of the latent space, meaningful subspaces corresponding to the human-understandable attributes.

GANs can produce high-fidelity images visually indistinguishable from real ones. However, the generated images are not controllable but controlling over the generated images brings many interesting applications.

Such control can be obtained by first learning the manifold and then realizing image editing through latent space traversal. Moving latent codes towards some certain directions can cause corresponding attribute change in images.

Many works have examined semantic directions in the latent spaces spontaneously learned by pre-trained GANs. The widely used StyleGAN is a common choice for its high-quality synthesis and remarkable latent-based editing quality through its rich and highly disentangled latent space.

Some using full-supervision in the form of semantic labels, others find meaningful directions in a self-supervised fashion, and, finally, recent works present unsupervised methods to achieve the same goal.

Disentanglement can be defined as the ability to control a single factor, or feature, without affecting other ones. A properly disentangled representation can benefit semantic data mixing, transfer learning for downstream tasks, or even interpretability. --《Face Identity Disentanglement via Latent Space Mapping》

🎯 Summary

We mainly focus on the disentanglement in the latent space of a generative model. We hope:

  • keep high-quality after editing

  • disentangle in an unsupervised manner

  • find more disentangled directions that do not interfere with each other

  • provide continuous manipulation of multiple attributes simultaneously

  • high-precision

  • interaction mode: provide segmentation mask to change specific area.

📌 Problem Statement

A (pretrained) fixed GAN model consisting of a generator G and a discriminator D, latent vector $\boldsymbol{z} \in \mathbb{R}^m$ from a known distribution $P(\boldsymbol{z})$, and sample $N$ random vectors $\mathbb{Z} = {\boldsymbol{z}^{(1)}, \dots, \boldsymbol{z}^{(N)}}$

We want to discover K non-linear interpretable paths on the latent space. The most straightforward way is to first generate a collection of image synthesis, then label these images regarding a target attribute, and finally find the latent separation boundary through supervised training. In view of the annotation drawbacks of previous method, finding steerable directions of the latent space in an unsupervised manner is another direction, such as using PCA. The common issue of the existing approaches is the limitation of global semantics, we would like to focus on some particular image region.

current methods required

  • carefully designed loss functions

  • introduction of additional attribute labels or features

  • special architectures to train new models

👀 What can we edit/control?

Meaningful human interpretable directions can refer to either domain-specific factors (e.g., facial expressions) or domain-agnostic factors (e.g., zoom scale). Some examples including:

  • change facial expressions in portraits

  • change view-point or shapes and textures of cars

  • interpolate between different images

  • some simple transformation (rotation, zooming)

📎 Futher Impact

  • browse through the concepts that the GAN has learned, internal representation

  • training a general model requires enormous computational resources, so interpret and extend the capabilities of existing GANs

  • for artistic visualization, design, photo enhancement

  • solving many other downstream tasks, including face verification, landmark detection, layout prediction, transfer learning, style mixing, image editing, etc.

Methods Taxonomy

This is the summary of controllable GAN including;

  • [training stage] conditional GAN mode

  • [test stage] modify a pre-trained GAN model

  • [test stage] modify a latent code of a given image

  • Conditional GANs and Auxiliary Classifier

    different conditioning lead different outputs

    auxiliary attribute classifiers to guide synthesis

    💬 requires large labeled datasets. These methods are limited to image types for which large annotated datasets are available, like portraits. Limited editing control

  • Analyze and dissect GANs’ latent spaces, finding disentangled latent variables suitable for editing

    💬 do not enable detailed editing and are often slow.

    used in real-time on different images and with different edits

  • change network weight

前两者can only discover researchers expectation directions. 需要想象力;后者能实现你所想不到

Evaluation Metrics

我们需要一些,其实也是面对整个大的解耦学习 Disentanglement Learning

Inspired by Bengio, we cam adopt the notion of disentangled representation learning as "a process of decorrelating information in the data into separate informative representations, each of which corresponds to a concept defined by humans".

So the three important properties can be summarized as informativeness, separability, and interpretability.

Measuring Disentanglement: A Review of Metrics
Julian Zaidi, Jonathan Boilard, Ghyslain Gagnon, Marc-André Carbonneau

Theory and Evaluation Metrics for Learning Disentangled Representations
Kien Do, Truyen Tran
[ICLR 2020]

Tricks

插值方法有最简单的线性插值,也有球性插值

  • 通常是假设 z 服从高斯分布,而这样导致点不太可能落在离球面 $\mathcal{S}(\sqrt{d}, d, 2)$ 太远的地方。又因为投影到球体上很容易且数值友好,因此会时刻让 z 映射到一个球体上。使用中,也会不使用 $\sqrt{d}$ 的球体,而是直接用单位球。

Truncation:

首先生成一系列图像,有他们对应的Z,然后计算出一个均值,生成新图像的时候,采用 $$ w' = w_{avg} + \alpha (w - w_{avg}) $$

Literature

Please ref subfolder for more details.

Survey

Representation Learning: A Review and New Perspectives
Yoshua Bengio, Aaron Courville, Pascal Vincent

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem

1.1 Conditional GAN

1.2 Auxiliary Classifier

2.1 Encode

encode a given image into a latent representation of the manipulated image

2.2 Modify GAN Model

control GANs' network parameters

Good: one model can produce countless new images following the modified rules.

2.3 Modify Latent Space

analyze and dissect latent space,finding disentangled latent variables suitable for editing.

the discovered directions linear / nonlinear path | their evaluation relies either on visual inspection or on laborious human labeling.

2.3.1 Supervised

domain-specific transformations (adding smile or glasses)

weakness: need human labels or pretrained models, expensive to obtain

pipeline: randomly sample a large amount of latent codes, then synthesize corresponding images and annotate them with labels, and finally use these labeled samples to learn a separation boundary in the latent space.

存在的问题:需要预定义的语义,需要大量采样

either by explicit human annotation, or by the use of pretrained semantic classifiers.

improves the memorability of the output image

explores the hierarchical semantics in the deep generative representations for scene synthesis

Use the classifiers pretrained on the CelebA dataset to predict certain face attributes

Add labels to latent space and separate a hyperplane. A normal to this hyperplane becomes a direction that captures the corresponding attribute.

2.3.2 Self-Supervised Learning

domain agnostic transformations (zooming or translation)

(image augmentations) - [simple transformations]

solve the optimization problem in the latent space that maximizes the score of the pretrained model, predicting image memorability

2.3.3 Unsupervised

are often less effective at providing semantic meaningful directions and all too often change image identity during an edit

demanding training process that requires drawing large numbers of random latent codes and regressing the latent directions

weakness: subjective visual inspection & laborious human labeling

using unsupervised manner such as PCA to find steerable direction

do not use any optimization

A geometric analysis of deep generative image models and its applications

Low-Rank Subspaces in GANs
Jiapeng Zhu, Ruili Feng, Yujun Shen, Deli Zhao, Zhengjun Zha, Jingren Zhou, Qifeng Chen
[NeurIPS 2021] (HKUST, Alibaba, USTC)

GAN models could capture the natural statistics while isolate independent factors of variation. These factors can be used used to control the outcome, but those perturbation will affect the global statistic of the images. So we want the manipulation occur at the localized level. The general methods will depend on the annotations of the independent factors.

We aim to learn spatially and semantically independent latent factors without the need for any annotation.

VAE-based methods use the total correlation of the latent variable distributions as the penalty

InfoGAN-based methods maximize the mutual information between latent factors ad related observations.

Usually the extra terms lead to worse generation quality for these typical disentanglement methods.

GAN-based methods discover semantically meaningful directions in the style space of StyleGAN by analysing the distribution of teh first-layer output or layer weights.

Generative Hierarchical Features from Synthesizing Images
[CVPR 2021] (CUHK)
Yinghao Xu, Yujun Shen, Jiapeng Zhu, Ceyuan Yang, Bolei Zhou

Editing in style: Uncovering the local semantics of GANs

Decorating your own bedroom: Locally controlling image generation with generative adversarial networks

Spatially controllable image synthesis with internal representation collaging

Human-in-the-loop differential subspace search in high-dimensional latent space

A spectral regularizer for unsupervised disentanglement

The geometry of deep generative image models and its applications

Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation
[ICLR 2021] (UIUC)
Peiye Zhuang, Oluwasanmi Koyejo, Alexander G. Schwing

Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning
[CVPR 2020] (Tsinghua)
Yu Deng, Jiaolong Yang, Dong Chen, Fang Wen, Xin Tong

augmenting and regularizing the latent space

A free viewpoint portrait generator with dynamic styling

Disentangled image generation through structured noise injection

Gan-control: Explicitly controllable gans

(GLO) Optimizing the Latent Space of Generative Networks
[ICML 2018] (Facebook)
Piotr Bojanowski, Armand Joulin, David Lopez-Paz, Arthur Szlam

Deforming autoencoders: Unsupervised disentangling of shape and appearance
[ECCV 2018] (Stony Brook University)
Zhixin Shu, Mihir Sahasrabudhe, Alp Guler, Dimitris Samaras, Nikos Paragios, Iasonas Kokkinos

For Face
MaskGAN: Towards Diverse and Interactive Facial Image Manipulation
[CVPR 2020] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, Ping Luo

Related work in papers

一般会出现的小标题是:

  • Latent Semantic Interpretation

I have published the materials below.