图像合成效果惊人

December 19, 2020

导言

paper: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9157662 code: https://github.com/clovaai/stargan-v2

1. Introduction

• domain ：a set of images that can be grouped as a visually distinctive category
• style: each image has a unique appearance, which we call style

例如，我们可以用性别作为不同的域，则风格就是妆容、胡子和发型等（图1的上部分）。

• An ideal image-to-image translation method ： should be able to synthesize images considering the diverse styles in each domain.

现有的方法： 只考虑的两个域之间的映射，当域数量增加的时候，他们不具备扩展性。

• StarGAN [6]：one of the earliest models, which learns the mappings between all available domains using a single generator。

生成器将域标签作为附加的输入，学习图像到对应域的转换。 然而，StarGAN仍然学习每个域的确定性映射，该映射没有捕获数据分布的多模式本质。

• StarGAN v2： a scalable approach that can generate diverse images across multiple domains.

基于StarGAN，并用我们提出的域特定风格代码取代掉了StarGAN的域标签，这个域特定风格代码可以表示特定域的不同风格。 为此，我们引入了两个模块，一个映射网络（mapping network），一个风格编码器（style encoder）。

• mapping network ： learns to transform random Gaussian noise into a style code

学习如何将随机高斯噪声转换为风格编码

• style encoder: the encoder learns to extract the style code from a given reference image.

而编码器则学习从给定的参考图像中提取风格编码。

2. StarGAN v2

2.1. Proposed framework

Style encoder(Figure 2c)

E可以使用不同参考图片生成多样化风格编码。 这允许G合成反映参考图像x的风格s的输出图像。

3. Experiments

• MUNIT [13]
• DRIT [22]
• MSGAN[27]
• StarGAN [6]

All the baselines are trained using the implementations provided by the authors.

datasets

• CelebA-HQ [17] 分为两个域，男性和女性
• our new AFHQ dataset (Appendix) 分为三个域，猫，狗，野生动物

• Frechét inception distance (FID) [11]
• learned perceptual image patch similarity (LPIPS) [38].

3.1. Analysis of individual components

We evaluate individual components that are added to our baseline StarGAN using CelebA-HQ.

FID 表示真实和生成图像的分布之间的距离，越小越好，LPIPS表示生成图像的多样性，越大越好

3.2. Comparison on diverse image synthesis

In this section, we evaluate StarGAN v2 on diverse image synthesis from two perspectives: latent-guided synthesis and reference-guided synthesis.

Human evaluation.

For each comparison, we randomly generate 100 questions, and each question is answered by 10 workers. We also ask each worker a few simple questions to detect unworthy workers. The number of total valid workers is 76.

These results show that StarGAN v2 better extracts and renders the styles onto the input image than the other baselines.

4. Discussion

We discuss several reasons why StarGAN v2 can successfully synthesize images of diverse styles over multiple domains.

• our style code is separately generated per domain by the multi-head mapping network and style encoder.
• our style space is produced by learned transformations
• our modules benefit from fully exploiting training data from multiple domains

To show that our model generalizes over the unseen images, we test a few samples from FFHQ [18] with our model trained on CelebA-HQ (Figure 7). Here, StarGAN v2 successfully captures styles of references and renders these styles correctly to the source images.