A Style-Based Generator Architecture for Generative Adversarial Network

A Style-Based Generator Architecture for Generative Adversarial Network

카테고리 없음 2019. 6. 25. 22:37

Abstract

-> automatically learned, unsupervised separation of high level attribute (pose and identity when trained on human faces) and stochastic variation in the generated images control of the synthesis.

-> intuitive, Scale specific control of the synthesis

-> leads to demonstrably better interpolation properies, and also better disentangles the latent factors of variation.

-> interpolation quality and disentanglement을 측정하기 위한 어떤 generator 구조에서도 사용할 수 있는 새로운 두 automated method(방법) 소개.

-> Introduce a new, highly varied and high-quality dataset of human faces. (고해상도의 사람 얼굴 dataset 제공) (FFHQ)

Introduction

Our generator starts from a learned constant input and adjusts the "style" of the image at each convolution laver based on the latent code, therefore directly controlling the strength of image features at different scales. -> 이 논문의 핵심이라고 생각함.

architecture의 변화는 네트워크로 직접적으로 주입되는 noise와 함께 결합되어 automatie, unsupervised separation of high-level attributes(e.g., pose, identity) from stochastic variation (e.g., freckles, hair) in the generated images, and enables intuitive scale-specific mixing and interpolation operations.

discriminator 혹은 loss function은 수정하지 않았다.

input latent space는 training data의 확률밀도를 반드시 따라야 하는데, 이것은 어쩔 수 없는 entanlgement로 이끈다.

하지만, Style GAN의 intermediate latent space는 이러한 제한에 자유하다. 그러므로, disentanlgement가 가능하다.

generator의 latent space disentanglement의 degree를 측정하기 위해 1. perceptual path length 와 2. linear separability를 소개.

2. Style-based generator

-> input layer를 완전히 생략하고 learned constant에서 시작한다.

-> Given a latent code z in the input latent space Z, a non-linear mapping network f : Z → W first produces w ∈ W

-> 단순성을 위해 두 공간 (w와 z)의 차원성을 512로 설정한다.

-> mapping f는 8-layer MLP을 사용하여 구현.

Latent z

Generator with AdaIN (from Rani Horev’s blog ). n = channel

μ(xi), σ(xi)를 이용하여 기존 스타일을 지우고, ys,i, yb,i를 이용하여 새로운 스타일을 입힌다고 생각.

2.1. Quality of generated images

Fre ́chet inception distance (FID) for various generator de- signs (lower is better). In this paper we calculate the FIDs using 50,000 images drawn randomly from the training set, and report the lowest distance encountered over the course of training.

C를 통해 첫 번째 convolution layer에 latent vector를 넣는(feed) 것이 더이상 도움이 되지 않는 다는 것을 발견하여 input layer를 지우고, 4x4x512 constant tensor로부터 이미지 합성을 하도록 모델 구조를 간단히 하였음. AdaIN 연산을 제어하는 스타일을 오직 인풋으로 하더라도 의미있는 결과물을 만들어내는 것을 알아냄.

F: Mixing regularization은 인접한 스타일을 상관하지 않고 생성된 이미지를 보다 세밀하게 제어할 수 있는 것임.

use different loss function

- CELEBA-HQ : WGAN-GP

- FFHQ : WGAN-GP for A and non-saturating loss with R1 regularization for B-Z

-> 이 loss를 사용했을 때, 모델의 학습 결과가 좋았다.

avoided sampling from the extreme regions of W using the so-called truncation trick

truncation trick : GAN model이 잘 학습하기 위한 것이 아니라, GAN model에서 generating된 이미지를 뽑아낼 때, 뽑아내는 것을 더 잘하기 위한 trick.

기존이랑 다른 점은 기존에는 latent vector z에 truncation trick을 사용하였다면, 이 논문에는 w가 중요하여 w에 truncation trick을 사용하였다. 또한, low resolution에만 적용하였고, high resolution에는 영향을 주지 않았다.

이 논문의 모든 FID는 truncation trick를 사용하지 않고 계산되었고, 오직 Figure2와 video에서만 truncation trick이 사용되었다.

Figure2

3. Properties of the style-based generator

Style에 대한 scale specific한 수정이 가능함.

The effects of each style are localized in the network, i.e., modi- fying a specific subset of the styles can be expected to affect only certain aspects of the image.

이 지역화의 이유를 보려면, AdaIN연산이 각 채널을 평균 및 분산으로 먼저 정규화한 다음에 스타일에 따라 scale과 bias를 적용하는 방법을 고려해 보아야 한다. 채널마다 통계(스타일이라고 불리는 -> ys,i, yb,i)는 이후 convolution 연산에 대한 feature의 상대적인 중요도를 수정한다. 하지만 이것은 정규화로 인해 original statistics에는 영향을 받지 않는다. 각 스타일은 다음 AdaIN 연산이 overridden되기 전에 오직 하나의 convolution만 제어한다.

Style Mixing

-> style의 지역화를 좀 더 강화하기 위해서 mixing regularization을 적용하였음.

-> 두 개의 random latent code(z1, z2)를 mapping network를 통과시켜 w1, w2를 가진다. 우리는 간단하게 한 latent code(w1)에서 다른 latent code(w2)로 바꿀 수 있다. -> 서로 독립적이니깐

-> 교차점 전에는 w1을 적용하고, 교차점 뒤에는 w2를 적용한다. 이 교차점은 랜덤하게 결정된다.

-> 이것은 인접한 스타일이 correlated되는 것을 방지한다.

각 레이어마다 표현하는 스타일이 다름

from 4x4 layer to 8x8 layer : 포즈, 전체적인 얼굴 모양, 안경, 머리색 등 크게 크게 바뀜.

from 16x16 to 32x32 layer: 머리스타일, 눈을 떳는가 감았는가 (Middle)

from 64x64 to 1024x1024 layer: 색의 배열과 같은 세세한 것들이 바뀜.

3.2. Stochastic variation

오직 입력 레이어만 인풋으로 하는 tradintional generator에서는 필요할 때 마다 이전 activation으로부터 spatially-varying pseudorandom numbers를 생성하는 방법을 도입하여 이것을 구현하려고 하였음. 하지만, 이것은 성공적이지 못했으며 그것은 generated image의 반복되는 패턴으로 입증이 된다. 이러한 문제를 피하기 위해 Style GAN은 각 convolution뒤에 per-pixel noise를 추가하였다.

Figure 4. Examples of stochastic variation. (a) Two generated images. (b) Zoom-in with different realizations of input noise. While the overall appearance is almost identical, individual hairs are placed very differently. (c) Standard deviation of each pixel over 100 different realizations, highlighting which parts of the im- ages are affected by the noise. The main areas are the hair, silhou- ettes, and parts of background, but there is also interesting stochas- tic variation in the eye reflections. Global aspects such as identity and pose are unaffected by stochastic variation.

We can see that the noise affects only the stochastic aspects, leaving the overall composition and high-level aspects such as identity intact.

noise의 영향은 네트워크안에서 완벽히 지역화되어 있다. -> 이유는 논문 페이지 5 아래에 설명되어 있음.

3.3. Separation of global effects from stochasticity ( style과 noise가 분리되는 효과에 대한 설명)

change to the style -> global effects (changing pose, identity, etc.)

noise -> inconsezuential stochastic variation (differently combed hair, beard, etc.)

encode style of an image -> spatially invariant statistics (Gram matrix, channel-wise mean, variance, etc.)

encode spcific instance -> spatially variant statistics

StyleGAN은 스타일은 전체 이미지에 영향을 미친다. 왜냐하면, feature map의 전체가 같은 값으로 scale되고 bias되기 때문임.

반면에, noise는 각 픽셀에 독립적으로 더해지기 때문에 stochastic variation을 controll하기 적합하다.

4. Disentangled studies

disentanlgled representation : 어떤 이미지를 나타내는 latent variable이 여러개로 분리되어 각각 다른 이미지의 특성에 관한 정보를 담고 있는 것을 의미함. (from internet)

disentanglement -> common goal is a latent space that consists of linear subspaces, each of which controls one factor of variation.

우리가 원하는 것은 latent space가 linear한 subspace를 가지게 되고, 그것들이 variation의 factor를 조절하는 능력을 가질 수 있게 되길 원함.

ex) latent space를 특정한 방향으로 움직였을 때, 바라보는 얼굴의 방향이 바뀐다거나 성별이 바뀐다거나 나이가 점점 드는 방향 또는 어려지는 방향의 latent space를 찾고 싶어서, 이러한 것을 원하는 거임.

이전에는 항상 z를 training data에 끼워넣어야 해서 충분히 disentanlge하지 않았음.

샘플링된 density는 f(z) (learned piecewise continuous mapping)에 의해 주입되고,

This mapping can be adapted to “unwarp” W so that the factors of variation become more linear.

또한, Style GAN의 intermediate latent space W는 이러한 fixed distribution을 따를 필요가 없다.

generator는 disentangle representation에 기반한 사실적인 이미지를 만들어내는 것이 더 쉽다.

ABOUT ME

Computer Computer

티스토리툴바