HoloGan : Unsupervised learning of 3D representations from natural images

HoloGan : Unsupervised learning of 3D representations from natural images

카테고리 없음 2019. 6. 19. 22:34
Abstract:

HoloGAN => Unsupervised learning of 3D representations from natural images.

Our experiments show that using explicit 3D features enables HoloGAN to disentangle 3D pose and identity, which is further decomposed into shape and appearance, while still being able to generate images with similar or higher visual quality than other generative models.

pose label, 3D shapes, multiple views of the same objects 없이도 학습이 가능함.

1. Introduction

HoloGan이 나오기 전까지의 모델은 많은 label이 있는 training data가 필요하거나, 흐린(blur)된 결과물을 만들어 냈다.

비록 최근 연구에서는 3D 데이터를 사용하여 이 문제를 해결하려고 하였으나, 3D ground truth 데이터는 캡처(capture)하고 재구성(reconstruct)하는데 너무 비싸다.

-> Therefore, there is also a practical motivation to directly learn 3D representations from unlabelled 2D images.

이것을 동기로 하여 만들어 낸 model이 HoloGAN임.

HoloGAN learns to separate pose from identity (shape and appearance) only from unlabelled 2D images without sacrificing the visual fidelity of the generated images.

HoloGAN first learns a 3D representation, which is then transformed to a target pose, projected to 2D features, and rendered to generate the final images.

summary of main technical contributions

A novel architecture that combines a strong inductive bias about the 3D world with deep generative models to learn disentangled representations (pose, shape, and appearance) of 3D objects from images. The representation is explicit in 3D and expressive in semantics.

An unconditional GAN that, for the first time, allows native support for view manipulation without sacrificing visual image fidelity.

An unsupervised training approach that enables disentangled representation learning without using labels.

3.Model

->View manipulation therefore can be achieved by directly applying 3D rigid-body transformations to the learnt 3D features.

*rigid transformation(강체 변환)

-> 형태와 크기만 유지한 채 위치와 방향(rotation)만 바뀔 수 있는 변환.

-> 즉, 회전(rotation)과 평행이동(translation)만을 허용하는 변환임.

1. 3D convolution을 이용하여 3D representation을 학습함.

2. 특정한 포즈로 이 representation을 변환함.

3. projects and projection unit을 사용하여 visibility 계산함.

4. 2D convolution을 이용하여 final 이미지의 각각 픽셀의 shaded colour value를 계산함.

HoloGAN shares many rendering insights with RenderNet.

하지만, HoloGAN은 natual image와 함께 작업하고, neural renderer의 pre-training이 필요없고, 3D shape-2D image의 training data set이 필요없다.

During training, we sample random poses from a uniform distribution and transform the 3D features using these poses before rendering them to images.

-> Using explicit rigid-body transformations for novel-view synthesis has been shown to produce sharper images with fewer artefacts

-> More importantly, this provides an inductive bias towards representations that are compatible with explicit 3D rigid-body transformations, providing easy view manipulation.

결과적으로, 학습된 표현은 3D로 명시되어 있고, 포즈(pose)와 정체성(identity)은 분리되어있다.

Kulkarni et al는 학습된 disentangled representation을 intrinsic element와 extrinsic element로 분류하였다.

intrinsic element -> shape, appearance

extrinsic element -> pose, lighting(location, intensity)

-> 학습된 3D feature(identity를 제어함)로부터 3D transform을 하는데, 이것은 pose를 제어한다.

3.1. Learning 3D representations

Figure 3. HoloGAN’s generator network: we employ 3D convolutions, 3D rigid-body transformations, the projection unit and 2D convolutions. We also remove the traditional input layer from z, and start from a learnt constant 4D tensor. The latent vector z is instead fed through MLPs to map to the affine transformation parameters for adaptive instance normalisation (AdaIN). Inputs are coloured gray.

->HoloGAN generates 3D representations from a learnt constant tensor.

-> The random noise vector z 는 "style" controller임. 그리고 이것은 MLP를 사용하여 각 convolution뒤에 AdaIN을 위한 affine parameter로 mapping됨.

Given some features Φl at layer l of an image x and the noise “style” vector z, AdaIN is defined as:

-> 경험적으로, 이 네트워크의 구조가 noise vector z를 generator의 첫번째 레이어에 direct하게 공급하는 것보다 pose와 identity를 더 잘 분리할 수 있다는 것을 알게되었다고 함.

이 다음 부분은 HoloGAN과 StyleGAN의 두가지 차이점에 대해 설명하였음

이곳에서 HoloGAN의 부분만 살펴보자면,

1. HoloGAN은 이미지를 만들어내기위해 2D feature로 들어가기(project) 전에 4D constant tensor(size 4x4x4x512 , 마지막은 feature channel)로 부터 3D feature를 학습한다.

2. HoloGAN은 학습 과정동안 3D feature과 rigid-body transformation을 combine함으로써 3D feature를 학습한다.

pose -> controlled by the 3D transformation

shape -> controlled by 3D features

appearance -> controlled by 2D features

3.2 Learning with view-dependent mappings

우리는 학습된 feature를 random pose로 transform한다. 2D image로 주입하기 전데.

이 random pose transformation은 HoloGAN이 disentangle하고, 가능한 모든 view로부터 render할 수 있는 3D representation을 학습할 수 있게 해준다.

3.2.1 Rigid-body transformaion

http://m.blog.daum.net/shksjy/228?tp_nil_a=1 -> 이 블로그에 설명이 잘 되어 있다.

3D rotation을 이용하여 Rigid-body transformaion을 매개변수화하였고, translation은 고려하지 않았다.

3.2.2. Projection unit

2D image로부터 의미있는 3D representation을 학습하기 위해 Projection unit을 사용.

Projection unit은 4D tensor(3D feature)를 받고 3D tensor (2D feature)를 반환함.

Projection unit은 reshapaing layer와 MLP(with an on-linear activation function)로 구성되어 있다. reshaping layer는 depth dimension과 channel dimension을 concatenate하는 것이다. 그러므로 tensor의 dimension은 4D(W×H×D×C)에서 3D(W×H×(D·C))로 감소한다. MLP는 occulusion을 학습하기 위한 것이고, 경험적으로는 leakyReLU사용하였다고 함.

3.3 Loss function

Identity regulariser

-> 높은 해상도(128x128)의 이미지를 만들어 내기 위해, Lidentity 사용.

Lidentity 는 생성된 영상에서 재구성(reconstruct)된 벡터가 generator G에서 사용된 잠복 벡터 z와 일치하는지에 대한 loss.

z가 pose가 다양할 때, 물체의 identity를 유지하기 한다는 것을 알아냄.

encoder network F는 discriminator의 convolution layer의 대부분을 공유하고, 추가적으로 fully connected layer를 사용한다. 이것을 사용하여, 생성된 영상에서 재구성된 벡터를 구할 수 있다.

Style discriminator

-> Our generator is designed to match the “style” of the training images at different levels, which effectively controls image attributes at different scales.

-> we propose multi-scale style discriminators that perform the same task but at the feature level.

-> In particular, the style discriminator tries to classify the mean μ(Φl) and standard deviation σ(Φl), which describe the image “style”

total loss

λi = λs = 1.0 for all experience.

5.4. Disentangling shape and appearance

HoloGAN also learns to fur- ther divide identity into shape and appearance.

latent code z1 -> controls the 3D features

latent code z2 -> controls the 2D features

same pose, same z1, but with different z2 at each row.

3D feature control object's shapes

2D feature control object's appearance(texture and lighting)

5.5 Ablation studies

Training without random 3D transformations

Randomly rotating the 3D feature -> to learn a disentangled representation between pose and identity

Training with traditional z input

confused between pose and identity. the model is also changes the object's identity when it it being rotated.

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

Computer Computer

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역