Learning Deep Features for Discriminative Localization

Learning Deep Features for Discriminative Localization

카테고리 없음 2019. 8. 2. 00:22

한 줄 정리 : global average pooling을 사용하여 localization의 정확성을 높임.

0. Abstract

이 논문에서는 global average pooling layer를 다시 한번 살펴보고, image level에 대한 label의 교육을 받았음에도 불구하고 어떻게 cnn이 눈에 띄는 localization능력을 가지게 됬는지 알아볼 것이다. global average pooling layer는 일반적으로 regularizing training을 의미했지만, 우리는 이것이 이미지에 대한 cnn의 attention이 노출된 generic localizable deep representation을 만든다는 것을 알아냈다. global average pooling의 간단한 추가임에도 불구하고, 성능이 많이 향상되었다고 이 논문은 소개하고 있다.

1. Introduction

Zhou et al[34]에서는 convolutional neural networks의 다양한 레이어의 convolution unit은 object의 location이 주어지지 않은 unsupervised learning임에도 불구하고, object detector로서 역할을 한다는 것을 발견했다. 하지만, convolution layer의 이러한 역할(localize objects)에도 불구하고, fully-connected layer를 사용하면, 이러한 능력을 잃어버린다. fully-connected layer는 지역정보를 유지할 수 없음. 이러한 문제를 해결하고자, global average pooling을 사용하였음. gap는 보통 학습 중 오버피팅을 막기위한 structural regularizer로서 사용되었지만, 이번 실험을 통해, gap가 이러한 역할 뿐만 아니라, network가 마지막 레이어까지 뛰어난 localization능력을 유지하게 해준다는 것을 발견했다. 이러한 변화를 통해, task의 다양한 변환(심지어, network가 다른 task에 대해 학습이 되었더라도)에 대한 single forwardpass 안에서 쉽게 discriminative image region을 찾을 수 있게 되었다.

Figure 1. A simple modification of the global average pool- ing layer combined with our class activation mapping (CAM) technique allows the classification-trained CNN to both classify the image and localize class-specific image regions in a single forward-pass e.g., the toothbrush for brushing teeth and the chain- saw for cutting trees.

1.1 Related Work

1.1.1 Weakly-supervised object localization

이전에 제안된 많은 기술은 좋은 결과를 보였지만, 그들은 end-to-end로 학습되지 않았고, 이미지를 localize하기 위해서는 network의 multiple forward pass가 필요하기 때문에, 실제 데이터셋으로 확장하기 힘들다. 하지만, 이 논문에서 제시한 방법은 end-to-end로 학습이 되며, single forward pass로 object를 localize할 수 있다.

이 논문에서 제시한 방법과 가장 비슷한 접근 방법은 global max pooling이다. global average pooling과 다르게, object의 point를 localize하는 global max pooling의 localization은 물체의 전체 범위를 결정하는 것이 아니라 물체의 경계에 놓여 있는 점으로 제한된다.

우리는 max보다는 object의 완전한 범위를 인식하는 global average pooling을 사용하였다.

2. Class Activation Mapping

CNN에서 global averaging pooling을 사용하여 Class Activation Maps(CAM)을 만들어 낼 것이다. 특정한 카테고리에 대한 CAM은 해당 카테고리를 식별하기 위해 사용된 CNN의 discriminative image regions 이다. 즉, 만약에 MNIST dataset을 이용한 분류 네트워크라고 생각해보자. 인풋이미지가 1이 그려져 있는 이미지라고 가정을 하고, 이때 CAM은 이 네트워크가 만약에 이 이미지를 1이라고 분류했을 경우, 1로 분류하기위해 CNN이 학습한 discriminative image region에 대한 것이다.

Figure 3. The CAMs of two classes from ILSVRC [21]. The maps highlight the discriminative image regions used for image classifi- cation, the head of the animal for briard and the plates in barbell.

우리의 네트워크에는 크기 convolution layer를 포함하고 있고, 마지막 output layer전에 convolution feature map에 global average pooling을 수행한다. 이후, 이 feature map을 desired output(categorical or otherwise)을 내기 위한 fully-connected layer를 위한 feature로 사용한다.

이러한 단순한 구조로, 우리는 ouput layer의 가중치를 CAM이라고 불리는 convolution layer feature map에 투영함으로써, image region의 중요도를 확인할 것이다.

Figure 2. Class Activation Mapping: the predicted class score is mapped back to the previous convolutional layer to generate the class activation maps (CAMs). The CAM highlights the class-specific discriminative regions.

위의 그림에서 볼 수 있듯, global average pooling은 마지막 convolution layer의 각 채널에 대한 spatial average를 구한다. 이 값들의 weighted sum은 최종 output이다.

이와 비슷하게, 우리는 CAM을 얻기 위해서, 마지막 convolution layer의 weighted sum 또한 구할 수 있다.

ABOUT ME

Computer Computer

티스토리툴바