Abstract

1. Introduction

2. Related Work

Set Predciton

Transformers and Parallel Decoding

Object Detection

3. The DETR model

Object Detection set prediction loss

\[\hat{\sigma} = \underset{\sigma \in \mathfrak{S}_N}{\text{argmin}} \sum^N \mathcal{L}_\text{match} (y_i, \hat{y}_{\sigma(i)})\]

Bounding box loss

DETR architecture

DETR Fig.2

Backbone

Transformer encoder

  1. $1\times1$ Convolution
    1. Input : High-level activation map $f$
    2. 차원 축소 : $C$ → $d$
    3. Output : $z_0 \in \mathcal{R}^{d \times H \times W}$
  2. Encoder는 Input이 Sequence임을 기대하므로, $z_0$의 Spatial dimension을 1차원으로 압축 → $d \times HW$ feature map
  3. 각각의 Encoder 레이어는 표준적인 구조를 가지고 있고, MSA(Multi-head Self Attention) 모듈과 FFN(Feed Forward Network)로 이루어짐
  4. Transformer 구조는 순서에 영향을 받지 않으므로, 고정된 Positioinal Encoding을 사용, 각 attention 레이어의 입력마다 더해준다.

Transformer Decoder

Prediction Feed-froward networks (FFNs)

Auxiliary decoding losses

4. Experiments

4.1 Comparison with Faster R-CNN

Comparison Table with Faster R-CNN

4.2 Ablations

Number of encoder layers

Table 2

Figure 3

Number of decoder layers

Figure 6

Importance of FFN

Importance of positional encodings

Table 3

중간 요약

Loss ablations

Table 4

4.3 Analysis

Decoder output slot analysis

Figure 7

Generalization to unseen numbers of instances

4.4 DETR for panoptic segmentation

Figure 8

Main Result

Table 5

5. Conclusion