论文笔记+源码 DETR:End-to-End Object Detection with Transformers

2020-06-20 10:04:33 阅读：1743 来源： 互联网

标签：loss End outputs self Object num 源码 queries DETR

〇、本论文需要有的基础知识

目标检测：了解传统目标检测的基本技术路线（如anchor-based、非最极大值抑制、one-stage、two-stage），大致了解近两年的SOTA方法（如Faster-RCNN）
Transformer：了解Transformer的机制，知道self-attention机制
二分图匹配：了解图论中的二分图匹配，知道匈牙利算法

一、摘要核心点

1. 相比传统路线：去掉了很多手工设计模块（hand-designed）：如非极大值抑制、anchor的设计
这些手工设计的模块里均为人为对task先验知识的一定程度上的“先验的编码（encode）”

2. DETR核心内容：
a set-based global loss → forces unique predictions via:
(a) bipartite matching, and
(b) a transformer encoder-decoder architecture.（本文用的Transformer网络是non-autoregressive非自回归的）

关于非自回归的介绍可以参考https://zhuanlan.zhihu.com/p/82892975

3. DETR能做到的事：
· 输入： a fixed small set of learned objects queries
· DETR输出：

(a) the relations of the objects
(b) the global image context to directly output the final set of prediction in parallel

4. 流程架构示意图：

更细节一些的流程架构示意图↓：

二、正文

1. 首先定性object detection问题为set of prediction

2. 整个网络设计是端到端（end-to-end）的，然后用一个“集合”损失函数（set loss function）来训练，这个损失函数描述预测框和ground-truth框之间的二分图匹配（ performs bipartite matching between predicted and ground-truth objects）来训练

3. DETR仅仅是架构上的创新，并没有创新独有的层（就好像resnet创新了跳连，DETR没有在layer这个层面进行创新）

4. DETR用的“匹配”损失函数（matching loss function）将预测框“一一分配”给ground-truth框（uniquely assigns a prediction to a ground truth object，这里的“一一分配”正是bipartite matching的本身含义）；而且能保证对预测对象的排列顺序保持不变（这也是用二分图匹配建模的原因，这里特指无向二分图）（uniquely assigns a prediction to a ground truth object）→这是能够并行化预测的一个原因

“matching”这里是图论里的概念，可以参考https://www.renfei.org/blog/bipartite-matching.html

5. 对于建模为“Set Prediction”（“集合”预测）的考虑：

通常“集合”预测任务是一种多标签分类问题。多标签分类问题的解决方法通常是“one-vs-rest”（“一对多”,one-vs-rest,又称one-vs-all, 这里指的是将label的类别作为“一”，将其余类别当做一个整体作为“多”，进行训练），这种方法不适用于“元素”间有底层关系结构的情况（“元素”e.g.几乎一模一样的预测框）（does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes)。这个方法会导致大量几乎一样的结果的情况（near-duplicates），传统的目标检测方法会用后处理（如非极大值抑制）来解决这个问题（成堆的近乎一样的预测结果），但是如果是建模为set prediction就不用这些后处理。set prediction需要在全局上有个策略来对这些“元素”之间的关系建模，来避免预测过多的无用、复制的结果造成冗余。

6. 对于采用“Bipartite Matching”（二分图匹配）作为“预测值→ground-truth值”的损失函数的考虑：

在Set Prediction问题中，损失函数必须满足“预测顺序不变性”（invariant by a permutation of the predictions，即预测值/框的顺序不能影响损失值），而二分图匹配——这里特指的是“无向”二分图匹配将“预测值→ground-truth值”的关系建模为了一个无向二分图，这种图的“匹配”不存在顺序问题。特别地，用“匈牙利算法”来求解二分图匹配问题。

· “Bipartite Matching”（二分图匹配）（1）能保证预测顺序不变性”；（2）能保证两者间的“一一匹配”

7. 对于大物体的预测更准确：

文章中说“a result likely enabled by the non-local computations of the transformer”，这里的“non-local computations”指的是Non-local Neural Networks（https://arxiv.org/pdf/1711.07971.pdf）这篇文章中的Non-local概念。

non-local computations指的是计算“非局部”感受野上的信息，可以参考https://zhuanlan.zhihu.com/p/33345791

三、结果

四、源码讨论

为了防止后面代码项目有改动，我摘出来写本文时候（2020.06.18）的最新的一次提交（1fcfc65）来做部分源码说明

DETR网络结构一览：

class DETR(nn.Module):
    """ This is the DETR module that performs object detection """
    def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
        """ Initializes the model.
        Parameters:
            backbone: torch module of the backbone to be used. See backbone.py
            transformer: torch module of the transformer architecture. See transformer.py
            num_classes: number of object classes
            num_queries: number of object queries, ie detection slot. This is the maximal number of objects
                         DETR can detect in a single image. For COCO, we recommend 100 queries.
            aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
        """
        super().__init__()
        self.num_queries = num_queries
        self.transformer = transformer
        hidden_dim = transformer.d_model
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)
        self.query_embed = nn.Embedding(num_queries, hidden_dim)
        self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1)
        self.backbone = backbone
        self.aux_loss = aux_loss

    def forward(self, samples: NestedTensor):
        """ The forward expects a NestedTensor, which consists of:
               - samples.tensor: batched images, of shape [batch_size x 3 x H x W]
               - samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels
            It returns a dict with the following elements:
               - "pred_logits": the classification logits (including no-object) for all queries.
                                Shape= [batch_size x num_queries x (num_classes + 1)]
               - "pred_boxes": The normalized boxes coordinates for all queries, represented as
                               (center_x, center_y, height, width). These values are normalized in [0, 1],
                               relative to the size of each individual image (disregarding possible padding).
                               See PostProcess for information on how to retrieve the unnormalized bounding box.
               - "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of
                                dictionnaries containing the two above keys for each decoder layer.
        """
        if not isinstance(samples, NestedTensor):
            samples = nested_tensor_from_tensor_list(samples)
        features, pos = self.backbone(samples) # backbone是一个CNN用于特征提取

        src, mask = features[-1].decompose() #??
        assert mask is not None
        hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]  # 这里是吧features的其中一部分信息作为src传进Transformer，input_proj是一个卷积层，用来收缩输入的维度，把维度控制到d_model的尺寸（model dimension）

        outputs_class = self.class_embed(hs)  # 为了把Transformer应用于目标检测问题上，作者引入了“类别嵌入网络”和“框嵌入网络”
        outputs_coord = self.bbox_embed(hs).sigmoid()  # 在框嵌入后加入一层sigmoid输出框坐标（原论文中提到是四点坐标，但是要考虑到原图片的尺寸）
        out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
        if self.aux_loss:
            out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
        return out

    @torch.jit.unused
    def _set_aux_loss(self, outputs_class, outputs_coord):
        # this is a workaround to make torchscript happy, as torchscript
        # doesn't support dictionary with non-homogeneous values, such
        # as a dict having both a Tensor and a list.
        return [{'pred_logits': a, 'pred_boxes': b}
                for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]

TBC.（没写完的部分最近会补上，毕竟我也是边看边学然后记下来的……）

标签：loss,End,outputs,self,Object,num,源码,queries,DETR
来源： https://blog.csdn.net/weixin_36047799/article/details/106825645

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

论文笔记+源码 DETR:End-to-End Object Detection with Transformers

〇、本论文需要有的基础知识

一、 摘要核心点

二、 正文

四、源码讨论

一、摘要核心点

二、正文