pytorch调用yolov5,yolov3 pytorch训练自己的数据

　　本文主要介绍Pytorch的YoloV5目标检测平台的实现过程。有需要的朋友可以参考一下，希望能有所帮助。祝大家进步很大，早日升职加薪。

　　00-1010学习序源代码下载YoloV5改进部分(不完整)1。整体结构分析2。网络结构分析1。骨干介绍2。构建FPN特征金字塔以增强特征提取3。使用Yolo头获得预测结果3。解码预测结果1。获得预测盒和分数2。分数筛选和非最大抑制4。培训第1部分。计算损失所需的内容2、正样本的匹配过程A、匹配先验框B、匹配特征点3、计算损失来训练自己的YoloV5模型一、数据集的准备二、数据集的处理三、网络训练的开始四、训练结果的预测。

学习前言

　　https://github.com/bubbliiiing/yolov5-pytorch

源码下载

　　1.主体部分：采用焦点网络结构。具体操作是在图片中每隔一个像素获取一个值。此时，获得四个独立的特征层，然后将这四个独立的特征层堆叠。此时，宽度和高度信息被集中在通道信息上，并且输入通道被扩展了四倍。YoloV5第5版之前用的就是这个结构，最新版没有。

　　2.数据增强：镶嵌数据增强。Mosaic使用四张图片进行拼接，实现数据增强。根据论文，它有一个巨大的优势，丰富了检测对象的背景！而且在计算BN的时候，会一次性计算出四张图片的数据！

　　3.多正样本匹配：在之前的Yolo系列中，训练时每一个真实帧对应一个正样本，即训练时，每一个真实帧仅由一个先前帧预测。在YoloV5中，为了加快模型的训练效率，增加了正样本的数量。在训练期间，每个真实帧可以由多个先前帧来预测。

　　以上并不是所有的改进，还有一些其他的改进。下面只是几个我感兴趣并且非常有效的。

YoloV5改进的部分（不完全）

　　在学习YoloV5之前，我们需要了解一下YoloV5所做的工作，这有助于我们后面了解网络的细节。

　　与之前版本的Yolo类似，整个YoloV5依然可以分为三个部分，分别是脊梁、FPN和Yolo头。

　　主干可以称为YoloV5的主干特征提取网络。根据它的结构和Yolo backbone以前的名字，我一般叫它CSPDarknet。在CSPDarknet中首先会对输入图像进行提取，提取的特征可以称为特征层，即输入图像的特征集。在主干部分，我们得到三个特征层来构建下一个网络。我把这三个要素图层称为有效要素图层。

　　FPN可以称为YoloV5的增强特征提取网络。在主要部分中获得的三个有效特征层将在该部分中融合。特征融合的目的是组合不同尺度的特征信息。在FPN部分，获得的有效特征层用于连续提取特征。YoloV5依然沿用Panet的结构。我们不仅要对特征进行上采样以实现特征融合，还要对特征进行下采样以实现特征融合。

　　Yolo Head是YoloV5的分类器和回归。通过CSPDarknet和FPN，我们已经能够获得三个增强的有效特征层。每个要素图层都有宽度、高度和通道数。此时，我

　　们可以将特征图看作一个又一个特征点的集合，每一个特征点都有通道数个特征。Yolo Head实际上所做的工作就是对特征点进行判断，判断特征点是否有物体与其对应。与以前版本的Yolo一样，YoloV5所用的解耦头是一起的，也就是分类和回归在一个1X1卷积里实现。

　　因此，整个YoloV5网络所作的工作就是特征提取-特征加强-预测特征点对应的物体情况。

二、网络结构解析

1、主干网络Backbone介绍

　　YoloV5所使用的主干特征提取网络为CSPDarknet，它具有五个重要特点：

　　1、使用了残差网络Residual，CSPDarknet中的残差卷积可以分为两个部分，主干部分是一次1X1的卷积和一次3X3的卷积；残差边部分不做任何处理，直接将主干的输入与输出结合。

　　整个YoloV5的主干部分都由残差卷积构成：

class Bottleneck(nn.Module):
　　 # Standard bottleneck
　　 def __init__(self, c1, c2, shortcut=True, g=1, e=0.5): # ch_in, ch_out, shortcut, groups, expansion
　　 super(Bottleneck, self).__init__()
　　 c_ = int(c2 * e) # hidden channels
　　 self.cv1 = Conv(c1, c_, 1, 1)
　　 self.cv2 = Conv(c_, c2, 3, 1, g=g)
　　 self.add = shortcut and c1 == c2
　　 def forward(self, x):
　　 return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

　　残差网络的特点是容易优化，并且能够通过增加相当的深度来提高准确率。其内部的残差块使用了跳跃连接，缓解了在深度神经网络中增加深度带来的梯度消失问题。

　　2、使用CSPnet网络结构，CSPnet结构并不算复杂，就是将原来的残差块的堆叠进行了一个拆分，拆成左右两部分：

　　主干部分继续进行原来的残差块的堆叠；

　　另一部分则像一个残差边一样，经过少量处理直接连接到最后。

　　因此可以认为CSP中存在一个大的残差边。

class C3(nn.Module):
　　 # CSP Bottleneck with 3 convolutions
　　 def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
　　 super(C3, self).__init__()
　　 c_ = int(c2 * e) # hidden channels
　　 self.cv1 = Conv(c1, c_, 1, 1)
　　 self.cv2 = Conv(c1, c_, 1, 1)
　　 self.cv3 = Conv(2 * c_, c2, 1) # act=FReLU(c2)
　　 self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
　　 # self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])
　　 def forward(self, x):
　　 return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))

　　3、使用了Focus网络结构，这个网络结构是在YoloV5里面使用到比较有趣的网络结构，具体操作是在一张图片中每隔一个像素拿到一个值，这个时候获得了四个独立的特征层，然后将四个独立的特征层进行堆叠，此时宽高信息就集中到了通道信息，输入通道扩充了四倍。拼接起来的特征层相对于原先的三通道变成了十二个通道，下图很好的展示了Focus结构，一看就能明白。

class Focus(nn.Module):
　　 def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups
　　 super(Focus, self).__init__()
　　 self.conv = Conv(c1 * 4, c2, k, s, p, g, act)
　　 def forward(self, x):
　　 return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))

　　4、使用了SiLU激活函数，SiLU是Sigmoid和ReLU的改进版。SiLU具备无上界有下界、平滑、非单调的特性。SiLU在深层模型上的效果优于 ReLU。可以看做是平滑的ReLU激活函数。

class SiLU(nn.Module):
　　 @staticmethod
　　 def forward(x):
　　 return x * torch.sigmoid(x)

　　5、使用了SPP结构，通过不同池化核大小的最大池化进行特征提取，提高网络的感受野。在YoloV4中，SPP是用在FPN里面的，在YoloV5中，SPP模块被用在了主干特征提取网络中。

class SPP(nn.Module):
　　 # Spatial pyramid pooling layer used in YOLOv3-SPP
　　 def __init__(self, c1, c2, k=(5, 9, 13)):
　　 super(SPP, self).__init__()
　　 c_ = c1 // 2 # hidden channels
　　 self.cv1 = Conv(c1, c_, 1, 1)
　　 self.cv2 = Conv(c_ * (len(k) + 1), c2, 1, 1)
　　 self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])
　　 def forward(self, x):
　　 x = self.cv1(x)
　　 return self.cv2(torch.cat([x] + [m(x) for m in self.m], 1))

　　整个主干实现代码为：

import torch
　　import torch.nn as nn
　　class SiLU(nn.Module):
　　 @staticmethod
　　 def forward(x):
　　 return x * torch.sigmoid(x)
　　def autopad(k, p=None):
　　 if p is None:
　　 p = k // 2 if isinstance(k, int) else [x // 2 for x in k] 
　　 return p
　　class Focus(nn.Module):
　　 def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups
　　 super(Focus, self).__init__()
　　 self.conv = Conv(c1 * 4, c2, k, s, p, g, act)
　　 def forward(self, x):
　　 return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))
　　class Conv(nn.Module):
　　 def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
　　 super(Conv, self).__init__()
　　 self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
　　 self.bn = nn.BatchNorm2d(c2, eps=0.001, momentum=0.03)
　　 self.act = SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
　　 def forward(self, x):
　　 return self.act(self.bn(self.conv(x)))
　　 def fuseforward(self, x):
　　 return self.act(self.conv(x))
　　class Bottleneck(nn.Module):
　　 # Standard bottleneck
　　 def __init__(self, c1, c2, shortcut=True, g=1, e=0.5): # ch_in, ch_out, shortcut, groups, expansion
　　 super(Bottleneck, self).__init__()
　　 c_ = int(c2 * e) # hidden channels
　　 self.cv1 = Conv(c1, c_, 1, 1)
　　 self.cv2 = Conv(c_, c2, 3, 1, g=g)
　　 self.add = shortcut and c1 == c2
　　 def forward(self, x):
　　 return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
　　class C3(nn.Module):
　　 # CSP Bottleneck with 3 convolutions
　　 def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
　　 super(C3, self).__init__()
　　 c_ = int(c2 * e) # hidden channels
　　 self.cv1 = Conv(c1, c_, 1, 1)
　　 self.cv2 = Conv(c1, c_, 1, 1)
　　 self.cv3 = Conv(2 * c_, c2, 1) # act=FReLU(c2)
　　 self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
　　 # self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])
　　 def forward(self, x):
　　 return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
　　class SPP(nn.Module):
　　 # Spatial pyramid pooling layer used in YOLOv3-SPP
　　 def __init__(self, c1, c2, k=(5, 9, 13)):
　　 super(SPP, self).__init__()
　　 c_ = c1 // 2 # hidden channels
　　 self.cv1 = Conv(c1, c_, 1, 1)
　　 self.cv2 = Conv(c_ * (len(k) + 1), c2, 1, 1)
　　 self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])
　　 def forward(self, x):
　　 x = self.cv1(x)
　　 return self.cv2(torch.cat([x] + [m(x) for m in self.m], 1))
　　class CSPDarknet(nn.Module):
　　 def __init__(self, base_channels, base_depth):
　　 super().__init__()
　　 #-----------------------------------------------#
　　 # 输入图片是640, 640, 3
　　 # 初始的基本通道是64
　　 #-----------------------------------------------#
　　 #-----------------------------------------------#
　　 # 利用focus网络结构进行特征提取
　　 # 640, 640, 3 -> 320, 320, 12 -> 320, 320, 64
　　 #-----------------------------------------------#
　　 self.stem = Focus(3, base_channels, k=3)
　　 #-----------------------------------------------#
　　 # 完成卷积之后，320, 320, 64 -> 160, 160, 128
　　 # 完成CSPlayer之后，160, 160, 128 -> 160, 160, 128
　　 #-----------------------------------------------#
　　 self.dark2 = nn.Sequential(
　　 Conv(base_channels, base_channels * 2, 3, 2),
　　 C3(base_channels * 2, base_channels * 2, base_depth),
　　 )
　　 #-----------------------------------------------#
　　 # 完成卷积之后，160, 160, 128 -> 80, 80, 256
　　 # 完成CSPlayer之后，80, 80, 256 -> 80, 80, 256
　　 #-----------------------------------------------#
　　 self.dark3 = nn.Sequential(
　　 Conv(base_channels * 2, base_channels * 4, 3, 2),
　　 C3(base_channels * 4, base_channels * 4, base_depth * 3),
　　 )
　　 #-----------------------------------------------#
　　 # 完成卷积之后，80, 80, 256 -> 40, 40, 512
　　 # 完成CSPlayer之后，40, 40, 512 -> 40, 40, 512
　　 #-----------------------------------------------#
　　 self.dark4 = nn.Sequential(
　　 Conv(base_channels * 4, base_channels * 8, 3, 2),
　　 C3(base_channels * 8, base_channels * 8, base_depth * 3),
　　 )
　　 #-----------------------------------------------#
　　 # 完成卷积之后，40, 40, 512 -> 20, 20, 1024
　　 # 完成SPP之后，20, 20, 1024 -> 20, 20, 1024
　　 # 完成CSPlayer之后，20, 20, 1024 -> 20, 20, 1024
　　 #-----------------------------------------------#
　　 self.dark5 = nn.Sequential(
　　 Conv(base_channels * 8, base_channels * 16, 3, 2),
　　 SPP(base_channels * 16, base_channels * 16),
　　 C3(base_channels * 16, base_channels * 16, base_depth, shortcut=False),
　　 )
　　 def forward(self, x):
　　 x = self.stem(x)
　　 x = self.dark2(x)
　　 #-----------------------------------------------#
　　 # dark3的输出为80, 80, 256，是一个有效特征层
　　 #-----------------------------------------------#
　　 x = self.dark3(x)
　　 feat1 = x
　　 #-----------------------------------------------#
　　 # dark4的输出为40, 40, 512，是一个有效特征层
　　 #-----------------------------------------------#
　　 x = self.dark4(x)
　　 feat2 = x
　　 #-----------------------------------------------#
　　 # dark5的输出为20, 20, 1024，是一个有效特征层
　　 #-----------------------------------------------#
　　 x = self.dark5(x)
　　 feat3 = x
　　 return feat1, feat2, feat3

2、构建FPN特征金字塔进行加强特征提取

　　在特征利用部分，YoloV5提取多特征层进行目标检测，一共提取三个特征层。

　　三个特征层位于主干部分CSPdarknet的不同位置，分别位于中间层，中下层，底层，当输入为(640,640,3)的时候，三个特征层的shape分别为feat1=(80,80,256)、feat2=(40,40,512)、feat3=(20,20,1024)。

　　在获得三个有效特征层后，我们利用这三个有效特征层进行FPN层的构建，构建方式为：

feat3=(20,20,1024)的特征层进行1次1X1卷积调整通道后获得P5，P5进行上采样UmSampling2d后与feat2=(40,40,512)特征层进行结合，然后使用CSPLayer进行特征提取获得P5_upsample，此时获得的特征层为(40,40,512)。
P5_upsample=(40,40,512)的特征层进行1次1X1卷积调整通道后获得P4，P4进行上采样UmSampling2d后与feat1=(80,80,256)特征层进行结合，然后使用CSPLayer进行特征提取P3_out，此时获得的特征层为(80,80,256)。
P3_out=(80,80,256)的特征层进行一次3x3卷积进行下采样，下采样后与P4堆叠，然后使用CSPLayer进行特征提取P4_out，此时获得的特征层为(40,40,512)。
P4_out=(40,40,512)的特征层进行一次3x3卷积进行下采样，下采样后与P5堆叠，然后使用CSPLayer进行特征提取P5_out，此时获得的特征层为(20,20,1024)。

　　特征金字塔可以将不同shape的特征层进行特征融合，有利于提取出更好的特征。

import torch
　　import torch.nn as nn
　　from nets.CSPdarknet import CSPDarknet, C3, Conv
　　#---------------------------------------------------#
　　# yolo_body
　　#---------------------------------------------------#
　　class YoloBody(nn.Module):
　　 def __init__(self, anchors_mask, num_classes, phi):
　　 super(YoloBody, self).__init__()
　　 depth_dict = {s : 0.33, m : 0.67, l : 1.00, x : 1.33,}
　　 width_dict = {s : 0.50, m : 0.75, l : 1.00, x : 1.25,}
　　 dep_mul, wid_mul = depth_dict[phi], width_dict[phi]
　　 base_channels = int(wid_mul * 64) # 64
　　 base_depth = max(round(dep_mul * 3), 1) # 3
　　 #-----------------------------------------------#
　　 # 输入图片是640, 640, 3
　　 # 初始的基本通道是64
　　 #-----------------------------------------------#
　　 #---------------------------------------------------# 
　　 # 生成CSPdarknet53的主干模型
　　 # 获得三个有效特征层，他们的shape分别是：
　　 # 80,80,256
　　 # 40,40,512
　　 # 20,20,1024
　　 #---------------------------------------------------#
　　 self.backbone = CSPDarknet(base_channels, base_depth)
　　 self.upsample = nn.Upsample(scale_factor=2, mode="nearest")
　　 self.conv_for_feat3 = Conv(base_channels * 16, base_channels * 8, 1, 1)
　　 self.conv3_for_upsample1 = C3(base_channels * 16, base_channels * 8, base_depth, shortcut=False)
　　 self.conv_for_feat2 = Conv(base_channels * 8, base_channels * 4, 1, 1)
　　 self.conv3_for_upsample2 = C3(base_channels * 8, base_channels * 4, base_depth, shortcut=False)
　　 self.down_sample1 = Conv(base_channels * 4, base_channels * 4, 3, 2)
　　 self.conv3_for_downsample1 = C3(base_channels * 8, base_channels * 8, base_depth, shortcut=False)
　　 self.down_sample2 = Conv(base_channels * 8, base_channels * 8, 3, 2)
　　 self.conv3_for_downsample2 = C3(base_channels * 16, base_channels * 16, base_depth, shortcut=False)
　　 self.yolo_head_P3 = nn.Conv2d(base_channels * 4, len(anchors_mask[2]) * (5 + num_classes), 1)
　　 self.yolo_head_P4 = nn.Conv2d(base_channels * 8, len(anchors_mask[1]) * (5 + num_classes), 1)
　　 self.yolo_head_P5 = nn.Conv2d(base_channels * 16, len(anchors_mask[0]) * (5 + num_classes), 1)
　　 def forward(self, x):
　　 # backbone
　　 feat1, feat2, feat3 = self.backbone(x)
　　 P5 = self.conv_for_feat3(feat3)
　　 P5_upsample = self.upsample(P5)
　　 P4 = torch.cat([P5_upsample, feat2], 1)
　　 P4 = self.conv3_for_upsample1(P4)
　　 P4 = self.conv_for_feat2(P4)
　　 P4_upsample = self.upsample(P4)
　　 P3 = torch.cat([P4_upsample, feat1], 1)
　　 P3 = self.conv3_for_upsample2(P3)
　　 P3_downsample = self.down_sample1(P3)
　　 P4 = torch.cat([P3_downsample, P4], 1)
　　 P4 = self.conv3_for_downsample1(P4)
　　 P4_downsample = self.down_sample2(P4)
　　 P5 = torch.cat([P4_downsample, P5], 1)
　　 P5 = self.conv3_for_downsample2(P5)
　　 #---------------------------------------------------#
　　 # 第三个特征层
　　 # y3=(batch_size,75,80,80)
　　 #---------------------------------------------------#
　　 out2 = self.yolo_head_P3(P3)
　　 #---------------------------------------------------#
　　 # 第二个特征层
　　 # y2=(batch_size,75,40,40)
　　 #---------------------------------------------------#
　　 out1 = self.yolo_head_P4(P4)
　　 #---------------------------------------------------#
　　 # 第一个特征层
　　 # y1=(batch_size,75,20,20)
　　 #---------------------------------------------------#
　　 out0 = self.yolo_head_P5(P5)
　　 return out0, out1, out2

3、利用Yolo Head获得预测结果

　　利用FPN特征金字塔，我们可以获得三个加强特征，这三个加强特征的shape分别为(20,20,1024)、(40,40,512)、(80,80,256)，然后我们利用这三个shape的特征层传入Yolo Head获得预测结果。

　　对于每一个特征层，我们可以获得利用一个卷积调整通道数，最终的通道数和需要区分的种类个数相关，在YoloV5里，每一个特征层上每一个特征点存在3个先验框。

　　如果使用的是voc训练集，类则为20种，最后的维度应该为75 = 3x25，三个特征层的shape为(20,20,75)，(40,40,75)，(80,80,75)。

最后的75可以拆分成3个25，对应3个先验框的25个参数，25可以拆分成4+1+20。
前4个参数用于判断每一个特征点的回归参数，回归参数调整后可以获得预测框；
第5个参数用于判断每一个特征点是否包含物体；
最后20个参数用于判断每一个特征点所包含的物体种类。

　　如果使用的是coco训练集，类则为80种，最后的维度应该为255 = 3x85，三个特征层的shape为(20,20,255)，(40,40,255)，(80,80,255)

最后的255可以拆分成3个85，对应3个先验框的85个参数，85可以拆分成4+1+80。
前4个参数用于判断每一个特征点的回归参数，回归参数调整后可以获得预测框；
第5个参数用于判断每一个特征点是否包含物体；
最后80个参数用于判断每一个特征点所包含的物体种类。

　　实现代码如下：

import torch
　　import torch.nn as nn
　　from nets.CSPdarknet import CSPDarknet, C3, Conv
　　#---------------------------------------------------#
　　# yolo_body
　　#---------------------------------------------------#
　　class YoloBody(nn.Module):
　　 def __init__(self, anchors_mask, num_classes, phi):
　　 super(YoloBody, self).__init__()
　　 depth_dict = {s : 0.33, m : 0.67, l : 1.00, x : 1.33,}
　　 width_dict = {s : 0.50, m : 0.75, l : 1.00, x : 1.25,}
　　 dep_mul, wid_mul = depth_dict[phi], width_dict[phi]
　　 base_channels = int(wid_mul * 64) # 64
　　 base_depth = max(round(dep_mul * 3), 1) # 3
　　 #-----------------------------------------------#
　　 # 输入图片是640, 640, 3
　　 # 初始的基本通道是64
　　 #-----------------------------------------------#
　　 #---------------------------------------------------# 
　　 # 生成CSPdarknet53的主干模型
　　 # 获得三个有效特征层，他们的shape分别是：
　　 # 80,80,256
　　 # 40,40,512
　　 # 20,20,1024
　　 #---------------------------------------------------#
　　 self.backbone = CSPDarknet(base_channels, base_depth)
　　 self.upsample = nn.Upsample(scale_factor=2, mode="nearest")
　　 self.conv_for_feat3 = Conv(base_channels * 16, base_channels * 8, 1, 1)
　　 self.conv3_for_upsample1 = C3(base_channels * 16, base_channels * 8, base_depth, shortcut=False)
　　 self.conv_for_feat2 = Conv(base_channels * 8, base_channels * 4, 1, 1)
　　 self.conv3_for_upsample2 = C3(base_channels * 8, base_channels * 4, base_depth, shortcut=False)
　　 self.down_sample1 = Conv(base_channels * 4, base_channels * 4, 3, 2)
　　 self.conv3_for_downsample1 = C3(base_channels * 8, base_channels * 8, base_depth, shortcut=False)
　　 self.down_sample2 = Conv(base_channels * 8, base_channels * 8, 3, 2)
　　 self.conv3_for_downsample2 = C3(base_channels * 16, base_channels * 16, base_depth, shortcut=False)
　　 self.yolo_head_P3 = nn.Conv2d(base_channels * 4, len(anchors_mask[2]) * (5 + num_classes), 1)
　　 self.yolo_head_P4 = nn.Conv2d(base_channels * 8, len(anchors_mask[1]) * (5 + num_classes), 1)
　　 self.yolo_head_P5 = nn.Conv2d(base_channels * 16, len(anchors_mask[0]) * (5 + num_classes), 1)
　　 def forward(self, x):
　　 # backbone
　　 feat1, feat2, feat3 = self.backbone(x)
　　 P5 = self.conv_for_feat3(feat3)
　　 P5_upsample = self.upsample(P5)
　　 P4 = torch.cat([P5_upsample, feat2], 1)
　　 P4 = self.conv3_for_upsample1(P4)
　　 P4 = self.conv_for_feat2(P4)
　　 P4_upsample = self.upsample(P4)
　　 P3 = torch.cat([P4_upsample, feat1], 1)
　　 P3 = self.conv3_for_upsample2(P3)
　　 P3_downsample = self.down_sample1(P3)
　　 P4 = torch.cat([P3_downsample, P4], 1)
　　 P4 = self.conv3_for_downsample1(P4)
　　 P4_downsample = self.down_sample2(P4)
　　 P5 = torch.cat([P4_downsample, P5], 1)
　　 P5 = self.conv3_for_downsample2(P5)
　　 #---------------------------------------------------#
　　 # 第三个特征层
　　 # y3=(batch_size,75,80,80)
　　 #---------------------------------------------------#
　　 out2 = self.yolo_head_P3(P3)
　　 #---------------------------------------------------#
　　 # 第二个特征层
　　 # y2=(batch_size,75,40,40)
　　 #---------------------------------------------------#
　　 out1 = self.yolo_head_P4(P4)
　　 #---------------------------------------------------#
　　 # 第一个特征层
　　 # y1=(batch_size,75,20,20)
　　 #---------------------------------------------------#
　　 out0 = self.yolo_head_P5(P5)
　　 return out0, out1, out2

三、预测结果的解码

1、获得预测框与得分

　　由第二步我们可以获得三个特征层的预测结果，shape分别为(N,20,20,255)，(N,40,40,255)，(N,80,80,255)的数据。

　　但是这个预测结果并不对应着最终的预测框在图片上的位置，还需要解码才可以完成。在YoloV5里，每一个特征层上每一个特征点存在3个先验框。

　　每个特征层最后的255可以拆分成3个85，对应3个先验框的85个参数，我们先将其reshape一下，其结果为(N,20,20,3,85)，(N,40.40,3,85)，(N,80,80,3,85)。

　　其中的85可以拆分成4+1+80。

前4个参数用于判断每一个特征点的回归参数，回归参数调整后可以获得预测框；
第5个参数用于判断每一个特征点是否包含物体；
最后80个参数用于判断每一个特征点所包含的物体种类。

　　以(N,20,20,3,85)这个特征层为例，该特征层相当于将图像划分成20x20个特征点，如果某个特征点落在物体的对应框内，就用于预测该物体。

　　如图所示，蓝色的点为20x20的特征点，此时我们对左图黑色点的三个先验框进行解码操作演示：

　　1、进行中心预测点的计算，利用Regression预测结果前两个序号的内容对特征点的三个先验框中心坐标进行偏移，偏移后是右图红色的三个点；

　　2、进行预测框宽高的计算，利用Regression预测结果后两个序号的内容求指数后获得预测框的宽高；

　　3、此时获得的预测框就可以绘制在图片上了。

　　除去这样的解码操作，还有非极大抑制的操作需要进行，防止同一种类的框的堆积。

def decode_box(self, inputs):
　　 outputs = []
　　 for i, input in enumerate(inputs):
　　 #-----------------------------------------------#
　　 # 输入的input一共有三个，他们的shape分别是
　　 # batch_size, 255, 20, 20
　　 # batch_size, 255, 40, 40
　　 # batch_size, 255, 80, 80
　　 #-----------------------------------------------#
　　 batch_size = input.size(0)
　　 input_height = input.size(2)
　　 input_width = input.size(3)
　　 #-----------------------------------------------#
　　 # 输入为416x416时
　　 # stride_h = stride_w = 32、16、8
　　 #-----------------------------------------------#
　　 stride_h = self.input_shape[0] / input_height
　　 stride_w = self.input_shape[1] / input_width
　　 #-------------------------------------------------#
　　 # 此时获得的scaled_anchors大小是相对于特征层的
　　 #-------------------------------------------------#
　　 scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) for anchor_width, anchor_height in self.anchors[self.anchors_mask[i]]]
　　 #-----------------------------------------------#
　　 # 输入的input一共有三个，他们的shape分别是
　　 # batch_size, 3, 20, 20, 85
　　 # batch_size, 3, 40, 40, 85
　　 # batch_size, 3, 80, 80, 85
　　 #-----------------------------------------------#
　　 prediction = input.view(batch_size, len(self.anchors_mask[i]),
　　 self.bbox_attrs, input_height, input_width).permute(0, 1, 3, 4, 2).contiguous()
　　 #-----------------------------------------------#
　　 # 先验框的中心位置的调整参数
　　 #-----------------------------------------------#
　　 x = torch.sigmoid(prediction[..., 0]) 
　　 y = torch.sigmoid(prediction[..., 1])
　　 #-----------------------------------------------#
　　 # 先验框的宽高调整参数
　　 #-----------------------------------------------#
　　 w = torch.sigmoid(prediction[..., 2]) 
　　 h = torch.sigmoid(prediction[..., 3]) 
　　 #-----------------------------------------------#
　　 # 获得置信度，是否有物体
　　 #-----------------------------------------------#
　　 conf = torch.sigmoid(prediction[..., 4])
　　 #-----------------------------------------------#
　　 # 种类置信度
　　 #-----------------------------------------------#
　　 pred_cls = torch.sigmoid(prediction[..., 5:])
　　 FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
　　 LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
　　 #----------------------------------------------------------#
　　 # 生成网格，先验框中心，网格左上角 
　　 # batch_size,3,20,20
　　 #----------------------------------------------------------#
　　 grid_x = torch.linspace(0, input_width - 1, input_width).repeat(input_height, 1).repeat(
　　 batch_size * len(self.anchors_mask[i]), 1, 1).view(x.shape).type(FloatTensor)
　　 grid_y = torch.linspace(0, input_height - 1, input_height).repeat(input_width, 1).t().repeat(
　　 batch_size * len(self.anchors_mask[i]), 1, 1).view(y.shape).type(FloatTensor)
　　 #----------------------------------------------------------#
　　 # 按照网格格式生成先验框的宽高
　　 # batch_size,3,20,20
　　 #----------------------------------------------------------#
　　 anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))
　　 anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))
　　 anchor_w = anchor_w.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(w.shape)
　　 anchor_h = anchor_h.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(h.shape)
　　 #----------------------------------------------------------#
　　 # 利用预测结果对先验框进行调整
　　 # 首先调整先验框的中心，从先验框中心向右下角偏移
　　 # 再调整先验框的宽高。
　　 #----------------------------------------------------------#
　　 pred_boxes = FloatTensor(prediction[..., :4].shape)
　　 pred_boxes[..., 0] = x.data * 2. - 0.5 + grid_x
　　 pred_boxes[..., 1] = y.data * 2. - 0.5 + grid_y
　　 pred_boxes[..., 2] = (w.data * 2) ** 2 * anchor_w
　　 pred_boxes[..., 3] = (h.data * 2) ** 2 * anchor_h
　　 #----------------------------------------------------------#
　　 # 将输出结果归一化成小数的形式
　　 #----------------------------------------------------------#
　　 _scale = torch.Tensor([input_width, input_height, input_width, input_height]).type(FloatTensor)
　　 output = torch.cat((pred_boxes.view(batch_size, -1, 4) / _scale,
　　 conf.view(batch_size, -1, 1), pred_cls.view(batch_size, -1, self.num_classes)), -1)
　　 outputs.append(output.data)
　　 return outputs