一本大道之中文日本香蕉,亚洲欧美日韩精品高清,欧洲亚洲日韩在线香蕉网

計算機視覺與物體檢測

2023-04-17 11:18

第一次通過Tensorflow對象檢測API了解對象檢測。它很容易使用。傳入了一張海灘的圖片，作為回報，API在它識別的對象上繪制了方框。這似乎很神奇。

很好奇，想剖析API，了解它到底是如何在幕后工作的。這很難，我失敗了。Tensorflow對象檢測API支持經(jīng)過數(shù)十年研究的最先進模型。它們被復(fù)雜地編織成代碼，就像鐘表匠如何將微小的齒輪組合在一起，它們可以連貫地移動。

然而，目前大多數(shù)最先進的模型都建立在Faster RCNN模型的基礎(chǔ)之上，即使在今天，該模型仍然是計算機視覺領(lǐng)域被引用最多的論文之一。因此，理解它至關(guān)重要。

在本文中，我們將分解Faster RCNN論文，了解其工作原理，并在PyTorch中部分構(gòu)建它，以了解其中的細微差別。

Faster R-CNN概述

對于物體檢測，我們需要建立一個模型，并教它學(xué)會識別和定位圖像中的物體。

Faster R-CNN模型采用以下方法：圖像首先通過主干網(wǎng)絡(luò)獲得輸出特征圖，主干網(wǎng)絡(luò)通常是卷積網(wǎng)絡(luò)，如ResNet或VGG16。輸出特征圖是表示圖像的學(xué)習(xí)特征的空間密集張量。接下來，我們生成多個不同大小和形狀的框。這些定位框的目的是捕獲圖像中的對象。

我們使用1x1卷積網(wǎng)絡(luò)來預(yù)測所有錨盒的類別和偏移。在訓(xùn)練期間，我們對與標(biāo)簽重疊最多的錨框進行采樣。這些被稱為陽性或正錨框。我們還對與標(biāo)簽錨框幾乎沒有重疊的負錨框進行了采樣。

網(wǎng)絡(luò)學(xué)習(xí)使用二進制交叉熵損失對錨盒進行分類�，F(xiàn)在，正錨框可能與標(biāo)簽錨框不完全對齊。因此，我們訓(xùn)練了一個類似的1x1卷積網(wǎng)絡(luò)，以學(xué)習(xí)從標(biāo)簽錨框預(yù)測偏移。當(dāng)應(yīng)用于錨框時，這些偏移會使它們更接近標(biāo)簽錨框。

我們使用L2回歸損失來學(xué)習(xí)偏移。使用預(yù)測的偏移來變換錨框，并將其稱為區(qū)域建議，并且上述網(wǎng)絡(luò)稱為區(qū)域提議網(wǎng)絡(luò)。這是探測器的第一階段。Faster RCNN是一個兩級檢測器。還有另一個階段。

第2階段的輸入是從第1階段生成的區(qū)域建議。在第2階段，我們學(xué)習(xí)使用簡單的卷積網(wǎng)絡(luò)預(yù)測區(qū)域建議中的對象類別�，F(xiàn)在，建議的框大小不同，因此我們使用一種稱為ROI池的技術(shù)在通過網(wǎng)絡(luò)之前調(diào)整它們的大小。該網(wǎng)絡(luò)學(xué)習(xí)使用交叉熵損失來預(yù)測多個類別。

我們使用另一個網(wǎng)絡(luò)來預(yù)測來自標(biāo)簽錨框的區(qū)域提議的偏移量。這一網(wǎng)絡(luò)進一步試圖使預(yù)測的框與標(biāo)簽錨框保持一致。這使用L2回歸損失。最后，我們對兩種損失進行加權(quán)組合，以計算最終損失。在第二階段，我們學(xué)習(xí)預(yù)測類別和偏移量。這被稱為多任務(wù)學(xué)習(xí)。

所有這些都發(fā)生在訓(xùn)練期間。在推斷過程中，我們通過主干網(wǎng)絡(luò)傳遞圖像并生成錨框-與之前相同。然而，這一次我們只選擇在第一階段中獲得高分類分數(shù)的前300個框，并使它們有資格進入第二階段。

在第二階段，我們預(yù)測最終類別和偏移量。此外，我們還執(zhí)行了一個額外的后處理步驟，使用一種稱為非最大抑制的技術(shù)來刪除重復(fù)的邊界框。如果一切按預(yù)期運行，探測器會識別并在圖像中的對象上繪制方框，如下所示：

這是兩階段Faster RCNN網(wǎng)絡(luò)的簡要概述。在接下來的部分中，我們將深入探討每個部分。

設(shè)置環(huán)境

使用的所有代碼都可以在此GitHub存儲庫中找到。我們不需要很多依賴項，因為我們將從頭開始構(gòu)建。僅在標(biāo)準(zhǔn)anaconda環(huán)境中安裝PyTorch庫就足夠了。

https://github.com/wingedrasengan927/pytorch-tutorials/tree/master/Object%20Detection

這是我們要使用的主要筆記本

https://gist.github.com/wingedrasengan927/3d5eb6f1b0d4fb3acbf2550f9db8daf0#file-faster-r-cnn-ipynb

%load_ext autoreload

%autoreload 2

import numpy as np

from skimage import io

from skimage.transform import resize

import matplotlib.pyplot as plt

import random

import matplotlib.patches as patches

from utils import *

from model import *

import os

import torch

import torchvision

from torchvision import ops

import torch.nn as nn

import torch.nn.functional as F

from torch.utils.data import DataLoader, Dataset

from torch.nn.utils.rnn import pad_sequence

準(zhǔn)備和加載數(shù)據(jù)

首先，我們需要使用一些示例圖像。這里我從這里下載了兩張高分辨率圖像。

接下來，我們需要標(biāo)記這些圖像。CVAT是目前流行的開源標(biāo)簽工具之一。

你只需將圖像加載到工具中，在相關(guān)對象周圍繪制框，并標(biāo)記其類別，如下所示：

完成后，可以將注釋導(dǎo)出為首選格式。在這里，我已經(jīng)將它們導(dǎo)出為CVAT for images 1.1 xml格式。

注釋文件包含有關(guān)圖像、標(biāo)記類和邊界框坐標(biāo)的所有信息。

PyTorch數(shù)據(jù)集和DataLoader

在PyTorch中，創(chuàng)建一個繼承自PyTorch的Dataset類的類來加載數(shù)據(jù)被認為是最佳實踐。這將使我們對數(shù)據(jù)有更多的控制，并有助于保持代碼模塊化。此外，我們可以從數(shù)據(jù)集實例創(chuàng)建PyTorch DataLoader，它可以自動處理數(shù)據(jù)的批處理、混洗和采樣。

class ObjectDetectionDataset(Dataset):

'''

A Pytorch Dataset class to load the images and their corresponding annotations.

Returns

------------

images: torch.Tensor of size (B, C, H, W)

gt bboxes: torch.Tensor of size (B, max_objects, 4)

gt classes: torch.Tensor of size (B, max_objects)

'''

def __init__(self, annotation_path, img_dir, img_size, name2idx):

self.annotation_path = annotation_path

self.img_dir = img_dir

self.img_size = img_size

self.name2idx = name2idx

self.img_data_all, self.gt_bboxes_all, self.gt_classes_all = self.get_data()

def __len__(self):

return self.img_data_all.size(dim=0)

def __getitem__(self, idx):

return self.img_data_all[idx], self.gt_bboxes_all[idx], self.gt_classes_all[idx]

def get_data(self):

img_data_all = []

gt_idxs_all = []

gt_boxes_all, gt_classes_all, img_paths = parse_annotation(self.annotation_path, self.img_dir, self.img_size)

for i, img_path in enumerate(img_paths):

# skip if the image path is not valid

if (not img_path) or (not os.path.exists(img_path)):

continue

# read and resize image

img = io.imread(img_path)

img = resize(img, self.img_size)

# convert image to torch tensor and reshape it so channels come first

img_tensor = torch.from_numpy(img).permute(2, 0, 1)

# encode class names as integers

gt_classes = gt_classes_all[i]

gt_idx = torch.Tensor([self.name2idx[name] for name in gt_classes])

img_data_all.append(img_tensor)

gt_idxs_all.append(gt_idx)

# pad bounding boxes and classes so they are of the same size

gt_bboxes_pad = pad_sequence(gt_boxes_all, batch_first=True, padding_value=-1)

gt_classes_pad = pad_sequence(gt_idxs_all, batch_first=True, padding_value=-1)

# stack all images

img_data_stacked = torch.stack(img_data_all, dim=0)

return img_data_stacked.to(dtype=torch.float32), gt_bboxes_pad, gt_classes_pad

在上面的類中，我們定義了一個名為get_data的函數(shù)，該函數(shù)加載注釋文件并解析它以提取圖像路徑、標(biāo)記類和邊界框坐標(biāo)，然后將其轉(zhuǎn)換為PyTorch的Tensor對象。圖像將被重塑為固定大小。

注意，我們正在填充邊界框。這與調(diào)整大小相結(jié)合，允許我們將圖像批處理在一起。

我們可以從DataLoader中獲取一些圖像并將其可視化，如下所示：

主干網(wǎng)絡(luò)

這里我們將使用ResNet 50作為主干網(wǎng)絡(luò)。記住，ResNet 50中的單個塊由瓶頸層的堆棧組成。在沿空間維度的每個塊之后，圖像會減少一半，而通道的數(shù)量會增加一倍。瓶頸層由三個卷積層以及跳躍連接組成，如下所示：

我們將使用ResNet 50的前四個塊作為主干網(wǎng)絡(luò)。

一旦圖像通過主干網(wǎng)絡(luò)，它就會沿著空間維度向下采樣。輸出是圖像的特征豐富的表示。

如果我們通過主干網(wǎng)絡(luò)傳遞大�。�640、480）的圖像，我們將得到大小（15、20）的輸出特征圖。因此，圖像已縮�。�32，32）。

生成錨點

我們將特征圖中的每個點視為錨點。因此，錨點將只是表示沿寬度和高度維度的坐標(biāo)的數(shù)組。

def gen_anc_centers(out_size):

out_h, out_w = out_size

anc_pts_x = torch.arange(0, out_w) + 0.5

anc_pts_y = torch.arange(0, out_h) + 0.5

return anc_pts_x, anc_pts_y

為了可視化這些錨點，我們可以簡單地通過乘以寬度和高度比例因子將它們投影到圖像空間上。

生成錨框

對于每個錨點，我們生成九個不同形狀和大小的邊界框。我們選擇這些框的大小和形狀，以便它們包圍圖像中的所有對象。錨框的選擇通常取決于數(shù)據(jù)集。

def gen_anc_base(anc_pts_x, anc_pts_y, anc_scales, anc_ratios, out_size):

n_anc_boxes = len(anc_scales) * len(anc_ratios)

anc_base = torch.zeros(1, anc_pts_x.size(dim=0)

, anc_pts_y.size(dim=0), n_anc_boxes, 4) # shape - [1, Hmap, Wmap, n_anchor_boxes, 4]

for ix, xc in enumerate(anc_pts_x):

for jx, yc in enumerate(anc_pts_y):

anc_boxes = torch.zeros((n_anc_boxes, 4))

c = 0

for i, scale in enumerate(anc_scales):

for j, ratio in enumerate(anc_ratios):

w = scale * ratio

h = scale

xmin = xc - w / 2

ymin = yc - h / 2

xmax = xc + w / 2

ymax = yc + h / 2

anc_boxes[c, :] = torch.Tensor([xmin, ymin, xmax, ymax])

c += 1

anc_base[:, ix, jx, :] = ops.clip_boxes_to_image(anc_boxes, size=out_size)

return anc_base

調(diào)整圖像大小的另一個優(yōu)點是可以在所有圖像上復(fù)制錨框。

再次，為了可視化錨框，我們通過乘以寬度和高度比例因子將其投影到圖像空間。

如果我們將所有錨點的所有錨框可視化，會出現(xiàn)以下情況：

數(shù)據(jù)準(zhǔn)備

在本節(jié)中，我們將討論訓(xùn)練的數(shù)據(jù)準(zhǔn)備。

正負錨箱

我們只需要抽樣幾個錨盒進行訓(xùn)練。我們對正和負錨框進行采樣。

正框包含對象，負框不包含對象。為了對正錨框進行采樣，我們選擇IoU大于0.7的錨框和任何標(biāo)簽錨框。當(dāng)錨框生成不好時，條件1失敗，因此條件2會出現(xiàn)問題，因為它為每個標(biāo)簽錨框選擇一個正框。為了對負錨框進行采樣，我們選擇IoU小于0.3的錨框。通常，陰性樣本的數(shù)量將遠遠高于陽性樣本。所以我們隨機抽取一些樣本，以匹配陽性樣本的數(shù)量。IoU是度量兩個邊界框之間重疊的度量。

def get_iou_mat(batch_size, anc_boxes_all, gt_bboxes_all):

# flatten anchor boxes

anc_boxes_flat = anc_boxes_all.reshape(batch_size, -1, 4)

# get total anchor boxes for a single image

tot_anc_boxes = anc_boxes_flat.size(dim=1)

# create a placeholder to compute IoUs amongst the boxes

ious_mat = torch.zeros((batch_size, tot_anc_boxes, gt_bboxes_all.size(dim=1)))

# compute IoU of the anc boxes with the gt boxes for all the images

for i in range(batch_size):

gt_bboxes = gt_bboxes_all[i]

anc_boxes = anc_boxes_flat[i]

ious_mat[i, :] = ops.box_iou(anc_boxes, gt_bboxes)

return ious_mat

上面的函數(shù)計算IoU矩陣，其中包含圖像中所有標(biāo)簽錨框的每個錨框的IoU。它將形狀為（B，w_amap，h_amap，n_anc_boxes，4）的錨框和形狀為（a，max_objects，4））的標(biāo)簽錨框作為輸入，并返回一個形狀矩陣（B，anc_boxes_tot，max_oobjects），其中符號如下：

B - Batch Size

w_amap - width of the output activation map

h_wmap - height of the output activation map

n_anc_boxes - number of anchor boxes per an anchor point

max_objects - max number of objects in a batch of images

anc_boxes_tot - total number of anchor boxes in the image i.e, w_amap * h_amap * n_anc_boxes

該函數(shù)基本上使所有錨框變平，并使用每個標(biāo)簽錨框計算IoU，如下所示：

投影標(biāo)簽錨框

重要的是要記住，IoU是在生成的錨框和投影的標(biāo)簽錨框之間的特征空間中計算的。要將標(biāo)簽錨框投影到特征空間，我們只需將其坐標(biāo)除以比例因子，如下函數(shù)所示：

def project_bboxes(bboxes, width_scale_factor, height_scale_factor, mode='a2p'):

assert mode in ['a2p', 'p2a']

batch_size = bboxes.size(dim=0)

proj_bboxes = bboxes.clone().reshape(batch_size, -1, 4)

invalid_bbox_mask = (proj_bboxes == -1) # indicating padded bboxes

if mode == 'a2p':

# activation map to pixel image

proj_bboxes[:, :, [0, 2]] *= width_scale_factor

proj_bboxes[:, :, [1, 3]] *= height_scale_factor

else:

# pixel image to activation map

proj_bboxes[:, :, [0, 2]] /= width_scale_factor

proj_bboxes[:, :, [1, 3]] /= height_scale_factor

proj_bboxes.masked_fill_(invalid_bbox_mask, -1) # fill padded bboxes back with -1

proj_bboxes.resize_as_(bboxes)

return proj_bboxes

現(xiàn)在，當(dāng)我們將坐標(biāo)除以比例因子時，我們將值舍入為最接近的整數(shù)。這本質(zhì)上意味著我們正在將標(biāo)簽錨框“捕捉”到特征空間中最近的網(wǎng)格。因此，如果圖像空間和特征空間的尺度差異很大，我們將無法獲得準(zhǔn)確的投影。因此，在目標(biāo)檢測中使用高分辨率圖像非常重要。

計算偏移量

正錨框與標(biāo)簽錨框不完全對齊。因此，我們計算正錨框和標(biāo)簽錨框之間的偏移，并訓(xùn)練神經(jīng)網(wǎng)絡(luò)來學(xué)習(xí)這些偏移。偏移量的計算方法如下：

tx_ = (gt_cx - anc_cx) / anc_w

ty_ = (gt_cy - anc_cy) / anc_h

tw_ = log(gt_w / anc_w)

th_ = log(gt_h / anc_h)

Where:

gt_cx, gt_cy - centers of ground truth boxes

anc_cx, anc_cy - centers of anchor boxes

gt_w, gt_h - width and height of ground truth boxes

anc_w, anc_h - width and height of anchor boxes

以下函數(shù)可用于計算相同值：

def calc_gt_offsets(pos_anc_coords, gt_bbox_mapping):

pos_anc_coords = ops.box_convert(pos_anc_coords, in_fmt='xyxy', out_fmt='cxcywh')

gt_bbox_mapping = ops.box_convert(gt_bbox_mapping, in_fmt='xyxy', out_fmt='cxcywh')

gt_cx, gt_cy, gt_w, gt_h = gt_bbox_mapping[:, 0], gt_bbox_mapping[:, 1], gt_bbox_mapping[:, 2], gt_bbox_mapping[:, 3]

anc_cx, anc_cy, anc_w, anc_h = pos_anc_coords[:, 0], pos_anc_coords[:, 1], pos_anc_coords[:, 2], pos_anc_coords[:, 3]

tx_ = (gt_cx - anc_cx)/anc_w

ty_ = (gt_cy - anc_cy)/anc_h

tw_ = torch.log(gt_w / anc_w)

th_ = torch.log(gt_h / anc_h)

return torch.stack([tx_, ty_, tw_, th_], dim=-1)

如果你注意到，我們正在教網(wǎng)絡(luò)了解錨框與標(biāo)簽錨框的距離。我們沒有強迫它預(yù)測錨盒的確切位置和規(guī)模。因此，網(wǎng)絡(luò)學(xué)習(xí)的偏移和變換是位置和尺度不變的。

代碼演練

讓我們?yōu)g覽一下數(shù)據(jù)準(zhǔn)備代碼。這可能是整個存儲庫中最重要的函數(shù)。

def get_req_anchors(anc_boxes_all, gt_bboxes_all, gt_classes_all, pos_thresh=0.7, neg_thresh=0.2):

'''

Prepare necessary data required for training

Input

------

anc_boxes_all - torch.Tensor of shape (B, w_amap, h_amap, n_anchor_boxes, 4)

all anchor boxes for a batch of images

gt_bboxes_all - torch.Tensor of shape (B, max_objects, 4)

padded ground truth boxes for a batch of images

gt_classes_all - torch.Tensor of shape (B, max_objects)

padded ground truth classes for a batch of images

Returns

---------

positive_anc_ind - torch.Tensor of shape (n_pos,)

flattened positive indices for all the images in the batch

negative_anc_ind - torch.Tensor of shape (n_pos,)

flattened positive indices for all the images in the batch

GT_conf_scores - torch.Tensor of shape (n_pos,), IoU scores of +ve anchors

GT_offsets - torch.Tensor of shape (n_pos, 4),

offsets between +ve anchors and their corresponding ground truth boxes

GT_class_pos - torch.Tensor of shape (n_pos,)

mapped classes of +ve anchors

positive_anc_coords - (n_pos, 4) coords of +ve anchors (for visualization)

negative_anc_coords - (n_pos, 4) coords of -ve anchors (for visualization)

positive_anc_ind_sep - list of indices to keep track of +ve anchors

'''

# get the size and shape parameters

B, w_amap, h_amap, A, _ = anc_boxes_all.shape

N = gt_bboxes_all.shape[1] # max number of groundtruth bboxes in a batch

# get total number of anchor boxes in a single image

tot_anc_boxes = A * w_amap * h_amap

# get the iou matrix which contains iou of every anchor box

# against all the groundtruth bboxes in an image

iou_mat = get_iou_mat(B, anc_boxes_all, gt_bboxes_all)

# for every groundtruth bbox in an image, find the iou

# with the anchor box which it overlaps the most

max_iou_per_gt_box, _ = iou_mat.max(dim=1, keepdim=True)

# get positive anchor boxes

# condition 1: the anchor box with the max iou for every gt bbox

positive_anc_mask = torch.logical_and(iou_mat == max_iou_per_gt_box, max_iou_per_gt_box > 0)

# condition 2: anchor boxes with iou above a threshold with any of the gt bboxes

positive_anc_mask = torch.logical_or(positive_anc_mask, iou_mat > pos_thresh)

positive_anc_ind_sep = torch.where(positive_anc_mask)[0] # get separate indices in the batch

# combine all the batches and get the idxs of the +ve anchor boxes

positive_anc_mask = positive_anc_mask.flatten(start_dim=0, end_dim=1)

positive_anc_ind = torch.where(positive_anc_mask)[0]

# for every anchor box, get the iou and the idx of the

# gt bbox it overlaps with the most

max_iou_per_anc, max_iou_per_anc_ind = iou_mat.max(dim=-1)

max_iou_per_anc = max_iou_per_anc.flatten(start_dim=0, end_dim=1)

# get iou scores of the +ve anchor boxes

GT_conf_scores = max_iou_per_anc[positive_anc_ind]

# get gt classes of the +ve anchor boxes

# expand gt classes to map against every anchor box

gt_classes_expand = gt_classes_all.view(B, 1, N).expand(B, tot_anc_boxes, N)

# for every anchor box, consider only the class of the gt bbox it overlaps with the most

GT_class = torch.gather(gt_classes_expand, -1, max_iou_per_anc_ind.unsqueeze(-1)).squeeze(-1)

# combine all the batches and get the mapped classes of the +ve anchor boxes

GT_class = GT_class.flatten(start_dim=0, end_dim=1)

GT_class_pos = GT_class[positive_anc_ind]

# get gt bbox coordinates of the +ve anchor boxes

# expand all the gt bboxes to map against every anchor box

gt_bboxes_expand = gt_bboxes_all.view(B, 1, N, 4).expand(B, tot_anc_boxes, N, 4)

# for every anchor box, consider only the coordinates of the gt bbox it overlaps with the most

GT_bboxes = torch.gather(gt_bboxes_expand, -2, max_iou_per_anc_ind.reshape(B, tot_anc_boxes, 1, 1).repeat(1, 1, 1, 4))

# combine all the batches and get the mapped gt bbox coordinates of the +ve anchor boxes

GT_bboxes = GT_bboxes.flatten(start_dim=0, end_dim=2)

GT_bboxes_pos = GT_bboxes[positive_anc_ind]

# get coordinates of +ve anc boxes

anc_boxes_flat = anc_boxes_all.flatten(start_dim=0, end_dim=-2) # flatten all the anchor boxes

positive_anc_coords = anc_boxes_flat[positive_anc_ind]

# calculate gt offsets

GT_offsets = calc_gt_offsets(positive_anc_coords, GT_bboxes_pos)

# get -ve anchors

# condition: select the anchor boxes with max iou less than the threshold

negative_anc_mask = (max_iou_per_anc < neg_thresh)

negative_anc_ind = torch.where(negative_anc_mask)[0]

# sample -ve samples to match the +ve samples

negative_anc_ind = negative_anc_ind[torch.randint(0, negative_anc_ind.shape[0], (positive_anc_ind.shape[0],))]

negative_anc_coords = anc_boxes_flat[negative_anc_ind]

return positive_anc_ind, negative_anc_ind, GT_conf_scores, GT_offsets, GT_class_pos,

positive_anc_coords, negative_anc_coords, positive_anc_ind_sep

首先，我們使用上述函數(shù)計算IoU矩陣。然后從這個矩陣中，我們得到每個標(biāo)簽錨框的最重疊錨框的IoU。這是對正極錨盒進行采樣的條件1。我們還應(yīng)用條件2并選擇IoU大于圖像中任何標(biāo)簽錨框閾值的錨框。我們將條件1和條件2與所有圖像的正錨框樣本相結(jié)合。

每個圖像將具有不同數(shù)量的陽性樣本。為了避免訓(xùn)練過程中的這種差異，我們將批次壓平并組合所有圖像中的陽性樣本。此外，我們可以使用torch.where跟蹤每個陽性樣本的來源。

接下來，我們需要計算來自標(biāo)簽樣本的偏移量。為此，我們需要將每個陽性樣本映射到其對應(yīng)的標(biāo)簽錨框。需要注意的是，一個正錨框只能映射到一個標(biāo)簽錨框，而多個正錨盒可以映射到同一個標(biāo)簽錨框。

為了進行映射，我們首先使用Tensor.expand擴展標(biāo)簽錨框以匹配總的錨框。然后，對于每個錨框，我們選擇其重疊最多的標(biāo)簽錨框。

為此，我們從IoU矩陣中獲取所有錨框的最大IoU索引，然后使用torch.collect對這些索引進行“聚集”。最后，我們將批次壓平并過濾陽性樣本。該過程如下所示：

將每個錨框映射到其重疊最多的標(biāo)簽錨框

我們對類別執(zhí)行相同的過程，為每個陽性樣本分配一個類別。

現(xiàn)在我們已經(jīng)為每個陽性樣本映射了標(biāo)簽錨框，我們可以使用上述函數(shù)計算偏移量。

最后，我們通過使用所有標(biāo)簽錨框?qū)oU小于給定閾值的錨框進行采樣來選擇陰性樣本。由于陰性樣本的數(shù)量遠遠超過陽性樣本，我們隨機選擇其中的一些樣本來匹配計數(shù)。

下面是正負錨框的外觀：

我們現(xiàn)在可以使用采樣的正負錨框進行訓(xùn)練。

建立模型建議模塊

讓我們先從建議模塊開始。正如我們所討論的，特征圖中的每個點都被視為錨點，每個錨點都會生成不同大小和形狀的框。我們希望將這些框中的每一個分類為對象或背景。

此外，我們希望從相應(yīng)的標(biāo)簽錨框中預(yù)測它們的偏移量。我們怎么能做到這一點？解決方案是使用1x1卷積層�，F(xiàn)在，1x1卷積層不會增加感受野。它們的功能不是學(xué)習(xí)圖像級特征。它們相當(dāng)于用來改變過濾器的數(shù)量，或者用作回歸或分類頭。

因此，我們采用兩個1x1卷積層，并使用其中一個將每個錨框分類為對象或背景。我們稱之為信心頭。因此，給定大小為（B，C，w_amap，h_amap）的特征圖，我們用卷積大小為1x1的核以獲得大小為（B，n_anc_boxes，w_amap，h_amp）的輸出。本質(zhì)上，每個輸出表示錨框的分類分數(shù)。

以類似的方式，另一個1x1卷積層獲取特征圖并產(chǎn)生大小（B，n_anc_boxes*4，w_amap，h_amap）的輸出，其中輸出濾波器表示錨框的預(yù)測偏移。這被稱為回歸頭。

class ProposalModule(nn.Module):

def __init__(self, in_features, hidden_dim=512, n_anchors=9, p_dropout=0.3):

super().__init__()

self.n_anchors = n_anchors

self.conv1 = nn.Conv2d(in_features, hidden_dim, kernel_size=3, padding=1)

self.dropout = nn.Dropout(p_dropout)

self.conf_head = nn.Conv2d(hidden_dim, n_anchors, kernel_size=1)

self.reg_head = nn.Conv2d(hidden_dim, n_anchors * 4, kernel_size=1)

def forward(self, feature_map, pos_anc_ind=None, neg_anc_ind=None, pos_anc_coords=None):

# determine mode

if pos_anc_ind is None or neg_anc_ind is None or pos_anc_coords is None:

mode = 'eval'

else:

mode = 'train'

out = self.conv1(feature_map)

out = F.relu(self.dropout(out))

reg_offsets_pred = self.reg_head(out) # (B, A*4, hmap, wmap)

conf_scores_pred = self.conf_head(out) # (B, A, hmap, wmap)

if mode == 'train':

# get conf scores

conf_scores_pos = conf_scores_pred.flatten()[pos_anc_ind]

conf_scores_neg = conf_scores_pred.flatten()[neg_anc_ind]

# get offsets for +ve anchors

offsets_pos = reg_offsets_pred.contiguous().view(-1, 4)[pos_anc_ind]

# generate proposals using offsets

proposals = generate_proposals(pos_anc_coords, offsets_pos)

return conf_scores_pos, conf_scores_neg, offsets_pos, proposals

elif mode == 'eval':

return conf_scores_pred, reg_offsets_pred

在訓(xùn)練期間，我們選擇正錨框并應(yīng)用預(yù)測的偏移量來生成區(qū)域建議。區(qū)域建議的計算方法如下：

其中上標(biāo)p表示區(qū)域建議，上標(biāo)a表示錨框，t表示預(yù)測偏移。

以下函數(shù)實現(xiàn)上述轉(zhuǎn)換并生成區(qū)域建議：

def generate_proposals(anchors, offsets):

# change format of the anchor boxes from 'xyxy' to 'cxcywh'

anchors = ops.box_convert(anchors, in_fmt='xyxy', out_fmt='cxcywh')

# apply offsets to anchors to create proposals

proposals_ = torch.zeros_like(anchors)

proposals_[:,0] = anchors[:,0] + offsets[:,0]*anchors[:,2]

proposals_[:,1] = anchors[:,1] + offsets[:,1]*anchors[:,3]

proposals_[:,2] = anchors[:,2] * torch.exp(offsets[:,2])

proposals_[:,3] = anchors[:,3] * torch.exp(offsets[:,3])

# change format of proposals back from 'cxcywh' to 'xyxy'

proposals = ops.box_convert(proposals_, in_fmt='cxcywh', out_fmt='xyxy')

return proposals

區(qū)域建議網(wǎng)絡(luò)

區(qū)域建議網(wǎng)絡(luò)是檢測器的第一階段，它獲取特征圖并產(chǎn)生區(qū)域建議。

在這里，我們將主干網(wǎng)絡(luò)、采樣模塊和建議模塊組合成區(qū)域建議網(wǎng)絡(luò)。

class RegionProposalNetwork(nn.Module):

def __init__(self, img_size, out_size, out_channels):

super().__init__()

self.img_height, self.img_width = img_size

self.out_h, self.out_w = out_size

# downsampling scale factor

self.width_scale_factor = self.img_width // self.out_w

self.height_scale_factor = self.img_height // self.out_h

# scales and ratios for anchor boxes

self.anc_scales = [2, 4, 6]

self.anc_ratios = [0.5, 1, 1.5]

self.n_anc_boxes = len(self.anc_scales) * len(self.anc_ratios)

# IoU thresholds for +ve and -ve anchors

self.pos_thresh = 0.7

self.neg_thresh = 0.3

# weights for loss

self.w_conf = 1

self.w_reg = 5

self.feature_extractor = FeatureExtractor()

self.proposal_module = ProposalModule(out_channels, n_anchors=self.n_anc_boxes)

def forward(self, images, gt_bboxes, gt_classes):

batch_size = images.size(dim=0)

feature_map = self.feature_extractor(images)

# generate anchors

anc_pts_x, anc_pts_y = gen_anc_centers(out_size=(self.out_h, self.out_w))

anc_base = gen_anc_base(anc_pts_x, anc_pts_y, self.anc_scales, self.anc_ratios, (self.out_h, self.out_w))

anc_boxes_all = anc_base.repeat(batch_size, 1, 1, 1, 1)

# get positive and negative anchors amongst other things

gt_bboxes_proj = project_bboxes(gt_bboxes, self.width_scale_factor, self.height_scale_factor, mode='p2a')

positive_anc_ind, negative_anc_ind, GT_conf_scores,

GT_offsets, GT_class_pos, positive_anc_coords,

negative_anc_coords, positive_anc_ind_sep = get_req_anchors(anc_boxes_all, gt_bboxes_proj, gt_classes)

# pass through the proposal module

conf_scores_pos, conf_scores_neg, offsets_pos, proposals = self.proposal_module(feature_map, positive_anc_ind,

negative_anc_ind, positive_anc_coords)

cls_loss = calc_cls_loss(conf_scores_pos, conf_scores_neg, batch_size)

reg_loss = calc_bbox_reg_loss(GT_offsets, offsets_pos, batch_size)

total_rpn_loss = self.w_conf * cls_loss + self.w_reg * reg_loss

return total_rpn_loss, feature_map, proposals, positive_anc_ind_sep, GT_class_pos

def inference(self, images, conf_thresh=0.5, nms_thresh=0.7):

with torch.no_grad():

batch_size = images.size(dim=0)

feature_map = self.feature_extractor(images)

# generate anchors

anc_pts_x, anc_pts_y = gen_anc_centers(out_size=(self.out_h, self.out_w))

anc_base = gen_anc_base(anc_pts_x, anc_pts_y, self.anc_scales, self.anc_ratios, (self.out_h, self.out_w))

anc_boxes_all = anc_base.repeat(batch_size, 1, 1, 1, 1)

anc_boxes_flat = anc_boxes_all.reshape(batch_size, -1, 4)

# get conf scores and offsets

conf_scores_pred, offsets_pred = self.proposal_module(feature_map)

conf_scores_pred = conf_scores_pred.reshape(batch_size, -1)

offsets_pred = offsets_pred.reshape(batch_size, -1, 4)

# filter out proposals based on conf threshold and nms threshold for each image

proposals_final = []

conf_scores_final = []

for i in range(batch_size):

conf_scores = torch.sigmoid(conf_scores_pred[i])

offsets = offsets_pred[i]

anc_boxes = anc_boxes_flat[i]

proposals = generate_proposals(anc_boxes, offsets)

# filter based on confidence threshold

conf_idx = torch.where(conf_scores >= conf_thresh)[0]

conf_scores_pos = conf_scores[conf_idx]

proposals_pos = proposals[conf_idx]

# filter based on nms threshold

nms_idx = ops.nms(proposals_pos, conf_scores_pos, nms_thresh)

conf_scores_pos = conf_scores_pos[nms_idx]

proposals_pos = proposals_pos[nms_idx]

proposals_final.append(proposals_pos)

conf_scores_final.append(conf_scores_pos)

return proposals_final, conf_scores_final, feature_map

在訓(xùn)練和推理過程中，RPN為所有錨框生成分數(shù)和偏移。然而，在訓(xùn)練期間，我們只選擇正和負錨框來計算分類損失。為了計算L2回歸損失，我們只考慮陽性樣本的偏移。最終損失是這兩種損失的加權(quán)組合。

在推斷過程中，我們選擇得分高于給定閾值的錨框，并使用預(yù)測的偏移量生成建議。我們使用S形函數(shù)將原始模型邏輯轉(zhuǎn)換為概率分數(shù)。

在這兩種情況下生成的建議被傳遞到檢測器的第二階段。

分類模塊

在第二階段，我們接收區(qū)域建議，并預(yù)測建議中對象的類別。這可以通過一個簡單的卷積網(wǎng)絡(luò)來實現(xiàn)，但有一個缺點：所有建議的大小都不相同。

現(xiàn)在，你可能會考慮在將建議輸入模型之前調(diào)整大小，就像我們通常在圖像分類任務(wù)中調(diào)整圖像大小一樣，但問題是調(diào)整大小不是一個可區(qū)分的操作，因此不能通過該操作進行反向傳播。

這里有一個更聰明的調(diào)整大小的方法：我們將建議分成大致相等的子區(qū)域，并對每個子區(qū)域應(yīng)用最大池操作，以產(chǎn)生相同大小的輸出。這稱為ROI池，如下所示：

最大池是一種可微操作，我們一直在卷積神經(jīng)網(wǎng)絡(luò)中使用它們。

我們不需要從頭開始實施ROI池，torchvisio.ops庫為我們提供了它。

一旦使用ROI池調(diào)整了建議的大小，我們將其通過卷積神經(jīng)網(wǎng)絡(luò)，該網(wǎng)絡(luò)由卷積層、平均池層和產(chǎn)生類別分數(shù)的線性層組成。

在推理過程中，我們通過對原始模型邏輯應(yīng)用softmax函數(shù)并選擇具有最高概率得分的類別來預(yù)測對象類別。在訓(xùn)練期間，我們使用交叉熵計算分類損失。

class ClassificationModule(nn.Module):

def __init__(self, out_channels, n_classes, roi_size, hidden_dim=512, p_dropout=0.3):

super().__init__()

self.roi_size = roi_size

# hidden network

self.avg_pool = nn.AvgPool2d(self.roi_size)

self.fc = nn.Linear(out_channels, hidden_dim)

self.dropout = nn.Dropout(p_dropout)

# define classification head

self.cls_head = nn.Linear(hidden_dim, n_classes)

def forward(self, feature_map, proposals_list, gt_classes=None):

if gt_classes is None:

mode = 'eval'

else:

mode = 'train'

# apply roi pooling on proposals followed by avg pooling

roi_out = ops.roi_pool(feature_map, proposals_list, self.roi_size)

roi_out = self.avg_pool(roi_out)

# flatten the output

roi_out = roi_out.squeeze(-1).squeeze(-1)

# pass the output through the hidden network

out = self.fc(roi_out)

out = F.relu(self.dropout(out))

# get the classification scores

cls_scores = self.cls_head(out)

if mode == 'eval':

return cls_scores

# compute cross entropy loss

cls_loss = F.cross_entropy(cls_scores, gt_classes.long())

return cls_loss

在一個全面的實現(xiàn)中，我們還將背景類別包括在第二階段，但讓我們將其留在本教程中。

在第二階段，我們還添加了一個回歸網(wǎng)絡(luò)，該網(wǎng)絡(luò)進一步為區(qū)域建議生成偏移量。然而，由于這需要額外的記錄，我沒有將其包含在本教程中。

非最大抑制

在推理的最后一步，我們使用一種稱為非最大抑制的技術(shù)來刪除重復(fù)的邊界框。在該技術(shù)中，我們首先考慮具有最高分類分數(shù)的邊界框。然后，我們用這個框計算所有其他框的IoU，并刪除具有高IoU分數(shù)的框。這些是與“原始”邊界框重疊的重復(fù)邊界框。我們對剩余的框也重復(fù)此過程，直到刪除所有重復(fù)項。

同樣，我們不必從頭開始實現(xiàn)它。torchvisio.ops庫為我們提供了它。NMS處理步驟在上述第1階段回歸網(wǎng)絡(luò)中實現(xiàn)。

Faster RCNN模型

我們將區(qū)域建議網(wǎng)絡(luò)和分類模塊結(jié)合起來，構(gòu)建最終的端到端Faster RCNN模型。

class TwoStageDetector(nn.Module):

def __init__(self, img_size, out_size, out_channels, n_classes, roi_size):

super().__init__()

self.rpn = RegionProposalNetwork(img_size, out_size, out_channels)

self.classifier = ClassificationModule(out_channels, n_classes, roi_size)

def forward(self, images, gt_bboxes, gt_classes):

total_rpn_loss, feature_map, proposals,

positive_anc_ind_sep, GT_class_pos = self.rpn(images, gt_bboxes, gt_classes)

# get separate proposals for each sample

pos_proposals_list = []

batch_size = images.size(dim=0)

for idx in range(batch_size):

proposal_idxs = torch.where(positive_anc_ind_sep == idx)[0]

proposals_sep = proposals[proposal_idxs].detach().clone()

pos_proposals_list.append(proposals_sep)

cls_loss = self.classifier(feature_map, pos_proposals_list, GT_class_pos)

total_loss = cls_loss + total_rpn_loss

return total_loss

def inference(self, images, conf_thresh=0.5, nms_thresh=0.7):

batch_size = images.size(dim=0)

proposals_final, conf_scores_final, feature_map = self.rpn.inference(images, conf_thresh, nms_thresh)

cls_scores = self.classifier(feature_map, proposals_final)

# convert scores into probability

cls_probs = F.softmax(cls_scores, dim=-1)

# get classes with highest probability

classes_all = torch.argmax(cls_probs, dim=-1)

classes_final = []

# slice classes to map to their corresponding image

c = 0

for i in range(batch_size):

n_proposals = len(proposals_final[i]) # get the number of proposals for each image

classes_final.append(classes_all[c: c+n_proposals])

c += n_proposals

return proposals_final, conf_scores_final, classes_final

訓(xùn)練模型

首先，讓我們在一小部分數(shù)據(jù)樣本上擬合網(wǎng)絡(luò)，以確保一切都按預(yù)期工作。我們使用Adam優(yōu)化器的標(biāo)準(zhǔn)訓(xùn)練循環(huán)，學(xué)習(xí)率為1e-3。

以下是結(jié)果：

由于我們在一小部分數(shù)據(jù)上進行了訓(xùn)練，所以模型還沒有學(xué)習(xí)到圖像級別的特征，因此結(jié)果并不準(zhǔn)確。這可以通過在大型數(shù)據(jù)集上進行訓(xùn)練來改善。

結(jié)論

在實現(xiàn)中，我們在標(biāo)準(zhǔn)數(shù)據(jù)集（如MS-COCO或PASCAL VOC）上訓(xùn)練網(wǎng)絡(luò)，并使用平均精度或ROC曲線下面積等指標(biāo)評估結(jié)果。然而，本教程的目的是了解Faster RCNN模型，因此我們將離開評估部分。

多年來，該領(lǐng)域取得了重大進展，并開發(fā)了許多新的網(wǎng)絡(luò)。示例包括YOLO、EfficientDet、DETR和Mask RCNN。然而，它們中的大多數(shù)都建立在我們在本教程中討論過的Faster RCNN模型所奠定的基礎(chǔ)之上。

我希望你喜歡這篇文章。代碼在GitHub中可用。

https://github.com/wingedrasengan927/pytorch-tutorials/tree/master/Object%20Detection

數(shù)據(jù)集

本文中使用的兩幅圖像來自DIV2K數(shù)據(jù)集。數(shù)據(jù)集在CC0:公共域下獲得許可。

@InProceedings{Agustsson_2017_CVPR_Workshops,

author = {Agustsson, Eirikur and Timofte, Radu},

title = {NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study},

booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},

month = {July},

year = {2017}

}

圖像學(xué)分

除非標(biāo)題中明確引用了源代碼，否則本教程中的所有圖像均由作者提供。

參考引用

Deep learning for Computer Vision, UMich(https://web.eecs.umich.edu/~justincj/teaching/eecs498/WI2022/)Faster-RCNN paper(https://arxiv.org/abs/1506.01497)

原文標(biāo)題 : 計算機視覺與物體檢測