4.6 - Object detection

!wget -nc --no-cache -O init.py -q https://raw.githubusercontent.com/rramosp/2021.deeplearning/main/content/init.py
import init; init.init(force_download=False); 
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
%load_ext tensorboard

from sklearn.datasets import *
from local.lib import mlutils
from IPython.display import Image
from skimage import io
tf.__version__
The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard
'2.4.1'

Object detection

Approaches:

  • Classical: Sliding window, costly

  • Two Stage Detectors: First obtain proposed regions, then classify.

  • One Stage Detectors: Use region priors on fixed image grid

Observe how an image is annotated for detection

This is an example from the Open Images V6 Dataset, a dataset created and curated at Google. Explore and inspect images and annotations to understand the dataset.

Particularly:

We download the class descriptions

!wget -nc https://storage.googleapis.com/openimages/v5/class-descriptions-boxable.csv
c = pd.read_csv("class-descriptions-boxable.csv", names=["code", "description"], index_col="code")
c.head()
--2021-01-27 21:15:00--  https://storage.googleapis.com/openimages/v5/class-descriptions-boxable.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.173.48, 142.250.78.48, 142.250.78.16, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.173.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12011 (12K) [text/csv]
Saving to: ‘class-descriptions-boxable.csv’

class-descriptions- 100%[===================>]  11,73K  --.-KB/s    in 0,002s  

2021-01-27 21:15:01 (6,82 MB/s) - ‘class-descriptions-boxable.csv’ saved [12011/12011]
description
code
/m/011k07 Tortoise
/m/011q46kg Container
/m/012074 Magpie
/m/0120dh Sea turtle
/m/01226z Football

An example image

img = io.imread("local/imgs/0003bb040a62c86f.jpg")
plt.figure(figsize=(12,10))
plt.imshow(img)
<matplotlib.image.AxesImage at 0x7fa1c5eb8dc0>
../_images/U4.06 - Object Detection_7_1.png

with its annotations

boxes = pd.read_csv("local/data/openimages_boxes_0003bb040a62c86f.csv")
boxes
ImageID Source LabelName Confidence XMin XMax YMin YMax IsOccluded IsTruncated ... IsDepiction IsInside XClick1X XClick2X XClick3X XClick4X XClick1Y XClick2Y XClick3Y XClick4Y
0 0003bb040a62c86f activemil /m/07j7r 1 0.280625 0.658125 0.021174 0.347449 1 1 ... 0 0 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
1 0003bb040a62c86f xclick /m/01g317 1 0.326250 0.349375 0.346487 0.397498 0 0 ... 0 0 0.336875 0.326250 0.349375 0.344375 0.346487 0.363811 0.363811 0.397498
2 0003bb040a62c86f xclick /m/01g317 1 0.461875 0.553750 0.312801 0.811357 1 0 ... 0 0 0.505000 0.461875 0.543750 0.553750 0.312801 0.762271 0.811357 0.760346
3 0003bb040a62c86f xclick /m/01g317 1 0.620000 0.641875 0.350337 0.448508 0 0 ... 0 0 0.630000 0.620000 0.641875 0.621250 0.350337 0.436959 0.436959 0.448508
4 0003bb040a62c86f xclick /m/01g317 1 0.650625 0.671250 0.344562 0.446583 0 0 ... 0 0 0.660625 0.650625 0.671250 0.668750 0.344562 0.446583 0.435034 0.442733
5 0003bb040a62c86f xclick /m/01g317 1 0.726250 0.754375 0.354187 0.448508 0 0 ... 0 0 0.733750 0.726250 0.741250 0.754375 0.354187 0.433109 0.448508 0.401347
6 0003bb040a62c86f xclick /m/01mqdt 1 0.397500 0.416875 0.256978 0.286814 0 0 ... 0 0 0.397500 0.406250 0.416875 0.416875 0.281039 0.256978 0.286814 0.286814
7 0003bb040a62c86f xclick /m/01mqdt 1 0.533125 0.560000 0.281039 0.328200 0 0 ... 0 0 0.533125 0.545000 0.560000 0.546250 0.308951 0.281039 0.305101 0.328200
8 0003bb040a62c86f xclick /m/01prls 1 0.163750 0.233125 0.325313 0.409047 0 0 ... 0 0 0.203750 0.163750 0.218125 0.233125 0.325313 0.379211 0.409047 0.366699
9 0003bb040a62c86f xclick /m/01prls 1 0.201875 0.227500 0.316651 0.342637 1 0 ... 0 0 0.205625 0.201875 0.218125 0.227500 0.316651 0.332050 0.342637 0.324350
10 0003bb040a62c86f xclick /m/01prls 1 0.219375 0.255625 0.323388 0.380173 1 0 ... 0 0 0.228750 0.219375 0.249375 0.255625 0.323388 0.336862 0.380173 0.348412
11 0003bb040a62c86f xclick /m/01prls 1 0.235000 0.259375 0.316651 0.354187 1 0 ... 0 0 0.249375 0.235000 0.254375 0.259375 0.316651 0.326275 0.354187 0.337825
12 0003bb040a62c86f xclick /m/01prls 1 0.258125 0.289375 0.323388 0.368624 1 0 ... 0 0 0.268750 0.258125 0.282500 0.289375 0.323388 0.331088 0.368624 0.349374
13 0003bb040a62c86f xclick /m/01prls 1 0.260625 0.298750 0.282964 0.358037 1 0 ... 0 0 0.273125 0.260625 0.293750 0.298750 0.282964 0.305101 0.358037 0.330125
14 0003bb040a62c86f xclick /m/01prls 1 0.264375 0.297500 0.292589 0.333975 1 0 ... 0 0 0.285000 0.297500 0.297500 0.264375 0.292589 0.333975 0.314726 0.312801
15 0003bb040a62c86f xclick /m/01prls 1 0.311875 0.336250 0.330125 0.354187 1 0 ... 0 0 0.315625 0.316875 0.311875 0.336250 0.330125 0.354187 0.354187 0.342637
16 0003bb040a62c86f xclick /m/01prls 1 0.326250 0.353125 0.351299 0.430221 1 0 ... 0 0 0.348125 0.326250 0.339375 0.353125 0.351299 0.390760 0.430221 0.366699
17 0003bb040a62c86f xclick /m/01prls 1 0.424375 0.490000 0.342637 0.415784 1 0 ... 0 0 0.466875 0.424375 0.468125 0.490000 0.342637 0.391723 0.415784 0.348412
18 0003bb040a62c86f xclick /m/07j7r 1 0.000000 0.181875 0.000000 0.366699 1 1 ... 0 0 0.071250 0.000000 0.078125 0.181875 0.000000 0.223292 0.366699 0.063523
19 0003bb040a62c86f xclick /m/07j7r 1 0.113750 0.273750 0.080847 0.360924 0 0 ... 0 0 0.113750 0.188125 0.273750 0.172500 0.210780 0.080847 0.232916 0.360924
20 0003bb040a62c86f xclick /m/07j7r 1 0.761875 0.830000 0.302214 0.362849 0 0 ... 0 0 0.761875 0.782500 0.830000 0.816250 0.335900 0.302214 0.333013 0.362849
21 0003bb040a62c86f xclick /m/07j7r 1 0.799375 0.999375 0.000000 0.471607 0 1 ... 0 0 0.799375 0.999375 0.999375 0.999375 0.223292 0.000000 0.461983 0.471607
22 0003bb040a62c86f xclick /m/0hnnb 1 0.401250 0.555000 0.628489 0.748797 1 0 ... 0 0 0.453750 0.401250 0.519375 0.555000 0.628489 0.681424 0.748797 0.703561
23 0003bb040a62c86f xclick /m/0hnnb 1 0.501250 0.588125 0.570741 0.734360 1 0 ... 0 0 0.550000 0.501250 0.551250 0.588125 0.570741 0.628489 0.734360 0.697786

24 rows × 21 columns

the annotations of this image

pd.Series([c.loc[i].description for i in boxes.LabelName]).value_counts()
Land vehicle    10
Person           5
Tree             5
Traffic sign     2
Umbrella         2
dtype: int64
from matplotlib.patches import Rectangle
i = np.random.randint(len(boxes))
plt.figure(figsize=(12,10)); 
ax = plt.subplot(111)
plt.imshow(img)
h,w = img.shape[:2]
for i in range(len(boxes)):
    k = boxes.iloc[i]
    label = c.loc[k.LabelName].values[0]
    ax.add_patch(Rectangle((k.XMin*w,k.YMin*h),(k.XMax-k.XMin)*w,(k.YMax-k.YMin)*h, linewidth=2,edgecolor='r',facecolor='none'))
    plt.text(k.XMin*w, k.YMin*h-10, label, fontsize=12, color="red")
../_images/U4.06 - Object Detection_12_0.png

Patch classification, with InceptionV3 from Keras

some sample patches

patches = [img[190:300, 150:270], 
           img[200:300, 600:700], 
           img[400:500, 400:500],
           img[200:300, 300:400],
           img[220:290, 325:360],
           img[10:180, 330:670]]

plt.figure(figsize=(20,3))
for i,pimg in enumerate(patches):
    plt.subplot(1,len(patches),i+1); plt.imshow(pimg)
../_images/U4.06 - Object Detection_14_0.png
from tensorflow.keras.applications import inception_v3
if not "model" in locals():
    model = inception_v3.InceptionV3(weights='imagenet', include_top=True)
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels.h5
96116736/96112376 [==============================] - 33s 0us/step
def plot_img_with_histogram(img):
    plt.figure(figsize=(13,4))
    plt.subplot(121)
    plt.imshow(img, vmin=np.min(img), vmax=np.max(img))
    plt.subplot(122)
    plt.hist(img.flatten(), bins=30);

pimg = patches[2]

plot_img_with_histogram(pimg)
../_images/U4.06 - Object Detection_16_0.png
from skimage.transform import resize
rimg = resize(pimg, output_shape=(299,299,3))
plot_img_with_histogram(rimg)
../_images/U4.06 - Object Detection_17_0.png
pred = model.predict(rimg.reshape(-1,*rimg.shape))
pred.shape
(1, 1000)
k = pd.DataFrame(inception_v3.decode_predictions(pred, top=100)[0], columns=["code", "label", "preds"])
k = k.sort_values(by="preds", ascending=False)
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/imagenet_class_index.json
40960/35363 [==================================] - 0s 1us/step
plt.figure(figsize=(15,3))
n = 40
plt.bar(range(n), k[:n].preds.values)
plt.xticks(range(n), k[:n].label.values, rotation="vertical");
plt.title("classes with highest prediction probability")
plt.grid();
../_images/U4.06 - Object Detection_20_0.png

observe how we decode prediction classes. We would nee to align them with our detectiond dataset.

print ('Predicted:')
k = inception_v3.decode_predictions(pred, top=10)[0]
for i in k:
    print("%10s %20s %.6f"%i)
Predicted:
 n03630383             lab_coat 0.134186
 n04479046          trench_coat 0.090892
 n04507155             umbrella 0.087079
 n02971356               carton 0.054701
 n03787032          mortarboard 0.039948
 n03617480               kimono 0.021387
 n02777292         balance_beam 0.017429
 n02669723        academic_gown 0.016952
 n04336792            stretcher 0.016528
 n04456115                torch 0.016192

Patch classification, with ResNet model published on TensorFlow Hub

import tensorflow_hub as hub
classnames = pd.read_csv('https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt', names=["label"])
if not 'm' in locals():
    m = tf.keras.Sequential([
        hub.KerasLayer("https://tfhub.dev/google/imagenet/inception_resnet_v2/classification/4")
    ])
    m.build([None, 299, 299, 3])  
m.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
keras_layer (KerasLayer)     (None, 1001)              55875273  
=================================================================
Total params: 55,875,273
Trainable params: 0
Non-trainable params: 55,875,273
_________________________________________________________________
preds = m(rimg.reshape(-1,*rimg.shape).astype(np.float32)).numpy()[0]
preds = np.exp(preds)/np.sum(np.exp(preds))
np.sum(preds)
1.0000002
names = classnames.copy()
names["preds"] = preds
names = names.sort_values(by="preds", ascending=False)
names.head()
label preds
880 umbrella 0.581176
479 carton 0.035877
870 trench coat 0.031472
835 suit 0.010810
631 Loafer 0.005403
plt.figure(figsize=(15,3))
n = 40
plt.bar(range(n), names[:n].preds.values)
plt.xticks(range(n), names[:n].label.values, rotation="vertical");
plt.title("classes with highest prediction probability")
plt.grid();
../_images/U4.06 - Object Detection_29_0.png

One stage detectors

This blog: YOLO v3 theory explained contains a detailed explanation on how YOLOv3 builds a prediction for detections.

Region priors

A set of box shapes representative of what appears in the training dataset. Obtained typically with KMeans, one must decide how many. For instance

Image("local/imgs/anchor_boxes.png", width=300)
../_images/U4.06 - Object Detection_31_0.png

Image fixed partition

  • each fixed size image cell is responsible for predicting objects whose center falls within that cell.

  • for instance the red dot below signals the tree center and thus the cell responsible for its prediction.

fig = plt.figure(figsize=(8,8))
ax = plt.subplot(111)
plt.imshow(img)
n = 9
for i in range(n):
    plt.axvline(img.shape[1]//n*i, color="black")
    plt.axhline(img.shape[0]//n*i, color="black")
    
k = boxes.iloc[0]
label = c.loc[k.LabelName].values[0]
ax.add_patch(Rectangle((k.XMin*w,k.YMin*h),(k.XMax-k.XMin)*w,(k.YMax-k.YMin)*h, 
                       linewidth=4,edgecolor='r',facecolor='none'))
plt.text(k.XMin*w, k.YMin*h-10, label, fontsize=12, color="red")
plt.scatter(k.XMin*w+(k.XMax-k.XMin)*w*.5,k.YMin*h+(k.YMax-k.YMin)*h*.5, color="red", s=100)
<matplotlib.collections.PathCollection at 0x7fa055077df0>
../_images/U4.06 - Object Detection_33_1.png

Predictions

for each cell and each anchor box the model will make a prediction that will contain:

  • \(t_x\), \(t_y\): the offset of the object center to the cell’s top left corner.

  • \(t_w\): \(t_y\): the widths of the object bounding box as referred to the anchor box size.

  • \(t_0\): a proxy for the probability of an object’s center being present at that cell: \(Pr(object)*IOU(b, object)=\sigma(t_0)\)

  • \(\mathbf{p}_c\): a vector of class probabilities

observe that:

  • \(p_w\), \(p_h\) are the dimensions of the anchor box.

  • the sigmoid function \(sigma\) is used to bound offset coordinates.

  • the exponential function \(e\) is used to bound sizes \(>0\) and provide larger gradients when

  • we are interested in both probability and IOU of the bounding box.

Therefore for each cell and anchor box we have \(5+C\) predictions, \(C\) being the number of classes in our dataset.

Image taken from YOLO9000: Better, Faster, Stronger

Image("local/imgs/yolo_predictions.png")
../_images/U4.06 - Object Detection_35_0.png

Typically, CNNs will

  • downsample image dimensions to \(13x13\), or the number of cells defined, and will do a \((1,1)\) convolution in 2D with \(n_a(5+C)\) channels, \(n_a\) being the number of anchor boxes.

  • make the same process in previous CNN layers (for instance when the activation map is \(52x52\)) or larger. So there is a set of prediction boxes

This will be ok to predict large object, but small ones get lost in CNN downsampling. To overcome this, different architectures use different techniques:

  • YOLO3 make predictions in earlier CNN layers besides the last one.

  • RetinaNet downsamples the image and then unsamples cativation maps, to integrate (sort of skipped connections) high level semantic information of late layers with spatial information from earlier layers.

See this blog for further intuitions on this

Loss function

Observe that we are doing BOTH regression (for boxes) AND classification (for object classes). A specific loss function must then be devised to take this into account.

See this blog post and this blog post for detailed explanations.

Non maximum suppression

Finally, as there might be many box predictions at different cells and resolutions, a decision must be taken for overlapping predictions. This is Non maximum suppression, and you can check this blog for a detailed explanation.