Convolutional Neural Networks (CNN) are variants of MLPs (Multi-layer perceptrons) which are inspired from biology. From Hubel and Wiesel's early work on cat's visual cortex, we know there exist a complex arrangement of cells within the visual cortex. These cells are sensitive to small sub-regions of the input space, called receptive field, and are titled in such a way as to cover the entire visual field. These filters are local in input space and are thus better suited to exploit the strong spatially local correlation present in natural images.
Figure 1: Hubel and Wiesel’s early work on cat’s visual cortex |
Another urge for using CNN was due to computational costs. In sparse auto-encoder, one design choice that we had made was to "fully connect" all hidden units to all the inputs. On the relatively small images (e.g. 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g. 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive – you would have about $10^{4}$ input units, and assuming you want to learn just 100 features, you would have on the order of $10^{6}$ parameters to learn. The feed forward and back propagation computations would also be about $10^{2}$ times slower, compared to 28x28 images.
Figure 2: Fully Connected Networks |
Figure 3: Locally Connected Networks (shared weights) |
Thus, one simple solution to this problem would be to restrict the connections between the hidden
units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each
hidden unit will connect to only a small contiguous region of pixels in the input.
More precisely, having learned features over small (say 8x8) patches, sampled randomly from the larger image, we can then apply
this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and convolve them
with the larger image, thus obtaining a different feature activation value at each location in the image.
Let's start by learning features over small 8x8 patches, sampled randomly from the larger image. Sample images from reduced
STL - 10 dataset,
Figure 4: Reduced STL – 10 dataset. Contains 4 classes: airplane, car, cat, and dog. |
Features learned over 8x8 patches, using sparse auto-encoder with linear decoder on output layer,
Figure 5: 400 features learned over 8×8 patches. |
Now convolving each of these 400 filters/features with the whole image to produce convoluted image/feature map,
Figure (a): Image 3 convoluted with 2nd feature |
Figure (b): Image 3 convoluted with 3rd feature |
Figure (c): Image 39 convoluted with 22nd feature |
Figure (d): Image 39 convoluted with 45th feature |
Figure 6: Convoluted image/Feature maps |
Dimensions of input image is 64x64 pixels and is being convolved with 400 8x8 features/filters,
obtaining a (64-8+1) x (64-8+1) x 400, i.e. 57 x 57 x 400 dimension matrix. Each of these 400 57x57 matrix image is known as
feature map.
Here are some convoluted images/feature maps,
Figure 7: Convoluted Features of Car (Image 2). |
Figure 8: Convoluted Features of Dog (Image 4). |
After obtaining features using convolution, we would next like to use them for classification. In
theory, one could use all the extracted features with a classifier such as a softmax classifier, but this can be
computationally challenging. Consider for instance the size of images, 64x64 pixels, and 400 features we have learned over 8x8
inputs. Each convolution results in an output of size (64 − 8 + 1) * (64 − 8 + 1) = 3249, and since we have 400 features,
this results in a vector of $57^{2}$ * 400 = 1,299,600 features per example. Learning a classifier with inputs having 1+
million features can be unwieldy, and can also be prone to over-fitting.
To address this, one natural approach is to aggregate statistics of these features at various
locations. For example, one could compute the mean (or max) value of a particular feature over a region of the image. These
summary statistics are much lower in dimension (compared to using all of the extracted features) and can also improve results
(less over-fitting). This operation of aggregation is called pooling, or sometimes mean pooling or max pooling (depending on
the pooling operation applied).
The following image shows how pooling is done over 4 non-overlapping regions of the image.
Figure 9: Pooling |
If one chooses the pooling regions to be contiguous areas in the image and only pools features
generated from the same (replicated) hidden units. Then, these pooling units will then be translation invariant. This means
that the same (pooled) feature will be active even when the image undergoes (small) translations. Translation-invariant
features are often desirable; in many tasks (e.g., object detection, audio recognition), the label of the example (image) is
the same even when the image is translated. For example, if you were to take an MNIST digit and translate it left or right, you
would want your classifier to still accurately classify it as the same digit regardless of its final position.
So after we obtain our convolved features, as described earlier, we now decide the pooling area.
In our case, we chose pooling dimension of 19x19 pixels. Therefore, our resulting pooled features will have dimension of
3x3x400. Here are some pooled features of convolved features of Image 2,
Figure 10: Pooled Feature of 2nd feature map (Image 2) |
Figure 11: Pooled Feature of 3rd feature map (Image 2) |
Now we will use pooled features to train a softmax classifier to map the pooled features to the
class labels. After training for few minutes, the classifier is trained.
We then test classifier on test set, we convolve images from test set, mean pool those features
and then after classification we obtain an accuracy of 80%.
Here are some classifier's prediction results. As you might notice there are some
misclassified examples, which look similar to others in same category,
Figure (a): Category 1 - Airplanes |
Figure (b): Category 2 - Cars |
Figure (c): Category 3 - Cats |
Figure (d): Category 4 - Dogs |
Figure 12: Result from trained classifier. Accuracy of 80% with some misclassified images. |