Convolutions

In the previous post we discussed using Convolutional Neural Networks for Image Recognition. In order to better understand what the algorithms are doing we need to explore convolutions in more detail.

Definition

A convolution of f and g, denoted f*g, is the integral of the product of 2 functions after one is reversed and shifted. Mathematically

    \[ (f \ast g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t-\tau)d\tau = \int_{-\infty}^{\infty} f(t-\tau)g(\tau)d\tau \]

One way of thinking about convolutions is an integral that expresses the amount of overlap of g as it is shifted over f. It “blends” one function with another. The below image courtesy of Wikipedia demonstrates this.
Some more intuitive explanations of convolutions (and more applicable to machine learning):

Think about a time series. Say we look at the fed funds rate over time.One way to smooth data is to convolve it against a smaller list. For example, to calculate the weekly moving average we convolve it against [1,1,1,1,1]. For a monthly moving average we would convolve it with a list of 25 ones (assuming 25 business days in a month).

As another example, consider rolling 2 six sided dice. The probability distribution looks as follows.This can be represented as a convolution… For example to get a 4 we have

So what does the “Convolution” in Convolutional Neural Networks (aka Deep Learning) do to images? Convolutions (called Kernels in image processing) work as feature detectors. Lets look at a few convolutions on 2 images below.

 Convolution  Matrix  Image 1 Image 2
 Original Image
   
 Average      
 Sharpen      
 Edge Detection 1      
 Edge Detection 2      
 Edge Detection 3      
 Edge Detection 4      
 Box Blur      
 Gaussian Blur      
Gradient Detection 1      
Gradient Detection 2      

The amazing part about Deep Learning is we don’t have to teach the algorithm the convolutions to use. It learns them as part of the hyperparameter tuning!

Image recognition on the CIFAR-10 dataset using deep learning

CIFAR-10 is an established computer vision dataset used for image recognition. Its a subset of 80 million tiny images collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. This is the link to the website.

The CIFAR-10 dataset consists of 60,000 32×32 color images of 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.

As a first post, I wanted to write a deep learning algorithm to identify images in the CIFAR-10 database. This topic has been very widely covered – in fact Google’s Tensorflow tutorials cover this very well. However, they do a few things that made it difficult for me to follow their code

  1. They split the code across multiple files making it difficult to follow.
  2. They use a binary version of the file and a file stream to feed Tensorflow.

I downloaded the python version of the data and loaded all the variables into memory. There is some image manipulation done in the tensorflow tutorial that I recreated in the numpy arrays directly and we will discuss it below.

Prerequisites for this tutorial:
Other than Python (obviously!)

  • numpy
  • pickle
  • sklearn
  • tensorflow

For TensorFlow I strongly recommend the GPU version if you have the set-up for it. The code takes 6 hours on my dual GTX Titan X machine and running it on a CPU will probably take days or weeks!
Assuming you have everything working, lets get started!

Start with our import statements

Declare some global variables we will use. In our code we are using GradientDescentOptimizer with learning rate decay. I have tested the same code with the AdamOptimizer. Adam runs faster but gives slightly worse results. If you do decide to use the AdamOptimizer, drop the learning rate to 0.0001. This is the link to the paper on Adam optimization.

Create data directory and download data if it doesn’t exist – this code will not run if we have already downloaded the data.

Load data into numpy arrays. The code below loads the labels from the batches.meta file, and the training and test data. The training data is split across 5 files. We also one hot encode the labels.

Having more training data can improve our algorithms. Since we are confined to 50,000 training images (5,000 for each category) we can “manufacture” more images using small image manipulations. We do 3 transformations – flip the image horizontally, randomly adjust the brightness and randomly adjust the contrast. We also normalize the data. Note that there are different ways to do this, but standardization works best for image. However rescaling can be an option as well.

Now comes the fun part. This is what our network looks like.

Lets define the various layers of the network. The last line of code (logits=tf.identity(final_output,name=’logits’)) is done in case you want to view the model in TensorBoard.

Now we define our cross entropy and optimization function. If you want to use the AdamOptomizer, uncomment that line, comment the generation_run, model_learning_rate and train_step lines and adjust the learning rate to something lower like 0.0001. Otherwise the model will not converge.

Now we define some functions to run through our batch. For large networks memory tends to be a big constraint. We run through our training data in batches. One epoch is one run through our complete training set (in multiple batches). After each epoch we randomly shuffle our data. This helps improve how our algorithm learns. We run through each batch of data and train our algorithm. We also check for accuracy every 1st, 2nd,…,10th, 20th,…, 100th,… step. Lastly we calculate the final accuracy of the model and save it so we can use the calculated weights on test data without having to re-run it.

The model gives around 81% accuracy on the test set. I have an iPython notebook on my GitHub site that lets you load the saved model and run it on random samples on the test set. It outputs the image vs the softmax probabilities of the top n predictions.