Intro

Machine learning has been growing by leaps and bounds in recent years, and with libraries like TensorFlow, it seems like almost anything is possible. One interesting application of neural networks is in classification of handwritten characters – in this case digits. This article will go through the fundamentals of creating and using a specific kind of network in TensorFlow: a convolutional neural network. Convolutional neural networks are specialized networks used for image recognition, that perform much better than a vanilla deep neural network.

Concepts

Before diving into this project, we will need to review some concepts.

TensorFlow

TensorFlow is more than just a machine learning library, it is actually a library for creating distributed computation graphs, whose execution can be deferred until needed, and stored when not needed. TensorFlow works by the creation of calculation graphs. These graphs are stored and executed later, within a "session". By storing neural network connection weights as matrices, TensorFlow can be used to create computation graphs which are effectively neural networks. This is the primary use of TensorFlow today, and how we'll be using it in this article.

Convolutional Neural Networks

Convolutional neural networks are networks based on the physical qualities of the human eye. Information is received as a "block" of data, like an image, and filters are applied across the entire image, which transform the image and reveal features which can be used for classification. For instance, one filter might find round edges, which could indicate a five or a six. Other filters might find straight lines, indicating a one or a seven. The weight of these filters are learned as the model receives data, and thus it gets better and better at predicting images, by getting better and better at coaxing features out using its filters. There is much more than this to a convolutional neural network, but this will suffice for this article.

The Data

How do we get the data we'll need to train this network? No problem; TensorFlow provides us some easy methods to fetch the MNIST dataset, a common machine learning dataset used to classify handwritten digits. Simply import the input_data method from the TensorFlow MNIST tutorial namespace as below. You will need to reshape the data into a square of 28 by 28, since the original dataset is a flat list of 784 numbers per image.

from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data")

test_imgs = mnist.test.images.reshape(-1, 28, 28, 1)
test_lbls = mnist.test.labels

train_imgs = mnist.train.images.reshape(-1, 28, 28, 1)
train_lbls = mnist.train.labels

The Network

So how might we build such a network? Where do we start? Well lucky for us, TensorFlow provides this functionality out of the box, so there's no need to reinvent the wheel. The first thing that must be defined are our input and output variables. For this, we'll use placeholders.

X = tf.placeholder(tf.float32, shape=(None, 28, 28, 1))
y = tf.placeholder(tf.int64, shape=(None), name="y")

Next, we need to define our initial filters. In order to avoid dying/exploding gradients, a truncated normal distribution is recommended for initialization. In our case, we will have two lists of filters for our two convolutional layers.

filters = tf.Variable(tf.truncated_normal((5,5,1,32), stddev=0.1))
filters_2 = tf.Variable(tf.truncated_normal((5,5,32,64), stddev=0.1))

Finally, we need to create our actual convolutional layers. This is done using TensorFlow's tf.nn.conv2d method. We also use a name scope to keep things organized. Note the max pooling layers between convolutional layers. The max pool layers aggregate the image data from each filter using a predefined method, and are not trained. They simply help reduce the complexity of the data by squashing the many layers produced by our filters.

with tf.name_scope("dnn"):
    convolution = tf.nn.conv2d(X, filters, strides=[1,2,2,1], padding="SAME")
    max_pool = tf.nn.max_pool(convolution, ksize=[1,2,2,1], strides=[1,2,2,1], padding="VALID")
    convolution_2 = tf.nn.conv2d(max_pool, filters_2, strides=[1,2,2,1], padding="SAME")
    max_pool_2 = tf.nn.max_pool(convolution_2, ksize=[1,2,2,1], strides=[1,2,2,1], padding="VALID")
    flatten = tf.reshape(max_pool_2, [-1, 2 * 2 * 64])
    predict = fully_connected(flatten, 1024, scope="predict")
    keep_prob = tf.placeholder(tf.float32)
    dropout = tf.nn.dropout(predict, keep_prob)
    logits = fully_connected(dropout, n_outputs, scope="outputs", activation_fn=None)

Also note that before our prediction layer, we have to squash down the final max pool output to make predictions at our fully connected layer. You can get the shapes of the various layers as shown below, to figure out what size your various layers need to be.

print("conv", convolution.get_shape())
print("max", max_pool.get_shape())
print("conv2", convolution_2.get_shape())
print("max2", max_pool_2.get_shape())
print("flat", flatten.get_shape())
print("predict", predict.get_shape())
print("dropout", dropout.get_shape())
print("logits", logits.get_shape())
print("logits guess", logits_guess.get_shape())
print("correct", correct.get_shape())
print("accuracy", accuracy.get_shape())

We also apply dropout to avoid overfitting, and do not apply an activation function to our outputs. We will instead calculate the entropy manually at each training step, which improves performance. Now to create our training and evaluation layers. We will also namespace these like the previous layers, to make things easier to understand when they are viewed in a visualization tool like TensorBoard. Our loss is the average of the cross-entropy between the expected outputs and the output of our logits, this much should make sense. For training, we use an Adam optimizer, which is almost always recommended. The learning rate used in this article is 1e-4. This is the same learning rate that is used for TensorFlow's own "expert" tutorial on MNIST. Our evaluation is a little more complicated. Since we are training with batches, we need to get the output for each item in the batch. We do this by applying tf.argmax to every output list using tf.map_fn. Then, we compare the guesses to the actual values using tf.equal. Our accuracy is the average number of correct predictions (i.e., the percentage of numbers we classified correctly).

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
with tf.name_scope("train"):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    logits_guess = tf.cast(tf.map_fn(tf.argmax, logits, dtype=tf.int64), tf.int64)
    correct = tf.equal(logits_guess, y)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

To actually train the network, we will need to run through the data several times, running a batch at every iteration. In this case, we will aim for 20,000 iterations. To calculate how many epochs we will need for our batch size, we use the following code.

keep_prob_num = 0.5
batch_size = 50
goal_iterations = 20000
iterations = mnist.train.num_examples // batch_size
epochs = int(goal_iterations / iterations) # so that total iterations ends up being around goal_iterations

Now to actually run the training operation on our graph.

with tf.Session() as sess:
    sess.run(init)
    for i in range(epochs):
        for iteration in range(iterations):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch_shaped = X_batch.reshape(X_batch.shape[0], 28, 28, 1)
            sess.run(training_op, feed_dict = {X: X_batch_shaped, y: y_batch, keep_prob: keep_prob_num})
            print("epoch:",i)
            print("iteration:", iteration)

It's also recommended that you save the model and evaluate the accuracy at every epoch. You can accomplish this with the following code.

Evaluating

accuracy_val = sess.run(accuracy, feed_dict = {X: train_imgs, y: train_lbls,  keep_prob: 1.0})
print("accuracy:", accuracy_val)

Saving

saver = tf.train.Saver()
saver.save(sess, save_path)

After running this model through all epochs and iterations, your accuracy should be around 99.2%. Let's check that.

with tf.Session() as sess:
    saver.restore(sess, save_path) #assume you've saved model, but could run in same session immediately after training
    accuracy_val = sess.run(accuracy, feed_dict = {X: test_imgs, y: test_lbls,  keep_prob: 1.0}) # test accuracy
    t_accuracy_val = sess.run(accuracy, feed_dict = {X: train_imgs, y: train_lbls,  keep_prob: 1.0}) # training accuracy
    print("accuracy:", accuracy_val)
    print("train accuracy:", t_accuracy_val)

Of course, in the above, the test accuracy is what's most important, as we want our model to generalize to new data.

Improvements

There are several steps you can take to improve on this model. One step is to apply affine transformations to the images, creating additional images similar but slightly different than the originals. This helps account for handwriting with various "tilts" and other tendencies. You can also train several of the same network, and have them make the final prediction together, averaging the predictions or choosing the prediction with the highest confidence.

Conclusion

TensorFlow makes digit classification easier than ever. Machine learning is no longer the domain of specialists, but rather should be a tool in the belt of every programmer, to help solve complex optimization, classification, and regression problems for which there is no obvious or cost-effective solution, and for programs which must respond to new information. Machine learning is the way of the future for many problems, and as has been said in another blogger's post: it's unreasonably effective.