Character level language models using Recurrent Neural Networks

In recent years Recurrent Neural Networks have shown great results in NLP tasks – generating text, neural machine translation, question answering, and a lot more.

In this post we will explore text generation – teaching computers to write in a certain style. This is based off (and a recreation of) Andrej Karpathy’s famous article The Unreasonable Effectiveness of Recurrent Neural Networks.

Predicting the next character in a sentence is a language model problem. Traditionally these were done using n-gram models. For example a unigram model would be the distribution of individual characters. At each time step we would predict a character using that probability distribution. A bigram model would take the probability distribution of 2 characters (for example, given the first letter a, what is the probability of the second letter is n). Mathematically

    \[P(W_n|W_{n-1}) = \frac{P(W_{n-1},W_n)}{P(W_{n-1})}\]

Doing this at a word level has a disadvantage – how to handle out of vocabulary words. Character models don’t have this problem since they learn general distributions of the underlying text. However, the challenge with n-gram models (word and character) is that the memory required grows exponentially with each additional n. We therefore have a limit to how far back in a sequence we can look. In our example we use an alphabet size of 98 characters (small case and capital letters, and special characters like space, parenthesis etc). A bigram model would take have 9,604 possible letter pairs. With a trigram model it grows to 941,192 possible triplets. In our example we go back 30 characters. That would require us to store 5.46e59 possible combinations.

This is where we can leverage the use of RNNs. I’m assuming you have an understanding of LSTMs and I will only describe the network architecture here. There is an excellent article by Christopher Olah on understanding RNNs and LSTMs that goes into the details of the underlying math.

For this problem we take data in sequences of 30 characters and try to predict the next character for each letter. We are using stateful LSTMs – the data is fed in batches but each batch is a continuation of the previous one. We also save the state of the LSTM at the end of each batch and use this as the initial state for the next batch. The benefit of doing this is that the system can learn longer term dependencies like closing an open parenthesis or bracket, ending a sentence with a period, etc. The code is available on my GitHub, and you can tweak the model parameters to see how the results look.

The model is agnostic to the data. I ran it on 3 different datasets – Shakespeare, Aesop’s fables and a crawl of Paul Graham‘s website. The same code learns to write in each style after a few epochs. In each case, it learns formatting, which words are commonly used, to close open quotes and parenthesis, etc.

We generate sample data as follows – we sample a capital letter (“L” in our case) and then ask the RNN to predict the next letter. we take the n highest probabilities (2 in these examples, but its a parameter that can be adjusted) and generate the next letter. Using that letter we generate the next one, and so on. Here are samples of the data for each dataset.

Shakespeare – we can see that the model learns quickly. At the end of the first epoch its already learned to format the text, close parenthesis (past the 30 character input) and add titles and scenes. After 5 epochs it gets even better and at 60 epochs it generates very “Shakespeare like” text.

Paul Graham posts – we have about 80% less data compared to Shakespeare and his writing style is more “diverse” so the model doesn’t do as well after the first epoch. Words are often incomplete. After 5 epochs we see a significant improvement – most words and the language structure are correct. The writing style is starting to resemble Paul Graham. After 60 epochs we see a big improvement overall but still have issues with some nonsensical words.

Aesop’s fables – the dataset is quite small so the model takes a lot longer to train. But it also gives us an insight into how the RNN is learning. After 1 epoch it only learns the more common letters in the language. It took 15 epochs for it to start to put words together. After 60 epochs it does better, but still has non English words. But it does learn the writing style (animal names in capital, different formatting from the above examples, etc).

The source code is available on my GitHub for anyone who wants to play with it. Please make sure you have a GPU with CUDA and CUDNN installed, otherwise it will take forever to train. The model parameters can be changed using command line arguments.

I also added a file in the git called that lets you try the same exercise with n-grams to compare how well the deep learning method does vs different n-gram sizes (both speed and accuracy).

The SQuAD Challenge – Machine Comprehension on the Stanford Question Answering Dataset


The SQuAD Challenge
Machine Comprehension on the
Stanford Question Answering Dataset


Over the past few years have seen some significant advances in NLP tasks like Named Entity Recognition [1], Part of Speech Tagging [2] and Sentiment Analysis [3]. Deep learning architectures have replaced conventional Machine Learning approaches with impressive results. However, reading comprehension remains a challenging task for machine learning [4][5]. The system has to be able to model complex interactions between the paragraph and question. Only recently have we seen models come close to human level accuracy (based on certain metrics for a specific, constrained task). For this paper I implemented the Bidirectional Attention Flow model [6], using pretrained word vectors and training my own character level embeddings. Both these were combined and passed through multiple deep learning layers to generated a query aware context representation of the paragraph text. My model achieved 76.553% F1 and 66.401% EM on the test set.


2014 saw some of the first scientific papers on using neural networks for machine translation (Bahdanau, et al [7], Kyunghyun et al [8], Sutskever, et al [9]). Since then we have seen an explosion in research leading to advances in Sequence to Sequence models, multilingual neural machine translation, text summarization and sequence labeling.
Machine comprehension evaluates a machine’s understanding by posing a series of reading comprehension questions and associated text, where the answer to each question can be found only in its associated text [5]. Machine comprehension has been a difficult problem to solve – a paragraph would typically contain multiple sentences and Recurrent Neural Networks are known to have problems with long term dependencies. Even though LSTMs and GRUs address the exploding/vanishing gradients RNNs experience, they too struggle in practice. Using just the last hidden state to make predictions means that the final hidden state must encode all the information about a long word sequence. Another problem has been the lack of large datasets that deep learning models need in order to show their potential. MCTest [10] has 500 paragraphs and only 2,000 questions.
Rajpurkar, et al addressed the data issue by creating the SQuAD dataset in 2016 [11]. SQuAD uses articles sourced from Wikipedia and has more than 100,000 questions. The labelled data was obtained by crowdsourcing on Amazon Mechanical Turk – three human responses were taken for each answer and the official evaluation takes the maximum F1 and EM scores for each one.

Sample SQuAD dataSince the release of SQuAD new research has pushed the boundaries of machine comprehension systems. Most of these use some form of Attention Mechanism [6][12][13] which tell the decoder layer to “attend” to specific parts of the source sentence at each step. Attention mechanisms address the problem of trying to encode the entire sequence into a final hidden state.
Formally we can define the task as follows – given a context paragraph c, a question q we need to predict the answer span by predicting (astart,aend) which are start and end indices of the context text where the answer lies.
For this project I implemented the Bidirectional Attention Flow model [6] – a hierarchical multi-stage model that has performed very well on the SQuAD dataset. I trained my own character vectors [15][16], and used pretrained Glove embeddings [14] for the word vectors. My final submission was a single model – ensemble models would typically yield better results but the complexity of my model meant longer training times.

Related Work

Since its introduction in June 2016, the SQuAD dataset has seen lots of research teams working on the challenge. There is a leaderboard maintained at Submissions since Jan 2018 have beaten human accuracy on one of the metrics (Microsoft Research, Alibaba and Google Brain are on this list at the time of writing this paper). Most of these models use some form of attention mechanism and ensemble multiple models.
For example, the R-Net by Microsoft Research [12] is a high performing SQuAD model. They use word and character embeddings along with Self-Matching attention. The Dynamic Coattention Network [13], another high performing SQuAD model uses coattention.


My model architecture is very closely based on the BiDAF model [6]. I implemented the following layers

  • Embedding layer – Maps words to high dimensional vectors. The embedding layer is applied separately to both the context and question text. I used two methods
    • Word embeddings – Maps each word to pretrained vectors. I used 300 dimensional GloVE vectors.
    • Character embeddings – Maps each word to character embedding and run them through multiple layers of Convolutions and Max Pooling layers. I trained my own character embeddings due to challenges with the dataset.
  • RNN Encoder layer – Takes the context and question embeddings and runs each one through a Bi-Directional RNN (LSTM). The Bi-RNNs share weights in order to enrich the context-question relationship.
  • Attention Layer – Calculates the BiDirectional attention flow (Context to Query attention and Query to Context attention). We concatenate this with the context embeddings.
  • Modeling Layer – Runs the attention and context layers through multiple layers of Bi-Directional RNNs (LSTMs)
  • Output layer –  Runs the output of the Modeling Layer through two fully connected layers to calculate the start and end indices of the answer span.


The dataset for this project was SQuAD – a reading comprehension dataset. SQuAD uses articles sourced from Wikipedia and has more than 100,000 questions. Our task is to find the answer span within the paragraph text that answers the questions.
The sentences (all converted to lowercase) are tokenized into words using nltk. The words are then converted into high dimensional vector embeddings using Glove. The characters for each word are also converted into character embeddings and then run through a series of convolutions neural network and max pooling layers. I ran some analysis on the word and character counts in the dataset to better understand what model parameters to use.

We can see that

99.8 percent of paragraphs are under 400 words
99.9 percent of questions are under 30 words
99 percent of answers are under 20 words (97.6 under 15 words)
99.9 percent of answer spans lie within first 300 paragraph words

We can use these statistics to adjust our model parameters (described in the next section).
For the character level encodings, I did an analysis of the character vocabulary in the training text. We had 1,258 unique characters. Since we are using Wikipedia for our training set, many articles contain foreign characters.

Further analysis suggested that these special characters don’t really affect the meaning of a sentence for our task, and that the answer span contained 67 unique characters. I therefore selected these 67 as my character vocabulary and replaced all the others with a special REPLACEMENT TOKEN.
Instead of using one-hot embeddings for character vectors, I trained my own character vectors on a subset of Wikipedia. I ran the word2vec algorithm at a character level to get char2vec 50 dimensional character embeddings. A t-SNE plot of the embeddings shows us results similar to word2vec.

I used these trained character vectors for my character embeddings. The maximum length of a paragraph word was 37 characters, and 30 characters for a question word. Since we are using max pooling, I used these as my character dimensions and padded with zero vectors for smaller words.

Model Configuration

I used the following parameters for my model. Some of these (context length, question length, etc.) were fixed based on the data analysis in the previous section. Others were set by trying different parameters to see which ones gave the best results.

Parameter Description Value
context_len Number of words in the paragraph input 300
question_len Number of words in the question input 30
embedding_size Dimension of GLoVE embeddings 300
context_char_len Number of characters in each word for the paragraph input (zero padded) 37
question_char_len Number of characters in each word for the question input (zero padded) 30
char_embed_size Dimension of character embeddings 50
optimizer Optimizer used Adam
learning_rate Learning Rate 0.001
dropout Dropout (used one dropout rate across the network) 0.15
hidden_size Size of hidden state vector in the Bi-Directional RNN layers 200
conv_channel_size Number of channels in the Convolutional Neural Network 128

Evaluation metric

Performance on SQuAD was measured via two metrics:

  • ExactMatch (EM) –  Binary measure of whether the system output matches the ground truth exactly.
  • F1 –  Harmonic mean of precision and recall.


My model achieved the following results (I scored much higher on the Dev and Test leaderboards than on my Validation set)

Dataset F1 EM
Train 81.600 68.000
Val 69.820 54.930
Dev 75.509 65.497
Test 76.553 66.401

The original BiDAF paper had an F1 score of 77.323 and EM score of 67.947. My model scored a little lower, possibly because I am missing some details not mentioned in their paper, or I need to tweak my hyperparameters further. Also, my scores were lower running against my cross validation set vs the official competition leaderboard.
I tracked accuracy on the validation set as I added more complexity to my model. I found it interesting to understand how each additional element contributed to the overall score. Each row tracks the added complexity and scores related to adding that component.

Model F1 EM
Baseline 39.34 28.41
BiDAF 42.28 31.00
Smart Span (adjust answer end location) 44.61 31.13
1 Bi-directional RNN in Modeling Layer 66.83 51.40
2 Bi-directional RNNs in Modeling Layer 68.28 53.10
3 Bi-directional RNNs in Modeling Layer 68.54 53.25
Character CNN 69.82 54.93

I also analyzed the questions where we scored zero on F1 and EM scores. The F1 score is more forgiving. We would have a non zero F1 if we predict even one word correctly vs any of the human responses. An analysis of questions that scored zero on the F1and EM metric were split by question type. The error rates are proportional to the distribution of the questions in the dataset.

Question Type Entire Dev Set (%) F1=0 (%) EM=0 (%)
what 27.2 28.4 29.3
is 18.4 18.5 18.4
did 9.1 8.8 9.0
was 8.7 9.1 7.9
do 6.9 6.9 7.9
how 6.2 5.9 6.1
who 6.2 6.7 6.1
are 4.4 3.7 4.2
which 3.3 3.4 3.1
where 2.3 2.5 2.5
when 3.9 2.9 2.3
name 1.8 1.5 1.5
why 0.7 0.6 1.3
would 0.7 0.9 0.9
whose 0.2 0.2 0.2

However, there were some questions where the system was very close to the correct answer, or the correct answer was technically wrong


Attention mechanisms coupled with deep neural networks can achieve competitive results on Machine Comprehension. For this project I implemented the BiDirectional attention flow model. My model accuracy was very close to the original paper. In the modeling layer we discovered that deeper networks do increase accuracy, but at a steeper computational cost.
For future work I would like to explore an ensemble of models – using different deep learning layers and attention mechanisms. Looking at the leaderboard (, most of the top performing models are ensembles.

Sentiment Analysis of movie reviews part 2 (Convolutional Neural Networks)

In a previous post I looked at sentiment analysis of movie reviews using a Deep Neural Network. That involved using pretrained vectors (GLOVE in our case) as a bag of words and fine tuning them for our task.

We will try a different approach to the same problem – using Convolutional Neural Networks (aka Deep Learning). We will take the idea from the image recognition blog and apply it to text classification. The idea is to

  1. Vectorize at a character level, using just the characters in our text. We don’t use any pretrained vectors for word embeddings.
  2. Apply multiple convolutional and max pooling layers to the data.
  3. Generate a final output layer with softmax
  4. We’re assuming the Convolutional Neural Network will automatically detect the relationship between characters (pooling them into words and further understanding the relationships between words).

Our input data is just vectorizing each character. We take all the unique characters in our data, and the maximum sentence length and transform our input data into maximum_sentence_length X character_count for each sentence. For sentences with less than the maximum_length, we pad the remaining rows with zeros.

I used 2 1-Dimensional convolutional layers with filter size=3, stride=1 and hidden size=64 and relu for the non-linear activation (see the Image Recognition blog for an explanation on this). I also added a pooling layer of size 3 after each convolution.

Finally, I used 2 fully connected layers of sizes 1024 and 256 dropout probability of 0.5 (that should help prevent over fitting. The final layer uses a softmax to generate the output probabilities and we the standard cross entropy function for the loss. The learning is optimized using the Adam optimizer.

Overall the results are very close to the deep neural network. We get 59.2% using CNNs vs 62%. I think the accuracy is the maximum information we can extract from this data. What’s interesting is we used 2 completely different approaches – pretrained word vectors in the Neural Network case, and character level vectors in this Deep Learning case and we got similar results.

Next post we will explore using LSTMs on the same problem.

Source code available on request.

Image recognition on the CIFAR-10 dataset using deep learning

CIFAR-10 is an established computer vision dataset used for image recognition. Its a subset of 80 million tiny images collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. This is the link to the website.

The CIFAR-10 dataset consists of 60,000 32×32 color images of 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.

As a first post, I wanted to write a deep learning algorithm to identify images in the CIFAR-10 database. This topic has been very widely covered – in fact Google’s Tensorflow tutorials cover this very well. However, they do a few things that made it difficult for me to follow their code

  1. They split the code across multiple files making it difficult to follow.
  2. They use a binary version of the file and a file stream to feed Tensorflow.

I downloaded the python version of the data and loaded all the variables into memory. There is some image manipulation done in the tensorflow tutorial that I recreated in the numpy arrays directly and we will discuss it below.

Prerequisites for this tutorial:
Other than Python (obviously!)

  • numpy
  • pickle
  • sklearn
  • tensorflow

For TensorFlow I strongly recommend the GPU version if you have the set-up for it. The code takes 6 hours on my dual GTX Titan X machine and running it on a CPU will probably take days or weeks!
Assuming you have everything working, lets get started!

Start with our import statements

Declare some global variables we will use. In our code we are using GradientDescentOptimizer with learning rate decay. I have tested the same code with the AdamOptimizer. Adam runs faster but gives slightly worse results. If you do decide to use the AdamOptimizer, drop the learning rate to 0.0001. This is the link to the paper on Adam optimization.

Create data directory and download data if it doesn’t exist – this code will not run if we have already downloaded the data.

Load data into numpy arrays. The code below loads the labels from the batches.meta file, and the training and test data. The training data is split across 5 files. We also one hot encode the labels.

Having more training data can improve our algorithms. Since we are confined to 50,000 training images (5,000 for each category) we can “manufacture” more images using small image manipulations. We do 3 transformations – flip the image horizontally, randomly adjust the brightness and randomly adjust the contrast. We also normalize the data. Note that there are different ways to do this, but standardization works best for image. However rescaling can be an option as well.

Now comes the fun part. This is what our network looks like.

Lets define the various layers of the network. The last line of code (logits=tf.identity(final_output,name=’logits’)) is done in case you want to view the model in TensorBoard.

Now we define our cross entropy and optimization function. If you want to use the AdamOptomizer, uncomment that line, comment the generation_run, model_learning_rate and train_step lines and adjust the learning rate to something lower like 0.0001. Otherwise the model will not converge.

Now we define some functions to run through our batch. For large networks memory tends to be a big constraint. We run through our training data in batches. One epoch is one run through our complete training set (in multiple batches). After each epoch we randomly shuffle our data. This helps improve how our algorithm learns. We run through each batch of data and train our algorithm. We also check for accuracy every 1st, 2nd,…,10th, 20th,…, 100th,… step. Lastly we calculate the final accuracy of the model and save it so we can use the calculated weights on test data without having to re-run it.

The model gives around 81% accuracy on the test set. I have an iPython notebook on my GitHub site that lets you load the saved model and run it on random samples on the test set. It outputs the image vs the softmax probabilities of the top n predictions.