Word embeddings with TensorFlow

Have you ever wondered how it is that you can “talk to” your computer? For example, you can ask something from your Google Assistant or Amazon Alexa, or type a simple question to the chatbot and get a tailored answer in return. It looks easy from the user’s point of view (unless you have a distinct accent!), however, it’s getting complicated from the engineering perspective. How come?

Working on your own chatbot or voice assistant you could code a thing like this:

if question == 'What is the best OS?': answer('GNU/Linux!')

But what if the question is “What OS is the best?” or “What’s the best system?” This tiny change will make the initial solution fruitless.The possibilities are endless so there has to be another way, right?

Yes, there is and it’s called “word embeddings”.

What are word embeddings?

While working on a similar solution, we can represent each word in our set as a vector in single or multi-dimensional space. If you were to place a word like “dog” on an axis representing size, where would you put it? At this point there is nothing else so let’s just put it in the middle and make it our point of reference.

Since we have the reference point, we can now place other words alongside this magnificent golden retriever based on where we feel they should be. Let’s do that with a cat and a bear as well.

The animals are placed on the axis based on their size: the cat is small, the bear is big, and the dog sits somewhere in between.

But how would something like a car tie into that? Vehicles can have different sizes as well but let’s say tiger and bike are the same size, so they would be in the same place on our axis, even if they’re clearly two different things.

We need another dimension to specify the difference between, for example, animals and vehicles. We can add a Y axis and some new words to visualize the correlation.

Now that we have everything visualized, the embedded words are basically vectors, representing their coordinates in the “however many dimensions we created” space:

  • Cat (-5, 0)
  • Dog (0, 0)
  • Bear (4, 0)
  • Harley (-4, 3)
  • Supra (0, 3)
  • Tank (4, 3)
  • Carriage (2, 1.5)

Of course using just two dimensions we can’t accurately represent large amounts of different words. It kinda worked for our small vocabulary size (seven words) but you’d typically use an “embedding size” of 64/128 or 256 dimensions: the more dimensions, the more accurate the representation you will get, but also the training will be slower and you’ll need a bigger dataset.

Summing up: word embeddings are vector representations of words that represent their position in our vocabulary space.

How to create word embeddings?

As you might have guessed, there are a lot of ways to get this representation of words. A lot of them are count-based like the co-occurrence matrix but we’ll focus on a more interesting predictive approach, using neural networks: Word2Vec.

Word2Vec is a model first introduced By Tomas Mikolov and his colleagues at Google; it comes in two flavours:

  • Continuous bag-of-words (CBOW) - Used for predicting the word that fits a sentence, based on context.
  • Skip-gram - Opposite of CBOW, predicts the context given the word.

Since we’ve been talking about representing words’ meanings, lets focus on the Skip-gram model.

Dataset preparation

In the next part of this article we’ll be going over code snippets; you can find working source code here. It was made to run on nvidia-docker to easily utilize CUDA cores but you can change it however you like ;)

First we’ll need a dataset. I used The Blog Authorship Corpus which contains 680,000 blog posts, with a total of over 140 million words. I merged all of the posts into one big file, and parsed so that it contains only plain text encoded in UTF-8.

Now we have our data nice and ready to read, we’ll have to split it into an array of words and lemmatize each one of them, filtering out everything that you don’t want in your embeddings. Lemmatization is the process of removing words that are spelled differently but carry the same meaning, for example: “walk”, “walked”, “walking” would be narrowed down to just “walk” but also “good”, “better” and “best” would be reduced to “good”. I’m using the nltk library for that, you can read more about it here.

There will be a lot more words than we need so let’s pick out the most common ones, this process will also speed up our training since we get rid of words that would be hard to train anyway.

def token_valid(token):
  return ((token not in STOP_WORDS)
          and (len(token) > 2)
          and (len(token) < 20)
          and re.match(r'^[a-z]+$', token))

def create_lexicon(file, n_tokens=20000):
  lexicon = list()
  word_list = list()
  for line in tqdm(file):
    words = word_tokenize(line.lower())
    words = [wnl.lemmatize(token) for token in words if      token_valid(token)]
    word_list.extend(words)

use only n most common words

  _lexicon = ['UNKNOWN'] + [count[0] for count in     Counter(word_list).most_common(n_tokens - 1)]
  return _lexicon, word_list

Notice that we’re adding an ‘UNKNOWN’ token to the beginning of the lexicon, it will represent each word that wasn’t found in our “lexicon”.

The next thing to do is to create a dictionary to map each word from our lexicon to its index, we’ll need this later to access our embeddings table.

def create_dictionary(_lexicon):
  _dictionary = dict()
  for entry in _lexicon:
    _dictionary[entry] = len(_dictionary)
  _reverse_dictionary = dict(zip(_dictionary.values(),     dictionary.keys()))
  return _dictionary, _reverse_dictionary

Now let’s use the dictionary on our processed ‘word_list’ that we made in ‘create_lexicon’.

One more thing to do with our training data is to replace multiple ‘0’s that appear one after another with a single ‘0’. Remember that ‘0’ corresponds to the ‘UNKNOWN’ token. So it won’t have a negative effect on our embeddings because ‘UNKNOWN’s will be ignored anyway.

After the above procedure, the following sentence: [1,5,2,0,0,0,0,5,9] would be transformed into: [1,5,2,0,5,9].

Model

Now to the fun part!

Take a look at the ‘embeddings.py’ file and let’s start with the ‘generate_batch’ function. But first, there are a few important variables here that need to be clarified as they’re a crucial part of the whole model.

  • context_word’ - Our label, we want to train the model to return the approximate vector for the given word x, of a word which is likely to appear in its context.
  • skip_window’ - Number of words to both the left and right of our ’context_word’.
  • num_skips’ - Number of context words to train for each input in ‘skip_window’.

Here is a simple visualization of these concepts, hopefully it’ll clear things up`a little bit:

Okay we can move to the code finally.

We’re using a global variable ‘data_index’ to keep track of what part of the dataset we’re using across batches.

We’re also making some assertions to be sure we have the correct values for our setting. Next, we’re declaring our variables and populating the buffer with the first span of data.

global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window

batch = np.ndarray(shape=(batch_size,), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
buffer = collections.deque(maxlen=span)

if data_index + span > len(data):
 data_index = 0

buffer.extend(data[data_index:data_index + span])
 data_index += span

Let’s move to the loop which is executed for each span of data that fits our ‘batch_size’:

for i in range(batch_size // num_skips):
  context_words = [w for w in range(span) if w != skip_window]
  words_to_use = random.sample(context_words, num_skips)

What we’re doing here is preparing a table of indexes for each word in a span except for the middle target word (its index is always equal to ‘skip_window’) and randomly picking a few of them to match the ‘num_skips’ amount. These are our context words.

for j, context_word in enumerate(words_to_use):
  batch[i * num_skips + j] = buffer[skip_window]
  labels[i * num_skips + j, 0] = buffer[context_word]

Next we have to populate the buffer with a new span of data:

if data_index == len(data):
 buffer = data[:span]
 data_index = span
else:
 buffer.append(data[data_index])
 data_index += 1

Before returning the batch we have to adjust our ‘data_index’ to make sure we won’t cut any words out of the span:

data_index = (data_index + len(data) - span) % len(data)

Since we have our batches prepared we can go to our neural network. As you remember, from generate_batch function, both training_inputs and labels are just lists of ints that represent words. Labels have a shape that’s a little different but that’s just how TensorFlow wants it to be.

embeddings = tf.Variable(tf.random_uniform((vocabulary_size, embedding_size), -1.0, 1.0))`\
 embed = tf.nn.embedding_lookup(embeddings, train_inputs)

Our embeddings will be a matrix in which to each word in the lexicon we assign its vector. To access these vectors with our ids we can use a function from TensorFlow, ‘embedding_lookup’.

Then we’re declaring weights and biases, nothing special here.

nce_weights = tf.Variable(
  tf.truncated_normal((vocabulary_size, embedding_size),
                      stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

For our loss function we’ll use a noise-contrastive estimation. It changes the problem from multinomial classification to binary classification where we can use logistic regression to solve it. This makes the loss function less computationally expensive. You can read more about it here.

loss = tf.reduce_mean(
  tf.nn.nce_loss(
    weights=nce_weights,
    biases=nce_biases,
    labels=train_labels,
    inputs=embed,
    num_sampled=num_sampled,
    num_classes=vocabulary_size))

tf.summary.scalar('embeddings loss', loss)

global_step = tf.Variable(0, False)
lr = tf.train.exponential_decay(1.0, global_step, 30000, .96, staircase=True)

tf.summary.scalar('embeddings learning rate', lr)

optimizer = tf.train.GradientDescentOptimizer(lr).minimize(loss, global_step=global_step)

The graph is ready, the next step is to start our session with it and feed in the data generated by our ‘generate_batch’ function.

After you’re done training, return the learned embeddings.

with tf.Session(graph=graph) as sess:
 init.run()
 average_loss = 0

 for step in range(n_steps):
 batch_inputs, batch_labels = generate_batch(data)
 feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
 _, loss_val, summary = sess.run((optimizer, loss, merged), feed_dict)
 average_loss += loss_val

 writer.add_summary(summary, step)
 return embeddings.eval()

Visualization

I think we can agree that it’d be hard to visualize directions in 256 dimensions. Thankfully there are algorithms made for dimensionality reduction. Two that we’ll use are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). We’re deep into the article and there are still two more algorithms to write. I can see your joy.

Thankfully, there is a cool tool to do all of that for us. If you’ve already worked with TensorFlow you probably know what I have in mind. TensorBoard has a lot of built-in features and one of them was made specially for embeddings visualization so why reinvent the wheel? Let’s use it!

First we’ll have to prepare all of our embeddings and corresponding labels (the actual words from the lexicon), you can limit the amount if your projector is stuttering too much:

embeds = []
labels = []
for i, label in enumerate(lexicon):
  labels.append(label)
  embeds.append(embed_lookup[i])
  if i > 5000:
    break

Then we save the embeddings to the ‘embeddings.ckpt’ file and labels to ‘metadata.tsv’. For the embeddings let’s start by placing it inside of the TensorFlow variable.

embeddings = tf.Variable(np.array(embeds), name='embeddings')`

Labels have to be written line by line; order is crucial here.

 with open(meta_path, 'w') as f:
  for label in labels:
    f.write('%s\n' % label)

Now we can start our session, initialize the embeddings variable and save it with a previously created saver.

 with tf.Session() as sess:
  saver = tf.train.Saver([embeddings])
  sess.run(embeddings.initializer)
  saver.save(sess, embeddings_path)

The final part is to create writer, initialize projector with default configuration and add embeddings to it (we’re assigning the property ‘name’ of our previously created TensorFlow variable to the property ‘tensor_name’ from our embeddings configuration). After that, we just have to configure a path for our labels and start the projector.

writer = tf.summary.FileWriter(os.path.join('log'))
  config = projector.ProjectorConfig()`

  embed = config.embeddings.add()
  embed.tensor_name = embeddings.name
  embed.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(writer, config)

Now, after training is completed you can use TensorBoard to visualize created embeddings:

tensorboard --logdir log/projector

TensorBoard prepares graphs using boh t-SNE and PCA dimensionality-reduction methods and even provides us search with a built-in tool for comparing similarity between words - it does that by finding the nearest points in our space using both cosine and euclidean distances computed from our vectors:

Summary

As you can see, word embeddings are not that hard to create, they have the capacity to represent meanings and relations between words like “king” - “man” + “woman” = “queen”(“king” is to “man”, as “queen” is to “woman”). Because of that, they’re widely used in semantic analysis.

Hope you learned something new today and if you’re still interested in this topic, I encourage you to read the original paper about them, you will find there a more in-depth analysis.