This is a post that summarizes the fundamental approaches to representing a word in NLP. I found that it’ll make me better understand a concept if I write it down and make some reviewing on it.
Basically, this post tries to answer the question: how to represent a word in a sentence for various applications of NLP?
The post is based on the following notes, tutorials, and slides:
- http://cs224d.stanford.edu/lecture_notes/notes1.pdf (from Stanford CS224d, same for 2 and 3 below)
- http://www.cs.biu.ac.il/˜yogo/nnlp.pdf (Dr.Golberg’s notes: A Primer on Neural Network Models for Natural Language Processing)
- https://github.com/nyu-dl/NLP_DL_Lecture_Note (Dr. Cho’s notes: Natural Language Understanding with Distributed Representation)
- http://www.marekrei.com/pub/Constructing_and_Evaluating_Word_Embeddings.pdf (Dr Marek Rei and Dr Ekaterina Kochmar’s slides)
The post is outlined as follows:
- What is a word vector and motivations
- Count-based approach (SVD-based)
- Neural word embedding: basics and variants
1. Word Vector and Motivations
How do we represent the meaning of a word?
We can use taxonomy like WordNet containing hypernyms relationships or synonym sets. But they have some disadvantages: missing nuances for some words like synonyms; subjective and requires human efforts to create and maintain. Not good!
How about 1-hot vectors like this:
Note that there are two properties for 1-hot representation (sparse encoding):
- Each vector is orthogonal to each other;
- The dimensions of word vectors is equal to the vocabulary size, which is very large
Correspondingly, we may say that 1-hot representation can be used in the case where:
- There are no correlation relationships to use in the model between vectors
- The vocab size/feature space is small
See the work of (Johnson & Zhang, 2015, https://arxiv.org/abs/1504.01255) that uses sparse vector encodings.
However, what if we want to incorporate the inherent relationships between words? What statistics can we use?
For a word of interest, we can count how many times each word has co-occurred with it in a document or in the neighbors of the sentence. This is based on the motivation that we can use the neighbors of a word to represent its meaning:
- “You shall know a word by the company it keeps.” —- Firth, 1957
- “Words which are similar in meaning occur in similar contexts.” —- Harris, 1954
This is the distributional hypothesis behind distributional vectors that we will introduce.
There are basically 4 categories to choose the neighbors to represent the word:
- The full document or even the paragraph the word belongs to;
- The sentence the word belongs to and the words around it within a particular range;
- The context are the words that are close to it in a parse tree given by a dependency parser, with syntactic relation information;
- Multilingual translation-based contexts. The contexts of a word are the foreign words that are aligned with it. And we can combine a monolingual window-based approach with a multilingual approach
Generally, the larger the context size is, the more topical similarities we will get; the smaller the context is, the more functional and syntactic the word similarities will be.
For example, for a window size of 2, we will have a co-occurrence matrix below describing corresponding to the sentences:
We now have seen what a basic/raw word vector is actually is. However, what if we further want to reduce the dimensions of the word vectors? That’s where SVD-based approach will play a role.
2. SVD-based approach
How do we reduce the dimensions of word vectors introduced above?
We can do SVD!
Suppose X is the word co-occurrence matrix. Then we can do:
For each word in X, we use the corresponding row in the leftmost matrix on the right side of the equation. So the new dimension of the word embedding now becomes k.
What do k dimensions of the vector represent/mean?
Intuitively, they represent the “concepts”/”features” inherent in a word vector. See http://infolab.stanford.edu/~ullman/mmds/ch11.pdf for more detailed interpretations of SVD.
And there are some things to note for the count-based approach:
- Matrix factorization exploit the global word-word co-occurrence information, and the matrix factorization methods for generating low-dimensional word representation have their roots in LSA (Latent Semantic Analysis). Also, apart from SVD, LDA can also be used to deal with the large number of sparse features of raw word vectors.
- Since common words may dominate the co-occurrence matrix, we can use the TF-IDF or PMI for a re-weighting. And we can ignore the function words such as “the”, “he”, etc.
- Pearson correlation can be used as an alternative to the raw count in the co-occurrence matrix, and negative values of correlation are set to 0 in this approach.
- SVD is expensive to perform (quadratic cost), especially when the orginal dimensions of the matrix is high.
All we talked about above is based on the hypothesis that count based co-occurrence information between words can help us determine the word vectors. The following method is based on another motivation from a new point of view – we can directly encode the word vectors with a fixed dimension in a particular prediction task (can be supervised or unsupervised).
3. Neural word embedding
This approach arises from language modeling. Refer to Dr. Cho’s notes (Chapter 5) for a more detailed and comprehensive introduction to language modeling and how it popularize the use of word embedding. Here, I will briefly summarize two basic approaches.
a. CBOW (Continuous Bag of Words Model):
In this approach, we still exploit the use of context words as count-based method. In contrast, here, we use context words to predict the “center” word. So it’s essentially a classification problem (Num. of classes = |V|). Thus, we have two matrices W and W’, as shown in the following figure, and the entries of them are learnt during training for this prediction task.
In short, once we get the word vectors for context words, we average them to get one averaged vector. Then we use the averaged one as a hidden layer, which will be used to multiply each word vector in W’ to for Softmax computation.
Go to Dr. Cho’s notes (5.4.2) for another interpretation of the model from the point of view of a Markov Random Field model and Maximum Pseudo-Likelihood approach. We will see why the context vectors get averaged.
So the matrices W and W’ learnt are word embeddings that we want.
In contrast to CBOW model, Skip-Gram model use the “center” word to predict its surrounding words. So if we have 2m surrounding words for the center word, then we are essentially doing a prediction task with 2m classifications, as shown in the figure below.
Some variants and some things to note:
- Alternative to softmax-based objective: Since softmax computation is expensive due to the normalization term, some techniques have been proposed to deal with this issue. For example, Word2Vec used a hierarchical softmax, based on the skip-gram model. And Mikolov et al. used a method called negative sampling to convert the original multi-classification problem to a logistic regression problem. It’s motivated by Noise Constrastive Estimation. See more details in Dr. Socher’s notes and the paper Distributed Representation of Words and Phrases and Their Compositionality. For a more detailed overview of alternatives to expensive softmax computation, see my post on Rare Word Issue in NMT.
- Word2Vec and Glove: Apart from Word2Vec, which is essentially a skip-gram model and is widely used in NLP communities, GloVe is another embedding approach that was recently proposed and gained popularity (http://nlp.stanford.edu/projects/glove/). It combines the advantages of count-based approach and iteration-based approach. It gives a new objective to train, which is essentially a weighted least squares regression, with the word embedding and word co-0ccurrence information combined.
- GloVe paper (http://aclweb.org/anthology/D14-1162) also makes a thorough overview of many other approaches. Go there for more detailed introduction to the related work.
- Subword modeling: researchers have tried to model on sub-word level such as characters. In this case, word is not the basic unit any more. See more detailed introduction in Dr. Goldberg’s notes (5.5.5) and Dr. Cho’s notes (5.1.2). It applies not only to word embedding modeling, but also many other tasks like neural machine translation, sentence classification, etc.