Word embeddings — word2vec and vectors that mean something
A lot of people meeting AI today start at the top of the stack — transformers, attention heads, fine-tuning, a chat model that already seems to understand everything. That’s a fine place to use the technology from. It’s a confusing place to understand it from: the big models are deep stacks of ideas accumulated over more than fifty years, and almost every one of those ideas is smaller and far easier to see clearly on its own. Better to build up from the pieces than start at the assembled result — and the piece that matters most, the one GPT-style models and BERT and every retrieval system are all variations on, is the idea that words can be turned into vectors whose geometry encodes meaning. That’s where the modern stack started, and it’s where this series starts too.
You’ve probably seen this famous example before — king − man + woman ≈ queen, the single most-screenshotted result in NLP. Look at the widget below. It demonstrates the arithmetic geometrically: each side of the picture is its own little coordinate system, but the Royalty direction — computed in system A as v(king) − v(man) — points the same way everywhere in embedding space. Carry that vector across to system B, lay it down at woman, and its tip lands at queen. So king − man isolates a direction, and + woman applies it to a different anchor.
Click step button to watch the construction step by step.
The first time you see it, it looks like the embeddings were hand-built by someone who already knew about royalty and gender. They weren’t. Nobody designed that geometry, and working out where it actually comes from is most of what this article is about. The picture is a schematic — the real space has a few hundred dimensions, flattened here to two — but the parallelism is genuinely there.
So, to work it out, we start at the bottom: how does a word even become something a neural network can work with?
For most of NLP’s history the answer stopped at: tokenize, then hand the model bare integer IDs — token 4271 for king, token 8693 for queen — usually dressed up as one-hot vectors, a vector as long as the entire vocabulary with a single 1 at the token’s index and zeros everywhere else. That’s an honest encoding of “this is word #4271” and nothing more — an opaque index. Under one-hot every word sits exactly the same distance from every other, so king and queen are neighbors in the table only by accident of ordering, unrelated as far as the model can tell. Anything semantic had to be bolted on downstream: engineered features, count-based statistics, lexicons like WordNet. Models worked, but they generalized poorly across words — what a system learned about cat transferred not at all to kitten.
Dense embeddings are what replaced that. Instead of a vocabulary-sized indicator, represent each token as a dense vector — a few hundred real numbers, one per token, all of them learned from data. The payoff is that geometry in that space lines up with meaning: similar words end up near each other, and pairs of related words sit at similar offsets — which is exactly what makes the arithmetic above work. And the part that’s easy to miss: none of this geometry is hand-engineered. No one assigns meanings to the dimensions or decides which words should land near which — it’s all learned. And, more surprising still, learned with no labels at all: not from a pile of human annotations the way you might expect, but from a prediction task the raw text answers for itself.
The vectors start out random.
They acquire meaning purely as a side effect of being forced to do well at a self-supervised prediction task — predict a word’s context (skip-gram) or a word from its context (CBOW). Nobody ever tells the model “king and queen are related.” But to predict the contexts of king accurately, and the contexts of queen accurately, when those two context distributions heavily overlap, the only way the tiny network can succeed is to give king and queen similar vectors. Similarity in the embedding space is the model’s mechanism for sharing predictive evidence between words that behave alike. The geometry falls out of the optimization.
From hypothesis to geometry
How does a hypothesis about word meaning end up producing actual vector geometry — where similar words land near each other and consistent relations become consistent vector offsets? The reasoning involves four moves, each forcing the next: a premise (the distributional hypothesis itself), a self-supervised task built directly on that premise, a representational bottleneck that constrains how the task can be solved, and the geometry that drops out the other end. Walk through them in order and the leap from “words have meaning” to “words become vectors with structure” stops being mysterious.
We start with the distributional hypothesis: a word’s meaning is well approximated by the distribution of contexts it appears in. Words that show up in the same kinds of sentences — couch and sofa, huge and enormous — mean similar things, which quietly reframes “what does this word mean?” as “what does it occur next to?”, a question you can answer from raw text alone, with no dictionary or human annotation.
From that premise comes a self-supervised task: take a word, predict the words that surround it — or take the surrounding words and predict the one in the middle. Since the hypothesis says those neighbors carry the meaning, getting good at the task forces the model to encode meaning — and it costs nothing to set up, because the right answers are already sitting in the text.
Then comes a constraint on that task — a bottleneck. A word’s entire representation has to fit in one short vector — say three hundred numbers — and that’s the only place its information can live. Three hundred numbers isn’t much room: nowhere near enough to give every word its own independent description. So to predict well for all of them at once, the model has to economize — represent words with shared, reusable features, and place words that play the same role at nearly the same coordinates. Similar words end up close together not because anyone asked for it, but because crowding them is the only way everything fits. Take the bottleneck away — let the vector grow as wide as the vocabulary itself, so each word can claim its own dedicated dimension and never has to share — and that pressure vanishes: the model just memorizes which words co-occur with which, one word at a time, and learns nothing that carries over to the next.
Out the other end comes the geometry: words with similar context distributions get similar vectors, and consistent relational patterns — capital-of, plural, past-tense — become consistent vector offsets. Meaning ends up encoded as position and direction in the space — which is what the opening picture is showing.
What you just walked through — predict context, squeeze it through a bottleneck, read off the geometry — isn’t a word2vec quirk. Every embedding model since runs the same play. Self-supervised prediction through a representational bottleneck is how BERT learns its representations, how the sentence encoders behind RAG learn theirs, and how the embedding layer inside every modern LLM gets built. (This is also why model width — the hidden-state dimensionality d — is a hyperparameter that genuinely matters, not just a footnote next to the training objective: the same bottleneck mechanism is operating, the only thing that’s changed is how wide the squeeze is.)
In today’s NLP architectures, “embedding” can mean three quite different things — all built the same way through self-supervised prediction through a bottleneck but each applied to a bigger unit of text than the last:
- Static word embeddings — word2vec, GloVe, fastText. One fixed vector per word, pulled from a lookup table, frozen after training. Word order is thrown away, and
bankgets a single vector that averages all its senses. ← this article - Contextual embeddings — ELMo, then BERT and its descendants. One vector per word as it appears in a sentence:
bankin “river bank” andbankin “bank account” come out different, because the surrounding tokens reshape the vector. Skip-gram and CBOW give way to masked language modeling — predict a hidden token from its bidirectional context, run through a deep encoder rather than a shallow average — but the principle is unchanged. ← next in the series - Sentence / passage embeddings — Sentence-BERT and the modern retrieval crop. Pool a contextual model down to one vector per sentence or document. This is the engine of retrieval-augmented generation: embed the corpus once, embed each query, return the nearest chunks by cosine similarity — the exact geometric bet the picture above is a demo of.
Where do generative LLMs — ChatGPT, Claude, Llama — fit here? Not at a level of their own; they contain all three. The very bottom layer of every one of them is a static word-embedding table, exactly like word2vec’s — a token ID goes in, a learned vector comes out. Every transformer layer above it produces contextual embeddings, BERT-style; the structural differences are that the attention runs strictly left-to-right and the training task is next-token prediction rather than masked-token prediction. And if you mean-pool the final layer you get a sentence embedding — in fact a lot of today’s strongest text-embedding models are decoder LLMs fine-tuned for exactly that. What makes an LLM an LLM isn’t a new kind of embedding; it’s that you read off its next-token distribution and sample from it, turning a representation machine into a generation machine.
This article is about the first level — static word embeddings. We’ll start with the idea behind all of it (the distributional hypothesis), train a skip-gram word2vec model from scratch, explore the famous analogy arithmetic on a real pre-trained vocabulary, and end at the place where static embeddings stop working — which is exactly where BERT and level two pick up, in the next article. Level three — the sentence-and-passage embeddings behind retrieval-augmented generation — comes later in the series.
The distributional hypothesis
Way back in 1957, the linguist J.R. Firth famously wrote, “You shall know a word by the company it keeps.” That’s the whole idea. Words that appear in similar contexts have similar meanings. To see why, consider three sentences with one word missing:
- The ___ barked at the postman.
- The ___ purred on my lap.
- The ___ flew south for winter.
You don’t need to know the missing words to know they refer to different kinds of animals.
The context — the words around the blank — narrows down what fits. Reverse the procedure: any word that consistently fills the slot in “The ___ barked” probably means something like dog.
Any word that consistently fills “The ___ purred” probably means something like cat.
Words that fill the same slots in the same kinds of sentences are, by this argument, similar.
The distributional hypothesis says: define a word by its contexts. Build a representation that places words close together if their context distributions are similar. That representation will encode semantic similarity, syntactic role, and a surprising amount of world knowledge — all without any human labeling, just from raw text.
The question is how to compute it efficiently. The classic answer, going back to the early 1990s, was very large sparse co-occurrence matrices — latent semantic analysis and its kin. For the sentence “the cat sat on the mat” with a ±2 window, the matrix looks like this:
the cat sat on mat
the [ 0, 1, 2, 1, 1 ]
cat [ 1, 0, 1, 1, 0 ]
sat [ 2, 1, 0, 1, 0 ]
on [ 1, 1, 1, 0, 1 ]
mat [ 1, 0, 0, 1, 0 ]One row per word, one column per context word, and each entry M[i][j] counts how often word j appeared in word i’s context — within a certain window of it (the example uses ±2). The diagonal is zero (a word doesn’t co-occur with itself).
The matrix is built by sliding the window across the corpus and tallying.
At each position with center word i, look at the words in its ±W window, and for each neighbour j in that window, increment M[i][j] by one. Scrub the widget from start to end and the matrix fills in cell by cell — one pass through the corpus and every co-occurrence count is recorded.
This is the first step of the count-then-factorise pipeline, and there are two more on top of it. The shape of a word’s vector changes substantially at each stage:
Stage 1 — raw counts. The matrix above, as it is. Each row is V numbers long (“V-long” for short), one slot for every word in the vocabulary — already the word’s vector, just a wildly oversized one. Each entry is literally “how many times did this specific vocab word appear in this word’s ±2 window across the corpus.” The row for cat is [1, 0, 1, 1, 0]. The toy corpus has only six tokens so the numbers are tiny, but the structure is what matters at scale.
Stage 2 — chance adjustment. Raw counts are dominated by frequency: the co-occurs with cat a lot in a real corpus, but only because the co-occurs with everything. To extract real signal, each co-occurrence has to be discounted by what chance alone would predict — frequent words shouldn’t get credit for co-occurring with everything. The standard fix is PMI (pointwise mutual information): a log-ratio that’s high when two words co-occur more than chance would predict, low when less, and zero when their joint rate matches what two independent words with those frequencies would produce. The matrix shape stays (V, V); only the values change, from raw counts to chance-adjusted scores.
Stage 3 — compression. Even after chance-adjusting, the matrix is still (V, V) — wildly too wide to be a usable word vector table. The standard fix is SVD (singular value decomposition): find the d directions of greatest variance in the matrix and project each row from V dimensions down to d:
raw cat vector (V long): [1, 0, 1, 1, 0]
PMI-weighted (V long): [−0.4, 0, 0.7, 0.3, 0] ← log-ratios per vocab word (illustrative)
SVD-compressed (d long): [0.21, -0.34, 0.18, …] ← d abstract latent dimensionsThe compressed vector is d real numbers (a few hundred), and the dimensions are no longer “co-occurrence with one specific word.” Each dimension is a linear combination of all V word-axes — a learned latent direction that captures one dominant pattern of co-occurrence variation across the corpus. You can’t point at dimension 47 and say “this is how often cat appeared near snowboard” anymore; it’s “this is how much of cat’s overall co-occurrence pattern aligns with the 47th-most-prominent direction in the data.”
That (V, d) compressed table — dense real numbers, one row per word — is structurally identical to the word2vec vectors covered in the rest of the article. The path to get there is the difference: count-then-factorise vs train-a-network-directly. The two routes land on essentially the same destination — which is what Levy & Goldberg (2014) formalises mathematically: skip-gram with negative sampling is implicit SVD on a shifted PMI matrix.
A direct consequence: the count-then-factorise route produces the same kind of geometry that supports king − man + woman ≈ queen — the famous analogy arithmetic falls out of either path.
Well-tuned PPMI-SVD models match word2vec on standard analogy benchmarks to within a couple of points; same destination, different engineering.
For two decades, this count-then-factorise recipe underpinned LSA (latent semantic analysis, 1990 — word-by-document matrix, used for information retrieval) and HAL (hyperspace analog to language, 1996 — word-by-word matrix, used in cognitive science to model human semantic memory). Both produced word vectors as a byproduct of other goals; word2vec (2013) was the first method to make the word-vector table the explicit target.
The common cost across all of them was sheer scale, and it grew faster than vocab size itself. At real V the matrix is a V × V monster: a million words on a side is a trillion cells, almost all of them zero because most pairs of words never co-occur. Storing it is heavy; factorising it with classical SVD is heavier still. On top of the raw compute, the pipeline is a chain of explicit, hand-tuned steps — count, weight, factorise, truncate, scale — each with knobs that interact non-obviously with the others. By the early 2010s this was the recipe everyone used and nobody loved.
In 2013, a breakthrough method called word2vec produced dense vectors of a few hundred dimensions, with the same semantic properties — and without ever building this matrix. Instead of counting co-occurrences and squeezing the result down, it learns the vectors directly, by training a tiny neural network on a prediction task. (A year later, GloVe went back to building a matrix and factorising it head-on; fastText extended word2vec with character n-grams. The static-embedding family article covers those alternatives and how they relate to each other.)
word2vec
The word2vec paper proposed two approaches that skip the co-occurrence matrix entirely — they produce the same (V, d) embedding table without ever building the V × V grid in the first place. Instead of counting co-occurrences and then squeezing the matrix down with SVD, train a tiny neural network to predict the words around each word, in either direction, then read the word vectors off the network’s weights when training stops. The prediction is real, but it isn’t what you care about: it’s only there to force structured vectors out of the network. You keep the table and discard the rest. Same destination as LSA/HAL, very different engineering: streaming SGD on raw (center, context) pairs instead of building a trillion-cell matrix and factorising it.
word2vec also has a hard scope limit: it doesn’t read sentences. The model produces one fixed vector per word and nothing more — no sentence representation, no word-order awareness, no compositionality across the words in a phrase. For the decade between word2vec’s release and the transformer takeover, the standard NLP pipeline filled that gap by stacking an LSTM (or similar recurrent net) on top of word2vec embeddings: the embeddings carried what a word means; the LSTM carried what depends on order. Why that split eventually collapsed is the topic of the closing sections.
The two algorithms are skip-gram and CBOW — alternative recipes that run the prediction in opposite directions. You use one or the other, never both:
- Skip-gram. Given a center word, predict the words in the small window around it. Input: the center word. Output: a probability distribution over the vocabulary — and each
(center, context)pair is a separate training example, so the same center word is reused once per neighbour. - CBOW (continuous bag of words). The reverse: given the words in a window, predict the center word. Input: the bag of context words. Output: a probability over the vocabulary for the center word. This fill-in-the-blank shape — recover the missing word from its surroundings — is the one BERT’s masked language modeling later inherits and scales up.
Both are doing the same thing, just running the prediction in opposite directions — you’d train one or the other, not both. We’ll work through skip-gram end to end, then come back to CBOW.
The training data and one-hot inputs
The whole skip-gram pipeline splits cleanly into two stages with a sharp boundary: pre-processing turns raw text into a list of training examples, then training runs those examples through a neural network.
Let’s first look at pre-processing — turning the raw text into a long list of integer pairs. No neural network is involved yet:
- Tokenize the corpus — split the text into a list of word tokens.
- Build the vocabulary — assign each distinct token an integer ID; that ID will later serve as the token’s index into one-hot vectors and as the row number into the embedding matrix. The total count is
V(the vocabulary size). - Extract
(center, context)pairs — slide a window over the ID stream and emit one training example per neighbour.
Step 3 is the heart of pre-processing. Conceptually, you pick a window size — convention is two words on either side, making the window five words wide — and slide it through the text one word at a time. At each position the word in the middle is the center, the words flanking it are its context, and the model’s job is to predict the context words from the center. Every (center, context) pairing emitted at that position is one training example.
Using the sentence “the cat sat on the mat” as example, we set the window to ±2 and slide it word by word. With the window centered on sat, the neighbours are the, cat, on, the — this single position emits four pairs: (sat, the), (sat, cat), (sat, on), (sat, the). Step forward to on and the window finds cat, sat, the, mat — four more pairs. Step again, four more, and so on, until a whole corpus collapses into a long list of (center, context) pairs, generated entirely from the text itself, with no human ever labelling anything.
At the end of pre-processing you have a sequence of (c, t) integer pairs — c is the center word’s id, t is the target (context) word’s id — ready to feed. The whole stage fits in about a dozen lines of NumPy on our running example:
# Step 1: tokenize.
corpus = "the cat sat on the mat"
tokens = corpus.split()
# ['the', 'cat', 'sat', 'on', 'the', 'mat']
# Step 2: build vocabulary and convert tokens to integer IDs.
vocab = sorted(set(tokens)) # ['cat', 'mat', 'on', 'sat', 'the']
word2id = {w: i for i, w in enumerate(vocab)} # {'cat': 0, 'mat': 1, 'on': 2, 'sat': 3, 'the': 4}
V = len(vocab) # 5
ids = [word2id[w] for w in tokens] # [4, 0, 3, 2, 4, 1]
# Step 3: slide a ±2 window over the ID stream, emit (center, context) pairs.
window = 2
pairs = []
for i, c in enumerate(ids):
for j in range(max(0, i - window), min(len(ids), i + window + 1)):
if i != j:
pairs.append((c, ids[j]))
len(pairs) # 18 — exactly the (center_id, context_id) pairs the widget above emits.Pre-processing is essentially the same regardless of which word2vec variant you train next — tokenization, vocabulary, and windowing are identical. Only the format of the emitted examples differs: skip-gram packs them as pairs, while CBOW emits one context bag plus its center per window. Training is where the algorithm actually lives, and the rest of this section walks through it in detail.
Each pair is a training example
With pre-processing done, let’s look at how the training algorithm uses these pairs.
Each pair is one training example, representing an input and a label. The four pairs emitted from one window position around sat are four separate training examples with the same input (sat) and four different labels (the, cat, on, the). Only the input goes through the forward pass — the label sits on the side until the loss step.
It’s the same shape as MNIST: each MNIST example pairs an image with its digit label, and here each skip-gram pair (c, t) pairs the center word c (input) with one context word t (label). For (sat, cat): feed sat in, get back a predicted distribution over the vocabulary, compare it against cat, take an SGD step. Then the next pair.
Before we can feed anything into the network, we need to represent the word as a vector of numbers.
In MNIST that step is mostly free — an image is already a grid of pixel intensities, so we just flatten it into a 784-number vector.
A word has no inherent numeric content, so we have to invent the encoding ourselves: assign each vocabulary word an integer index,
and represent it as a one-hot V-vector — a vector V numbers long, all zeros except a single 1 at the word’s index:
For our running vocabulary of 5 words (V=5), the encoding looks like this:
"cat" → [1, 0, 0, 0, 0]
"mat" → [0, 1, 0, 0, 0]
"on" → [0, 0, 1, 0, 0]
"sat" → [0, 0, 0, 1, 0]
"the" → [0, 0, 0, 0, 1]The one-hot is V long — its length is the vocabulary size, around a million for real word2vec. So each word literally becomes a million-number vector where 999,999 entries are 0 and only one is 1. It’s immediately evident that this is a hugely wasteful representation: almost all of the storage and almost all of the multiplications are operating on zeros that contribute nothing.
Luckily, there’s a clever bit of linear algebra that lets us skip building the one-hot vector at all — multiplying a one-hot by a matrix is the same as picking out one row of that matrix. For our 5-word vocabulary, with sat at index 3 and some matrix M of shape (5, d):
one-hot for "sat" M (5 rows × 3 cols) result
[ 0 0 0 1 0 ] · [ row 0: 0.21 -0.43 0.15 ] = [ 0.33 -0.27 0.84 ]
[ row 1: 0.07 0.62 -0.31 ] (just row 3)
[ row 2: -0.55 0.18 0.40 ]
[ row 3: 0.33 -0.27 0.84 ]
[ row 4: -0.12 0.49 -0.06 ]
# dot product, column by column:
col 0: 0·0.21 + 0·0.07 + 0·(-0.55) + 1·0.33 + 0·(-0.12) = 0.33
col 1: 0·(-0.43) + 0·0.62 + 0·0.18 + 1·(-0.27) + 0·0.49 = -0.27
col 2: 0·0.15 + 0·(-0.31) + 0·0.40 + 1·0.84 + 0·(-0.06) = 0.84Every term that touches a 0 from the one-hot vanishes, leaving only the row 3 contribution — so the answer is simply row 3 of M. So in code we just store the integer 3 (the vocab index) and use it as a direct row lookup — we’ll see this concretely in the forward-pass section. Mathematically a one-hot V-vector and an integer V-index carry the same information; we use the one-hot for the maths because it makes the linear algebra clean, and the integer for the code because it’s V times cheaper.
Mental model — what we’re trying to do
Before we look closely at the network architecture, it’s worth nailing down what the network is actually trying to accomplish.
The core idea in one line: given a list of pairs (center, target), for each pair compute a similarity across dimensions between the center word and every word in the vocabulary — then adjust the weights so the highest-similarity word is the actual target.
The similarity is measured using cosine similarity between word vectors (embeddings), and softmax ranks every vocab word by its predicted probability of being the target. The rest of this section unpacks what those weights are, how the network implements this, and why running this loop over a corpus produces meaningful geometry.
We’re going to have two trainable weight matrices, together giving every word in the vocabulary two d-dim vectors (one embedding per role):
Eof shape(V, d)— each row is one word’s input embedding, used when the word appears as the center of a training pair (the word being conditioned on).E'of shape(d, V)— each column is one word’s output embedding, used when the word appears as the target/context being predicted (the word being scored as a candidate).
So a single word w has both a row in E (when it’s the center) and a column in E' (when it’s the target) — two independent trainable d-dim vectors, one for each role. They get independent gradient updates and end up with different values. The asymmetry is intentional: the center plays a different role from the target in the prediction task (one is the conditioning context, the other is being judged for plausibility), and the model is more expressive when it can learn distinct vectors for those two roles than when forced to share. After training, only E ships as “the word vectors”; E' is discarded.
Pick any center and target in the widget below to see one (center, target) pair scored in isolation.
The full softmax forward pass runs this for every target word, producing scores for each word in the vocabulary (V).
The score for a (c, t) pair is just v_c · v'_t — the dot product of c’s input embedding with t’s output embedding, which is unnormalized cosine similarity. This is also why E' is stored transposed (columns = output embeddings) rather than as another (V, d) table: storing them as columns means scores = v_c @ E' computes the dot product of v_c against every word’s output embedding in a single matrix-vector multiplication. If E' were (V, d) like E, you’d have to loop or transpose to get the same V dot products. The (d, V) shape is exactly the layout that makes “compute similarity across all words at once” a one-line operation.
This is the same operation MNIST used to score digits — each output class had its own feature template, dot-producted with the hidden vector to score how well the image matched that class. word2vec runs the same play with columns of E' as per-word templates and v_c as the center word’s feature vector; the only difference is that here the features are learned semantic dimensions (royalty-ness, plurality, gender) rather than hand-interpretable patches (edges, strokes, loops).
With that in mind, the architecture in the next section is just the simplest possible neural net that computes similarity (v_c · v'_t) for every candidate pair (c, t) and updates weights in the two matrices (SGD).
Repeated billions of times across the corpus, this produces a geometry where each word’s input embedding sits near the output embeddings of words it co-occurs with — and transitively, where words that share the same neighbours end up close to each other. king and queen are never told to be similar; they end up similar because they’re both pulled toward royal, throne, crown, monarch. Same for cat and dog: they share neighbours like pet, fur, tail.
The architecture
Since we’re running a supervised labelled-prediction task — given a center word, predict which word from the vocabulary comes nearby — the setup is similar to MNIST in key ways: one hidden layer, softmax over output classes, cross-entropy against the label. What differs is the scale (vocabulary size V here vs. 10 digit classes for MNIST), the input format (one-hot vs. dense real-valued pixels), and the goal (we want the trained embeddings, not the prediction itself).
Concretely, the input is a V-dim one-hot for the center word; the hidden layer has d neurons (e.g. 300) and is purely linear (no bias, no nonlinearity); the output has V neurons with softmax across all V producing P(w | c) — the probability of each vocabulary word given the center. The learning lives in two weight matrices that bracket the hidden layer: E (input → hidden, shape (V, d)) and E' (hidden → output, shape (d, V)). These are the network’s only learned parameters; both start random, and after training only E ships as “the word vectors” — E' is discarded.
There’s a sharp asymmetry in how much each side actually computes. Because the input is one-hot, the (V, d) matmul collapses to a single row read E[c] — only that one row of E participates per training example, and only that row receives a gradient on the backward pass. All the real dot products — and consequently almost all the learning per step — live on the hidden → output side: the hidden vector is dot-producted against every column of E' to produce V scores, and every column of E' gets a gradient on every step. Softmax on top is pure normalization with no learned parameters of its own.
To make the layer structure concrete, the whole base-case network is six lines of Keras:
from tensorflow import keras
from tensorflow.keras import layers
V = vocab_size # e.g. 1_000_000
d = embedding_dim # e.g. 300
model = keras.Sequential([
keras.Input(shape=(1,), dtype='int32'), # integer ID of the center word
layers.Embedding(input_dim=V, output_dim=d), # E: shape (V, d), the lookup
layers.Reshape((d,)), # (1, d) → (d,)
layers.Dense(V, use_bias=False), # E': shape (d, V), the linear layer
layers.Softmax(), # softmax over V vocab scores
])
model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy')Two trainable layers, both linear, no activation in between. Embedding(V, d) is E — Keras’s name for “look up a row from a (V, d) table given an integer index.” Dense(V, use_bias=False) is E' — a plain (d, V) linear projection. Reshape strips a redundant length-1 dimension that Keras’s Embedding adds (a Keras-specific quirk — Embedding is built for token sequences and adds an extra axis even when we’re feeding one token at a time); Softmax turns the V raw scores into a probability distribution; sparse_categorical_crossentropy is exactly −log P(label), with the label given as an integer (same loss MNIST uses).
The number of dimensions in the word embedding d is a hyperparameter with no formula for picking it — convention does most of the work.
The well-worn default is d = 300, what the 2013 word2vec paper used on Google News and what GloVe ships as its largest pre-trained option.
In practice you’ll see d = 50–100 for lightweight/edge use, d = 200–300 as the sweet spot, and rarely anything larger. The reason d has to be much smaller than V is the bottleneck argument from earlier: if every word could sprawl into its own dedicated dimension, the model would memorize co-occurrences instead of building shared features, and no geometry would emerge.
What we’ve just described is the base-case forward pass — a softmax over the entire vocabulary on the output side. That’s the textbook version of skip-gram, and it’s what we’ll explain and visualise in the rest of this section. In practice, the full-V softmax is too expensive, so the canonical word2vec code swaps it for negative sampling — k+1 sigmoid scores instead of V softmax entries — and we’ll get to that in the dedicated subsection further down. Same lookup, same idea, cheaper loss.
The forward pass — lookup, scoring, softmax
Now that we’ve covered the high-level architecture — input one-hot, hidden lookup, output scores — let’s zoom in and walk through exactly what happens when a single training example flows through the network. For a center word c, the forward pass moves left-to-right in three stages: lookup, scoring, and softmax.
Stage 1: input × E → hidden (the lookup). Mathematically this is the matrix multiplication one_hot(c) @ E, producing a d-dim hidden vector. As we showed earlier with explicit per-column dot products on a 5-row matrix, the input is sparse — V−1 entries are zero — so almost every multiplication evaluates to zero, and the whole (V, d) matmul collapses to a single row read:
[0, 0, 1, 0, 0] · ⎡ row 0 ⎤
⎢ row 1 ⎥
⎢ row 2 ⎥ = row 2
⎢ row 3 ⎥
⎣ row 4 ⎦So v_c = E[c] — one array index — is mathematically equivalent to the full one-hot-times-matrix multiplication, and V times faster. The shape (V, d) is the same kind of weight matrix as MNIST’s first layer; the access pattern is “look up one row” instead of “multiply all of them.”
Stage 2: hidden × E’ → V scores. Second-layer forward pass: scores = v_c @ E', producing V real numbers (one per vocabulary word). This is a genuine (d, V) matrix-vector multiplication — no shortcuts, no sparsity to exploit. Each score is v_c · E'[:, w], the dot product of v_c with column w of E'.
Using the same 5-word vocab and our v_c from Stage 1 (the embedding for sat):
v_c (1 × d=3) E' (d=3 × V=5) scores (V=5)
"sat" embedding one column per vocab word one score per vocab word
cat mat on sat the
⎡ 0.45 -0.31 0.18 -0.22 0.07 ⎤ ⎡ -0.18 ⎤ ← cat
[ 0.33 -0.27 0.84 ] ·⎢ 0.62 0.15 -0.40 0.33 0.55 ⎥ = ⎢ 0.26 ⎥ ← mat
⎣ -0.20 0.48 0.27 0.11 -0.39 ⎦ ⎢ 0.39 ⎥ ← on
⎢ -0.07 ⎥ ← sat
⎣ -0.45 ⎦ ← the
# one score per word, expanded for "cat":
score[cat] = v_c · E'[:, cat]
= 0.33 · 0.45 + (-0.27) · 0.62 + 0.84 · (-0.20)
= 0.149 + (-0.167) + (-0.168)
= -0.186Five dot products, five raw scores. on comes out highest (+0.39), which makes sense — on actually sits next to sat in “the cat sat on the mat”. But these aren’t probabilities yet: they can be negative, unbounded, and don’t sum to anything in particular. That’s Stage 3’s job.
Stage 3: softmax → probabilities. The V scores — called logits — are arbitrary real numbers: they could be negative, unbounded, not summing to anything in particular. Softmax turns them into P(w | c) — V non-negative numbers that sum to 1, the model’s predicted probability that word w is in the context of c. Drag the scores below to see softmax convert them to probabilities live:
In skip-gram these scores are the dot products v_c · E'[:, w] for every vocab word w, and the resulting probabilities are P(w | c). (For the softmax subtleties — soft-argmax behaviour, shift invariance, temperature — see the MNIST article, which covers them in depth.)
The loss function
The loss for one training example with target word t is the negative log-probability the model assigned to that target:
loss = −log P(t | c)A single number per example. Small when the softmax has piled probability onto the true target, large when it hasn’t. This is cross-entropy with a one-hot label — the same loss MNIST uses, and the gradient through softmax is derived there.
What different loss values mean. At initialisation, all V scores are tiny random numbers, the softmax is ~uniform, and the loss is log V — about 13.8 for V = 10⁶, about 1.6 for our 5-word toy. Training drives it down. A well-trained word2vec model gets the per-pair loss down to around 4–6 on a real corpus — meaning P(target) averages 1/100 to 1/400, with most of the probability mass concentrated near the right answer.
The gradient
The gradient is similar to MNIST: the gradient of the loss with respect to each logit is P(w) − 𝟙[w == t] — predicted probability minus the one-hot target. The target word’s gradient is P(t) − 1 (negative — push its score up); every other word’s is P(w) (positive — push its score down, proportional to how much probability it currently has). Words the model already correctly thinks are unlikely barely move; words it’s wrong about get the most signal.
Backprop carries those score-gradients into E' and E by the chain rule:
scores = v_c @ E' (the forward step we're differentiating)
∂loss / ∂E'[:, w] = ( P(w) − 𝟙[w == t] ) · v_c ← gradient on column w of E'
∂loss / ∂v_c = E' @ ( P − one_hot_t ) ← gradient into the hidden vector
∂loss / ∂E[c] = ∂loss / ∂v_c ← because v_c = E[c]Two consequences worth keeping in mind, both from the one-hot input:
Egets a gradient on exactly one row per training example — the center word’s row. The other V−1 rows have zero gradient becausev_c = E[c]only read from that one row. (Compare MNIST, whereW₁updates every weight per example because the input is dense pixels.)E'gets a gradient on every column. The true target’s column is pulled towardv_c; every other word’s column is pushed away, scaled byP(w) − 𝟙[w == t].
Putting it together — one pair, end to end
That’s the whole forward + backward pass for a single training example. Concretely on the window centered on sat in the cat sat on the mat — emitting four pairs (sat, the), (sat, cat), (sat, on), (sat, the) — the widget below runs them one at a time over a tiny 5-word vocabulary so each softmax fits on screen. Watch the same input (sat) produce a single distribution over the whole vocab; each pair picks a different target out of that distribution; the loss is −log P(target | sat); and applying the update shifts the bars so the target’s grows and everyone else’s shrinks.
sat — but each one's target is a different neighbour. Click through them and notice three things: (1) the input never changes within this window position (only `v_sat` drifts slightly between updates), (2) each update lifts that pair's target out of the pack and lowers everyone else in proportion to their current probability (so popular wrong answers get shoved harder), (3) `the` appears twice in `sat`'s window, so it gets two updates' worth of pull — which is exactly the right way for a frequently-co-occurring word to register stronger influence.That’s the entire base-case algorithm: full softmax over the vocab, cross-entropy loss against the true context, gradient updates on E (one row) and E' (every column). It works — but the forward pass scales as O(V) per pair, and with V = 10⁶ that’s a million dot products and a million exponentials for every single training example. Unaffordable. The negative-sampling subsection later in the article is about the trick that makes the same shape O(k) instead, and that’s the version the canonical word2vec code actually runs.
Mini-batches and epochs
The walkthrough above processed pairs one at a time, which is conceptually how every training example is handled. In practice, implementations group examples into mini-batches and process a whole batch in one forward + backward pass — exactly the same mini-batch SGD as MNIST, just averaged across B examples per step:
- Forward pass on a batch of B examples — each
(c, t)pair flows through the network. Vectorized, the whole batch is one matrix multiply. - Loss — compute cross-entropy loss per example, then average across the batch into a single scalar.
- Backward pass — backprop the averaged loss through the network. Gradients on
EandE'accumulate across the B examples. - Gradient descent step —
E -= lr × ∂L/∂E,E' -= lr × ∂L/∂E'. Repeat.
One word2vec-specific wrinkle in the backward pass. In MNIST the input is a dense pixel vector, so every weight in W₁ gets a non-zero gradient on every example — after a batch of B examples W₁ has been touched everywhere. In word2vec each example is a one-hot for one center word, so only one row of E participates per example. After a batch of B examples, at most B distinct rows of E receive non-zero gradients (fewer if any center words repeat); the rest stay at zero. This sparsity is what makes the gradient update on E a sparse update — typically implemented with scatter_add or IndexedSlices in real frameworks rather than a dense (V, d) gradient tensor.
Two reasons batching helps:
- GPU throughput. GPUs are designed for parallel computation on big tensors. Doing one pair at a time wastes most of their capacity — the kernel-launch overhead alone exceeds the per-pair compute. Batching turns many small dot products into one big matrix multiplication, which is exactly what GPUs are good at.
- Smoother gradient estimates. Averaging gradients across B examples gives a less noisy update direction, which can let you use a larger learning rate without diverging. (Sometimes the noise of single-example SGD is helpful — it can shake the optimiser out of bad local minima — but for well-shaped losses like this one, smoother is usually better.)
The trade-off: larger batches require more memory and produce fewer updates per epoch, so you may need more epochs or a higher learning rate to compensate.
The original 2013 word2vec used batch size 1 — pure SGD, one pair per step, with multi-threaded Hogwild! parallelism on CPU (each thread streams its own pairs and writes to the shared E / E' without locks; occasional update collisions are silently absorbed). That’s still how Gensim runs in 2026, and it’s the right choice for word2vec specifically — the Gensim subsection later covers why.
An epoch is one complete pass over all the training pairs. word2vec typically trains for 5–15 epochs; the Gensim default is 5. Each pair gets seen multiple times because each pass sees it once and a single SGD step on one pair doesn’t fully shape the relevant rows — repeated passes converge the rows of E and E' toward the geometry the loss prefers.
Why this six-line model isn’t trained in production
The six-line Keras model from earlier in this section is the textbook architecture — and it’s the right mental picture, but nobody actually trains it as written. To see why, it helps to compare the per-example compute against MNIST:
INPUT HIDDEN OUTPUT OUTPUT-LAYER MATMUL
MNIST 784 128 10 128 × 10 = 1,280 ops
word2vec (V = 10⁶) V → 1 300 V 300 × 10⁶ = 300,000,000 opsThe bottleneck isn’t on the input side — that’s actually faster in word2vec (a single row lookup against E, no real arithmetic). The whole cost sits on the output side, and that’s where word2vec and MNIST differ wildly:
- Output layer size. MNIST has 10 output classes (one per digit). word2vec has V output classes (one per vocab word). With V = 1,000,000, that’s 100,000× more output neurons. The whole point of word2vec is “given a center word, predict which word in the entire vocabulary is the target” — and that prediction has to score every vocab word.
- Output-layer weight matrix size. MNIST’s
W₂is(128, 10)= 1,280 weights. word2vec’sE'is(300, 10⁶)= 300 million weights — 235,000× larger. Every one of those weights participates in the forward pass. - Output-layer matrix-multiply cost. Per example: MNIST does
128 × 10 = 1,280multiply-adds; word2vec does300 × 10⁶ = 300 millionmultiply-adds. ~234,000× more compute per training pair. - Softmax cost. MNIST exponentiates 10 numbers per example. word2vec exponentiates 1,000,000 — same shape of operation, just 10⁵× more of them.
So the hidden layer (300 dims) is cheap — it’s just the row of E you read out. The complexity comes from what happens after: matrix-multiplying that 300-dim hidden vector against a (300, 10⁶) matrix and softmaxing over a million scores.
The fundamental reason word2vec has this problem and MNIST doesn’t: in MNIST you’re picking from 10 fixed classes (the digits), so the output layer naturally has 10 neurons. In word2vec, every word in the vocabulary is a possible output class, so the output layer has to be V neurons wide. Predicting “which word comes next?” out of a million-word vocab is fundamentally a million-class classification problem, and that scales linearly per example. That’s the bottleneck. (MNIST would hit the same wall on something like ImageNet-21k — 21,841 classes — where the final softmax also becomes a non-trivial slice of training cost, though still ~50× smaller than word2vec’s vocab.)
Negative sampling replaces the last two layers with something O(k) instead of O(V) — instead of computing a million-way softmax over the full vocabulary, reframe the prediction as a binary classification (“is this (center, context) pair real or random?”) and train the model to score real pairs high and a handful of sampled random pairs low. Same gradient flavor, hundreds of thousands of times faster, and the resulting vectors are essentially as good. But as a picture of what the network is, before you optimise how it trains, this six-line model is it: an embedding, a dense layer, a softmax.
CBOW — and the bridge to BERT
CBOW (continuous bag of words) is word2vec’s second algorithm, run as a mirror image of skip-gram. Where skip-gram takes a center word and predicts a target word as its context, CBOW takes several words as context and predicts the center — the missing word.
That fill-in-the-blank framing is exactly the Cloze test from 1950s reading-comprehension research — hide some words, ask the model to fill them in from what’s left — which only works if you have a working model of the surrounding language. It’s also exactly the objective behind BERT’s masked language modeling, the pretraining task that powered every contextual encoder since 2018. You can read CBOW as a tiny, linear-shaped BERT and BERT as CBOW grown up with attention: same training task, shallow averaging replaced by a deep transformer stack, symmetric window replaced by the whole sentence, single mask replaced by 15% of tokens at once, static output replaced by contextual per-token vectors. The BERT article walks through all of that in detail.
Mechanically, CBOW differs from skip-gram by exactly one averaging step on the input; the rest of the training loop is identical. So most of the skip-gram sections above carry over unchanged — we focus here only on where CBOW differs.
Pre-processing — context bags instead of pairs
The pre-processing pipeline is the same as skip-gram: tokenize, build vocab, slide a window over the token stream. What changes is the shape of what gets emitted per window position:
Position centered on: Skip-gram emits (per position): CBOW emits (per position):
───────────────────── ────────────────────────────── ──────────────────────────
the (pos 0) (the, cat), (the, sat) ({cat, sat}, the)
cat (pos 1) (cat, the), (cat, sat), (cat, on) ({the, sat, on}, cat)
sat (pos 2) (sat, the), (sat, cat), ({the, cat, on, the}, sat)
(sat, on), (sat, the)
on (pos 3) (on, cat), (on, sat), ({cat, sat, the, mat}, on)
(on, the), (on, mat)
the (pos 4) (the, sat), (the, on), (the, mat) ({sat, on, mat}, the)
mat (pos 5) (mat, on), (mat, the) ({on, the}, mat)
───────────────────── ────────────────
total: 18 pairs total: 6 examplesSame corpus, same window — skip-gram produces 18 separate (center, neighbour) training pairs, CBOW produces 6 (context_bag, center) examples. Step through it in the widget:
The architecture
CBOW’s architecture is the skip-gram one with its ends flipped — the inputs are the context words, the output is the center word. Same two weight matrices E and E', same dimensionality d, same softmax over V. The only architectural difference is on the input side: skip-gram looks up one row of E (the center), CBOW looks up C rows (one per context word) and averages them into a single d-dim hidden vector h:
Skip-gram: h = E[c] (one row read)
CBOW: h = mean(E[c_1], E[c_2], ..., E[c_C]) (C rows read, then averaged)Visualized below — the C one-hot inputs, the C corresponding row lookups in E, and the averaging step that produces h:
shape: C × V
shape: V × d
shape: d
The averaging is what gives “bag of words” its name — word order inside the window is discarded, the context becomes a multiset. At each training step, the lookups happen on the current, in-flight values of E — the rows you read in the forward pass are the same rows you update in the backward pass, just like any neural-net SGD training. Once h is computed, everything downstream — scores = h @ E', softmax, cross-entropy loss, backprop, SGD on E and E', and the negative-sampling shortcut — is identical to skip-gram.
Training — three small differences from skip-gram
The forward pass, loss, gradient, mini-batching, and negative-sampling shortcut all carry over from skip-gram unchanged. Three differences worth keeping in mind:
- One forward pass per window position. Skip-gram emits C separate pairs per window and runs C forward passes. CBOW emits one example per window — the whole context bag predicts the center. So CBOW is roughly C× faster to train, which is its main practical edge.
- Gradient updates touch more rows of
Eper example. Skip-gram nudges one row ofEper pair (the center’s row). CBOW nudges C rows per example (each context word’s row, with the gradient scaled by 1/C from the averaging). That’s why CBOW does better on frequent words and worse on rare ones — common stop-words get touched on most windows; rare words rarely show up as context. E'updates are the same shape, just keyed differently. In skip-gram, each pair(c, t)updates the target word’s column ofE'. In CBOW, each example updates the center word’s column ofE'. Same update pattern, different word picking it out.
How word2vec is actually trained in practice — Gensim
It might be surprising in 2026 — an era where transformer-based encoders dominate NLP and every headline model is some flavour of attention — but word2vec is still trained, deployed, and shipped in production every day. The reason is partly architectural fit (word2vec is a CPU-shaped workload, which we’ll get to), partly that for plenty of tasks the cheap-and-static vectors are simply good enough (and orders of magnitude faster to query than a BERT forward pass), and partly that some pipelines — recommendation systems, search ranking, vector-database bootstrapping, lightweight semantic features — specifically want the embedding-lookup behaviour rather than a deep contextual model. So the question of how to actually train one isn’t historical — it’s a working engineering question for plenty of teams.
The de facto standard library for training word2vec is Gensim — short for “Generate Similar” — open-source Python, in development since 2009, and despite Keras, PyTorch, and JAX all existing, still what most production word2vec pipelines run on in 2026. From the library’s own description:
Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible. Gensim is designed to process raw, unstructured digital texts (“plain text”) using unsupervised machine learning algorithms. The algorithms in Gensim — Word2Vec, FastText, Latent Semantic Indexing (LSI/LSA), Latent Dirichlet Allocation (LDA), etc — automatically discover the semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary — you only need a corpus of plain text documents.
Two things to notice from that description.
It covers a family of algorithms, not just word2vec. Word2Vec, FastText, LSA, LDA — every unsupervised statistical-co-occurrence method this article has discussed (and a few more) lives in the same library. The article walked through word2vec specifically; Gensim is the library where you’d actually run any of them.
Efficiently (computer-wise) and painlessly (human-wise) is exactly the trade-off Gensim has optimised for fifteen years: extremely fast on the hardware these algorithms actually need (CPU, as we’ll see in a moment), with a one-liner API that hides every fiddly detail of vocabulary building, subsampling, sampling tables, and the training loop. Both of those properties are why people still reach for it.
A one-liner gets you trained vectors:
from gensim.models import Word2Vec
model = Word2Vec(sentences, vector_size=300, window=5, min_count=5, sg=1, workers=8)
v_king = model.wv["king"]Vocabulary building, frequent-word subsampling, the negative-sampling alias table, the training loop, save/load, similarity queries, analogy arithmetic — Gensim handles all of it. To match this in raw Keras or PyTorch you’d write ten times the code, get worse throughput, and end up with vectors that may not exactly reproduce the canonical word2vec results.
The reason Gensim persists is partly inertia, partly ergonomics — but the deepest reason is that word2vec is a CPU-shaped workload, and Gensim is the best CPU implementation there is.
Why word2vec is CPU-intensive, not GPU-intensive
The standard intuition for “deep learning = GPU” comes from models whose bottleneck is dense matrix multiplication on big tensors: convolutional networks, transformers, big MLPs. GPUs are designed for that — thousands of cores running the same operation on different elements in parallel, fed by high-bandwidth memory laid out for vectorised access. Give a GPU a (4096, 4096) matmul and it crunches through it in microseconds.
Word2vec, structurally, never does a big matmul. Look at what one training pair actually computes:
1. one lookup in E: read row E[center] ← 1 row out of V
2. lookup target + k negatives in E': read k+1 rows of E' ← k+1 rows out of V
3. dot products against v_c: k+1 dot products of d-dim vectors ← ~k+1 × d multiply-adds
4. gradient updates: update 1 row of E, k+1 rows of E' ← k+2 rows touchedWith k = 5–20 and d = 300, each training step does on the order of 10⁴ multiply-adds — about a thousand times less compute per example than a single forward pass through a small image classifier. The “compute” is barely there.
What is there is memory access. Two huge tables, E and E', each V × d × 4 bytes. For V = 10⁶, d = 300, that’s 1.2 GB per matrix — 2.4 GB total. Every training step does a handful of widely-scattered row reads from those matrices: jump to row 4271, jump to row 873291, jump to row 12, jump to row 999847. These rows are 1.2 KB apart, 80 MB apart, sometimes hundreds of MB apart. Cache prefetching doesn’t help — the access pattern is effectively random.
This profile — tiny compute per step, large memory, random-access bound — is exactly the wrong shape for GPUs. The GPU’s headline FLOPs throughput is irrelevant when you’re not doing enough multiply-adds to use it; GPU memory wins on sustained sequential bandwidth, which random row reads don’t exploit; kernel-launch overhead is a huge fraction of each tiny step; and synchronising thousands of cores for backward passes that touch only a few rows is wasted coordination. This is the same shape of problem the MNIST-trains-faster-on-CPU article explores in detail: when per-step compute is small enough, kernel-launch overhead and PCIe transfer cost dominate any FLOPs advantage the GPU might have, and the CPU wins on wall-clock time.
The CPU, by contrast, is well suited: modern L1/L2 caches and random-access latency make scattered row reads from a multi-GB embedding table cheap, system RAM is large enough to hold a 1M-word, 300-dim model (2.4 GB) plus thread-local buffers without breaking a sweat, and OS-level threads can update shared memory with occasional clobbering — which is the entire trick behind Hogwild! lock-free parallelism, much cheaper than the GPU equivalent (atomics across thousands of cores).
Gensim leans into all of this.
Its inner loop is Cython (Python compiled to C), single-pair forward-and-backward, with Hogwild! lock-free parallelism across workers CPU threads. Each thread streams pairs from its own slice of the corpus, looks up its rows of E and E', runs the update, and writes back to the shared matrices without holding a lock. Occasional write-write conflicts are silently absorbed — only a few rows are touched per update, so the collision rate is low, and the convergence isn’t measurably hurt.
The result, on a beefy laptop: 8 threads, a few million pairs per second, billions of training pairs per hour, all without leaving the CPU. That’s enough to train word2vec on a full Wikipedia dump in a few hours on hardware that costs nothing extra (because you needed the CPU anyway).
Where static embeddings break
word2vec produces static embeddings: one fixed vector per word, regardless of context. This is exactly the right shape for the distributional hypothesis as originally stated, but it has three failure modes that became increasingly visible as NLP moved to harder tasks.
Polysemy
Consider these two sentences:
- I deposited the cheque at the bank.
- We had a picnic on the river bank.
A static embedding gives bank one vector. That vector is some kind of average over both senses, which means it’s a good representation of neither.
bank with two sense clustersglove-wiki-gigaword-300The widget above shows the cosine similarity between bank and two clusters of context words: money, loan, account, deposit, interest on one side; river, shore, water, creek, flood on the other. Both clusters pull above zero — the single vector covers both senses — but the financial cluster wins. That’s not a fact about the word; it’s the corpus skew.
There’s no way for downstream code to recover which sense was meant in any specific sentence, because it has access to one vector and a surrounding sequence of other single vectors, all of them sense-blind in the same way.
No syntax sensitivity
A bag-of-vectors representation throws away word order. The sentences:
- Dog bites man.
- Man bites dog.
contain identical word sets and therefore identical bag-of-vectors representations, despite having opposite meanings. Anything built on top of static embeddings has to recover order from somewhere else — typically a recurrent or convolutional layer that processes the sequence directly. That works (it’s how every pre-2018 NLP model was built), but it means the embeddings themselves are doing only part of the job.
Frozen at training time
Static embeddings are fixed once trained. New senses, new compounds, new domain vocabulary — the vectors don’t update. Worse, words that didn’t appear in the training corpus simply don’t have vectors at all.
These three failures look different on the surface but share one cause: a static embedding is a function of the word, not the sentence. Anything sentence-dependent — sense, role, discourse position — has to be handled outside the embedding. That’s a big enough job that it constrained the architectures of the entire pre-2018 era.
The same three failures hit every other static word embedding too — GloVe and fastText produce the same (V, d) lookup table by different training procedures, so they inherit the same blind spots. Solving them requires giving up on “one vector per word” entirely, which is what BERT and the contextual encoders do.
What comes next
The fix is easy to state and was hard to build: make the vector depend on the surrounding sentence, not just the word. bank shouldn’t have one vector — it should have the one it earns in “river bank,” and a different one in “bank account.”
BERT inherits more from word2vec than you’d expect from the architectural gap — the distributional hypothesis (predict missing tokens from context), the trick of a fake prediction task that exists only to force good vectors out of the network, and the bet that geometry encodes meaning. ELMo (2018) ran the first serious version of this with a bidirectional LSTM and per-position hidden states. BERT (also 2018) swapped the LSTM for a Transformer encoder and a new pretraining objective — masked language modeling, which is structurally CBOW with most of CBOW’s limits removed: a deep stack of bidirectional Transformer layers over the whole sentence instead of a shallow average over a five-word window, multiple masks per example instead of one, and contextual per-token output vectors instead of a single frozen lookup row. The training task — fill in the blank — is the same one CBOW ran on a tiny scale back in 2013. The BERT article walks through that machinery in detail.
The level after that is pooling a contextual model down to one vector per sentence or passage — the engine of retrieval-augmented generation, which picks up later in the series.