Word embeddings — word2vec and vectors that mean something
A lot of people meeting AI today start at the top of the stack — transformers, attention heads, fine-tuning, a chat model that already seems to understand everything. That’s a fine place to use the technology from. It’s too high-level a place to learn it from: the big models are deep stacks of ideas accumulated over more than fifty years, and almost every one of those ideas is smaller and far easier to see clearly on its own. To understand it, it’s better to build up from the pieces than start with the finished stack — and the piece that matters most, the one GPT-style models and BERT and every retrieval system are all variations on, is the idea that words can be turned into vectors whose geometry encodes meaning.
You’ve probably seen this famous example before — king − man + woman ≈ queen, which is probably the single most-screenshotted result in NLP. Look at the widget below.
It demonstrates this arithmetic geometrically: each side of the picture is its own little coordinate system,
but the Royalty direction — computed in system A as v(king) − v(man) — points the same way everywhere in embedding space.
Carry that vector across to system B, lay it down at woman, and its tip lands at queen.
So king − man isolates a direction, and + woman applies it to a different anchor.
Click step button to watch the construction step by step.
What the widget demonstrates is a dense vector representation of words: each word is a point in a high-dimensional space, and the geometry between points encodes semantic relationships. Nobody set those values by hand, and nobody told the model that “royalty” or “gender” are even concepts — there are no labelled axes, no category tags, no human marking king as “royal” or queen as “female”. The geometry falls out entirely from the training objective, and working out how that happens is most of what this article is about.
For most of NLP’s history, though, words weren’t vectors at all — they were standalone symbols, integer IDs or one-hot indicators with no built-in relationship to one another. Anything semantic had to be bolted on downstream: engineered features, count-based statistics, lexicons like WordNet. Models worked, but they generalized poorly across words — what a system learned about cat transferred not at all to kitten.
Dense embeddings are what replaced that. Each token becomes a vector of a few hundred real numbers, all learned from data, and the geometry of that space lines up with meaning: similar words land near each other, and pairs of related words sit at similar offsets — exactly what makes the arithmetic above work.
The training process is what gives those vectors meaning. At initialisation each word is just a short array of random numbers — nothing semantic in there yet. They acquire meaning purely as a side effect of being forced to do well at a self-supervised prediction task — predict a word’s context (skip-gram) or a word from its context (CBOW). Nobody ever tells the model “king and queen are related.” But to predict the contexts of king accurately, and the contexts of queen accurately, when those two context distributions heavily overlap, the only way the tiny network can succeed is to give king and queen similar vectors. Similarity in the embedding space is the model’s mechanism for sharing predictive evidence between words that behave alike. The geometry falls out of the optimization.
From hypothesis to geometry
How does a hypothesis about word meaning end up producing actual vector geometry — where similar words land near each other and consistent relations become consistent vector offsets? The reasoning involves four steps, each forcing the next: a premise (the distributional hypothesis itself), a self-supervised task built directly on that premise, a representational bottleneck that constrains how the task can be solved, and the geometry that emerges at the other end. Walk through them in order and the leap from “words have meaning” to “words become vectors with structure” stops being mysterious.
We start with the distributional hypothesis: a word’s meaning is well approximated by the distribution of contexts it appears in. Words that show up in the same kinds of sentences — couch and sofa, huge and enormous — mean similar things, which quietly reframes the question of what a word means into the question of what it appears next to — and that’s something you can answer from raw text alone, with no dictionary or human annotation.
From that premise comes a self-supervised task: take a word, predict the words that surround it — or take the surrounding words and predict the one in the middle. Since the hypothesis says those neighbors carry the meaning, getting good at the task forces the model to encode meaning — and it costs nothing to set up, because the right answers are already sitting in the text.
Then comes a constraint on that task — a bottleneck. A word’s entire representation has to fit in one short vector — say three hundred numbers — and that’s the only place its information can live. Three hundred numbers isn’t much room: nowhere near enough to give every word its own independent description. So to predict well for all of them at once, the model has to economize — represent words with shared, reusable features, and place words that play the same role at nearly the same coordinates. Similar words end up close together not because anyone asked for it, but because crowding them is the only way everything fits. Take the bottleneck away — let the vector grow as wide as the vocabulary itself, so each word can claim its own dedicated dimension and never has to share — and that pressure vanishes: the model just memorizes which words co-occur with which, one word at a time, and learns nothing that carries over to the next.
Out the other end comes the geometry: words with similar context distributions get similar vectors, and consistent relational patterns — capital-of, plural, past-tense — become consistent vector offsets. Meaning ends up encoded as position and direction in the space — which is what the opening picture is showing.
What you just walked through — predict context, squeeze it through a bottleneck, read off the geometry — isn’t unique to word2vec. Every embedding model after word2vec runs the same play. Self-supervised prediction through a representational bottleneck is how BERT learns its representations, how the sentence encoders behind RAG learn theirs, and how the embedding layer inside every modern LLM gets built.
In today’s NLP architectures, “embedding” can mean three quite different things — all built the same way through self-supervised prediction through a bottleneck but each applied to a bigger unit of text than the last:
- Static word embeddings — word2vec, GloVe, fastText. One fixed vector per word, pulled from a lookup table, frozen after training. Word order is thrown away, and
bankgets a single vector that averages all its senses. - Contextual embeddings — ELMo, then BERT and its descendants. One vector per word as it appears in a sentence:
bankin “river bank” andbankin “bank account” come out different, because the surrounding tokens reshape the vector. The training task shifts to masked language modeling — predict a hidden token from its bidirectional context, run through a deep encoder rather than a shallow average — but the underlying principle — self-supervised prediction through a bottleneck — is unchanged. - Sentence / passage embeddings — Sentence-BERT and the modern retrieval models. Pool a contextual model down to one vector per sentence or document. This is the engine of retrieval-augmented generation: embed the corpus once, embed each query, return the nearest chunks by cosine similarity — the exact geometric bet the picture above is a demo of.
You might be wondering where generative LLMs — ChatGPT, Claude, Llama — fit here. Not at a level of their own; they contain all three. The very bottom layer of every one of them is a static word-embedding table, exactly like word2vec’s — a token ID goes in, a learned vector comes out. Every transformer layer above it produces contextual embeddings, BERT-style. What makes an LLM an LLM isn’t a new kind of embedding; it’s that you read off its next-token distribution and sample from it, turning a representation machine into a generation machine.
This article is about the first level — static word embeddings. We’ll start with the idea behind all of it (the distributional hypothesis), walk through the inner workings of skip-gram and CBOW, explore the famous king–queen analogy on a real pre-trained vocabulary, and end at the place where static embeddings stop working — which is exactly where BERT and level two pick up, in the next article. Level three — the sentence-and-passage embeddings behind retrieval-augmented generation — comes later in the series.
The distributional hypothesis
Way back in 1957, the linguist J.R. Firth famously wrote, “You shall know a word by the company it keeps.” That’s the whole idea. Words that appear in similar contexts have similar meanings. To see why, consider three sentences with one word missing:
- The ___ barked at the postman.
- The ___ purred on my lap.
- The ___ flew south for winter.
You don’t need to know the missing words to know they refer to different kinds of animals. The context — the words around the blank — narrows down what fits. That’s one half of the idea: context predicts the word.
The other half flips it. Collect every word that consistently fills the slot in “The ___ barked” — dog, puppy, hound, retriever. They all fit because they all mean something dog-like. Any word that consistently fills “The ___ purred” — cat, kitten, tabby — means something cat-like. So words that fill the same slots in the same kinds of sentences must mean similar things. That’s the direction that pays off for representations: a model trained to predict context has to give words with similar contexts similar vectors, because that’s the only way it can score them all well at once.
The distributional hypothesis says: define a word by its contexts. Build a representation that places words close together if their context distributions are similar. That representation will encode semantic similarity, syntactic role, and a surprising amount of world knowledge — all without any human labeling, just from raw text.
The question is how to compute it efficiently. The classic answer, going back to the early 1990s, was very large sparse co-occurrence matrices — latent semantic analysis and its kin. For the sentence “the cat sat on the mat” with a 2-word window (look up to 2 tokens left and right of each center word), the matrix looks like this:
the cat sat on mat
the [ 0, 1, 2, 1, 1 ]
cat [ 1, 0, 1, 1, 0 ]
sat [ 2, 1, 0, 1, 0 ]
on [ 1, 1, 1, 0, 1 ]
mat [ 1, 0, 0, 1, 0 ]One row per word, one column per context word, and each entry M[i][j] counts how often word j appeared in word i’s context — within a certain window of it. The diagonal is always zero since a word doesn’t co-occur with itself.
The matrix is built by sliding the window across the corpus and tallying.
At each position with center word i, look at the words in its ±W window, and for each neighbour j in that window, increment M[i][j] by one. Scrub the widget from start to end and the matrix fills in cell by cell — one pass through the corpus and every co-occurrence count is recorded.
This is the first step of the count-then-factorise pipeline, and there are two more on top of it. The shape of a word’s vector changes substantially at each stage:
Stage 1 — raw counts. The matrix above, as it is. Each row is V numbers long (“V-long” for short), one slot for every word in the vocabulary — already the word’s vector, just a wildly oversized one. Each entry is literally “how many times did this specific vocab word appear in this word’s ±2 window across the corpus.” The row for cat is [1, 0, 1, 1, 0]. The toy corpus has only six tokens so the numbers are tiny, but the structure is what matters at scale.
Stage 2 — chance adjustment. Raw counts are dominated by frequency: the co-occurs with cat a lot in a real corpus, but only because the co-occurs with everything. To extract real signal, each co-occurrence has to be discounted by what chance alone would predict — frequent words shouldn’t get credit for co-occurring with everything. The standard fix is PMI (pointwise mutual information): a log-ratio that’s high when two words co-occur more than chance would predict, low when less, and zero when their joint rate matches what two independent words with those frequencies would produce. The matrix shape stays (V, V); only the values change, from raw counts to chance-adjusted scores.
Stage 3 — compression. Even after chance-adjusting, the matrix is still (V, V) — wildly too wide to be a usable word vector table. The standard fix is SVD (singular value decomposition): find the d directions of greatest variance in the matrix and project each row from V dimensions down to d:
raw cat vector (V long): [1, 0, 1, 1, 0]
PMI-weighted (V long): [−0.4, 0, 0.7, 0.3, 0] ← log-ratios per vocab word (illustrative)
SVD-compressed (d long): [0.21, -0.34, 0.18, …] ← d abstract latent dimensionsThe compressed vector is d real numbers (a few hundred), and the dimensions are no longer “co-occurrence with one specific word.” Each dimension is a linear combination of all V word-axes — a learned latent direction that captures one dominant pattern of co-occurrence variation across the corpus. Dimension 47 no longer means how often cat appeared near snowboard; it now measures how much of cat’s overall co-occurrence pattern aligns with the 47th-most-prominent direction in the data.
That (V, d) compressed table — dense real numbers, one row per word — is structurally identical to the word2vec vectors covered in the rest of the article. What’s different is how you get there: count-then-factorise vs train-a-network-directly. The two routes land on essentially the same destination — which is what Levy & Goldberg (2014) formalises mathematically.
This means that the count-then-factorise route produces the same kind of geometry that supports king − man + woman ≈ queen — the famous analogy arithmetic falls out of either path.
Well-tuned PPMI-SVD models match word2vec on standard analogy benchmarks to within a couple of points; same destination, different engineering.
For two decades, this count-then-factorise recipe underpinned LSA (latent semantic analysis, 1990 — word-by-document matrix, used for information retrieval) and HAL (hyperspace analog to language, 1996 — word-by-word matrix, used in cognitive science to model human semantic memory). Both produced word vectors as a byproduct of other goals; word2vec (2013) was the first method to make the word-vector table the explicit target.
The common cost across all of them was sheer scale, and it grew faster than vocab size itself. At real V the matrix is a V × V monster: a million words on each side is a trillion cells in total, almost all of them zero because most pairs of words never co-occur. Storing it is heavy; factorising it with classical SVD is heavier still. On top of the raw compute, the pipeline is a chain of explicit, hand-tuned steps — count, weight, factorise, truncate, scale — each with knobs that interact non-obviously with the others. By the early 2010s this was the recipe everyone used and nobody loved.
In 2013, a breakthrough method called word2vec produced dense vectors of a few hundred dimensions, with the same semantic properties — and without ever building this matrix. Instead of counting co-occurrences and squeezing the result down, it learns the vectors directly, by training a tiny neural network on a prediction task. (A year later, GloVe went back to building a matrix and factorising it head-on; fastText extended word2vec with character n-grams. The static-embedding family article covers those alternatives and how they relate to each other.
word2vec
The word2vec paper proposed two approaches that skip the co-occurrence matrix entirely — they produce the same (V, d) embedding table without ever building the V × V grid in the first place. Instead of counting co-occurrences and then squeezing the matrix down with SVD, train a tiny neural network to predict the words around each word, in either direction, then read the word vectors off the network’s weights when training stops. The prediction is real, but it isn’t what you care about: it’s only there to force structured vectors out of the network. You keep the table of learned word embeddings and discard the rest. Same destination as LSA/HAL, very different engineering: stream through word pairs one at a time and nudge the vectors as you go, instead of building a trillion-cell matrix and factorising it.
That training setup also gives word2vec a hard scope limit — it focuses on individual words, not sentences. The model produces one fixed vector per word and nothing more — no sentence representation, no word-order awareness, no compositionality across the words in a phrase. For the decade between word2vec’s release and the transformer takeover, the standard NLP pipeline filled that gap by stacking an LSTM (or similar recurrent net) on top of word2vec embeddings: the embeddings carried what each word means; the LSTM handled how those meanings fit together in a sentence. Why that split eventually collapsed is the topic of the closing sections.
The two algorithms introduced by word2vec are skip-gram and CBOW — alternative recipes that run the prediction in opposite directions. You use one or the other, never both:
- Skip-gram. Given a center word, predict the words in the small window around it. We feed the center word as input and predict each of the surrounding context-window words in turn — every
(center, context)pair is a separate training example, so the same center word is reused once per neighbour. - CBOW (continuous bag of words). The reverse: given the words in a window, predict the center word. We feed a set of context words in and predict the single word in the middle. This fill-in-the-blank shape — recover the missing word from its surroundings — is the one BERT’s masked language modeling later inherits and scales up.
Both are doing the same thing, just running the prediction in opposite directions. We’ll work through skip-gram end to end, then come back to CBOW.
The training data and one-hot inputs
The whole skip-gram pipeline splits cleanly into two stages with a sharp boundary: pre-processing turns raw text into a list of training examples, then training runs those examples through a neural network.
Let’s first look at pre-processing — turning the raw text into a long list of integer pairs. No neural network is involved yet:
- Tokenize the corpus — split the text into a list of word tokens (whitespace-separated, usually lowercased; word2vec uses word-level tokens, not subword pieces).
- Build the vocabulary — assign each distinct token an integer ID; that ID will later serve as the token’s index into one-hot vectors and as the row number into the embedding matrix. The total count is
V(the vocabulary size). - Extract
(center, context)pairs — slide a window over the ID stream and emit one training example per neighbour.
Step 3 is the heart of pre-processing. Conceptually, you pick a window size — convention is two words on either side, making the window five words wide — and slide it through the text one word at a time. At each position the word in the middle is the center, the surrounding words form its context, and the model’s job is to predict the context words from the center. Every (center, context) pairing emitted at that position is one training example.
Using the sentence “the cat sat on the mat” as example, we set the window to ±2 and slide it word by word. With the window centered on sat, the neighbours are the, cat, on, the — this single position emits four pairs: (sat, the), (sat, cat), (sat, on), (sat, the). Step forward to on and the window finds cat, sat, the, mat — four more pairs. Step again, four more, and so on, until a whole corpus collapses into a long list of (center, context) pairs, generated entirely from the text itself, with no human ever labelling anything.
At the end of pre-processing you have a sequence of (c, t) integer pairs — c is the center word’s id, t is the target (context) word’s id — ready to feed. The whole stage fits in about a dozen lines of NumPy on our running example:
# Step 1: tokenize.
corpus = "the cat sat on the mat"
tokens = corpus.split()
# ['the', 'cat', 'sat', 'on', 'the', 'mat']
# Step 2: build vocabulary and convert tokens to integer IDs.
vocab = sorted(set(tokens)) # ['cat', 'mat', 'on', 'sat', 'the']
word2id = {w: i for i, w in enumerate(vocab)} # {'cat': 0, 'mat': 1, 'on': 2, 'sat': 3, 'the': 4}
V = len(vocab) # 5
ids = [word2id[w] for w in tokens] # [4, 0, 3, 2, 4, 1]
# Step 3: slide a ±2 window over the ID stream, emit (center, context) pairs.
window = 2
pairs = []
for i, c in enumerate(ids):
for j in range(max(0, i - window), min(len(ids), i + window + 1)):
if i != j:
pairs.append((c, ids[j]))
len(pairs) # 18 — exactly the (center_id, context_id) pairs the widget above emits.Pre-processing is essentially the same regardless of which word2vec variant you train next — tokenization, vocabulary, and windowing are identical. Only the format of the emitted examples differs: skip-gram packs them as pairs, while CBOW emits one context bag plus its center per window. Training is where the algorithm actually lives, and the rest of this section walks through it in detail.
Each pair is a training example
With pre-processing done, let’s look at how the training algorithm uses these pairs.
Each pair is one training example, representing an input and a prediction target.
The four pairs emitted from one window position around sat are four separate training examples with the same input sat and four different prediction targets (the, cat, on, the). Only the input goes through the forward pass; the target isn’t used during the network’s computation at all — it’s only consulted at the loss step, to score how well the predicted distribution matched it.
It’s similar to MNIST: each MNIST example pairs an image with its digit label, and here each skip-gram pair (c, t) pairs the center word c (input) with one context word t (target). For (sat, cat): feed sat in, get back a predicted distribution over the vocabulary, compare it against cat, take an SGD step. Then the next pair.
Before we can feed anything into the network, we need to represent the word as a vector of numbers.
In MNIST that step is mostly free — an image is already a grid of pixel intensities, so we just flatten it into a 784-number vector.
A word has no inherent numeric content, so we invented one back in the build-the-vocabulary step above — every word already has an integer index from 0 to V−1. To feed it into the network we expand that index into a one-hot V-vector — a vector V numbers long, all zeros except a single 1 at the word’s index.
For our running vocabulary of 5 words (V=5), the encoding looks like this:
"cat" → [1, 0, 0, 0, 0]
"mat" → [0, 1, 0, 0, 0]
"on" → [0, 0, 1, 0, 0]
"sat" → [0, 0, 0, 1, 0]
"the" → [0, 0, 0, 0, 1]The one-hot is V long — its length is the vocabulary size, around a million for real word2vec. So each word literally becomes a million-number vector where 999,999 entries are 0 and only one is 1. It’s immediately evident that this is a hugely wasteful representation: almost all of the storage and almost all of the multiplications are operating on zeros that contribute nothing.
What does the network do with that one-hot? It hands it straight to the first layer as input, which dot-products it against the layer’s weight matrix — entry by entry, summed. With V around a million, that’s a million-by-d matrix multiplication per training example, almost all of it multiplying zeros. Done literally, it’s an enormous amount of wasted arithmetic.
Luckily, there’s a clever bit of linear algebra that lets us skip building the one-hot vector at all — multiplying a one-hot by a matrix is the same as picking out one row of that matrix (a lookup by the word’s integer ID). For our 5-word vocabulary, with sat at index 3 and some matrix M of shape (5, d):
one-hot for "sat" M (5 rows × 3 cols) result
[ 0 0 0 1 0 ] · [ row 0: 0.21 -0.43 0.15 ] = [ 0.33 -0.27 0.84 ]
[ row 1: 0.07 0.62 -0.31 ] (just row 3)
[ row 2: -0.55 0.18 0.40 ]
[ row 3: 0.33 -0.27 0.84 ]
[ row 4: -0.12 0.49 -0.06 ]Every term that touches a 0 from the one-hot vanishes, leaving only the row 3 contribution — so the answer is simply row 3 of M.
# dot product, column by column:
col 0: 0·0.21 + 0·0.07 + 0·(-0.55) + 1·0.33 + 0·(-0.12) = 0.33
col 1: 0·(-0.43) + 0·0.62 + 0·0.18 + 1·(-0.27) + 0·0.49 = -0.27
col 2: 0·0.15 + 0·(-0.31) + 0·0.40 + 1·0.84 + 0·(-0.06) = 0.84So in code we just store the integer 3 (the vocab index) and use it as a direct row lookup — we’ll see this concretely in the forward-pass section. Mathematically a one-hot V-vector and an integer V-index carry the same information; we use the one-hot for the maths because it makes the linear algebra clean, and the integer for the code because it’s V times cheaper.
Mental model — what we’re trying to do
Before we look closely at the network architecture, it’s worth nailing down what the network is actually trying to accomplish.
The core idea in one line: given a list of pairs (center, target), for each pair compute a similarity across dimensions between the center word and every word in the vocabulary — then keep nudging the weights, pair after pair, so the targets gradually rise to the top of the similarity ranking.
The similarity is measured using cosine similarity between word vectors (embeddings), and softmax ranks every vocab word by its predicted probability of being the target. The rest of this section unpacks what those weights are, how the network implements this, and why running this loop over a corpus produces meaningful geometry.
We’re going to have two trainable weight matrices — each between a pair of layers in the network — together giving every word in the vocabulary two d-dim vectors (one embedding per role):
Eof shape(V, d)— each row is one word’s input embedding, used when the word appears as the center of a training pair (the word being conditioned on).E'of shape(d, V)— each column is one word’s output embedding, used when the word appears as the target/context being predicted (the word being scored as a candidate).
Note the shape difference: E' is essentially a transposed E — the same V × d worth of numbers per matrix, but laid out as (d, V) so words run as columns instead of rows.
So a single word w has both a row in E (when it’s the center) and a column in E' (when it’s the target) — two independent trainable d-dim vectors, one for each role.
They get independent gradient updates and end up with different values.
The asymmetry is intentional: the center plays a different role from the target in the prediction task (one is the conditioning context, the other is being judged for plausibility), and the model is more expressive when it can learn distinct vectors for those two roles than when forced to share. After training, E is what ships as the final word-embedding table, while E' is discarded.
The widget below makes that step concrete — computing a similarity across dimensions between the center word and every word in the vocabulary, the interaction between E and E'. We focus on sat as the center (highlighted green in E). Step through to watch the matmul score it against every word in the vocabulary — one dot product per word, filling the scores vector entry by entry. (Stages 1 and 2 only — softmax and the loss come a couple of sections later.)
The score for a (c, t) pair is just v_c · v'_t — the dot product of c’s input embedding with t’s output embedding, which is unnormalized cosine similarity.
This is also why E' is stored transposed (columns = output embeddings) rather than as another (V, d) table: storing them as columns means scores = v_c @ E' computes the dot product of v_c against every word’s output embedding in a single matrix-vector multiplication. If E' were (V, d) like E, you’d have to loop or transpose to get the same V dot products. The (d, V) shape is exactly the layout that makes “compute similarity across all words at once” a one-line operation.
This is the same operation MNIST used to score digits — each output class had its own feature template, dot-producted with the hidden vector to score how well the image matched that class. word2vec runs the same play with columns of E' as per-word templates and v_c as the center word’s feature vector; the only difference is that here the features are learned semantic dimensions (royalty-ness, plurality, gender) rather than hand-interpretable patches (edges, strokes, loops).
With that in mind, the architecture in the next section is just the simplest possible neural net that computes similarity (v_c · v'_t) for every candidate pair (c, t) and updates weights in the two matrices through gradient descent.
Repeated billions of times across the corpus, this produces a geometry where each word’s input embedding sits near the output embeddings of words it co-occurs with — and transitively, where words that share the same neighbours end up close to each other. king and queen are never told to be similar; they end up similar because they’re both pulled toward royal, throne, crown, monarch. Same for cat and dog: they share neighbours like pet, fur, tail.
The architecture
Once we have collected pairs, we’re essentially running a supervised labelled-prediction task — given a center word, predict which word from the vocabulary comes nearby — so the setup is similar to MNIST in key ways: one hidden layer, softmax over output classes, cross-entropy against the label. What differs is the scale (vocabulary size V here vs. 10 digit classes for MNIST), the input format (one-hot vs. dense real-valued pixels), and the goal (we want the trained embeddings, not the prediction itself).
Concretely, the input is a V-dim one-hot for the center word; the hidden layer has d neurons (e.g. 300) and is purely linear (no bias, no nonlinearity); the output has V neurons with softmax across all V producing P(w | c) — the probability of each vocabulary word given the center. The two matrices E and E' we introduced in the previous section live between those layers: E is the input → hidden weight matrix (shape (V, d)), E' is the hidden → output one (shape (d, V)). These are the network’s only learned parameters; both start random, and after training only E ships as the final word-embedding table — E' is discarded.
To make the layer structure concrete, the whole base-case network is six lines of Keras:
from tensorflow import keras
from tensorflow.keras import layers
V = vocab_size # e.g. 1_000_000
d = embedding_dim # e.g. 300
model = keras.Sequential([
keras.Input(shape=(1,), dtype='int32'), # integer ID of the center word
layers.Embedding(input_dim=V, output_dim=d), # E: shape (V, d), the lookup
layers.Reshape((d,)), # (1, d) → (d,)
layers.Dense(V, use_bias=False), # E': shape (d, V), the linear layer
layers.Softmax(), # softmax over V vocab scores
])
model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy')Two trainable layers — Embedding and Dense — both linear, with no activation in between.
Embedding(V, d) is E — Keras’s name for looking up a row from a (V, d) table given an integer index. Dense(V, use_bias=False) is E' — a plain (d, V) linear projection.
The Reshape in between strips a redundant length-1 dimension that Keras’s Embedding adds. It’s a Keras quirk — Embedding is built for token sequences and adds an extra axis even when we’re feeding one token at a time.
Softmax turns the V raw scores into a probability distribution over the vocabulary, and sparse_categorical_crossentropy is the standard classification loss on top of it — the same softmax-plus-cross-entropy MNIST uses, just with V vocabulary classes instead of 10 digits. We’ll walk through the loss and its gradient in detail in the sections that follow.
The number of dimensions in the word embedding d is a hyperparameter with no formula for picking it — convention does most of the work.
The well-worn default is d = 300, what the 2013 word2vec paper used on Google News and what GloVe ships as its largest pre-trained option.
In practice you’ll see d = 50–100 for lightweight/edge use, d = 200–300 as the sweet spot, and rarely anything larger.
The reason d has to be much smaller than V is the bottleneck argument from earlier: if every word could sprawl into its own dedicated dimension, the model would memorize co-occurrences instead of building shared features, and no geometry would emerge.
What we’ve just described is the base-case forward pass — a softmax over the entire vocabulary on the output side.
The forward pass — lookup, scoring, softmax
Now that we’ve covered the high-level architecture — input one-hot, hidden lookup, output scores — let’s zoom in and walk through exactly what happens when a single training example flows through the network. For a center word c, the forward pass moves left-to-right in three stages: lookup, scoring, and softmax.
The widget back in the mental-model section already showed the first two stages in isolation — one-hot times matrix collapsing to a row read, then a row-times-matrix producing V scores. Here we name them, add softmax on top, and follow the numbers end to end.
Stage 1: input × E → hidden (the lookup). Mathematically this is the matrix multiplication one_hot(c) @ E, producing a d-dim hidden vector. The input is sparse — V−1 entries are zero — so almost every multiplication evaluates to zero, and the whole (V, d) matmul collapses to a single row read: v_c = E[c].
Stage 2: hidden × E’ → V scores. Second-layer forward pass: scores = v_c @ E', producing V real numbers (one per vocabulary word). This is a genuine (d, V) matrix-vector multiplication — no shortcuts, no sparsity to exploit. Each score w is v_c · E'[:, w] — the dot product of v_c with the w-th column of E' — one such dot product per vocabulary word. This is where most of the matmul work happens: every column of E' contributes to the forward pass and gets a gradient on the backward pass, so almost all the per-step compute — and almost all the learning — lives on the hidden → output side.
The matmul never looks at the target word. It uses only v_c = E[sat] and scores sat against the entire vocabulary, producing all V scores at once. So those exact five scores — cat -0.19, mat 0.26, on 0.39, sat -0.07, the -0.45 — are identical for every training pair that shares the center sat: (sat, cat), (sat, on), and (sat, the) all run the same matmul and land on the same five numbers. The target word enters only later, at the loss; the forward pass never sees it.
Stage 3: softmax → probabilities. The V scores — called logits — are arbitrary real numbers: they could be negative, unbounded, not summing to anything in particular. Softmax turns them into P(w | c) — V non-negative numbers that sum to 1, the model’s predicted probability that word w is in the context of c.
Using the same 5-word vocab, the widget below walks all three stages. Pick a (center, context) training pair, then hit step to fill the scores vector one dot product at a time; once all V scores are in, softmax turns them into probabilities.
Five dot products, five raw scores, then softmax. on comes out highest at the score stage (+0.39), which makes sense — on actually sits next to sat in “the cat sat on the mat” — and after softmax it carries the largest share of the probability mass too.
In skip-gram these scores are the dot products v_c · E'[:, w] for every vocab word w, and the resulting probabilities are P(w | c). Pulling out just the softmax step from the widget above, the five scores [-0.19, 0.26, 0.39, -0.07, -0.45] turn into a proper probability distribution that sums to 1:
The mechanics is simple: exponentiate every score, sum them to calculate the normaliser (here ~5.17), then divide each exp(score) by that sum. The result is a probability distribution that sums to exactly 1 by construction, regardless of what the inputs were. For the softmax subtleties — soft-argmax behaviour, shift invariance, temperature — see the MNIST article, which covers them in depth.
The loss function
The loss for one training example with target word t is the negative log-probability the model assigned to that target:
loss = −log P(t | c)That gives us a single number per training pair (c, t) — small when the softmax has piled probability onto the true target, large when it hasn’t. This is cross-entropy with a one-hot label — the same loss MNIST uses, and the gradient through softmax is derived there.
Concretely, for the pair (sat, cat) using the scores and probabilities from above:
cat mat on sat the
scores = [-0.19, 0.26, 0.39, -0.07, -0.45]
probabilities = [ 0.16, 0.25, 0.29, 0.18, 0.12]
target = cat
P(cat | sat) = 0.16
loss = −log(0.16) ≈ 1.83
# what-ifs — how the loss responds to different P(cat):
P(cat) = 0.90 → loss = −log(0.90) ≈ 0.11 (good prediction)
P(cat) = 0.01 → loss = −log(0.01) ≈ 4.6 (bad prediction)The widget below carries the five softmax probabilities over from above and plots them against the −log curve. Click a different word to designate it as the target — the marker slides along the curve, and you can see directly how a target the model already favours costs almost nothing while a target it underweights pays a sharp price.
The graph shows that when the model correctly assigns the target a large probability, the loss is tiny, and when it underweights the target — giving the right answer only a small probability — the loss is huge. Because of the logarithmic shape of −log(P), the transition between the two extremes is sharp, not gradual — the steeper the curve gets, the more aggressively the gradient pushes the target’s probability upward.
Compare the actual target cat to the uniform reference. cat sits at (0.16, 1.83) — the target in (sat, cat) — while the uniform reference point sits at (0.20, 1.61). That uniform baseline is where any untrained model starts — at initialisation the softmax spreads probability roughly equally across all the words in the vocabulary.
cat’s probability is below uniform, so its loss is above the baseline — the model is doing slightly worse than random on this pair, exactly the situation training is built to correct.
The gradient
The gradient is similar to MNIST: the gradient of the loss with respect to each logit is P(w) − 𝟙[w == t] — predicted probability minus the one-hot target. The target word’s gradient is P(t) − 1 (negative — push its score up); every other word’s is P(w) (positive — push its score down, proportional to how much probability it currently has). Words the model already correctly thinks are unlikely barely move; words it’s wrong about get the most signal.
The widget below makes that subtraction concrete on the same five probabilities from above. Click a different target to see the gradient row redraw — one tall blue bar pulling the target’s score up, four short red bars pushing the others’ down.
Every word except the target gets a small positive gradient — SGD pushes its score (and therefore its column of E') down, proportionally to how much probability it currently steals. The target gets one big negative gradient of size P(target) − 1 (about -0.84 for cat) — SGD pulls its score up. The further the model is from P(target) = 1, the closer that pull gets to −1 (the strongest single-step pull possible for a single example). Sum across the row and it’s zero: probability mass is being moved around, not created.
Backprop carries those score-gradients into E' and E by the chain rule:
scores = v_c @ E' (the forward step we're differentiating)
∂loss / ∂E'[:, w] = ( P(w) − 𝟙[w == t] ) · v_c ← gradient on column w of E'
∂loss / ∂v_c = E' @ ( P − one_hot_t ) ← gradient into the hidden vector
∂loss / ∂E[c] = ∂loss / ∂v_c ← because v_c = E[c]Two consequences worth keeping in mind, both from the one-hot input:
Egets a gradient on exactly one row per training example — the center word’s row. The other V−1 rows have zero gradient becausev_c = E[c]only read from that one row. (Compare MNIST, whereW₁updates every weight per example because the input is dense pixels.)E'gets a gradient on every column. The true target’s column is pulled towardv_c; every other word’s column is pushed away, scaled byP(w) − 𝟙[w == t].
Those gradients are then used to take a gradient descent step: apply them to the weights as E -= lr × ∂L/∂E and E' -= lr × ∂L/∂E'. This is the optimiser’s job, and the choice of optimiser (vanilla SGD, momentum, Adam, RMSprop, AdaGrad…) only matters at this step — they all consume the same gradients but use them differently.
Mini-batches and epochs
The walkthrough above processed pairs one at a time. Most NN training generalises that into mini-batch SGD — group examples into batches of B and process a whole batch in one forward + backward pass, exactly the same mini-batch SGD as MNIST, just averaged across B examples per step. That’s the recipe BERT, GPT, and pretty much every modern model use, and it’s what the bullets below describe.
Word2vec is an exception. The original 2013 release used batch size 1 — pure SGD, one pair per step, with multi-threaded Hogwild! parallelism on CPU (each thread streams its own pairs and writes to the shared E / E' without locks; occasional update collisions are silently absorbed). That’s still how Gensim runs in 2026, and it’s the right choice for word2vec specifically — the Gensim subsection later covers why. The bullets below describe the more general mini-batch flow because that’s the recipe transferable to BERT and beyond; just keep in mind that for word2vec specifically, B=1 and Hogwild! win out.
An epoch is one complete pass over all the training pairs, regardless of batch size. word2vec typically trains for 5–15 epochs; the Gensim default is 5. Each pair gets seen multiple times because a single SGD step on one pair doesn’t fully shape the relevant rows — repeated passes converge E and E' toward the geometry the loss prefers.
Per-batch, the loop looks like this:
- Forward pass on a batch of B examples — each
(c, t)pair flows through the network. Vectorized, the lookup stacks B rows fromEinto a(B, d)matrix, the score step becomes a single(B, d) @ (d, V) → (B, V)matmul, and softmax runs row-wise. - Loss — compute cross-entropy loss per example (same formula as B=1), then average across the batch into a single scalar.
- Backward pass (backprop). Compute the gradients via the chain rule — pure calculus from the loss back to every weight, no weight updates yet. Output: gradient tensors
∂L/∂Eand∂L/∂E', the same shape as the weight tensors. - Gradient descent step — apply the gradients to the weights via the optimiser, exactly as in the single-pair case above. Repeat for the next batch.
Run that loop across many batches per epoch and a handful of epochs over the corpus, and the rows of E and E' settle into the geometry the loss prefers.
CBOW — and the bridge to BERT
CBOW (continuous bag of words) is word2vec’s second algorithm, run as a mirror image of skip-gram. Where skip-gram takes a center word and predicts a target word as its context, CBOW takes several words as context and predicts the center — the missing word.
That fill-in-the-blank framing is exactly the Cloze test from 1950s reading-comprehension research — hide some words, ask the model to fill them in from what’s left — which only works if you have a working model of the surrounding language. It’s also exactly the objective behind BERT’s masked language modeling, the pretraining task that powered every contextual encoder since 2018. You can read CBOW as a tiny, linear-shaped BERT and BERT as CBOW grown up with attention: same training task, shallow averaging replaced by a deep transformer stack, symmetric window replaced by the whole sentence, single mask replaced by 15% of tokens at once, static output replaced by contextual per-token vectors. The BERT article walks through all of that in detail.
Mechanically, CBOW differs from skip-gram by exactly one averaging step on the input; the rest of the training loop is identical. So most of the skip-gram sections above carry over unchanged — we focus here only on where CBOW differs.
Pre-processing — context bags instead of pairs
The pre-processing pipeline is the same as skip-gram: tokenize, build vocab, slide a window over the token stream. What changes is the shape of what gets emitted per window position:
Position centered on: Skip-gram emits (per position): CBOW emits (per position):
───────────────────── ────────────────────────────── ──────────────────────────
the (pos 0) (the, cat), (the, sat) ({cat, sat}, the)
cat (pos 1) (cat, the), (cat, sat), (cat, on) ({the, sat, on}, cat)
sat (pos 2) (sat, the), (sat, cat), ({the, cat, on, the}, sat)
(sat, on), (sat, the)
on (pos 3) (on, cat), (on, sat), ({cat, sat, the, mat}, on)
(on, the), (on, mat)
the (pos 4) (the, sat), (the, on), (the, mat) ({sat, on, mat}, the)
mat (pos 5) (mat, on), (mat, the) ({on, the}, mat)
───────────────────── ────────────────
total: 18 pairs total: 6 examplesSame corpus, same window — skip-gram produces 18 separate (center, neighbour) training pairs, CBOW produces 6 (context_bag, center) examples. Step through it in the widget:
The architecture
CBOW’s architecture is the skip-gram one with its ends flipped — the inputs are the context words, the output is the center word. Same two weight matrices E and E', same dimensionality d, same softmax over V. The only architectural difference is on the input side: skip-gram looks up one row of E (the center), CBOW looks up C rows (one per context word) and averages them into a single d-dim hidden vector h:
Skip-gram: h = E[c] (one row read)
CBOW: h = mean(E[c_1], E[c_2], ..., E[c_C]) (C rows read, then averaged)Visualized below — the C one-hot inputs, the C corresponding row lookups in E, and the averaging step that produces h:
shape: C × V
shape: V × d
shape: d
The averaging is what gives “bag of words” its name — word order inside the window is discarded, the context becomes a multiset. At each training step, the lookups happen on the current, in-flight values of E — the rows you read in the forward pass are the same rows you update in the backward pass, just like any neural-net SGD training. Once h is computed, everything downstream — scores = h @ E', softmax, cross-entropy loss, backprop, SGD on E and E', and the negative-sampling shortcut — is identical to skip-gram.
Training — three small differences from skip-gram
The forward pass, loss, gradient, mini-batching, and negative-sampling shortcut all carry over from skip-gram unchanged. Three differences worth keeping in mind:
- One forward pass per window position. Skip-gram emits C separate pairs per window and runs C forward passes. CBOW emits one example per window — the whole context bag predicts the center. So CBOW is roughly C× faster to train, which is its main practical edge.
- Gradient updates touch more rows of
Eper example. Skip-gram nudges one row ofEper pair (the center’s row). CBOW nudges C rows per example (each context word’s row, with the gradient scaled by 1/C from the averaging). That’s why CBOW does better on frequent words and worse on rare ones — common stop-words get touched on most windows; rare words rarely show up as context. E'updates are the same shape, just keyed differently. In skip-gram, each pair(c, t)updates the target word’s column ofE'. In CBOW, each example updates the center word’s column ofE'. Same update pattern, different word picking it out.
How word2vec is actually trained in practice — Gensim
It might be surprising in 2026 — an era where transformer-based encoders dominate NLP and every headline model is some flavour of attention — but word2vec is still trained, deployed, and shipped in production every day. The reason is partly architectural fit (word2vec is a CPU-shaped workload, which we’ll get to), partly that for plenty of tasks the cheap-and-static vectors are simply good enough (and orders of magnitude faster to query than a BERT forward pass), and partly that some pipelines — recommendation systems, search ranking, vector-database bootstrapping, lightweight semantic features — specifically want the embedding-lookup behaviour rather than a deep contextual model. So the question of how to actually train one isn’t historical — it’s a working engineering question for plenty of teams.
The de facto standard library for training word2vec is Gensim — short for “Generate Similar” — open-source Python, in development since 2009, and despite Keras, PyTorch, and JAX all existing, still what most production word2vec pipelines run on in 2026. From the library’s own description:
Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible. Gensim is designed to process raw, unstructured digital texts (“plain text”) using unsupervised machine learning algorithms. The algorithms in Gensim — Word2Vec, FastText, Latent Semantic Indexing (LSI/LSA), Latent Dirichlet Allocation (LDA), etc — automatically discover the semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary — you only need a corpus of plain text documents.
Two things to notice from that description.
It covers a family of algorithms, not just word2vec. Word2Vec, FastText, LSA, LDA — every unsupervised statistical-co-occurrence method this article has discussed (and a few more) lives in the same library. The article walked through word2vec specifically; Gensim is the library where you’d actually run any of them.
Efficiently (computer-wise) and painlessly (human-wise) is exactly the trade-off Gensim has optimised for fifteen years: extremely fast on the hardware these algorithms actually need (CPU, as we’ll see in a moment), with a one-liner API that hides every fiddly detail of vocabulary building, subsampling, sampling tables, and the training loop. Both of those properties are why people still reach for it.
A one-liner gets you trained vectors:
from gensim.models import Word2Vec
model = Word2Vec(sentences, vector_size=300, window=5, min_count=5, sg=1, workers=8)
v_king = model.wv["king"]Vocabulary building, frequent-word subsampling, the negative-sampling alias table, the training loop, save/load, similarity queries, analogy arithmetic — Gensim handles all of it. To match this in raw Keras or PyTorch you’d write ten times the code, get worse throughput, and end up with vectors that may not exactly reproduce the canonical word2vec results.
The reason Gensim persists is partly inertia, partly ergonomics — but the deepest reason is that word2vec is a CPU-shaped workload, and Gensim is the best CPU implementation there is.
Why word2vec is CPU-intensive, not GPU-intensive
The standard intuition for “deep learning = GPU” comes from models whose bottleneck is dense matrix multiplication on big tensors: convolutional networks, transformers, big MLPs. GPUs are designed for that — thousands of cores running the same operation on different elements in parallel, fed by high-bandwidth memory laid out for vectorised access. Give a GPU a (4096, 4096) matmul and it crunches through it in microseconds.
Word2vec, structurally, never does a big matmul. Look at what one training pair actually computes:
1. one lookup in E: read row E[center] ← 1 row out of V
2. lookup target + k negatives in E': read k+1 rows of E' ← k+1 rows out of V
3. dot products against v_c: k+1 dot products of d-dim vectors ← ~k+1 × d multiply-adds
4. gradient updates: update 1 row of E, k+1 rows of E' ← k+2 rows touchedWith k = 5–20 and d = 300, each training step does on the order of 10⁴ multiply-adds — about a thousand times less compute per example than a single forward pass through a small image classifier. The “compute” is barely there.
This is the same shape of problem the MNIST-trains-faster-on-CPU article explores in detail: when per-step compute is small enough, kernel-launch overhead and PCIe transfer cost dominate any FLOPs advantage the GPU might have, and the CPU wins on wall-clock time.
This demonstrates that the bottleneck is memory access, not arithmetic — and that’s exactly where the CPU is the right fit. E and E' together take 2 × V × d × 4 bytes — about 2.4 GB for V = 10⁶, d = 300 — and each step does a handful of widely-scattered row reads into them, a pattern cache prefetching can’t help with. L1/L2 caches absorb those random reads, system RAM holds the 2.4 GB model with room to spare, and OS threads can clobber-update shared memory — the Hogwild! trick — much more cheaply than the GPU equivalent (atomics across thousands of cores).
Gensim leans into all of this.
Its inner loop is Cython, single-pair forward-and-backward, with Hogwild! lock-free parallelism across workers CPU threads.
On a beefy laptop, that’s 8 threads, a few million pairs per second, billions of training pairs per hour, all without leaving the CPU. That’s enough to train word2vec on a full Wikipedia dump in a few hours on hardware that doesn’t require expensive GPUs.
Where static embeddings break
word2vec produces static embeddings: one fixed vector per word, regardless of context. This is exactly the right shape for the distributional hypothesis as originally stated, but it has three failure modes that became increasingly visible as NLP moved to harder tasks.
Polysemy
Consider these two sentences:
- I deposited the cheque at the bank.
- We had a picnic on the river bank.
A static embedding gives bank one vector. That vector is some kind of average over both senses, which means it’s a good representation of neither.
bank with two sense clustersglove-wiki-gigaword-300The widget above shows the cosine similarity between bank and two clusters of context words: money, loan, account, deposit, interest on one side; river, shore, water, creek, flood on the other. Both clusters pull above zero — the single vector covers both senses — but the financial cluster wins. That’s not a fact about the word; it’s the corpus skew.
There’s no way for downstream code to recover which sense was meant in any specific sentence, because it has access to one vector and a surrounding sequence of other single vectors, all of them sense-blind in the same way.
No syntax sensitivity
A bag-of-vectors representation throws away word order. The sentences:
- Dog bites man.
- Man bites dog.
contain identical word sets and therefore identical bag-of-vectors representations, despite having opposite meanings. Anything built on top of static embeddings has to recover order from somewhere else — typically a recurrent or convolutional layer that processes the sequence directly. That works (it’s how every pre-2018 NLP model was built), but it means the embeddings themselves are doing only part of the job.
Frozen at training time
Static embeddings are fixed once trained. New senses, new compounds, new domain vocabulary — the vectors don’t update. Worse, words that didn’t appear in the training corpus simply don’t have vectors at all.
These three failures look different on the surface but share one cause: a static embedding is a function of the word, not the sentence. Anything sentence-dependent — sense, role, discourse position — has to be handled outside the embedding. That’s a big enough job that it constrained the architectures of the entire pre-2018 era.
The same three failures hit every other static word embedding too — GloVe and fastText produce the same (V, d) lookup table by different training procedures, so they inherit the same blind spots. Solving them requires giving up on “one vector per word” entirely, which is what BERT and the contextual encoders do.
What comes next
The fix is easy to state and was hard to build: make the vector depend on the surrounding sentence, not just the word. bank shouldn’t have one vector — it should have the one it earns in “river bank,” and a different one in “bank account.” That’s exactly the gap BERT-style architectures closed.
BERT inherits more from word2vec than you’d expect from the architectural gap. The distributional hypothesis carries over (predict missing tokens from context); so does the trick of a fake prediction task that exists only to force good vectors out of the network; so does the bet that geometry encodes meaning. The training task itself — fill in the blank — is the same one CBOW ran on a tiny scale back in 2013.
What changed is the architecture and the scale. ELMo (2018) ran the first serious version with a bidirectional LSTM and per-position hidden states. BERT (also 2018) swapped the LSTM for a Transformer encoder and introduced masked language modeling — structurally CBOW with most of CBOW’s limits removed: a deep stack of bidirectional Transformer layers over the whole sentence instead of a shallow average over a five-word window, multiple masks per example instead of one, and contextual per-token output vectors instead of a single frozen lookup row. The BERT article walks through that machinery in detail.