Negative sampling and contrastive learning — from word2vec to CLIP

The textbook version of word2vec relies on a softmax output layer to turn the model’s raw scores into a probability for every word in the vocabulary — and that’s where the scaling problem comes from. In training, the loss only ever uses one probability per pair, P(target | center). But softmax defines it as a share of a vocabulary-wide total:

P(target) = exp(score_target) / Σ_w exp(score_w)

The denominator sums over all V words, so to get the one number you actually want, you must compute all V scores — not because you want them, but because the normalizer needs them. Drop any single score and the normalization breaks.

With V = 10⁶ vocab words and embedding dim d = 300, scoring the center against every word is a V × d matmul — 300 million multiply-adds — plus a million exponentiations for the denominator, per training pair.

To feel the scale, compare per-example compute to MNIST:

                          INPUT      HIDDEN     OUTPUT     OUTPUT-LAYER MATMUL
MNIST                     784        128        10         128 × 10  = 1,280 ops
word2vec  (V = 10⁶)       V → 1      300        V          300 × 10⁶ = 300,000,000 ops

MNIST is a 10-class classification problem; word2vec is a million-class one — one output neuron per vocab word. Every cost in the output layer — weights, matmul, softmax — scales linearly with that class count.

word2vec never actually ran that V-way softmax. The paper introduced negative sampling from the start — dropping the need for the normalizer and focusing on individual scores instead. Each (center, w) pair stops being one V-class classification — scoring how likely w is relative to every other word in the vocabulary — and becomes its own standalone score that defines how likely this specific pair is to be real. For each selected pair, the model produces a single sigmoid score σ(v_c · v'_w) — pushed up for real pairs and down for random ones. Because each judgment stands alone, the loss is a plain sum of per-pair terms — and that separability is the whole reason you can score just k+1 rows of E' and ignore the rest: the other words never enter this step’s loss at all.

Mechanically, that shrinks the score step dramatically. Where softmax does one matmul against the whole output matrix to get every score at once, negative sampling does several individual dot products instead — one for each of the k randomly picked negatives plus one for the real pair, k+1 in total. The other V − (k+1) rows of E' aren’t read, aren’t multiplied against, aren’t updated this step. Each of the k+1 dot products goes through its own sigmoid, independently — no denominator coupling them. The per-pair loss decomposes into a sum of k+1 independent terms — same maximum-likelihood machinery as softmax, just over k+1 independent Bernoullis instead of one V-wide multinomial.

This mechanism is called contrastive learning — pull paired things together, push random things apart — and it can be applied to anything embeddable. The two flavors you might have encountered: dense retrievers like DPR, the engine of RAG, trained on each query paired with its relevant passage as a positive and random passages as negatives; and RLHF reward models, where preferred responses play the positive role and rejected ones the negative.

From pairs to negatives

word2vec’s training data is (center, context) pairs from a sliding window over the corpus: the cat sat on the mat produces (sat, cat), (sat, on), (sat, the), and so on — billions of genuine co-occurrences.

Negative sampling keeps the pairs identical, but instead of just maximizing the likelihood of the real (center, context) match, it adds k random pairs per step as counter-balance — adjusting the weights so that their likelihood is also minimized.

Both kinds of pairs share the same center but pick the accompanying word (partner) differently:

  • Real pairs — walk the corpus and take the words that actually appear in each center’s context window. The partner is shaped by what genuinely co-occurs in the corpus (the joint distribution) — (sat, cat) shows up because it really happened.
  • Random pairs — keep the same center, but draw the partner from how often each word appears in the corpus (the unigram distribution), ignoring co-occurrence entirely. You get pairs like (sat, banana) — both real words, but they never appeared near each other.

Negative sampling keeps the full-softmax skeleton — three forward stages, a loss, a gradient — and only changes what happens inside each. The one genuinely new operation is sampling: draw k negative words for each pair from the unigram distribution (the random-pair draw above). Everything else is full softmax restricted in scope: the lookup still reads a single row from E, and the score step still dots the center against rows of E' — just k+1 of them (the target plus the k negatives) instead of all V.

k+1 dot products instead of V — the same v_c · v'_w operation full softmax runs, just evaluated for the words in the loss rather than the whole vocabulary. The full v_c @ E' matmul required with softmax never happens.

A raw dot product v_c · v'_w can be any real number — positive, negative, large, small — but the loss needs a probability between 0 and 1: “how likely is this pair real?” The sigmoid function σ is what gets us there.

Once we have the scores, we can turn each into a probability with sigmoid. σ squashes any real-valued score into (0, 1):

dot product v_c · v'_wσmodel says
large positive~1”real pair”
~00.5unsure
large negative~0”random pair”

For each real pair we want σ to climb toward 1; for each random pair, toward 0 — training pushes the dot products in those directions.

On the same five-word toy from the word2vec article, the widget below runs that score step. Pick a (center, target) training pair, toggle which words are sampled as the negatives, then step through the k+1 dot products — score[w] = v_c · v'_w, expanded term by term, then through its sigmoid. Only the target and the negatives get scored; the other rows of E' stay greyed, never read.

scoring one pair · k+1 dot products, not V
training pair
negatives — sampled
all 3 dot products done — loss is defined
E (V=5 × d=3)
row sat = v_c
cat
0.21
-0.43
0.15
mat
0.07
0.62
-0.31
on
-0.55
0.18
0.40
sat
0.33
-0.27
0.84
the
-0.12
0.49
-0.06
E' (V=5 × d=3)
target + k negatives — the rest unread
cat
0.45
0.62
-0.20
mat
-0.31
0.15
0.48
on
0.18
-0.40
0.27
sat
-0.22
0.33
0.11
the
0.07
0.55
-0.39
scores → σ target + k negatives
Stage 2 — score · k+1 dot products, not V
cat
mat
on
sat
the
-0.19
0.26
0.39
·
skip
Stage 3 — sigmoid · independent σ, no softmax
cat
mat
on
sat
the
0.45
0.56
0.60
·
each score through its own σ — no shared denominator; target → 1, negatives → 0.
loss −Σ log σ
k+1 terms
1.95
wordrolescoreσ(score)direction
onpositive+0.3940.597→ pushed toward 1
catnegative-0.1870.453→ pushed toward 0
matnegative+0.2600.565→ pushed toward 0

Where full softmax pushes the V scores through one normalized distribution, negative sampling sends each of the k+1 scores through its own σ independently — σ(v_c · v'_w) stands on its own as the model’s estimate that (center, w) is a real pair, with no shared denominator. k+1 independent probabilities rather than one V-wide distribution.

How are negatives sampled?

word2vec originally proposed sampling negatives from the unigram-frequency distribution — pick random words from the vocab, weighted by how often they appear. However, modern contrastive systems — CLIP, DPR, Sentence-BERT — skip that entirely and treat the other examples in the same training batch as the negatives. That approach is called in-batch negatives. Let’s quickly explore both.

In the unigram-frequency sampling approach, we favor frequent words: contrasting against words the model actually encounters matters more than contrasting against rare ones. If zebra appears only a handful of times in the corpus, the model rarely needs to learn that other words shouldn’t have similar associations with zebra — that’s wasted signal.

However, pure frequency weighting would mean ~98% of negatives are words that co-occur with everything and carry no distinguishing information — so-called stop words like the and and. We can achieve a middle ground by smoothing the frequencies: still biased toward frequent words (so most negatives are realistic) but dampened enough that content-bearing words like cat, mat, king get sampled often enough to teach real semantic distinctions.

Each word is sampled with probability proportional to how often the word appears in the corpus, counted alone (its unigram frequency). The original paper found empirically that raising those counts to the 3/4 power before normalizing works best — values from 0.5 to 1.0 all produce decent results, with 0.75 hitting the sweet spot:

P(w)=count(w)0.75wcount(w)0.75P(w) = \frac{\text{count}(w)^{0.75}}{\sum_{w'} \text{count}(w')^{0.75}}

P(w) here is the probability that word w gets drawn as a negative sample. The shape — counts normalized by their sum — is the same recipe softmax uses; raising the counts to 0.75 before normalizing is a temperature-scaling move that flattens the distribution. The numerator is each word’s count(w)^0.75; the denominator sums those across the vocabulary so probabilities sum to 1.

With the distribution in hand, sampling is straightforward: compute P(w) for every word in the vocab using the formula above, then for each positive (center, context) pair draw k random words from this distribution — a weighted dice-roll over the vocabulary, repeated k times, where words with higher P(w) get picked more often.

In contrast, modern systems don’t sample from a vocab distribution at all. They reuse the batch they’re already processing — an approach called in-batch negatives.

Training happens in batches of N real (query, positive) pairs — (Q1, P1), …, (QN, PN). Each Qi truly matches its own Pi (a query and its relevant passage, an image and its caption, etc.). For the negatives of Qi, just look at the other positives in the batchPj for j ≠ i — which are random with respect to Qi:

batch:    (Q1, P1)   (Q2, P2)   (Q3, P3)   (Q4, P4)

for Q1:   positive = P1,   negatives = {P2, P3, P4}
for Q2:   positive = P2,   negatives = {P1, P3, P4}
for Q3:   positive = P3,   negatives = {P1, P2, P4}
for Q4:   positive = P4,   negatives = {P1, P2, P3}

The cleverness: you’d already be running a forward pass on P1 through PN to score each Qi · Pi. The embeddings for all Ps are already in memory. Scoring Qi against Pj for j ≠ i is just one extra dot product — no additional forward pass through the encoder. A batch of 256 yields 255 negatives per query “for free,” equivalent to k = 255 instead of word2vec’s typical k = 5–20.

These negatives are usually higher quality than unigram draws too: real text or images that the model needs to genuinely distinguish, not random stop words. Sharper gradients, faster convergence. The tradeoff: they only span the current batch, not the full corpus — which is why systems like MoCo layer a memory queue on top, and others use hard-negative mining for even tighter contrasts.

The loss

Loss. The per-pair loss is a sum of k+1 log-sigmoid terms — one for the positive, one for each negative — in place of full softmax’s −log P(target | center):

loss = − log σ(v_c · v'_t)  −  Σ log σ(−v_c · v'_n)
       ─────────────────       ───────────────────────
       true (positive) pair     k sampled negatives

Where v_c is the center word’s input embedding, v'_t is the true target’s output embedding, v'_n is a sampled negative word’s output embedding, and σ is the sigmoid function. The first term pushes the true pair’s dot product up (toward σ(·) = 1); the second term pushes each negative’s dot product down (toward σ(·) = 0).

The negative term uses σ(−v_c · v'_n) — the negative of the dot product — which works because of the identity σ(−x) = 1 − σ(x). So −log σ(−v_c · v'_n) is just −log(1 − σ(v_c · v'_n)): the standard “wrong class” half of cross-entropy, applied to the “this isn’t a real pair” direction. Each term in the loss is binary cross-entropy (BCE) applied to one (center, w) pair — label 1 for the positive, label 0 for each negative. The total loss is k+1 BCEs added together.

What does −log σ(z) look like as σ changes? When σ → 1, the term goes to 0 — correct prediction, no loss. When σ → 0, the term explodes toward infinity — wrong prediction, big penalty. The mirror holds for negatives: −log(1 − σ) → 0 when σ → 0 (correct), → ∞ when σ → 1 (model wrongly thinks a random pair is real). That asymmetric curve is what makes the loss self-throttling — nearly-right predictions barely move the loss; very-wrong ones dominate it. Gradient descent then automatically focuses attention on the worst predictions.

Each scored word contributes one term — −log σ for the positive, −log(1 − σ) for each negative:

wordroleσtermvalue
onpositive0.60−log(0.60)0.51
catnegative0.45−log(1 − 0.45)0.60
matnegative0.56−log(1 − 0.56)0.83

Total: L ≈ 1.95. mat contributes most — its σ (0.56) is the furthest from where a negative should be (0).

And the sum isn’t just convenient — it’s the joint log-loss of k+1 independent binary decisions. Because each (center, w) pair is judged independently, the probability that all k+1 decisions are correct is the product of their individual probabilities:

P(all right)  =  P(positive right) × P(neg₁ right) × … × P(neg_k right)

−log turns a product into a sum:

−log P(all right)  =  −log P(positive)  +  −log P(neg₁)  +  …  +  −log P(neg_k)

That’s exactly the row of terms above — adding the per-pair −log’s isn’t an extra modeling choice; it’s what the joint log-likelihood of independent decisions looks like.

Stepping back: both softmax and negative sampling are doing maximum likelihood — they differ only in what family of probabilities they model.

softmaxnegative sampling
what’s modeledP(target | center) over the whole vocabP(real pair | center, w) for each w
distribution typeone multinomial over V classesk+1 independent Bernoullis
outputsV numbers summing to 1k+1 numbers in (0,1), no sum constraint
lossone V-way cross-entropysum of k+1 binary cross-entropies

Every σ in the row above is a legitimate probability — the model’s estimate that this particular (c, w) pair is real — and the total loss is a genuine joint log-likelihood. Negative sampling is fully probabilistic — just locally (one Bernoulli per pair) instead of globally (one normalized distribution over the vocab).

What you give up is the calibrated P(target | center) that sums to 1 across V — the global, normalized view. What you keep is the maximum-likelihood machinery: each σ is a real probability, and the gradient is a real ML gradient. Mikolov et al. were explicit about this trade in the deeper-point paragraph earlier — they only wanted good vectors, not a calibrated probability model. The loss is happy when positive σ → 1, every negative σ → 0, and total → 0 — and every gradient step pushes in that direction.

The gradient

Backprop touches only what the forward pass read. The k+1 scored rows of E' update — the target’s row pulled toward v_c, each negative’s row pushed away — while the other V − (k+1) rows, never read, collect no gradient at all. On the input side, the single row E[center] updates, exactly as in full softmax.

To minimize L we need its gradient with respect to every parameter that touched the forward pass: v_c (the center’s row in E), v'_t (the target’s row in E'), and each v'_n (one row per negative). With one calculus fact —

d/dz [ −log σ(z) ] = σ(z) − 1

— the chain rule gives all three:

∂L / ∂v'_t = (σ_t − 1) · v_c                       ← target's output row
∂L / ∂v'_n = σ_n · v_c                             ← each negative's output row
∂L / ∂v_c  = (σ_t − 1) · v'_t  +  Σ_n σ_n · v'_n   ← center's input row

where σ_t = σ(v_c · v'_t) and σ_n = σ(v_c · v'_n) — exactly the numbers from ### The loss above.

Notice the symmetry: every output-row gradient (∂L/∂v'_t, ∂L/∂v'_n) is a scalar times v_c, and the center’s gradient is a weighted sum of the output rows it scored against. That’s a direct consequence of the dot product being symmetric in its arguments — differentiating any f(v_c · v'_w) w.r.t. v'_w always yields something proportional to v_c, and vice versa.

Three observations fall straight out:

  • The positive’s gradient points along v_c, scaled by σ_t − 1 (a negative number — σ_t is below 1). When we subtract the gradient, v'_t is pulled in the v_c direction — toward v_c.
  • Each negative’s gradient also points along v_c, scaled by σ_n (positive). Subtracting it pushes each v'_n in the −v_c direction — away from v_c.
  • The center’s gradient combines all of them — one pull from the target, one push per negative — each weighted by how wrong its sigmoid currently is.

Plugging in the toy values from above (σ_on = 0.60, σ_cat = 0.45, σ_mat = 0.56, with the embedding tables from the score widget):

∂L/∂v_c = (0.60 − 1) · v'_on  +  0.45 · v'_cat  +  0.56 · v'_mat
        ≈ [ −0.04,  0.53,  0.07 ]

The negative entry for the positive (σ_on − 1 = −0.40) and the positive entries for the negatives produce the combined direction we’ll step the center along next.

One more elegant property to notice: the same scalar (1 − σ_t) appears in both the target’s update (v'_t moves toward v_c by (1 − σ_t)·v_c) and the center’s pull (v_c moves toward v'_t by (1 − σ_t)·v'_t). Both vectors move toward each other by exactly the same amount — that’s the co-adaptation symmetry of the dot-product loss. The same holds for each negative, with σ_n instead. The update section formalizes this next.

The update

Gradient descent with learning rate η:

v'_t ← v'_t + η · (1 − σ_t) · v_c                    ← step toward v_c
v'_n ← v'_n − η · σ_n · v_c                          ← step away from v_c
v_c  ← v_c  + η · (1 − σ_t) · v'_t  −  η · Σ_n σ_n · v'_n
                                                      ← toward v'_t, away from each v'_n

Notice the step sizes:

  • v'_t moves by η(1 − σ_t) — largest when σ_t is small (the positive is failing).
  • v'_n moves by η · σ_n — largest when σ_n is large (a negative looks falsely real).

So in our state, mat (σ = 0.56) gets pushed harder than cat (σ = 0.45); and the more wrong any pair is, the bigger the correction it gets. That self-throttling — small corrections when the model is already right, big ones when it isn’t — is what drives the loss down monotonically over training.

Plugging in the gradient from above with η = 0.1 nudges the center:

v_c     ≈ [ 0.33, −0.27,  0.84 ]
v_c_new = v_c − η · ∂L/∂v_c  ≈ [ 0.33, −0.32,  0.83 ]

A small step, but in the direction the loss demands. v'_on, v'_cat, and v'_mat get their own updates at the same time using the formulas above; we focus on v_c here to keep the trace short.

Verifying the step

Did the update actually help? Recompute the three dot products with the new v_c (using the same v'_w for clarity — in practice they update simultaneously, sharpening the effect):

wordbeforeafterdirection
v_c · v'_on0.390.41up — positive more aligned ✓
v_c · v'_cat−0.19−0.22down — negative pushed apart ✓
v_c · v'_mat0.260.25down — negative pushed apart ✓

Every σ moves the right way, and the per-pair loss drops from 1.95 to ~1.92. One pair, one tiny step — multiplied by billions of pairs and many epochs, this is what carves out the embedding geometry where matched things land close and random things land far.

Compare to the six-line NumPy loop above — those three update lines are exactly these three updates: Ep[t] -= lr * (s_t - 1) * v_c is v'_t’s step, Ep[negs] -= lr * s_n[:, None] * v_c is the negatives’, and E[c] -= lr * ((s_t - 1) * v_t + s_n @ v_n) is v_c’s.

Why dropping the softmax doesn’t hurt quality

The shift from “V-class classification” to “binary classification with random negatives” looks like it should lose information — we’re not directly maximizing P(true target | center) anymore. So why does it produce equally good embeddings?

Why this works: the gradient signal is essentially the same. The full-softmax gradient pulls v_c toward v'_t and pushes it away from every other word’s output embedding, weighted by that word’s predicted probability. Negative sampling does the same thing, just on a sampled subset — pull toward v'_t, push away from k randomly selected words’ embeddings. Across many examples, the expected gradient is the same as the full-softmax gradient up to scaling. The model converges to similar geometry; it just gets there with a small constant-factor amount of work per step instead of a vocabulary-scaled amount.

one SGD step on a (center, context) pair · simplified to 2D
v_sat (center)v_cat (true)v_bananav_zebrav_rocket
σ(v_sat · v_w)
cat0.576→ 1
banana0.582→ 0
zebra0.511→ 0
rocket0.516→ 0
loss = 2.866
steps taken: 0

One pair, one step. The center’s input vector v_sat (green) is pulled toward the true target v_cat (blue) — raising their dot product, so σ(v_sat · v_cat) climbs toward 1 — and away from each negative v_banana, v_zebra, v_rocket (red), so each σ(v_sat · v_neg) drops toward 0. Press step repeatedly and watch the geometry sort itself out. Real word2vec runs the same update in a few hundred dimensions, with 5–20 negatives, against each of billions of pairs.

Run the same loop on a whole vocabulary — many pairs, many updates — and the embedding table sorts itself into clusters by meaning:

step-by-step skip-gram training · 21-word vocab, 2-D embeddings, k=5 negatives
thekingsatinpalacequeenruledworkedplayedgardenamanchairwomanboyrangirlonmatcatdog
step
0
epoch ≈ 0 · 532 pairs/epoch
current pair
legend
royalty
adult
youth
animal
verb
place
function

Real skip-gram with negative sampling, run live in the browser on a tiny synthetic corpus. Each step picks one (center, context) pair, samples 5 random negatives, and applies one SGD update: v_center is pulled toward v_context and pushed away from each negative. At step 0 the vectors are random — same-cluster words sit no closer than random pairs. Hit play or +200 and watch them sort themselves: the royalty words drift together, so do the adults, the youths, the animals, the verbs. Nobody told the model these groupings exist; they fall out because words that share contexts in the corpus end up with rows that have to share predictive geometry. The vectors are forced into 2D so we can draw them — the actual algorithm is identical at 300.

Nobody told either model these groupings exist; they fall out because words that share contexts in the corpus end up with rows that have to share predictive geometry. The vectors are forced into 2D so we can draw them — at d=300 you can’t visualize, but the algorithms are identical.

In NumPy, the inner loop is six lines:

for c, t in pairs:
    negs = rng.choice(V, size=k, p=neg_dist)
    v_c, v_t, v_n = E[c], Ep[t], Ep[negs]
    s_t = sigmoid(v_c @ v_t)        # want → 1
    s_n = sigmoid(v_c @ v_n.T)       # want → 0 (each)
    Ep[t]    -= lr * (s_t - 1) * v_c           # pull v_t toward v_c
    Ep[negs] -= lr * s_n[:, None]  * v_c       # push v_n's away from v_c
    E[c]     -= lr * ((s_t - 1) * v_t + s_n @ v_n)  # both, into v_c

That’s the whole training step. Each iteration is one frame of the widgets above; running across billions of pairs is what produces a real word2vec embedding table.

The reframing also explains why the model learns anything beyond raw word frequency: real pairs carry co-occurrence signal, random pairs carry only marginal frequency. The log-ratio between those two distributions is pointwise mutual information — so a classifier that separates them is implicitly learning the PMI between center and context.

Mikolov et al. (2013) showed that the resulting vectors score within a couple of percent of the full-softmax version on standard analogy benchmarks. Levy & Goldberg (2014) later proved that skip-gram with negative sampling is implicitly factorizing a shifted PMI (pointwise mutual information) matrix — the same matrix older count-then-factorize methods like SVD were trying to factor explicitly. The trick wasn’t a hack; it was a different path to the same mathematical destination.

When to use what

Negative sampling and full softmax are tools for different jobs. Softmax retains two genuine advantages that don’t matter for word2vec’s purpose but matter elsewhere:

  • Calibrated probabilities — softmax gives a true conditional P(target | center) that sums to 1 across the vocabulary. Negative sampling’s k+1 independent Bernoullis don’t form a coherent distribution; you can read each as “is this pair real?” but not as “where does this word rank against all V alternatives?”
  • Denser per-step gradient — softmax’s gradient pushes the target up against every alternative simultaneously. Negative sampling only contrasts against k random samples per step, so per-step updates are noisier — the result is the same on expectation across many examples, but each individual step is a weaker signal.

So if your vocabulary is small enough that the V-way softmax is affordable, softmax wins on accuracy and you don’t pay much for it. That’s why most modern classifiers stick with it:

  • BERT’s masked language modeling uses full softmax over its ~30,000 WordPiece vocab.
  • GPT-style next-token prediction uses full softmax over its ~50,000–128,000 Byte-Pair Encoding (BPE) vocab.
  • Standard image classification (10 to ~21,000 classes) uses full softmax.

Negative sampling becomes the right choice when the classes are something like every vocab word, every product in the catalog, every passage in the corpus, every user on the platform — sets so large that the full softmax becomes intractable.

Strip the word-level vocabulary out, and negative sampling generalizes into one of the most important training paradigms in modern ML: contrastive learning. Given pairs of things that go together and pairs of things that don’t, learn embeddings such that matched pairs land close (high dot product / cosine similarity) and unmatched pairs land far. The “things” can be anything embeddable: words, images, sentences, passages, audio clips, graph nodes, user-item pairs. The loss has the same shape — bring positives together, push negatives apart — but the data and the encoders change.

The method underpins a large fraction of modern training procedures. DPR (Dense Passage Retrieval), the engine of RAG, trains on (query, relevant_passage) positives plus sampled (query, irrelevant_passage) negatives — same shape as word2vec’s (c, t) vs (c, random). RLHF reward models train on (preferred_response, rejected_response) pairs — the structural analog of (positive, negative), same contrastive flavor. CLIP pulls matched (image, caption) embeddings together and pushes unmatched together-in-batch combinations apart — image-text alignment as one giant in-batch contrastive loss.

CLIP — word2vec’s idea, scaled up massively

Contrastive learning applies to anything embeddable, not just words — and CLIP demonstrates this most clearly. Word2vec trained 300-dim word vectors from sentence co-occurrence; CLIP trained joint image-and-text embeddings from 400 million (image, caption) pairs scraped from the internet — using the same pull-paired-together, push-random-apart loss, just with two encoders instead of one embedding table, and in-batch negatives instead of unigram-frequency draws.

What makes CLIP worth a closer look in the context of this article:

  • It validates the central claim — the method generalizes. If negative sampling were specific to language modeling, CLIP couldn’t work. The fact that it does work, and the resulting embedding space is rich enough to support zero-shot image classification (more on that below), is direct evidence that the pull/push geometry is doing real semantic work — not just memorizing word frequencies.
  • It’s the architectural template most modern multimodal systems follow. DALL-E uses CLIP. Stable Diffusion uses CLIP’s text encoder. Virtually every modern image-text-aligned system has a CLIP-style contrastive backbone somewhere. So understanding negative sampling + in-batch negatives is understanding how all of these were trained.

Two encoders — a vision transformer for images, a transformer for text — produce embeddings into a shared 512-dim space. Training data: 400 million (image, caption) pairs scraped from the internet.

For each training batch of N pairs (I_1, T_1), ..., (I_N, T_N):

  • Encode each image into an embedding; encode each caption into an embedding.
  • Compute the N × N similarity matrix S[i][j] = I_embed[i] · T_embed[j] — every image’s embedding dotted with every caption’s embedding.
  • The diagonal entries S[i][i] are the matched pairs (positives); everything off-diagonal is mismatched (in-batch negatives, contributed for free by the rest of the batch).
  • Loss: softmax cross-entropy across each row (image picks its caption from N candidates) plus softmax cross-entropy across each column (caption picks its image from N candidates), summed.

For CLIP’s batch size of 32K, each image has 32K − 1 negatives for free — every other caption in the batch. Compare to word2vec’s k = 5–20. The in-batch shortcut from earlier in this article is the entire game at this scale.

What you get: an embedding space where semantically related images and texts land close, unrelated ones land far. That’s why CLIP can do zero-shot image classification — compute text embeddings for class names (“a photo of a dog”, “a photo of a cat”, etc.), then classify an image by which class embedding it’s closest to. The geometry the contrastive loss carved out already encodes meaning across the two modalities; no labeled classifier needed.

Same method word2vec introduced. Different data, different encoders, internet-scale batches — but the bones are identical: pull paired things together, push random things apart.

The clean way to read the modern landscape: softmax-based training (LLMs, BERT, image classifiers) is for predicting the right token from a fixed small set; negative-sampling-style contrastive training is for learning embeddings of two things such that matched pairs land close and random pairs land far. Modern ML uses both, for different jobs. word2vec taught the field how to do the second one efficiently, and that lesson has aged extremely well.