Draft

Cosine similarity — how vectors measure meaning

king − man + woman ≈ queen. A BERT-family encoder finds the paragraph that answers your question without sharing any words with it. A movie recommender suggests something you’d never have searched for. Vector RAG retrieves the right page out of a million chunks. All of these are computing the same number underneath — a single score that ranks one vector against another by how alike they are.

That number is cosine similarity. Once you see how it works, the resemblance between the applications stops being mysterious. This article is about that one operation: where it comes from geometrically, why “dot product after normalising” turns out to be the right answer, and what happens when you try to compute it fast over a billion vectors.

We start two-dimensional and concrete and end with a brief tour of every article on this site that ends up cashing in on cosine.

Two vectors, one question

Two vectors. Are they alike? “Alike” is doing a lot of work in that sentence; there are two reasonable answers.

Compare three vectors:

a = ( 3, 4)
b = ( 6, 8)
c = (-4, 3)

a and b point in the same direction — b is just a scaled by 2. c is perpendicular to a. Asked “which is most like a?”, most people would say b: same direction, just longer. But if you interpret “alike” as “close together in space,” then c lands in second place — b is far away, c is at least nearby.

There are two notions of similarity hiding here. Direction (does the vector point the same way?) and distance (does it land in the same neighbourhood?). The rest of this article is about why machine learning models almost always pick the first one and what it costs them.

Vector, embedding, tensor — same numbers, different jobs

A quick aside before we go further. Three words get used almost interchangeably for what is, on disk, the same array of floats — vector, embedding, tensor — and it’s worth pinning down which job each one is doing, because they keep showing up side by side for the rest of the article.

Vector is the mathematical role. School geometry teaches “an arrow with magnitude and direction”; that’s where the intuition comes from but it isn’t the definition. A vector is anything you can add to another and scale by a number — which means a 2D arrow qualifies, but so does a list of 768 floats, and so does a probability distribution over a thousand categories. They all live in vector spaces and admit the same operations, including the dot product and the angle between two of them. That’s what lets the next section’s math work identically on (3, 4) and on a BERT embedding: a 768-D vector still has a direction in the same sense the 2D arrow does — just in 768-D space instead of on a sheet of paper.

Embedding is a vector whose coordinates have been learned to encode meaning — the word2vec output, a sentence-transformer output, an image encoder’s penultimate layer. The numbers themselves are unreadable; what matters is that nearby vectors correspond to similar things — cat near kitten, far from tractor. An embedding is a vector with semantic structure laid on top.

Tensor is the storage shape. In PyTorch or TensorFlow, “tensor” just means an n-dimensional array of numbers — a scalar is a rank-0 tensor, a vector is rank-1, a matrix is rank-2, a batch of RGB images is rank-4. The word emphasizes the container, not what the numbers mean.

So embedding = torch.tensor([0.2, 0.5, -1.3]) is simultaneously all three: stored as a tensor, mathematically a vector, semantically an embedding. The reason ML engineers slide between the words is that deep-learning libraries handed them one object that wears all three hats, and which hat you call out depends on what you’re doing with it — shape-shuffling, geometry, or meaning. The rest of this article treats them as one thing — an arrow in some number of dimensions — and asks how to compare two of them.

Dot product, revisited

The dot product of two vectors is one number, computed by multiplying matching components and adding the results:

ab=i=1daibi=a1b1+a2b2++adbd\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^d a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_d b_d

If you’ve read the backprop article, this is the same operation a single neuron applies to its inputs — and a CNN kernel applies to each image patch in the CNN article. The dot product is everywhere, and for a reason: it has a second definition that makes it geometric instead of algebraic.

For any two vectors a,b\mathbf{a}, \mathbf{b} in any number of dimensions:

ab=abcosθ\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \, \|\mathbf{b}\| \cos\theta

where a=iai2\|\mathbf{a}\| = \sqrt{\sum_i a_i^2} is the L2 norm (Euclidean length) of a\mathbf{a} and θ\theta is the angle between the two vectors. The two definitions return the same number; the algebraic form gives you a recipe for computing it, the geometric form tells you what it means.

Check the equivalence on the three vectors above:

  • a=9+16=5\|\mathbf{a}\| = \sqrt{9 + 16} = 5, b=36+64=10\|\mathbf{b}\| = \sqrt{36 + 64} = 10, c=16+9=5\|\mathbf{c}\| = \sqrt{16 + 9} = 5.
  • ab=36+48=50\mathbf{a} \cdot \mathbf{b} = 3\cdot 6 + 4 \cdot 8 = 50. Geometric: 510cos0°=505 \cdot 10 \cdot \cos 0° = 50. ✓
  • ac=3(4)+43=0\mathbf{a} \cdot \mathbf{c} = 3 \cdot (-4) + 4 \cdot 3 = 0. Geometric: 55cos90°=05 \cdot 5 \cdot \cos 90° = 0. ✓

The geometric form is what makes the dot product useful as a similarity score. The cosine of the angle is exactly the “do these point the same way?” question — and it falls out of the dot product, if we can find a way to remove the two lengths.

The length problem

Look again at the dot products above. ab=50\mathbf{a} \cdot \mathbf{b} = 50 (same direction, big number), ac=0\mathbf{a} \cdot \mathbf{c} = 0 (perpendicular, zero). So far so good. But now consider a fourth vector d=(1.5,2)\mathbf{d} = (1.5, 2) — exactly the same direction as a\mathbf{a}, half the length:

ad=31.5+42=12.5\mathbf{a} \cdot \mathbf{d} = 3 \cdot 1.5 + 4 \cdot 2 = 12.5

That score is much lower than ab=50\mathbf{a} \cdot \mathbf{b} = 50, even though d\mathbf{d} and a\mathbf{a} point in the exact same direction. The dot product is rewarding b\mathbf{b} over d\mathbf{d} purely because b\mathbf{b} is longer.

This is the same effect that makes a CNN feature map spike on bright patches even when the kernel’s pattern matches better in a dark region — a sliding kernel is a sliding dot product, and unnormalised dot products pick up brightness, not pattern. In retrieval, the same thing happens with raw embeddings: a long passage racks up a larger dot product against any query than a short passage on the same topic, just by having more vector mass.

If you only care about direction — and for “meaning,” that’s what you want — you have to get rid of the lengths.

Normalise it out

The fix is mechanical. For any non-zero vector v\mathbf{v}, define its unit vector:

v^=vv\hat{\mathbf{v}} = \frac{\mathbf{v}}{\|\mathbf{v}\|}

You’ve divided away the length. v^\hat{\mathbf{v}} points in the same direction as v\mathbf{v}, but has length exactly 1. Every unit vector in dd dimensions lives on the unit sphere — the set of all points at distance 1 from the origin. (In 2D, the unit “sphere” is the unit circle.)

For a=(3,4)\mathbf{a} = (3, 4) with length 5: a^=(0.6,0.8)\hat{\mathbf{a}} = (0.6, 0.8). Check: 0.62+0.82=10.6^2 + 0.8^2 = 1. The same trick on b\mathbf{b} (length 10) and d\mathbf{d} (length 2.5) gives the same unit vector, (0.6,0.8)(0.6, 0.8) — they were the same direction all along, and now that’s obvious.

Cosine similarity is exactly the dot product of two unit vectors:

cos(a,b)=abab=a^b^\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|} = \hat{\mathbf{a}} \cdot \hat{\mathbf{b}}

You can derive this directly: divide both sides of ab=abcosθ\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\|\|\mathbf{b}\|\cos\theta by ab\|\mathbf{a}\|\|\mathbf{b}\| and you get cosθ\cos\theta on one side and the dot product over the lengths on the other.

The range is [1,1][-1, 1]:

  • 11 = same direction (angle 0°)
  • 00 = perpendicular (angle 90°), “unrelated”
  • 1-1 = opposite direction (angle 180°)

For embeddings produced by trained models, you’ll almost never see negative numbers — the learned space is mostly one-sided, so unrelated vectors land near 0, related ones in the 0.4–0.8 range, and same-thing ones near 1. We come back to why in the high-dimensional section.

Re-running the four-vector example with cosine instead of plain dot product:

cos(a,b)=0.60.6+0.80.8=1\cos(\mathbf{a}, \mathbf{b}) = 0.6 \cdot 0.6 + 0.8 \cdot 0.8 = 1 cos(a,c)=0.6(0.8)+0.80.6=0\cos(\mathbf{a}, \mathbf{c}) = 0.6 \cdot (-0.8) + 0.8 \cdot 0.6 = 0 cos(a,d)=1\cos(\mathbf{a}, \mathbf{d}) = 1

The misleading magnitude is gone. b\mathbf{b} and d\mathbf{d} are now correctly tied at perfect-match. That’s the whole point.

Normalise once, dot product forever

Here is the pragmatic move that turns cosine similarity from a formula with two square roots into one of the fastest operations in computing.

Take every vector in your index and normalise it once, at index time. The lengths are now permanently 1. And once the lengths are 1, the cosine formula collapses:

cos(q,c)=qcqc=qc\cos(\mathbf{q}, \mathbf{c}) = \frac{\mathbf{q} \cdot \mathbf{c}}{\|\mathbf{q}\| \, \|\mathbf{c}\|} = \mathbf{q} \cdot \mathbf{c}

A plain dot product. No norms to compute, no fraction to evaluate, no square roots, no division.

This is why every embedding library returns unit-normalised vectors by default. SentenceTransformers exposes a normalize_embeddings=True flag; OpenAI’s embedding endpoints return unit vectors; the BGE and E5 model families are trained with cosine objectives and barely need the post-step. You pay the cost of normalisation once, when each vector enters the index, and from then on similarity is a dot product.

At scale, the consequence is dramatic. Stack NN chunk vectors of dimension dd into an N×dN \times d matrix CC. A query is a dd-vector q\mathbf{q}. Every chunk’s cosine score against the query is one matrix-vector multiply:

scores=Cq\text{scores} = C \mathbf{q}

In numpy that’s literally the line scores = index @ qv. Top-k is then argsort(-scores)[:k]. That is the entire inner loop of a flat-vector RAG retriever, and it runs in milliseconds on a CPU for tens of thousands of chunks:

import numpy as np

# index built once — every chunk vector normalised to length 1
chunk_vecs = embedder.encode(corpus, normalize_embeddings=True)   # (N, d)
index = np.asarray(chunk_vecs)

def search(query, k=5):
    q = embedder.encode(query, normalize_embeddings=True)          # (d,)
    scores = index @ q                                              # (N,)
    return np.argsort(-scores)[:k]

Ten lines, including imports. Everything more sophisticated — pgvector, Pinecone, HNSW, IVF-PQ, the rest of the vector-database ecosystem — is in service of computing this dot product faster when NN stops being tens of thousands and starts being millions or billions.

For real numbers on real text, the RAG article has a heatmap of all-MiniLM-L6-v2 cosines between a tiny refund-policy corpus and four queries. The visible 0.6-vs-0.2 split between “right meaning” and “irrelevant” rows is this same dot product, evaluated on 384-dimensional unit vectors.

Cosine vs L2 distance

Several libraries default to squared Euclidean distance instead of cosine:

qc2=i=1d(qici)2\|\mathbf{q} - \mathbf{c}\|^2 = \sum_{i=1}^d (q_i - c_i)^2

For unit vectors this is exactly cosine in disguise. Expand it:

qc2=q22qc+c2\|\mathbf{q} - \mathbf{c}\|^2 = \|\mathbf{q}\|^2 - 2\,\mathbf{q} \cdot \mathbf{c} + \|\mathbf{c}\|^2

If both vectors are unit length, q2=c2=1\|\mathbf{q}\|^2 = \|\mathbf{c}\|^2 = 1, so this simplifies to:

qc2=22cos(q,c)\|\mathbf{q} - \mathbf{c}\|^2 = 2 - 2\cos(\mathbf{q}, \mathbf{c})

The two quantities are linearly related, with a negative slope. Sorting by smallest L2 distance gives the same ordering as sorting by largest cosine — same top-k, same answer. Whether your vector database calls its metric cosine, dot, or euclidean doesn’t change retrieval results, as long as the vectors are normalised.

The two only diverge when norms differ — that is, when you forget to normalise, or when you’re working with something other than embeddings. For raw count vectors, image pixels, or anything where length carries information, the two metrics measure genuinely different things and you have to pick deliberately.

A second wrinkle: cosine distance (1cos1 - \cos) is sometimes called a metric, but it doesn’t satisfy the triangle inequality. Algorithms that require a true metric — some balanced trees, certain clustering implementations — won’t work directly on cosine distance and need either L2 distance on unit vectors or an explicit metric-compatible transform.

Making it fast — locality-sensitive hashing

A matrix-vector product is fast, but it still touches every row in the index. For N=109N = 10^9, that’s billions of floating-point operations per query, and a serial scan is no longer in the budget.

The classic clever fix is locality-sensitive hashing (LSH). The variant for cosine is so geometric you can almost picture it.

Pick a random vector r\mathbf{r} from the embedding space. It defines a hyperplane through the origin: the set of all points v\mathbf{v} with rv=0\mathbf{r} \cdot \mathbf{v} = 0. Every other point is either “above” the hyperplane (rv>0\mathbf{r} \cdot \mathbf{v} > 0) or “below” (rv<0\mathbf{r} \cdot \mathbf{v} < 0). Record a single bit per vector: 1 for above, 0 for below.

Now pick kk such random hyperplanes. Each vector gets a kk-bit hash code — its side of plane 1, side of plane 2, …, side of plane kk.

The crucial property — and this is the part to convince yourself of with a picture — is that the probability two vectors land on the same side of a random hyperplane through the origin is:

P[same side]=1θπP[\text{same side}] = 1 - \frac{\theta}{\pi}

where θ\theta is the angle between them. Nearly-parallel vectors (small θ\theta) almost always agree across many hyperplanes. Orthogonal vectors agree about half the time. Anti-parallel vectors almost never agree. Their bit-strings end up close in Hamming distance if and only if their angles are close in cosine space.

That gives you the search:

  1. Index time. For each chunk vector, compute its kk-bit hash. Bucket all chunks by hash code.
  2. Query time. Hash the query. Look in the same bucket (and nearby buckets by Hamming distance) to get candidates. Compute the exact cosine only for those candidates and return the top-k.

You’ve gone from “score every vector in the index” to “score only the small subset whose hash code is near the query’s hash code.” For well-tuned kk and number of hash tables, that’s a constant factor speed-up of hundreds to thousands, with a small recall loss because some true neighbours might not share enough hash bits to surface as candidates.

In a 2D picture: the unit circle, three random diameters chopping it into six pie-slices, two nearby points landing in the same slice and one distant point landing in a different slice. The hash code is which slice. That’s it — every other LSH detail is engineering on top of this idea.

LSH isn’t the dominant ANN method anymore — HNSW (a graph-based method) and IVF-PQ (clustering + product quantisation) outperform it in production vector stores. But it is the most geometrically transparent: you can see why it works by drawing two unit vectors and a random line through the origin.

What changes in 768 dimensions

Cosine in 2D and cosine in 768D behave differently, and the reason matters for understanding retrieval scores.

Take two random unit vectors in 2D. The angle between them is roughly uniform on [0°,180°][0°, 180°], so their cosine is roughly uniform on [1,1][-1, 1]. You’ll see plenty of high-cosine pairs by chance.

Now do it in 100D. The cosines pile up around 0. By 768D — BERT-base’s output dimensionality, and the size of many production embedding vectors — the cosines between random unit vectors are sharply concentrated in a narrow band, typically within ±0.05\pm 0.05 of 0 (the standard deviation goes as 1/d1/\sqrt{d}). The geometric reason is that almost all of the surface area of a high-dimensional sphere is concentrated near the equator with respect to any chosen axis. Two random points on the sphere are almost certainly nearly perpendicular.

A few consequences for retrieval:

  1. Background noise sits at cosine ≈ 0, not −1. A score of 0.05 isn’t “weakly related” — it’s “as unrelated as two random vectors in this space.” Negative cosines are almost vanishingly rare.
  2. The useful range is compressed. Real matches might score 0.7 instead of 0.99, because the entire learned space lives in a fraction of the sphere. The score scale is conditioned on the model.
  3. Anisotropy is what happens when this gets worse. Trained encoders sometimes push all of their outputs into a narrow cone of the sphere; every pair scores high; nothing is distinguishable from anything else. Untuned LLM token embeddings exhibit this; retrieval encoders (BGE, E5, sentence-transformers) are explicitly trained against it. Some libraries apply a whitening post-processing step (subtract the mean, scale by the inverse covariance) to spread the cone back out.

The same high-dimensional concentration is what makes LSH work: in 768D, “far apart on the sphere” really does mean far apart everywhere, so cheap random projections actually distinguish neighbourhoods. The thing that hurts intuition rescues the algorithm.

Where this shows up

A short tour of every place this site ends up cashing in on cosine:

  • Single neurons (backprop). A neuron computes wx+b\mathbf{w} \cdot \mathbf{x} + b — already a dot product. Training learns the right w\mathbf{w} to make the dot product large for some inputs and small for others, with no normalisation: magnitude is part of the signal. So neurons aren’t computing cosine, but they are computing the primitive that cosine is built from.
  • CNN kernels (CNN article). A convolution is a sliding dot product between a small kernel and each image patch. The article frames this as pattern matching, and that intuition is exactly right up to the brightness problem from earlier — unnormalised dot products see brightness, not pure pattern. Modern CNNs invest heavily in normalisation layers (BatchNorm, LayerNorm) downstream of every convolution partly to mitigate this.
  • word2vec analogies. The famous king − man + woman ≈ queen example works because cosine ignores length. Subtraction-and-addition gives you a vector in approximately the right direction but with a meaningless magnitude; cosine finds the nearest dictionary word by direction and shrugs at the wrong length.
  • BERT and retrieval encoders (BERT article). Token vectors get pooled to one vector per passage; the model is fine-tuned with a contrastive loss that makes cosine similarity between (query, relevant passage) pairs large and (query, irrelevant) small. The whole purpose of the fine-tune is to align cosine with relevance.
  • RAG (RAG article). Flat-vector RAG is one giant cosine search, run for every user query. That article walks through where the assumption “cosine = relevance” holds and where it breaks.
  • Recommenders. User and item embeddings, learned by matrix factorisation or two-tower neural networks; “items similar to what you watched” is a cosine top-k over the item matrix.

All six are the same operation under different names. The reason for having one foundational article is so you can recognise the operation in all six places and not get fooled by the surface differences.

Where cosine fails

Cosine similarity is a measure of geometric closeness in a vector space. It is not a measure of truth or relevance — those are properties of the space you projected your data into, and cosine just measures distances in that space. A few specific failure modes worth knowing:

  • Anisotropy. As described in the high-dimensional section: when a model’s output space is a narrow cone, every pair scores high and the metric stops being informative. Diagnosis: sample 1000 random vectors from your corpus, compute pairwise cosines, look at the distribution. If everything is between 0.6 and 0.95, you’ve got anisotropy. Fix: switch to a contrastively-trained encoder, or apply whitening.
  • Wrong space. If the model was trained on web text and you’re retrieving over legal contracts, it has learned to bring things close together that look like similar web text, which is not the same as similar legal claims. Cosine confidently returns the wrong neighbour. The fix is domain adaptation or a stronger general-purpose model — not anything to do with the metric.
  • Surface similarity. Two passages with the same boilerplate but opposite content can land close in cosine: “I love this product because…” and “I hate this product because…” share most of the sentence, and cosine doesn’t know that “love” vs “hate” is the load-bearing word. Cross-encoder rerankers (covered in the RAG article) are how production systems patch this.
  • No notion of authority or freshness. Cosine doesn’t care that the 2025 document supersedes the 2023 one, or that an official policy outranks a Slack mention. That kind of information has to be wired in as metadata filters or score boosts. The metric itself is timeless.

Wrap

Cosine similarity is dot product with the length signal removed. Removing the length signal is what makes it work as a measure of meaning — similarity shouldn’t depend on how long a passage is, how bright a patch is, or how many times a word appears.

That’s the whole article in one sentence. The reason cosine shows up in so many places — neurons, CNNs, word2vec, BERT, RAG, recommenders, LSH — is that the underlying primitive (dot product) is everywhere, and the geometric form abcosθ\|\mathbf{a}\|\|\mathbf{b}\|\cos\theta tells us exactly which factor to divide out when we want direction without magnitude. From there it’s all engineering: pre-normalise to make the matmul cheap, hash randomly to make the matmul approximate, train the encoder to make the angle meaningful.