How binary arithmetic works: two's complement integers and IEEE-754 floats

Every numeric operation your program runs — a + b, x * y, count / 2 — ultimately becomes a sequence of bit-level operations on bit patterns. How those bit operations turn into the correct arithmetic result depends entirely on which encoding the bits are using.

For integers, computers use two’s complement, a beautifully simple scheme where signed addition uses the exact same hardware circuit as unsigned addition — the bits don’t know or care whether you meant them to be signed. For fractions, computers use IEEE-754 floating point, a far more elaborate machinery involving separate sign, exponent, and mantissa fields that have to be aligned and rounded on every operation.

This article walks through the arithmetic of both — what actually happens to the bits during addition, subtraction, multiplication, and division in each format, where the operations break, and how the two systems compare. The encoding side of each format is covered in sibling articles:

Here we focus purely on the arithmetic operations themselves, with short recaps of the encoding where needed.

Arithmetic on two’s complement integers

We’ll assume you already know the encoding basics — the most significant bit (MSB) carries negative weight, to negate a number you flip its bits and add 1, and the range is asymmetric by one (-8 to +7 in 4 bits, or -2,147,483,648 to +2,147,483,647 in 32-bit). If any of that is unfamiliar, the full walkthrough is in offset binary vs two’s complement. The 4-bit examples used below are just a stand-in for any n-bit width.

Addition and subtraction: the same as unsigned

The first beautiful property of two’s complement: signed addition uses the same physical circuit as unsigned addition. Two’s complement stores each negative number xx using the bit pattern that, when read as unsigned, equals x+2nx + 2^n — for instance, in 4 bits (n=4n = 4), 3-3 is stored as the pattern for 13: x+2n=3+24=3+16=13x + 2^n = -3 + 2^4 = -3 + 16 = 13, which is 1101 in binary. Because of this, adding a negative and a positive number often overflows the nn-bit register, but the hardware simply discards the carry-out — and the signed answer falls out automatically.

Let’s see addition of a negative number in action. Trace -3 + 5 in 4 bits — -3 is encoded as 1101 (= 13), 5 is 0101:

    1 1 0 1     (-3, stored as 13)
  + 0 1 0 1     (+5)
  ─────────
  1 0 0 1 0     (= 18; doesn't fit in 4 bits — the leading 1 is the carry-out, weight 2^4 = 16)
    0 0 1 0     (kept in the 4-bit register: 2)  ✓ matches -3 + 5 = 2

The hardware simply added the unsigned bit patterns 13 + 5 = 18. The carry bit at the 5th position (weight 24=162^4 = 16) didn’t fit in the 4-bit register and was dropped. That dropped 16 is exactly the +2n+2^n part in the x+2nx + 2^n formula we saw above — the encoding offset and the overflow cancel, and the signed answer comes out for free. Now we simply interpret those bits as signed, whereas the hardware machinery doesn’t distinguish. The same logic covers no-overflow cases: -7 + 4 is 1001 + 0100 = 1101, which decodes as -3 (signed) or 13 (unsigned).

Since subtraction is simply addition of a negated number, it needs no separate circuit either a - b = a + (-b), where negation is the flip-bits-and-add-1 recipe. For 5 - 3: -3 is 1101, then 0101 + 1101 = 10010. The leading bit falls off in 4-bit, leaving 0010 = 2 — that dropped bit is the modular wrap. The full framing (nn-bit arithmetic as a circle of 2n2^n positions) is walked through in the modular-arithmetic section of the encoding article.

The modulo-2n2^n mechanism leads to the following handling of overflows: INT_MAX + 1 = INT_MIN, and INT_MIN - 1 = INT_MAX — not a bug, just the modular step crossing the seam between the positive and negative halves. Note that INT_MAX and INT_MIN are C/C++ names for the signed 32-bit extremes 23112^{31} - 1 and 231-2^{31}; Java has Integer.MAX_VALUE/MIN_VALUE, Rust has i32::MAX/MIN. Python and JavaScript don’t, since their integer types are arbitrary-precision.

Comparison: where bits alone aren’t enough

Unlike addition, comparison does depend on interpretation. The bit pattern 1111 is 15 as unsigned, so naturally 1111 > 0001 (15 > 1). However, if those same bits are meant to be a signed -1, the right answer is -1 < 1 — but a bit-level comparison still gives 1111 > 0001. Same bits, opposite answers.

At the hardware level, comparison is often implemented as a subtraction: the CPU computes a - b and sets flags (sign, zero, overflow, carry); branch instructions then read those flags to decide the branch outcome. The subtraction itself uses the same a + (-b) mechanism from above — the CPU just discards the result and keeps only the flags. A few 4-bit examples to see each flag in action:

 5 - 3:   0101 - 0011 = 0010    carry=0, sign=0, zero=0, overflow=0
 3 - 5:   0011 - 0101 = 1110    carry=1, sign=1, zero=0, overflow=0    (= -2 signed)
 5 - 5:   0101 - 0101 = 0000    carry=0, sign=0, zero=1, overflow=0
-8 - 1:   1000 - 0001 = 0111    carry=0, sign=0, zero=0, overflow=1    (= -9, doesn't fit)

Take the first row, 5 - 3 = 2, where every flag is 0: carry = 0 because the unsigned subtraction stays in range (5 ≥ 3); sign = 0 because the result’s MSB (0010) is 0; zero = 0 because the result is non-zero; overflow = 0 because +2 fits easily in the signed 4-bit range. The other rows flip one flag each: carry fires when the unsigned subtraction wraps past zero (row 2: 3 < 5, so the result is the two’s-complement encoding of a negative number); sign fires when the result’s MSB is 1 (row 2’s 1110); zero fires when the operands were equal (row 3); overflow fires when the signed result falls out of the 4-bit signed range (row 4: -8 - 1 = -9 underflows below the minimum -8).

Because the same bits can mean different things signed vs unsigned, CPUs expose two families of branch instructions — signed (JL, JG on x86 — “jump if less/greater”) and unsigned (JB, JA — “jump if below/above”) — each reading a different flag combination:

ComparisonUnsignedSigned
a < bJB: carry = 1JL: sign ⊕ overflow = 1
a > bJA: carry = 0 ∧ zero = 0JG: sign ⊕ overflow = 0 ∧ zero = 0
a ≤ bJBE: carry = 1 ∨ zero = 1JLE: sign ⊕ overflow = 1 ∨ zero = 1
a ≥ bJAE: carry = 0JGE: sign ⊕ overflow = 0

In the conditions, is logical XOR (exactly one input is 1), is logical AND (both inputs are 1), and is logical OR (at least one input is 1). The compiler picks the right variant from the declared type.

In action, again in 4-bit (each row’s flag condition is satisfied, so the branch is taken):

  JB  (1 < 2  unsigned):   0001 - 0010 = 1111    carry = 1
  JA  (3 > 2  unsigned):   0011 - 0010 = 0001    carry = 0, zero = 0
  JL  (-1 < 1 signed):     1111 - 0001 = 1110    sign ⊕ overflow = 1 ⊕ 0 = 1
  JG  (3 > 2  signed):     0011 - 0010 = 0001    sign ⊕ overflow = 0 ⊕ 0 = 0, zero = 0

The right-hand side of each row reads as formula → substituted flag values → boolean result. Take the JL row: the subtraction 1111 - 0001 = 1110 has sign = 1 (MSB of the result) and overflow = 0 (the signed value -2 still fits in 4-bit), so sign ⊕ overflow becomes 1 ⊕ 0, which evaluates to 1 — JL’s firing condition. The JG row works the same way with sign = 0 and overflow = 0, giving 0 ⊕ 0 = 0 (plus zero = 0), which is JG’s firing condition.

Equality (a == b, instruction JE) and its negation (JNE) just read the zero flag and are identical for signed and unsigned.

Multiplication: shift and add

Recall long multiplication from grade school: for each digit of one operand, multiply it by the other operand, shift the resulting row by the digit’s position, and sum all the rows. For example, 123 × 456:

        1 2 3              (a = 123)
      × 4 5 6              (b = 456)
      ─────────
        7 3 8              ← 123 × 6 (6 is at position 0, no shift)
      6 1 5                ← 123 × 5, shifted one position left (5 is at position 1)
    4 9 2                  ← 123 × 4, shifted two positions left (4 is at position 2)
    ─────────
    5 6 0 8 8              = 56088

Binary works the same way, but with one big simplification: each digit of the multiplier is either 0 or 1. So each row is either a shifted copy of the multiplicand a (when the multiplier bit is 1) or just zero (when the bit is 0) — no digit-by-digit multiplication step like in base 10. It’s as if every multiplication were against a base-10 number whose digits are just 1s and 0s — like 25 × 101:

        2 5               (a = 25)
      × 1 0 1             (b = 101)
      ─────────
        2 5               ← digit 0 = 1: add 25
      0 0                 ← digit 1 = 0: skip
    2 5                   ← digit 2 = 1: add 25 shifted 2 left (= 2500)
    ─────────
    2 5 2 5               = 2525

Notice there’s no actual digit-multiplication step — every row is either 25 shifted to its position or just 0. And the shift itself is just zero-padding at the low end: 25 shifted 2 left is 2500, with two zeros appended (in binary the same shift turns 0011 into 1100). So the entire operation reduces to “either add a zero-padded copy of a, or skip”. That’s the property that makes binary multiplication so cheap: in binary the multiplier always has digits in {0, 1}, so every row collapses to the same shift-or-zero choice, no matter the operand.

Here’s 3 × 5 (0011 × 0101) in 4 bits, run through the same algorithm:

        0 0 1 1           (a = 3)
      × 0 1 0 1           (b = 5)
      ─────────
        0 0 1 1           ← bit 0 = 1: add a
      0 0 0 0             ← bit 1 = 0: skip
    0 0 1 1               ← bit 2 = 1: add a ≪ 2
  0 0 0 0                 ← bit 3 = 0: skip
  ─────────────────
  0 0 0 0 1 1 1 1         = 15

The row-by-row layout suggests n sequential steps — but the CPU doesn’t need them sequential. Each row depends only on a and one bit of b, not on any previous row’s result, so all n rows can be computed in parallel. To produce row i, the hardware shifts a left by i (the row’s position), then either keeps the shifted value (if b[i] is 1) or replaces it with all zeros (if b[i] is 0). The n partial products are then summed by a parallel adder tree (Wallace or Dadda), which takes O(logn)O(\log n) gate delay — in contrast to O(n)O(n) cycles for a serial shift-and-add multiplier.

In hardware, this becomes a parallel tree: b splits into its bits, each bit is a branch, and each branch decides 0 or 1. If 1, that branch contributes a shifted by its position; if 0, it contributes zero. All branches run in parallel, then a tree adder sums them up.

For a = 0011 (= 3), b = 0101 (= 5) the computation process looks like this:

              a = 0011 (broadcast to every branch)

  ┌───────────┼───────────┬───────────┬───────────┐
  │           │           │           │           │
  ▼           ▼           ▼           ▼
a ≪ 0       a ≪ 1       a ≪ 2       a ≪ 3            ← shift by branch index
= 00000011  = 00000110  = 00001100  = 00011000        (result in 8-bit field)
  │           │           │           │
  × b₀=1      × b₁=0      × b₂=1      × b₃=0          ← gate with that b bit
  │           │           │           │                 (if bᵢ=0, output = 0;
  ▼           ▼           ▼           ▼                 if bᵢ=1, output = a≪i)
00000011    00000000    00001100    00000000
(partial₀)  (partial₁)  (partial₂)  (partial₃)
  │           │           │           │
  └───────────┴─────┬─────┴───────────┘


       ┌────────────────────────────┐
       │  parallel adder tree       │   ← O(log n) gate depth
       │  sum = 00001111 = 15       │
       └──────────────┬─────────────┘


                  P = 00001111 = 15                ← 2n-bit product (= a × b)

There’s a practical pitfall worth knowing about — multiplication can silently overflow, even when the operands are nowhere near the type’s max. That’s because the product needs 2n2n bits, not nn — for 4-bit × 4-bit, the largest product is 15×15=225=1110000115 \times 15 = 225 = \texttt{11100001}, which spills into 8 bits; for 8-bit × 8-bit, 255×255=65,025255 \times 255 = 65{,}025 needs 16 bits. Different languages handle this in different ways:

StrategyBehavior on overflowExample
Silent truncationKeep low n bits, discard the high halfC/Java int * int → int; Rust i32 * i32 in release
Wider typePromote operands first so all 2n bits fitC: (int64_t)a * (int64_t)b; Java: (long)a * b
Checked / trappingDetect overflow, return an error or panicRust a.checked_mul(b)Option; Rust debug-mode * panics
SaturatingClamp to the type’s MAX/MIN instead of wrappingRust a.saturating_mul(b)
Arbitrary-precisionUse a type that grows — no overflow at allPython int, JavaScript BigInt, Java BigInteger

Signed multiplication

Unlike addition, multiplication for signed and unsigned doesn’t work as straightforwardly with the same algorithm. That’s because in two’s complement, the multiplier’s MSB carries negative weight — so when it’s 1, that row’s contribution should be subtracted rather than added. For example, with b = 1110 and a = 3: read as unsigned, b = 14 and all rows are added (0 + 6 + 12 + 24 = 42, matching 3 × 14); read as signed, b = -2 and the MSB row is subtracted instead (0 + 6 + 12 − 24 = -6, matching 3 × -2). Same operand bits, one row’s sign flipped.

So a few tweaks to the algorithm above are required: subtract (rather than add) the partial product for the sign-bit row, and sign-extend each partial product to the full 2n2n-bit width so negative contributions propagate correctly into the upper half. Running 3 × -2 (0011 × 1110) through this recipe:

        0 0 1 1                  (a = 3)
      × 1 1 1 0                  (b = -2 signed)
      ─────────
  0 0 0 0 0 0 0 0    ← bit 0 = 0: skip
  0 0 0 0 0 1 1 0    ← bit 1 = 1: add a ≪ 1
  0 0 0 0 1 1 0 0    ← bit 2 = 1: add a ≪ 2
  1 1 1 0 1 0 0 0    ← bit 3 = 1 (MSB!): subtract a ≪ 3 (= -24, encoded in 8-bit two's complement)
  ─────────────────
  1 1 1 1 1 0 1 0    = -6 (signed)  ✓

The signed/unsigned divergence lives entirely in the upper nn bits of the full 2n2n product — the low nn bits are the same either way. For example, with the same operand bits 0011 × 1110:

  unsigned (3 × 14 = 42):   0 0 1 0  |  1 0 1 0
  signed   (3 × -2 = -6):   1 1 1 1  |  1 0 1 0
                            ───────     ───────
                            upper n     low n
                            (differs)   (identical)

That’s why same-width multiplication (int * int → int) doesn’t need separate signed/unsigned variants at the language level: it’s just the low n bits, which wrap on overflow like addition (no special pathological case like INT_MIN / -1, the way division has).

Division: shift and subtract

Let’s now look at division. At the bit level, it’s nothing exotic — just long division applied to binary. Recall school-style long division, say 105 ÷ 7:

      1 5
    ─────
7 │ 1 0 5
      7
    ─────
      3 5
      3 5
    ─────
        0

To compute, we walk the digits of the dividend (105) left-to-right. At each step we ask “how many times does the divisor (7) fit into what we have so far?”, subtract that many times, bring down the next digit, and repeat.

Binary division works exactly the same way, except the “how many times?” question has only two possible answers: 0 or 1 — the divisor either fits once or not at all. That’s the key simplification. In base 10 each step needs an actual calculation (or estimation) to pin down a digit from 0–9 — you have to figure out how much of the divisor fits, which in general requires multiplication and trial-and-error.

In base 2 there’s nothing to calculate: one comparison between the running remainder and the divisor gives the answer directly — 1 if the divisor fits, 0 if not. No multiplication table, no trial-and-error, just a single subtract-or-don’t decision per bit. That’s why binary long division maps so cleanly onto hardware: the top-level loop iterates over the dividend bits, while each iteration is just a comparator and a conditional subtract — nothing more.

And recall from the comparison section that the comparator is implemented as a subtraction: R ≥ D is just R − D with a sign/carry-flag check. So for each loop iteration the hardware speculatively computes R − D: if the result is non-negative (no borrow), it commits that result back to R and records 1 in Q; otherwise it throws the result away and records 0. One subtract per iteration, always.

The CPU keeps two registers: R holds the running remainder, Q accumulates the quotient bit by bit. At each step it shifts R left and brings in the next dividend bit, then compares R with D by always computing R − D. Based on the result:

  • if R − D is non-negative (D fits) → sets the new Q bit to 1 and replaces R with that result
  • if R − D is negative (D doesn’t fit) → sets the new Q bit to 0, discards the result, and leaves R unchanged

Let’s see an example: 13 ÷ 3 in 4 bits, dividend 1310=1101213_{10} = 1101_2, divisor 310=001123_{10} = 0011_2. Step through the widget below to see both renderings in parallel — the school-style long division on the left, and the CPU’s shift-register trace on the right. The widget shows four register rows: Ds is the divisor (what we’ve been calling D in the prose), Dd is the dividend, R is the running remainder, and Q is the quotient being built bit by bit. Mental model: Dd is the input stream of bits being fed into R (the next dividend bit shifts in on each tick); Ds is the static reference that R is compared against; R and Q are the work registers, both shifting left each tick.

School long division
 
CPU shift-register trace
Ds
0011
= 3
Dd
1101
= 13
R
0000
= 0 (remainder)
Q
0000
= 0 (quotient)
 
if R < Ds
no subtract · Q bit = 0
if R ≥ Ds
subtract (R ← R − Ds) · Q bit = 1
 
Step 0 / 13

The widget above has 14 states — a consistent 3-sub-stage structure per tick (shift R, compare/subtract, shift Q). The trace below mirrors the per-step text shown in the widget’s info panel:

Step 0   — Init. Ds = 0011 (3), R = 0000, Q = 0000.

Tick 1 — bit 3 = 1 (no subtract)
    Step 1   — (a) shift R: bit 3 shifted in → R = 0001. No compare yet.
    Step 2   — (b) compare: 1 < 3 → no subtract. "R < D" branch lit.
    Step 3   — (c) shift Q: Q = 0000. First quotient digit = 0.

Tick 2 — bit 2 = 1 (subtract fires)
    Step 4   — (a) shift R: bit 2 shifted in → R = 0011 (intermediate, visible).
    Step 5   — (b) compare: 3 ≥ 3 → "R ≥ D" branch lit; will subtract on the next click.
    Step 6   — (c) subtract + shift Q: R = 0011 − 0011 = 0000; Q = 0001. School-side "1 1" row appears.

Tick 3 — bit 1 = 0 (no subtract)
    Step 7   — (a) shift R: bit 1 shifted in → R = 0000.
    Step 8   — (b) compare: 0 < 3 → no subtract.
    Step 9   — (c) shift Q: Q = 0010. Third quotient digit = 0.

Tick 4 — bit 0 = 1, LSB (no subtract)
    Step 10  — (a) shift R: bit 0 shifted in → R = 0001.
    Step 11  — (b) compare: 1 < 3 → no subtract.
    Step 12  — (c) shift Q: Q = 0100. Fourth quotient digit = 0.

Step 13  — Done. Q = 0100 (= 4), R = 0001 (= 1). 13 ÷ 3 = 4 remainder 1.

After n steps (one per bit), you have the full quotient. Whatever Q and R the CPU lands on at the end of a run, they satisfy a tight identity by construction:

(a/b)b+(amodb)=afor any b0(a / b) \cdot b + (a \bmod b) = a \quad \text{for any } b \neq 0

The identity isn’t an extra rule — it’s the definition of the remainder: R is whatever’s left over after the loop has subtracted D as many times as possible. The widget’s Q = 4, R = 1 is a trivial check: 4 · 3 + 1 = 13. Shift-and-subtract produces both a / b and a % b in a single pass — both results fall out together.

Implementations preserve this identity wherever Q and R both fit in the register. When the true quotient isn’t a whole integer, it has to be rounded — and languages take two approaches:

  • Truncation toward zero (C99/C++, Rust, Java, Go): a / b rounds toward zero; the remainder inherits the sign of the dividend.
  • Floored division (Python): a // b rounds toward -\infty; the remainder inherits the sign of the divisor.

Example with a = -17, b = 5 (true quotient is 3.4-3.4):

C:       -17 / 5  = -3     -17 % 5 = -2    (-3) * 5 + (-2) = -17   ✓
Python:  -17 // 5 = -4     -17 % 5 =  3    (-4) * 5 +  3   = -17   ✓

Both satisfy the identity; they differ only on the rounding direction. The shift-and-subtract algorithm above produces truncated division directly — it operates on absolute values and reapplies the sign at the boundary. Floored division is a small post-adjustment: if the operand signs differ and the remainder is non-zero, decrement the quotient by 1 and add b to the remainder.

As we just saw, the shift-and-subtract algorithm runs one loop iteration per bit, making integer division an O(n) operation for n-bit numbers — noticeably slower than addition or multiplication, which is why even modern CPUs take many cycles per idiv instruction. Simpler architectures sometimes implement it as a software loop rather than dedicated hardware, while modern CPUs skip plain shift-and-subtract in favor of faster algorithms — SRT division, Newton–Raphson, and Goldschmidt — that reduce division to a handful of multiplications, since multiplication is much cheaper in hardware.

Signed division is the same algorithm run on the absolute values of the operands, with signs fixed up before and after the loop — the quotient is negated if exactly one operand was negative, and the remainder takes the sign of the dividend (C/C++) or the divisor (Python).

This works cleanly for every pair — except one edge case.

The one signed division that overflows

Addition overflows are fixable in the sense that the wrap-around is well-defined. But there’s exactly one arithmetic operation in two’s complement that has no sensible answer at all: dividing the minimum value by -1.

INT_MIN÷1=INT_MIN=+231\text{INT\_MIN} \div -1 = -\text{INT\_MIN} = +2^{31}

That’s exactly INT_MAX + 1, or (2^31 − 1) + 1 = 2^31, which is one above the largest representable signed value. The result literally cannot be represented in any 32-bit two’s complement bit pattern. There’s no wrap-around that gives the right answer — 2312^{31} just doesn’t exist in the format.

This is the direct consequence of the asymmetric-by-one range. Every negative integer except INT_MIN has a matching positive; INT_MIN is the lone exception. Attempting to negate it (via unary - or via division by -1) asks for a value the format can’t produce.

Different systems handle this differently:

SystemBehavior
x86 CPUs (directly)Division overflow exception (program crashes via SIGFPE on Unix)
C / C++Undefined behavior — compiler may assume it never happens
JavaDefined: Integer.MIN_VALUE / -1 == Integer.MIN_VALUE (wraps to itself)
PythonNot an issue — int is arbitrary precision
Rust (debug)Panic with overflow message
Rust (release)Wraps or panics based on compile settings

They all have to make a choice because the mathematical answer simply doesn’t fit in the bit pattern — there’s no well-behaved fallback.

Arithmetic on IEEE-754 floats

Float arithmetic is an entirely different animal. Instead of the clean modular world of two’s complement, IEEE-754 gives you a grid of representable values whose spacing doubles every time the exponent increments — dense near zero, sparse near the extremes. Each operation has to align exponents, perform the underlying math, re-normalize the result, and round to fit the mantissa.

We’ll assume you already know the IEEE-754 binary64 layout — 1 sign bit, 11 biased exponent bits, 52 mantissa bits, implicit leading 1. If that’s new, the full breakdown is in how JavaScript’s Number and Python’s float store numbers. Here we pick up where that article leaves off — the format is fixed; now what happens when you do arithmetic on two of these numbers.

A quick note on the termionology

One term that comes up below is subnormal (sometimes called denormal): a very small float, smaller than the smallest normal value 210222^{-1022}. Normal floats have the form 1.xxx × 2e2^e with the implicit leading 1; subnormals drop that implicit 1, encoded with an all-zero exponent and a non-zero mantissa. This lets IEEE-754 represent values gradually down to about 210742^{-1074} instead of jumping from 210222^{-1022} straight to zero — a feature called gradual underflow.

All four arithmetic operations follow the same skeleton: each of the three fields (sign, exponent, mantissa) is handled separately, and then the result is normalized (forced back into 1.xxx × 2^e form) and rounded to fit 52 mantissa bits.

For multiplication and division, the operands are already in 1.xxx × 2^e form, so the operation runs directly: signs are XORed, exponents are added (mul) or subtracted (div), and mantissas are multiplied or divided. Three steps: operate → normalize → round.

For addition and subtraction, there’s an extra preamble step — alignment — because adding values requires them at the same scale. The smaller operand’s mantissa is shifted right until both exponents match, and only then are the mantissas combined; the sign of the result comes from the larger-magnitude operand. Four steps: align → operate → normalize → round.

Now let’s walk through each operation in detail, starting with addition.

Addition: align, add, normalize, round

Adding two floats is more involved than adding two integers. The five steps are:

  1. Compare exponents. Pick the larger one as the reference exponent.
  2. Align mantissas. Shift the smaller operand’s mantissa right so both mantissas represent values at the same exponent. This may shift bits off the right end (which become the guard, round, and sticky bits used for correct rounding).
  3. Add the aligned mantissas. Plain binary addition of the two (including implicit leading 1s).
  4. Re-normalize. If the sum overflowed into the next exponent (e.g., 1.1 + 1.1 = 11.0, needing 1.10 × 2^1), shift the mantissa right and bump the exponent. If the sum is smaller than normalized form (subtraction cancellation), shift left and decrement the exponent.
  5. Round. The re-normalized mantissa usually has more bits than the 52-bit field allows. Round to the nearest representable value using IEEE-754’s default round to nearest, ties to even.

We can see how all those steps apply by following in detail the famous case where 0.1 + 0.2 evaluates to 0.30000000000000004 instead of 0.3. Both 0.1 and 0.2 are infinite repeating binaries (the same way 1/3 is 0.333… in decimal: 0.1=0.0001120.1 = 0.0\overline{0011}_2), so each gets rounded to 52 mantissa bits when stored, and the addition rounds again. Three roundings whose errors don’t cancel. A detailed bit-by-bit walk-through is in the sibling article.

Given this example, we can see that float addition is not associative and not exact. Three roundings give a different answer than one rounding of the same mathematical result. This is why (a + b) + c and a + (b + c) can produce different float results.

Subtraction: the catastrophic cancellation trap

Float subtraction follows the same align-add-normalize-round dance as addition, but with one extra concern: catastrophic cancellation.

When two nearly-equal floats are subtracted, the leading bits cancel. The re-normalize step then shifts left by many positions, “pulling up” low-order bits that were originally the least significant (and therefore least precise) parts of the operands.

Example in binary64: two doubles that differ only in the last mantissa bit:

  • a = 1.0000…0001 × 2⁰ (mantissa: 51 zeros then a 1; equals 1 + 2⁻⁵²)
  • b = 1.0000…0000 × 2⁰ (mantissa: all zeros; equals 1.0)

a − b = 0.0000…0001 × 2⁰, which re-normalizes to 1.0 × 2⁻⁵². The 52 leading mantissa bits canceled away, and the bottom bit — already at the precision limit — became the entire result. If a itself was the rounded output of some upstream computation, that bottom bit is mostly rounding noise, and the subtraction has just promoted it to the most significant position.

This is why numerical code often restructures formulas to avoid subtracting nearly-equal values — cancellation can turn a small rounding error into a result with effectively no valid digits.

Comparison: bit-order works (mostly)

Float comparison is almost as simple as integer comparison: for non-special values, comparing two floats by their bit pattern gives the same answer as comparing their numeric values. That’s because IEEE-754 puts sign in the MSB, then exponent, then mantissa — so a larger bit pattern (read as sign-magnitude) corresponds to a larger float. CPUs can compare two floats with a single hardware instruction.

But three special cases break the simple rule:

  • NaN compares unequal to everything, including itself. NaN == NaN evaluates to false. This is intentional — NaN represents “no valid number,” so it can’t be equal to anything. Sorting algorithms and hash tables have to handle NaN specially.
  • +0 == -0, even though their sign bits differ. The two zeros compare equal numerically, but they’re not fully interchangeable: 1 / +0 gives +∞ while 1 / -0 gives -∞.

These show up in surprising places. In JavaScript, [NaN].includes(NaN) is true (the language spec uses bit-equality there) but [NaN].indexOf(NaN) is -1 (uses ===, which honors the IEEE-754 rule). And in most languages, if (x == x) is the standard idiom for “is x not NaN?”

Multiplication: simpler than addition

Surprisingly, float multiplication is simpler than addition, because you don’t need to align exponents.

To compute a × b:

  1. Multiply the mantissas (including implicit leading 1s). Each mantissa is at least 1 (because the implicit leading bit is 1) and less than 2 (anything ≥ 2 would have been renormalized to a higher exponent). So their product is at least 1 and less than 4 (minimum 1 × 1 = 1; maximum just under 2 × 2 = 4), meaning the result is either already normalized (1.xxx) or one bit too big (1x.xxx).
  2. Add the true exponents. Biased exponents have to be un-biased first, then re-biased: (e11023)+(e21023)+1023=e1+e21023(e_1 - 1023) + (e_2 - 1023) + 1023 = e_1 + e_2 - 1023.
  3. XOR the sign bits. Positive × negative = negative, etc.
  4. Re-normalize if needed. If the mantissa product overflowed to 1x.xxx, shift right by 1 and add 1 to the exponent.
  5. Round the mantissa to fit 52 bits.

No alignment means only one source of rounding error (step 5), plus whatever error was already in the operands. So a × b is typically more accurate than a + b when both are computed on freshly-rounded inputs.

We can see how those steps apply by following 0.1 × 0.2, which evaluates to 0.020000000000000004 instead of 0.02. Both 0.1 and 0.2 round to the same 52-bit mantissa (the same infinite-repeating binary we saw before), differing only in exponent:

  • 0.11.10011001…10011010 × 242^{-4} (52 bits)
  • 0.21.10011001…10011010 × 232^{-3} (52 bits)

Running the five steps:

  1. Multiply the mantissas. Each mantissa decodes to approximately 1.6 in decimal (the binary 1.10011001…10011010 is the closest 53-bit approximation of 8/5 = 1.6). The full multiplication of two 53-bit values produces up to 106 bits, with a value ≈ 1.6 × 1.6 = 2.56. Since 2.56 ≥ 2, the result is in 1x.xxx form and renormalization is needed.
  2. Add the true exponents: 4+3=7-4 + -3 = -7.
  3. XOR the signs: positive × positive = positive.
  4. Renormalize. Mantissa product is in 1x.xxx form, so shift right by 1 (mantissa becomes ≈ 1.28) and bump the exponent from 7-7 to 6-6.
  5. Round the 106-bit mantissa product down to 52 bits.

The final stored value decodes to the exact decimal 0.0200000000000000004163336342344337026588618755340576171875 — close to 0.02, but not equal. The rounding errors in 0.1 and 0.2 get carried into the product, and the product itself rounds again.

Division: rarely exact

Float division mirrors multiplication’s five-step structure, with the operations reversed:

  1. Divide the mantissas. Each mantissa is at least 1 and less than 2, so the quotient mantissa is at least 0.5 and less than 2 — either already normalized (1.xxx) or one bit too small (0.xxx).
  2. Subtract the true exponents. Same un-bias-then-rebias dance as multiplication: (e11023)(e21023)+1023=e1e2+1023(e_1 - 1023) - (e_2 - 1023) + 1023 = e_1 - e_2 + 1023.
  3. XOR the sign bits. Negative / positive = negative, etc.
  4. Re-normalize if needed. If the quotient mantissa is 0.xxx, shift left by 1 and decrement the exponent.
  5. Round the mantissa to fit 52 bits.

The big difference from multiplication: division is rarely exact. Two 53-bit values can produce a quotient that needs infinitely many bits to represent — even when both operands are simple. For instance, 1.0 / 3.0 is the binary equivalent of 1/3 = 0.333… in decimal: an infinite repeating binary that has to be rounded. So step 5 almost always rounds, even when no rounding was needed for the operands.

Modern CPUs implement float division via the same Newton–Raphson / Goldschmidt iterations mentioned earlier for integer division, reducing it to a few multiplications — but division is still typically the slowest of the four basic float operations.

Overflow and underflow: infinity and zero instead of wrap-around

Unlike two’s complement integers, floats don’t wrap around on overflow. IEEE-754 reserves special bit patterns for “too big” and “too small” results:

  • Overflow → the exponent would exceed the maximum representable value 210232^{1023}. Result becomes ±\pm\infty, encoded as all-ones exponent with all-zero mantissa.
  • Underflow → the exponent would drop below the minimum (210222^{-1022} for normals). Result becomes a subnormal (mantissa has no implicit leading 1, allowing gradual loss of precision) or eventually ±0\pm 0.

In bit layout (sign | 11 exponent bits | 52 mantissa bits):

+∞          :  0 | 11111111111 | 0000…0000
-∞          :  1 | 11111111111 | 0000…0000
+0          :  0 | 00000000000 | 0000…0000
-0          :  1 | 00000000000 | 0000…0000
subnormal   :  s | 00000000000 | xxxx…xxxx   (non-zero mantissa; no implicit leading 1)
NaN         :  s | 11111111111 | xxxx…xxxx   (non-zero mantissa)

So 1e300 × 1e300 in a double produces Infinity, not some wrapped-around bit pattern. And 1e-300 × 1e-300 underflows through subnormals toward zero.

The bit patterns for special values all live at the extremes of the exponent range, which is why offset binary was chosen for the exponent field in the first place — it puts all-zero and all-one exponent patterns at the boundaries where reservations are natural.

Division by zero: Infinity, not an exception

Integer division by zero traps or raises an exception in most languages. Float division by zero doesn’t — it produces ±Infinity:

> 1 / 0
Infinity
> -1 / 0
-Infinity

Python is an exception (it raises ZeroDivisionError at the language layer), but the underlying IEEE-754 arithmetic produces Infinity; Python just intercepts it. In JavaScript you get the raw behavior.

0 / 0 is different — it produces NaN, the “not a number” special value that propagates through further arithmetic and compares unequal to everything, including itself.

How the two systems differ

With both systems laid out, the contrasts are striking. Integer arithmetic and float arithmetic share almost nothing at the bit level, and the differences matter in practice:

AspectTwo’s complement integersIEEE-754 floats
Bit spacingUniform — every integer is 1 apartLogarithmic — density is highest near zero, sparse near the extremes
AdditionExact, commutative, associativeRounded; associative only by coincidence
MultiplicationExact within range, overflows wrapRounded once per operation
OverflowWraps modulo 2^nSaturates to ±Infinity
UnderflowN/A (integers don’t have underflow)Gradual via subnormals, then ±0
Division by zeroException / undefined±Infinity (or NaN for 0/0)
SignednessEncoded in the bits (MSB convention)Explicit sign bit; +0 and -0 both exist
EqualityBit-exact; x == x always trueMostly bit-exact; NaN != NaN is the exception

For most application code, the differences show up as everyday quirks:

  • Integers are predictable but narrow. Addition is associative and exact, so (a + b) + c == a + (b + c) always. But step outside the range and you wrap (or crash, depending on language).
  • Floats are wide but imprecise. You can represent values from 1030810^{-308} to 1030810^{308}, but few of them exactly. Every operation rounds, and rounding errors accumulate in ways that depend on operation order.
  • Mixing them needs care. Converting a large int64 to float64 loses precision past 2532^{53} (the safe-integer boundary). Converting a large float to int truncates. Neither is reversible in general.

Underneath those everyday observations, four deeper points are worth holding onto:

  1. Two’s complement integer arithmetic is modular. Every operation happens modulo 2n2^n, which is why addition and subtraction use the same circuit and why INT_MAX + 1 = INT_MIN. The asymmetric range (one extra negative) gives you the INT_MIN / -1 edge case where the true result has no bit pattern at all.
  2. Float arithmetic is align-add-normalize-round. Every operation on two floats involves re-aligning their exponents (for addition), performing the math, re-normalizing the result, and rounding to 52 mantissa bits. Every step can introduce error, which is why float results are rarely bit-exact.
  3. Overflow behaves completely differently. Integers wrap; floats saturate to infinity. That’s not a superficial choice — it reflects the fundamentally different structure of the two number spaces (flat modular vs logarithmic with reserved endpoints).
  4. The bits don’t carry meaning — the type does. The same 32 bits can be an int or a float depending on what the program decides. Two’s complement addition and float addition are entirely different operations that happen to use the same 32 bits of memory.

Knowing which arithmetic system you’re in (and its quirks) is usually more important than knowing the specific encoding of your numbers. When a calculation gives an unexpected answer, the question “which system handled this?” is usually the fastest way to the answer.