GPU memory, precision, and multi-GPU inference
In the previous article, we saw that a bigger GPU doesn’t help when your model is too small to saturate the hardware. But what happens when your model is too large for the hardware? That’s a different problem entirely — and the constraint shifts from compute to memory.
GPUs have their own RAM
When you load a model onto a GPU, the weights don’t go into your system’s main RAM. They go into VRAM (Video RAM) — dedicated memory sitting physically on the graphics card, right next to the processing cores.
VRAM is fast. The L4’s GDDR6 delivers ~300 GB/s bandwidth, compared to ~50 GB/s for typical system RAM. But there’s much less of it. A high-end consumer card like the RTX 4090 has 24 GB. The datacenter A100 has 40 or 80 GB. That’s the hard ceiling for what you can load.
For our MNIST model with ~101K parameters, this is irrelevant — the entire model fits in about 400 KB. But once you start working with large language models or vision transformers, VRAM becomes the binding constraint.
How much VRAM does a model need?
The formula is simple: number of parameters × bytes per parameter.
The “bytes per parameter” depends on the numerical precision you choose:
| Precision | Bytes per param | 0.5B model | 7B model | 70B model |
|---|---|---|---|---|
| float32 | 4 | 2 GB | 28 GB | 280 GB |
| float16 / bfloat16 | 2 | 1 GB | 14 GB | 140 GB |
| int8 | 1 | 0.5 GB | 7 GB | 70 GB |
| int4 | 0.5 | 0.25 GB | 3.5 GB | 35 GB |
These numbers are just the weights. During inference, you also need memory for activations and the KV cache (which grows with context length and batch size), but the weights are the dominant cost.
A 7B parameter model at float16 needs ~14 GB of VRAM — that fits on a single L4 or RTX 4090. The same model at float32 needs ~28 GB — it doesn’t fit. A 70B model at float16 needs ~140 GB — that’s two A100-80GBs.
Model precision: trading bits for memory
When a model is trained, each weight is typically stored as a float32 — a 32-bit floating-point number with about 7 decimal digits of precision. But for inference, you usually don’t need all that precision.
float16 / bfloat16
Using 16 bits instead of 32 halves memory usage. Most modern models are trained with mixed precision (float16 for forward/backward passes, float32 for weight updates), so the weights were never meaningfully precise to 32 bits — the extra bits are just padding. Loading in float16 for inference is essentially lossless.
In code, it’s a single parameter:
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype=torch.float16, # half the memory, same quality
device_map="auto",
)
int8 and int4 quantization
Quantization goes further, representing weights as 8-bit or 4-bit integers. This does introduce rounding error — you’re compressing a continuous value into 256 (int8) or 16 (int4) discrete levels.
But neural networks are robust to small perturbations. Millions of weights work together, so tiny rounding errors in individual weights wash out statistically. In practice, int8 quantization is nearly indistinguishable from float16 for most models. Int4 shows some degradation on reasoning-heavy tasks but is remarkably good for general use.
Finding the right precision
The practical workflow: train at full precision, then evaluate at progressively lower precision. If accuracy barely drops but memory halves, that’s a clear win for deployment.
float32 → float16 → int8 → int4
28 GB 14 GB 7 GB 3.5 GB (for a 7B model)
baseline ~same ~same slight drop
Each step down makes the model cheaper and faster to serve. The question is where quality degrades enough to matter for your use case.
When one GPU isn’t enough
What if your model doesn’t fit in VRAM even at reduced precision? You have a few options.
Offloading to CPU RAM
Libraries like HuggingFace Transformers support device_map="auto", which fills GPU VRAM first, then spills remaining layers to CPU RAM, and can even offload to disk as a last resort.
It works, but every time a layer on CPU needs to run, data shuttles back and forth over the PCIe bus. For the same reasons we saw in the previous article — transfer overhead dominating compute — this is significantly slower than having everything on GPU.
Splitting across multiple GPUs
There are two main strategies:
Pipeline parallelism assigns different layers to different GPUs. GPU 1 gets layers 0–19, GPU 2 gets layers 20–39. Data flows through GPU 1, then the activations transfer to GPU 2 for the remaining layers. This is simple to implement (it’s what device_map="auto" does across multiple GPUs), but only one GPU is active at a time per request — the others are idle, waiting for their turn in the pipeline.
Tensor parallelism splits individual layers across GPUs. A single matrix multiplication gets divided so each GPU computes a slice, then they combine results. This keeps all GPUs active simultaneously, but requires high-speed GPU-to-GPU interconnects (like NVLink at ~900 GB/s, vs PCIe at ~32 GB/s) because the GPUs need to exchange partial results on every layer.
For serious multi-GPU serving, tools like vLLM and TensorRT-LLM handle tensor parallelism automatically. For training, frameworks like DeepSpeed and FSDP (Fully Sharded Data Parallel) distribute both parameters and optimizer state across GPUs.
The same principle applies
The pattern is the same one from the previous article: the overhead of coordination (transfers, synchronization, scheduling) must be small relative to the actual compute, or the extra hardware doesn’t help. Tensor parallelism across slow interconnects is the multi-GPU equivalent of running tiny kernels on a big GPU — the communication overhead eats the gains.
What to take away
- VRAM is the hard constraint for larger models. Compute speed is one axis; memory is another. Know the formula (params × bytes per weight) before choosing hardware.
- Lower precision is (almost) free performance. Float16 inference uses half the memory with negligible quality loss. Int8/int4 quantization pushes further — evaluate the tradeoff for your use case.
- Multi-GPU has its own overhead tax. Splitting a model across GPUs introduces communication overhead. If the coordination cost exceeds the compute benefit, more hardware doesn’t help.
- Start with the math. Before spinning up a GPU instance, multiply your model’s parameter count by the bytes per weight at your target precision. If it fits in one GPU’s VRAM, you’re set. If not, decide between quantization, offloading, or multi-GPU — in that order of simplicity.
Stay up to date
Get notified when I publish new deep dives.