Deploying open source LLMs with Docker and GPUs on GCP
If you’ve ever wanted to run your own LLM — fully under your control, no API keys, no rate limits — this article will show you how. We’ll start from scratch: what a GPU Docker image actually is, how to package a simple model inside one, and then scale up to hosting a production-grade open source LLM on a GCP VM with serious GPU power.
What is a GPU Docker image?
A regular Docker container runs on CPU. To use a GPU inside a container, you need three things:
- NVIDIA drivers installed on the host machine
- NVIDIA Container Toolkit (formerly nvidia-docker) — a runtime that exposes GPUs to containers
- A CUDA base image — a Docker image with NVIDIA’s CUDA libraries pre-installed
NVIDIA publishes official CUDA images on Docker Hub. For example:
nvidia/cuda:12.8.0-base-ubuntu24.04This image name breaks down as:
nvidia/cuda— the official NVIDIA CUDA image repository12.8.0— the CUDA toolkit versionbase— the image variant (more on this below)ubuntu24.04— the underlying OS
CUDA image variants
NVIDIA publishes three variants of each CUDA image, each building on the previous one:
| Variant | Contents | Size | Use case |
|---|---|---|---|
base | CUDA runtime only | ~120 MB | Running pre-compiled CUDA apps |
runtime | base + CUDA math libs + NCCL + cuDNN | ~1.5 GB | Running ML inference frameworks |
devel | runtime + headers + compiler toolchain | ~3.5 GB | Compiling CUDA code from source |
For deploying inference, runtime is usually the right choice. You don’t need devel unless you’re compiling custom CUDA kernels.
Verifying GPU access
The simplest smoke test is running nvidia-smi inside the container:
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smiThe --gpus all flag tells Docker to expose all host GPUs to the container via the NVIDIA Container Toolkit. If this prints a table showing your GPU model, memory, and driver version — your GPU setup is working.
Building a minimal GPU inference container
Before we deploy a full LLM, let’s build a simple container that loads a small model and runs inference. This will make the mechanics clear.
The Dockerfile
FROM nvidia/cuda:12.8.0-runtime-ubuntu24.04
# Install Python and pip
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
python3-venv \
&& rm -rf /var/lib/apt/lists/*
# Create a virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Install PyTorch with CUDA support and transformers
RUN pip install --no-cache-dir \
torch \
transformers \
accelerate
# Copy inference script
COPY inference.py /app/inference.py
WORKDIR /app
CMD ["python3", "inference.py"]The inference script
# inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
print(f"Loading {model_name}...")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
messages = [
{"role": "user", "content": "Explain what Docker is in two sentences."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(f"\nResponse: {response}")We’re using Qwen2.5-0.5B-Instruct — a tiny 0.5 billion parameter model. It’s small enough to download quickly and run on almost any GPU, making it perfect for testing the pipeline.
Build and run
docker build -t my-llm .
docker run --rm --gpus all my-llmThis works, but it has a major problem: the model downloads from Hugging Face every time you start the container. For a 0.5B model that’s mildly annoying. For a 70B model, it’s a dealbreaker. We’ll fix this later.
Deploying a real model: PaddleOCR as an API
The minimal container above runs inference once and exits. A real deployment needs to serve an API — accept requests, run the model, return results. Let’s build one using PaddleOCR, an open source OCR toolkit that detects and recognizes text in images.
This follows the same pattern we’ll use for the MNIST training server later: package a model inside a Docker container with a FastAPI server in front of it.
Not every model needs a GPU. PaddleOCR’s models are small (~150 MB total) and process images in under 100ms on CPU — fast enough for most production workloads. Using CPU keeps the Dockerfile simple and avoids CUDA version compatibility issues between PaddlePaddle and the base image. We’ll use GPUs where they actually matter: training neural networks and running LLMs.
The server
# server.py
from fastapi import FastAPI, UploadFile
from paddleocr import PaddleOCR
import numpy as np
import cv2
import uvicorn
app = FastAPI()
ocr = PaddleOCR(use_angle_cls=True, lang="en", use_gpu=False)
@app.post("/ocr")
async def run_ocr(file: UploadFile):
contents = await file.read()
img = cv2.imdecode(np.frombuffer(contents, np.uint8), cv2.IMREAD_COLOR)
result = ocr.ocr(img, cls=True)
lines = []
for line in result[0]:
bbox, (text, confidence) = line
lines.append({"text": text, "confidence": round(confidence, 4), "bbox": bbox})
return {"lines": lines}
@app.get("/health")
def health():
return {"status": "ok"}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8080)PaddleOCR loads three models on startup — text detection, text recognition, and angle classification. Once loaded, each request is just a forward pass through these models, so responses are fast.
The Dockerfile
FROM python:3.12-slim
RUN apt-get update && apt-get install -y \
libgl1 libglib2.0-0 libgomp1 \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir \
paddlepaddle "paddleocr>=2.6,<3" \
fastapi uvicorn python-multipart
# Download OCR models at build time so startup is instant
RUN python3 -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"
COPY server.py /app/server.py
WORKDIR /app
EXPOSE 8080
CMD ["python3", "server.py"]Two things to notice:
libgl1andlibglib2.0-0— OpenCV dependencies that aren’t in the slim Python image. Without them,import cv2fails. This is a common gotcha when containerizing computer vision code.- The
RUN python3 -c "..."line downloads PaddleOCR’s models (~150 MB) during the build, baking them into the image. This is the same “bake the model in” pattern — the container starts serving immediately with no download wait.
Build and run
docker build -t paddleocr-server .
docker run -d -p 8080:8080 paddleocr-serverTest it
# OCR a local image
curl -X POST http://localhost:8080/ocr -F "[email protected]"
# Response:
# {
# "lines": [
# {"text": "TOTAL", "confidence": 0.9987, "bbox": [[45, 120], [130, 120], [130, 145], [45, 145]]},
# {"text": "$42.50", "confidence": 0.9954, "bbox": [[150, 120], [240, 120], [240, 145], [150, 145]]}
# ]
# }This is the general pattern for deploying any Python-based model as a GPU API:
- Start from a CUDA runtime image
- Install your ML framework and model dependencies
- Download model weights at build time
- Wrap inference in a FastAPI server
- Run with
--gpus all
The MNIST training server later in this article uses the same pattern, and so would any custom model you want to deploy — a fine-tuned classifier, a speech-to-text model, an image segmentation pipeline. The Dockerfile structure and FastAPI wrapper are always the same; only the model and the request/response schema change.
Key concepts before scaling up
Before we deploy a real LLM, let’s understand a few concepts that will determine our hardware choices.
Model size and GPU memory
How do you know which GPU (or how many GPUs) you need for a given model? You don’t have to guess — you can calculate it. Every model parameter is a number stored in a specific precision, and each precision uses a known number of bytes:
| Precision | Bytes per parameter |
|---|---|
| FP32 | 4 bytes |
| FP16/BF16 | 2 bytes |
| INT8 | 1 byte |
| INT4 | 0.5 bytes |
The formula is straightforward:
A small example: the MNIST network
To make this concrete, let’s start with the small neural network from the previous article — a 2-layer network that classifies handwritten digits:
784 inputs → 128 hidden neurons → 10 output classesThat’s parameters. How much memory does it need?
Under half a megabyte. This model runs comfortably on any hardware — it would fit on a GPU from 2005. Even the larger version we experimented with (784→512→256→10, 535,818 parameters) is only ~2 MB in FP32. These models are so small that hardware planning isn’t really a question — they don’t need a GPU.
But speed matters when you’re experimenting. The MNIST article has ~20 training runs — different learning rates, batch sizes, architectures, activation functions. On a basic CPU VM, each run takes ~30 seconds, so the full set of experiments takes 5–10 minutes of waiting. On a cheap GPU, each run takes 3–5 seconds — experimentation feels instant.
| Setup | Per run (5 epochs) | All experiments (~20 runs) | Cost/hour |
|---|---|---|---|
e2-medium (CPU only) | ~30s | ~10 min | ~$0.03 |
n1-standard-4 + 1x T4 | ~3-5s | ~1-2 min | ~$0.80 |
The T4 has 16 GB of VRAM — absurd overkill for a 0.4 MB model, but it’s the cheapest GPU on GCP and the parallelism on matrix multiplications makes a noticeable difference even for small matrices.
Since you only need the GPU for a few minutes at a time, create the VM once, set up your environment, and then suspend it when you’re done. Suspended VMs save their full state (memory, running processes, installed packages) to disk — when you resume, everything is exactly where you left it in ~10–20 seconds, no re-provisioning needed. You only pay for disk storage while suspended.
# One-time setup: create the VM
gcloud compute instances create mnist-experiments \
--zone=us-central1-a \
--machine-type=n1-standard-4 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--boot-disk-size=50GB \
--image-family=common-cu128-ubuntu-2404-nvidia-570 \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
# SSH in, install your dependencies, run experiments...
gcloud compute ssh mnist-experiments --zone=us-central1-a
# When done — suspend (state saved, GPU charges stop)
gcloud compute instances suspend mnist-experiments --zone=us-central1-a
# Next day — resume in ~10-20 seconds, everything intact
gcloud compute instances resume mnist-experiments --zone=us-central1-a
# When you're done for good — delete to stop all charges
gcloud compute instances delete mnist-experiments --zone=us-central1-aThe workflow is: create once → suspend/resume as needed → delete when done for good. At 2/month) — no GPU or CPU charges.
Running experiments remotely
If you don’t want to stay SSH’d into the VM watching terminal output, you can wrap the training in a small FastAPI service. Upload a Python training script, let it run in the background, and poll for logs from your local machine.
You don’t need a heavy framework for this. Tools like Ray and MLflow are great when you need distributed training or experiment tracking across teams, but for running a few experiments on a single GPU they’re overkill. A FastAPI app that runs uploaded scripts in a subprocess is about 80 lines and does exactly what we need.
The approach is simple: you upload a .py file, the server saves it, runs it as a subprocess with full GPU access, and captures everything it prints to stdout. Since we’re writing Python training scripts throughout this series, this is a natural fit — your experiment is the script. You define the model architecture, the optimizer, the training loop, everything. The server just runs it and streams the output.
# train_server.py
import uuid
import subprocess
import threading
from collections import deque
from fastapi import FastAPI, HTTPException, UploadFile
import uvicorn
app = FastAPI()
class Job:
def __init__(self, name: str, script_path: str):
self.id = uuid.uuid4().hex[:8]
self.name = name
self.script_path = script_path
self.status = "queued" # queued | running | completed | failed
self.logs: list[str] = []
self.error: str | None = None
def log(self, message: str):
self.logs.append(message)
print(f"[{self.name}] {message}")
# --- Job queue and runner ---
jobs: dict[str, Job] = {}
job_queue: deque[Job] = deque()
runner_lock = threading.Lock()
def run_job(job: Job):
job.status = "running"
job.log(f"Running: {job.name}")
try:
proc = subprocess.Popen(
["python3", "-u", job.script_path], # -u for unbuffered output
stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
text=True, bufsize=1,
)
for line in proc.stdout:
job.log(line.rstrip())
proc.wait()
if proc.returncode == 0:
job.status = "completed"
job.log("Done.")
else:
job.status = "failed"
job.error = f"Exit code {proc.returncode}"
except Exception as e:
job.status = "failed"
job.error = str(e)
job.log(f"Error: {e}")
def process_queue():
"""Process jobs sequentially — one GPU, one job at a time."""
while job_queue:
run_job(job_queue.popleft())
def enqueue(job: Job):
jobs[job.id] = job
job_queue.append(job)
if runner_lock.acquire(blocking=False):
def run():
try:
process_queue()
finally:
runner_lock.release()
threading.Thread(target=run, daemon=True).start()
# --- API ---
@app.post("/jobs")
async def submit_job(file: UploadFile):
script_path = f"/tmp/train_{uuid.uuid4().hex[:8]}.py"
with open(script_path, "wb") as f:
f.write(await file.read())
job = Job(name=file.filename, script_path=script_path)
enqueue(job)
return {"job_id": job.id, "name": job.name, "status": job.status}
@app.get("/jobs")
def list_jobs():
return [
{"job_id": j.id, "name": j.name, "status": j.status}
for j in jobs.values()
]
@app.get("/jobs/{job_id}")
def get_job(job_id: str):
job = jobs.get(job_id)
if not job:
raise HTTPException(status_code=404, detail="Job not found")
return {
"job_id": job.id, "name": job.name, "status": job.status,
"error": job.error, "log_lines": len(job.logs),
}
@app.get("/jobs/{job_id}/logs")
def get_job_logs(job_id: str, since: int = 0):
job = jobs.get(job_id)
if not job:
raise HTTPException(status_code=404, detail="Job not found")
return {"logs": job.logs[since:], "total": len(job.logs)}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8080)Run the server on the VM:
pip install fastapi uvicorn
python train_server.pyNow from your local machine, upload a training script and poll for progress:
# Submit a training script
curl -X POST http://$VM_IP:8080/jobs -F "[email protected]"
# {"job_id": "a1b2c3d4", "name": "experiment.py", "status": "queued"}
# Poll logs (use ?since=N to only get new lines)
curl http://$VM_IP:8080/jobs/a1b2c3d4/logs?since=0
# Check status
curl http://$VM_IP:8080/jobs/a1b2c3d4
# List all jobs
curl http://$VM_IP:8080/jobsThe -u flag in the Popen call ensures Python’s output is unbuffered, so each print() in your training script shows up in the job’s logs immediately. Jobs run sequentially — one GPU can only train one model at a time — but the queue lets you submit several scripts and walk away.
A training script is just a regular Python file. Here’s one that runs the baseline experiment from the MNIST article:
# experiment.py
from tensorflow import keras
(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()
X_train = train_images.reshape(60000, 784).astype("float32") / 255.0
X_test = test_images.reshape(10000, 784).astype("float32") / 255.0
model = keras.Sequential([
keras.layers.Dense(128, activation="relu", input_shape=(784,)),
keras.layers.Dense(10, activation="softmax"),
])
model.compile(
optimizer=keras.optimizers.SGD(learning_rate=0.1),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"],
)
model.fit(X_train, train_labels, epochs=5, batch_size=32,
validation_data=(X_test, test_labels))Because the script is the experiment, you have full control — change the model architecture, swap the optimizer, add dropout or convolutional layers, try a completely different approach. Write a new .py file, upload it, and poll for results. No need to modify the server.
Note: This server executes uploaded Python scripts on your VM. This is fine for a personal experiment machine that only you can reach — don’t expose it to the internet without authentication.
Scaling up to LLMs
The math is the same for large language models — but the numbers get serious:
| Precision | 7B model | 70B model |
|---|---|---|
| FP32 | 28 GB | 280 GB |
| FP16/BF16 | 14 GB | 140 GB |
| INT8 | 7 GB | 70 GB |
| INT4 | 3.5 GB | 35 GB |
These numbers are just for the model weights. Inference also requires memory for KV-cache (which stores attention state for each token in the context), activations, and framework overhead. A safe rule of thumb: plan for 1.2x the weight size at minimum.
So the full formula for picking a GPU is:
This means a 7B model in FP16 needs roughly of GPU memory. A single NVIDIA L4 (24 GB) or A100 (40/80 GB) can comfortably run it. A 70B model in FP16 needs ~168 GB — more than any single GPU — so you’d need multiple GPUs or quantization.
What requires empirical testing
The 1.2x multiplier is a minimum estimate. The actual overhead depends on factors you can’t fully predict from the parameter count alone:
- Context length — KV-cache grows linearly with
max_model_len. A 72B model at 4,096 context uses far less memory than at 32,768. - Concurrent requests — each concurrent request needs its own KV-cache allocation. How many users you can serve simultaneously depends on your traffic patterns.
- Throughput and latency — the parameter count tells you whether a model fits on a GPU, but not how fast it runs. Tokens per second, time to first token, and requests per second under load all require benchmarking with realistic workloads.
In short: VM sizing is calculable, performance tuning is empirical. The formula above tells you which GPU to rent. Load testing tells you whether it’s fast enough.
Quantization
Quantization reduces precision to shrink model size and speed up inference. A 70B model quantized to INT4 fits in ~35 GB — achievable on a single A100 80 GB. The quality trade-off is often surprisingly small: INT4 quantized models typically score within a few percent of their full-precision counterparts on benchmarks.
The most common quantization formats you’ll encounter:
- GPTQ — post-training quantization, requires a calibration dataset
- AWQ — activation-aware quantization, generally better quality than GPTQ
- GGUF — used by llama.cpp, supports CPU+GPU split inference
Many models on Hugging Face come pre-quantized. Look for names like Qwen2.5-72B-Instruct-AWQ or Qwen2.5-72B-Instruct-GPTQ-Int4.
Inference frameworks
You could load models directly with transformers like we did above, but production deployments use specialized inference servers that handle batching, caching, and serving efficiently:
| Framework | Best for | Key features |
|---|---|---|
| vLLM | High-throughput serving | PagedAttention, continuous batching, OpenAI-compatible API |
| TGI (Text Generation Inference) | Hugging Face ecosystem | Built-in quantization, easy model loading |
| Ollama | Simplicity | One-command setup, manages model downloads |
| SGLang | Advanced use cases | Structured generation, RadixAttention |
For most deployments, vLLM is the default choice. It’s fast, battle-tested, and exposes an OpenAI-compatible API, so your existing code that calls OpenAI can point at your self-hosted model with just a URL change.
Deploying Qwen 72B on GCP with vLLM
Now let’s deploy a serious model. We’ll use Qwen2.5-72B-Instruct-AWQ (a quantized 72B model) served by vLLM on a GCP VM with NVIDIA GPUs.
Step 1: Choose the right GPU VM
GCP offers several GPU options. Here’s what matters for LLM inference:
| GPU | VRAM | Good for | GCP machine type |
|---|---|---|---|
| T4 | 16 GB | Small models (7B quantized) | n1-standard-8 + 1x T4 |
| L4 | 24 GB | Medium models (7B FP16, 13B quantized) | g2-standard-12 + 1x L4 |
| A100 40 GB | 40 GB | Large models (30B quantized) | a2-highgpu-1g |
| A100 80 GB | 80 GB | 70B quantized on single GPU | a2-ultragpu-1g |
| H100 80 GB | 80 GB | Fastest inference, 70B+ models | a3-highgpu-1g |
For Qwen 72B AWQ (~35 GB weights), we need at least one A100 80 GB. To be comfortable with KV-cache for long contexts, two A100 40 GB (using tensor parallelism) or one A100 80 GB is a good fit.
We’ll use an a2-ultragpu-1g instance (1x A100 80 GB).
Step 2: Create the VM
gcloud compute instances create llm-server \
--zone=us-central1-a \
--machine-type=a2-ultragpu-1g \
--boot-disk-size=200GB \
--image-family=common-cu128-ubuntu-2404-nvidia-570 \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"Key flags:
--image-family=common-cu128-ubuntu-2404-nvidia-570— GCP’s Deep Learning VM image, which comes with NVIDIA drivers, CUDA, and the NVIDIA Container Toolkit pre-installed. This saves a lot of setup.--boot-disk-size=200GB— LLM weights are large. The 72B AWQ model is ~40 GB. Give yourself room.--maintenance-policy=TERMINATE— required for GPU instances. GCP can’t live-migrate GPU VMs, so it terminates them instead.
Step 3: Install Docker on the VM
SSH into the instance:
gcloud compute ssh llm-server --zone=us-central1-aThe Deep Learning VM image usually has Docker pre-installed. Verify:
docker --version
nvidia-smiIf Docker isn’t installed:
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Re-login for group changes
exitStep 4: Launch vLLM
This is where it all comes together. One command to serve a 72B model with an OpenAI-compatible API:
docker run -d \
--name vllm \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-72B-Instruct-AWQ \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.9Let’s break this down:
-v ~/.cache/huggingface:/root/.cache/huggingface— mounts the host’s Hugging Face cache into the container. This means the model downloads once and persists across container restarts. This fixes the “download every time” problem from our simple example.--ipc=host— shares the host’s IPC namespace, needed for PyTorch’s shared memory when using multiple workers.--model Qwen/Qwen2.5-72B-Instruct-AWQ— the model to serve. vLLM downloads it from Hugging Face automatically.--quantization awq— tells vLLM to use AWQ dequantization kernels.--max-model-len 4096— limits the context window. Shorter context = less KV-cache memory = more room for batching.--gpu-memory-utilization 0.9— use 90% of GPU memory. vLLM pre-allocates memory for KV-cache, so this controls how much it reserves.
The first run will take a while as it downloads the model (~40 GB). Watch the logs:
docker logs -f vllmOnce you see Uvicorn running on http://0.0.0.0:8000, the server is ready.
Step 5: Test the API
From the VM itself:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-72B-Instruct-AWQ",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100
}'The API is OpenAI-compatible, so you can use any OpenAI SDK client by changing the base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-72B-Instruct-AWQ",
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
)
print(response.choices[0].message.content)Step 6: Expose the API externally
By default, the VM’s port 8000 isn’t accessible from the internet. To open it:
# Create a firewall rule
gcloud compute firewall-rules create allow-vllm \
--direction=INGRESS \
--action=ALLOW \
--rules=tcp:8000 \
--target-tags=vllm-server
# Add the tag to the VM
gcloud compute instances add-tags llm-server \
--zone=us-central1-a \
--tags=vllm-serverNow you can access the API from anywhere using the VM’s external IP:
EXTERNAL_IP=$(gcloud compute instances describe llm-server \
--zone=us-central1-a \
--format='get(networkInterfaces[0].accessConfigs[0].natIP)')
curl http://$EXTERNAL_IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-72B-Instruct-AWQ",
"messages": [{"role": "user", "content": "Hello!"}]
}'Important: This exposes an unauthenticated API to the internet. For production, put it behind a reverse proxy with authentication, or use GCP’s Identity-Aware Proxy (IAP). vLLM also supports an
--api-keyflag to require a bearer token.
Multi-GPU setups
If your model doesn’t fit on a single GPU, vLLM supports tensor parallelism — splitting the model across multiple GPUs:
docker run -d \
--name vllm \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9The --tensor-parallel-size 2 flag splits the model across 2 GPUs. For this to work, you need a VM with multiple GPUs, like a2-megagpu-16g (16x A100 40 GB) or a2-ultragpu-2g (2x A100 80 GB).
With 2x A100 80 GB (160 GB total), you can run the full unquantized 72B model in FP16 (~140 GB) with room to spare for KV-cache.
Production considerations
Keeping the model warm
Container restarts re-download the model unless you persist the cache. We handled this with the -v mount, but you can go further by baking the model into the image:
FROM vllm/vllm-openai:latest
# Download model at build time
RUN python3 -c "from huggingface_hub import snapshot_download; \
snapshot_download('Qwen/Qwen2.5-72B-Instruct-AWQ')"This creates a larger image (~40 GB+) but guarantees zero download time on startup. For GCP specifically, push this image to Artifact Registry in the same region as your VM to minimize pull time:
# Tag and push to Artifact Registry
docker tag my-vllm us-central1-docker.pkg.dev/my-project/my-repo/vllm-qwen72b:latest
docker push us-central1-docker.pkg.dev/my-project/my-repo/vllm-qwen72b:latestCost management
GPU VMs are expensive. An a2-ultragpu-1g costs roughly $8-12/hour on-demand. Strategies to manage costs:
- Preemptible/Spot VMs — up to 60-91% cheaper, but GCP can reclaim them with 30 seconds notice. Good for batch workloads, risky for serving.
- Committed use discounts — 1 or 3-year commitments for 37-55% off. Good if you know you need sustained GPU capacity.
- Stop when idle — use a startup script that launches vLLM, and stop the VM when you’re not using it. You only pay for the disk when stopped.
# Stop the VM (keeps disk, no GPU charges)
gcloud compute instances stop llm-server --zone=us-central1-a
# Start it back up
gcloud compute instances start llm-server --zone=us-central1-aHealth checks and auto-restart
Add Docker restart policy and a health check:
docker run -d \
--name vllm \
--gpus all \
--restart unless-stopped \
--health-cmd="curl -f http://localhost:8000/health || exit 1" \
--health-interval=30s \
--health-timeout=10s \
--health-retries=3 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-72B-Instruct-AWQ \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.9Quick reference
Here’s a cheat sheet for common model sizes and recommended GCP setups:
| Model | Precision | VRAM needed | GCP setup | ~Cost/hour |
|---|---|---|---|---|
| Qwen2.5-7B | FP16 | ~17 GB | 1x L4 (g2-standard-12) | ~$1.50 |
| Qwen2.5-7B | INT4 | ~5 GB | 1x T4 (n1-standard-8) | ~$0.80 |
| Qwen2.5-32B | AWQ | ~18 GB | 1x L4 (g2-standard-12) | ~$1.50 |
| Qwen2.5-72B | AWQ | ~40 GB | 1x A100 80 GB | ~$10 |
| Qwen2.5-72B | FP16 | ~173 GB | 2x A100 80 GB | ~$20 |
| Llama 3 70B | AWQ | ~38 GB | 1x A100 80 GB | ~$10 |
Prices are approximate on-demand rates and vary by region.
Wrapping up
The path from “I want to run my own LLM” to actually doing it is shorter than it looks:
- Pick a model and precision that fits your GPU budget
- Use a serving framework like vLLM — don’t reinvent the wheel
- Run it in a GPU-enabled Docker container on a GCP VM
- Mount the model cache or bake it into the image so restarts are fast
The OpenAI-compatible API means your application code doesn’t need to know or care whether it’s talking to OpenAI, Anthropic, or your own Qwen instance running on a VM in us-central1. You just change the base URL.