Draft

Deploying open source LLMs with Docker and GPUs on GCP

If you’ve ever wanted to run your own LLM — fully under your control, no API keys, no rate limits — this article will show you how. We’ll start from scratch: what a GPU Docker image actually is, how to package a simple model inside one, and then scale up to hosting a production-grade open source LLM on a GCP VM with serious GPU power.

What is a GPU Docker image?

A regular Docker container runs on CPU. To use a GPU inside a container, you need three things:

  1. NVIDIA drivers installed on the host machine
  2. NVIDIA Container Toolkit (formerly nvidia-docker) — a runtime that exposes GPUs to containers
  3. A CUDA base image — a Docker image with NVIDIA’s CUDA libraries pre-installed

NVIDIA publishes official CUDA images on Docker Hub. For example:

nvidia/cuda:12.8.0-base-ubuntu24.04

This image name breaks down as:

  • nvidia/cuda — the official NVIDIA CUDA image repository
  • 12.8.0 — the CUDA toolkit version
  • base — the image variant (more on this below)
  • ubuntu24.04 — the underlying OS

CUDA image variants

NVIDIA publishes three variants of each CUDA image, each building on the previous one:

VariantContentsSizeUse case
baseCUDA runtime only~120 MBRunning pre-compiled CUDA apps
runtimebase + CUDA math libs + NCCL + cuDNN~1.5 GBRunning ML inference frameworks
develruntime + headers + compiler toolchain~3.5 GBCompiling CUDA code from source

For deploying inference, runtime is usually the right choice. You don’t need devel unless you’re compiling custom CUDA kernels.

Verifying GPU access

The simplest smoke test is running nvidia-smi inside the container:

docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi

The --gpus all flag tells Docker to expose all host GPUs to the container via the NVIDIA Container Toolkit. If this prints a table showing your GPU model, memory, and driver version — your GPU setup is working.

Building a minimal GPU inference container

Before we deploy a full LLM, let’s build a simple container that loads a small model and runs inference. This will make the mechanics clear.

The Dockerfile

FROM nvidia/cuda:12.8.0-runtime-ubuntu24.04

# Install Python and pip
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    python3-venv \
    && rm -rf /var/lib/apt/lists/*

# Create a virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install PyTorch with CUDA support and transformers
RUN pip install --no-cache-dir \
    torch \
    transformers \
    accelerate

# Copy inference script
COPY inference.py /app/inference.py

WORKDIR /app

CMD ["python3", "inference.py"]

The inference script

# inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

print(f"Loading {model_name}...")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain what Docker is in two sentences."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

print(f"\nResponse: {response}")

We’re using Qwen2.5-0.5B-Instruct — a tiny 0.5 billion parameter model. It’s small enough to download quickly and run on almost any GPU, making it perfect for testing the pipeline.

Build and run

docker build -t my-llm .
docker run --rm --gpus all my-llm

This works, but it has a major problem: the model downloads from Hugging Face every time you start the container. For a 0.5B model that’s mildly annoying. For a 70B model, it’s a dealbreaker. We’ll fix this later.

Deploying a real model: PaddleOCR as an API

The minimal container above runs inference once and exits. A real deployment needs to serve an API — accept requests, run the model, return results. Let’s build one using PaddleOCR, an open source OCR toolkit that detects and recognizes text in images.

This follows the same pattern we’ll use for the MNIST training server later: package a model inside a Docker container with a FastAPI server in front of it.

Not every model needs a GPU. PaddleOCR’s models are small (~150 MB total) and process images in under 100ms on CPU — fast enough for most production workloads. Using CPU keeps the Dockerfile simple and avoids CUDA version compatibility issues between PaddlePaddle and the base image. We’ll use GPUs where they actually matter: training neural networks and running LLMs.

The server

# server.py
from fastapi import FastAPI, UploadFile
from paddleocr import PaddleOCR
import numpy as np
import cv2
import uvicorn

app = FastAPI()
ocr = PaddleOCR(use_angle_cls=True, lang="en", use_gpu=False)


@app.post("/ocr")
async def run_ocr(file: UploadFile):
    contents = await file.read()
    img = cv2.imdecode(np.frombuffer(contents, np.uint8), cv2.IMREAD_COLOR)
    result = ocr.ocr(img, cls=True)

    lines = []
    for line in result[0]:
        bbox, (text, confidence) = line
        lines.append({"text": text, "confidence": round(confidence, 4), "bbox": bbox})
    return {"lines": lines}


@app.get("/health")
def health():
    return {"status": "ok"}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

PaddleOCR loads three models on startup — text detection, text recognition, and angle classification. Once loaded, each request is just a forward pass through these models, so responses are fast.

The Dockerfile

FROM python:3.12-slim

RUN apt-get update && apt-get install -y \
    libgl1 libglib2.0-0 libgomp1 \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir \
    paddlepaddle "paddleocr>=2.6,<3" \
    fastapi uvicorn python-multipart

# Download OCR models at build time so startup is instant
RUN python3 -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"

COPY server.py /app/server.py
WORKDIR /app

EXPOSE 8080
CMD ["python3", "server.py"]

Two things to notice:

  • libgl1 and libglib2.0-0 — OpenCV dependencies that aren’t in the slim Python image. Without them, import cv2 fails. This is a common gotcha when containerizing computer vision code.
  • The RUN python3 -c "..." line downloads PaddleOCR’s models (~150 MB) during the build, baking them into the image. This is the same “bake the model in” pattern — the container starts serving immediately with no download wait.

Build and run

docker build -t paddleocr-server .
docker run -d -p 8080:8080 paddleocr-server

Test it

# OCR a local image
curl -X POST http://localhost:8080/ocr -F "[email protected]"

# Response:
# {
#   "lines": [
#     {"text": "TOTAL", "confidence": 0.9987, "bbox": [[45, 120], [130, 120], [130, 145], [45, 145]]},
#     {"text": "$42.50", "confidence": 0.9954, "bbox": [[150, 120], [240, 120], [240, 145], [150, 145]]}
#   ]
# }

This is the general pattern for deploying any Python-based model as a GPU API:

  1. Start from a CUDA runtime image
  2. Install your ML framework and model dependencies
  3. Download model weights at build time
  4. Wrap inference in a FastAPI server
  5. Run with --gpus all

The MNIST training server later in this article uses the same pattern, and so would any custom model you want to deploy — a fine-tuned classifier, a speech-to-text model, an image segmentation pipeline. The Dockerfile structure and FastAPI wrapper are always the same; only the model and the request/response schema change.

Key concepts before scaling up

Before we deploy a real LLM, let’s understand a few concepts that will determine our hardware choices.

Model size and GPU memory

How do you know which GPU (or how many GPUs) you need for a given model? You don’t have to guess — you can calculate it. Every model parameter is a number stored in a specific precision, and each precision uses a known number of bytes:

PrecisionBytes per parameter
FP324 bytes
FP16/BF162 bytes
INT81 byte
INT40.5 bytes

The formula is straightforward:

memory for weights=parameters×bytes per parameter\text{memory for weights} = \text{parameters} \times \text{bytes per parameter}

A small example: the MNIST network

To make this concrete, let’s start with the small neural network from the previous article — a 2-layer network that classifies handwritten digits:

784 inputs → 128 hidden neurons → 10 output classes

That’s (784×128+128)+(128×10+10)=101,770(784 \times 128 + 128) + (128 \times 10 + 10) = 101{,}770 parameters. How much memory does it need?

101,770×4 bytes (FP32)=407,080 bytes0.4 MB101{,}770 \times 4 \text{ bytes (FP32)} = 407{,}080 \text{ bytes} \approx 0.4 \text{ MB}

Under half a megabyte. This model runs comfortably on any hardware — it would fit on a GPU from 2005. Even the larger version we experimented with (784→512→256→10, 535,818 parameters) is only ~2 MB in FP32. These models are so small that hardware planning isn’t really a question — they don’t need a GPU.

But speed matters when you’re experimenting. The MNIST article has ~20 training runs — different learning rates, batch sizes, architectures, activation functions. On a basic CPU VM, each run takes ~30 seconds, so the full set of experiments takes 5–10 minutes of waiting. On a cheap GPU, each run takes 3–5 seconds — experimentation feels instant.

SetupPer run (5 epochs)All experiments (~20 runs)Cost/hour
e2-medium (CPU only)~30s~10 min~$0.03
n1-standard-4 + 1x T4~3-5s~1-2 min~$0.80

The T4 has 16 GB of VRAM — absurd overkill for a 0.4 MB model, but it’s the cheapest GPU on GCP and the parallelism on matrix multiplications makes a noticeable difference even for small matrices.

Since you only need the GPU for a few minutes at a time, create the VM once, set up your environment, and then suspend it when you’re done. Suspended VMs save their full state (memory, running processes, installed packages) to disk — when you resume, everything is exactly where you left it in ~10–20 seconds, no re-provisioning needed. You only pay for disk storage while suspended.

# One-time setup: create the VM
gcloud compute instances create mnist-experiments \
    --zone=us-central1-a \
    --machine-type=n1-standard-4 \
    --accelerator=type=nvidia-tesla-t4,count=1 \
    --boot-disk-size=50GB \
    --image-family=common-cu128-ubuntu-2404-nvidia-570 \
    --image-project=deeplearning-platform-release \
    --maintenance-policy=TERMINATE \
    --metadata="install-nvidia-driver=True"

# SSH in, install your dependencies, run experiments...
gcloud compute ssh mnist-experiments --zone=us-central1-a

# When done — suspend (state saved, GPU charges stop)
gcloud compute instances suspend mnist-experiments --zone=us-central1-a

# Next day — resume in ~10-20 seconds, everything intact
gcloud compute instances resume mnist-experiments --zone=us-central1-a

# When you're done for good — delete to stop all charges
gcloud compute instances delete mnist-experiments --zone=us-central1-a

The workflow is: create once → suspend/resume as needed → delete when done for good. At 0.80/hourfortheGPU,a2minuteexperimentsessioncostsabout3cents.Whilesuspended,youonlypayforthe50GBbootdisk( 0.80/hour for the GPU, a 2-minute experiment session costs about 3 cents. While suspended, you only pay for the 50 GB boot disk (~2/month) — no GPU or CPU charges.

Running experiments remotely

If you don’t want to stay SSH’d into the VM watching terminal output, you can wrap the training in a small FastAPI service. Upload a Python training script, let it run in the background, and poll for logs from your local machine.

You don’t need a heavy framework for this. Tools like Ray and MLflow are great when you need distributed training or experiment tracking across teams, but for running a few experiments on a single GPU they’re overkill. A FastAPI app that runs uploaded scripts in a subprocess is about 80 lines and does exactly what we need.

The approach is simple: you upload a .py file, the server saves it, runs it as a subprocess with full GPU access, and captures everything it prints to stdout. Since we’re writing Python training scripts throughout this series, this is a natural fit — your experiment is the script. You define the model architecture, the optimizer, the training loop, everything. The server just runs it and streams the output.

# train_server.py
import uuid
import subprocess
import threading
from collections import deque
from fastapi import FastAPI, HTTPException, UploadFile
import uvicorn

app = FastAPI()


class Job:
    def __init__(self, name: str, script_path: str):
        self.id = uuid.uuid4().hex[:8]
        self.name = name
        self.script_path = script_path
        self.status = "queued"      # queued | running | completed | failed
        self.logs: list[str] = []
        self.error: str | None = None

    def log(self, message: str):
        self.logs.append(message)
        print(f"[{self.name}] {message}")


# --- Job queue and runner ---

jobs: dict[str, Job] = {}
job_queue: deque[Job] = deque()
runner_lock = threading.Lock()


def run_job(job: Job):
    job.status = "running"
    job.log(f"Running: {job.name}")

    try:
        proc = subprocess.Popen(
            ["python3", "-u", job.script_path],   # -u for unbuffered output
            stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
            text=True, bufsize=1,
        )
        for line in proc.stdout:
            job.log(line.rstrip())
        proc.wait()

        if proc.returncode == 0:
            job.status = "completed"
            job.log("Done.")
        else:
            job.status = "failed"
            job.error = f"Exit code {proc.returncode}"
    except Exception as e:
        job.status = "failed"
        job.error = str(e)
        job.log(f"Error: {e}")


def process_queue():
    """Process jobs sequentially — one GPU, one job at a time."""
    while job_queue:
        run_job(job_queue.popleft())


def enqueue(job: Job):
    jobs[job.id] = job
    job_queue.append(job)
    if runner_lock.acquire(blocking=False):
        def run():
            try:
                process_queue()
            finally:
                runner_lock.release()
        threading.Thread(target=run, daemon=True).start()


# --- API ---

@app.post("/jobs")
async def submit_job(file: UploadFile):
    script_path = f"/tmp/train_{uuid.uuid4().hex[:8]}.py"
    with open(script_path, "wb") as f:
        f.write(await file.read())

    job = Job(name=file.filename, script_path=script_path)
    enqueue(job)
    return {"job_id": job.id, "name": job.name, "status": job.status}


@app.get("/jobs")
def list_jobs():
    return [
        {"job_id": j.id, "name": j.name, "status": j.status}
        for j in jobs.values()
    ]


@app.get("/jobs/{job_id}")
def get_job(job_id: str):
    job = jobs.get(job_id)
    if not job:
        raise HTTPException(status_code=404, detail="Job not found")
    return {
        "job_id": job.id, "name": job.name, "status": job.status,
        "error": job.error, "log_lines": len(job.logs),
    }


@app.get("/jobs/{job_id}/logs")
def get_job_logs(job_id: str, since: int = 0):
    job = jobs.get(job_id)
    if not job:
        raise HTTPException(status_code=404, detail="Job not found")
    return {"logs": job.logs[since:], "total": len(job.logs)}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

Run the server on the VM:

pip install fastapi uvicorn
python train_server.py

Now from your local machine, upload a training script and poll for progress:

# Submit a training script
curl -X POST http://$VM_IP:8080/jobs -F "[email protected]"
# {"job_id": "a1b2c3d4", "name": "experiment.py", "status": "queued"}

# Poll logs (use ?since=N to only get new lines)
curl http://$VM_IP:8080/jobs/a1b2c3d4/logs?since=0

# Check status
curl http://$VM_IP:8080/jobs/a1b2c3d4

# List all jobs
curl http://$VM_IP:8080/jobs

The -u flag in the Popen call ensures Python’s output is unbuffered, so each print() in your training script shows up in the job’s logs immediately. Jobs run sequentially — one GPU can only train one model at a time — but the queue lets you submit several scripts and walk away.

A training script is just a regular Python file. Here’s one that runs the baseline experiment from the MNIST article:

# experiment.py
from tensorflow import keras

(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()
X_train = train_images.reshape(60000, 784).astype("float32") / 255.0
X_test = test_images.reshape(10000, 784).astype("float32") / 255.0

model = keras.Sequential([
    keras.layers.Dense(128, activation="relu", input_shape=(784,)),
    keras.layers.Dense(10, activation="softmax"),
])

model.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.1),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

model.fit(X_train, train_labels, epochs=5, batch_size=32,
          validation_data=(X_test, test_labels))

Because the script is the experiment, you have full control — change the model architecture, swap the optimizer, add dropout or convolutional layers, try a completely different approach. Write a new .py file, upload it, and poll for results. No need to modify the server.

Note: This server executes uploaded Python scripts on your VM. This is fine for a personal experiment machine that only you can reach — don’t expose it to the internet without authentication.

Scaling up to LLMs

The math is the same for large language models — but the numbers get serious:

Precision7B model70B model
FP3228 GB280 GB
FP16/BF1614 GB140 GB
INT87 GB70 GB
INT43.5 GB35 GB

These numbers are just for the model weights. Inference also requires memory for KV-cache (which stores attention state for each token in the context), activations, and framework overhead. A safe rule of thumb: plan for 1.2x the weight size at minimum.

So the full formula for picking a GPU is:

minimum VRAMparameters×bytes per parameter×1.2\text{minimum VRAM} \approx \text{parameters} \times \text{bytes per parameter} \times 1.2

This means a 7B model in FP16 needs roughly 7B×2×1.2=16.8 GB7\text{B} \times 2 \times 1.2 = 16.8\text{ GB} of GPU memory. A single NVIDIA L4 (24 GB) or A100 (40/80 GB) can comfortably run it. A 70B model in FP16 needs ~168 GB — more than any single GPU — so you’d need multiple GPUs or quantization.

What requires empirical testing

The 1.2x multiplier is a minimum estimate. The actual overhead depends on factors you can’t fully predict from the parameter count alone:

  • Context length — KV-cache grows linearly with max_model_len. A 72B model at 4,096 context uses far less memory than at 32,768.
  • Concurrent requests — each concurrent request needs its own KV-cache allocation. How many users you can serve simultaneously depends on your traffic patterns.
  • Throughput and latency — the parameter count tells you whether a model fits on a GPU, but not how fast it runs. Tokens per second, time to first token, and requests per second under load all require benchmarking with realistic workloads.

In short: VM sizing is calculable, performance tuning is empirical. The formula above tells you which GPU to rent. Load testing tells you whether it’s fast enough.

Quantization

Quantization reduces precision to shrink model size and speed up inference. A 70B model quantized to INT4 fits in ~35 GB — achievable on a single A100 80 GB. The quality trade-off is often surprisingly small: INT4 quantized models typically score within a few percent of their full-precision counterparts on benchmarks.

The most common quantization formats you’ll encounter:

  • GPTQ — post-training quantization, requires a calibration dataset
  • AWQ — activation-aware quantization, generally better quality than GPTQ
  • GGUF — used by llama.cpp, supports CPU+GPU split inference

Many models on Hugging Face come pre-quantized. Look for names like Qwen2.5-72B-Instruct-AWQ or Qwen2.5-72B-Instruct-GPTQ-Int4.

Inference frameworks

You could load models directly with transformers like we did above, but production deployments use specialized inference servers that handle batching, caching, and serving efficiently:

FrameworkBest forKey features
vLLMHigh-throughput servingPagedAttention, continuous batching, OpenAI-compatible API
TGI (Text Generation Inference)Hugging Face ecosystemBuilt-in quantization, easy model loading
OllamaSimplicityOne-command setup, manages model downloads
SGLangAdvanced use casesStructured generation, RadixAttention

For most deployments, vLLM is the default choice. It’s fast, battle-tested, and exposes an OpenAI-compatible API, so your existing code that calls OpenAI can point at your self-hosted model with just a URL change.

Deploying Qwen 72B on GCP with vLLM

Now let’s deploy a serious model. We’ll use Qwen2.5-72B-Instruct-AWQ (a quantized 72B model) served by vLLM on a GCP VM with NVIDIA GPUs.

Step 1: Choose the right GPU VM

GCP offers several GPU options. Here’s what matters for LLM inference:

GPUVRAMGood forGCP machine type
T416 GBSmall models (7B quantized)n1-standard-8 + 1x T4
L424 GBMedium models (7B FP16, 13B quantized)g2-standard-12 + 1x L4
A100 40 GB40 GBLarge models (30B quantized)a2-highgpu-1g
A100 80 GB80 GB70B quantized on single GPUa2-ultragpu-1g
H100 80 GB80 GBFastest inference, 70B+ modelsa3-highgpu-1g

For Qwen 72B AWQ (~35 GB weights), we need at least one A100 80 GB. To be comfortable with KV-cache for long contexts, two A100 40 GB (using tensor parallelism) or one A100 80 GB is a good fit.

We’ll use an a2-ultragpu-1g instance (1x A100 80 GB).

Step 2: Create the VM

gcloud compute instances create llm-server \
    --zone=us-central1-a \
    --machine-type=a2-ultragpu-1g \
    --boot-disk-size=200GB \
    --image-family=common-cu128-ubuntu-2404-nvidia-570 \
    --image-project=deeplearning-platform-release \
    --maintenance-policy=TERMINATE \
    --metadata="install-nvidia-driver=True"

Key flags:

  • --image-family=common-cu128-ubuntu-2404-nvidia-570 — GCP’s Deep Learning VM image, which comes with NVIDIA drivers, CUDA, and the NVIDIA Container Toolkit pre-installed. This saves a lot of setup.
  • --boot-disk-size=200GB — LLM weights are large. The 72B AWQ model is ~40 GB. Give yourself room.
  • --maintenance-policy=TERMINATE — required for GPU instances. GCP can’t live-migrate GPU VMs, so it terminates them instead.

Step 3: Install Docker on the VM

SSH into the instance:

gcloud compute ssh llm-server --zone=us-central1-a

The Deep Learning VM image usually has Docker pre-installed. Verify:

docker --version
nvidia-smi

If Docker isn’t installed:

# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
    sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Re-login for group changes
exit

Step 4: Launch vLLM

This is where it all comes together. One command to serve a 72B model with an OpenAI-compatible API:

docker run -d \
    --name vllm \
    --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-72B-Instruct-AWQ \
    --quantization awq \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9

Let’s break this down:

  • -v ~/.cache/huggingface:/root/.cache/huggingface — mounts the host’s Hugging Face cache into the container. This means the model downloads once and persists across container restarts. This fixes the “download every time” problem from our simple example.
  • --ipc=host — shares the host’s IPC namespace, needed for PyTorch’s shared memory when using multiple workers.
  • --model Qwen/Qwen2.5-72B-Instruct-AWQ — the model to serve. vLLM downloads it from Hugging Face automatically.
  • --quantization awq — tells vLLM to use AWQ dequantization kernels.
  • --max-model-len 4096 — limits the context window. Shorter context = less KV-cache memory = more room for batching.
  • --gpu-memory-utilization 0.9 — use 90% of GPU memory. vLLM pre-allocates memory for KV-cache, so this controls how much it reserves.

The first run will take a while as it downloads the model (~40 GB). Watch the logs:

docker logs -f vllm

Once you see Uvicorn running on http://0.0.0.0:8000, the server is ready.

Step 5: Test the API

From the VM itself:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-72B-Instruct-AWQ",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 100
    }'

The API is OpenAI-compatible, so you can use any OpenAI SDK client by changing the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct-AWQ",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ]
)

print(response.choices[0].message.content)

Step 6: Expose the API externally

By default, the VM’s port 8000 isn’t accessible from the internet. To open it:

# Create a firewall rule
gcloud compute firewall-rules create allow-vllm \
    --direction=INGRESS \
    --action=ALLOW \
    --rules=tcp:8000 \
    --target-tags=vllm-server

# Add the tag to the VM
gcloud compute instances add-tags llm-server \
    --zone=us-central1-a \
    --tags=vllm-server

Now you can access the API from anywhere using the VM’s external IP:

EXTERNAL_IP=$(gcloud compute instances describe llm-server \
    --zone=us-central1-a \
    --format='get(networkInterfaces[0].accessConfigs[0].natIP)')

curl http://$EXTERNAL_IP:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-72B-Instruct-AWQ",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'

Important: This exposes an unauthenticated API to the internet. For production, put it behind a reverse proxy with authentication, or use GCP’s Identity-Aware Proxy (IAP). vLLM also supports an --api-key flag to require a bearer token.

Multi-GPU setups

If your model doesn’t fit on a single GPU, vLLM supports tensor parallelism — splitting the model across multiple GPUs:

docker run -d \
    --name vllm \
    --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

The --tensor-parallel-size 2 flag splits the model across 2 GPUs. For this to work, you need a VM with multiple GPUs, like a2-megagpu-16g (16x A100 40 GB) or a2-ultragpu-2g (2x A100 80 GB).

With 2x A100 80 GB (160 GB total), you can run the full unquantized 72B model in FP16 (~140 GB) with room to spare for KV-cache.

Production considerations

Keeping the model warm

Container restarts re-download the model unless you persist the cache. We handled this with the -v mount, but you can go further by baking the model into the image:

FROM vllm/vllm-openai:latest

# Download model at build time
RUN python3 -c "from huggingface_hub import snapshot_download; \
    snapshot_download('Qwen/Qwen2.5-72B-Instruct-AWQ')"

This creates a larger image (~40 GB+) but guarantees zero download time on startup. For GCP specifically, push this image to Artifact Registry in the same region as your VM to minimize pull time:

# Tag and push to Artifact Registry
docker tag my-vllm us-central1-docker.pkg.dev/my-project/my-repo/vllm-qwen72b:latest
docker push us-central1-docker.pkg.dev/my-project/my-repo/vllm-qwen72b:latest

Cost management

GPU VMs are expensive. An a2-ultragpu-1g costs roughly $8-12/hour on-demand. Strategies to manage costs:

  • Preemptible/Spot VMs — up to 60-91% cheaper, but GCP can reclaim them with 30 seconds notice. Good for batch workloads, risky for serving.
  • Committed use discounts — 1 or 3-year commitments for 37-55% off. Good if you know you need sustained GPU capacity.
  • Stop when idle — use a startup script that launches vLLM, and stop the VM when you’re not using it. You only pay for the disk when stopped.
# Stop the VM (keeps disk, no GPU charges)
gcloud compute instances stop llm-server --zone=us-central1-a

# Start it back up
gcloud compute instances start llm-server --zone=us-central1-a

Health checks and auto-restart

Add Docker restart policy and a health check:

docker run -d \
    --name vllm \
    --gpus all \
    --restart unless-stopped \
    --health-cmd="curl -f http://localhost:8000/health || exit 1" \
    --health-interval=30s \
    --health-timeout=10s \
    --health-retries=3 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-72B-Instruct-AWQ \
    --quantization awq \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9

Quick reference

Here’s a cheat sheet for common model sizes and recommended GCP setups:

ModelPrecisionVRAM neededGCP setup~Cost/hour
Qwen2.5-7BFP16~17 GB1x L4 (g2-standard-12)~$1.50
Qwen2.5-7BINT4~5 GB1x T4 (n1-standard-8)~$0.80
Qwen2.5-32BAWQ~18 GB1x L4 (g2-standard-12)~$1.50
Qwen2.5-72BAWQ~40 GB1x A100 80 GB~$10
Qwen2.5-72BFP16~173 GB2x A100 80 GB~$20
Llama 3 70BAWQ~38 GB1x A100 80 GB~$10

Prices are approximate on-demand rates and vary by region.

Wrapping up

The path from “I want to run my own LLM” to actually doing it is shorter than it looks:

  1. Pick a model and precision that fits your GPU budget
  2. Use a serving framework like vLLM — don’t reinvent the wheel
  3. Run it in a GPU-enabled Docker container on a GCP VM
  4. Mount the model cache or bake it into the image so restarts are fast

The OpenAI-compatible API means your application code doesn’t need to know or care whether it’s talking to OpenAI, Anthropic, or your own Qwen instance running on a VM in us-central1. You just change the base URL.