Mar 28, 2026

Async/await for containers: how Trigger.dev suspends and resumes running tasks

I’ve been looking for a platform to run AI-powered workflows — LLM chains, data pipelines, agent loops — without building all the queue, retry, and scheduling infrastructure myself. That’s how I came across Trigger.dev, an open-source platform for running background tasks in TypeScript. It’s a nice project, but one feature in particular caught my attention.

If your task calls wait.for() or awaits a child task with triggerAndWait():

export const myTask = task({
  id: "my-task",
  run: async () => {
    // Container is suspended here — you pay nothing for the hour
    await wait.for({ hours: 1 });

    // Container is suspended while the child task runs in its own container
    const result = await childTask.triggerAndWait({ data: "some data" });
  },
});

the platform suspends the entire container, stops billing you for compute, and resumes execution from the exact point where it paused — whether that’s an hour later or when the child task finishes. No serialization, no state management, no re-running previous steps.

Trigger.dev calls this the Checkpoint-Resume System. While waiting for a subtask or a programmed pause, the system checkpoints the task’s entire state — memory, CPU registers, open file descriptors — and releases all resources. When the wait ends or the subtask completes, the checkpoint is loaded into a new execution environment, restoring the task to its exact state before suspension. The task resumes from where it left off, with subtask results seamlessly integrated.

What makes this interesting is that it’s essentially async concurrency applied at the infrastructure level. In JavaScript, await suspends a function while a network call completes, freeing the event loop for other work. Python coroutines do the same thing. Trigger.dev takes that same pattern — pause execution, release resources, resume when ready — but applies it to entire containers instead of functions. The await in your task code literally suspends the machine — and when it resumes, it might be on a completely different VM.

That sounded almost too good to be true, so I asked Claude to explore the source code to understand how it actually works. The answer involves CRIU (Checkpoint/Restore In Userspace), Docker’s experimental checkpoint API, Buildah for OCI image creation, and a carefully designed state machine to coordinate it all.

In this article I want to share what I found.

The problem: long waits in serverless tasks

Let’s first understand where checkpoints might be needed. Consider a task that processes a payment, waits for a confirmation, and then sends a receipt:

import { task, wait } from "@trigger.dev/sdk";

export const processPayment = task({
  id: "process-payment",
  run: async (payload) => {
    const charge = await chargeCustomer(payload);

    // Parent container is suspended while getConfirmation runs in its own container
    // Could take hours or days — you don't pay for the wait
    const confirmation = await getConfirmation.triggerAndWait({
      chargeId: charge.id,
    });

    await sendReceipt(charge, confirmation);

    return { success: true };
  },
});

The core problem is that serverless functions have hard execution timeouts. AWS Lambda caps at 15 minutes, GCP Cloud Functions at 60 minutes. A 24-hour workflow simply can’t run in a single function invocation. The standard advice is to reach for a different compute model entirely — containers on ECS/GKE, VMs, or batch services — but then you lose the simplicity of “just write a function.”

Even if you stay within timeout limits, you have three bad options:

Keep the container running while the confirmation arrives — you pay for compute the entire time, even though the parent task is doing nothing
Split into multiple tasks — break the workflow into chargeCustomer, a scheduled trigger, and sendReceipt, losing the simplicity of a single function. Workflow orchestrators like AWS Step Functions or Google Cloud Workflows can help, but you’re now debugging state machines instead of async functions
Serialize state to a database — save charge somewhere, schedule a follow-up job, deserialize on resume — now you’re building a workflow engine

Trigger.dev’s answer is option 4: freeze the container’s memory to disk, shut it down, and restore it later.

When you call triggerAndWait, it spawns the child task in a separate container, then the parent is checkpointed and suspended — releasing its compute and concurrency — until the child completes. The parent resumes with the child’s return value, as if it were a normal await. The same mechanism kicks in for timer waits like await wait.for({ hours: 24 }).

How this compares to traditional approaches

Most workflow engines handle long waits by serializing state to a database. Here’s how checkpointing compares:

Database serialization (Flowable, Temporal, etc.):

You (or the framework) decide what to save
State must be serializable — closures, file handles, open connections are lost
Restore recreates the process and rehydrates from saved state
Lightweight storage (a few KB of serialized variables)
Framework must understand your language’s runtime

Container checkpointing (Trigger.dev):

CRIU saves everything automatically — you don’t think about it
Nothing is lost — memory, call stack, local variables, closures all preserved
Restore is exact — the process doesn’t know it was checkpointed
Heavy storage (container memory image, potentially hundreds of MB)
Language-agnostic — works with any process running in the container

The trade-off is clear: checkpointing is simpler for the developer (zero serialization burden) but more expensive in storage and restore latency. Database serialization is lightweight but requires the framework (or the developer) to explicitly manage what’s saved.

Trigger.dev’s bet is that developer simplicity is worth the infrastructure cost. You write a normal async function with await calls, and the platform handles the rest. No workflow DSL, no state classes, no serialization interfaces.

CRIU: the technology underneath

The foundation is CRIU — Checkpoint/Restore In Userspace. CRIU is a Linux tool that can freeze a running process (or tree of processes), save its complete state to disk, and restore it later. “Complete state” means everything: memory pages, register contents, file descriptors, socket state, signal handlers, all of it.

CRIU operates at the OS level. It doesn’t know or care what language your code is written in, what variables you have in scope, or what your call stack looks like. It captures raw memory pages and kernel state. This means any program can be checkpointed — Node.js, Python, C++, whatever is running in the container.

Docker has experimental support for CRIU through docker checkpoint create, and Kubernetes supports it via the CRI (Container Runtime Interface) with crictl checkpoint. Trigger.dev uses both, depending on the deployment mode. When CRIU isn’t available at all — missing binary, unsupported kernel, or Docker experimental features not enabled — Trigger.dev falls back to docker pause, which suspends the container but doesn’t capture state. The workflow continues, but if the container dies, the run is lost. This fallback exists for development environments where setting up CRIU isn’t practical.

Trying it yourself

You can see CRIU in action with a simple Python counter inside a Docker container. Start a container with CRIU installed (--privileged is needed for CRIU to access process memory):

docker run -d --name criu-demo --privileged python:3.12-slim bash -c 'apt-get update -qq && apt-get install -y -qq criu > /dev/null 2>&1 && sleep infinity'

Copy the counter script — it just increments a number and writes it to a file every second:

docker exec criu-demo bash -c 'cat > /counter.py << "EOF"
import time
count = 0
while True:
    count += 1
    with open("/output.txt", "a") as f:
        f.write(f"count = {count}\n")
    time.sleep(1)
EOF'

Copy the demo script — it starts the counter, checkpoints it, then restores it:

docker exec criu-demo bash -c 'cat > /demo.sh << "EOF"
#!/bin/bash
python3 /counter.py &
PID=$!
disown
sleep 5

echo "--- before checkpoint ---"
cat /output.txt

mkdir -p /checkpoint
criu dump -t $PID -D /checkpoint --shell-job -v0

echo "--- checkpointed, process killed ---"
> /output.txt

criu restore -d -D /checkpoint --shell-job -v0
sleep 5

echo "--- after restore ---"
cat /output.txt
EOF
chmod +x /demo.sh'

Run the demo:

docker exec criu-demo /demo.sh

Output:

--- before checkpoint ---
count = 1
count = 2
count = 3
count = 4
count = 5
--- checkpointed, process killed ---
--- after restore ---
count = 7
count = 8
count = 9
count = 10
count = 11

The counter was checkpointed after writing count 5, the process was killed, then CRIU restored it from the checkpoint — and it resumed counting as if nothing happened. The variable count was sitting in Python’s heap memory, and CRIU captured and restored the entire memory state. This is exactly the mechanism Trigger.dev uses, just wrapped in more orchestration.

Note that the container needs --privileged for CRIU to access process memory. In production, Trigger.dev doesn’t run CRIU inside the task container — the supervisor calls docker checkpoint create or crictl checkpoint from the outside, which invokes CRIU on the container as a whole.

How the checkpoint flow works

Here’s the full checkpoint-resume flow for a parent task that triggers a child task (based on the diagram from Trigger.dev docs):

The diagram shows the flow for triggerAndWait, but the same mechanism applies to wait.for() — the only difference is what resolves the waitpoint (a timer vs a child task completing).

Here are the key source files behind each actor:

Diagram Actor	Source	Class
Trigger.dev	run-engine/engine/index.ts	`RunEngine`
Parent/Child Task	managed/controller.ts	`ManagedRunController`
CR System	coordinator/checkpointer.ts	`Checkpointer`
Storage	coordinator/exec.ts	`Buildah`

The CR System in the diagram maps to the Coordinator container — here’s how these components are laid out on a worker VM:

Worker VM
├── Supervisor container (apps/supervisor)
│   ├── Dequeues runs from the platform
│   ├── Creates task containers on demand
│   └── Coordinates with Coordinator for checkpointing
│
├── Coordinator container (apps/coordinator)  ← "CR System" in the diagram
│   ├── Runs the Checkpointer (CRIU, Buildah)
│   ├── Has access to Docker daemon
│   └── Freezes task containers from the outside
│
└── Task container (ephemeral, one per run)
    ├── Controller (ManagedRunController)  [entry point]
    │   └── Signals when task is suspendable
    │
    └── Worker  [child process, forked via IPC]
        └── Your task code (task.run())

The task container can’t checkpoint itself. The controller signals that it’s suspendable, the supervisor tells the coordinator, and the coordinator freezes the container from the outside. When CRIU checkpoints the container, it captures both processes and the IPC channel — on restore, both resume simultaneously. This separation is what makes cross-machine restore possible — the checkpoint image is pushed to a registry, and any node can pull and restore it.

Let’s walk through each step.

Step 1: Start execution

The supervisor running on the worker VM dequeues the run from the platform and creates a task container via workloadManager.create(), passing environment variables, e.g. TRIGGER_SUPERVISOR_API_DOMAIN, so the controller process inside the task container knows how to reach the supervisor’s Workload API over HTTP. The controller ManagedRunController is the main Node.js process inside the task container — it uses Node’s fork() to spawn a worker child process that runs your task code. The two processes communicate via Node.js IPC.

Step 2: Trigger child task

When your code calls await childTask.triggerAndWait(...), two things happen:

the SDK running inside the worker process makes an API call directly to the Trigger.dev platform (the “Trigger.dev” actor in the diagram), bypassing the controller — the worker has its own HTTP client to the platform. This queues the child task for execution and creates a waitpoint — a record in the platform’s database that says “this run is waiting for this child task to complete.”
the worker signals to the controller via IPC that it’s suspendable — ready to be frozen without data loss.

The parent doesn’t wait for the child to start; it just tells the platform “run this” and signals “I can be checkpointed now.” Note the split: the controller handles the execution lifecycle (suspendable signaling, snapshot management), but the SDK’s API calls to trigger tasks and create waitpoints bypass it entirely — they go straight from the worker to the platform over HTTP. For wait.for(), the waitpoint is a datetime instead of a child task, but the rest of the flow is identical.

Step 3: Request snapshot

The controller inside the task container calls suspendRun() on the supervisor’s Workload API (using the HTTP connection from Step 1). The supervisor delegates to the coordinator via the CheckpointClient. The coordinator invokes CRIU from the outside to freeze the task container. The checkpoint logic lives in checkpointAndPush() and has two modes:

Docker mode (local/development):

docker checkpoint create --leave-running <container-name> <checkpoint-name>

Kubernetes mode (production):

crictl checkpoint --export=/checkpoints/<identifier>.tar <container-id>

Both commands tell the container runtime to invoke CRIU, which freezes all processes, dumps all memory pages to disk, and saves kernel state (file descriptors, sockets, timers).

Step 4: Store snapshot

In production (Kubernetes mode), the checkpoint is exported as a tar archive. The coordinator wraps it in an OCI container image using the Buildah class and pushes it to a registry:

buildah from scratch
buildah add <container> /checkpoints/<identifier>.tar /
buildah config --annotation=io.kubernetes.cri-o.annotations.checkpoint.name=<shortCode> <container>
buildah commit <container> <registry>/<namespace>/<project>:<version>.prod-<shortCode>
buildah push --tls-verify <imageRef>

The result is a standard OCI container image pushed to a container registry (the “Storage” actor in the diagram — the same kind of registry used for Docker images). Any node in the cluster can pull it and restore the checkpoint.

Step 5: Release resources

Once the checkpoint image is stored, the platform:

Updates the run status to WAITING_TO_RESUME in the platform’s database (the same database where the waitpoint from Step 2 was created)
Stores a TaskRunCheckpoint record (type, location, image reference)
Releases all concurrency for this run

If you have a queue with concurrencyLimit: 5 and three tasks are suspended, those three slots are freed up. Suspended tasks consume zero compute and zero concurrency.

Step 6: Child task completes

The child task runs in its own container. When it finishes, its controller calls completeRunAttempt() on the supervisor, which reports the result to the platform. The platform resolves the parent’s waitpoint record created in Step 2, which triggers the restore flow. For wait.for(), this step is replaced by the timer expiring, which resolves the waitpoint the same way.

Step 7: Retrieve snapshot and restore state

The platform requests the checkpoint from the CR system, which retrieves the snapshot image from storage. A new container is started from the checkpoint image — CRIU restores all processes to their exact memory state.

The restored controller detects it’s been restored and calls continueRunExecution():

POST /api/runs/{runId}/continue
Body: { snapshotId: "...", workerId: "...", runnerId: "..." }

The container might be on a different physical node than the original — the checkpoint image is in a registry, and any node can pull it.

Step 8: Resume and complete execution

The backend validates the snapshot, updates the run to EXECUTING, and the task continues from the line after the await. From your code’s perspective, nothing happened — the await resolved with the child task’s return value, and execution continues normally.

What can go wrong

Checkpointing isn’t magic. There are edge cases:

Network connections don’t survive. TCP sockets are saved by CRIU, but by the time the container restores (minutes, hours, or days later), the remote end has long closed the connection. Any open HTTP connections, database connections, or WebSocket connections will be stale. The Trigger.dev runtime handles this for its own connections (re-establishing the supervisor WebSocket and HTTP client), but if your task code holds open connections across a wait, they’ll fail on resume.

File system changes are ephemeral. The checkpoint captures memory, not disk. If your task wrote temporary files before the wait, they won’t exist after restore (the container runs on a fresh filesystem from the image). Design tasks to be self-contained across wait boundaries.

Checkpoint size scales with memory usage. A container using 2GB of RAM produces a 2GB checkpoint image. For tasks with large in-memory datasets, this means significant storage and transfer time. The MAXLEN ~ trimming on the checkpoint image push helps, but large checkpoints are inherently slower to create and restore.

CRIU requires kernel support. CRIU needs specific kernel features (namespaces, cgroups) and Docker’s experimental mode. In Kubernetes, the container runtime (CRI-O or containerd) must be configured for checkpoint support. This isn’t available everywhere, which is why Trigger.dev has the simulation fallback.

The same pattern, at the VM level

CRIU operates at the process level — it captures a single process tree inside a container. But the same checkpoint/restore pattern works at the VM level too. Firecracker, the microVM manager behind AWS Lambda and Fly.io, can pause an entire virtual machine, dump its full memory and device state to files, and restore from those files later — on a completely new Firecracker process.

I tested this on WSL2 with KVM enabled. The setup: Firecracker v1.12.0, an Alpine Linux rootfs with a shell counter incrementing every second, and a prebuilt Linux kernel. After booting the VM and letting the counter reach 20, I paused the VM and created a snapshot via Firecracker’s REST API:

# Pause
curl --unix-socket /tmp/firecracker.socket -X PATCH \
  http://localhost/vm -H 'Content-Type: application/json' \
  -d '{"state": "Paused"}'

# Snapshot
curl --unix-socket /tmp/firecracker.socket -X PUT \
  http://localhost/snapshot/create -H 'Content-Type: application/json' \
  -d '{"snapshot_type": "Full", "snapshot_path": "./snapshot_file", "mem_file_path": "./mem_file"}'

Then I killed the Firecracker process entirely, started a fresh one, and loaded the snapshot:

curl --unix-socket /tmp/firecracker.socket -X PUT \
  http://localhost/snapshot/load -H 'Content-Type: application/json' \
  -d '{"snapshot_path": "./snapshot_file", "mem_file_path": "./mem_file", "enable_diff_snapshots": false, "resume_vm": true}'

The counter resumed at 21. The restore took ~29ms.

Firecracker snapshots are heavier (you’re saving the entire VM’s memory, not just one process), but they’re also more portable — no kernel feature requirements beyond KVM, no experimental flags, no container runtime configuration. This is how Lambda achieves sub-100ms cold starts: pre-snapshot a VM with your function loaded, then restore from the snapshot on invocation.

The elegance of the approach

What I find most compelling about this design is the abstraction boundary. From the developer’s perspective:

await wait.for({ hours: 24 });

That’s it. One line. Behind it: CRIU freezes all processes, memory pages are dumped to disk, Buildah wraps them in an OCI image, the image is pushed to a registry, concurrency slots are released, a state machine tracks the snapshot lifecycle, and hours later a new container is started from the checkpoint image on potentially a different machine, the process is restored, connections are re-established, and the await resolves.

The complexity is entirely behind the platform boundary. Your task is just an async function.