Module 07DevOpsmedium17 min

Module 7 — Shipping It to Production

Shipping is the part tutorials end before. But a service nobody can deploy, observe, or tell is healthy isn't done — it's a demo. This module does the unglamorous work that turns your code into something that survives contact with real users.

Your app works and it's tested. Now it has to run somewhere other than your laptop, deploy without you SSHing into a box, and tell you when it's sick. This module applies The Edge chapter: a proper container, a pipeline, observability, and an SLO.

Goal

  • A Dockerfile that builds a small, secure image.
  • A CI/CD pipeline: on push, run tests and migrations; on green + main, deploy.
  • Structured logging with a request id, and a health check that means something.
  • One SLO defined, so "is it broken?" has an answer that isn't a vibe.

Step 1: Containerise it properly

A container packages your app so it runs the same everywhere. "Properly" means small, layer-cached, and not running as root.

# build stage — has the toolchain
FROM node:22-slim AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build            # tsc → dist/

# runtime stage — only what's needed to run
FROM node:22-slim
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev        # no dev deps in the shipped image
COPY --from=build /app/dist ./dist
USER node                    # don't run as root
EXPOSE 3000
CMD ["node", "dist/server.js"]

Why multi-stage, and why not root

The build stage has TypeScript and dev dependencies; the runtime stage copies only the compiled output and production deps, so the shipped image is smaller and has less attack surface (no compiler, no dev tooling to exploit). USER node means that if someone does compromise the process, they don't get root inside the container. Ordering the COPY package*.json before COPY . . lets Docker cache the slow npm ci layer and only re-run it when dependencies actually change. Small, layered, non-root — the three properties of a production image.

Secrets do not go in the image

DATABASE_URL, your LLM API key (Module 8), session secrets — none of these are baked into the Dockerfile or committed. They're injected at runtime as environment variables by the platform. An image is a build artifact that may get pushed to a registry; a secret in it is a secret leaked. Your Module 1 config layer already reads from the environment, which is exactly why.

Step 2: The CI/CD pipeline

The pipeline turns "push to git" into "tested and deployed" with no manual steps to forget. The order is the safety: test before you deploy, always.

  1. On every push: build and test

    Spin up Postgres in CI, run your migrations against it, run the suite from Module 6. A red suite stops here — nothing ships.

  2. On green + main: build and push the image

    Build the Docker image, tag it with the commit SHA, push it to a registry.

  3. Deploy with migrations first, then a rolling update

    Run pending migrations, then roll out the new image gradually — start new instances, health-check them, shift traffic, retire the old. Your Module 1 graceful shutdown is what makes the retirement clean.

# .github/workflows/ci.yml (sketch)
jobs:
  test:
    services:
      postgres: { image: postgres:16, env: { POSTGRES_PASSWORD: test }, ports: ["5432:5432"] }
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run migrate up      # real migrations
      - run: npm test                # the suite must pass to proceed
  deploy:
    needs: test                      # only if tests passed
    if: github.ref == 'refs/heads/main'
    steps:
      - run: # build, push image, trigger platform deploy

The migration ordering trap

Run migrations before the new code that depends on them, and design them to be backward-compatible with the currently running code, because during a rolling deploy both versions run at once. Adding a nullable column is safe; renaming or dropping one mid-deploy breaks the old instances still serving traffic. This is the "zero-downtime changes" point from the data chapter — the migration and the deploy are a coordinated pair, not two independent steps.

Step 3: Observability — see inside the running system

When this breaks at 2am (see the incident deep dive), you'll have only what you instrumented now. The three pillars:

Structured logs

Log JSON, not prose, with a request id attached to every line of a request. {"reqId":"abc","route":"POST /snippets","status":201,"ms":42}. Structured logs are queryable ("show me all 500s in the last hour"); a request id lets you follow one request across every log line it produced.

Metrics + a real health check

Track the golden signals (latency, traffic, errors, saturation). Upgrade /health: a liveness check ("am I running?") and a readiness check ("can I reach Postgres and Redis?"), because an instance that's up but can't reach its database should not receive traffic.

// request-id + structured access log middleware
app.use("*", async (c, next) => {
  const reqId = crypto.randomUUID();
  const start = Date.now();
  c.set("reqId", reqId);
  await next();
  console.log(JSON.stringify({
    reqId, method: c.req.method, path: c.req.path,
    status: c.res.status, ms: Date.now() - start,
  }));
});
// readiness: don't take traffic if dependencies are down
app.get("/ready", async (c) => {
  try {
    await pool.query("select 1");
    await redis.ping();
    return c.json({ ready: true });
  } catch {
    return c.json({ ready: false }, 503); // LB stops routing here
  }
});

Step 4: Define one SLO

An SLO (service level objective) is a number that defines "good enough," so "is it broken?" stops being a vibe. The chapter's framing: pick a target, measure against it, and the gap between perfect and your target is your error budget — the room you have to take risks.

SLO: 99.5% of GET /s/:publicId requests succeed (non-5xx) and return in under 300ms, measured over 30 days.

That 0.5% is the budget. If you're comfortably within it, ship features faster. If you've burned it — a bad week of errors — the signal is to stop shipping risk and spend effort on reliability. One SLO on your most important path is worth more than a dashboard of metrics nobody has a threshold for.

Why an SLO beats 'aim for 100%'

100% uptime is impossible and chasing it is infinitely expensive — the last fraction of a nine costs more than the whole product. An SLO names what's actually good enough for users and turns reliability into a budget you spend deliberately: within budget, move fast; over budget, slow down and harden. It converts an emotional argument ("is this reliable enough?") into a number both engineers and the business can point at.

Acceptance check

docker build -t snippets . && docker run -p 3000:3000 --env-file .env snippets
curl localhost:3000/health    # 200
curl localhost:3000/ready     # 200 if DB+Redis reachable; stop Postgres → 503

# push a branch → CI runs migrations + tests. Merge to main → it deploys.
# tail the logs → each request is one JSON line with a reqId, status, and ms.

You're done when the app runs from a non-root multi-stage image, CI refuses to deploy on a red suite, /ready flips to 503 when a dependency is down, logs are structured with a request id, and you've written down one SLO. Commit it.

What you just internalised

Shipping is its own discipline: a small non-root container with no baked-in secrets, a pipeline that tests before it deploys and coordinates migrations with rolling updates, observability you set up before the incident (structured logs with a request id, golden-signal metrics, a readiness check that gates traffic), and one SLO that turns reliability into a budget. This is the table-stakes work that keeps the thing alive after launch — the part that separates a demo from a service.

Before you leave — how confident are you with this?

Your honest rating shapes when you'll see this again. No grades, no shame.

Comments

to join the discussion.

Loading comments…