Journal · 8 May 2026 · 6 min read

Three production outages from one schema migration — and the deploy script we hardened to stop them

Three days, three production 500s, all on actions touching the same set of tables. Each fix was the same shape: a column had been added in the schema files but never made it to the production database. Each time, the deploy pipeline had reported “Deploy OK.” Each time, it was lying.

What follows is a narrative of how the three incidents stacked up, what we wrongly diagnosed each time, and the three changes we eventually made to the deploy script that have stopped the class of failure permanently.

Day one — “but the column is nullable”

A team member added an optional column to a table — let's call it orders.confirmed_at — and the form that writes orders. They committed both files together, pushed, watched the CI go green, opened the production page, and clicked save. 500.

The error in the logs was unambiguous:

PostgresError: column "confirmed_at" of relation "orders" does not exist

The first thought — and ours too — was “but I made it nullable, so it shouldn't fail on insert.” Drizzle doesn't care that the column is nullable. Drizzle's .returning() emits RETURNING with the full column list from the schema. Same for .select(): it emits SELECT col1, col2, … enumerating every column the schema knows about. If production is missing one of them, every query that touches the table 500s. The mutation's nullable-on-insert behaviour is irrelevant.

Quick fix: SSH to prod, run db:push manually, restart PM2. Site back up in five minutes. We told ourselves it was a one-off.

Day two — “the deploy script was supposed to handle this”

The next afternoon, a different table, same shape of failure. This time the schema-changing PR was merged at 14:15; deploy started at 14:17; reported success at 14:23. Form submission at 14:31: 500.

We dug into the deploy script. The db:push step was running via drizzle-kit push against the production database — but it was using the Supabase transaction pooleron port 6543. Drizzle-kit's push uses introspection that emits multiple statements per logical change; the transaction pooler can't hold a session across them. The push quietly hangs for the script's 180-second timeout and the script's || true swallowed the failure. The deploy moved on, restarted PM2, returned green.

Two simultaneous fixes:

Day three — “the deploy script was supposed to handle this (this time)”

You can see where this is going. Day three, third table, third 500. This time the deploy was even greener — the db:push step had logged No changes detected. So why was the column missing?

Because the previous deploy hadn't actually applied the schema. The db:push step had failed on day two before our fix, and the script had moved on. The day-three deploy ran db:push again, but the schema files in this commit were unchanged from the previous commit (different feature, no schema touch), so drizzle reported nothing to do — without ever rerunning the push that was needed from two deploys ago.

The class of failure here isn't “script broken once” — it's “the script incorrectly assumes that thiscommit's schema state matches production's schema state.” The fix is structural.

The hardened deploy script

Three changes, layered:

1. Schema-touch detection gates db:push

The deploy script now diffs the schema files against the last successfully-deployed SHA. If any file in packages/db/src/schema/ differs, db:pushruns. Otherwise it's skipped — which is the desired behaviour, except that we explicitly track the last successful push, not the last attempt. The.last-deployed-sha file is only updated after the push step exits 0.

2. The push step fails the deploy

set -e wraps the push step. No || true. If the push fails, the script aborts before restarting PM2. PM2 keeps serving the previous build off the previous schema. The site stays up; the deploy reports red; we get an alert; we fix.

3. Post-deploy verification on the affected endpoint

For schema changes, the deploy script (or the human merging the PR) hits one tRPC endpoint that touches the new column and verifies the response includes the field. Five seconds, catches the gap before the first user does. We made this part of the “PR template” for any change that touches the schema directory.

The methodology takeaway

The class of failure here isn't Drizzle-specific or Supabase-specific. It's a category:“the deploy script's notion of state diverges silently from production's actual state.” Same shape applies to env vars, to migrations, to feature flags, to CDN cache purges. The methodology answer is the same:

Each of those three fixes became an invariant in the project's rule library. They travel with the method to the next codebase. The next team won't have to discover the same failure three times to get the lesson.


More on the methodology these invariants come from in The agentic delivery method. More from the same codebase in the PickNDeal case study.