Three production outages from one schema migration. And the deploy script we hardened to stop them

Three days, three production-shape 500s during PickNDeal's pre-launch shakedown. Deploy was pointing at the production database, the production deploy script, and a fully production-shape environment, but no public users had been let onto the system yet. We were the only ones hitting the failing endpoints. Each fix was the same shape: a column had been added in the schema files but never made it to the production database. Each time, the deploy pipeline had reported “Deploy OK.” Each time, it was lying.

What follows is a narrative of how the three failure modes stacked up, what we wrongly diagnosed each time, and the three changes we eventually made to the deploy script that have stopped the class of failure permanently. The dress rehearsal was specifically designed to surface this kind of thing before we opened the doors, and it did.

Day one, “but the column is nullable”

A team member added an optional column to a table, let's call it orders.confirmed_at. And the form that writes orders. They committed both files together, pushed, watched the CI go green, opened the production page, and clicked save. 500.

The error in the logs was unambiguous:

PostgresError: column "confirmed_at" of relation "orders" does not exist

The first thought. And ours too, was “but I made it nullable, so it shouldn't fail on insert.” Drizzle doesn't care that the column is nullable. Drizzle's .returning() emits RETURNING with the full column list from the schema. Same for .select(): it emits SELECT col1, col2, … enumerating every column the schema knows about. If production is missing one of them, every query that touches the table 500s. The mutation's nullable-on-insert behaviour is irrelevant.

Quick fix: SSH to prod, run db:push manually, restart PM2. Site back up in five minutes. We told ourselves it was a one-off.

Day two, “the deploy script was supposed to handle this”

The next afternoon, a different table, same shape of failure. This time the schema-changing PR was merged at 14:15; deploy started at 14:17; reported success at 14:23. Form submission at 14:31: 500.

We dug into the deploy script. The db:push step was running via drizzle-kit push against the production database. But it was using the Supabase transaction pooleron port 6543. Drizzle-kit's push uses introspection that emits multiple statements per logical change; the transaction pooler can't hold a session across them. The push quietly hangs for the script's 180-second timeout and the script's || true swallowed the failure. The deploy moved on, restarted PM2, returned green.

Two simultaneous fixes:

Switch DB_PORT for the push step to 5432 (session pooler), not 6543. Drizzle introspection works against the session pooler.
Drop the || true. If the schema push fails, the deploy fails, loudly. Better a red CI than a silent corruption.

Day three, “the deploy script was supposed to handle this (this time)”

You can see where this is going. Day three, third table, third 500. This time the deploy was even greener. The db:push step had logged No changes detected. So why was the column missing?

Because the previous deploy hadn't actually applied the schema. The db:push step had failed on day two before our fix, and the script had moved on. The day-three deploy ran db:push again, but the schema files in this commit were unchanged from the previous commit (different feature, no schema touch), so drizzle reported nothing to do. Without ever rerunning the push that was needed from two deploys ago.

The class of failure here isn't “script broken once”. It's “the script incorrectly assumes that thiscommit's schema state matches production's schema state.” The fix is structural.

The hardened deploy script

Three changes, layered:

1. Schema-touch detection gates db:push

The deploy script now diffs the schema files against the last successfully-deployed SHA. If any file in packages/db/src/schema/ differs, db:pushruns. Otherwise it's skipped. Which is the desired behaviour, except that we explicitly track the last successful push, not the last attempt. The.last-deployed-sha file is only updated after the push step exits 0.

2. The push step fails the deploy

set -e wraps the push step. No || true. If the push fails, the script aborts before restarting PM2. PM2 keeps serving the previous build off the previous schema. The site stays up; the deploy reports red; we get an alert; we fix.

3. Post-deploy verification on the affected endpoint

For schema changes, the deploy script (or the human merging the PR) hits one tRPC endpoint that touches the new column and verifies the response includes the field. Five seconds, catches the gap before the first user does. We made this part of the “PR template” for any change that touches the schema directory.

The methodology takeaway

The class of failure here isn't Drizzle-specific or Supabase-specific. It's a category:“the deploy script's notion of state diverges silently from production's actual state.”Same shape applies to env vars, to migrations, to feature flags, to CDN cache purges. The method's answer is the same:

Detect the change before deploy (schema-touch detection).
Don't swallow failures (no || true).
Verify post-deploy against the actual contract you intended to deliver, not against “PM2 returned 200 on /.”

Each of those three fixes became a rule in the project's rule library. They travel with the method to the next codebase. The next team won't have to discover the same failure three times to get the lesson.

This is the failure mode we now have a rule for. The full rule lives in feedback_schema_changes_require_migration.md and ships with every new contributor's clone of the codebase. The next time a session looks at a Drizzle schema file, it reads this rule before writing the consuming code, and the third occurrence stops shipping to production. See the agentic engineering method, principle 02: “Codify what production teaches.”

More on the methodology these rules come from in The agentic delivery method. More from the same codebase in the PickNDeal case study.