Define the surface, not the prompt
The first thing we do on any new system is define the surface area an agent operates on — the read tools, the write tools, the human-approval gates — not the prompt that drives it. Prompts are throwaway; the tool surface compounds. A well-designed surface makes a mediocre prompt produce reliable results; a great prompt with a leaky surface produces undefined behaviour at scale.
Concretely: every action the agent can take goes through a typed tool. Every state-mutation goes through a human-review gate by default. We earn the right to remove gates by demonstrating a clean audit history.
Invariants are the moat
We maintain a project-scoped invariant library — a set of rules of the form “don't do X because here's the production incident that taught us why.” Each invariant has a name, a one-paragraph explanation of the failure mode that produced it, and code-level enforcement when possible.
The PickNDeal codebase has 20+ invariants. The first 10 were written after watching agents (and humans) re-discover the same production failures. Each invariant means a class of incidents that doesn't recur. The list compounds. The list is the moat.
The audit trail is the product
Every agent run writes to a trail: which tool calls, which inputs, which outputs, which human approvals, which auto-rollbacks. Stored as structured data, queryable, exportable. When something goes wrong — and it will — the trail is how we diagnose without re-running the agent. When something goes right, the trail is how we prove it to a stakeholder who wasn't in the room.
For client engagements, the audit trail is what we hand over alongside the working code. It's the evidence layer that lets the client's board sign off on agentic systems running in production.
Human-in-the-loop on the fault line, not everywhere
Reviewing every agent action manually defeats the point. Not reviewing any action defeats the point in a different way. We design the human-review surface around the fault line — the small set of actions that, if wrong, produce real damage (financial transactions, customer-visible mutations, irreversible deletes). Everything else runs on rails with audit-after-the-fact.
The dashboard pattern we use is exactly this: a queue of pending actions the agent has flagged, with one-tap approve/reject and a structured-diff view of what changed. We'll publish a standalone version of this UI as the first artefact under /open-source.
Reproduce in production-shape, not toy-shape
The standard mistake is testing agentic systems on toy data and trusting that production will look the same. It never does. We reproduce production shape from day one: real schemas, real concurrency, real error rates. If the agent can't handle a 1% failure rate from an upstream service in dev, it will cascade in production.
Cron + queue + webhook is the holy trinity
Most agent failures we've seen are timing failures, not logic failures. Cron jobs that run on slightly-wrong cadence; webhooks that retry into duplicate state; queues that lose messages on restart. We start every agentic system with the same trio: durable cron (with retries + alerting), idempotent webhook handlers (with HMAC verification + replay protection), at-least-once queue processing (with deduplication on the consumer side).
Ship the boring infra first
Authentication, role-scoped permissions, secrets management, deployment hardening, error tracking, structured logging — none of this is “agentic.” All of it must work before any agent does anything interesting. The work most teams skip in their excitement is exactly what determines whether the interesting work survives the first month in production.
Where the method came from
Refined across production codebases including PickNDeal (B2B + D2C food marketplace, 14 phases shipped), PayoutKit (Stripe Connect commercial boilerplate), and client engagements going back to 2018. Each phase surfaced a class of failure that produced an invariant. The invariants travel between projects — the methodology is the same whether we're building a marketplace or replacing a legacy procurement system.
We're publishing the engineering stories behind each invariant on the journal, and the human-review-loop UI is being extracted as the first open-source artefact. The full methodology lives inside client engagements.
Get the method as a 12-page PDF.
Seven principles + a representative slice of the invariant library + a one-page “how to apply this to your codebase” guide. Yours to share with your team or your CTO.
Want this method applied to your system?
We take on a small number of engagements per quarter. Discovery, build, or transformation tracks.
Submit a project brief