The x402 Auto-Pilot: Docker Containers and Claude Code Running Our Dev Loop

When we talk to people about our x402 payment gateway they usually ask the same question after the technical part of the conversation ends, which is how a small team like ours can sustain the test coverage and the release pace that they see in our repos. The honest answer is that we built a dev loop around isolated Docker containers and Claude Code that removes most of the drudgery from the kind of work that normally eats a small team's schedule, and once we had that loop running we stopped thinking about it as tooling and started thinking about it as the auto-pilot for the whole gateway. This post is a look at what that looks like in practice, because it is the part of our workflow that clients are most curious about and the least visible from the outside.

Why we wanted an auto-pilot at all

A multi-chain x402 gateway is the kind of codebase where the number of things that can go wrong grows quickly with the number of chains and rails you support. Each chain has its own client library, its own idea of how a signature should be formatted, its own fee estimation quirks, and its own set of failure modes that only show up under real network conditions. Supporting Base, XRPL, Solana, and Stellar in a single gateway means your tests can no longer be a polite suite that a developer runs before opening a pull request, because by the time a developer has run the full suite once they have lost thirty minutes. That is not a sustainable model for a small team, and if you let it drift, the test suite becomes aspirational instead of real, which is exactly when the production incidents start.

The reframe that got us out of that trap was to stop thinking of the test suite as something a developer runs and start thinking of it as something the codebase continuously runs on its own. The only way that reframe works without a huge cloud bill is to get really efficient about how tests are sliced, how containers are reused, and how agents decide what actually needs to run in response to a given change. That is the problem the auto-pilot solves, and the result is that our coverage is real, our feedback is fast, and the team spends its time on the interesting parts of the gateway instead of waiting on green checks.

The architecture in plain language

The setup has three layers and they are all fairly boring in isolation, which is how we know it works. At the bottom we have a set of Docker images, each of which is pinned to a specific chain environment and its required tooling. There is an XRPL image with rippled and xrpl.js, an EVM image with Foundry and a local Anvil chain, a Solana image with the CLI and a local validator, a Stellar image with the SDK and a quickstart node, and a few general-purpose images for the gateway code itself and its TypeScript toolchain. Each image is deterministic and small enough that we can spin up dozens of them in parallel on a single workstation without anything catching fire.

The middle layer is the orchestrator. When a change lands in a working directory, the orchestrator reads the diff, decides which test groups are potentially affected, and spawns the right containers in parallel. This is not a shell script, it is a small service that tracks container state, collects logs, and surfaces failures back up to the agent layer in a format that Claude Code can consume. The point of this layer is to make sure the auto-pilot never guesses about which tests to run and never runs the same test twice because two unrelated changes happened to touch the same file.

The top layer is the agent layer, which is where Claude Code lives. Claude Code runs inside an isolated environment with exactly the tools it needs for the task it is handling, and the orchestrator hands it the container logs and the failing output when something breaks. That last piece is what makes the whole thing feel like an auto-pilot instead of a build system, because when a test fails, Claude Code can look at the actual chain behavior, correlate it with the gateway code, and propose a fix that goes through the same review loop that any human change would. The humans are still in the loop, but the loop is built around agents who have already done the diagnostic work before the human even opens the pull request.

What a normal change actually looks like

Here is the flow for a feature that we shipped recently, which was adding a new optional x402 header for usage metering. A Claude Code agent starts in a working directory with the gateway source mounted, reads the relevant spec notes from our persistent memory, and drafts a plan that touches the request parser, the response builder, the metering store, and a test file for each of the four supported chains. The orchestrator sees the plan, reserves containers for the chains that are actually implicated, and the agent writes code while the containers are warming up. When the first set of tests runs, the orchestrator streams the results back into the agent's context, and the agent iterates on the implementation until every chain passes.

Nothing about this is magic. The thing that makes it work is that the agent is never reading stale information, never missing a test because of a flaky shared environment, and never blocked by the kind of state leakage that happens when you try to run multi-chain tests on a single long-lived node. The containers are disposable, the state is fresh, and the agent has everything it needs to diagnose a failure the moment it happens. When the pull request eventually opens, a human reads it, asks whatever clarifying questions they want, and merges. The review is faster because the code already has green tests across every chain, the commit history is clean because the agent was instructed to make small reviewable commits, and the documentation is already written because that was part of the plan.

Why Docker is the right choice here

Some people ask why we did not go with a more fashionable sandbox like a microVM or a function-as-a-service platform. The answer is that Docker has two specific properties that matter for this workload. The first is that the tooling for running, inspecting, and cleaning up containers is mature and understood by everyone who might ever need to touch the system, which is important when you want your auto-pilot to be legible to new engineers instead of mysterious. The second is that Docker composes well with everything else we use, from our CI to our local workstations to the agent runtime, so we do not have to maintain two parallel environments and keep them in sync. Boring technology that works is more valuable than clever technology that almost works, and Docker is the boring choice that keeps paying off.

We do pay attention to the overhead and the cleanup story. Long-running dev loops can accumulate dangling volumes and stale networks faster than people expect, and the orchestrator is explicit about tearing things down at the end of every test group. We use image caching aggressively for the expensive parts like local chain snapshots, and we keep the images themselves small enough that pulling a fresh copy on a new worker is not a ceremony. None of this is exotic, it is just the kind of maintenance that a shop handles once and then stops thinking about.

What this unlocks for clients

The reason this matters for anyone who might hire us is that it is the reason our x402 integration engagements can ship in one to three weeks with the kind of coverage you expect from a much bigger team. When we add a new chain or a new feature to your x402 integration, we are not starting from scratch and we are not relying on a human to remember every detail of every chain's quirks. The auto-pilot has already been over this ground, the memory has already been written, and the test scaffolding is ready to receive your specific use case. That is where our flat-fee pricing comes from, because we know exactly what our cost structure looks like for the kinds of builds we take on.

It is also the reason we are comfortable taking on work where the chain list is going to grow over time, which is a common pattern for teams building on top of BitBooth. You do not have to explain to us how to keep the tests honest as the surface area expands, because honesty is baked into the loop. Your integration inherits a testing posture that was earned over thousands of test runs on the gateway itself, and it stays that way for as long as the retainer relationship runs.

The last thing we want to say about all this is that the auto-pilot is not a pitch for our services so much as a description of how small teams can do work that looks like it came from much bigger ones. If you are building your own x402 implementation or any serious payment rail, the specific choices might not match yours, but the shape of the answer is probably similar. The drudgery has to go somewhere, and putting it into containers and agents that never get tired is how we got our weekends back while also shipping better code.

Need this kind of discipline on your own x402 build?

We scope x402 and agent-payment integrations starting at ten thousand dollars flat, with delivery in one to three weeks. The first call is free.

Scope an Integration See Services