How a Multi-Agent Engineering Harness Lets a Small Team Deliver Like a Large One

The hardest thing about being a small software firm is not the engineering, it is the arithmetic of headcount, because a contracting officer reading a statement of work has every reason to assume that delivery velocity scales with the number of people on the team and that a small shop will simply move slower than a large one. We built the thing that breaks that assumption, a multi-agent AI engineering harness that runs on the Claude Agent SDK and lets the firm produce at a velocity and a volume that would normally require a much larger staff. This is an honest writeup of what that harness actually is, how it is wired, and what it deliberately refuses to do without a human in the loop, because the framing that matters for a federal buyer is that this is how one certified small business delivers like a large contractor without pretending to be one.

Why a small firm builds a harness in the first place

The conventional way a software company adds throughput is to add engineers, and that path is closed to a firm that is deliberately small and intends to stay that way. The alternative we chose was to treat the engineering process itself as something that could be encoded, supervised, and run continuously, so the leverage comes from how the work is organized rather than from how many hands are on it. A harness in this sense is not a chatbot and it is not a single model answering questions, it is an orchestration system that holds a backlog of real engineering goals, decides which one matters most right now, hands that goal to a specialist that knows how to do that one kind of work, and then refuses to let the result through until it has passed the checks a careful human reviewer would insist on. The point was never to remove the engineer from the loop, it was to let one engineer supervise what behaves like a small disciplined team that never loses the thread between sessions.

The orchestrator and its specialist subagents

At the center of the harness is an orchestrator whose only job is to look at a backlog of engineering goals, pick the highest-priority item, and delegate it to whichever specialist is right for that kind of work. The specialists are not one general agent wearing different hats, they are distinct subagents with distinct role briefs, and each one is narrow on purpose. There is an endpoint builder that writes the actual production feature, a test writer that covers it, a docs writer that documents it, an infra engineer that handles the deployment and the cloud wiring, and a smoke runner that exercises the result end to end to confirm it actually works rather than merely compiles. Alongside those five sits a dedicated x402-payments security reviewer, a specialist whose entire reason to exist is to look at anything touching payment settlement with the suspicion a payments auditor brings, because the cost of a quiet mistake in settlement code is categorically higher than almost anywhere else. The orchestrator does not do any of this work itself, it routes, and that separation is what keeps each agent good at one thing instead of mediocre at all of them.

The reason to split the work this way is the same reason a real engineering organization splits roles. A test writer that is only ever asked to write tests develops a sharper sense of edge cases than a generalist who bolts a few assertions onto a feature at the end, and a security reviewer that only ever reviews is not tempted to wave through its own code. By giving each subagent a typed brief and a single responsibility, the harness gets the benefit of specialization without the coordination overhead the same division of labor costs a human team, because the orchestrator carries the context between them rather than relying on a standup or a ticket handoff to keep everyone aligned.

The gates that keep it honest

The part of the harness that matters most to anyone worried about autonomous systems is not what it can do but what it is not allowed to do unsupervised. Every cycle of work runs behind two gates. The first is a lint-and-test gate, which means nothing the harness produces is considered done until it passes the project's linting and its test suite, so a change that breaks the build or fails a test never advances no matter how confident the agent that wrote it happens to be. The second is a supervisor-gated diff, which means the actual change set is held for review before it is allowed to land rather than being committed blindly, so a human stays in the position of approving what enters the codebase. Those two gates together are the difference between a system that generates plausible-looking work and a system whose output you would actually be willing to ship, and they are why we describe the harness as supervised automation rather than an autonomous coder turned loose on a repository.

The harness also has to handle the failure mode that sinks most naive automation, which is getting stuck. When a specialist cannot make progress on a goal, the system does not sit and spin or halt and wait for a human to notice, it rotates, moving on to the next viable goal in the backlog and coming back to the stuck one later rather than burning a cycle against a wall. This rotate-on-stuck behavior is what lets the whole thing run unattended on a roughly fifteen-minute tick, because a loop that runs every quarter hour is only useful if it degrades gracefully when one item is hard, and one that froze on the first difficult goal would deliver nothing while looking busy.

From one harness to a fleet

Once the single-orchestrator pattern was proven on production engineering work, the same ideas scaled outward into something larger, an Autonomous Defensive Research Fleet built to grow a defensive Web3 security knowledge base on its own. Where the engineering harness is one orchestrator coordinating a handful of specialists, the fleet is a containerized system of nineteen specialist roles. One manager container fires every fifteen minutes and hands work out to worker containers, each of which spawns its own headless Claude session so the workers run genuinely in parallel rather than taking turns, while a Redis container holds the job queue that keeps the whole thing coordinated and a dashboard service exposes runtime tuning so the tick interval and the fleet size can be adjusted live without tearing anything down. It is the same philosophy as the engineering harness, narrow specialists coordinated by a manager and fed from a queue, rebuilt at the scale of a standing operation that runs around the clock.

What makes the fleet worth describing is that it has already run, and the numbers from its first day are real rather than projected. In its first twenty-four hours the fleet produced 710 markdown research documents across thirty successful runs with zero crashes, and all of that output was auto-pushed to GitHub as it was generated. We are careful to be precise about what that figure means, it is a measure of research throughput and not a claim about audited findings or shipped product, but as a demonstration of what disciplined multi-agent orchestration can sustain it speaks for itself, because 710 documents in a day across thirty clean runs is a volume of structured output a small human team would take a long time to match and could not match without rest.

What this means for a federal buyer

The reason any of this belongs in a writeup aimed at the government rather than buried in an engineering notebook is that it directly answers the question a contracting officer is right to ask about a small firm, which is whether a company this size can actually deliver at the pace a real program demands. Our answer is that delivery capacity here is not bounded by headcount the way it is at a conventional shop, because the harness lets one engineer drive the throughput of a coordinated team while keeping a human at every gate that matters, which means the firm can take on more work and turn it around faster than its size would suggest without cutting the corners small shops are usually accused of cutting under pressure. The supervisor-gated diff and the lint-and-test gate are not decoration, they are the assurance that velocity does not come at the expense of review, and that is exactly what a federal buyer needs to hear from a small vendor.

We are also clear-eyed that a harness is a tool and not a substitute for engineering judgment, and we run it that way on purpose. The orchestrator decides what to work on, the specialists do the work, and the gates catch what the gates are built to catch, but a human owns the backlog, approves the diffs, and answers for everything that ships, because that is the only honest way to operate automation at this level of capability. The combination is what we think a modern small business should look like, a firm that uses agent infrastructure to deliver at the output of a much larger contractor while staying small enough to stay accountable, built on the same discipline of staying online under load that the Navy taught the engineer who built it. For a government buyer weighing whether a certified small business can keep up, the harness is the most direct evidence we can offer, because it is not a promise about future capacity, it is a system already running.

Need delivery velocity from a small certified team?

We are an SBA-certified SDVOSB software, AI, and cybersecurity firm that uses agent infrastructure to deliver at the output of a much larger shop while keeping a human at every gate. If you are a contracting officer or a prime weighing a capable veteran-owned partner, start a conversation and look at the work.

Start a Contract Conversation View Capability Statement