Engineering Production-Grade Autonomous Agent Orchestration: A Complete Technical Deep Dive
How I built a multi-agent system with ephemeral subagents, file-based state persistence, SAFe methodology integration, and production deployment pipelines—validated against authoritative patterns from Anthropic, OpenAI, Microsoft, and Manus AI.
Table of Contents
1. The Problem: Context Window Overflow
It started with a familiar frustration. I had built Nabster—my autonomous AI operations hub—to handle everything from X/Twitter management to candidate pipeline monitoring. It worked beautifully until I asked it to write code.
The problem wasn't capability. Claude Code is extraordinarily capable. The problem was context accumulation. Every file read, every command executed, every iteration on a bug—it all stacked up. Within a single coding session, I'd watch the context window fill: 50K tokens, 100K, 150K. Eventually, the dreaded error:
This wasn't sustainable. I needed a coding agent that could:
- 1.Work without filling up its own context window
- 2.Survive session crashes and resume seamlessly
- 3.Maintain state across ephemeral executions
- 4.Integrate with proper product management workflows
The Core Insight: The solution wasn't to make agents smarter—it was to make them ephemeral. Spawn fresh, do work, persist state to files, terminate. The files become the memory. The agent becomes disposable.
2. Research Phase: Learning from the Giants
Before writing a single line of configuration, I dove deep into authoritative sources. If I was going to build production-grade agent orchestration, I needed to understand what the leading AI labs had learned.
Anthropic's Multi-Agent Research System
Anthropic published detailed documentation on their Claude Research feature—a multi-agent system that achieved 90.2% performance improvement over single-agent approaches. Key patterns:
- • Orchestrator-worker pattern: A lead agent coordinates specialized subagents
- • Detailed task specifications: Each subagent needs objective, output format, tool guidance, and task boundaries
- • External memory persistence: Save plans to files before context exceeds limits
- • Lightweight references: Pass file paths between agents, not full content
Anthropic's claude-progress.txt Pattern
Their documentation on long-running agents revealed a elegant pattern: the claude-progress.txt file.
“Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.”
— Anthropic Engineering Blog
The solution: An Initializer agent creates the environment and a progress file. The Coding agent reads the progress file on every spawn, makes incremental changes, commits to git, and updates the progress file. Git history plus progress file equals recovery mechanism.
Manus AI's Context Engineering
Manus AI published fascinating insights on attention manipulation. Their agents create and continuously update todo.md files—not just for organization, but as a deliberate mechanism to keep objectives in the model's recent attention span.
With an average of 50 tool calls per task, maintaining focus is critical. By “reciting” objectives at the end of the context, Manus reduces goal drift and misalignment. They also treat the file system as external memory—unlimited in size, persistent by nature, directly operable by the agent.
Interestingly, Manus evolved from a simple todo.md approach to a dedicated Planner + Executor architecture—finding that roughly 30% of actions were spent updating the todo list. This validated my own architectural instincts.
Microsoft Azure AI Agent Design Patterns
Microsoft's Azure Architecture Center provided a comprehensive taxonomy of orchestration patterns:
| Pattern | Use Case | Trade-offs |
|---|---|---|
| Sequential | Clear stage dependencies | Simple but higher latency |
| Concurrent | Independent subtasks | Higher throughput, needs merge |
| Handoff | Dynamic delegation | Flexible but risk of loops |
| Hierarchical | Manager coordinates specialists | Clear ownership, more hops |
They also emphasized checkpoint features for recovery, circuit breakers to prevent cascading failures, and graceful degradation when agents fail.
SagaLLM: Academic Rigor
A VLDB 2025 paper on multi-agent coordination provided the final piece: structured handoffs.
“Free-text handoffs are the main source of context loss. Treat inter-agent transfer like a public API.”
— SagaLLM, VLDB 2025
The recommendation: Use JSON Schema-based structured outputs for all handoffs. Validate contract conformance, dependency satisfaction, and cross-agent consistency.
3. Architecture Evolution: From Dev to PM+Coder
My first design was “Dev Nabster”—a single coding agent with file-based state. It worked. Tests passed. Recovery from crashes succeeded. But something was fundamentally missing.
The problem: I was giving vague requirements and expecting perfect code. “Add a rate limiter” doesn't specify limits, storage mechanism, error responses, or header formats. The agent was guessing. Sometimes correctly, often not.
The solution emerged from product management principles: separate planning from execution.
The Two-Agent Architecture
Main Nabster (always running)
│
└── PM Nabster (ephemeral)
│
└── Coder Nabster (ephemeral)PM Nabster owns requirements. It asks clarifying questions, creates dev-ready stories with acceptance criteria, reviews completed work, handles deployment, and verifies in production.
Coder Nabster owns implementation. It receives stories with clear criteria, implements exactly what's specified, writes tests, and submits for review. If rejected, it fixes and resubmits.
This mirrors Manus AI's evolution from todo.md to Planner+Executor. The separation isn't arbitrary—it's a recognition that planning and coding require different modes of thinking.
4. SAFe Methodology: WSJF and Track Selection
Not every request deserves the same process. A critical bug fix shouldn't go through the same ceremony as a multi-week feature. PM Nabster implements WSJF (Weighted Shortest Job First) from SAFe to intelligently route work.
The WSJF Assessment
WSJF SCORING: - Business Value: [1-5] - Time Criticality: [1-5] - Risk/Opportunity: [1-5] - Size: [XS/S/M/L/XL] Score = (Value + Urgency + Risk) / Size
Three Tracks
HOTFIX Track
Quick triage → Immediate coding → Fast review → Ship. For production fires and showstoppers.
STANDARD Track
Light refinement (2-3 questions) → Story with criteria → Code → Full review → Ship.
PROJECT Track
Discovery → Planning → Milestones → Stories → Review loops → Stakeholder check-ins.
This right-sizing ensures we're not over-engineering simple fixes or under-planning complex features.
5. File-Based State: The Memory Architecture
The key insight from all the research: agents are ephemeral, files are memory. Here's the complete state architecture:
PM Nabster's State Files
/home/clawdbot/pm-nabster/ ├── SOUL.md # Identity, principles, methodology ├── AGENTS.md # Operating rules, protocols ├── progress.md # Current state, file references ├── intake/ # Original requests (verbatim) │ └── REQ-2026-01-30-001.json ├── sessions/ # Q&A history, decisions │ └── 2026-01-30-REQ-001.md ├── backlog/ │ ├── ready/ # Stories ready for Coder │ ├── in-progress/ # Stories being implemented │ └── done/ # Completed with verification ├── checkpoints/ # Recovery snapshots └── templates/ # Story, review templates
The Context Recovery Protocol
When PM Nabster spawns, it follows a strict protocol:
- 1. Read SOUL.md (identity)
- 2. Read progress.md (where am I?)
- 3. If mid-task, read referenced files:
- • Intake file (original request)
- • Session file (Q&A history)
- • Story file (if exists)
- 4. Continue from where previous spawn stopped
The critical rule: Never restart from scratch. Never assume context. Always read the files.
Attention Management
Following Manus AI's pattern, both agents implement attention management:
// Before any major action:
1. Re-read current objectives from progress.md
2. Explicitly state: "Current objective: [X]. Next action: [Y]"
3. This keeps goals in recent attention, prevents drift
6. Production Deployment Pipeline
A critical realization: “code complete” is not “done.” PM Nabster owns the full lifecycle:
Production Verification Protocol
This is where most automation stops—and where we go further. PM Nabster actually hits the live endpoints to verify functionality:
# For a rate limiter feature:
# Test 1: Endpoint responds
curl -I https://production.url/api/hello
# Verify: HTTP 200, X-RateLimit headers present
# Test 2: Rate limiting works
for i in {1..105}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" ...)
echo "Request $i: $STATUS"
done
# Verify: Requests 1-100 return 200
# Verify: Requests 101+ return 429All verification results are documented with evidence:
{
"story_id": "STORY-001",
"verification_status": "PASSED",
"test_results": {
"endpoint_works": "PASSED - HTTP 200",
"rate_headers_present": "PASSED - X-RateLimit-*",
"allows_100_requests": "PASSED - 100x HTTP 200",
"blocks_101_plus": "PASSED - 5x HTTP 429"
},
"production_url": "https://...",
"commit": "972a8b1"
}7. Live Testing: Rate Limiter End-to-End
Theory is nothing without practice. Here's the complete flow from a real test:
The Request
PM Nabster's WSJF Assessment
Business Value: 3 (security/stability) Time Criticality: 2 (not urgent) Risk/Opportunity: 2 (prevents abuse) Size: S (middleware pattern) Score: (3+2+2)/2 = 3.5 → STANDARD Track
Clarifying Questions
PM Nabster asked:
- • Is in-memory storage acceptable, or do you need Redis/persistent storage?
- • What's the rate limit? (requests per minute per IP)
- • Is Express.js acceptable for the example server?
Stakeholder answers: In-memory is fine. 100 requests/minute/IP. Express is fine.
The Story
STORY-001: Implement API Rate Limiter Middleware Acceptance Criteria: 1. Given a client IP makes requests, When count <= 100 in last minute, Then request should be allowed 2. Given a client IP has made 100 requests, When they make another, Then return HTTP 429 with retryAfter 3. Given a client was rate limited, When 1 minute passes, Then their count resets 4. Given multiple client IPs, When each makes requests, Then each has independent limit 5. Given any request, Then X-RateLimit-Remaining header included
Coder Nabster's Implementation
Coder received the story and produced:
- •
src/middleware/rateLimiter.js- Core middleware - •
src/middleware/rateLimiter.test.js- 26 comprehensive tests - •
src/server.js- Example Express server
PM Review
PM Nabster verified:
- ✓ All 5 acceptance criteria met
- ✓ All 4 edge cases handled
- ✓ 26 tests passing
- ✓ No security issues
- ✓ VERDICT: APPROVED
Production Verification
Server started, ngrok tunnel created, PM hit the live endpoint:
# Request 1
HTTP/2 200
x-ratelimit-remaining: 99
# Request 100
HTTP/2 200
x-ratelimit-remaining: 0
# Request 101
HTTP/2 429
{"error":"Too Many Requests","retryAfter":12}
Result: STORY-001 deployed and verified in production. All acceptance criteria confirmed working on live infrastructure.
8. Best Practices Audit
After building the system, I audited it against the authoritative sources. The alignment was strong:
| Best Practice | Source | Our Implementation |
|---|---|---|
| Orchestrator-worker pattern | Anthropic | ✓ Main → PM → Coder |
| File-based state persistence | Anthropic, Manus | ✓ intake/, sessions/, backlog/ |
| Planner + Executor separation | Manus | ✓ PM + Coder |
| Attention management (todo.md) | Manus | ✓ progress.md + recitation |
| Structured JSON handoffs | SagaLLM, OpenAI | ✓ Story JSON with schema |
| Circuit breakers | Microsoft | ✓ 3-failure escalation |
| Checkpointing | Microsoft | ✓ checkpoints/ directory |
| Graceful degradation | Microsoft, Anthropic | ✓ Failure protocols |
Based on the audit, I added improvements: explicit attention management instructions, circuit breaker rules for repeated failures, schema validation requirements, and context limit awareness.
9. The Agent Creation Standard
With the patterns proven, I needed to ensure future agents would follow the same rigor. I created a comprehensive standard document at:
The standard mandates:
Required Workspace Structure
/home/clawdbot/[agent-name]/ ├── SOUL.md # Identity (REQUIRED) ├── AGENTS.md # Operating rules (REQUIRED) ├── progress.md # State for continuity (REQUIRED) ├── checkpoints/ # Recovery snapshots └── [domain-specific]/ # Role-specific directories
Required SOUL.md Sections
- • Identity statement (who, what, NOT what)
- • Core principles (3-5)
- • Hierarchy position
- • Critical rules
Required AGENTS.md Protocols
- • On Every Spawn protocol
- • Attention management
- • Failure & recovery protocol
- • Context recovery protocol
- • Autonomy levels
- • Circuit breaker rules
Testing Requirements
Before deploying any new agent:
- • Fresh spawn test: Verify it reads SOUL.md, writes progress.md
- • Recovery test: Set mid-task state, verify continuation
- • Failure test: Create blocker, verify graceful handling
- • Handoff test: Verify context transfers correctly
10. The Chain Rule: Ensuring Compliance Forever
A standard is useless if agents don't follow it. The question: how do we guarantee every future agent uses the standard?
The answer: The Chain Rule. Every agent that spawns another agent must include this in the spawn prompt:
STANDING RULE: If you ever create a NEW agent type, you MUST first read /home/clawdbot/nabster/standards/AGENT-CREATION-STANDARD.md and follow it completely. Pass this rule to any agent you spawn.
This creates an unbroken chain:
Main Nabster spawns PM Nabster
↓ includes standing rule
PM Nabster spawns Coder Nabster
↓ includes standing rule
Coder Nabster spawns [future agent]
↓ includes standing rule
...foreverReinforcement Points
The rule is embedded in multiple places:
- • Main Nabster's SOUL.md (Principle #7)
- • Main Nabster's AGENTS.md (explicit section)
- • Main Nabster's MEMORY.md (Standing Rules)
- • PM Nabster's SOUL.md (Critical Rule #6)
- • PM Nabster's AGENTS.md (Standing Rule section)
- • Coder Nabster's AGENTS.md (Standing Rule section)
- • The standard document itself
Weekly Audit
Every Saturday, Main Nabster performs an audit: verify all registered agents have the required files and sections. Any gaps are reported and fixed.
## Weekly Agent Audit Report - [Date] | Agent | SOUL.md | AGENTS.md | progress.md | Standard | |--------------|---------|-----------|-------------|----------| | PM Nabster | ✓ | ✓ | ✓ | ✓ | | Coder Nabster| ✓ | ✓ | ✓ | ✓ |
11. Conclusion: What We Built
In a single session, we architected and implemented a production-grade multi-agent orchestration system:
Ephemeral Agents with File-Based Memory
Agents spawn fresh, read state from files, persist before termination. Context dies, memory lives.
PM + Coder Separation
Planning and execution as distinct roles. Clear handoffs via structured JSON stories.
SAFe Methodology Integration
WSJF scoring routes work to appropriate tracks. Right-sized process for every request.
Full Deployment Pipeline
Commit, push, build, deploy, verify in production. Evidence documented.
Agent Creation Standard
Templates, checklists, and the Chain Rule ensure every future agent follows the patterns.
The system is now live. I can ask Nabster to build any feature, and it flows through PM for refinement, to Coder for implementation, back to PM for review, through deployment, and into production with verification.
More importantly, the patterns are documented and enforced. This isn't a one-off solution—it's infrastructure for building reliable autonomous systems at scale.
The Meta-Lesson
The best AI systems aren't the ones with the most capabilities—they're the ones with the clearest boundaries. By making agents ephemeral and files permanent, by separating planning from execution, by right-sizing process to complexity, we built something that's both powerful and predictable. That's the goal.
Sources & References
- Anthropic: How We Built Our Multi-Agent Research System
- Anthropic: Effective Harnesses for Long-Running Agents
- Manus AI: Context Engineering for AI Agents
- Microsoft Azure: AI Agent Orchestration Patterns
- OpenAI: Orchestrating Multiple Agents
- SagaLLM: Context Management for Multi-Agent Systems (VLDB 2025)