Lab Guide: Unit 5 — Multi-Agent Orchestration for Security

CSEC 602 | Weeks 1–4 | Semester 2

Build, compare, and evaluate multi-agent security systems using Claude Agent SDK, OpenAI Agents SDK, and Claude Managed Agents. Produce a quantitative framework comparison for your midyear report.

Claude as your agent swarm designer: Use Claude to design your agent swarm before you build it. Describe the system you want, then ask: "What can go wrong when these agents interact?" You'll surface failure modes before they're bugs.

Unit 5 Lab Progress0 / 22 steps complete

Week 1 — Multi-Agent Architecture Patterns

WEEK 1Lab: Pattern Mapping & Supervisor Implementation

Lab Goal: Implement the Supervisor pattern in Python using the Claude Agent SDK. Map four architecture patterns to concrete security scenarios. Measure coordination overhead vs. single-agent baseline.

S1 connection: you've already done this

In Week 15 of Semester 1 you deployed your hardened Sprint II agent to Claude Managed Agents — client.beta.agents.create(), client.beta.environments.create(), and a streaming session loop. That's one of the three harnesses you'll compare this unit.

This week's lab implements the same pattern (Supervisor + specialists) but in the Claude Agent SDK — your own Python loop, not Anthropic's hosted loop. Weeks 2 and 3 add the OpenAI Agents SDK and Managed Agents sides of the comparison. By Week 4 you'll have run the same security workload through all three and have real numbers.

Required: AIUC-1 pre-check before building (5 minutes).

This system makes consequential recommendations — containment actions, escalations, account lockouts. Before writing code, answer these questions in unit5/aiuc1-precheck.md:

What data does this system process? Security alerts likely contain IP addresses, usernames, system names — potentially PII. How will you handle it?
Who is affected by an incorrect decision? A false positive containment recommendation triggers unnecessary lockouts or isolation. A false negative misses a real threat. Who bears the cost?
Which AIUC-1 domains are in scope? (B: Security, D: Reliability, E: Accountability — at minimum)
What human oversight exists for high-severity recommendations? Does a "P1 — isolate host" recommendation execute automatically, or does a human review it first?

The full AIUC-1 audit of this system happens in Unit 7. This pre-check ensures your architecture doesn't build in gaps that will surface there.

Lab Exercise: Supervisor Pattern SOC System

Step 0: AIUC-1 pre-check — run before writing code

Run /audit-aiuc1 on your planned system architecture. Focus on Domains B (Security), D (Reliability), and E (Accountability) at minimum. Save the output as unit5/aiuc1-precheck.md — this is a graded deliverable. The full audit happens in Unit 7; this pre-check surfaces design decisions you would otherwise have to reverse later.

Finding the skill

The /audit-aiuc1 skill is included in the course skills bundle. The skill file is at .claude/skills/audit-aiuc1/SKILL.md.

Step 1: Install Claude Agent SDK

pip install anthropic
mkdir -p ~/noctua-labs/unit5/week1 && cd ~/noctua-labs/unit5/week1
# SDK uses: from anthropic import Anthropic

Step 2: Implement the Supervisor Agent

Build supervisor.py using Claude Agent SDK. The supervisor receives an alert, analyzes which specialist to route to (threat_analyst, containment_advisor, compliance_checker), calls that specialist as a subagent, and returns a synthesized decision.

# Claude Code prompt:
# "Build a Supervisor agent using the Anthropic SDK that:
# 1. Receives a security alert as input
# 2. Routes to one of three specialists: threat_analyst, containment_advisor, compliance
# 3. Each specialist is a separate Claude call with a specialized system prompt
# 4. Supervisor synthesizes specialist output into final recommendation
# 5. Returns structured JSON: {routing, specialist_output, synthesis, confidence}"
claude

These tools are mocks — and that's intentional. The tools in this lab simulate the output of your Unit 2 MCP server (query_cve, query_asset_exposure, generate_incident_report). The architecture is identical to what you built: an agent calls a tool, the tool returns structured data, the agent reasons about it. Only the transport changes. In Unit 7 (Hardening), these mock tools get replaced with calls to your actual MCP server via MCP client. Build the agent architecture correctly here — the tool swap in Unit 7 is a one-line configuration change, not a rewrite.

Lab vs. production architecture. In this lab, the supervisor agent receives simulated alert data passed directly to the triage function. In a production SOC, the equivalent input comes from a SIEM stream, Kafka topic, webhook, or security data pipeline — not a manual function call. The detection and triage logic you're writing is identical; only the data ingestion layer differs. When you deploy this in a real environment (Unit 8 capstone), the input wiring is what changes — the agent logic does not.

Add model attribution to your outputs. Every agent output that goes downstream — to another agent, to a human analyst, to an audit log — should declare how it was produced. Add a pipeline_metadata field to your triage output:

"pipeline_metadata": {
    "classifier_model": "claude-haiku-4-5-20251001",
    "enricher_invoked": True,
    "report_model": "claude-sonnet-4-6",
    "total_duration_ms": elapsed_ms,
    "pipeline_version": "1.0"
}

When a recommendation is wrong, you need to know which stage produced the error. Without attribution, you're debugging blind. This is the audit record for agentic pipelines.

Step 3: Test with the Meridian Financial incident

Run the Supervisor system against the incident from Semester 1. Compare: (1) Does routing make sense? (2) Does the specialist response differ from a generalist? (3) What is the coordination overhead vs. single-agent?

Step 4: Document pattern selection rationale

Create pattern-analysis.md mapping all five patterns (Supervisor, Hierarchical, Peer-to-Peer, Expert Swarm, Pipeline) to specific security use cases with pros/cons. This document becomes part of your Unit 5 Framework Comparison Report.

A2A vs. MCP — Two Protocols, Two Jobs

Two protocols appear in agentic system architecture. They solve different problems:

MCP (Model Context Protocol) — how an agent calls a tool. Defines the interface between an agent and external capabilities: APIs, databases, file systems.
A2A (Agent-to-Agent Protocol) — how agents communicate with each other. Defines message passing, task delegation, and result handoff between agents in a multi-agent system.

A production multi-agent system needs both: MCP to connect agents to tools, A2A to connect agents to each other. They operate at different trust boundaries — tool calls and inter-agent messages have different authentication and authorization requirements. The Claude Agent SDK used in this lab handles inter-agent communication natively; A2A is the standardized protocol for this layer when interoperability across agent frameworks is required.

Week 2 — OpenAI Agents SDK

WEEK 2Lab: Cross-Provider Agent Design

Lab Goal: Build a SOC investigation agent using the OpenAI Agents SDK. Compare the Runner-managed loop pattern (your process, client-side tools) against Claude Managed Agents (Anthropic infrastructure, server-side tools). Run against the same 10-incident test suite you'll use all unit.

Knowledge Check — Week 2

1. What is the fundamental architectural difference between OpenAI Agents SDK and Claude Managed Agents?

A) OpenAI Agents SDK only supports GPT models; Managed Agents only supports Claude B) OpenAI Agents SDK requires more code; Managed Agents is simpler C) Runner executes in your process with client-side tools; Managed Agents runs on Anthropic infrastructure with server-side tools D) OpenAI Agents SDK is synchronous only; Managed Agents supports streaming

2. What is the difference between handoffs=[] and .as_tool() in OpenAI Agents SDK?

A) handoffs=[] is for async agents; .as_tool() is for sync agents B) handoffs=[] transfers conversation ownership to another agent; .as_tool() wraps an agent as a callable tool the orchestrator controls C) They are identical; .as_tool() is just the newer API D) handoffs=[] is for external agents; .as_tool() is for local agents

Lab Exercise: OpenAI Agents SDK SOC Investigation

Step 1: Install the SDK and define a SOC analyst agent

Install openai-agents and create oai_soc_agent.py. Define two function tools (IOC enrichment and log query) and a SOC analyst agent using the @function_tool decorator for automatic schema generation.

pip install openai-agents


# oai_soc_agent.py
from agents import Agent, Runner, function_tool

@function_tool
def enrich_ioc(indicator: str) -> str:
    """Look up threat intel for an IP, domain, or hash."""
    # Implement with your preferred threat intel source
    # e.g., VirusTotal API, AbuseIPDB, local feed
    return f"[IOC enrichment result for {indicator}]"

@function_tool
def query_logs(query: str, time_range: str = "1h") -> str:
    """Search SIEM logs for a given query string."""
    # Implement with your SIEM or log source
    return f"[Log query results for: {query} over {time_range}]"

soc_agent = Agent(
    name="SOC Analyst",
    instructions="""You are a senior SOC analyst. When given a security alert:
1. Enrich all IoCs (IPs, domains, hashes) using the enrich_ioc tool
2. Query logs for related activity using the query_logs tool
3. Apply CCT analysis: identify TTPs, generate 3 hypotheses with probabilities
4. Produce a structured incident report: severity, TTPs, hypotheses, recommendations""",
    model="gpt-4o",
    tools=[enrich_ioc, query_logs],
)

Step 2: Run sync and then add streaming

Run the agent synchronously first to verify output, then switch to streaming to see events as they arrive. Compare the streaming event types to what you'll see from Claude Managed Agents in Week 3.


# Synchronous — simplest path
from agents import Runner

result = Runner.run_sync(soc_agent, meridian_incident_text)
print(result.final_output)

# Streaming — see events as they arrive
import asyncio

async def run_streamed(alert: str):
    result = Runner.run_streamed(soc_agent, alert)
    async for event in result.stream_events():
        if event.type == "raw_response_event":
            continue
        print(f"[{event.type}]", flush=True)
    print(result.final_output)

asyncio.run(run_streamed(meridian_incident_text))

Step 3: Add multi-agent handoffs

Build a triage agent that routes alerts to specialists using handoffs=[]. Then try the manager pattern using .as_tool(). Note the difference: handoffs transfer conversation ownership; as_tool() keeps the orchestrator in control.


from agents import Agent, Runner

malware_agent = Agent(
    name="Malware Analyst",
    instructions="Specialize in malware behavior analysis, hash lookups, and sandbox report interpretation.",
    tools=[enrich_ioc],
)

# Pattern 1: Handoff — triage routes, specialist takes over
triage_agent = Agent(
    name="Triage",
    instructions="Analyze alert type. Route network/phishing alerts to soc_agent. Route malware/hash alerts to malware_agent.",
    handoffs=[soc_agent, malware_agent],
)

result = Runner.run_sync(triage_agent, alert_text)
print(result.final_output)

# Pattern 2: as_tool — orchestrator invokes specialist, stays in control
orchestrator = Agent(
    name="Orchestrator",
    instructions="Coordinate investigation. Use soc_analyst for network alerts, malware_analyst for hash/file alerts.",
    tools=[
        soc_agent.as_tool(
            tool_name="soc_analyst",
            tool_description="Run a full SOC investigation on a network or phishing alert"
        ),
        malware_agent.as_tool(
            tool_name="malware_analyst",
            tool_description="Analyze malware samples, hashes, or sandbox reports"
        ),
    ],
)

Step 4: Run the test suite and record metrics

Run your OpenAI Agents SDK implementation against all 10 test incidents from docs/data/incident-data.md. Record for each: pass/fail for severity classification, latency (seconds), token cost, output completeness score (1–5). Add results to your comparison spreadsheet — this is the same test suite you'll run against Claude Managed Agents in Week 3.

Tip: Use Runner.run_sync() in a loop and time each call with time.perf_counter(). OpenAI usage data is in result.raw_responses[-1].usage.

Step 5: Document the developer experience trade-offs

Add an "OpenAI Agents SDK" section to your framework-comparison-report.md. Cover: (1) lines of code to implement vs. Claude SDK, (2) what you control — Runner runs in your process, tools are your functions, full visibility into inputs/outputs, (3) cross-provider implications — does portability matter for your SOC deployment?, (4) when you'd choose this over Claude Managed Agents in a real production environment.

Week 3 — Claude Managed Agents

WEEK 3Lab: Anthropic-Hosted Agent Execution

Lab Goal: Implement the same SOC investigation pipeline from Week 1 using Claude Managed Agents — Anthropic's hosted agent harness. Compare the development experience and operational characteristics against the custom SDK approach you already built.

Knowledge Check — Week 3

1. What is the fundamental difference between Claude Managed Agents and the custom SDK supervisor from Week 1?

A) Managed Agents uses a different Claude model than the SDK B) With Managed Agents, Anthropic runs the agent loop and tool execution in a hosted container; with the SDK, your code owns the loop and tool execution C) Managed Agents only supports bash tools; SDK supports all tool types D) SDK is for single-agent; Managed Agents is required for multi-agent

Lab Exercise: SOC Investigator as a Managed Agent

Reuse pattern: The agent logic you built in Weeks 1–2 (classify alert → enrich IoCs → analyze TTPs → report) stays identical. What changes is the infrastructure layer: instead of your Python process managing the loop, Anthropic hosts it. You stream events back via SSE.

Step 1: Create the agent config (one-time setup)

Create a setup script managed_agent_setup.py. The agent config holds model, system prompt, and tools — create it once and save the returned ID. The agent is a versioned, persistent object: you reference it by ID for every session.

import anthropic, os

client = anthropic.Anthropic()

# Create the agent (run once — save agent.id to .env or config file)
agent = client.beta.agents.create(
    name="SOC Investigator",
    model="claude-sonnet-4-6",
    system="""You are a senior SOC analyst conducting incident investigations.
Given a security alert:
1. Extract and enrich all IoCs (IPs, domains, file hashes) using web search
2. Apply CCT analysis: map to MITRE ATT&CK, generate 3 hypotheses with probabilities
3. Assess severity (Critical/High/Medium/Low) with justification
4. Produce a structured JSON report followed by a concise narrative summary""",
    tools=[{"type": "agent_toolset_20260401"}],  # includes bash, read, write, grep, web_search
)

# Create the environment (run once — save environment.id)
environment = client.beta.environments.create(
    name="soc-investigation-env",
    config={"type": "cloud", "networking": {"type": "unrestricted"}},
)

print(f"AGENT_ID={agent.id}")
print(f"AGENT_VERSION={agent.version}")
print(f"ENVIRONMENT_ID={environment.id}")
print("Save these to your .env file — do not re-create on each run")

Step 2: Build the per-investigation session runner

Create managed_agent_runner.py. This script starts a session, opens the event stream, sends the alert as a user message, and processes events until the agent goes idle. Save the session ID for audit — it gives you the full event history.

import anthropic, os, json

client = anthropic.Anthropic()

AGENT_ID = os.environ["AGENT_ID"]
ENVIRONMENT_ID = os.environ["ENVIRONMENT_ID"]

def investigate(alert_text: str, title: str = "SOC Investigation") -> dict:
    """Run a Managed Agent investigation and return the structured report."""
    session = client.beta.sessions.create(
        agent=AGENT_ID,
        environment_id=ENVIRONMENT_ID,
        title=title,
    )
    print(f"Session: {session.id}")

    report_text = []
    with client.beta.sessions.events.stream(session.id) as stream:
        # Send the alert — stream must be open before sending
        client.beta.sessions.events.send(
            session.id,
            events=[{"type": "user.message", "content": [
                {"type": "text", "text": alert_text}
            ]}],
        )
        for event in stream:
            if event.type == "agent.message":
                for block in event.content:
                    if hasattr(block, "text"):
                        print(block.text, end="", flush=True)
                        report_text.append(block.text)
            elif event.type == "agent.tool_use":
                print(f"\n[Tool: {event.name}]", flush=True)
            elif event.type == "session.status_idle":
                print("\n── Investigation complete ──")
                break

    return {"session_id": session.id, "report": "".join(report_text)}

# Test with the Meridian Financial incident
with open("test_incidents.json") as f:
    incidents = json.load(f)

result = investigate(incidents[0]["alert_text"], title=incidents[0]["id"])
print(f"\nSession ID for audit: {result['session_id']}")

Step 3: Run the full test suite and record metrics

Run the Managed Agent against all 10 incidents from your test_incidents.json. Record the same metrics you captured for Week 1–2 frameworks: severity classification accuracy, latency (wall-clock per session), token cost (from span.model_request_end events), and report completeness. Add Managed Agents as a third column in your comparison spreadsheet.

import time, json

results = []
for incident in incidents:
    start = time.time()
    result = investigate(incident["alert_text"], title=incident["id"])
    elapsed = time.time() - start

    # Evaluate output
    report = result["report"]
    results.append({
        "id": incident["id"],
        "framework": "managed_agents",
        "latency_s": round(elapsed, 1),
        "severity_correct": incident["expected_severity"].lower() in report.lower(),
        "ioc_count_found": sum(1 for ioc in incident["expected_iocs"] if ioc in report),
        "session_id": result["session_id"],  # audit trail
    })

with open("results_managed_agents.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Accuracy: {sum(r['severity_correct'] for r in results)}/{len(results)}")
print(f"Avg latency: {sum(r['latency_s'] for r in results)/len(results):.1f}s")

Step 4: Document the developer experience trade-offs

Add a "Managed Agents" section to your framework-comparison-report.md. Cover: (1) lines of code to implement vs. custom SDK, (2) what you can and can't control (tool execution happens server-side — you see events, not raw tool calls), (3) the setup-vs-runtime split (agent and environment are one-time; sessions are per-run), (4) when you'd choose this over a custom orchestrator in a real SOC deployment.

Step 5: Compare event streams — Managed Agents vs. OpenAI Agents SDK

Both frameworks stream events, but the information available differs. With Managed Agents, tool execution is server-side: you receive agent.tool_use events with the tool name, but full tool inputs and outputs are in the server-side container. With OpenAI Agents SDK, tools run in your process: you have complete visibility. Document in your comparison report: what observability does each approach give you? How does this affect your ability to audit, debug, and detect misuse in a SOC context?

Week 4 — Agent Evaluation & Framework Comparison

WEEK 4Lab: Quantitative Framework Benchmarking

Knowledge Check — Week 4

1. What dimensions should a comprehensive agent evaluation harness measure?

A) Only accuracy — correct output is all that matters B) Speed and cost only — accuracy is subjective C) Accuracy, latency, cost, reliability (pass rate), coverage, and security-specific metrics (false positive rate, hallucination rate) D) Lines of code and deployment complexity

Lab Exercise: Framework Comparison Evaluation

Step 1: Create a standardized test suite (10 security incidents)

Create test_incidents.json with 10 diverse security incidents: 2 critical, 3 high, 3 medium, 2 low severity. Each includes: alert text, expected severity, expected IoC count, expected top recommendation. This is your ground truth for evaluation.

Step 2: Run all three frameworks against the test suite

Execute your Claude Agent SDK system, OpenAI Agents SDK implementation, and Claude Managed Agents implementation against all 10 test incidents. Record: pass/fail for severity classification, latency (seconds), token cost, and output completeness for each.

Step 3: Build the comparison visualization

Use matplotlib/seaborn to create: (1) Accuracy bar chart by framework, (2) Latency violin plot, (3) Cost scatter plot, (4) Radar chart covering all dimensions. Export as PNG. These go in your Unit 5 Framework Comparison Report.

Step 4: Write the Framework Comparison Report (Unit 5 Major Deliverable)

Write the 10% Framework Comparison Report using the required structure below.

Framework Comparison Report — Required Structure (10% deliverable)

This report is evaluated as a security practitioner document, not a generic benchmark report. Generic accuracy metrics are insufficient — include security-specific measures.

Required sections:

1. Executive Summary (1 paragraph): Which framework do you recommend for each of the 3 security scenarios, and why? Lead with the decision.

2. Methodology: What did you test? What did you NOT test? Who evaluated (and is there evaluator bias — did the same person write the test cases and grade results)?

3. Security-Specific Metrics (required for each framework):

False Negative Rate: what % of real threats did the system miss?
False Positive Rate: what % of legitimate activity was flagged?
Hallucination rate in threat attribution: when the system attributed a threat, how often was the attribution incorrect?
Latency under load (P95): can the system handle real alert volume?

4. Framework-by-Framework Analysis: Apply all three frameworks to each of 3 security scenarios.

5. What These Numbers Don't Capture (required — this section cannot be omitted): What would a different evaluator find? What failure modes did your test set not cover? What adversarial inputs did you not test? This section demonstrates CCT Pillar 1 (Empirical Inquiry) applied to your own evaluation.

Appendix: AIUC-1 Self-Audit: For the framework you recommend, complete a brief AIUC-1 check: what domains are in scope, what controls exist, what gaps remain. This connects your Unit 3 learning to your Unit 5 recommendation.

Conversation starter before writing:
Ask Claude: "I'm writing a framework comparison report for multi-agent security systems evaluated on SOC operations. What metrics matter most for this use case, and what are the most common mistakes in framework comparison reports that security architects reject?"

Deep Agents Exercise — Build Your Institutional Knowledge Base

WEEK 4Exercise: AGENTS.md Audit

Exercise Goal: Write a human-curated AGENTS.md for your Unit 5 capstone system. Then audit it against the surgical test: every line must contain knowledge Claude cannot infer from reading 5 files in your codebase. If it can infer it, remove the line. Practitioner experience suggests human-written context files increase task success ~4%; LLM-generated files reduce it ~3%.

Lab Exercise: AGENTS.md for Capstone System

Step 1: Draft three sections — Patterns, Anti-Patterns, Traps

Write AGENTS.md for your capstone system by hand — do not generate it with Claude. Include exactly three sections: (1) Patterns & Conventions: what the agent should follow, (2) Anti-Patterns: what the agent must NOT do and why, (3) Known Traps: inconsistencies or non-obvious constraints in your codebase. Keep each section to 3–6 bullet points.

# AGENTS.md — [Your Capstone System Name]
# Human-written. Last updated: [date]

## Patterns & Conventions
- [e.g., All agent outputs return structured JSON with keys: status, data, confidence, error]
- [e.g., Use claude-haiku-4-5 for triage agents, claude-sonnet-4-6 for analysis agents]

## Anti-Patterns — DO NOT
- DO NOT add tool calls without input validation — every parameter must be sanitized
- DO NOT write agent state to disk except via the approved SQLite helper in utils/state.py
- [Add what would go wrong in YOUR system without this guidance]

## Known Traps
- [e.g., The log parser returns UTC timestamps but the alert format uses local time — normalize before comparison]
- [e.g., Staging environment uses a reduced threat intel feed — test results differ from production]

Step 2: Apply the surgical test to every line

For each bullet in your AGENTS.md, ask: "If I gave Claude 5 files from this repo, could it infer this on its own?" If yes — delete the line. Only keep what requires genuine institutional knowledge: decisions that aren't in the code, conventions that aren't consistent in the codebase, and traps that are invisible to a reader without context.

Step 3: Run a session with vs. without AGENTS.md — compare output quality

Start two Claude Code sessions for the same task (adding a new agent to your capstone system). Session A: no AGENTS.md. Session B: AGENTS.md in place. Compare: Does Session B follow your conventions without being told? Does it avoid your documented anti-patterns? Does it handle the known traps correctly? Record the differences — this is evidence for your Framework Comparison Report.

Unit 5 Major Deliverable

Framework Comparison Report (10% of grade) — quantitative evaluation of Claude Agent SDK, OpenAI Agents SDK, and Claude Managed Agents with test suite results, visualizations, and justified recommendations
All three framework implementations — working code repositories for each

Your Comparison Doesn't Exist Publicly Yet — Post It

The practitioner deciding between Claude Managed Agents and OpenAI Agents SDK for a production SOC deployment next month will be searching for exactly what you built. Open source the test suite, publish the results as a GitHub repo, and write a README summarizing your findings and recommendations.

The security community — including hobbyists, junior engineers, and teams at organizations that can't afford consultants — makes better decisions when practitioners share real evaluations. You ran the experiment. Share the results. Be the resource you wished existed when you started this unit.

Unit 5 Complete

You have built multi-agent systems in three frameworks and produced a quantitative comparison report.

Next: Unit 6 Lab — AI Attacker vs. AI Defender →

Lab Guide: Unit 5 — Multi-Agent Orchestration for Security

Week 1 — Multi-Agent Architecture Patterns

Knowledge Check — Week 1

Lab Exercise: Supervisor Pattern SOC System

Week 2 — OpenAI Agents SDK

Knowledge Check — Week 2

Lab Exercise: OpenAI Agents SDK SOC Investigation

Week 3 — Claude Managed Agents

Knowledge Check — Week 3

Lab Exercise: SOC Investigator as a Managed Agent

Week 4 — Agent Evaluation & Framework Comparison

Knowledge Check — Week 4

Lab Exercise: Framework Comparison Evaluation

Deep Agents Exercise — Build Your Institutional Knowledge Base

Lab Exercise: AGENTS.md for Capstone System

Unit 5 Major Deliverable

Unit 5 Complete