Lab Guide: Unit 5 — Multi-Agent Orchestration for Security
CSEC 602 | Weeks 1–4 | Semester 2
Build, compare, and evaluate multi-agent security systems using Claude Agent SDK, OpenAI Agents SDK, and Claude Managed Agents. Produce a quantitative framework comparison for your midyear report.
Week 1 — Multi-Agent Architecture Patterns
Lab Goal: Implement the Supervisor pattern in Python using the Claude Agent SDK. Map four architecture patterns to concrete security scenarios. Measure coordination overhead vs. single-agent baseline.
Knowledge Check — Week 1
1. When is the Supervisor pattern most appropriate for security operations?
2. Which pattern is best for threat hunting where diverse analytical perspectives are valuable?
3. When should you NOT use a multi-agent architecture?
In Week 15 of Semester 1 you deployed your hardened Sprint II agent to Claude Managed Agents — client.beta.agents.create(), client.beta.environments.create(), and a streaming session loop. That's one of the three harnesses you'll compare this unit.
This week's lab implements the same pattern (Supervisor + specialists) but in the Claude Agent SDK — your own Python loop, not Anthropic's hosted loop. Weeks 2 and 3 add the OpenAI Agents SDK and Managed Agents sides of the comparison. By Week 4 you'll have run the same security workload through all three and have real numbers.
Required: AIUC-1 pre-check before building (5 minutes).
This system makes consequential recommendations — containment actions, escalations, account lockouts. Before writing code, answer these questions in unit5/aiuc1-precheck.md:
- What data does this system process? Security alerts likely contain IP addresses, usernames, system names — potentially PII. How will you handle it?
- Who is affected by an incorrect decision? A false positive containment recommendation triggers unnecessary lockouts or isolation. A false negative misses a real threat. Who bears the cost?
- Which AIUC-1 domains are in scope? (B: Security, D: Reliability, E: Accountability — at minimum)
- What human oversight exists for high-severity recommendations? Does a "P1 — isolate host" recommendation execute automatically, or does a human review it first?
The full AIUC-1 audit of this system happens in Unit 7. This pre-check ensures your architecture doesn't build in gaps that will surface there.
Lab Exercise: Supervisor Pattern SOC System
The /audit-aiuc1 skill is included in the course skills bundle. The skill file is at .claude/skills/audit-aiuc1/SKILL.md.
pip install anthropic
mkdir -p ~/noctua-labs/unit5/week1 && cd ~/noctua-labs/unit5/week1
# SDK uses: from anthropic import Anthropic# Claude Code prompt:
# "Build a Supervisor agent using the Anthropic SDK that:
# 1. Receives a security alert as input
# 2. Routes to one of three specialists: threat_analyst, containment_advisor, compliance
# 3. Each specialist is a separate Claude call with a specialized system prompt
# 4. Supervisor synthesizes specialist output into final recommendation
# 5. Returns structured JSON: {routing, specialist_output, synthesis, confidence}"
claudeThese tools are mocks — and that's intentional. The tools in this lab simulate the output of your Unit 2 MCP server (query_cve, query_asset_exposure, generate_incident_report). The architecture is identical to what you built: an agent calls a tool, the tool returns structured data, the agent reasons about it. Only the transport changes. In Unit 7 (Hardening), these mock tools get replaced with calls to your actual MCP server via MCP client. Build the agent architecture correctly here — the tool swap in Unit 7 is a one-line configuration change, not a rewrite.
Lab vs. production architecture. In this lab, the supervisor agent receives simulated alert data passed directly to the triage function. In a production SOC, the equivalent input comes from a SIEM stream, Kafka topic, webhook, or security data pipeline — not a manual function call. The detection and triage logic you're writing is identical; only the data ingestion layer differs. When you deploy this in a real environment (Unit 8 capstone), the input wiring is what changes — the agent logic does not.
Add model attribution to your outputs. Every agent output that goes downstream — to another agent, to a human analyst, to an audit log — should declare how it was produced. Add a pipeline_metadata field to your triage output:
"pipeline_metadata": {
"classifier_model": "claude-haiku-4-5-20251001",
"enricher_invoked": True,
"report_model": "claude-sonnet-4-6",
"total_duration_ms": elapsed_ms,
"pipeline_version": "1.0"
}When a recommendation is wrong, you need to know which stage produced the error. Without attribution, you're debugging blind. This is the audit record for agentic pipelines.
Two protocols appear in agentic system architecture. They solve different problems:
- MCP (Model Context Protocol) — how an agent calls a tool. Defines the interface between an agent and external capabilities: APIs, databases, file systems.
- A2A (Agent-to-Agent Protocol) — how agents communicate with each other. Defines message passing, task delegation, and result handoff between agents in a multi-agent system.
A production multi-agent system needs both: MCP to connect agents to tools, A2A to connect agents to each other. They operate at different trust boundaries — tool calls and inter-agent messages have different authentication and authorization requirements. The Claude Agent SDK used in this lab handles inter-agent communication natively; A2A is the standardized protocol for this layer when interoperability across agent frameworks is required.
Week 2 — OpenAI Agents SDK
Lab Goal: Build a SOC investigation agent using the OpenAI Agents SDK. Compare the Runner-managed loop pattern (your process, client-side tools) against Claude Managed Agents (Anthropic infrastructure, server-side tools). Run against the same 10-incident test suite you'll use all unit.
Knowledge Check — Week 2
1. What is the fundamental architectural difference between OpenAI Agents SDK and Claude Managed Agents?
2. What is the difference between handoffs=[] and .as_tool() in OpenAI Agents SDK?
Lab Exercise: OpenAI Agents SDK SOC Investigation
pip install openai-agents
# oai_soc_agent.py
from agents import Agent, Runner, function_tool
@function_tool
def enrich_ioc(indicator: str) -> str:
"""Look up threat intel for an IP, domain, or hash."""
# Implement with your preferred threat intel source
# e.g., VirusTotal API, AbuseIPDB, local feed
return f"[IOC enrichment result for {indicator}]"
@function_tool
def query_logs(query: str, time_range: str = "1h") -> str:
"""Search SIEM logs for a given query string."""
# Implement with your SIEM or log source
return f"[Log query results for: {query} over {time_range}]"
soc_agent = Agent(
name="SOC Analyst",
instructions="""You are a senior SOC analyst. When given a security alert:
1. Enrich all IoCs (IPs, domains, hashes) using the enrich_ioc tool
2. Query logs for related activity using the query_logs tool
3. Apply CCT analysis: identify TTPs, generate 3 hypotheses with probabilities
4. Produce a structured incident report: severity, TTPs, hypotheses, recommendations""",
model="gpt-4o",
tools=[enrich_ioc, query_logs],
)
# Synchronous — simplest path
from agents import Runner
result = Runner.run_sync(soc_agent, meridian_incident_text)
print(result.final_output)
# Streaming — see events as they arrive
import asyncio
async def run_streamed(alert: str):
result = Runner.run_streamed(soc_agent, alert)
async for event in result.stream_events():
if event.type == "raw_response_event":
continue
print(f"[{event.type}]", flush=True)
print(result.final_output)
asyncio.run(run_streamed(meridian_incident_text))
from agents import Agent, Runner
malware_agent = Agent(
name="Malware Analyst",
instructions="Specialize in malware behavior analysis, hash lookups, and sandbox report interpretation.",
tools=[enrich_ioc],
)
# Pattern 1: Handoff — triage routes, specialist takes over
triage_agent = Agent(
name="Triage",
instructions="Analyze alert type. Route network/phishing alerts to soc_agent. Route malware/hash alerts to malware_agent.",
handoffs=[soc_agent, malware_agent],
)
result = Runner.run_sync(triage_agent, alert_text)
print(result.final_output)
# Pattern 2: as_tool — orchestrator invokes specialist, stays in control
orchestrator = Agent(
name="Orchestrator",
instructions="Coordinate investigation. Use soc_analyst for network alerts, malware_analyst for hash/file alerts.",
tools=[
soc_agent.as_tool(
tool_name="soc_analyst",
tool_description="Run a full SOC investigation on a network or phishing alert"
),
malware_agent.as_tool(
tool_name="malware_analyst",
tool_description="Analyze malware samples, hashes, or sandbox reports"
),
],
)
Runner.run_sync() in a loop and time each call with time.perf_counter(). OpenAI usage data is in result.raw_responses[-1].usage.Week 3 — Claude Managed Agents
Lab Goal: Implement the same SOC investigation pipeline from Week 1 using Claude Managed Agents — Anthropic's hosted agent harness. Compare the development experience and operational characteristics against the custom SDK approach you already built.
Knowledge Check — Week 3
1. What is the fundamental difference between Claude Managed Agents and the custom SDK supervisor from Week 1?
Lab Exercise: SOC Investigator as a Managed Agent
import anthropic, os
client = anthropic.Anthropic()
# Create the agent (run once — save agent.id to .env or config file)
agent = client.beta.agents.create(
name="SOC Investigator",
model="claude-sonnet-4-6",
system="""You are a senior SOC analyst conducting incident investigations.
Given a security alert:
1. Extract and enrich all IoCs (IPs, domains, file hashes) using web search
2. Apply CCT analysis: map to MITRE ATT&CK, generate 3 hypotheses with probabilities
3. Assess severity (Critical/High/Medium/Low) with justification
4. Produce a structured JSON report followed by a concise narrative summary""",
tools=[{"type": "agent_toolset_20260401"}], # includes bash, read, write, grep, web_search
)
# Create the environment (run once — save environment.id)
environment = client.beta.environments.create(
name="soc-investigation-env",
config={"type": "cloud", "networking": {"type": "unrestricted"}},
)
print(f"AGENT_ID={agent.id}")
print(f"AGENT_VERSION={agent.version}")
print(f"ENVIRONMENT_ID={environment.id}")
print("Save these to your .env file — do not re-create on each run")import anthropic, os, json
client = anthropic.Anthropic()
AGENT_ID = os.environ["AGENT_ID"]
ENVIRONMENT_ID = os.environ["ENVIRONMENT_ID"]
def investigate(alert_text: str, title: str = "SOC Investigation") -> dict:
"""Run a Managed Agent investigation and return the structured report."""
session = client.beta.sessions.create(
agent=AGENT_ID,
environment_id=ENVIRONMENT_ID,
title=title,
)
print(f"Session: {session.id}")
report_text = []
with client.beta.sessions.events.stream(session.id) as stream:
# Send the alert — stream must be open before sending
client.beta.sessions.events.send(
session.id,
events=[{"type": "user.message", "content": [
{"type": "text", "text": alert_text}
]}],
)
for event in stream:
if event.type == "agent.message":
for block in event.content:
if hasattr(block, "text"):
print(block.text, end="", flush=True)
report_text.append(block.text)
elif event.type == "agent.tool_use":
print(f"\n[Tool: {event.name}]", flush=True)
elif event.type == "session.status_idle":
print("\n── Investigation complete ──")
break
return {"session_id": session.id, "report": "".join(report_text)}
# Test with the Meridian Financial incident
with open("test_incidents.json") as f:
incidents = json.load(f)
result = investigate(incidents[0]["alert_text"], title=incidents[0]["id"])
print(f"\nSession ID for audit: {result['session_id']}")import time, json
results = []
for incident in incidents:
start = time.time()
result = investigate(incident["alert_text"], title=incident["id"])
elapsed = time.time() - start
# Evaluate output
report = result["report"]
results.append({
"id": incident["id"],
"framework": "managed_agents",
"latency_s": round(elapsed, 1),
"severity_correct": incident["expected_severity"].lower() in report.lower(),
"ioc_count_found": sum(1 for ioc in incident["expected_iocs"] if ioc in report),
"session_id": result["session_id"], # audit trail
})
with open("results_managed_agents.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Accuracy: {sum(r['severity_correct'] for r in results)}/{len(results)}")
print(f"Avg latency: {sum(r['latency_s'] for r in results)/len(results):.1f}s")Week 4 — Agent Evaluation & Framework Comparison
Knowledge Check — Week 4
1. What dimensions should a comprehensive agent evaluation harness measure?
Lab Exercise: Framework Comparison Evaluation
Framework Comparison Report — Required Structure (10% deliverable)
This report is evaluated as a security practitioner document, not a generic benchmark report. Generic accuracy metrics are insufficient — include security-specific measures.
Required sections:
1. Executive Summary (1 paragraph): Which framework do you recommend for each of the 3 security scenarios, and why? Lead with the decision.
2. Methodology: What did you test? What did you NOT test? Who evaluated (and is there evaluator bias — did the same person write the test cases and grade results)?
3. Security-Specific Metrics (required for each framework):
- False Negative Rate: what % of real threats did the system miss?
- False Positive Rate: what % of legitimate activity was flagged?
- Hallucination rate in threat attribution: when the system attributed a threat, how often was the attribution incorrect?
- Latency under load (P95): can the system handle real alert volume?
4. Framework-by-Framework Analysis: Apply all three frameworks to each of 3 security scenarios.
5. What These Numbers Don't Capture (required — this section cannot be omitted): What would a different evaluator find? What failure modes did your test set not cover? What adversarial inputs did you not test? This section demonstrates CCT Pillar 1 (Empirical Inquiry) applied to your own evaluation.
Appendix: AIUC-1 Self-Audit: For the framework you recommend, complete a brief AIUC-1 check: what domains are in scope, what controls exist, what gaps remain. This connects your Unit 3 learning to your Unit 5 recommendation.
Conversation starter before writing:
Ask Claude: "I'm writing a framework comparison report for multi-agent security systems evaluated on SOC operations. What metrics matter most for this use case, and what are the most common mistakes in framework comparison reports that security architects reject?"
Deep Agents Exercise — Build Your Institutional Knowledge Base
Exercise Goal: Write a human-curated AGENTS.md for your Unit 5 capstone system. Then audit it against the surgical test: every line must contain knowledge Claude cannot infer from reading 5 files in your codebase. If it can infer it, remove the line. Practitioner experience suggests human-written context files increase task success ~4%; LLM-generated files reduce it ~3%.
Lab Exercise: AGENTS.md for Capstone System
# AGENTS.md — [Your Capstone System Name]
# Human-written. Last updated: [date]
## Patterns & Conventions
- [e.g., All agent outputs return structured JSON with keys: status, data, confidence, error]
- [e.g., Use claude-haiku-4-5 for triage agents, claude-sonnet-4-6 for analysis agents]
## Anti-Patterns — DO NOT
- DO NOT add tool calls without input validation — every parameter must be sanitized
- DO NOT write agent state to disk except via the approved SQLite helper in utils/state.py
- [Add what would go wrong in YOUR system without this guidance]
## Known Traps
- [e.g., The log parser returns UTC timestamps but the alert format uses local time — normalize before comparison]
- [e.g., Staging environment uses a reduced threat intel feed — test results differ from production]Unit 5 Major Deliverable
- Framework Comparison Report (10% of grade) — quantitative evaluation of Claude Agent SDK, OpenAI Agents SDK, and Claude Managed Agents with test suite results, visualizations, and justified recommendations
- All three framework implementations — working code repositories for each
The practitioner deciding between Claude Managed Agents and OpenAI Agents SDK for a production SOC deployment next month will be searching for exactly what you built. Open source the test suite, publish the results as a GitHub repo, and write a README summarizing your findings and recommendations.
The security community — including hobbyists, junior engineers, and teams at organizations that can't afford consultants — makes better decisions when practitioners share real evaluations. You ran the experiment. Share the results. Be the resource you wished existed when you started this unit.
Unit 5 Complete
You have built multi-agent systems in three frameworks and produced a quantitative comparison report.