Lab Guide: Unit 4 — Rapid Prototyping with Agentic Tools

CSEC 601 | Weeks 13–16 | Semester 1

Four weeks of agentic engineering: multi-agent architecture (Week 13), Sprint I build (Week 14), Sprint II hardening (Week 15), and midyear presentations (Week 16). This is where everything from Semester 1 comes together.

Claude for velocity, not shortcuts: Use Claude for velocity, not shortcuts. The goal is to learn production thinking fast — not to avoid thinking. Ask Claude to review your architecture decisions, not to make them for you.

Unit 4 Lab Progress0 / 34 steps complete

Week 13 — Multi-Agent Architecture Deep Dive

WEEK 13Lab: Build a Multi-Agent SOC System with Worktrees

Lab Goal: Build a four-agent SOC system (Orchestrator + Recon Agent + Analysis Agent + Reporting Agent) using Claude Agent SDK, with worktrees for parallel development. Implement the Hierarchical orchestration pattern.

Tool-agnostic framing: worktrees and multi-agent patterns

Worktrees are a git feature, not a Claude Code feature. Any IDE or terminal supports them. VS Code Multi-Root Workspaces, Cursor, and JetBrains all work with git worktrees. The Claude Code integration shown here is one way to use them.

Orchestrator + specialized worker is a pattern, not a framework. Whether you use the Anthropic SDK (shown here), Claude Managed Agents, OpenAI Agents SDK, or AutoGen — the architecture is the same: one agent routes and coordinates, specialized agents execute with focused context and tool scope. The framework changes; the pattern doesn't.

Try /worktree-setup for Parallel Agent Development

This lab has you building three agents simultaneously across isolated worktrees. The /worktree-setup skill generates the exact commands for your project — worktree create commands, directory map, and integration merge steps — based on your spec.

⭳ Download worktree-setup.md 👁 All Course Skills

curl -o ~/.claude/commands/worktree-setup.md https://raw.githubusercontent.com/r33n3/Noctua/main/docs/skills/worktree-setup.md
# Then in Claude Code, after your spec is ready:
/worktree-setup three-agent SOC system: recon-agent, analysis-agent, reporting-agent

What you're learning here transfers beyond Claude Code. Worktrees are a git feature — not a Claude Code invention. The git worktree add command works in any git repository, with any development tool. Agent orchestration patterns (supervisor → specialist, parallel fanout, sequential pipeline) appear in Claude Managed Agents, OpenAI Agents SDK, AutoGen, and the Claude Agent SDK with different syntax but identical architecture. The mental model you build this week — how to decompose a complex task into a supervised multi-agent pipeline — is framework-independent. The syntax is not.

Lab Exercise: Three-Agent SOC System

Architecture: Incident arrives → Orchestrator routes to Recon Agent (gathers IoC data via your MCP tools) → Analysis Agent (applies CCT framework to synthesize findings) → Reporting Agent (generates structured JSON + Markdown report). All three run under the Orchestrator's control.

Step 1: Initialize worktrees for parallel development

cd ~/noctua-labs
git init soc-agent-system && cd soc-agent-system
git commit --allow-empty -m "init"
# Create worktrees for each agent component
git worktree add ../soc-recon   -b feature/recon-agent
git worktree add ../soc-analysis -b feature/analysis-agent
git worktree add ../soc-reporting -b feature/reporting-agent

Step 2: Build the Recon Agent in its worktree

In ../soc-recon/, use Claude Code to build recon_agent.py: a Claude-backed agent with access to your Unit 2 MCP tools (query_cve, query_asset_exposure, generate_incident_report, search_security_kb). Input: raw alert text. Output: structured reconnaissance JSON with all IoCs enriched.

cd ~/soc-recon
claude
# Prompt: "Build a recon security agent in Python using the Anthropic SDK.
# The agent receives raw alert text and must:
# 1. Extract all IoCs (IPs, hashes, CVEs, domains)
# 2. Use tool calls to enrich each IoC via the MCP tools
# 3. Return structured JSON: {iocs: [{type, value, enrichment, severity}], summary}
# Use claude-sonnet-4-6. Include error handling for failed tool calls."

Step 3: Build the Analysis Agent

In ../soc-analysis/, build analysis_agent.py: applies your CCT system prompt (from Week 4) to the Recon Agent's output. Applies 5-pillar CCT analysis, generates top-3 hypotheses with probabilities, and recommends next steps. System prompt is your security-analyst-context-v2.md.

Step 4: Build the Reporting Agent

In ../soc-reporting/, build reporting_agent.py: takes Analysis Agent output and generates a validated incident report (using your Week 5–6 schemas) in both JSON and Markdown formats. Calculates MTTI based on timestamps passed from the Orchestrator.

Step 5: Build the Orchestrator

In the main soc-agent-system/ branch, build orchestrator.py: receives the alert, calls Recon → Analysis → Reporting in sequence, handles timeouts (30s per agent), and implements graceful degradation (if Recon fails, Analysis works with partial data).

cd ~/noctua-labs/soc-agent-system
claude
# Prompt: "Build orchestrator.py that coordinates three agents in sequence.
# Import: recon_agent.run(alert_text), analysis_agent.run(recon_json),
# reporting_agent.run(analysis_json, start_ts)
# Requirements:
# - 30s asyncio timeout per agent via asyncio.wait_for()
# - If recon times out: pass partial data to analysis with recon_status='timeout'
# - If analysis fails: reporting still runs with error summary
# - Return: {mtti_seconds, recon_status, analysis_status, report}
# - Log each stage with timestamp to structured JSON log"

Step 6: Merge and end-to-end test

Merge all three agent branches into main. Run the full system against the Meridian Financial incident. Verify: (1) Recon enriches all IoCs, (2) Analysis applies CCT and generates 3 hypotheses, (3) Reporting produces valid JSON + Markdown. Record end-to-end MTTI.

Step 7: Compare against Week 1 MTTI baseline

Record the multi-agent system's MTTI for the Meridian Financial incident. Compare to your Week 1 manual MTTI. Calculate the improvement ratio. Consider: what is the cost per investigation (token cost)? Is the improvement worth the cost?

Alternative: Claude Managed Agents

The system you just built manages the orchestration loop in Python — your code calls each agent in sequence, handles errors, and wires results together. Claude Managed Agents is Anthropic's hosted version of this pattern: you define the agent config once, start a session per investigation, and stream events. The container, tools, and agent loop run on Anthropic's infrastructure.

The same SOC investigation, rebuilt as a Managed Agent:

import anthropic

client = anthropic.Anthropic()

# ── ONE-TIME SETUP (run once, save the IDs) ──────────────────────────────────
agent = client.beta.agents.create(
    name="SOC Analyst",
    model="claude-sonnet-4-6",
    system="""You are a senior SOC analyst. When given a security alert:
1. Enrich all IoCs (IPs, domains, hashes) using available tools
2. Apply CCT analysis: identify TTPs, generate 3 hypotheses with probabilities
3. Produce a structured incident report in JSON + narrative summary""",
    tools=[{"type": "agent_toolset_20260401"}],  # bash, read/write, web_search all included
)
environment = client.beta.environments.create(
    name="soc-env",
    config={"type": "cloud", "networking": {"type": "unrestricted"}},
)
# Save agent.id and environment.id — reuse across all investigations

# ── PER-INVESTIGATION (run for each new alert) ────────────────────────────────
session = client.beta.sessions.create(
    agent=agent.id,
    environment_id=environment.id,
    title="Meridian Financial Incident",
)

with client.beta.sessions.events.stream(session.id) as stream:
    client.beta.sessions.events.send(
        session.id,
        events=[{"type": "user.message", "content": [
            {"type": "text", "text": meridian_incident_text}
        ]}],
    )
    for event in stream:
        if event.type == "agent.message":
            for block in event.content:
                if hasattr(block, "text"):
                    print(block.text, end="")
        elif event.type == "agent.tool_use":
            print(f"\n[Tool: {event.name}]")
        elif event.type == "session.status_idle":
            print("\n── Investigation complete ──")
            break

Custom orchestrator vs. Managed Agents — when to choose:

Custom SDK orchestrator (this lab): use when you need custom tool execution on your own infrastructure, approval gates before sensitive actions, or fine-grained control over every agent call.
Claude Managed Agents: use when you want Anthropic to host the container and run the loop — ideal for long-running investigations, file-heavy work, and teams that don't want to manage agent infrastructure.

The orchestrator pattern you designed this week transfers directly to Managed Agents. The agent's reasoning logic (recon → analysis → report) lives in the system prompt; what changes is who runs the infrastructure underneath it. You'll evaluate both approaches quantitatively in Unit 5.

Week 13 Deliverables

soc-agent-system/ — complete multi-agent repository with all three agents and orchestrator
Architecture Diagram — data flow diagram showing alert → Orchestrator → Recon → Analysis → Reporting → output
MTTI Comparison — Week 1 manual vs. Week 13 automated, with cost analysis

Week 14 — Rapid Prototyping Sprint I

WEEK 14Lab: Concept to Working Demo in 3 Hours

Lab Goal: Execute a time-boxed rapid prototyping sprint: 20 minutes for problem scoping and spec, 2 hours for building, 40 minutes for testing and demo prep. You will measure MTTS, MTTP, and MTTSol for your sprint. This prototype becomes the foundation of your midyear project.

Knowledge Check — Week 14

1. In the sprint context, what does MTTS measure?

A) Time from challenge start to validated strategy/approach decision B) Time to deploy the finished system to production C) Time to fix the first bug discovered in testing D) Time for leadership to review and approve the prototype

2. What does the Think → Spec → Build → Retro cycle prescribe for time allocation?

A) 50% thinking, 30% building, 20% retro B) Equal quarters: thinking, spec, building, retro C) ~15% think, ~20% spec, ~50% build, ~15% retro D) 10% thinking, 80% building, 10% retro

Use /think + /build-spec for Your 20-Minute Scoping Phase

The Think → Spec → Build → Retro cycle was built for exactly this sprint format. In your 20-minute planning window: run /think first to validate your chosen problem, then run /build-spec to produce the formal 1-page spec. Both skills give you structured output you can paste directly into the build phase.

⭳ Download spec.md 👁 All Course Skills

curl -o ~/.claude/commands/build-spec.md https://raw.githubusercontent.com/r33n3/Noctua/main/docs/skills/build-spec.md
# Sprint scoping sequence in Claude Code:
/think I want to build a phishing email triage pipeline for this sprint. What are the risks and unknowns?
# Then after validating direction:
/build-spec phishing email triage pipeline — single sprint, Claude agent + MCP tools, 3-hour build window

Track two metrics, not one. Most students track MTTP (Mean Time to Prototype) — time from "go" to working demo. Also track MTTS (Mean Time to Spec) — time from "go" to a written, signed-off spec. Why:

MTTP tells you how fast you execute
MTTS tells you how fast you make design decisions
If MTTP improves by compressing MTTS (skipping the spec to start building faster), you've traded evaluation validity for speed

The spec phase should not be rushed. It is where security decisions are made.

Required: AIUC-1 pre-check before finalizing your spec (5 minutes).

Before writing a single line of code, answer these four questions and document them in sprint1/aiuc1-precheck.md:

What data does this system process? (Email content? Does it contain PII? Attachments?)
Who is affected by a wrong decision? (False positive = missed real threat reported to nobody. False negative = analyst alerted on legitimate email.)
Which AIUC-1 domains are in scope? (B: Security — does the system have access to security-sensitive data? D: Reliability — what happens when it's wrong? E: Accountability — who reviews its decisions?)
What human oversight exists for high-severity outputs? (Does a P1 classification auto-escalate, or does a human review first?)

The full AIUC-1 audit happens in Unit 3 (which you've already completed). This pre-check ensures your Sprint I design avoids the gaps your Unit 3 audit identified.

Lab Exercise: Sprint I — Timed Prototype

Timer is active: The lab instructor will signal the start. Record your exact start time. Each phase is time-boxed — when the phase ends, move on regardless of completion state. Track your metrics honestly.

Step 0: AIUC-1 pre-check — run before writing code

Run /audit-aiuc1 on your planned system architecture. Focus on Domains B (Security), D (Reliability), and E (Accountability) at minimum. Save the output as sprint1/aiuc1-precheck.md — this is a graded deliverable. The full audit happens in Unit 7; this pre-check surfaces design decisions you would otherwise have to reverse later.

Finding the skill

The /audit-aiuc1 skill is included in the course skills bundle. The skill file is at .claude/skills/audit-aiuc1/SKILL.md.

Phase 1 — Plan (20 min): Record your start time and scope the problem

Record: Start Time = ___. Choose ONE security problem from: (a) Phishing email triage pipeline, (b) Cloud misconfiguration detector, (c) CVE-to-patch priority ranker, (d) Insider threat alert enricher, (e) Your own idea (approved in advance). Write a 1-page spec: problem, agents required, tools, success criteria, what you will NOT build in this sprint.

Phase 1 — Plan: Record MTTS (end of planning = validated strategy)

MTTS = Time from start to moment you have a validated strategy and can begin building. Record: MTTS = ___ minutes.

Phase 2 — Build (120 min): Initialize repo and build core agents

Use Claude Code continuously. Start with the core agent loop. Use your MCP server from Unit 2. Apply your context-engineered system prompt from Week 4. Build the minimum viable system that demonstrates the core value — not a perfect system.

mkdir -p ~/noctua-labs/unit4/sprint1
cd ~/noctua-labs/unit4/sprint1
git init
# Start building with Claude Code immediately — spec in context
claude
# Prompt: "I am building [your chosen system]. Here is my spec:
# [paste your 1-page spec]. Let's start with the core agent loop.
# Build [first component] first."

Phase 2 — Build: Record MTTP (end of build = working plan)

When you have a working plan (the system does SOMETHING end-to-end, even if incomplete): MTTP = ___ minutes from start.

Phase 3 — Review (40 min): Test the prototype against the success criteria

Run your prototype against 3 test cases. For each: does it produce a useful security output? Record pass/fail. Do NOT add new features — only verify what you built. Fix critical bugs only.

Phase 3 — Review: Record MTTSol and MTTI

MTTSol = Time from start to working tested prototype = ___ minutes. Record total token cost from Claude API usage. Create Sprint I metrics summary: MTTS / MTTP / MTTSol / Token Cost / Completion %.

Post-sprint: Write a 5-minute demo script

You have 5 minutes to demo this prototype to leadership. Write the script: (1) Problem statement (30s), (2) Live demo (3 min), (3) Value delivered and what's missing (1 min), (4) What you'd build in Sprint II (30s). Practice it.

Sprint I exit gate: run /check-antipatterns on your prototype

Before closing Sprint I, audit your prototype against the 26 production anti-patterns. This is a required Sprint I deliverable — include the report output alongside your Sprint I metrics. Requirement: zero CRITICAL findings. Document HIGH findings as deferred to Sprint II with written justification (what would need to change to fix each one). Track this as a sprint metric: how many findings in Sprint I vs. Sprint II?

/check-antipatterns ~/noctua-labs/unit4/sprint1/

# Required: Zero CRITICAL findings to pass Sprint I
# Document: HIGH findings → deferred to Sprint II with justification
# Include: report output in Sprint I deliverables package

Discussion (~10 min): The Cost of Being Thorough

Setup: The semi-formal approach took 2.8x more agent steps. Full harness runs cost $125–200 and took 4–6 hours. The quick version was $9 and took 20 minutes.

Discussion prompt: Your sprint has a token budget. Running /check-antipatterns with full semi-formal analysis costs 3x more than a quick scan. You have 5 components to assess. Do you run the thorough check on all 5? Or do you triage — run the quick check on everything, and the thorough check on the 2 highest-risk components? How do you decide WHICH components get the thorough check? What makes a component high-risk?

Key insight: Thoroughness is a budget allocation problem, not a binary choice. The answer is tiered verification: every commit gets quick checks (linting, basic /code-review); every PR gets standard checks (/code-review with confidence threshold); before deployment gets full semi-formal analysis (all three evaluators); high-risk components get additional manual review on top. The cost of a thorough assessment is always less than the cost of a missed vulnerability in production. But you can't run thorough assessments on everything — you'd never ship. Match the reasoning depth to the stakes of the decision.

Instructor note: Have students actually measure the cost difference. Run /check-antipatterns on a component and record the token cost. Then run it with explicit instructions to "trace every function call and provide file:line evidence for every finding." Compare costs. The difference makes the tradeoff concrete.

Course connection: /cost tracking, /effort levels, three-tier code review, sprint budget management. Students should track their actual token costs for quick vs thorough assessments across the sprint.

Source: Ugare & Chandra, "Agentic Code Reasoning," arXiv:2603.01896v2

Week 14 Deliverables

Sprint I Prototype — code repository with README, working end-to-end (even if incomplete)
Sprint I Metrics — MTTS, MTTP, MTTSol, token cost, and completion percentage
5-Minute Demo Script — written and rehearsed presentation for Sprint I showcase
AI Methodology Note — 100-word description of how Claude Code was used in the sprint

Week 15 — Rapid Prototyping Sprint II: Hardening

WEEK 15Lab: Iterate, Harden, and Production-Ready Quality

Lab Goal: Transform your Sprint I prototype into a hardened, documented, and ethically audited system. Apply the full quality checklist: error handling, input validation, logging, CCT analysis, AIUC-1 compliance, performance measurement.

Knowledge Check — Week 15

1. What does 'hardening' a security prototype specifically require?

A) Converting Python to a compiled language for speed B) Adding more features requested by users C) Moving the system to cloud infrastructure D) Error handling for all failure modes, input validation, structured logging, documentation, and ethics compliance

Dependency review before you plan. Before committing to Sprint II scope, audit your dependencies:

What Python packages do you need that aren't currently installed?
What does adding them require (compilation? system libraries? significant disk space?)
Do any planned remediations (PII scanning, encryption, external API integrations) require new dependencies?

Dependencies discovered mid-sprint become scope constraints. A PII scanner that requires a 200MB model download or a complex compilation step can block a sprint if discovered on day two. Five minutes of dependency review at sprint start saves hours of mid-sprint replanning.

Lab Exercise: Sprint II Hardening Checklist

Step 1: Add comprehensive error handling

Every agent invocation should be wrapped in try/except with specific error types. Agent timeouts (30s) must be enforced. Failed tool calls must return structured errors, not crash the pipeline. Ask Claude Code: "Review my prototype and add comprehensive error handling for all failure modes."

Step 2: Add input validation for all entry points

Identify every place user input or external data enters the system. Add validation: type checking, length limits, format validation (regex for IPs/CVEs/hashes), sanitization to prevent injection. Test each validator with invalid inputs.

Step 3: Implement structured JSON logging

Add Python logging configured to output JSON: every agent call, tool invocation, error, and timing data. Each log entry: timestamp (ISO-8601), event_type, agent, tool, input_hash (SHA256 of sanitized input), result_code, duration_ms.

import logging, json, hashlib, time

class JSONFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            'ts': self.formatTime(record),
            'level': record.levelname,
            'event': record.getMessage(),
            'agent': getattr(record, 'agent', None),
            'tool': getattr(record, 'tool', None),
        })

# Use: logger.info("tool_call", extra={'agent':'recon','tool':'query_cve'})

Step 4: Apply a quick AIUC-1 ethics self-audit

Run your Week 9 audit checklist against your prototype. Record compliance for each domain (Full / Partial / Gap). For any Gap, write one concrete remediation action. You don't need to fix all gaps in Sprint II, but you must document them.

Step 5: Write the README and architecture documentation

Create a README covering: what the system does, architecture diagram, setup instructions, usage examples (with expected outputs), known limitations, ethics compliance status, and performance metrics (MTTS/MTTP/MTTSol comparison Sprint I vs Sprint II).

Step 6: Run Sprint II metrics and compare to Sprint I

Run your hardened prototype against the same 3 Sprint I test cases. Record new metrics. Calculate: improvement in test pass rate, change in MTTSol, change in token cost, compliance improvement (Sprint I ethics gaps vs. Sprint II). Present as a comparison table.

Sprint II close: achieve READY status on production readiness audit

Run the full production readiness audit on your hardened prototype. Target: READY (no CRITICAL or HIGH findings). Layer 3 checklist: structured logging (3.1), correlation IDs (3.2), health check endpoint (3.3), graceful SIGTERM shutdown (3.4), metrics emission (3.5). Layer 4 checklist: constant-time secret comparison (4.1), structured log fields only — no f-strings with user input (4.2), input bounds on all tools (4.3), credential refresh support (4.4), audit records on automated actions (4.5). Track improvement from Sprint I. Include the final report in your Week 15 deliverables alongside your ethics audit.

/check-antipatterns ~/noctua-labs/unit4/sprint1/

# Track improvement:
# Sprint I:  CRITICAL __ HIGH __ MEDIUM __
# Sprint II: CRITICAL __ HIGH __ MEDIUM __
# Target: READY status (no CRITICAL or HIGH)

Step 8: Deploy your hardened agent to Claude Managed Agents — your S1 production milestone

Your agent is hardened, audited, and READY. Now deploy it. Claude Managed Agents is Anthropic's hosted harness — you deploy once, then run investigations as sessions. Anthropic runs the loop and tool execution in a container; you don't manage any server infrastructure.

What this step teaches: The difference between "it runs on my laptop" and "it's deployed." Same agent logic, different runtime. This is the simplest path to production for a Claude agent.

import anthropic, json, os

client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY from env

# ── STEP 1: Deploy the agent (run ONCE — save the IDs) ───────────────────────
# Paste your Sprint II system prompt below:
SYSTEM_PROMPT = """
[Your hardened Sprint II system prompt here — the one that passed AIUC-1 audit]
"""

agent = client.beta.agents.create(
    name="Unit 4 Sprint II — <your tool name>",
    model="claude-sonnet-4-6",
    system=SYSTEM_PROMPT,
    tools=[{"type": "agent_toolset_20260401"}],  # bash, file ops, web_search included
)
environment = client.beta.environments.create(
    name="unit4-sprint2-env",
    config={"type": "cloud", "networking": {"type": "unrestricted"}},
)

# Save IDs — you'll reuse these for every session
ids = {"agent_id": agent.id, "environment_id": environment.id}
with open("managed_agent_ids.json", "w") as f:
    json.dump(ids, f, indent=2)
print(f"Deployed: {ids}")

# ── STEP 2: Run a test session against your Sprint I test cases ───────────────
with open("managed_agent_ids.json") as f:
    ids = json.load(f)

session = client.beta.sessions.create(
    agent=ids["agent_id"],
    environment_id=ids["environment_id"],
    title="Test Session — Week 15",
)

test_input = "[Paste your Sprint I test case #1 here]"

with client.beta.sessions.events.stream(session.id) as stream:
    client.beta.sessions.events.send(session.id, events=[{
        "type": "user.message",
        "content": [{"type": "text", "text": test_input}]
    }])
    for event in stream:
        if event.type == "agent.message":
            for block in event.content:
                if hasattr(block, "text"):
                    print(block.text, end="", flush=True)
        elif event.type == "agent.tool_use":
            print(f"\n[Tool: {event.name}]", flush=True)
        elif event.type == "session.status_idle":
            print("\n\n── Session complete ──")
            break

After deployment: answer these questions

Did your agent produce the same output as the local version? If not — why? (context window, tool availability, system prompt differences?)
How long did the session take vs. local execution? What drove the difference?
Your agent is now "live" — anyone with the session API and your agent ID can run it. What access controls would a production deployment need?
Look at tools/mass/claude-managed-agents/ in this repo — that's the MASS scanner deployed the same way. Your Sprint II agent is structurally identical. In Unit 5 you'll compare this pattern against two others.

Discussion (~10 min): Semi-Formal vs Fully Formal — The Practical Middle

Setup: The paper explicitly positions semi-formal reasoning between unstructured chain-of-thought (too loose) and fully formal verification in Lean or Coq (too rigid). Unstructured: 78% accuracy. Semi-formal: 88% accuracy. Fully formal: theoretically 100% accuracy on whatever you can formalize — but formalizing a Django codebase with Python, PostgreSQL, Redis, and three API integrations would take years.

Discussion prompt: Why not just use formal verification for everything? Let students identify: cost, time, scope limitations. Then push: "Is there any part of your capstone system that SHOULD be formally verified? What about the parts that can't be?"

Key insight: Three levels of verification rigor. Informal (chain-of-thought): fast, cheap, broad scope, 78% accuracy — use for quick checks, brainstorming, initial triage. Semi-formal (structured templates): moderate cost, broad scope, 88% accuracy — use for security assessments, code review, audit, most production work. Fully formal (Lean, Coq, proof systems): expensive, narrow scope, 100% accuracy on formalized scope — use for cryptographic implementations, authentication logic, critical safety properties. For your capstone: your agent's system prompt governance is semi-formal. Your IAM policies are closer to formal (AWS validates them against a policy language). Your overall security posture is assessed semi-formally (AIUC-1 audit with evidence chains). The mix is intentional.

Instructor note: This is a good capstone prep discussion because it frames what the CISO panel will ask: "How confident are you in this assessment? What's the rigor level? Where would formal verification add value? Where is it impractical?" Students who can articulate the rigor-cost-scope tradeoff demonstrate senior engineering judgment.

Course connection: Engineering Assessment Stack (start simple, escalate when needed), four defense layers, the entire course philosophy of "minimum viable rigor for the stakes involved."

Source: Ugare & Chandra, "Agentic Code Reasoning," arXiv:2603.01896v2

Week 15 Deliverables

Hardened Prototype — complete code with error handling, validation, logging, and README
Sprint II AIUC-1 Governance Audit — AIUC-1 compliance matrix for the hardened prototype
Sprint I vs II Comparison — metrics comparison table showing measurable improvements
Managed Agent Deployment — managed_agent_ids.json proving a live deployment + answers to the 4 post-deployment questions

Close the Cycle: Run /retro Before Week 16 Presentations

Before your demo, run a structured retrospective on both sprints. The /retro skill produces a document comparing what you spec'd vs. what you built, what worked, what didn't, and what you'd carry into a third sprint. This feeds your presentation directly — and becomes part of your portfolio.

⭳ Download retro.md 👁 All Course Skills

curl -o ~/.claude/commands/retro.md https://raw.githubusercontent.com/r33n3/Noctua/main/docs/skills/retro.md
# After Sprint II, in Claude Code:
/retro Unit 4 Sprint I + II — phishing triage pipeline

Merge Before You Present: Run /merge-worktrees

If you built Sprint II across multiple worktrees, merge them now — before Week 16 presentations. Presenting from an unmerged worktree means your demo runs against a branch that won't survive the sprint. The /merge-worktrees skill handles conflict resolution, runs your test suite after each merge, and produces a merge report.

# From your main Claude Code session (not inside a worktree):
/merge-worktrees

Critical: Run this from your main session, not from inside a worktree. The skill merges worktrees into main — it can't run from inside one of the branches being merged.

⭳ Download merge-worktrees.md 👁 All Course Skills

Week 16 — Midyear Project Presentations

WEEK 16Lab: Demo, Defend, Reflect

Lab Goal: Present your hardened prototype to the class and instructor panel. Defend architectural decisions using CCT. Conduct peer reviews. Reflect on the semester's learning trajectory.

Class Discussion — Before Presentations

Before the demo session begins, take 15 minutes for structured reflection. These questions are the synthesis moment — 16 weeks of CCT, tools, ethics, and prototyping converge here.

Week 1 vs. Week 16 MTTI: Your MTTI at Week 1 was roughly 26 minutes. What is your Week 16 MTTI for a similar-complexity investigation? What accounts for the difference — skill, tooling, or workflow? How much of the improvement came from faster CCT reasoning vs. faster execution?
Changed assumptions: What assumption about AI-assisted security work did you hold in Week 1 that turned out to be wrong? What evidence from a specific lab changed your mind?
CCT in practice: Which of the five CCT pillars challenged your instincts most? Was there a lab where you realized you had been applying a pillar superficially?
Course redesign: If you were designing this semester's curriculum, what would you add, cut, or reorder?
Portfolio gap: Your Sprint II prototype is now a portfolio item. What would you need to do to present it in a job interview with confidence? What is the gap between "it works" and "I would stake my professional reputation on this"?

Ship It: Release Pipeline Checklist

Before your prototype earns a PR and goes to leadership review, run it through this shipping checklist. Each step has a reason — this is the production shipping discipline practiced throughout the course: spec before build, test before ship, document before deploy.

Pre-flight: All tests pass on a clean branch

Run your full test suite from a clean state — not just the last test you ran. git stash && python -m pytest (or equivalent). If tests only pass in your local environment, they don't count.

Pre-landing AI checklist: all 7 items reviewed

Work through the Unit 4 pre-landing checklist below. Document any item you knowingly skip and why — that note is a required part of the deliverable.

Unit 4 Pre-Landing AI Checklist

Review each item. Pass = confirmed. Skipped = documented in your PR description with written justification.

#	Item	What to check	Pass criteria
1	LLM trust boundaries	Does your system treat all LLM outputs as untrusted input? Is there validation before acting on agent decisions?	Every agent output is validated before downstream use
2	SQL / injection safety	If your system writes to any database or constructs queries, are inputs parameterized?	No string concatenation in queries
3	Race conditions	If agents run concurrently, is shared state (logs, files, rate counters) protected?	Thread-safe or process-isolated
4	Enum completeness	Do all match/case or if-elif blocks have an explicit default?	No silent fall-through on unexpected values
5	Error propagation	Does every exception either recover gracefully or surface clearly to the caller?	No bare `except: pass` blocks
6	Secrets in env vars	Are all API keys, tokens, and credentials in environment variables — not in code or committed files?	`git grep -i "api_key\s*="` returns zero matches
7	Blast radius	Does this PR change fewer than 5 files? If more, is the scope justified?	Documented rationale for any large-scope change

Document any item you knowingly skip and the reason why. This is a graded deliverable — the documentation is as important as the checklist result.

Coming in Semester 2 — the full production checklist

Unit 7 extends this checklist with governance-grade requirements: AIUC-1 audit across all 6 domains, MASS security scan, agent identity and allowance profiles, distributed tracing, red team report, cost caps, and tested human escalation paths. The 7 items above are the Semester 1 foundation you build on.

Boil the Lake decision: what did you complete vs. defer?

Write 3 sentences: (1) What you completed fully because AI made it cheap. (2) What you deferred and why. (3) Whether any deferred item is a security risk that needs to be tracked. Add deferred items to a TODOS.md in your repo.

Version tag and CHANGELOG entry created

Tag your release: git tag -a v0.1.0 -m "Sprint II: hardened threat hunter". Write a one-paragraph CHANGELOG entry: what the system does, what changed from Sprint I to Sprint II, what's known-missing. This is the foundation of the audit trail Unit 7 will formalize.

PR created with structured description

Use gh pr create with a description covering: problem solved, architecture decisions, test coverage summary, known gaps, and what leadership needs to evaluate. A PR that just says "Sprint II prototype" is not reviewable.

gh pr create \
  --title "Sprint II: Threat Hunter v0.1.0 — Hardened" \
  --body "## Problem
Analysts spend 45 min/day triaging phishing alerts manually.

## What this does
3-agent system: classifier → enricher → reporter. 94% accuracy on test set.

## Architecture decisions
- Claude Sonnet for classification (cost/accuracy tradeoff)
- Async enrichment with 30s timeout + fallback
- Human escalation at confidence < 0.7

## Test coverage
- 23 unit tests, 4 integration tests, all passing
- Pre-landing checklist: all 7 items reviewed, 0 deferred

## Known gaps
- No rate limiting on the enrichment API (tracked in TODOS.md)
- Evaluation dataset is synthetic; production data may differ"

Presentation Preparation Checklist

Preparation: Live demo works end-to-end

Test your live demo at least 3 times in the 24 hours before presentation. Have a fallback: if live demo fails, have a screen recording ready. Never apologize for a broken demo — pivot cleanly to the recording.

Preparation: 10-minute presentation structure complete

Structure: (1) Problem & motivation — 90 sec, (2) Architecture overview & agent design decisions — 2 min, (3) Live demo — 3 min, (4) Security/ethics audit highlights — 1.5 min, (5) Sprint I vs II metrics — 1 min, (6) What you'd build next — 1 min.

Preparation: CCT defense prepared for three challenge questions

Prepare CCT-structured answers for: (1) "Why did you choose this architecture over alternatives?", (2) "What is the most significant security risk in this system?", (3) "What evidence shows this improves analyst efficiency?" Use Evidence-Based Analysis for each.

During: Peer review feedback submitted

For each peer presentation, complete the structured peer review form below. Save as log/peer-reviews/[presenter-name]-w16.md in your student workspace and share with the presenter within 24 hours.

Peer Review Template — copy and complete for each presenter

# Peer Review — [Presenter Name]
**Reviewer:** [Your Name] | **Date:** [Date] | **System:** [System Name]

## 1. Problem solved
[1-2 sentences: What security problem did they address? Was the problem well-defined?]

## 2. Most impressive technical achievement
[Specific: name the component, approach, or design decision that stood out.
Not "it worked well" — what specifically demonstrated skill?]

## 3. Most significant gap or risk
[Specific: a security gap, architectural weakness, or untested edge case.
Apply CCT Pillar 1 — what evidence supports this being a real risk?]

## 4. One improvement suggestion
[Actionable: "Add rate limiting to the enrichment API call in recon_agent.py"
not "improve security." Something they could implement in a day.]

## Overall: Would you use this in a real SOC?
[ ] Yes, as-is  [ ] Yes, with modifications  [ ] Not yet — needs X first
Reason: [one sentence]

Post-presentation: Semester reflection submitted

Write 750-word reflection: How has your thinking about AI security changed from Week 1 to Week 16? Which CCT pillar challenged you most? What would you do differently in Semester 2? What are you most proud of building? Reference specific labs.

Week 16 Deliverables

10-Minute Live Demo — of your hardened Sprint II prototype to the class
Peer Review Forms — one for each teammate's presentation
Semester Reflection (750 words) — learning trajectory from Week 1 to Week 16
GitHub Repository — final, tagged release of your prototype with complete documentation

Semester 1 Portfolio — Make It Public and Share It

Stop and take stock of what you've built: a CCT analysis framework, a multi-tool MCP server with RAG, an AI security policy, an AI ethics audit, and a 3-agent SOC prototype that went through two hardening sprints. That is a real portfolio — not coursework, not exercises, actual working security systems.

If your repositories aren't public on GitHub with proper READMEs, fix that now. Share the links — on LinkedIn, in security forums, with your team, in Discord servers like BloodHound Gang or Security BSides channels. There are security practitioners, hobbyists, and students who would learn from your AI ethics audit template alone. Security only improves when practitioners share what works. You don't have to wait until you're an expert to share. You are already building things people need.

Use this prompt to generate READMEs:

Write a GitHub README for my [project name] that explains the security problem it solves, how to run it locally, what tools it uses, and what someone could fork or extend.

Build Your Sprint Skill — Shortcut the Next One

You've now run two sprints and have a repeatable pattern: planning → scaffolding → hardening → review. Turn this into a Claude Code skill. Create a /sprint-setup skill that scaffolds a new security agent project with your preferred directory structure, CLAUDE.md, logging config, and ethics checklist pre-wired. The next sprint starts in 20 seconds instead of 20 minutes.

Use this prompt:

Based on my Sprint I and II work, write a Claude Code skill file called sprint-setup.md that scaffolds a new security agent project with my standard structure, dependencies, CLAUDE.md, and hardening checklist already in place.

Knowledge Check — Week 16

Which best describes production readiness for an AI system?

A) The system passes all unit tests and produces correct output on test cases B) The system has been reviewed by a senior engineer and approved for deployment C) The system handles failure gracefully — bad input, network errors, and unexpected load don't cause silent failures or crashes D) The system has comprehensive documentation and a deployment runbook

Applying the production engineer mindset to your final presentation means:

A) Identifying potential failure modes — demo crashes, hard questions, time overruns — and preparing contingencies for each B) Optimizing your slides for maximum information density C) Rehearsing until you can deliver it without notes D) Writing a detailed post-mortem document after the presentation

Semester 1 Complete!

You have completed all four units of CSEC 601: CCT Foundations, Agent Tool Architecture, AI Security Governance, and Rapid Prototyping. You are ready for Semester 2.

What Semester 2 does with what you built.

The tools and systems you built this semester are the starting material for Semester 2 — not background context, but direct inputs.

Unit 5 (Multi-Agent Orchestration): Your phishing triage agent from Sprint I becomes a supervised multi-agent pipeline. The single agent that classifies and reports becomes a specialist team: a classifier agent, an enrichment agent, a report-writer agent.
Unit 6 (Red Teaming): You will attack your own Unit 2 MCP server using the techniques from Unit 6. The tools you built are the targets.
Unit 7 (Hardening): The Cedar policies you wrote in Week 12 are deployed to Amazon Verified Permissions. Your Unit 2 MCP server gets production security hardening. The gaps your Unit 3 audit identified get closed.
Unit 8 (Capstone): Your Sprint II prototype is the capstone starting point. You're not starting from scratch — you're hardening and scaling what's already there.

✓ What you mastered

Sprint planning: spec-first, scope definition, integration estimation
Agentic prototype development under ethical constraints
Evaluation methodology: ground truth independence, adversarial test cases
Multi-agent architecture patterns (worktree-based, supervisor/specialist)

↻ What was introduced (returns later)

Production hardening (Unit 7)
Red team testing of your own tools (Unit 6)
Multi-agent orchestration at scale (Unit 5)

→ What's waiting next

Semester 2 begins with Unit 5 — your prototype enters a multi-agent pipeline, and the security evaluation gets serious.

Continue: Semester 2 Lab Guide — Unit 5: Multi-Agent Orchestration →

Lab Guide: Unit 4 — Rapid Prototyping with Agentic Tools

Week 13 — Multi-Agent Architecture Deep Dive

Knowledge Check — Week 13

Lab Exercise: Three-Agent SOC System

Week 13 Deliverables

Week 14 — Rapid Prototyping Sprint I

Knowledge Check — Week 14

Lab Exercise: Sprint I — Timed Prototype

Week 14 Deliverables

Week 15 — Rapid Prototyping Sprint II: Hardening

Knowledge Check — Week 15

Lab Exercise: Sprint II Hardening Checklist

Week 15 Deliverables

Week 16 — Midyear Project Presentations

Ship It: Release Pipeline Checklist

Presentation Preparation Checklist

Week 16 Deliverables

Knowledge Check — Week 16

Semester 1 Complete!