Evaluate Presets
Overview
Systematically test all hat collection presets using shell scripts. Direct CLI invocationβno meta-orchestration complexity.
When to Use
- Testing preset configurations after changes
- Auditing the preset library for quality
- Validating new presets work correctly
- After modifying hat routing logic
Quick Start
Evaluate a single preset:
./tools/evaluate-preset.sh tdd-red-green claude
Evaluate all presets:
./tools/evaluate-all-presets.sh claude
Arguments:
- First arg: preset name (without
.ymlextension) - Second arg: backend (
claudeorkiro, defaults toclaude)
Bash Tool Configuration
IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
- Single preset evaluation: Use
timeout: 600000(10 minutes max) andrun_in_background: true - All presets evaluation: Use
timeout: 600000(10 minutes max) andrun_in_background: true
Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the TaskOutput tool to check progress periodically.
Example invocation pattern:
Bash tool with:
command: "./tools/evaluate-preset.sh tdd-red-green claude"
timeout: 600000
run_in_background: true
After launching, use TaskOutput with block: false to check status without waiting for completion.
What the Scripts Do
evaluate-preset.sh
- Loads test task from
tools/preset-test-tasks.yml(ifyqavailable) - Creates merged config with evaluation settings
- Runs Ralph with
--record-sessionfor metrics capture - Captures output logs, exit codes, and timing
- Extracts metrics: iterations, hats activated, events published
Output structure:
.eval/
βββ logs/<preset>/<timestamp>/
β βββ output.log # Full stdout/stderr
β βββ session.jsonl # Recorded session
β βββ metrics.json # Extracted metrics
β βββ environment.json # Runtime environment
β βββ merged-config.yml # Config used
βββ logs/<preset>/latest -> <timestamp>
evaluate-all-presets.sh
Runs all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/
βββ SUMMARY.md # Markdown report
βββ <preset>.json # Per-preset metrics
βββ latest -> <suite-id>
Presets Under Evaluation
| Preset | Test Task |
|---|---|
tdd-red-green |
Add is_palindrome() function |
adversarial-review |
Review user input handler for security |
socratic-learning |
Understand HatRegistry |
spec-driven |
Specify and implement StringUtils::truncate() |
mob-programming |
Implement a Stack data structure |
scientific-method |
Debug failing mock test assertion |
code-archaeology |
Understand history of config.rs |
performance-optimization |
Profile hat matching |
api-design |
Design a Cache trait |
documentation-first |
Document RateLimiter |
incident-response |
Respond to "tests failing in CI" |
migration-safety |
Plan v1 to v2 config migration |
Interpreting Results
Exit codes from evaluate-preset.sh:
0β Success (LOOP_COMPLETE reached)124β Timeout (preset hung or took too long)- Other β Failure (check
output.log)
Metrics in metrics.json:
iterationsβ How many event loop cycleshats_activatedβ Which hats were triggeredevents_publishedβ Total events emittedcompletedβ Whether completion promise was reached
Hat Routing Performance
Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
What Good Looks Like
Each hat should execute in its own iteration:
Iter 1: Ralph β publishes starting event β STOPS
Iter 2: Hat A β does work β publishes next event β STOPS
Iter 3: Hat B β does work β publishes next event β STOPS
Iter 4: Hat C β does work β LOOP_COMPLETE
Red Flags (Same-Iteration Hat Switching)
BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work
^^^ All in one bloated context!
How to Check
1. Count iterations vs events in session.jsonl:
# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log
# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
Expected: iterations β events published (one event per iteration) Bad sign: 2-3 iterations but 5+ events (all work in single iteration)
2. Check for same-iteration hat switching in output.log:
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
.eval/logs/<preset>/latest/output.log
Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
3. Check event timestamps in session.jsonl:
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
Red flag: Multiple events with identical timestamps (published in same iteration).
Routing Performance Triage
| Pattern | Diagnosis | Action |
|---|---|---|
| iterations β events | β Good | Hat routing working |
| iterations << events | β οΈ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | β οΈ Recovery loops | Agent not publishing required events |
| 0 events | β Broken | Events not being read from JSONL |
Root Cause Checklist
If hat routing is broken:
Check workflow prompt in
hatless_ralph.rs:- Does it say "CRITICAL: STOP after publishing"?
- Is the DELEGATE section clear about yielding control?
Check hat instructions propagation:
- Does
HatInfoincludeinstructionsfield? - Are instructions rendered in the
## HATSsection?
- Does
Check events context:
- Is
build_prompt(context)using the context parameter? - Does prompt include
## PENDING EVENTSsection?
- Is
Autonomous Fix Workflow
After evaluation, delegate fixes to subagents:
Step 1: Triage Results
Read .eval/results/latest/SUMMARY.md and identify:
β FAILβ Create code tasks for fixesβ±οΈ TIMEOUTβ Investigate infinite loopsβ οΈ PARTIALβ Check for edge cases
Step 2: Dispatch Task Creation
For each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"
Step 3: Dispatch Implementation
For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"
Step 4: Re-evaluate
./tools/evaluate-preset.sh <fixed-preset> claude
Prerequisites
- yq (optional): For loading test tasks from YAML. Install:
brew install yq - Cargo: Must be able to build Ralph
Related Files
tools/evaluate-preset.shβ Single preset evaluationtools/evaluate-all-presets.shβ Full suite evaluationtools/preset-test-tasks.ymlβ Test task definitionstools/preset-evaluation-findings.mdβ Manual findings docpresets/β The preset collection being evaluated