Co-evolutionary Examples Mining Guide¶
Static examples.json files have a natural expiry date. When apo-textgrad is first set up, examples are written for the prompt's current capability — but as the prompt improves, those examples become trivially easy, PASS_RATE saturates, the gradient signal disappears, and optimization stalls at a local optimum. Meanwhile, the skill itself evolves: new conventions, new file layouts, new issue formats. The corpus quietly goes stale.
examples-miner breaks this pattern by coupling the corpus to the project's own history. Every completed issue with a session log is an implicit human approval — the agent's output was good enough that a human accepted it and the issue closed. examples-miner harvests these real labeled invocations, quality-gates them through a three-layer judge, calibrates difficulty to an informative 40–80% band, runs apo-textgrad as an inner sub-loop to extract a gradient signal, and synthesizes adversarial examples targeting exactly the failure pattern the optimizer found. Repeat, and difficulty always leads the prompt's current capability.
Table of Contents¶
- The Problem with Static Examples
- What examples-miner Does
- When to Use
- Quick Start
- How the Pipeline Works
- Stage 1: Harvest
- Stage 2: Judge — Three-Layer Quality Stack
- Stage 3: Calibrate
- Stage 4: Write + Optimize
- Stage 5: Synthesize — Adversarial Examples
- Stage 6: Screen → Score → Merge
- Stage 7: Diversify + Publish
- The examples.json Schema
- Configuration Reference
- File I/O Reference
- Configuring for a Different Skill
- Incremental Harvesting
- The Oracle Sub-loop (v2 Upgrade)
- Monitoring a Run
- Tips
- Troubleshooting
- See Also
The Problem with Static Examples¶
apo-textgrad works by testing the current prompt against a batch of (input, expected) pairs, counting how many pass, and computing a text gradient from the failures. This gradient drives the next prompt revision. The mechanism is powerful — but it depends entirely on the examples being informative.
When examples are hand-crafted at the time the skill is written:
- The prompt improves — iterations drive the pass rate up toward 100%.
- The examples become trivially easy — the prompt now handles them without effort.
PASS_RATEsaturates → the loop emitsCONVERGEDand stops.- The prompt still fails on harder real inputs that were never in the corpus — but the optimizer doesn't know this.
The result is a prompt optimized for the hand-crafted corpus, not for the full distribution of real invocations. The corpus is also frozen in time: as the skill's conventions evolve, examples reference old file paths, old issue formats, old tool patterns. The optimizer silently reinforces stale behavior.
examples-miner solves both problems. It sources examples from the project's completed issue history — real invocations that real humans accepted. It continuously recalibrates difficulty to keep examples in the informative 40–80% pass-rate band. And after each optimizer run, it synthesizes new adversarial examples that specifically target the current gradient's failure pattern, so the corpus always stays ahead of the prompt's current capability.
What examples-miner Does¶
Two loops coupled together:
┌── OUTER LOOP: examples-miner ───────────────────────────────────────────┐
│ │
│ harvest ──→ judge ──→ calibrate ──→ write_examples │
│ │ │
│ run_optimizer │
│ (inner: apo-textgrad) │
│ │ │ │
│ SUCCESS FAILURE │
│ │ │ │
│ synthesize │ │
│ │ │ │
│ screen_adversarial │ │
│ │ │ │
│ score_adversarial │ │
│ │ │ │
│ merge │ │
│ │ │ │
│ diversify ←──┘ │
│ │ │
│ publish ──→ done │
└──────────────────────────────────────────────────────────────────────────┘
The outer loop mines, calibrates, and publishes the corpus. The inner loop (apo-textgrad) runs as a child FSM via sub-loop chaining (context_passthrough: true), inheriting the outer loop's prompt_file and examples_file. After the inner loop completes, the outer loop reads its gradient signal from ${captured.run_optimizer.gradient.output} and uses it to synthesize adversarial examples that target the exact failure pattern the optimizer found.
When to Use¶
| Situation | Use examples-miner? |
|---|---|
apo-textgrad PASS_RATE plateaus ≥90% but prompt still fails on real invocations |
Yes — corpus is stale or too easy |
examples.json was hand-crafted months ago and the skill has evolved |
Yes — conventions have drifted |
No examples.json exists yet and you want one from real project history |
Yes — first-run builds from scratch |
| You have a brand-new skill with no completed issues yet | No — harvest will be empty; write initial examples manually, then use the miner |
You want to run apo-textgrad against a specific curated test set |
No — manage that corpus manually |
Quick Start¶
# 1. Check that session data exists for the skill
ll-messages --skill capture-issue --examples-format --stdout | head -3
# 2. First run: mine all history, optimize the prompt, publish examples.json
ll-loop run examples-miner \
--context skill_name=capture-issue \
--context prompt_file=skills/capture-issue/SKILL.md
# 3. Verify the output
python3 -c "
import json
with open('examples.json') as f:
data = json.load(f)
print(f'{len(data)} examples')
print('Sources:', set(e.get(\"source\") for e in data))
print('Difficulty range:', min(e.get(\"difficulty_score\", 0) for e in data),
'–', max(e.get(\"difficulty_score\", 0) for e in data))
"
On the first run, no corpus.last_harvested sentinel exists so all session history is harvested. Subsequent runs are incremental — only sessions newer than the sentinel are re-processed.
How the Pipeline Works¶
Stage 1: Harvest¶
harvest (shell, 120s)
runs: ll-messages --skill <skill_name> --examples-format --context-window 3 [--since <sentinel>] --stdout
produces: harvested_examples (JSON lines)
The harvest state runs ll-messages with --examples-format to extract (input, expected) pairs from Claude Code session logs. Each record is a JSON object on its own line:
{
"type": "example",
"skill": "capture-issue",
"input": "<N preceding user messages concatenated>",
"output": "{\"tools_used\": [\"Read\", \"Write\"], \"files_modified\": [\".issues/features/P3-FEAT-849-...\"], \"completion_status\": \"success\"}",
"session_id": "361c2c3a-bd3c-417b-9d69-cfd541e136fc",
"timestamp": "2026-03-21T22:21:36",
"context_window": 3
}
ResponseMetadata — A JSON object automatically captured by little-loops at issue completion. Key fields:
issue_id(string),skill(string),duration_ms(number),tool_calls(array),exit_status(string). You don't create this manually — the harness captures it automatically.
Important: the output field is not free text. It is a JSON-serialized ResponseMetadata object recording what tools the agent used and what files it changed — not the raw assistant response. The oracle judge evaluates tool choices and file changes, not prose quality.
On the first run, no sentinel file exists and all sessions are harvested. On subsequent runs, --since $(cat corpus.last_harvested) limits the query to sessions added after the last publish.
Stage 2: Judge — Three-Layer Quality Stack¶
judge (prompt, 300s)
reads: harvested_examples
produces: judge_scores (JSON array with per-example metadata)
Every harvested candidate passes through three sequential quality layers. Failing any layer discards the candidate.
Layer 1 — Code persistence (objective, uses Bash tool):
For each candidate, the judge checks files_modified from the output JSON against the current repository:
persistence_score= fraction offiles_modifiedstill tracked in HEADpersistence_age= average commit count for still-present files
Candidates where persistence_score = 0 (all modified files are absent from HEAD) are discarded immediately. If the agent created a file that was later deleted, the output was likely incorrect or superseded — not worth training on.
Layer 2 — Revision distance (heuristic, uses Bash tool):
For each surviving candidate, the judge estimates how much the linked issue was revised after the invocation:
- 1–2 session log entries in
## Session Log→revision_distance ≈ 0.1(output accepted quickly) - 3–4 entries →
revision_distance ≈ 0.4 - 5+ entries →
revision_distance ≈ 0.7(heavily revised before acceptance)
Low distance signals the output was accepted nearly as-is — high training utility. High distance means the output needed substantial correction — lower confidence.
Layer 3 — Oracle rubric (LLM scoring):
For candidates that clear layers 1–2, the judge scores the output JSON against skill-specific criteria (0–100 pts):
tools_usedquality: were appropriate tools used for this skill? (0–33 pts)files_modifiedquality: are the file paths realistic and relevant? (0–33 pts)completion_statusvalidity: is"success"the correct status for an accepted output? (0–34 pts)
Deduplication: when multiple candidates exist for the same issue (same invocation context, multiple refinement passes), only the highest-ranked is kept. The ranking metric is:
The surviving candidates are emitted as a JSON array with all computed metadata fields.
Stage 3: Calibrate¶
calibrate (prompt, 120s)
reads: judge_scores, corpus_state_file (optional)
produces: calibrated_corpus (JSON array, overwrites on re-runs)
The calibrate state assigns a difficulty_score (0–100) to each surviving candidate and applies the difficulty band filter:
| Score range | Classification | Action |
|---|---|---|
| 0–39 | Trivially easy | Excluded — won't produce gradient signal |
| 40–80 | Informatively challenging | Included |
| 81–100 | Noise-level hard | Excluded — will always fail, wasting LLM calls |
This range targets non-trivial but learnable examples — too easy (< 40) provides no training signal; too hard (> 80) causes instability.
If corpus_state_file (default: corpus.json) exists, calibrate loads it, decays each existing entry's freshness_weight by ×0.9, and merges with today's included set (deduplicating by input text). This preserves the accumulated corpus across runs while progressively down-weighting stale examples.
The output is a JSON array capturing all metadata. The final line is a sentinel: CALIBRATED_COUNT=N.
Stage 4: Write + Optimize¶
write_examples (prompt, 60s)
reads: calibrated_corpus
writes: examples_file to disk
produces: write_examples_result
run_optimizer (sub-loop: apo-textgrad, context_passthrough: true)
reads: examples_file (from disk), prompt_file (from disk)
on_success → synthesize
on_failure → diversify
apo-textgrad reads examples_file from disk — it cannot receive the corpus via FSM captures. The write_examples state handles this by writing the calibrated corpus to examples_file before the optimizer starts. This is the intermediate write; publish performs the final write at the end.
run_optimizer invokes apo-textgrad as a child FSM. With context_passthrough: true, the child inherits the parent's prompt_file, examples_file, and other context variables. The child runs to completion independently (up to its own max_iterations: 20), then routes the parent:
- SUCCESS (child reached its
doneterminal state): gradient signal is available in${captured.run_optimizer.gradient.output}→ proceed tosynthesize - FAILURE (child hit
max_iterationsor timed out): no gradient available → skip adversarial synthesis, go directly todiversify
Stage 5: Synthesize — Adversarial Examples¶
synthesize (prompt, 300s)
reads: run_optimizer.gradient.output, calibrated_corpus
produces: adversarial_candidates (JSON array)
If the optimizer produced a gradient (not CONVERGED), the synthesize state uses it to generate adversarial examples. The gradient includes:
FAILURE_PATTERN: <common theme across all failures>
ROOT_CAUSE: <what is wrong in the current prompt>
GRADIENT: <precise instruction for how to change the prompt>
The FAILURE_PATTERN is matched to one of five perturbation types:
| Perturbation type | What it does | When to use |
|---|---|---|
complexity_injection |
Adds a second symptom that may or may not belong in the same issue | Prompt fails to exercise scope boundary judgment |
ambiguity_injection |
Strips specific file/function names, forcing discovery rather than copying | Prompt over-fits to copying literal references |
domain_shift |
Reproduces the same failure pattern in a different subsystem | Prompt passes in training domain but fails in others |
priority_boundary |
Edge case sitting between two adjacent priority levels | Prompt misclassifies borderline priority cases |
type_confusion |
Description that looks like FEAT but is BUG (or vice versa) | Prompt mis-classifies issue types |
Seeds are selected as the up to 5 examples nearest the 80% difficulty boundary (highest difficulty_score ≤ 80) — the region with the highest information density for perturbation.
Critical constraint: only the seed's INPUT is perturbed. The original expected output is never reused. A fresh expected output is generated for each perturbed input — this prevents ground-truth corruption where the expected output no longer matches the perturbed input.
Stage 6: Screen → Score → Merge¶
screen_adversarial (prompt, 120s)
reads: adversarial_candidates
produces: screened_adversarial
score_adversarial (prompt, 300s)
reads: screened_adversarial, calibrated_corpus
produces: validated_adversarial
merge (prompt, 60s)
reads: calibrated_corpus, validated_adversarial
produces: calibrated_corpus (overwritten — now contains both harvested + adversarial)
Screen: mechanical quality checks. Discards candidates with missing required fields, non-JSON expected values, empty inputs, or perturbation_type values outside the five-type taxonomy. File path plausibility is a soft check (flags but does not discard).
Score: oracle rubric applied to screened candidates: - Perturbation coherence: is the perturbation type correctly applied? (0–34 pts) - Expected output quality: does the fresh expected output correctly reflect what the skill should produce? (0–33 pts) - Adversarial value: does this example target a real weakness from the failure cluster? (0–33 pts)
Only candidates in the 40–80 difficulty band are accepted. The adversarial cap is enforced here: source: adversarial examples must stay ≤ 30% of the final corpus size. If accepted candidates would exceed the cap, the lowest-scoring ones are trimmed first.
Merge: concatenates the calibrated harvested corpus with the validated adversarial examples. Overwrites calibrated_corpus so the rest of the pipeline sees the combined set. Each example retains its source field ("harvested" or "adversarial") as provenance.
Stage 7: Diversify + Publish¶
diversify (prompt, 120s)
reads: calibrated_corpus (merged)
produces: calibrated_corpus (overwritten — final diversified version)
publish (prompt, 60s)
reads: calibrated_corpus (final)
writes: examples_file, corpus.last_harvested
produces: publish_result
Diversify: enforces per-axis coverage minimums when the corpus has ≥ 10 examples:
- At least 2 examples per represented issue type (BUG / FEAT / ENH / EPIC)
- At least 1 example per represented priority band (P0–P5)
- Adversarial examples ≤ 30% of corpus (trim lowest oracle_score first)
Sub-sampling removes the easiest examples (lowest difficulty_score) from over-represented groups. If coverage targets are already met, the corpus passes through unchanged. The final line is a sentinel: FINAL_COUNT=N.
Publish: writes the final corpus to examples_file (overwriting the intermediate write from write_examples) and updates the harvest sentinel:
The sentinel records the UTC timestamp of this publish. The next run's harvest state reads it to limit ll-messages to new sessions only.
The examples.json Schema¶
Each object in the published examples.json array has the following fields:
{
"input": "User message history preceding the skill invocation (N messages concatenated)",
"expected": "{\"tools_used\": [\"Read\", \"Write\"], \"files_modified\": [\".issues/features/P3-FEAT-849-...\"], \"completion_status\": \"success\"}",
"source": "harvested",
"difficulty_score": 62,
"failure_cluster": null,
"freshness_weight": 0.9,
"persistence_score": 1.0,
"persistence_age": 14,
"revision_distance": 0.1,
"oracle_score": 87
}
| Field | Type | Description |
|---|---|---|
input * |
string | Concatenated preceding user messages (context window) |
expected * |
string | JSON-serialized ResponseMetadata — NOT free text |
source |
"harvested" | "adversarial" |
Provenance |
difficulty_score |
0–100 | Estimated pass-rate difficulty; 40–80 = active band |
failure_cluster |
string | null | FAILURE_PATTERN tag for adversarial examples; null for harvested |
freshness_weight |
0.0–1.0 | Decays ×0.9 per run; new examples start at 1.0 |
perturbation_type |
string | absent | Adversarial only: which of the five types was applied |
seed_id |
string | absent | Adversarial only: session_id of the source example |
persistence_score |
0.0–1.0 | Harvested: fraction of files_modified still in HEAD |
persistence_age |
integer | Harvested: average commit count for still-present files |
revision_distance |
0.0–1.0 | Harvested: how much the linked issue was revised post-invocation |
oracle_score |
0–100 | Quality gate score from the oracle rubric |
Fields marked * are required. All others are metadata used by the miner.
apo-textgrad reads only input and expected — the other fields are miner bookkeeping and are preserved across re-runs.
Configuration Reference¶
Set context variables with --context key=value flags or by editing the loop's context: block after ll-loop install.
| Variable | Default | Notes |
|---|---|---|
skill_name |
capture-issue |
Controls ll-messages --skill filter; must match a real skill name exactly |
prompt_file |
system.md |
Path to the prompt being optimized; passed to the inner apo-textgrad loop |
examples_file |
examples.json |
Written twice per run: intermediate (before optimizer) and final (at publish) |
corpus_state_file |
corpus.json |
If this file exists, calibrate loads it and decays freshness_weight ×0.9 |
target_pass_rate |
0.6 |
Center of the 40–80% difficulty band (fraction, 0–1); used only in the calibrate prompt |
File I/O Reference¶
| File | Read by | Written by | Purpose |
|---|---|---|---|
corpus.last_harvested |
harvest (incremental --since flag) |
publish (UTC timestamp) |
Incremental harvest sentinel |
corpus.json (or corpus_state_file) |
calibrate (Read tool, optional) |
Not written by the miner | Persisted calibration state for freshness decay |
examples.json (or examples_file) |
run_optimizer inner loop |
write_examples (intermediate), publish (final) |
The training corpus for apo-textgrad |
.issues/completed/*.md |
judge (session log entry count via Bash) |
Never | Source of revision distance heuristic |
Session JSONL files in ~/.claude/projects/ |
ll-messages in harvest |
Never | Source of raw harvested candidates |
Configuring for a Different Skill¶
The default skill_name: capture-issue is just a starting point. The miner works for any skill that has completed issues with session logs.
# Example: mine refine-issue sessions
ll-loop run examples-miner \
--context skill_name=refine-issue \
--context prompt_file=commands/refine-issue.md \
--context examples_file=refine-examples.json \
--context corpus_state_file=refine-corpus.json
# Verify
python3 -c "import json; d=json.load(open('refine-examples.json')); print(len(d), 'examples')"
Steps for a new skill:
- Check harvest availability:
ll-messages --skill <name> --examples-format --stdout | wc -l— if 0, the skill has no eligible session history yet - Set
skill_nameto the skill's exact name (e.g.,ready-issue,manage-issue) - Set
prompt_fileto the skill's SKILL.md or equivalent prompt file - Use a separate
examples_fileper skill to avoid cross-contaminating corpora - Use a separate
corpus_state_fileper skill for independent freshness tracking
On the inline oracle: the judge's oracle rubric is generic — it scores tools_used, files_modified, and completion_status without knowing what the "correct" tools or files are for a given skill. For skill-specific precision (e.g., knowing that refine-issue must always modify an issue file), see the Oracle Sub-loop section below.
Incremental Harvesting¶
The sentinel file corpus.last_harvested is the key to efficient re-runs:
First run:
corpus.last_harvested not present
→ harvest state: SINCE_ARG=""
→ ll-messages harvests ALL session history
→ publish writes: 2026-03-21T22:00:00Z > corpus.last_harvested
Second run (next day):
corpus.last_harvested = "2026-03-21T22:00:00Z"
→ harvest state: SINCE_ARG="--since 2026-03-21T22:00:00Z"
→ ll-messages harvests only sessions from the past day
→ publish updates sentinel to current time
Important: the sentinel is written only by publish. If the loop terminates early (e.g., judge is blocked, or the loop times out), the sentinel is NOT updated. The next run will re-harvest the same window — no sessions are skipped.
Forcing a full reharvest (e.g., after a major skill refactor):
Resetting the corpus (start fresh, ignore freshness decay):
rm corpus.last_harvested corpus.json examples.json
ll-loop run examples-miner --context skill_name=<name> ...
The Oracle Sub-loop (v2 Upgrade)¶
This upgrade is optional for most projects. Adopt it when your quality rubric needs to vary per-skill rather than using a global judge.
Advanced — The v2 oracle pattern requires familiarity with nested FSM loops. Complete the rest of this guide before attempting this upgrade.
The built-in examples-miner.yaml uses inline LLM judging in the judge and score_adversarial states. This is a generic rubric that works across all skills but doesn't know skill-specific invariants.
For production use on a specific skill, upgrade to a dedicated oracle sub-loop:
1. Install the loop¶
2. Create the oracle YAML¶
Copy the reference implementation and adapt for your skill:
mkdir -p .loops/oracles
cp loops/oracles/oracle-capture-issue.yaml .loops/oracles/oracle-<skill>.yaml
Edit .loops/oracles/oracle-<skill>.yaml:
- Update skill_name in the context block
- Customize check_mechanical's Python check for the skill's files_modified invariants (e.g., refine-issue should always modify an issue file in .issues/)
- Update the score_semantic prompt for the skill's specific tool expectations
3. Wire the oracle into judge¶
In .loops/examples-miner.yaml, replace the judge state's action_type: prompt with a sub-loop delegation:
judge:
loop: oracles/oracle-<skill>
context_passthrough: true
on_success: calibrate
on_failure: done
Note: The
loop:field does not support context interpolation —loop: oracles/oracle-${context.skill_name}does NOT work. Hardcode the oracle path.
4. Calibrate the oracle¶
Use ensemble agreement bootstrapping to validate the oracle before deploying:
- Run the oracle at multiple temperatures against the full candidate pool
- Examples where all temperature variants agree strongly → use as fixture set
- Validate against deliberately degraded candidates (strip required sections, corrupt file paths in known-good examples) — the oracle must score these low; if it doesn't, it is miscalibrated
- If oracle scores on the fixture set drift significantly across consecutive runs, halt and re-calibrate rather than corrupting the corpus
Monitoring a Run¶
Use --verbose to see candidate counts at each state:
ll-loop run examples-miner --verbose \
--context skill_name=capture-issue \
--context prompt_file=skills/capture-issue/SKILL.md
The states correspond to the pipeline stages described in How the Pipeline Works above. What to watch for at each state:
| State | What to look for |
|---|---|
harvest |
Line count of JSON output = raw candidate count; 0 lines means no matching sessions |
judge |
Length of JSON array in output = survivors after quality gating; watch for discards |
calibrate |
CALIBRATED_COUNT=N on last output line; N=0 means all candidates outside band |
write_examples |
WRITTEN=N; confirms intermediate corpus write succeeded |
run_optimizer |
Sub-loop states shown with indent; watch for CONVERGED vs exhausting iterations |
synthesize |
Length of JSON array = adversarial candidates generated; [] means no gradient / optimizer converged |
screen_adversarial |
SCREENED_COUNT=N |
score_adversarial |
ADVERSARIAL_ACCEPTED=N; also watch adversarial cap enforcement in output |
diversify |
FINAL_COUNT=N |
publish |
PUBLISHED=N = final corpus size; confirms sentinel updated |
The loop exits at done (terminal). If it exits early via on_blocked: done on judge or calibrate, the corpus was not published and the sentinel was not updated.
Tips¶
- Start with
--verboseon the first run to see candidate counts at each stage and diagnose empty-corpus issues early. - Verify harvest first: run
ll-messages --skill <name> --examples-format --stdout | wc -lstandalone before running the full loop — it confirms the session data exists and the skill name is correct. - Separate
examples_fileper skill: don't shareexamples.jsonacross skills; each skill's corpus should be isolated for clean gradient signals. - Delete
corpus.last_harvestedafter major refactors: when skill conventions change significantly (new file layout, renamed commands), force a full reharvest to pick up all historical examples under the new conventions. - Use
corpus_state_filefor long-running projects: it preservesfreshness_weightdecay across runs and prevents corpus churn when recent sessions produce no new candidates. - The
synthesize → []result is the happy path: if the optimizer converged andFAILURE_PATTERNis absent,synthesizecorrectly outputs[]. No adversarial examples means the corpus is already doing its job. - Install before customizing:
ll-loop install examples-minercopies the YAML to.loops/so you can tune timeouts, adjust the difficulty band, or wire a skill-specific oracle without affecting the built-in. - The intermediate
write_examplesmatters: if this state is blocked,run_optimizerwill read whateverexamples.jsonwas on disk before the run — potentially stale. Checkwrite_examplesoutput in--verbosemode if the optimizer behaves unexpectedly.
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
harvest produces 0 lines |
No sessions match --skill filter; or sentinel timestamp is in the future |
Run ll-messages --skill <name> --stdout standalone to verify; delete corpus.last_harvested to reset the window |
judge outputs empty array [] |
All candidates discarded: files absent from HEAD, or oracle_score too low | Use --verbose to see which layer is discarding; check git log for files_modified paths; the skill may not modify trackable files |
calibrate outputs CALIBRATED_COUNT=0 |
All judged candidates outside 40–80 difficulty band | Corpus may be all easy (prompt too good for history) or all hard (few quality examples); delete corpus.json to clear stale accumulated state |
Loop skips to diversify after run_optimizer |
Inner apo-textgrad hit max_iterations or timed out |
Expected behavior on timeout; increase outer loop timeout: (default 7200s) or override inner loop via context; check --verbose for inner loop state |
synthesize outputs [] |
Gradient is CONVERGED or FAILURE_PATTERN absent |
This is the happy path — the optimizer converged on the current corpus; no adversarial synthesis is needed |
ADVERSARIAL_ACCEPTED=0 |
All adversarial candidates outside 40–80 band, below oracle threshold, or cap already full | Check synthesize output in --verbose; the perturbation type may not match the actual failure pattern; or harvested corpus is already at/near the 30% adversarial cap |
| Final corpus is empty | All paths produced empty arrays | Run with --verbose and trace which state first produced []; the most common cause is an empty harvest (no sessions) flowing through to an empty judge and calibrate |
publish writes the intermediate corpus |
diversify was blocked or skipped; publish reads whatever calibrated_corpus last held |
Check diversify in --verbose for blocking; increase the diversify timeout |
Sentinel corpus.last_harvested has wrong date |
System clock skew or a previous run wrote a future timestamp | cat corpus.last_harvested to inspect; delete and let the next run rebuild from scratch |
See Also¶
- LOOPS_GUIDE.md — quick-reference section for
examples-miner: context variables table, FSM flow diagram, perturbation taxonomy, basic invocations scripts/little_loops/loops/examples-miner.yaml— full annotated loop source (12 states with complete action prompts)scripts/little_loops/loops/oracles/oracle-capture-issue.yaml— reference implementation for the v2 oracle sub-loop (two-phase: shell mechanical checks + LLM semantic scoring)scripts/little_loops/loops/apo-textgrad.yaml— inner optimizer loop invoked byrun_optimizer; readsexamples_file, emitsFAILURE_PATTERN/ROOT_CAUSE/GRADIENT- Automatic Harnessing Guide — related guide for wrapping skills in layered quality evaluation pipelines
- FSM Loop Architecture — sub-loop chaining (
context_passthrough), state field reference, evaluator catalog