Automatic Harnessing Guide¶

A harness loop wraps your skill in a multi-stage quality pipeline — automatically retrying until the output passes your quality bar.

The hard problem in automated iteration isn't running the skill — it's knowing when the output is actually good. A harness loop is a quality evaluation pipeline that applies a skill or prompt to work items, then evaluates the result from multiple angles before advancing: mechanical tests catch regressions, LLM judgment assesses semantic quality, user-simulation skills verify the experience as a real user would, and diff invariants catch runaway changes. The wizard auto-derives this evaluation framework from your project config so you don't write it by hand.

Table of Contents¶

What Is a Harness Loop?
The Evaluation Pipeline
Evaluation Phases Explained
Tool-Based Gates (check_concrete)
MCP Tool Gates (check_mcp)
Skill-as-Judge (check_skill)
LLM-as-Judge (check_semantic)
Diff Invariants (check_invariants)
Stall Detection (check_stall)
When to Use a Harness
Creating a Harness: The 4-Step Wizard
Step H1: Choose a Target
Step H2: Work Item Discovery
Step H3: Evaluation Phases
Step H4: Iteration Budget
Generated FSM Structure
Variant A: Single-Shot
Variant B: Multi-Item
Using the Example Files
Worked Example: Harness refine-issue
Tips
Troubleshooting
See Also

What Is a Harness Loop?¶

A harness loop is a pre-structured FSM pattern that repeatedly applies a skill or prompt to a list of work items (or once in single-shot mode), evaluating success after each run through a layered quality pipeline.

The Evaluation Pipeline¶

Each harness applies up to five evaluation phases in sequence, cheapest first:

Phase	What it checks
`check_concrete`	Exit code from test/lint/type command — objective, fast
`check_mcp`	MCP server tool call — deterministic external state
`check_skill`	Full agentic user simulation — did it work as a real user would?
`check_semantic`	LLM judges output quality — semantic correctness
`check_invariants`	Diff line count — catches runaway changes

Each phase is optional; the wizard pre-selects based on your project config. All five can be active simultaneously, or you can use any subset.

Conceptual cycle:

            ┌──────────────────────────────────────────────────┐
            │                                                  │
            ▼                                                  │
       ┌─────────┐     items      ┌─────────┐                 │
       │discover │───remaining───►│ execute │                 │
       └─────────┘                └────┬────┘                 │
            │                         │                       │
         no items                   next                      │
            │                         ▼               on_no (retry)
            ▼                  ┌──────────────┐               │
          done ◄── terminal    │check_concrete│───────────────┤
                               └──────┬───────┘               │
                                   on_yes                      │
                                      ▼                        │
                               ┌──────────────┐               │
                               │  check_mcp   │───────────────┤
                               └──────┬───────┘               │
                                   on_yes                      │
                                      ▼                        │
                               ┌──────────────┐               │
                               │ check_skill  │───────────────┤
                               └──────┬───────┘               │
                                   on_yes                      │
                                      ▼                        │
                               ┌──────────────┐               │
                               │check_semantic│───────────────┤
                               └──────┬───────┘               │
                                   on_yes                      │
                                      ▼                        │
                               ┌──────────────────┐           │
                               │check_invariants  │───────────┘
                               └──────┬───────────┘
                                   on_yes
                                      ▼
                                  ┌─────────┐
                                  │ advance │──► discover
                                  └─────────┘

Evaluation Phases Explained¶

Tool-Based Gates (`check_concrete`)¶

Runs the highest-priority configured tool command from ll-config.json as a shell action with an exit_code evaluator. Exit code 0 = pass, non-zero = fail (retry execute).

This phase provides fast, objective feedback. It runs before the LLM judge, so failures are caught cheaply.

MCP Tool Gates (`check_mcp`)¶

action_type: mcp_tool invokes an MCP server tool directly — not via Claude — yielding deterministic output at ~500ms latency. The mcp_result evaluator routes on the MCP response envelope rather than an exit code or LLM judgment. This makes it a good fit for verifying external state that the other evaluation phases cannot observe.

mcp_result verdict table:

Verdict	Meaning
`success`	Tool ran and succeeded (`isError: false`)
`tool_error`	Tool ran but reported failure (`isError: true`)
`not_found`	Server or tool not registered in `.mcp.json`
`timeout`	Transport-level timeout

Generic pattern (check_mcp is a naming convention, not a reserved name):

check_mcp:
  action_type: mcp_tool
  action: "server/tool-name"              # server_name/tool_name from .mcp.json
  params:
    key: "${captured.current_item.output}"  # ${variable} interpolation supported
  capture: mcp_result
  route:
    success: check_invariants    # or next evaluation state
    tool_error: execute          # retry the execute state
    not_found: check_invariants  # server not configured — skip this gate
    timeout: execute

Example: Browser UI verification (one application among many)

A harness that implements a UI feature can use a playwright MCP server to check that the rendered page reflects the change before advancing:

check_mcp:
  action_type: mcp_tool
  action: "playwright/screenshot"
  params:
    url: "http://localhost:3000"
  capture: ui_result
  route:
    success: check_invariants
    tool_error: execute
    not_found: check_invariants  # playwright not configured — skip
    timeout: execute             # dev server may not be up yet

Other MCP gate applications: - database/query — verify a record was written - github/list_pull_requests — confirm a PR was created - slack/get_messages — check a notification was sent - filesystem/read_file — verify a file was created at the expected path

Placement: check_mcp slots after check_concrete (cheap shell gates first) and before check_semantic / check_invariants. If the MCP call is expensive or optional, placing it last (just before check_invariants) avoids wasted cost on items that fail earlier checks.

Skill-as-Judge (`check_skill`)¶

check_skill is the highest-fidelity evaluation mode in the pipeline: it invokes a skill whose job is to use the feature as a real user would, then judges whether the user experience actually worked. This is the only phase that evaluates from the perspective that actually matters — a real user completing a real workflow. Browser navigation, form submission, multi-step UX flows, or any end-to-end user simulation all belong here.

The skill runs as a full agentic Claude session and produces natural-language output; an llm_structured evaluator parses its verdict (YES/NO with rationale) and routes accordingly.

How it differs from check_mcp:

	`check_mcp`	`check_skill`
Execution	Single deterministic tool call	Full agentic Claude session
Latency	~500ms	30–300s
Output	Structured MCP envelope	Natural-language rationale
Best for	Verifying discrete external state	Exercising complex user flows

YAML pattern:

check_skill:
  action: "/ll:act-as-user 'Navigate to /dashboard and verify the new filter works'"
  action_type: slash_command
  timeout: 300
  evaluate:
    type: llm_structured
    prompt: >
      Did the skill successfully complete the user flow without errors?
      Did it confirm the expected feature is present and working?
      Answer YES or NO with what it observed.
  on_yes: check_invariants
  on_no: execute

action_type values for skill invocations:

`action_type`	How it runs	When to use
`slash_command`	Executes the action string as a named slash command directly	Use when the action is a fixed `/ll:<name>` slash command
`prompt`	Sends the action string as a free-form instruction to Claude	Use for natural-language prompts, or when the skill name is dynamic or constructed at runtime

For skills invoked as free-form prompts (no fixed slash command), use action_type: prompt:

check_skill:
  action: "Use the scrape-docs skill to fetch /api/users and confirm the new 'role' field appears in the response"
  action_type: prompt
  timeout: 180
  evaluate:
    type: llm_structured
    prompt: >
      Did the skill confirm the 'role' field is present in the API response?
      Answer YES or NO with what it observed.
  on_yes: check_invariants
  on_no: execute

Placement: check_skill slots after check_concrete and check_mcp (cheap/deterministic gates first) and before check_semantic / check_invariants. When check_skill covers quality assessment, check_semantic can be omitted — the skill already provides semantic judgment from a user perspective.

Cost consideration: check_skill runs a full agentic session (30–300s, proportional cost). Use it when a skill can verify something the other phases cannot observe — actual rendered UI, end-to-end user flow, or external system state that deterministic checks can't reach.

LLM-as-Judge (`check_semantic`)¶

Uses an llm_structured evaluator where Claude assesses whether the previous action achieved its intent. The wizard collects two criteria from the user — what should change on success and what indicates failure — and generates a numbered multi-criteria evaluation prompt:

Why echo as the action? check_semantic receives the echo string as <action_output> in the LLM prompt — an empty echo provides minimal evidence. To evaluate a prior state's output, set source: "${captured.<var>.output}" on the evaluate block, where <var> is the capture key on the source state. Note: ${prev.output} at check_semantic resolves to check_concrete's output (pytest results), not execute's skill output — use the capture + source pattern instead (see production examples in loops/issue-staleness-review.yaml:36-47).

evaluate:
  type: llm_structured
  prompt: >
    Evaluate the previous action on these criteria:
    1. [success criterion: what should be different after the skill runs successfully]
    2. Absence of failure signals: [failure criterion: what would indicate the skill failed]
    Answer YES only if all criteria pass. Otherwise NO, stating which criterion failed.

The wizard asks two follow-up questions when LLM-as-judge is selected: "What should be different in the output after the skill runs successfully?" and "What would indicate the skill failed or made no progress?" The answers populate criteria 1 and 2 respectively. For custom prompts, the same two-question format applies.

Diff Invariants (`check_invariants`)¶

Runs git diff --stat HEAD | wc -l | tr -d ' ' and checks that the line count is less than 50 using an output_numeric evaluator. This catches runaway changes — if a skill modifies far more than expected, the loop retries rather than advancing.

Adjust the target value for skills that intentionally make large changes.

Referencing Captured Outputs¶

Use ${captured.<state_name>.output} to pass output from one state to a later state:

prompt: "Review this output: ${captured.execute.output}"

Use ${prev.output} to reference the immediately preceding state's output.

Stall Detection (`check_stall`)¶

Add a check_stall state when a skill might loop without making any code changes. This is especially important for prompt-based skills that sometimes conclude "nothing to do" — without stall detection, they exhaust max_iterations silently.

When to add stall detection: - The action uses action_type: prompt and may no-op - You see a harness exhausting max_iterations without git commits - The skill being harnessed sometimes returns "already done"

Placement: Insert check_stall between execute and the first check state (e.g., check_concrete). In this position, use on_yes: check_concrete (or whichever check state comes first) and on_no: advance (multi-item) or on_no: done (single-shot). Placing it here avoids making LLM-based quality checks on output from a run that has already stalled.

check_stall:
  action: "echo 'checking stall'"     # output ignored by diff_stall
  action_type: shell
  evaluate:
    type: diff_stall
    scope: ["scripts/"]    # optional: limit diff to specific paths
    max_stall: 2           # optional: consecutive no-change iterations before stall
  on_yes: check_concrete   # progress detected — proceed to evaluation chain
  on_no: advance           # stalled — skip item (use on_no: done for single-shot)
  on_error: check_concrete

diff_stall field reference:

Field	Type	Default	Description
`scope`	`list[str]`	(entire repo)	Paths to limit `git diff --stat` to
`max_stall`	`int`	`1`	Consecutive no-change iterations before failure verdict

Verdicts:

Verdict	Meaning
`yes`	Progress detected (diff changed)
`no`	Stalled — no changes for `max_stall` consecutive iterations
`error`	git unavailable or command failed

Full 6-phase ordering (with all phases active):

check_stall      → no-op detection (diff_stall, <1s) — first, before any evaluation cost
check_concrete   → cheapest (exit code, <1s)
check_mcp        → deterministic tool call (~500ms)
check_skill      → agentic user simulation (30–300s)
check_semantic   → LLM text quality judgment (can omit when check_skill covers it)
check_invariants → diff size (cheapest final gate)

Decision guide — when to reach for each phase:

Phase	Use when
`check_stall` (diff_stall)	The action is prompt-based and may no-op silently
`check_concrete` (shell)	A CLI tool exit-codes on pass/fail
`check_mcp` (mcp_tool)	An MCP server can deterministically verify the result
`check_skill` (slash_command + llm_structured)	A skill can exercise the feature end-to-end as a user would
`check_semantic` (LLM judge)	You need judgment about output quality
`check_invariants` (diff size)	You want to catch runaway changes

When to Use a Harness¶

Use a harness loop when you want to:

Wrap a skill in quality gates — ensure tests pass and the LLM confirms success before advancing to the next item
Run a skill repeatedly over a list — refining every open issue, checking every file, processing a batch of items in priority order
Set up a single polished iteration — execute a skill once with full evaluation rather than just calling it manually
Run capable skills at scale, unattended — even for skills that rarely fail, a harness lets you process 50 items unattended with the same confidence you'd have watching 1 item manually; the value isn't just catching failures, it's enabling workflows no single agent call could safely accomplish

As models improve, the harness becomes more ambitious, not less necessary — better skills expand the space of what a well-composed evaluation pipeline can accomplish.

Compare to hand-authoring a loop:

Approach	Effort	Evaluators	Stall protection
Harness wizard	~2 min	Auto-derived	Available as add-on
Hand-authored YAML	30–60 min	Manual	Manual

If your workflow is highly custom (e.g., multi-branch routing, complex captured-variable logic), hand-author using the FSM reference. Otherwise, use the harness wizard.

Creating a Harness: The 4-Step Wizard¶

Run /ll:create-loop and select "Harness a skill or prompt" when prompted for loop type.

Step H1: Choose a Target¶

The wizard scans skills/*/SKILL.md and presents every available skill with its description. Pick one, or choose "Custom prompt" to enter free-form natural language.

What do you want to harness?
  ○ refine-issue       — Refine issue files with codebase-driven research
  ○ format-issue       — Format issue files to align with template v2.0
  ○ check-code         — Run code quality checks (lint, format, types, build)
  ○ audit-docs         — Audit documentation for accuracy and completeness
  ○ ...                — (all discovered skills listed)
  ○ Custom prompt      — Enter a free-form natural language prompt to repeat

If you pick Custom prompt, you'll also be asked: "What does 'done' look like?" — this answer drives the LLM-as-judge evaluation prompt.

Step H2: Work Item Discovery¶

How are work items discovered?
  ○ Single-shot (no item iteration)      — Run once; no discover state
  ○ Active issues list (Recommended for issue skills) — ll-issues list --json
  ○ File glob pattern                    — Find files matching a pattern
  ○ Manual list                          — Hard-code items in the loop

If you pick File glob pattern, you'll be prompted for the glob (e.g., .issues/**/*.md). If you pick Manual list, you'll enter comma-separated items.

Discovery commands generated per mode:

Mode	Discovery Command
Active issues list	`ll-issues list --json \\| python3 -c "..."`
File glob pattern	`find . -name '<pattern>' -not -path './.git/*' \\| sort \\| head -1`
Manual list	`python3 -c "items='<item1>,<item2>,...'.split(','); print(items[0])"`

The active issues command filters for status == 'open', prints the first issue ID, and exits 1 when the list is empty. See Variant B below for the full Python snippet.

Step H3: Evaluation Phases¶

The wizard reads .ll/ll-config.json to detect configured tool commands and presents only relevant options. All available phases are pre-selected (default, can be changed); stall detection is pre-selected by default since all H1 choices produce prompt-based execution. (See Evaluation Phases Explained above for what each phase does.)

Which evaluation phases should be included? (multi-select)
  ☑ Tool-based gates (Recommended)                      — Shell checks using test/lint/type commands
  ☑ Stall detection (Recommended for prompt-based skills) — Detects no-op iterations
  ☑ LLM-as-judge                                        — Claude assesses output against skill description
  ☑ Diff invariants                                     — git diff --stat line count < 50
  ○ Skill-based evaluation (Optional)                   — Invoke a skill to exercise and verify the feature as a user would

Note: check_mcp is not offered by the wizard. If your harness requires an MCP tool call for evaluation, add a check_mcp state manually to the generated YAML after wizard completion. See check_mcp in the Evaluation Phases Explained section for the required fields.

Tool-gate priority order (highest-priority configured command wins): 1. test_cmd — most comprehensive 2. lint_cmd — fast feedback 3. type_cmd — type safety 4. If none configured: check_concrete state is omitted entirely

Step H4: Iteration Budget¶

How many retries per item before giving up?
  ○ 3 retries (Recommended)   — Good balance for most skills
  ○ 5 retries                 — For complex or slow-converging skills
  ○ 1 retry (strict)          — Fail fast; skip items that don't resolve immediately

What is the total iteration budget?
  ○ 50 (Recommended)    — For up to ~15 items with 3 retries each
  ○ 100                 — For larger item sets
  ○ 200                 — For long-running batch operations

Convergence defaults by skill category:

Skill category	Suggested max_iterations	Per-item retries
Issue refinement / analysis	200	3
Code quality / fix	50	5
Documentation	100	3
Custom prompt	50	3

Generated FSM Structure¶

Variant A: Single-Shot¶

Generated when work item mode is "Single-shot". Starts directly at execute with no discovery loop.

name: "harness-check-code"
initial: execute
max_iterations: 5          # = per-item retries
states:

  execute:
    action: /ll:check-code --auto
    action_type: slash_command
    next: check_concrete

  check_concrete:            # present if tool-based gates selected
    action: python -m pytest scripts/tests/ -q --tb=no
    action_type: shell
    evaluate:
      type: exit_code
    on_yes: check_semantic
    on_no: execute

  check_semantic:            # present if LLM-as-judge selected
    action: echo 'Evaluating output quality'
    action_type: shell
    evaluate:
      type: llm_structured
      prompt: >
        Evaluate the previous action on these criteria:
        1. No lint or type errors remain in the modified files
        2. Absence of failure signals: no error output, no unresolved violations reported
        Answer YES only if all criteria pass. Otherwise NO, stating which criterion failed.
    on_yes: check_invariants
    on_no: execute

  check_invariants:          # present if diff invariants selected
    action: "git diff --stat HEAD | wc -l | tr -d ' '"
    action_type: shell
    evaluate:
      type: output_numeric
      operator: lt
      target: 50
    on_yes: done
    on_no: execute

  done:
    terminal: true

Ready-to-run example: scripts/little_loops/loops/harness-single-shot.yaml is a fully annotated version of this variant, including commented-out check_mcp and check_skill optional gates. See Using the Example Files below.

Variant B: Multi-Item¶

Generated for Active issues list, File glob pattern, or Manual list. Adds discover and advance states around the evaluation chain.

name: "harness-refine-issue"
initial: discover
max_iterations: 200        # total budget across all items
states:

  discover:                  # shell command pops the next item
    action: "ll-issues list --json | python3 -c ..."
    action_type: shell
    capture: "current_item"
    evaluate:
      type: exit_code
    on_yes: execute          # item found → process it
    on_no: done              # no items left → finished

  execute:
    action: /ll:refine-issue ${captured.current_item.output} --auto
    action_type: prompt
    max_retries: 3           # prevents a stuck item from exhausting the budget
    on_retry_exhausted: advance
    next: check_concrete

  check_concrete:
    action: python -m pytest scripts/tests/ -q --tb=no
    action_type: shell
    evaluate:
      type: exit_code
    on_yes: check_semantic
    on_no: execute

  check_semantic:
    action: echo 'Evaluating refinement quality'
    action_type: shell
    evaluate:
      type: llm_structured
      prompt: >
        Evaluate the previous action on these criteria:
        1. The issue file was meaningfully updated with new codebase-grounded information
        2. Absence of failure signals: no error output, no unchanged or empty issue content
        Answer YES only if all criteria pass. Otherwise NO, stating which criterion failed.
    on_yes: check_invariants
    on_no: execute

  check_invariants:
    action: "git diff --stat HEAD | wc -l | tr -d ' '"
    action_type: shell
    evaluate:
      type: output_numeric
      operator: lt
      target: 50
    on_yes: advance
    on_no: execute

  advance:                   # mark item done, loop back to discover
    action: echo 'Item complete'
    action_type: shell
    next: discover

  done:
    terminal: true

max_retries + on_retry_exhausted: Adding these to execute is the key safeguard in multi-item loops. Without them, one item that never passes evaluation will consume the entire max_iterations budget. With them, the loop skips the stuck item and moves on after max_retries attempts.

Ready-to-run example: scripts/little_loops/loops/harness-multi-item.yaml is a fully annotated version of this variant with all five evaluation phases active, including check_mcp and check_skill. See Using the Example Files below.

Using the Example Files¶

Two annotated example harness loops are built in to loops/:

File	Variant	Phases included
`scripts/little_loops/loops/harness-single-shot.yaml`	A — Single-shot	`check_stall`, `check_concrete`, `check_semantic`, `check_invariants`; `check_mcp` and `check_skill` as commented-out optional gates
`scripts/little_loops/loops/harness-multi-item.yaml`	B — Multi-item	All five phases active: `check_concrete`, `check_mcp`, `check_skill`, `check_semantic`, `check_invariants`

Each state in both files has an # EXAMPLE: comment explaining its pedagogical purpose.

Validate structure¶

ll-loop validate harness-single-shot
ll-loop validate harness-multi-item

Run interactively (dry-run)¶

ll-loop test walks through every state and lets you choose simulated verdicts — useful for understanding the FSM transitions without executing the real skill:

ll-loop test harness-single-shot

Run for real¶

ll-loop run harness-single-shot
ll-loop run harness-multi-item

The multi-item example discovers open issues via ll-issues list and runs /ll:manage-issue on each one. Make sure you have open issues before running it.

Adapt to your own workflow¶

The recommended approach is to copy, rename, and edit rather than modifying the originals (so they remain usable as references):

ll-loop install harness-single-shot   # copies to .loops/harness-single-shot.yaml
cp .loops/harness-single-shot.yaml .loops/my-harness.yaml

Key fields to change:

Field	What to change it to
`name`	A descriptive name for your loop
`execute.action`	Your skill or prompt (e.g., `/ll:check-code --auto`)
`check_concrete.action`	Your test/lint command, or remove the state entirely
`check_semantic.evaluate.prompt`	Multi-criteria numbered prompt: criterion 1 (what should change), criterion 2 (absence of failure signals)
`check_invariants.evaluate.target`	Increase if your skill makes large diffs legitimately
`discover.action`	Your item discovery command (multi-item only)

After editing, validate with ll-loop validate <your-file> before running.

Worked Example: Harness `refine-issue`¶

The following is a production-ready harness that refines all active issues. It is the canonical output of running the wizard with: target = refine-issue, discovery = active issues, all evaluation phases enabled, 3 retries, 200 iterations.

See also: scripts/little_loops/loops/harness-multi-item.yaml is a runnable annotated variant of this pattern with all five evaluation phases active, including check_mcp and check_skill.

Note: This example includes check_concrete and check_semantic but omits check_mcp and check_skill. The check_mcp gate is not generated by the wizard (add it manually if needed — see MCP Tool Gates). The check_skill gate is optional and only applies when a user-simulation skill is available for the workflow; it is omitted here to keep the example minimal.

name: "harness-refine-issue"
initial: discover
max_iterations: 200
timeout: 14400                    # 4-hour wall clock limit (seconds)
states:

  discover:                       # pop the next open issue ID
    action: |
      ll-issues list --json | python3 -c "
      import json, sys
      issues = json.load(sys.stdin)
      open_issues = [i for i in issues if i.get('status') == 'open']
      if not open_issues:
          sys.exit(1)
      print(open_issues[0]['id'])
      "
    action_type: shell
    capture: "current_item"       # stored as ${captured.current_item.output}
    evaluate:
      type: exit_code
    on_yes: execute
    on_no: done                   # empty list → all issues processed

  execute:                        # invoke the skill with the captured issue ID
    action: /ll:refine-issue ${captured.current_item.output} --auto
    action_type: prompt
    max_retries: 3                # prevents a stuck issue from exhausting max_iterations
    on_retry_exhausted: advance
    next: check_concrete

  check_concrete:                 # run tests to confirm no regressions
    action: python -m pytest scripts/tests/ -q --tb=no
    action_type: shell
    evaluate:
      type: exit_code
    on_yes: check_semantic
    on_no: execute

  check_semantic:                 # LLM confirms the issue was actually refined
    action: echo 'Evaluating refinement quality'
    action_type: shell
    evaluate:
      type: llm_structured
      prompt: >
        Did the previous /ll:refine-issue action successfully refine the issue?
        Check that: the issue file was updated with new content, confidence scores
        were added or improved, and no errors occurred. Answer YES or NO.
    on_yes: check_invariants
    on_no: execute

  check_invariants:               # catch runaway edits (> 50 diff lines)
    action: "git diff --stat HEAD | wc -l | tr -d ' '"
    action_type: shell
    evaluate:
      type: output_numeric
      operator: lt
      target: 50
    on_yes: advance
    on_no: execute

  advance:                        # mark item done, loop back for the next one
    action: echo 'Issue refined'
    action_type: shell
    next: discover

  done:
    terminal: true

Tips¶

Route not_found to the next phase, not back to execute, in check_mcp states. If the MCP server isn't configured in .mcp.json, retrying the execute state won't fix it — skip to the next evaluation gate instead.
Start with single-shot to validate the skill works end-to-end before adding discovery. Use ll-loop run <file> with a single item to test the evaluation chain.
Use ll-loop validate to check the FSM structure before full execution — it validates YAML syntax, transition completeness, and terminal reachability.
Add stall detection for prompt-based skills (especially custom prompts) that may no-op. A skill that says "already done" on every item will silently exhaust your budget without it.
Check ll-config.json has at least one tool command (test_cmd, lint_cmd, or type_cmd) to get the concrete check_concrete gate. Without it, the wizard omits the tool phase and your loop has no objective quality check.
Tune target: 50 in check_invariants if your skill intentionally makes large changes (e.g., a doc rewrite skill). Increase to 150–200 for documentation-heavy skills.
Set timeout on the loop-level (seconds) for long-running batch operations to avoid unbounded runs.

Troubleshooting¶

Symptom	Cause	Fix
Loop exhausts `max_iterations` without finishing	No stall detection; one item looping forever	Add `check_stall` state; or add `max_retries` + `on_retry_exhausted` on `execute`
LLM-judge always returns NO	Evaluation prompt too strict or vague	Edit `check_semantic.evaluate.prompt` to match actual skill output characteristics
`check_concrete` state missing from generated YAML	No tool commands in `ll-config.json`	Run `/ll:configure` to set `test_cmd`, `lint_cmd`, or `type_cmd`
`discover` exits immediately with no items	Discovery command filter too narrow	Check that issues have `status: open`; verify `ll-issues list` returns results
`check_invariants` always fails	Skill makes large diffs legitimately	Increase `target` from 50 to a value appropriate for the skill
Loop runs but nothing changes across iterations	Skill is idempotent / "already done"	Add `check_stall` with `max_stall: 1` to skip no-op items
`check_mcp` always routes to `not_found`	Server not registered in `.mcp.json`	Add the MCP server entry to `.mcp.json` or route `not_found` to the next phase to skip gracefully
`check_skill` always returns NO	Skill prompt too broad or skill has no browser/nav capability	Narrow the skill instruction; ensure the skill has access to the target system; check timeout is long enough