Running the demo¶
The demo workspace is deliberately almost empty — just an AGENTS.md (the project's conventions) and a .gitignore. It exists mainly as a git repo, which is all Tilth needs to do what it always does: branch off a worktree and build inside it. The path mirrors what a real first-time user does — seed a task list with tilth prep-feature, then run it. Nothing is pre-baked; the todo CLI gets built from scratch during the run.
Clone the demo workspace¶
Path used on this page. Commands below use
~/projects/tilth-demoas an illustrative location. Tilth doesn't care where the workspace lives — the path is just a CLI argument — so substitute any directory that matches your setup. Treat the demo repo as a stand-in for your own.
Seed a task list¶
Tilth's task list (prd.json) and the matching acceptance tests come from an interview the harness runs against your codebase. Kick it off:
- [ ] item in TODOS.md"), then asks a few targeted questions to slice the work and lock acceptance criteria. The output is harness-owned and lands on the Tilth side, not in the demo repo: the task list at <tilth-clone>/sessions/<id>/prd.json, plus one test_t<NNN>_*.py per task under that session's worktree at <tilth-clone>/sessions/<id>/workspace/tests/ (on branch session/<id>). Your demo checkout stays as empty as it started. See Seeding a session for the full interview-engine story.
Where a session's state lives. Everything the harness writes — prd.json, the seeded tests under workspace/, the event log — sits under your Tilth clone (sessions/<id>/); only the session/<id> branch and its worktree admin entry live in the demo repo's .git. Full breakdown in Session layout.
You can preview what a finished seed for this codebase looks like by reading examples/seed-reference/todo-cli/ in the Tilth repo — same project, a hand-crafted reference.
Run a session against the demo¶
What happens, end-to-end:
- Tilth verifies the path is a git repo on a clean main.
- Creates a worktree of the demo repo. The working tree lives at
<tilth-clone>/sessions/<id>/workspace/(inside Tilth, gitignored); the new branchsession/<id>is registered in the demo repo's.git. The two halves live in different places by design — see Session layout for the why. - Loops through pending tasks in
prd.json. For each task:- Reset context. Prompt = system + the feature plan (as context) + AGENTS.md + recent progress + this task (and, on a retry, the evaluator's prior verdicts on it).
- Tool-loop with the worker model (bash, file ops, search) until it calls
submit_caseto present its finished work. - Run
ruff+pytestin the worktree. Failures get fed back into the loop. - Evaluator model reviews the case + diff in a fresh context (it also sees this task's prior verdicts). Rejections get fed back.
- Self-improvement prompt — the worker considers whether the task surfaced a durable observation worth proposing. Any proposal lands in
sessions/<id>/proposed-learnings.md(not in your repo) for end-of-run review. - Commit on the worktree branch. Append to
progress.txt. Mark the taskdoneinprd.json.
- Stops on: all tasks done, iteration cap, wall-clock cap, token cap, evaluator-call cap, or a terminal failure (e.g. a provider returning empty responses, or the worker never presenting a case).
You can interrupt at any point with Ctrl-C. Ctrl-C and cap hits (iteration, wall-clock, token) all leave the run in a resumable state — see Resuming a session to pick it back up. Of the three caps, only the token cap needs attention before you resume: the cumulative token total carries across resumes, so if TILTH_MAX_TOKENS is what stopped the run, raise it in .env first or tilth resume trips it again on the first check. The wall-clock budget resets per resume, and the iteration cap is per-task (a retried task starts counting from one), so neither blocks a resume unless the work genuinely needs a bigger budget — see What resume does.
What you should expect to see¶
The console streams every tool call as it happens. The per-task loop has the shape below:
One task's lifecycle inside the harness. The worker sees the Prompt and the Tool Loop; the Self-Improve step and the cross-task evaluation machinery stay harness-side. Failed validators or a rejected evaluator verdict feed back into the Tool Loop for another iteration.
A clean run ends with every task in prd.json marked done and a commit-per-task on the session/<id> branch. When the loop doesn't track this cleanly, watch for these patterns:
- A task spinning is signalled by the same files being read and re-written across iterations. If it happens, kill the run and rewrite the task description before retrying.
- Validator feedback loops show as repeated
validator_failed → next iterationpatterns. A handful is normal; a long string usually means the test suite or the lint config is misaligned with the agent's idea of "done."
After the run¶
Once every task in prd.json is done, the harness closes out the final task and prints all tasks complete followed by a run summary:
A clean ending. Every task is committed on the session branch (left); the run summary tallies what happened (centre); and the artifacts the run wrote under sessions/<id>/ — the event log, the rolled-up summary, the resume checkpoint, and any proposed learnings — sit on the right, outside the worktree for you to read, resume from, or review by hand. AGENTS.md is never touched by the run.
To inspect what just got committed:
Each task is one commit. If you like the work, merge it into main like any other branch; if not, delete the branch. The harness never auto-merges. (You can also use Resetting a session to throw away the worktree, branch, and the harness's session directory in one shot.)
The session log lives at <tilth-clone>/sessions/<id>/events.jsonl — every model call, tool call, validator run, evaluator verdict, and proposed-learning verdict is recorded (see Session layout → Event types for the full taxonomy). Alongside it, sessions/<id>/summary.json carries a rolled-up snapshot (token totals, per-task iteration counts, tool histogram, hook outcomes, evaluator accepts/rejects with rejection categories) refreshed at every task boundary — read that when you want a quick stat without jq-ing the full log.
For a more readable view of a finished run, see Visualizing a session.


