What is Spellbook?
Spellbook is a sandboxed execution platform for specification-driven coding agents.
Overview
Spellbook helps teams define what should be built, how it should behave, what constraints must be respected, and how generated code should be verified before it is trusted.
Spellbook is infrastructure for disciplined AI-native development.
The problem
AI coding agents can produce code quickly, but speed alone is not enough for serious engineering.
- clear intent
- domain correctness
- architecture consistency
- local repo conventions
- safety checks
- reviewable changes
- audit logs
- repeatable workflows
The Spellbook approach
Spellbook places coding agents inside a structured engineering loop.
loopgoverned execution
init -> specify -> architecture -> local conventions -> plan -> build -> test -> review -> ship -> verify -> monitor -> learn
What Spellbook provides
- A project control layer
- A specification repository
- Requirements and quality gates
- Agent onboarding rules
- Controlled execution workspaces
- Task state tracking
- Test and verification reports
- Audit-ready artifacts
What Spellbook is not
Spellbook is not a replacement for engineers. Spellbook is not just a prompt template. Spellbook is not only a code generator. Spellbook is not magic.
Quickstart
Create a minimal control layer, add context, and run a reviewable task.
Minimal flow
The quickest path is to initialize the project, capture intent, create a task, run the task, and verify the output.
terminalfirst run
spellbook init
spellbook specify
spellbook requirements add
spellbook task create "Add disabled-user login rejection"
spellbook plan task-001
spellbook build task-001
spellbook test task-001
spellbook verify task-001
spellbook review task-001
What to check
A successful first run should produce a plan, changed files, test output, a diff summary, and a review report linked back to requirements.
Create your first .spellbook project
The `.spellbook/` directory is the control layer agents read before changing code.
Initialize
terminalinit
spellbook init
The initialization step should create onboarding files, a project manifest, and folders for context, requirements, architecture, tasks, and reports.
First files
Start with `AGENTS.md`, `PEOPLE.md`, `spellbook.yaml`, and a short product context file. These files tell agents how to behave and tell humans what review evidence to expect.
Run your first agent task
Agent tasks are scoped units of work that can be planned, executed, tested, and reported.
Create and plan
terminaltask
spellbook task create "Reject disabled-user login"
spellbook plan task-001
Execute and review
After execution, review the task record before accepting the diff. The report should show why the change exists, which requirement it satisfies, and what still needs human judgment.
Getting Started
Build a first workflow from installation through review.
Install
Install the Spellbook command line in the environment where agents will plan and execute work.
Note Keep the first project local. Validate the control layer before connecting CI or organization policies.
Commands
terminalguided setup
spellbook init
spellbook specify
spellbook requirements add
spellbook architecture init
spellbook conventions init
spellbook task create "Add disabled-user login rejection"
spellbook plan task-001
spellbook build task-001
spellbook test task-001
spellbook verify task-001
spellbook review task-001
Specification-driven development
Make intent, requirements, domain rules, architecture constraints, and verification criteria explicit before code is generated or changed.
Prompt-driven vs specification-driven
| Prompt-driven coding | Specification-driven coding |
| “Build login.” | “Build login according to these entities, states, invariants, requirements, contracts, tests, and security rules.” |
| Context is hidden in chat. | Context is versioned in the repo. |
| Agent guesses conventions. | Agent reads local conventions. |
| Review starts after code exists. | Review starts from intent and plan. |
| Tests are optional follow-up. | Verification is part of execution. |
| Hard to audit. | Task history and evidence are captured. |
Goal
The goal is not to make agents more creative. The goal is to make their work more constrained, inspectable, and trustworthy.
The Spellbook loop
The loop keeps agent work connected to product intent from task creation through learning.
Loop steps
- init
- specify
- architecture
- local conventions
- plan
- build
- test
- review
- ship
- verify
- monitor
- learn
Controlled execution
Controlled execution means agent work happens inside a bounded environment with policies, logs, limits, artifacts, and verification.
Boundaries
Agents should know which tools are allowed, which actions need approval, which commands validate the work, and which shortcuts are forbidden.
Requirements packs
A requirement pack groups functional, non-functional, security, compliance, quality, or operational requirements.
When to use packs
Use packs when a task crosses a known engineering boundary, such as authentication, payments, authorization, audit logging, or production change management.
Domain truth
Domain truth is the stable business and system knowledge generated code must respect.
Examples
- Disabled users cannot log in.
- Transfers require a source account, destination account, and settled ledger entry.
- Audit events must preserve actor, action, target, and timestamp.
Invariants and forbidden shortcuts
An invariant is a rule that must always remain true. A forbidden shortcut is a tempting move that violates correctness, security, maintainability, or product intent.
yamlpolicy
invariant: PasswordHashNeverReturned
must_hold_for:
- GET /me
- POST /login
forbidden_shortcuts:
- return user model directly
- skip response serialization
Local conventions
Local conventions capture repo-specific engineering rules for naming, errors, tests, logging, layout, and agent behavior.
Examples
Conventions should tell agents where tests live, how errors are represented, how logs are structured, and when a change requires approval.
Quality gates
A quality gate is a check that must pass before a task can be considered complete.
Common gates
- tests pass
- requirements have evidence
- architecture constraints are respected
- security rules pass
- review report is produced
Agent task lifecycle
A task moves through states that make agent work observable and reviewable.
States
state machinetask
created -> planned -> running -> testing -> review_ready -> approved -> shipped
created -> planned -> blocked
running -> failed -> retry_planned
Concepts
Core terms used throughout Spellbook.
Definitions
Spellbook project
A repository with a `.spellbook/` control layer.
Specification
A written definition of intended behavior, constraints, and verification criteria.
Domain truth
The stable business and system rules generated code must respect.
Invariant
A rule that must always remain true.
Forbidden shortcut
A tempting implementation move that is not allowed.
Requirement pack
A grouped set of functional, security, compliance, quality, or operational requirements.
Quality gate
A check that must pass before a task can be complete.
Agent task
A scoped unit of work an agent can plan, execute, test, and report.
Controlled execution
Agent work inside a bounded environment with logs, limits, policies, artifacts, and verification.
The .spellbook project structure
The `.spellbook/` folder keeps agent instructions, human onboarding, specifications, runner policy, tasks, and reports together.
Directory tree
tree.spellbook
.spellbook/
├── AGENTS.md
├── PEOPLE.md
├── spellbook.yaml
├── context/
├── domain/
├── requirements/
├── architecture/
├── local-conventions/
├── runner/
├── tasks/
└── reports/
Reference
| Path | Purpose |
| .spellbook/AGENTS.md | Quick onboarding file for coding agents. Defines how agents should read the project, plan work, make changes, run checks, and report results. |
| .spellbook/PEOPLE.md | Human onboarding companion. Explains assumptions, workflow, review expectations, and human-agent collaboration. |
| .spellbook/context/ | Product and business context: goals, actors, workflows, non-goals, boundaries, metrics, and product truth. |
| .spellbook/domain/ | Domain model: entities, states, events, effects, invariants, forbidden shortcuts, contracts, failure modes, and risk maps. |
| .spellbook/requirements/ | Functional and non-functional requirements, quality gates, security gates, compliance packs, mappings, schemas, and acceptance criteria. |
| .spellbook/architecture/ | Runtime components, architecture decisions, integration patterns, design rules, and implementation mappings. |
| .spellbook/local-conventions/ | Repository-specific engineering rules for naming, testing, errors, observability, layout, logging, and agent behavior. |
| .spellbook/runner/ | Execution environment definitions: sandboxing, timeouts, allowed tools, artifacts, retry behavior, task state, and verification commands. |
| .spellbook/tasks/ | Planned and executed coding tasks with intent, scope, plan, changes, verification, artifacts, risk flags, and review status. |
| .spellbook/reports/ | Generated reports: test results, verification results, risk summaries, diff summaries, timelines, cost metrics, and learning notes. |
context/
Product and business context for agents.
What belongs here
Goals, actors, workflows, non-goals, boundaries, success metrics, product intent, and stable product language.
domain/
Domain truth and system rules.
What belongs here
Entities, states, events, effects, invariants, forbidden shortcuts, contracts, failure modes, and risk maps.
requirements/
Functional and non-functional requirements.
What belongs here
Acceptance criteria, requirement IDs, security requirements, quality bars, compliance mappings, and evidence expectations.
architecture/
Runtime components and design rules.
What belongs here
Architecture decisions, integration patterns, ownership, component boundaries, and mappings between domain concepts and implementation.
local-conventions/
Repository-specific engineering rules.
What belongs here
Naming, test style, error handling, observability, code layout, logging, commit rules, and approval-required agent behavior.
runner/
Execution environment definitions.
What belongs here
Sandboxing, timeouts, allowed tools, captured logs, artifacts, retry behavior, task state machine, and verification commands.
tasks/
Planned and executed coding tasks.
What belongs here
Task intent, scope, plan, changed files, verification, artifacts, risk flags, retry history, and review status.
reports/
Generated review and verification reports.
What belongs here
Test results, verification results, risk summaries, diff summaries, task timelines, cost and latency metrics, and learning notes.
auth-v0
A compact authentication boundary that demonstrates why executable context matters.
Spec
yamlauth-v0
Intent:
Users can register, log in, log out, and access their profile.
Domain:
User
Session
States:
User: PendingEmailVerification | Active | Disabled
Session: Active | Revoked | Expired
Invariants:
DisabledUserCannotLogin
SessionBelongsToExistingUser
SessionHasExpiry
PasswordHashNeverReturned
OnlyActiveSessionAuthorizesProtectedRoutes
Runtime:
AuthController
UserRepository
SessionRepository
PasswordHasher
TokenIssuer
AuditLogger
Routes:
POST /register
POST /login
POST /logout
GET /me
Tests:
Disabled user login is rejected
Expired session is rejected
Password hash is never returned
Token is not issued before password validation
fintech transfer
A money movement boundary needs explicit state, idempotency, ledger, and audit rules.
Useful checks
- Transfer has source and destination accounts.
- Ledger entries balance.
- Retries are idempotent.
- Failed transfers preserve audit evidence.
legacy codebase migration
Use Spellbook to keep modernization work inside architecture and compatibility boundaries.
Pattern
Capture current behavior first, map risky modules, define forbidden shortcuts, then migrate behind tests and evidence reports.
API hardening task
Security-sensitive API changes should require explicit gates.
Suggested gates
- authentication checks
- authorization checks
- input validation
- error response policy
- audit logging
test-generation task
Generated tests should trace back to requirements and domain rules, not just implementation branches.
Evidence
Ask the task report to show which requirement each test covers and which requirements still lack coverage.
Runner
The runner is where specification-driven development becomes executable.
Expected capabilities
Execution boundary The runner owns the practical boundary between agent intent and codebase mutation.
Isolation
Per-task isolated execution environment, workspace checkout, and branch or worktree per task.
Reproducibility
Reproducible dependency setup, timeout and resource constraints, cancellation, and resumability.
Evidence
Captured stdout, stderr, artifacts, diffs, structured logs, task timelines, and audit log of tool actions.
State
Task state machine, persisted task records, step history, memory snapshots, retries with reason codes, and error taxonomy.
Validation
Test execution, verification commands, rollback strategy, and allowed tools or approval-required actions policy.
Measurement
Token, cost, latency, failure, and quality counters for each agent task.
runnertask states
created -> planned -> running -> testing -> review_ready
failed -> retry_planned -> running
blocked -> human_input_required
Custom gates
Custom gates encode repo-specific validation rules.
Example
yamlgate
gate: auth-boundary-review
requires:
- npm test
- security-policy-check
- human approval
Plugins
Plugins are extension points for tools, validators, report generators, or organization policy integrations.
Guidance
Keep plugins narrow. They should add a capability without hiding task evidence from reviewers.
Policy rules
Policy rules define allowed actions, rejected actions, and approval-required actions.
Examples
- Reject destructive commands unless approved.
- Require human approval for auth, payment, and production-change boundaries.
- Reject skipped tests unless a task report explains why.
Observability
Observability makes agent work measurable like engineering work.
Signals
Track success rate, time to success, retries, failed attempts, human interventions, requirements satisfied, requirements missed, cost, latency, diff size, review corrections, and spec gaps.
CI integration
CI should validate the same gates agents run locally.
Pattern
Run `spellbook validate`, execute required test commands, attach reports to pull requests, and fail the build when required evidence is missing.
Agent evaluation
Evaluate agents by evidence quality, not just whether code compiles.
Metrics
Measure requirements coverage, test pass rate, retry count, review corrections, security findings, cost per task, and spec gaps discovered.