Model Selection
The single biggest performance lever is model choice.
| Priority | Model Type | Examples | Tradeoff |
|---|---|---|---|
| Speed | Small/fast | GPT-4.1-mini, Haiku, Gemini Flash | Faster, cheaper, less capable |
| Balance | Mid-tier | Sonnet, GPT-4.1 | Good for most tasks |
| Quality | Large | Opus, GPT-5.4 | Slower, expensive, best reasoning |
Switch mid-session with /model — use a fast model for exploration, switch to a capable model for complex changes.
Context Management
Monitor usage
> /context
Context: ~45000 tokens (23% of 200000 window)
Auto-compact at: 167000 tokens
Messages: 24
Manual compaction
If context is getting large, compact proactively:
> /compact
Freed ~12000 estimated tokens.
Start fresh when stuck
If the agent seems confused or repetitive, clear context:
> /clear
Or start a new session — previous sessions are auto-saved and can be resumed with /resume.
Cost Control
Set a budget
[api]
max_cost_usd = 5.0
The agent stops when the budget is reached. Check spending with /cost.
Reduce cost per turn
- Shorter prompts: be specific, avoid repeating instructions the agent already has
- Use AGENTS.md: persistent context loads once instead of being re-explained each session
- Smaller models for simple tasks:
/model gpt-4.1-minifor quick edits, switch back for complex work - Plan mode:
/planfor exploration uses only read tools (cheaper turns)
Token Budget
Enable budget tracking to get warnings before hitting limits:
[features]
token_budget = true
The agent shows a warning when approaching the auto-compact threshold.
Compaction Tuning
The default thresholds work well for most use cases. For very long sessions:
- Microcompact fires first and is free — it clears old tool results
- LLM summary costs one API call but frees significant context
- Context collapse is a last resort — removes middle messages entirely
If sessions are compacting too aggressively, consider using a model with a larger context window (200K+).
Tool Execution Speed
Streaming execution
Tools execute as soon as their input is parsed from the LLM response stream — they don't wait for the full response. This is automatic and requires no configuration.
Parallel reads
Read-only tools (FileRead, Grep, Glob, WebFetch) run in parallel. The agent naturally batches reads when exploring code. No tuning needed.
Bash timeout
Long-running shell commands can be given explicit timeouts:
{"command": "npm test", "timeout": 60000}
The default timeout is 120 seconds. Background mode (run_in_background: true) has no timeout.
Session Persistence
Sessions auto-save on exit. For long-running work:
- Use
/forkto create a checkpoint before risky changes - Use
/resume <id>to return to a previous state - Use
/exportor/shareto save a readable copy
Benchmarking
Run the built-in benchmarks to measure performance on your machine:
cargo bench # All benchmarks
cargo bench -- microcompact # Compaction only
cargo bench -- estimate_tokens # Token estimation only
Results with HTML reports are generated in target/criterion/.