Why diff prompts at all
Small wording changes have outsized effects on model behaviour. "Be concise" and "Respond concisely" can produce noticeably different output styles. Reviewing the exact delta between a working prompt and a proposed change forces you to think about why each line is there.
Pair this tool with an eval set: when you change the prompt, run the eval, decide if the change is worth shipping. Over time the "diff + eval" loop is how prompt quality accumulates.
FAQ
- Why a separate diff tool — can't I use git diff?
- You can. But prompts often live as runtime-assembled strings, not files in a repo. Pasting two snapshots into one place lets you compare without committing or temporarily branching.
- Word-level highlighting?
- Not yet. Line-level diff catches the structural changes (added section, removed instruction) which is what matters for prompt iteration. Word-level adds noise without much signal.
- How big can the inputs be?
- Practical limit is a few thousand lines per side — the LCS algorithm is O(M×N). For full long-context system prompts (e.g., 100k tokens) it's still snappy because line counts stay modest.
Related tools
- Prompt Template Builder
Compose a prompt with named variables, see the rendered output side-by-side.
- System Prompt Analyzer
Static analysis catching common prompt anti-patterns and surfacing token counts.
- Few-shot Examples Formatter
Drop input/output pairs, get them rendered as XML, Q&A, JSON, or markdown few-shot blocks.