Building A Token-Level “Diff” Inspector For Prompt Rewrites
Written by
Nova Neural
The itch that started this project
I kept seeing the same frustrating pattern while iterating on prompts: I’d “rewrite the prompt to be clearer,” run the generation again, and somehow the output would change in ways I couldn’t explain. Reading two prompts side-by-side didn’t help because most diffs were tiny, and the model’s internal reasoning isn’t visible.
So I built a small tool that answers a very specific question:
When I transform a prompt (via rewriting or templating), which exact parts of the prompt did the tokenizer change, and how do token boundaries shift?
To do that, I wrote a token-level “diff” inspector that:
- tokenizes an original prompt and a rewritten prompt using the same tokenizer,
- computes a diff at the token level (not character level),
- prints aligned token sequences so I can see boundary shifts and magnitude of change.
This isn’t a generic “prompt diff” tool—it's a practical debugging lens for generative AI prompt rewrites where tokenization effects matter.
What I mean by token diff (quick, concrete)
A token is the chunk of text a language model actually processes internally. Tokenizers often split words into sub-parts (for example, unbelievable might become something like un, believable depending on the tokenizer).
So if you only compare text, you miss cases where:
- the rewritten prompt changes whitespace that affects token boundaries,
- punctuation or quotes shift tokens,
- a small edit causes a bigger tokenization ripple.
Working setup
I implemented this using the Hugging Face transformers library. The code below uses a tokenizer from a commonly used chat-capable model.
Install dependencies
pip install -U transformers accelerate
Step-by-step: a token-level diff inspector
1) The core script
Save as token_diff_inspector.py:
import difflib from dataclasses import dataclass from typing import List, Tuple from transformers import AutoTokenizer @dataclass class TokenizedPrompt: text: str token_ids: List[int] tokens: List[str] def tokenize_prompt(tokenizer, prompt: str) -> TokenizedPrompt: """ Tokenizes the prompt and returns both token ids and their string forms. Why we store both: - token ids are what the model consumes, - token strings make the diff readable. """ token_ids = tokenizer.encode(prompt, add_special_tokens=False) tokens = tokenizer.convert_ids_to_tokens(token_ids) return TokenizedPrompt(text=prompt, token_ids=token_ids, tokens=tokens) def format_token(tokens: List[str], token_ids: List[int], idx: int) -> str: """ Formats a token as a human-readable snippet including its string form and id. """ return f"{tokens[idx]!r}({token_ids[idx]})" def token_diff(orig: TokenizedPrompt, rewritten: TokenizedPrompt) -> List[Tuple[str, int, int]]: """ Computes a diff at the token string level. Returns list of (op, i1, i2) tuples where: - op is one of 'equal', 'replace', 'delete', 'insert' - i1/i2 refer to indices ranges in the original/re-written token sequences """ sm = difflib.SequenceMatcher(a=orig.tokens, b=rewritten.tokens) ops = [] for tag, i1, i2, j1, j2 in sm.get_opcodes(): # We store j ranges too by encoding them into i2-like fields. # Simpler approach would be a richer tuple, but this is enough for printing. ops.append((tag, (i1, i2), (j1, j2))) return ops def print_diff(orig: TokenizedPrompt, rewritten: TokenizedPrompt) -> None: """ Prints a token-level diff with alignment-ish context. This is where most of the value comes from: seeing how token boundaries shift after rewriting. """ ops = token_diff(orig, rewritten) print("=== ORIGINAL PROMPT ===") print(orig.text) print("\n=== REWRITTEN PROMPT ===") print(rewritten.text) print("\n=== TOKEN DIFF (token strings) ===") # Print with a simple header and per-op blocks. for tag, (i1, i2), (j1, j2) in ops: if tag == "equal": # Print only small equal spans to avoid massive output. continue if tag == "replace": print("\n[REPLACE]") print(f"Original tokens [{i1}:{i2}]:") for idx in range(i1, i2): print(" -", format_token(orig.tokens, orig.token_ids, idx)) print(f"Rewritten tokens [{j1}:{j2}]:") for idx in range(j1, j2): print(" -", format_token(rewritten.tokens, rewritten.token_ids, idx)) elif tag == "delete": print("\n[DELETE]") print(f"Original tokens [{i1}:{i2}]:") for idx in range(i1, i2): print(" -", format_token(orig.tokens, orig.token_ids, idx)) elif tag == "insert": print("\n[INSERT]") print(f"Rewritten tokens [{j1}:{j2}]:") for idx in range(j1, j2): print(" -", format_token(rewritten.tokens, rewritten.token_ids, idx)) # Also print token counts summary print("\n=== SUMMARY ===") print(f"Original token count: {len(orig.token_ids)}") print(f"Rewritten token count: {len(rewritten.token_ids)}") print(f"Token strings equal? {orig.tokens == rewritten.tokens}") def main(): # I picked a tokenizer-backed model because the tokenizer matters more than the model weights here. # You can swap this to another chat model; the diff will still work as long as the tokenizer exists. model_name = "meta-llama/Llama-3.1-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) # Original prompt: short and slightly ambiguous prompt_a = ( "Write a marketing email for a new AI code assistant. " "Keep it friendly." ) # Rewritten prompt: more constraints, includes structure prompt_b = ( "Write a friendly marketing email for a new AI code assistant.\n" "Constraints:\n" "- 90 to 120 words\n" "- Use a single short hook sentence\n" "- Include one concrete example of how the assistant helps\n" "- End with a clear call to action" ) orig = tokenize_prompt(tokenizer, prompt_a) rewritten = tokenize_prompt(tokenizer, prompt_b) print_diff(orig, rewritten) if __name__ == "__main__": main()
2) Run it
python token_diff_inspector.py
You’ll get output showing:
- the original prompt,
- the rewritten prompt,
- the token spans that changed (
[REPLACE],[DELETE],[INSERT]), - token counts for quick sanity checks.
Why this works (and what it catches)
When I first built this, I expected “prompt rewrite diff” to be mostly about text changes. Token diff revealed something else: token boundaries can shift because the tokenizer often treats whitespace, newlines, punctuation, and quotes as meaningful separators.
Two practical things I found when I used this on prompt rewrites:
1) “Same meaning” rewrites still change many tokens
Even when a rewrite is conceptually similar, adding formatting (like bullet points or line breaks) can create many new tokens—sometimes enough to change the model’s behavior because it sees more structure.
2) Small punctuation edits can cause disproportionate token changes
Replacing a period with something like :\n- can split/merge tokenizations. The token diff shows those boundary changes immediately because tokens differ even when characters look similar.
Token diff on chat templates (the part I almost missed)
A common gotcha in generative AI is that many models expect chat formatted input (system/user/assistant roles). If you use a tokenizer that applies a chat template, tokenization can differ depending on whether you provide raw text or formatted chat messages.
For that reason, I also experimented with token diff using the tokenizer’s chat template. Here’s a minimal version you can run separately.
chat_token_diff.py
from transformers import AutoTokenizer def tokenize_with_chat_template(tokenizer, messages): """ Applies the tokenizer's chat template to messages, then tokenizes the resulting string. """ rendered = tokenizer.apply_chat_template(messages, tokenize=False) token_ids = tokenizer.encode(rendered, add_special_tokens=False) tokens = tokenizer.convert_ids_to_tokens(token_ids) return rendered, token_ids, tokens def main(): model_name = "meta-llama/Llama-3.1-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) messages_a = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Summarize the impact of RAG in one paragraph."}, ] messages_b = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Summarize the impact of Retrieval-Augmented Generation (RAG) in one paragraph."}, ] rendered_a, ids_a, toks_a = tokenize_with_chat_template(tokenizer, messages_a) rendered_b, ids_b, toks_b = tokenize_with_chat_template(tokenizer, messages_b) print("=== RENDERED CHAT A ===") print(rendered_a) print("\n=== RENDERED CHAT B ===") print(rendered_b) print("\nToken count A:", len(ids_a)) print("Token count B:", len(ids_b)) print("Token strings equal?", toks_a == toks_b) if __name__ == "__main__": main()
This version helps when your “rewrite” changes only the user content but the full rendered chat string (including role markers and formatting) is what actually gets tokenized.
What I learned building this
Building a token-level diff inspector changed how I debug generative AI prompts. Instead of guessing why outputs changed, I could see whether the rewrite:
- introduced new structural tokens (like newlines and bullets),
- shifted token boundaries due to punctuation/spacing,
- produced large token count changes that might alter the model’s behavior.
In short: token diffs gave me a concrete, repeatable way to inspect prompt rewrites, especially in formatting-heavy prompts where “minor” text edits often have major tokenization consequences.