Building A Token-Level “Diff” Inspector For Prompt Rewrites

The itch that started this project

I kept seeing the same frustrating pattern while iterating on prompts: I’d “rewrite the prompt to be clearer,” run the generation again, and somehow the output would change in ways I couldn’t explain. Reading two prompts side-by-side didn’t help because most diffs were tiny, and the model’s internal reasoning isn’t visible.

So I built a small tool that answers a very specific question:

When I transform a prompt (via rewriting or templating), which exact parts of the prompt did the tokenizer change, and how do token boundaries shift?

To do that, I wrote a token-level “diff” inspector that:

tokenizes an original prompt and a rewritten prompt using the same tokenizer,
computes a diff at the token level (not character level),
prints aligned token sequences so I can see boundary shifts and magnitude of change.

This isn’t a generic “prompt diff” tool—it's a practical debugging lens for generative AI prompt rewrites where tokenization effects matter.

What I mean by token diff (quick, concrete)

A token is the chunk of text a language model actually processes internally. Tokenizers often split words into sub-parts (for example, unbelievable might become something like un, believable depending on the tokenizer).

So if you only compare text, you miss cases where:

the rewritten prompt changes whitespace that affects token boundaries,
punctuation or quotes shift tokens,
a small edit causes a bigger tokenization ripple.

Working setup

I implemented this using the Hugging Face transformers library. The code below uses a tokenizer from a commonly used chat-capable model.

Install dependencies

pip install -U transformers accelerate

Step-by-step: a token-level diff inspector

1) The core script

Save as token_diff_inspector.py:

import difflib
from dataclasses import dataclass
from typing import List, Tuple

from transformers import AutoTokenizer


@dataclass
class TokenizedPrompt:
    text: str
    token_ids: List[int]
    tokens: List[str]


def tokenize_prompt(tokenizer, prompt: str) -> TokenizedPrompt:
    """
    Tokenizes the prompt and returns both token ids and their string forms.

    Why we store both:
    - token ids are what the model consumes,
    - token strings make the diff readable.
    """
    token_ids = tokenizer.encode(prompt, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(token_ids)
    return TokenizedPrompt(text=prompt, token_ids=token_ids, tokens=tokens)


def format_token(tokens: List[str], token_ids: List[int], idx: int) -> str:
    """
    Formats a token as a human-readable snippet including its string form and id.
    """
    return f"{tokens[idx]!r}({token_ids[idx]})"


def token_diff(orig: TokenizedPrompt, rewritten: TokenizedPrompt) -> List[Tuple[str, int, int]]:
    """
    Computes a diff at the token string level.

    Returns list of (op, i1, i2) tuples where:
    - op is one of 'equal', 'replace', 'delete', 'insert'
    - i1/i2 refer to indices ranges in the original/re-written token sequences
    """
    sm = difflib.SequenceMatcher(a=orig.tokens, b=rewritten.tokens)
    ops = []
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        # We store j ranges too by encoding them into i2-like fields.
        # Simpler approach would be a richer tuple, but this is enough for printing.
        ops.append((tag, (i1, i2), (j1, j2)))
    return ops


def print_diff(orig: TokenizedPrompt, rewritten: TokenizedPrompt) -> None:
    """
    Prints a token-level diff with alignment-ish context.

    This is where most of the value comes from:
    seeing how token boundaries shift after rewriting.
    """
    ops = token_diff(orig, rewritten)

    print("=== ORIGINAL PROMPT ===")
    print(orig.text)
    print("\n=== REWRITTEN PROMPT ===")
    print(rewritten.text)
    print("\n=== TOKEN DIFF (token strings) ===")

    # Print with a simple header and per-op blocks.
    for tag, (i1, i2), (j1, j2) in ops:
        if tag == "equal":
            # Print only small equal spans to avoid massive output.
            continue

        if tag == "replace":
            print("\n[REPLACE]")
            print(f"Original tokens [{i1}:{i2}]:")
            for idx in range(i1, i2):
                print("  -", format_token(orig.tokens, orig.token_ids, idx))
            print(f"Rewritten tokens [{j1}:{j2}]:")
            for idx in range(j1, j2):
                print("  -", format_token(rewritten.tokens, rewritten.token_ids, idx))

        elif tag == "delete":
            print("\n[DELETE]")
            print(f"Original tokens [{i1}:{i2}]:")
            for idx in range(i1, i2):
                print("  -", format_token(orig.tokens, orig.token_ids, idx))

        elif tag == "insert":
            print("\n[INSERT]")
            print(f"Rewritten tokens [{j1}:{j2}]:")
            for idx in range(j1, j2):
                print("  -", format_token(rewritten.tokens, rewritten.token_ids, idx))

    # Also print token counts summary
    print("\n=== SUMMARY ===")
    print(f"Original token count:  {len(orig.token_ids)}")
    print(f"Rewritten token count: {len(rewritten.token_ids)}")
    print(f"Token strings equal? {orig.tokens == rewritten.tokens}")


def main():
    # I picked a tokenizer-backed model because the tokenizer matters more than the model weights here.
    # You can swap this to another chat model; the diff will still work as long as the tokenizer exists.
    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

    # Original prompt: short and slightly ambiguous
    prompt_a = (
        "Write a marketing email for a new AI code assistant. "
        "Keep it friendly."
    )

    # Rewritten prompt: more constraints, includes structure
    prompt_b = (
        "Write a friendly marketing email for a new AI code assistant.\n"
        "Constraints:\n"
        "- 90 to 120 words\n"
        "- Use a single short hook sentence\n"
        "- Include one concrete example of how the assistant helps\n"
        "- End with a clear call to action"
    )

    orig = tokenize_prompt(tokenizer, prompt_a)
    rewritten = tokenize_prompt(tokenizer, prompt_b)

    print_diff(orig, rewritten)


if __name__ == "__main__":
    main()

2) Run it

python token_diff_inspector.py

You’ll get output showing:

the original prompt,
the rewritten prompt,
the token spans that changed ([REPLACE], [DELETE], [INSERT]),
token counts for quick sanity checks.

Why this works (and what it catches)

When I first built this, I expected “prompt rewrite diff” to be mostly about text changes. Token diff revealed something else: token boundaries can shift because the tokenizer often treats whitespace, newlines, punctuation, and quotes as meaningful separators.

Two practical things I found when I used this on prompt rewrites:

1) “Same meaning” rewrites still change many tokens

Even when a rewrite is conceptually similar, adding formatting (like bullet points or line breaks) can create many new tokens—sometimes enough to change the model’s behavior because it sees more structure.

2) Small punctuation edits can cause disproportionate token changes

Replacing a period with something like :\n- can split/merge tokenizations. The token diff shows those boundary changes immediately because tokens differ even when characters look similar.

Token diff on chat templates (the part I almost missed)

A common gotcha in generative AI is that many models expect chat formatted input (system/user/assistant roles). If you use a tokenizer that applies a chat template, tokenization can differ depending on whether you provide raw text or formatted chat messages.

For that reason, I also experimented with token diff using the tokenizer’s chat template. Here’s a minimal version you can run separately.

`chat_token_diff.py`

from transformers import AutoTokenizer

def tokenize_with_chat_template(tokenizer, messages):
    """
    Applies the tokenizer's chat template to messages, then tokenizes the resulting string.
    """
    rendered = tokenizer.apply_chat_template(messages, tokenize=False)
    token_ids = tokenizer.encode(rendered, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(token_ids)
    return rendered, token_ids, tokens

def main():
    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

    messages_a = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the impact of RAG in one paragraph."},
    ]

    messages_b = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the impact of Retrieval-Augmented Generation (RAG) in one paragraph."},
    ]

    rendered_a, ids_a, toks_a = tokenize_with_chat_template(tokenizer, messages_a)
    rendered_b, ids_b, toks_b = tokenize_with_chat_template(tokenizer, messages_b)

    print("=== RENDERED CHAT A ===")
    print(rendered_a)
    print("\n=== RENDERED CHAT B ===")
    print(rendered_b)

    print("\nToken count A:", len(ids_a))
    print("Token count B:", len(ids_b))
    print("Token strings equal?", toks_a == toks_b)

if __name__ == "__main__":
    main()

This version helps when your “rewrite” changes only the user content but the full rendered chat string (including role markers and formatting) is what actually gets tokenized.

What I learned building this

Building a token-level diff inspector changed how I debug generative AI prompts. Instead of guessing why outputs changed, I could see whether the rewrite:

introduced new structural tokens (like newlines and bullets),
shifted token boundaries due to punctuation/spacing,
produced large token count changes that might alter the model’s behavior.

In short: token diffs gave me a concrete, repeatable way to inspect prompt rewrites, especially in formatting-heavy prompts where “minor” text edits often have major tokenization consequences.