Distributed Systems & CryptographyJuly 2, 2026

Building A Verifiable Merkle-Map For Off-Chain Digital Provenance Receipts

C

Written by

Cipher Stone

The problem I ran into

I was building a provenance pipeline where files (PDFs, videos, build artifacts) were stored off-chain (S3-like storage), but I still needed a cryptographically verifiable “receipt” that proves:

  1. Which exact file bytes were involved.
  2. When the pipeline decided the file was part of a provenance event.
  3. That the receipt didn’t get tampered with—even if someone later replaced or reordered files off-chain.

Blockchains are great for anchoring data, but anchoring every file byte directly on-chain is a non-starter. So I ended up with a niche but practical infrastructure piece:

I built a verifiable Merkle-Map receipt format that can bundle many off-chain file hashes, produce inclusion proofs, and anchor the Merkle root on-chain.

Along the way I also discovered a subtle pitfall: many “Merkle proof” examples ignore how leaves are keyed. If leaf positions depend on array order, reordering breaks verification even when the set is identical. A Merkle-Map avoids that by using a keyed leaf position derived from something stable (like a content identifier).

Below is exactly what I implemented: a deterministic Merkle-Map builder, a receipt serializer, and code to verify inclusion proofs.


Background: what I mean by “Merkle-Map”

A Merkle tree is a hash-based authenticated data structure. It commits to many items using only a root hash. A Merkle proof shows that a specific item is included under that root.

A Merkle-Map is the keyed variant: each entry is mapped to a leaf position based on a key (not an array index). That means the same set of (key, value) pairs yields the same root regardless of ordering.

In this post, my leaves represent:

  • key: a stable identifier derived from the provenance event metadata + file identifier (so it’s deterministic)
  • value: the hash of the file bytes (off-chain content)

Then I:

  • build the Merkle root over all leaves
  • generate inclusion proofs per file
  • anchor the Merkle root on-chain (shown as the “receipt” format; the chain anchoring part is typically a transaction call)

The specific receipt format I used

Each provenance receipt contains:

  • eventId: string (human-friendly ID)
  • timestamp: integer (unix seconds)
  • items: list of (fileId, fileHash)
  • root: Merkle-Map root hash
  • optional per-item proofs: inclusion proof for each (key, fileHash)

The key design choice: leaf key = hash of a “leaf key material” string, not the fileId directly. This lets me version the receipt format later without breaking everything.

Leaf key material I used:

leafKeyMaterial = "v1|eventId=<eventId>|fileId=<fileId>"

Leaf hash value I used:

leafValue = SHA256(fileBytes)

Then I map keys into the Merkle-Map.


Step-by-step implementation (deterministic Merkle-Map)

This implementation uses:

  • SHA-256
  • a binary Merkle tree of fixed depth
  • leaves keyed by the hash-bitstring of leafKeyMaterial

What “fixed depth” means here

I chose depth D = 20 (about a million possible leaf positions). For the demo below, I keep it small but deterministic.

If two different keys map to the same leaf, they overwrite—so I include a collision check. In production systems I usually:

  • increase depth
  • or use a different structure (like a sparse Merkle tree with non-overlapping paths)
  • or treat collisions explicitly by using secondary hashing

For this blog post I’ll keep it deterministic and transparent.


Working code: Merkle-Map builder + inclusion proofs

import hashlib import json from dataclasses import dataclass from typing import Dict, List, Tuple, Optional def sha256(data: bytes) -> bytes: return hashlib.sha256(data).digest() def hex32(b: bytes) -> str: return b.hex() @dataclass class Proof: index: int # leaf index in the fixed-depth tree key: str # leaf key material (debug-friendly) value_hash: str # hex hash of file bytes path: List[str] # sibling hashes from leaf to root (hex) @dataclass class Receipt: eventId: str timestamp: int depth: int root: str entries: List[Tuple[str, str]] # (fileId, fileHashHex) proofs: Optional[Dict[str, Proof]] = None # fileId -> proof def leaf_index_from_key(leaf_key_material: str, depth: int) -> int: """ Deterministically maps a leaf key to an index using the first `depth` bits of SHA256(leaf_key_material). """ key_hash = sha256(leaf_key_material.encode("utf-8")) bits_needed = depth idx = 0 consumed = 0 for byte in key_hash: for bitpos in range(7, -1, -1): if consumed >= bits_needed: break idx = (idx << 1) | ((byte >> bitpos) & 1) consumed += 1 if consumed >= bits_needed: break return idx def hash_leaf(value_hash: bytes) -> bytes: """ Domain-separated leaf hash. """ return sha256(b"leaf|" + value_hash) def hash_internal(left: bytes, right: bytes) -> bytes: """ Domain-separated internal node hash. """ return sha256(b"node|" + left + b"|" + right) def build_merkle_map(entries: List[Tuple[str, str]], eventId: str, depth: int) -> Tuple[bytes, Dict[int, bytes]]: """ entries: list of (fileId, fileHashHex) where fileHashHex = SHA256(fileBytes).hex() Returns: (root, leaf_map) where leaf_map[index] = leaf_digest """ # leaf_map maps fixed leaf index -> hashed leaf leaf_map: Dict[int, bytes] = {} for fileId, fileHashHex in entries: file_hash = bytes.fromhex(fileHashHex) leaf_key_material = f"v1|eventId={eventId}|fileId={fileId}" idx = leaf_index_from_key(leaf_key_material, depth) leaf_digest = hash_leaf(file_hash) if idx in leaf_map: # Deterministic collision check: in a keyed map collision means two keys hit same leaf # This is unlikely with enough depth but still possible. if leaf_map[idx] != leaf_digest: raise ValueError(f"Collision at leaf index {idx}: two distinct leaves map to same position.") leaf_map[idx] = leaf_digest # Build the tree bottom-up for the fixed-depth binary tree. size = 1 << depth level: List[bytes] = [hash_leaf(b"\x00" * 32) for _ in range(size)] # default empty leaf for idx, leaf_digest in leaf_map.items(): level[idx] = leaf_digest # Now iteratively compute internal levels for d in range(depth, 0, -1): next_level = [] for i in range(0, len(level), 2): left = level[i] right = level[i + 1] next_level.append(hash_internal(left, right)) level = next_level root = level[0] return root, leaf_map def generate_proof(fileId: str, fileHashHex: str, eventId: str, depth: int, leaf_map: Dict[int, bytes]) -> Proof: """ Generate inclusion proof for the given file entry. """ file_hash = bytes.fromhex(fileHashHex) leaf_key_material = f"v1|eventId={eventId}|fileId={fileId}" idx = leaf_index_from_key(leaf_key_material, depth) if idx not in leaf_map: raise ValueError("Entry not found in leaf_map (proof cannot be generated).") size = 1 << depth level: List[bytes] = [hash_leaf(b"\x00" * 32) for _ in range(size)] for i, leaf_digest in leaf_map.items(): level[i] = leaf_digest path: List[str] = [] index = idx # Build path: sibling hashes from leaf up to root for _ in range(depth): sibling_index = index ^ 1 path.append(hex32(level[sibling_index])) # move to parent index //= 2 # recompute next level next_level = [] for i in range(0, len(level), 2): next_level.append(hash_internal(level[i], level[i + 1])) level = next_level return Proof( index=idx, key=leaf_key_material, value_hash=fileHashHex, path=path ) def verify_proof(proof: Proof, root_hex: str, depth: int) -> bool: """ Verify inclusion proof by recomputing root from leaf and sibling path. """ value_hash = bytes.fromhex(proof.value_hash) current = hash_leaf(value_hash) index = proof.index for sibling_hex in proof.path: sibling = bytes.fromhex(sibling_hex) if index % 2 == 0: # current is left child current = hash_internal(current, sibling) else: # current is right child current = hash_internal(sibling, current) index //= 2 return current.hex() == root_hex def make_receipt(eventId: str, timestamp: int, items: List[Tuple[str, bytes]], depth: int, include_proofs: bool = True) -> Receipt: """ items: list of (fileId, fileBytes) """ entries: List[Tuple[str, str]] = [] for fileId, fileBytes in items: entries.append((fileId, sha256(fileBytes).hex())) root_bytes, leaf_map = build_merkle_map(entries, eventId, depth) root_hex = root_bytes.hex() proofs = None if include_proofs: proofs = {} for fileId, fileHashHex in entries: proofs[fileId] = generate_proof(fileId, fileHashHex, eventId, depth, leaf_map) return Receipt( eventId=eventId, timestamp=timestamp, depth=depth, root=root_hex, entries=entries, proofs=proofs ) if __name__ == "__main__": # Demo data: two "off-chain" files file_a = b"hello provenance file A" file_b = b"hello provenance file B" eventId = "build-2026-07-01T12:00Z" timestamp = 1719835200 # example unix seconds depth = 6 # small for demo (64 leaves) receipt = make_receipt( eventId=eventId, timestamp=timestamp, items=[("fileA", file_a), ("fileB", file_b)], depth=depth, include_proofs=True ) print("Receipt root:", receipt.root) print("Entries:", receipt.entries) # Verify each proof against the anchored root for fileId, proof in receipt.proofs.items(): ok = verify_proof(proof, receipt.root, receipt.depth) print(f"Proof for {fileId} valid:", ok) # Serialize receipt (what I'd anchor root on-chain; proofs can be distributed off-chain) receipt_json = json.dumps({ "eventId": receipt.eventId, "timestamp": receipt.timestamp, "depth": receipt.depth, "root": receipt.root, "entries": receipt.entries, "proofs": { fid: { "index": p.index, "key": p.key, "value_hash": p.value_hash, "path": p.path } for fid, p in receipt.proofs.items() } }, indent=2) print("\nSerialized receipt (truncated):") print(receipt_json[:800], "...")

What to look for in the code

  • leaf_index_from_key(...)
    Converts stable leafKeyMaterial into a deterministic leaf index using the first depth bits of SHA256. That prevents dependence on insertion order.

  • build_merkle_map(...)
    Builds a fixed-depth binary tree. Missing leaves use a default digest (leaf|0x00…00) so every leaf position is well-defined.

  • generate_proof(...)
    Walks upward collecting sibling hashes in a path. That path is what a verifier needs to recompute the root.

  • verify_proof(...)
    Recomputes the root from the leaf hash + sibling path, checking the result matches the anchored root.


How this maps to blockchain infrastructure

In practice, I anchor only:

  • root (Merkle-Map root)
  • eventId (or a hash of it)
  • timestamp (or let the chain’s block time serve that role)

A simplified “anchoring transaction” payload typically looks like:

  • anchorRoot = receipt.root
  • anchorEvent = SHA256(eventId).hex()
  • store both in a smart contract call or a log event

Then later, when someone needs to prove a file was included, they provide:

  • the receipt metadata (or at least eventId, depth, root)
  • the file hash
  • the inclusion proof path

No file bytes are posted on-chain—only constant-size hashes and proofs.


Minimal on-chain anchoring sketch (conceptual)

The exact smart contract depends on your chain (EVM, Solana, Cosmos, etc.). The infrastructure concept stays the same: store root in contract state keyed by eventIdHash.

Here’s what the data flow looks like:

  1. Off-chain: run make_receipt(...) → get root
  2. On-chain: call anchor(eventIdHash, root)
  3. Off-chain later: verify proof against anchored root

The code above already implements the cryptographic part (receipt + proof verification). Anchoring is “plumbing.”


Conclusion

I built a deterministic, keyed Merkle-Map receipt system for off-chain digital provenance: each file is committed via SHA256(fileBytes) and placed into a fixed-depth Merkle structure using a stable, versioned leaf key. The result is a compact root that can be anchored on-chain, while individual files can later be proven included using short inclusion proofs reconstructed from sibling hashes. Most importantly, I learned that keyed leaf placement (not array order) is what makes provenance receipts resilient to reordering and makes verification predictable across distributed systems.