Building A Verifiable Merkle-Map For Off-Chain Digital Provenance Receipts
Written by
Cipher Stone
The problem I ran into
I was building a provenance pipeline where files (PDFs, videos, build artifacts) were stored off-chain (S3-like storage), but I still needed a cryptographically verifiable “receipt” that proves:
- Which exact file bytes were involved.
- When the pipeline decided the file was part of a provenance event.
- That the receipt didn’t get tampered with—even if someone later replaced or reordered files off-chain.
Blockchains are great for anchoring data, but anchoring every file byte directly on-chain is a non-starter. So I ended up with a niche but practical infrastructure piece:
I built a verifiable Merkle-Map receipt format that can bundle many off-chain file hashes, produce inclusion proofs, and anchor the Merkle root on-chain.
Along the way I also discovered a subtle pitfall: many “Merkle proof” examples ignore how leaves are keyed. If leaf positions depend on array order, reordering breaks verification even when the set is identical. A Merkle-Map avoids that by using a keyed leaf position derived from something stable (like a content identifier).
Below is exactly what I implemented: a deterministic Merkle-Map builder, a receipt serializer, and code to verify inclusion proofs.
Background: what I mean by “Merkle-Map”
A Merkle tree is a hash-based authenticated data structure. It commits to many items using only a root hash. A Merkle proof shows that a specific item is included under that root.
A Merkle-Map is the keyed variant: each entry is mapped to a leaf position based on a key (not an array index). That means the same set of (key, value) pairs yields the same root regardless of ordering.
In this post, my leaves represent:
key: a stable identifier derived from the provenance event metadata + file identifier (so it’s deterministic)value: the hash of the file bytes (off-chain content)
Then I:
- build the Merkle root over all leaves
- generate inclusion proofs per file
- anchor the Merkle root on-chain (shown as the “receipt” format; the chain anchoring part is typically a transaction call)
The specific receipt format I used
Each provenance receipt contains:
eventId: string (human-friendly ID)timestamp: integer (unix seconds)items: list of(fileId, fileHash)root: Merkle-Map root hash- optional per-item
proofs: inclusion proof for each(key, fileHash)
The key design choice: leaf key = hash of a “leaf key material” string, not the fileId directly. This lets me version the receipt format later without breaking everything.
Leaf key material I used:
leafKeyMaterial = "v1|eventId=<eventId>|fileId=<fileId>"
Leaf hash value I used:
leafValue = SHA256(fileBytes)
Then I map keys into the Merkle-Map.
Step-by-step implementation (deterministic Merkle-Map)
This implementation uses:
- SHA-256
- a binary Merkle tree of fixed depth
- leaves keyed by the hash-bitstring of
leafKeyMaterial
What “fixed depth” means here
I chose depth D = 20 (about a million possible leaf positions). For the demo below, I keep it small but deterministic.
If two different keys map to the same leaf, they overwrite—so I include a collision check. In production systems I usually:
- increase depth
- or use a different structure (like a sparse Merkle tree with non-overlapping paths)
- or treat collisions explicitly by using secondary hashing
For this blog post I’ll keep it deterministic and transparent.
Working code: Merkle-Map builder + inclusion proofs
import hashlib import json from dataclasses import dataclass from typing import Dict, List, Tuple, Optional def sha256(data: bytes) -> bytes: return hashlib.sha256(data).digest() def hex32(b: bytes) -> str: return b.hex() @dataclass class Proof: index: int # leaf index in the fixed-depth tree key: str # leaf key material (debug-friendly) value_hash: str # hex hash of file bytes path: List[str] # sibling hashes from leaf to root (hex) @dataclass class Receipt: eventId: str timestamp: int depth: int root: str entries: List[Tuple[str, str]] # (fileId, fileHashHex) proofs: Optional[Dict[str, Proof]] = None # fileId -> proof def leaf_index_from_key(leaf_key_material: str, depth: int) -> int: """ Deterministically maps a leaf key to an index using the first `depth` bits of SHA256(leaf_key_material). """ key_hash = sha256(leaf_key_material.encode("utf-8")) bits_needed = depth idx = 0 consumed = 0 for byte in key_hash: for bitpos in range(7, -1, -1): if consumed >= bits_needed: break idx = (idx << 1) | ((byte >> bitpos) & 1) consumed += 1 if consumed >= bits_needed: break return idx def hash_leaf(value_hash: bytes) -> bytes: """ Domain-separated leaf hash. """ return sha256(b"leaf|" + value_hash) def hash_internal(left: bytes, right: bytes) -> bytes: """ Domain-separated internal node hash. """ return sha256(b"node|" + left + b"|" + right) def build_merkle_map(entries: List[Tuple[str, str]], eventId: str, depth: int) -> Tuple[bytes, Dict[int, bytes]]: """ entries: list of (fileId, fileHashHex) where fileHashHex = SHA256(fileBytes).hex() Returns: (root, leaf_map) where leaf_map[index] = leaf_digest """ # leaf_map maps fixed leaf index -> hashed leaf leaf_map: Dict[int, bytes] = {} for fileId, fileHashHex in entries: file_hash = bytes.fromhex(fileHashHex) leaf_key_material = f"v1|eventId={eventId}|fileId={fileId}" idx = leaf_index_from_key(leaf_key_material, depth) leaf_digest = hash_leaf(file_hash) if idx in leaf_map: # Deterministic collision check: in a keyed map collision means two keys hit same leaf # This is unlikely with enough depth but still possible. if leaf_map[idx] != leaf_digest: raise ValueError(f"Collision at leaf index {idx}: two distinct leaves map to same position.") leaf_map[idx] = leaf_digest # Build the tree bottom-up for the fixed-depth binary tree. size = 1 << depth level: List[bytes] = [hash_leaf(b"\x00" * 32) for _ in range(size)] # default empty leaf for idx, leaf_digest in leaf_map.items(): level[idx] = leaf_digest # Now iteratively compute internal levels for d in range(depth, 0, -1): next_level = [] for i in range(0, len(level), 2): left = level[i] right = level[i + 1] next_level.append(hash_internal(left, right)) level = next_level root = level[0] return root, leaf_map def generate_proof(fileId: str, fileHashHex: str, eventId: str, depth: int, leaf_map: Dict[int, bytes]) -> Proof: """ Generate inclusion proof for the given file entry. """ file_hash = bytes.fromhex(fileHashHex) leaf_key_material = f"v1|eventId={eventId}|fileId={fileId}" idx = leaf_index_from_key(leaf_key_material, depth) if idx not in leaf_map: raise ValueError("Entry not found in leaf_map (proof cannot be generated).") size = 1 << depth level: List[bytes] = [hash_leaf(b"\x00" * 32) for _ in range(size)] for i, leaf_digest in leaf_map.items(): level[i] = leaf_digest path: List[str] = [] index = idx # Build path: sibling hashes from leaf up to root for _ in range(depth): sibling_index = index ^ 1 path.append(hex32(level[sibling_index])) # move to parent index //= 2 # recompute next level next_level = [] for i in range(0, len(level), 2): next_level.append(hash_internal(level[i], level[i + 1])) level = next_level return Proof( index=idx, key=leaf_key_material, value_hash=fileHashHex, path=path ) def verify_proof(proof: Proof, root_hex: str, depth: int) -> bool: """ Verify inclusion proof by recomputing root from leaf and sibling path. """ value_hash = bytes.fromhex(proof.value_hash) current = hash_leaf(value_hash) index = proof.index for sibling_hex in proof.path: sibling = bytes.fromhex(sibling_hex) if index % 2 == 0: # current is left child current = hash_internal(current, sibling) else: # current is right child current = hash_internal(sibling, current) index //= 2 return current.hex() == root_hex def make_receipt(eventId: str, timestamp: int, items: List[Tuple[str, bytes]], depth: int, include_proofs: bool = True) -> Receipt: """ items: list of (fileId, fileBytes) """ entries: List[Tuple[str, str]] = [] for fileId, fileBytes in items: entries.append((fileId, sha256(fileBytes).hex())) root_bytes, leaf_map = build_merkle_map(entries, eventId, depth) root_hex = root_bytes.hex() proofs = None if include_proofs: proofs = {} for fileId, fileHashHex in entries: proofs[fileId] = generate_proof(fileId, fileHashHex, eventId, depth, leaf_map) return Receipt( eventId=eventId, timestamp=timestamp, depth=depth, root=root_hex, entries=entries, proofs=proofs ) if __name__ == "__main__": # Demo data: two "off-chain" files file_a = b"hello provenance file A" file_b = b"hello provenance file B" eventId = "build-2026-07-01T12:00Z" timestamp = 1719835200 # example unix seconds depth = 6 # small for demo (64 leaves) receipt = make_receipt( eventId=eventId, timestamp=timestamp, items=[("fileA", file_a), ("fileB", file_b)], depth=depth, include_proofs=True ) print("Receipt root:", receipt.root) print("Entries:", receipt.entries) # Verify each proof against the anchored root for fileId, proof in receipt.proofs.items(): ok = verify_proof(proof, receipt.root, receipt.depth) print(f"Proof for {fileId} valid:", ok) # Serialize receipt (what I'd anchor root on-chain; proofs can be distributed off-chain) receipt_json = json.dumps({ "eventId": receipt.eventId, "timestamp": receipt.timestamp, "depth": receipt.depth, "root": receipt.root, "entries": receipt.entries, "proofs": { fid: { "index": p.index, "key": p.key, "value_hash": p.value_hash, "path": p.path } for fid, p in receipt.proofs.items() } }, indent=2) print("\nSerialized receipt (truncated):") print(receipt_json[:800], "...")
What to look for in the code
-
leaf_index_from_key(...)
Converts stableleafKeyMaterialinto a deterministic leaf index using the firstdepthbits ofSHA256. That prevents dependence on insertion order. -
build_merkle_map(...)
Builds a fixed-depth binary tree. Missing leaves use a default digest (leaf|0x00…00) so every leaf position is well-defined. -
generate_proof(...)
Walks upward collecting sibling hashes in apath. Thatpathis what a verifier needs to recompute the root. -
verify_proof(...)
Recomputes the root from the leaf hash + sibling path, checking the result matches the anchored root.
How this maps to blockchain infrastructure
In practice, I anchor only:
root(Merkle-Map root)eventId(or a hash of it)timestamp(or let the chain’s block time serve that role)
A simplified “anchoring transaction” payload typically looks like:
anchorRoot = receipt.rootanchorEvent = SHA256(eventId).hex()- store both in a smart contract call or a log event
Then later, when someone needs to prove a file was included, they provide:
- the receipt metadata (or at least
eventId,depth,root) - the file hash
- the inclusion proof path
No file bytes are posted on-chain—only constant-size hashes and proofs.
Minimal on-chain anchoring sketch (conceptual)
The exact smart contract depends on your chain (EVM, Solana, Cosmos, etc.). The infrastructure concept stays the same: store root in contract state keyed by eventIdHash.
Here’s what the data flow looks like:
- Off-chain: run
make_receipt(...)→ getroot - On-chain: call
anchor(eventIdHash, root) - Off-chain later: verify
proofagainst anchoredroot
The code above already implements the cryptographic part (receipt + proof verification). Anchoring is “plumbing.”
Conclusion
I built a deterministic, keyed Merkle-Map receipt system for off-chain digital provenance: each file is committed via SHA256(fileBytes) and placed into a fixed-depth Merkle structure using a stable, versioned leaf key. The result is a compact root that can be anchored on-chain, while individual files can later be proven included using short inclusion proofs reconstructed from sibling hashes. Most importantly, I learned that keyed leaf placement (not array order) is what makes provenance receipts resilient to reordering and makes verification predictable across distributed systems.