Skip to content

[FEATURE]: Deterministic Tokenization Pipeline with Cryptographic Verification #15

@Varshiniputtabakula

Description

@Varshiniputtabakula

Feature and its Use Cases

Current Problem:
Currently there is a verification gap in the pipeline between the dataset verification layer (Layer 2) and model training (Layer 4):
Verified Dataset (Layer 2) ----GAP---- Model Training (Layer 4)

While the dataset verification work establishes cryptographic proof of the raw and processed Wikipedia data, there is no mechanism to verify the tokenization step. This means the following cannot be detected:
-Use of a different tokenizer than claimed
-Different tokenizer configuration than claimed
-Tampering with tokenized output before training

This is consistent with what the paper claims in "A Framework for Cryptographic Verifiability of End-to-End AI Pipelines" (2025) — the Extraction and Analysis phase (Stage 2) is the critical gap where current pipelines fail to provide full verifiability.

Solution:
Build a deterministic tokenization pipeline with three verifiable components:

  1. Verifiable Tokenizer Configuration
    -Train a BPE tokenizer on preprocessed Wikipedia text using HuggingFace Tokenizers
    -Use fixed vocabulary size and configuration for determinism
    -Hash the resulting vocab.json and merges.txt using existing SHA-256 utilities from PR Add deterministic dataset hashing utility #1
    -Add these hashes to the verification manifest

  2. Verifiable Tokenized Dataset
    -Tokenize the entire preprocessed Wikipedia dataset deterministically
    -Build a Merkle tree over tokenized chunks
    -Add the Merkle root to the verification manifest

  3. Cryptographic Linkage
    -Reference the processed data Merkle root from Layer 2 in the tokenization manifest entry
    -Creating an unbroken verification chain:

Raw Data Merkle Root

Processed Data Merkle Root

Tokenizer Config Hash

Tokenized Data Merkle Root

Model Training

Additional Context

References:
"A Framework for Cryptographic Verifiability of End-to-End AI Pipelines" (IWSPA 2025): https://arxiv.org/pdf/2503.22573v1
PR #1: Deterministic dataset hashing utility
PR #8: Merkle Tree Based Chunk Level Hashing

Open to feedback on the approach before implementation begins.

Code of Conduct

  • I have joined the Discord server and will post updates there
  • I have searched existing issues to avoid duplicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions