You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current Problem:
Currently there is a verification gap in the pipeline between the dataset verification layer (Layer 2) and model training (Layer 4):
Verified Dataset (Layer 2) ----GAP---- Model Training (Layer 4)
While the dataset verification work establishes cryptographic proof of the raw and processed Wikipedia data, there is no mechanism to verify the tokenization step. This means the following cannot be detected:
-Use of a different tokenizer than claimed
-Different tokenizer configuration than claimed
-Tampering with tokenized output before training
This is consistent with what the paper claims in "A Framework for Cryptographic Verifiability of End-to-End AI Pipelines" (2025) — the Extraction and Analysis phase (Stage 2) is the critical gap where current pipelines fail to provide full verifiability.
Solution:
Build a deterministic tokenization pipeline with three verifiable components:
Verifiable Tokenizer Configuration
-Train a BPE tokenizer on preprocessed Wikipedia text using HuggingFace Tokenizers
-Use fixed vocabulary size and configuration for determinism
-Hash the resulting vocab.json and merges.txt using existing SHA-256 utilities from PR Add deterministic dataset hashing utility #1
-Add these hashes to the verification manifest
Verifiable Tokenized Dataset
-Tokenize the entire preprocessed Wikipedia dataset deterministically
-Build a Merkle tree over tokenized chunks
-Add the Merkle root to the verification manifest
Cryptographic Linkage
-Reference the processed data Merkle root from Layer 2 in the tokenization manifest entry
-Creating an unbroken verification chain:
Raw Data Merkle Root
↓
Processed Data Merkle Root
↓
Tokenizer Config Hash
↓
Tokenized Data Merkle Root
↓
Model Training
Additional Context
References:
"A Framework for Cryptographic Verifiability of End-to-End AI Pipelines" (IWSPA 2025): https://arxiv.org/pdf/2503.22573v1
PR #1: Deterministic dataset hashing utility
PR #8: Merkle Tree Based Chunk Level Hashing
Open to feedback on the approach before implementation begins.
Code of Conduct
I have joined the Discord server and will post updates there
I have searched existing issues to avoid duplicates
Feature and its Use Cases
Current Problem:
Currently there is a verification gap in the pipeline between the dataset verification layer (Layer 2) and model training (Layer 4):
Verified Dataset (Layer 2) ----GAP---- Model Training (Layer 4)
While the dataset verification work establishes cryptographic proof of the raw and processed Wikipedia data, there is no mechanism to verify the tokenization step. This means the following cannot be detected:
-Use of a different tokenizer than claimed
-Different tokenizer configuration than claimed
-Tampering with tokenized output before training
This is consistent with what the paper claims in "A Framework for Cryptographic Verifiability of End-to-End AI Pipelines" (2025) — the Extraction and Analysis phase (Stage 2) is the critical gap where current pipelines fail to provide full verifiability.
Solution:
Build a deterministic tokenization pipeline with three verifiable components:
Verifiable Tokenizer Configuration
-Train a BPE tokenizer on preprocessed Wikipedia text using HuggingFace Tokenizers
-Use fixed vocabulary size and configuration for determinism
-Hash the resulting vocab.json and merges.txt using existing SHA-256 utilities from PR Add deterministic dataset hashing utility #1
-Add these hashes to the verification manifest
Verifiable Tokenized Dataset
-Tokenize the entire preprocessed Wikipedia dataset deterministically
-Build a Merkle tree over tokenized chunks
-Add the Merkle root to the verification manifest
Cryptographic Linkage
-Reference the processed data Merkle root from Layer 2 in the tokenization manifest entry
-Creating an unbroken verification chain:
Raw Data Merkle Root
↓
Processed Data Merkle Root
↓
Tokenizer Config Hash
↓
Tokenized Data Merkle Root
↓
Model Training
Additional Context
References:
"A Framework for Cryptographic Verifiability of End-to-End AI Pipelines" (IWSPA 2025): https://arxiv.org/pdf/2503.22573v1
PR #1: Deterministic dataset hashing utility
PR #8: Merkle Tree Based Chunk Level Hashing
Open to feedback on the approach before implementation begins.
Code of Conduct