Skip to content

Releases: tusharinqueue/tewtoken

Initial release - 8k vocab, bilingual BPE

16 Mar 18:42

Choose a tag to compare

TewToken v1.0.0 — Initial Release

First working release of TewToken, a bilingual BPE tokenizer built from scratch in pure Python.

What's included

  • BPE algorithm implemented from zero — no HuggingFace, no PyTorch
  • Trained on YouTube transcripts (English + Hindi)
  • 8,000 merge rules learned
  • ~7,900 token vocabulary
  • 13 utility functions (encode, decode, tokenize, count_tokens, batch, truncate and more)
  • Importable as a Python package

Install

pip install git+https://github.com/tusharinqueue/tewtoken.git

Note

v2.0.0 coming soon with 32k vocab trained on ~1GB of Wikipedia data.

Full Changelog: https://github.com/tusharinqueue/tewtoken/commits/v1.0.0