Releases: tusharinqueue/tewtoken
Releases · tusharinqueue/tewtoken
Initial release - 8k vocab, bilingual BPE
TewToken v1.0.0 — Initial Release
First working release of TewToken, a bilingual BPE tokenizer built from scratch in pure Python.
What's included
- BPE algorithm implemented from zero — no HuggingFace, no PyTorch
- Trained on YouTube transcripts (English + Hindi)
- 8,000 merge rules learned
- ~7,900 token vocabulary
- 13 utility functions (encode, decode, tokenize, count_tokens, batch, truncate and more)
- Importable as a Python package
Install
pip install git+https://github.com/tusharinqueue/tewtoken.git
Note
v2.0.0 coming soon with 32k vocab trained on ~1GB of Wikipedia data.
Full Changelog: https://github.com/tusharinqueue/tewtoken/commits/v1.0.0