Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation#1880
Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation#1880JqzChandler wants to merge 2 commits intohuggingface:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds an offset_type parameter to the encode and encode_batch methods, allowing users to choose between character-based offsets ("char"), byte-based offsets ("byte"), or no offsets ("none") for faster encoding. The default is "char" to maintain backward compatibility.
- Adds
offset_typeparameter to both Rust and Python encoding methods - Routes to appropriate internal methods based on offset type selection
- Provides input validation with helpful error messages
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| bindings/python/src/tokenizer.rs | Adds offset_type parameter to encode and encode_batch Rust methods with validation and routing logic |
| bindings/python/py_src/tokenizers/implementations/base_tokenizer.py | Adds offset_type parameter to Python wrapper methods with documentation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks, if you want to expose it, it will be better to just add encode_char_offsets to the bindings (its less breaking and no need for api changes)
If that works for you, python stub.py will update the inits
| pair: Optional[InputSequence] = None, | ||
| is_pretokenized: bool = False, | ||
| add_special_tokens: bool = True, | ||
| offset_type: str = "char", |
There was a problem hiding this comment.
I'd rather we make it optional!
There was a problem hiding this comment.
btw this file is not really something I thought had a lot of usage 😄
WDYT about rather adding some doc in a md file or something? As this already exists?
issue
#1881
Our team tried different aligning solutions when implementing our self-developed On-Policy distillation method, including trl's existing implementation (which repeatedly calls the decode() method but encounters correctness issues in special cases and has high computational overhead).
Later we found a better method: by making one call to the encode() method to get byte-level offsets for all tokens, we can effectively avoid BPE's complexity, and byte level offsets are also compatible with all other types of tokenizers. Additionally, for distillation between two BPE tokenizers, we can get more accurate alignment by skipping string as an intermediate modality.
Therefore, we hope to merge this simple patch to expose the byte-level offset calculation already supported in the Rust code for use by Python classes.
More description at:
huggingface/trl#4393