Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation by JqzChandler · Pull Request #1880 · huggingface/tokenizers

JqzChandler · 2025-10-30T13:59:29Z

Our team tried different aligning solutions when implementing our self-developed On-Policy distillation method, including trl's existing implementation (which repeatedly calls the decode() method but encounters correctness issues in special cases and has high computational overhead).

Later we found a better method: by making one call to the encode() method to get byte-level offsets for all tokens, we can effectively avoid BPE's complexity, and byte level offsets are also compatible with all other types of tokenizers. Additionally, for distillation between two BPE tokenizers, we can get more accurate alignment by skipping string as an intermediate modality.

Therefore, we hope to merge this simple patch to expose the byte-level offset calculation already supported in the Rust code for use by Python classes.

More description at:
huggingface/trl#4393

Copilot

Pull Request Overview

This PR adds an offset_type parameter to the encode and encode_batch methods, allowing users to choose between character-based offsets ("char"), byte-based offsets ("byte"), or no offsets ("none") for faster encoding. The default is "char" to maintain backward compatibility.

Adds offset_type parameter to both Rust and Python encoding methods
Routes to appropriate internal methods based on offset type selection
Provides input validation with helpful error messages

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
bindings/python/src/tokenizer.rs	Adds `offset_type` parameter to `encode` and `encode_batch` Rust methods with validation and routing logic
bindings/python/py_src/tokenizers/implementations/base_tokenizer.py	Adds `offset_type` parameter to Python wrapper methods with documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ArthurZucker

Thanks, if you want to expose it, it will be better to just add encode_char_offsets to the bindings (its less breaking and no need for api changes)

If that works for you, python stub.py will update the inits

ArthurZucker · 2025-11-28T06:54:47Z

        pair: Optional[InputSequence] = None,
        is_pretokenized: bool = False,
        add_special_tokens: bool = True,
+        offset_type: str = "char",


I'd rather we make it optional!

ArthurZucker · 2025-11-28T06:57:21Z

btw this file is not really something I thought had a lot of usage 😄

WDYT about rather adding some doc in a md file or something? As this already exists?

enable_getting_encoding_offsets_at_diff_lvl

452978c

This was referenced Oct 30, 2025

Problems with Cross-Tokenizer Alignment in Correctness and Efficiency huggingface/trl#4393

Open

Provide byte-level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation #1881

Open

kashif requested review from ArthurZucker and Copilot October 31, 2025 12:32

Copilot AI reviewed Oct 31, 2025

View reviewed changes

Comment thread bindings/python/src/tokenizer.rs Outdated

Update bindings/python/src/tokenizer.rs

fd89faf

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ArthurZucker reviewed Nov 28, 2025

View reviewed changes

ArthurZucker added the Feature Request label Nov 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation#1880

Providing byte level offsets for effective alignment in Cross-Tokenizer On-Policy Distillation#1880
JqzChandler wants to merge 2 commits intohuggingface:mainfrom
JqzChandler:main

JqzChandler commented Oct 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Nov 28, 2025

Uh oh!

ArthurZucker Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JqzChandler commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JqzChandler commented Oct 30, 2025 •

edited

Loading