Skip to content

[RNE Rewrite] feat: add tokenizer pipeline (#1248)#1274

Draft
msluszniak wants to merge 1 commit into
rne-rewritefrom
@ms/issue1248-tokenizer
Draft

[RNE Rewrite] feat: add tokenizer pipeline (#1248)#1274
msluszniak wants to merge 1 commit into
rne-rewritefrom
@ms/issue1248-tokenizer

Conversation

@msluszniak

Copy link
Copy Markdown
Member

Description

Adds the tokenizer pipeline (issue #1248) using the new worklet-based architecture, with functional parity to the current TokenizerModule.

A new text extension exposes a loadTokenizer JSI primitive (top-level on __rnexecutorch_jsi__, like loadModel) returning a Tokenizer host object backed by tokenizers::HFTokenizer. On top of it sits a createTokenizer(config, runtime?) async factory (async + *Worklet variants + dispose) and a useTokenizer hook. Methods: encode, decode, getVocabSize, idToToken, tokenToId — same semantics as today (special tokens follow the tokenizer.json post_processor). The *Worklet variants let an upcoming text-embeddings task tokenize → build tensors → run forward within a single worklet.

  • C++: cpp/extensions/text/{tokenizer,install}.{h,cpp}, wired into RnExecutorch.cpp.
  • TS: src/extensions/text/{ops,tasks}/tokenizer.ts, src/hooks/useTokenizer.ts, exports in index.ts, example models.tokenizer.ALL_MINILM_L6_V2.
  • Build: tokenizer include path added to android/CMakeLists.txt and the podspec (headers ship in the ExecuTorch llm bundle); documented in third-party/README.md.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

CI is TypeScript-only on this branch; native is not compiled in CI. Verified locally: yarn typecheck, root yarn lint, and yarn prepare (bob build) all pass. On-device native build requires provisioning the ExecuTorch third-party artifacts (see third-party/README.md).

Screenshots

Related issues

#1248, part of #1208

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

The C++ mirrors the proven current TokenizerModule calls against the same software-mansion-labs/tokenizers-cpp fork. Tokenizer download currently uses the temporary react-native-fs-based useResourceDownload introduced in #1264 (to be replaced by the ResourceFetcher in #1253).

@msluszniak msluszniak marked this pull request as draft June 22, 2026 12:49
@msluszniak msluszniak self-assigned this Jun 22, 2026
@msluszniak msluszniak added the feature PRs that implement a new feature label Jun 22, 2026
@msluszniak msluszniak linked an issue Jun 22, 2026 that may be closed by this pull request
@msluszniak msluszniak force-pushed the @ms/issue1248-tokenizer branch 4 times, most recently from c5817d8 to f426882 Compare June 22, 2026 13:30
@msluszniak msluszniak force-pushed the @ms/issue1248-tokenizer branch from f426882 to 66dfb9d Compare June 22, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RNE Rewrite] Add tokenizer pipeline implementation

1 participant