[RNE Rewrite] feat: add tokenizer pipeline (#1248)#1274
Draft
msluszniak wants to merge 1 commit into
Draft
Conversation
c5817d8 to
f426882
Compare
f426882 to
66dfb9d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds the tokenizer pipeline (issue #1248) using the new worklet-based architecture, with functional parity to the current
TokenizerModule.A new
textextension exposes aloadTokenizerJSI primitive (top-level on__rnexecutorch_jsi__, likeloadModel) returning aTokenizerhost object backed bytokenizers::HFTokenizer. On top of it sits acreateTokenizer(config, runtime?)async factory (async +*Workletvariants +dispose) and auseTokenizerhook. Methods:encode,decode,getVocabSize,idToToken,tokenToId— same semantics as today (special tokens follow thetokenizer.jsonpost_processor). The*Workletvariants let an upcoming text-embeddings task tokenize → build tensors → run forward within a single worklet.cpp/extensions/text/{tokenizer,install}.{h,cpp}, wired intoRnExecutorch.cpp.src/extensions/text/{ops,tasks}/tokenizer.ts,src/hooks/useTokenizer.ts, exports inindex.ts, examplemodels.tokenizer.ALL_MINILM_L6_V2.android/CMakeLists.txtand the podspec (headers ship in the ExecuTorch llm bundle); documented inthird-party/README.md.Introduces a breaking change?
Type of change
Tested on
Testing instructions
CI is TypeScript-only on this branch; native is not compiled in CI. Verified locally:
yarn typecheck, rootyarn lint, andyarn prepare(bob build) all pass. On-device native build requires provisioning the ExecuTorchthird-partyartifacts (seethird-party/README.md).Screenshots
Related issues
#1248, part of #1208
Checklist
Additional notes
The C++ mirrors the proven current
TokenizerModulecalls against the samesoftware-mansion-labs/tokenizers-cppfork. Tokenizer download currently uses the temporaryreact-native-fs-baseduseResourceDownloadintroduced in #1264 (to be replaced by the ResourceFetcher in #1253).