Skip to content

Conversation

@lucylq
Copy link
Contributor

@lucylq lucylq commented Jan 14, 2026

Summary

Lora models created with unsloth use HFTokenizer, not supported by the static runner.

Test plan

Export llama1b lora model

python export_static_llm_coreml.py \
      --checkpoint $LLAMA1B/original/consolidated.00.pth  \
      --params $LLAMA1B/original/params.json \
      --adapter_checkpoint $LLAMA1B/lora/adapter_model.safetensors \
      --adapter_config $LLAMA1B/lora/adapter_config.json \
      --output coreml-llama1b-lora.pte \
      --max_context_len 1024

Run llama1b lora model

(executorch) lfq@lfq-mbp llama % python run_static_llm.py \
    --model /Users/lfq/executorch/examples/apple/coreml/llama/coreml-llama1b-lora.pte \
    --params $LLAMA1B/original/params.json \
    --tokenizer $LLAMA1B/tokenizer.json \
    --tokenizer_config $LLAMA1B/tokenizer_config.json \
    --prompt "What is 15% of 80?" \
    --max_new_tokens 100 
W0114 16:23:07.644390 81771 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
W0114 16:23:08.208112 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.208413 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.208502 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.208578 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.208648 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.208904 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.208997 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.209072 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.209149 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.209216 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.209280 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.209353 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
W0114 16:23:08.209419 81771 site-packages/torch/utils/flop_counter.py:45] triton not found; flop counting will not work for triton kernels
I tokenizers:regex.cpp:27] Registering override fallback regex
Model config: 16 layers, dim=2048
Input length: 32, Cache length: 992
Loading model from /Users/lfq/executorch/examples/apple/coreml/llama/coreml-llama1b-lora.pte...
[program.cpp:154] InternalConsistency verification requested but not available
[ETCoreMLModelManager.mm:474] Cache Hit: Successfully retrieved compiled model with identifier=executorch_2d7b5a72-14ac-4133-b35d-269dd19a3ed5_cpu_and_ne from the models cache.
[ETCoreMLModelManager.mm:474] Cache Hit: Successfully retrieved compiled model with identifier=executorch_9d2dc1da-5080-4b9c-a49d-031352db1b03_cpu_and_ne from the models cache.
[ETCoreMLModelManager.mm:474] Cache Hit: Successfully retrieved compiled model with identifier=executorch_4dd01157-98ab-4a24-b89b-abd4b98b1f3e_cpu_and_ne from the models cache.
Method metadata: num_inputs=36, num_outputs=33

Prompt: What is 15% of 80?
Prompt tokens: 10
--------------------------------------------------
Prefilling... done in 0.15s

What is 15% of 80? - 15% of 80 is equal to 12. The answer is 0.12. The answer is 0.12. The answer is 0.12. The answer is 0.12. The answer is 0.12. The answer is 0.12. The answer is 0.12. The answer is 0.12. The answer is 0.12. The answer is 0.12. The answer is 0.12
--------------------------------------------------
Prefill: 10 tokens in 0.15s
Decode: 100 tokens in 7.18s (13.92 tok/s)

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16606

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Pending, 2 Unrelated Failures

As of commit b4ba65d with merge base 9510334 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 14, 2026
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@lucylq lucylq force-pushed the lfq.use-pytorch-tokenizer-static-runner branch from f025e17 to 162b667 Compare January 14, 2026 22:20
@lucylq lucylq marked this pull request as ready for review January 14, 2026 22:20
Copilot AI review requested due to automatic review settings January 14, 2026 22:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the static runner to support HuggingFace tokenizers (like those used by Qwen models) by replacing the custom tokenizer wrapper with pytorch_tokenizers.get_tokenizer. Additionally, it fixes the RMSNorm usage in static attention to use the custom RMSNorm implementation instead of torch.nn.RMSNorm.

Changes:

  • Replaced custom tokenizer wrapper with pytorch_tokenizers.get_tokenizer to support HuggingFace tokenizers
  • Added get_stop_tokens helper function to handle different tokenizer interfaces
  • Changed torch.nn.RMSNorm to custom RMSNorm in StaticAttention initialization

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
examples/models/llama/static_attention.py Replaces torch.nn.RMSNorm with custom RMSNorm import for QK normalization layers
examples/apple/coreml/llama/run_static_llm.py Removes custom Tokenizer class and switches to pytorch_tokenizers library for broader tokenizer support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lucylq lucylq force-pushed the lfq.use-pytorch-tokenizer-static-runner branch from 162b667 to b4ba65d Compare January 14, 2026 23:26
This was referenced Jan 15, 2026
@lucylq lucylq changed the title Use pytorch_tokenizer in static runner Use pytorch_tokenizer in coreml static runner Jan 15, 2026
@lucylq lucylq merged commit 33974d5 into main Jan 15, 2026
310 of 323 checks passed
@lucylq lucylq deleted the lfq.use-pytorch-tokenizer-static-runner branch January 15, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants