Add LLM performance regression instrumentation tests (#19700) by psiddh · Pull Request #19700 · pytorch/executorch

psiddh · 2026-05-20T15:08:03Z

Summary:
Adds LlmPerformanceTest, an Android instrumentation test that measures inference performance metrics (TPS, TPS stability, TTFT) for ExecuTorch LLM on the stories110M fixture and asserts they meet minimum thresholds.

This enables OKR 3.3 (Performance Testing: TPS/latency regression detection) using the same zero-infra approach as D105741356 — same fixture, same CI paths, no new dependencies.

Three performance aspects are tested:

testTpsAboveThreshold — decode speed regression gate. A warm-up run is excluded from measurement. Threshold is configurable via instrumentation arg (minTps) so the same APK works on emulator (1.0 TPS) and device (10+ TPS).
testTpsStability — checks coefficient of variation across 3 runs is below 0.5. Catches thread contention, GC pressure, or scheduling instability that causes inconsistent user experience.
testTimeToFirstToken — measures prompt evaluation latency (prefill time). Asserts TTFT < 30s. Catches regressions in the prefill/KV-cache-fill path that make the app feel unresponsive before generation starts.

All metrics are reported via InstrumentationRegistry.sendStatus() for CI metric capture and future dashboarding.

Differential Revision: D105840841

Pulled By: psiddh

pytorch-bot · 2026-05-20T15:08:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19700

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

❌ 18 Cancelled Jobs, 2 Unrelated Failures

As of commit d98cb92 with merge base 8debe93 ():

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / test-coreml-bc-macos (macos-m1-stable) / macos-job (gh)
pull / test-lora-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (ic4, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh)
pull / test-qnn-delegate-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-testsuite-linux / package-golden-artifacts (gh)
pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh)
##[error]The operation was canceled.
pull / test-sqnr-static-llm-qnn-linux (smollm2_135m) / linux-job (gh)
##[error]The operation was canceled.
pull / unittest / linux / linux-job (gh)
##[error]The operation was canceled.
pull / unittest / macos / macos-job (gh)
pull / unittest-arm-backend-with-no-deps (test_pytest_models_tosa) / linux-job (gh)
##[error]The operation was canceled.
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
##[error]The operation was canceled.
pull / unittest-buck / macos / macos-job (gh)
pull / unittest-editable / linux / linux-job (gh)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh)
pull / unittest-nxp-neutron / linux-job (gh)
##[error]The operation was canceled.
Test CoreML Backend / test-coreml / test-backend-macos (coreml, models) / macos-job (gh)
Test CoreML Backend / test-coreml / test-backend-macos (coreml, operators) / macos-job (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-05-20T15:08:15Z

@psiddh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105840841.

github-actions · 2026-05-20T15:08:51Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds an Android instrumentation test suite (LlmPerformanceTest) intended to detect LLM inference performance regressions on the TinyStories-110M fixture by measuring TPS, TPS stability, and TTFT, and reporting metrics via InstrumentationRegistry.sendStatus().

Changes:

Introduces TPS threshold gating with an overridable minTps instrumentation argument.
Adds a stability check using coefficient of variation across multiple runs.
Adds a TTFT measurement and threshold assertion, plus metric reporting for CI capture.

Comments suppressed due to low confidence (2)

extension/android/executorch_android/src/androidTest/java/org/pytorch/executorch/LlmPerformanceTest.kt:142

Same issue as above: LlmModule.load() is void/Unit and throws on failure, so assigning it to loadResult and comparing to 0 won’t compile. Use llmModule.load() without a return-code check (or catch/fail() on exception).

    val loadResult = llmModule.load()
    assertTrue("Model failed to load", loadResult == 0)

extension/android/executorch_android/src/androidTest/java/org/pytorch/executorch/LlmPerformanceTest.kt:191

Same issue as above: LlmModule.load() returns void/Unit and throws on failure; this return-code assertion will not compile. Replace with llmModule.load() and rely on exceptions to fail the test (or catch and fail).

    val loadResult = llmModule.load()
    assertTrue("Model failed to load", loadResult == 0)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

meta-codesync · 2026-05-20T15:56:06Z

@psiddh has imported this pull request. If you are a Meta employee, you can view this in D105840841.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

+    reportMetric("measured_tps", measuredTps)
+    reportMetric("measured_tokens", generatedTokens.size.toFloat())
+    reportMetric("min_tps_threshold", minTps)


Summary: Adds `LlmPerformanceTest`, an Android instrumentation test that measures inference performance metrics (TPS, TPS stability, TTFT) for ExecuTorch LLM on the stories110M fixture and asserts they meet minimum thresholds. This enables OKR 3.3 (Performance Testing: TPS/latency regression detection) using the same zero-infra approach as D105741356 — same fixture, same CI paths, no new dependencies. Three performance aspects are tested: 1. `testTpsAboveThreshold` — decode speed regression gate. A warm-up run is excluded from measurement. Threshold is configurable via instrumentation arg (`minTps`) so the same APK works on emulator (1.0 TPS) and device (10+ TPS). 2. `testTpsStability` — checks coefficient of variation across 3 runs is below 0.5. Catches thread contention, GC pressure, or scheduling instability that causes inconsistent user experience. 3. `testTimeToFirstToken` — measures prompt evaluation latency (prefill time). Asserts TTFT < 30s. Catches regressions in the prefill/KV-cache-fill path that make the app feel unresponsive before generation starts. All metrics are reported via InstrumentationRegistry.sendStatus() for CI metric capture and future dashboarding. Differential Revision: D105840841 Pulled By: psiddh

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

digantdesai · 2026-05-20T21:28:55Z

@claude review this PR

claude · 2026-05-20T21:29:42Z

Claude finished @digantdesai's task in 1m 46s —— View job

Code Review: LlmPerformanceTest.kt

Read the changed file and gather context
Review code for correctness, style, and best practices
Post review feedback

Overall this is a well-structured test that follows the patterns established by LlmModuleConversationHistoryTest. The code is clean, the thresholds are appropriately conservative for emulator CI, and the metric reporting via InstrumentationRegistry.sendStatus() is a good approach for CI capture. A few observations:

Looks Good

setUp pattern — Uses requireNotNull(...).use { ... } for resource streams (lines 62-71), which is the correct Kotlin pattern to ensure streams close on both success and failure. This is an improvement over the older LlmModuleInstrumentationTest which manually closes streams.
load() usage — The test correctly calls llmModule.load() without capturing a return value (lines 95, 140, 188), matching the void signature. The earlier Copilot comments about loadResult == 0 appear to be from a stale revision.
TTFT uses first_token_ms — Line 204 correctly uses first_token_ms - inference_start_ms, which matches the C++ stats definition in stats.h:43 (first_token: Timestamp when the first generated token is emitted).
minTps validation — Line 258 validates the instrumentation arg is finite and positive, addressing the concern about invalid overrides.
minOrNull()!! — Lines 169-170 use the non-deprecated Kotlin API (not min()/max()).

Suggestions (non-blocking)

1. TPS calculation uses generated_tokens from stats JSON, but the denominator excludes prefill — potential off-by-one (lines 229-234)

The generated_tokens field from stats is num_generated_tokens (total minus prompt tokens). The decode time is inference_end_ms - prompt_eval_end_ms. This is correct and consistent with how stats_to_json_string computes decode_token_per_sec. However, consider using the pre-computed decode_token_per_sec field from the JSON directly — it avoids duplicating the calculation and stays in sync if the stats definition changes.

2. Synchronized list may not be needed (lines 53-55)

LlmModule.generate() is a synchronous, blocking call (it holds mLock). The onResult and onStats callbacks are invoked from the same thread context during generate(). The Collections.synchronizedList and @Volatile annotations suggest multi-threaded access, but since all access happens sequentially (generate blocks → callbacks fire → test reads results), plain lists would suffice. This isn't a bug — just unnecessary overhead that might mislead readers about threading guarantees.

3. Consider asserting first_token_ms > 0 before computing TTFT (line 204)

If first_token_ms is 0 (the default in the Stats struct, meaning the field was never set), the TTFT calculation would produce a negative number (0 - inference_start_ms), which would still pass the <= 30_000 assertion but report a meaningless metric. A guard like:

assertTrue("first_token_ms was not set by the runtime", firstTokenMs > 0)

...would catch this case explicitly.

4. Stability test: 3 iterations with CV < 0.5 is very permissive (lines 148, 290)

With only 3 samples, the coefficient of variation estimate has extremely high uncertainty. A CV threshold of 0.5 means you'd only catch cases where one run is ~2x faster/slower than another — at that point it's less "stability testing" and more "catastrophic failure detection." This is fine for an initial gate, but worth noting in follow-up that increasing STABILITY_ITERATIONS to 5+ would make this more statistically meaningful.

5. Minor: lastStatsJson!! assertion (line 115)

In testTpsAboveThreshold, line 115 does JSONObject(lastStatsJson!!) which would throw an NPE if lastStatsJson is null. The test does assert tpsResults.isNotEmpty() above (which implies onStats was called and parsed successfully), so in practice this is safe. But it's a slightly indirect guarantee — if the JSON had a valid generated_tokens and timestamps but a parsing error elsewhere, lastStatsJson would still be set. This is fine as-is.

Summary

This is a solid addition. The test correctly uses the ExecuTorch stats API, handles warm-up properly, and provides configurable thresholds for different environments. The suggestions above are minor improvements, none are blocking.

Copilot AI review requested due to automatic review settings May 20, 2026 15:08

psiddh requested a review from kirklandsign as a code owner May 20, 2026 15:08

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 20, 2026

meta-codesync Bot added fb-exported meta-exported labels May 20, 2026

psiddh requested a review from GregoryComer May 20, 2026 15:08

Copilot started reviewing on behalf of psiddh May 20, 2026 15:08 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 20, 2026 16:34

Copilot started reviewing on behalf of psiddh May 20, 2026 16:35 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

meta-codesync Bot changed the title ~~Add LLM performance regression instrumentation tests~~ Add LLM performance regression instrumentation tests (#19700) May 20, 2026

Copilot AI review requested due to automatic review settings May 20, 2026 18:28

psiddh force-pushed the export-D105840841 branch from d98cb92 to cbd0a6c Compare May 20, 2026 18:28

Copilot started reviewing on behalf of psiddh May 20, 2026 18:29 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

digantdesai approved these changes May 20, 2026

View reviewed changes

psiddh merged commit 2b7a5a2 into pytorch:main May 20, 2026
177 checks passed

Conversation

psiddh commented May 20, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19700

❗ 1 Active SEVs

❌ 18 Cancelled Jobs, 2 Unrelated Failures

Uh oh!

meta-codesync Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

meta-codesync Bot commented May 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

digantdesai commented May 20, 2026

Uh oh!

claude Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: LlmPerformanceTest.kt

Looks Good

Suggestions (non-blocking)

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

psiddh commented May 20, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented May 20, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented May 20, 2026 •

edited

Loading