Skip to content

Add LLM performance regression instrumentation tests (#19700)#19700

Merged
psiddh merged 1 commit into
pytorch:mainfrom
psiddh:export-D105840841
May 20, 2026
Merged

Add LLM performance regression instrumentation tests (#19700)#19700
psiddh merged 1 commit into
pytorch:mainfrom
psiddh:export-D105840841

Conversation

@psiddh
Copy link
Copy Markdown
Contributor

@psiddh psiddh commented May 20, 2026

Summary:
Adds LlmPerformanceTest, an Android instrumentation test that measures inference performance metrics (TPS, TPS stability, TTFT) for ExecuTorch LLM on the stories110M fixture and asserts they meet minimum thresholds.

This enables OKR 3.3 (Performance Testing: TPS/latency regression detection) using the same zero-infra approach as D105741356 — same fixture, same CI paths, no new dependencies.

Three performance aspects are tested:

  1. testTpsAboveThreshold — decode speed regression gate. A warm-up run is excluded from measurement. Threshold is configurable via instrumentation arg (minTps) so the same APK works on emulator (1.0 TPS) and device (10+ TPS).

  2. testTpsStability — checks coefficient of variation across 3 runs is below 0.5. Catches thread contention, GC pressure, or scheduling instability that causes inconsistent user experience.

  3. testTimeToFirstToken — measures prompt evaluation latency (prefill time). Asserts TTFT < 30s. Catches regressions in the prefill/KV-cache-fill path that make the app feel unresponsive before generation starts.

All metrics are reported via InstrumentationRegistry.sendStatus() for CI metric capture and future dashboarding.

Differential Revision: D105840841

Pulled By: psiddh

Copilot AI review requested due to automatic review settings May 20, 2026 15:08
@psiddh psiddh requested a review from kirklandsign as a code owner May 20, 2026 15:08
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 20, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19700

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 18 Cancelled Jobs, 2 Unrelated Failures

As of commit d98cb92 with merge base 8debe93 (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 20, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 20, 2026

@psiddh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105840841.

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an Android instrumentation test suite (LlmPerformanceTest) intended to detect LLM inference performance regressions on the TinyStories-110M fixture by measuring TPS, TPS stability, and TTFT, and reporting metrics via InstrumentationRegistry.sendStatus().

Changes:

  • Introduces TPS threshold gating with an overridable minTps instrumentation argument.
  • Adds a stability check using coefficient of variation across multiple runs.
  • Adds a TTFT measurement and threshold assertion, plus metric reporting for CI capture.
Comments suppressed due to low confidence (2)

extension/android/executorch_android/src/androidTest/java/org/pytorch/executorch/LlmPerformanceTest.kt:142

  • Same issue as above: LlmModule.load() is void/Unit and throws on failure, so assigning it to loadResult and comparing to 0 won’t compile. Use llmModule.load() without a return-code check (or catch/fail() on exception).
    val loadResult = llmModule.load()
    assertTrue("Model failed to load", loadResult == 0)

extension/android/executorch_android/src/androidTest/java/org/pytorch/executorch/LlmPerformanceTest.kt:191

  • Same issue as above: LlmModule.load() returns void/Unit and throws on failure; this return-code assertion will not compile. Replace with llmModule.load() and rely on exceptions to fail the test (or catch and fail).
    val loadResult = llmModule.load()
    assertTrue("Model failed to load", loadResult == 0)


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 20, 2026

@psiddh has imported this pull request. If you are a Meta employee, you can view this in D105840841.

Copilot AI review requested due to automatic review settings May 20, 2026 16:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment on lines +114 to +116
reportMetric("measured_tps", measuredTps)
reportMetric("measured_tokens", generatedTokens.size.toFloat())
reportMetric("min_tps_threshold", minTps)
Summary:
Adds `LlmPerformanceTest`, an Android instrumentation test that measures inference performance metrics (TPS, TPS stability, TTFT) for ExecuTorch LLM on the stories110M fixture and asserts they meet minimum thresholds.

This enables OKR 3.3 (Performance Testing: TPS/latency regression detection) using the same zero-infra approach as D105741356 — same fixture, same CI paths, no new dependencies.

Three performance aspects are tested:

1. `testTpsAboveThreshold` — decode speed regression gate. A warm-up run is excluded from measurement. Threshold is configurable via instrumentation arg (`minTps`) so the same APK works on emulator (1.0 TPS) and device (10+ TPS).

2. `testTpsStability` — checks coefficient of variation across 3 runs is below 0.5. Catches thread contention, GC pressure, or scheduling instability that causes inconsistent user experience.

3. `testTimeToFirstToken` — measures prompt evaluation latency (prefill time). Asserts TTFT < 30s. Catches regressions in the prefill/KV-cache-fill path that make the app feel unresponsive before generation starts.

All metrics are reported via InstrumentationRegistry.sendStatus() for CI metric capture and future dashboarding.


Differential Revision: D105840841

Pulled By: psiddh
@meta-codesync meta-codesync Bot changed the title Add LLM performance regression instrumentation tests Add LLM performance regression instrumentation tests (#19700) May 20, 2026
Copilot AI review requested due to automatic review settings May 20, 2026 18:28
@psiddh psiddh force-pushed the export-D105840841 branch from d98cb92 to cbd0a6c Compare May 20, 2026 18:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@digantdesai
Copy link
Copy Markdown
Contributor

@claude review this PR

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Claude finished @digantdesai's task in 1m 46s —— View job


Code Review: LlmPerformanceTest.kt

  • Read the changed file and gather context
  • Review code for correctness, style, and best practices
  • Post review feedback

Overall this is a well-structured test that follows the patterns established by LlmModuleConversationHistoryTest. The code is clean, the thresholds are appropriately conservative for emulator CI, and the metric reporting via InstrumentationRegistry.sendStatus() is a good approach for CI capture. A few observations:


Looks Good

  1. setUp pattern — Uses requireNotNull(...).use { ... } for resource streams (lines 62-71), which is the correct Kotlin pattern to ensure streams close on both success and failure. This is an improvement over the older LlmModuleInstrumentationTest which manually closes streams.

  2. load() usage — The test correctly calls llmModule.load() without capturing a return value (lines 95, 140, 188), matching the void signature. The earlier Copilot comments about loadResult == 0 appear to be from a stale revision.

  3. TTFT uses first_token_ms — Line 204 correctly uses first_token_ms - inference_start_ms, which matches the C++ stats definition in stats.h:43 (first_token: Timestamp when the first generated token is emitted).

  4. minTps validation — Line 258 validates the instrumentation arg is finite and positive, addressing the concern about invalid overrides.

  5. minOrNull()!! — Lines 169-170 use the non-deprecated Kotlin API (not min()/max()).


Suggestions (non-blocking)

1. TPS calculation uses generated_tokens from stats JSON, but the denominator excludes prefill — potential off-by-one (lines 229-234)

The generated_tokens field from stats is num_generated_tokens (total minus prompt tokens). The decode time is inference_end_ms - prompt_eval_end_ms. This is correct and consistent with how stats_to_json_string computes decode_token_per_sec. However, consider using the pre-computed decode_token_per_sec field from the JSON directly — it avoids duplicating the calculation and stays in sync if the stats definition changes.

2. Synchronized list may not be needed (lines 53-55)

LlmModule.generate() is a synchronous, blocking call (it holds mLock). The onResult and onStats callbacks are invoked from the same thread context during generate(). The Collections.synchronizedList and @Volatile annotations suggest multi-threaded access, but since all access happens sequentially (generate blocks → callbacks fire → test reads results), plain lists would suffice. This isn't a bug — just unnecessary overhead that might mislead readers about threading guarantees.

3. Consider asserting first_token_ms > 0 before computing TTFT (line 204)

If first_token_ms is 0 (the default in the Stats struct, meaning the field was never set), the TTFT calculation would produce a negative number (0 - inference_start_ms), which would still pass the <= 30_000 assertion but report a meaningless metric. A guard like:

assertTrue("first_token_ms was not set by the runtime", firstTokenMs > 0)

...would catch this case explicitly.

4. Stability test: 3 iterations with CV < 0.5 is very permissive (lines 148, 290)

With only 3 samples, the coefficient of variation estimate has extremely high uncertainty. A CV threshold of 0.5 means you'd only catch cases where one run is ~2x faster/slower than another — at that point it's less "stability testing" and more "catastrophic failure detection." This is fine for an initial gate, but worth noting in follow-up that increasing STABILITY_ITERATIONS to 5+ would make this more statistically meaningful.

5. Minor: lastStatsJson!! assertion (line 115)

In testTpsAboveThreshold, line 115 does JSONObject(lastStatsJson!!) which would throw an NPE if lastStatsJson is null. The test does assert tpsResults.isNotEmpty() above (which implies onStats was called and parsed successfully), so in practice this is safe. But it's a slightly indirect guarantee — if the JSON had a valid generated_tokens and timestamps but a parsing error elsewhere, lastStatsJson would still be set. This is fine as-is.


Summary

This is a solid addition. The test correctly uses the ExecuTorch stats API, handles warm-up properly, and provides configurable thresholds for different environments. The suggestions above are minor improvements, none are blocking.

@psiddh psiddh merged commit 2b7a5a2 into pytorch:main May 20, 2026
177 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants