Add LLM performance regression instrumentation tests (#19700)#19700
Conversation
|
@psiddh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105840841. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds an Android instrumentation test suite (LlmPerformanceTest) intended to detect LLM inference performance regressions on the TinyStories-110M fixture by measuring TPS, TPS stability, and TTFT, and reporting metrics via InstrumentationRegistry.sendStatus().
Changes:
- Introduces TPS threshold gating with an overridable
minTpsinstrumentation argument. - Adds a stability check using coefficient of variation across multiple runs.
- Adds a TTFT measurement and threshold assertion, plus metric reporting for CI capture.
Comments suppressed due to low confidence (2)
extension/android/executorch_android/src/androidTest/java/org/pytorch/executorch/LlmPerformanceTest.kt:142
- Same issue as above:
LlmModule.load()isvoid/Unitand throws on failure, so assigning it toloadResultand comparing to 0 won’t compile. UsellmModule.load()without a return-code check (or catch/fail()on exception).
val loadResult = llmModule.load()
assertTrue("Model failed to load", loadResult == 0)
extension/android/executorch_android/src/androidTest/java/org/pytorch/executorch/LlmPerformanceTest.kt:191
- Same issue as above:
LlmModule.load()returnsvoid/Unitand throws on failure; this return-code assertion will not compile. Replace withllmModule.load()and rely on exceptions to fail the test (or catch andfail).
val loadResult = llmModule.load()
assertTrue("Model failed to load", loadResult == 0)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@psiddh has imported this pull request. If you are a Meta employee, you can view this in D105840841. |
| reportMetric("measured_tps", measuredTps) | ||
| reportMetric("measured_tokens", generatedTokens.size.toFloat()) | ||
| reportMetric("min_tps_threshold", minTps) |
Summary: Adds `LlmPerformanceTest`, an Android instrumentation test that measures inference performance metrics (TPS, TPS stability, TTFT) for ExecuTorch LLM on the stories110M fixture and asserts they meet minimum thresholds. This enables OKR 3.3 (Performance Testing: TPS/latency regression detection) using the same zero-infra approach as D105741356 — same fixture, same CI paths, no new dependencies. Three performance aspects are tested: 1. `testTpsAboveThreshold` — decode speed regression gate. A warm-up run is excluded from measurement. Threshold is configurable via instrumentation arg (`minTps`) so the same APK works on emulator (1.0 TPS) and device (10+ TPS). 2. `testTpsStability` — checks coefficient of variation across 3 runs is below 0.5. Catches thread contention, GC pressure, or scheduling instability that causes inconsistent user experience. 3. `testTimeToFirstToken` — measures prompt evaluation latency (prefill time). Asserts TTFT < 30s. Catches regressions in the prefill/KV-cache-fill path that make the app feel unresponsive before generation starts. All metrics are reported via InstrumentationRegistry.sendStatus() for CI metric capture and future dashboarding. Differential Revision: D105840841 Pulled By: psiddh
|
@claude review this PR |
|
Claude finished @digantdesai's task in 1m 46s —— View job Code Review: LlmPerformanceTest.kt
Overall this is a well-structured test that follows the patterns established by Looks Good
Suggestions (non-blocking)1. TPS calculation uses The 2. Synchronized list may not be needed (lines 53-55)
3. Consider asserting If assertTrue("first_token_ms was not set by the runtime", firstTokenMs > 0)...would catch this case explicitly. 4. Stability test: 3 iterations with CV < 0.5 is very permissive (lines 148, 290) With only 3 samples, the coefficient of variation estimate has extremely high uncertainty. A CV threshold of 0.5 means you'd only catch cases where one run is ~2x faster/slower than another — at that point it's less "stability testing" and more "catastrophic failure detection." This is fine for an initial gate, but worth noting in follow-up that increasing 5. Minor: In SummaryThis is a solid addition. The test correctly uses the ExecuTorch stats API, handles warm-up properly, and provides configurable thresholds for different environments. The suggestions above are minor improvements, none are blocking. |
Summary:
Adds
LlmPerformanceTest, an Android instrumentation test that measures inference performance metrics (TPS, TPS stability, TTFT) for ExecuTorch LLM on the stories110M fixture and asserts they meet minimum thresholds.This enables OKR 3.3 (Performance Testing: TPS/latency regression detection) using the same zero-infra approach as D105741356 — same fixture, same CI paths, no new dependencies.
Three performance aspects are tested:
testTpsAboveThreshold— decode speed regression gate. A warm-up run is excluded from measurement. Threshold is configurable via instrumentation arg (minTps) so the same APK works on emulator (1.0 TPS) and device (10+ TPS).testTpsStability— checks coefficient of variation across 3 runs is below 0.5. Catches thread contention, GC pressure, or scheduling instability that causes inconsistent user experience.testTimeToFirstToken— measures prompt evaluation latency (prefill time). Asserts TTFT < 30s. Catches regressions in the prefill/KV-cache-fill path that make the app feel unresponsive before generation starts.All metrics are reported via InstrumentationRegistry.sendStatus() for CI metric capture and future dashboarding.
Differential Revision: D105840841
Pulled By: psiddh