LCORE-1206: add tests for too long question#1232
LCORE-1206: add tests for too long question#1232radofuchs wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
WalkthroughThis PR extends end-to-end test coverage to validate query handling with shields enabled and disabled, particularly for long-query scenarios. It introduces shield management utilities (unregister/re-register via API), enhanced streaming error handling with error event tracking, and diagnostic capabilities for Llama Stack restoration failures. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/e2e/features/steps/llm_query_response.py (1)
193-193: Remove raw streamed body dump from assertion step.Line 193 prints full streamed content on every run. This can create noisy logs and expose payload content unnecessarily in CI output.
🔧 Suggested fix
`@then`("The streamed response contains error message {message}") def check_streamed_response_error_message(context: Context, message: str) -> None: @@ assert context.response is not None, "Request needs to be performed first" - print(context.response.text) parsed = _parse_streaming_response(context.response.text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/steps/llm_query_response.py` at line 193, Remove the raw dump call print(context.response.text) from the assertion step (the print statement that outputs context.response.text); either delete it or replace it with a non-sensitive debug/log call that only emits a short, masked summary (e.g., status and length) behind a verbosity flag or logger.debug, ensuring the assertion logic (in the step handling LLM responses) remains unchanged.tests/e2e/features/query.feature (1)
220-225: Consider validating token metrics in the new active long-query query scenario.Line 220-225 checks status/body only. Adding token metric assertions here would strengthen coverage for quota-related regressions on too-long query failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/query.feature` around lines 220 - 225, Add assertions to the "Check if query with shields returns 413 when question is too long for model context" scenario to validate token metrics in the too-long-query response; after the step "When I use \"query\" to ask question with too-long query and authorization header" and the existing status/body checks, assert that the response includes token-related fields (e.g., token_count/token_usage or relevant headers) and that those values reflect the query exceeded the model/context limit (e.g., token_count > model_limit or a flagged usage field). Update the scenario's expectations so the step verifying the 413 and "Prompt is too long" message also checks the presence and correctness of these token metric indicators returned by the query handler.tests/e2e/features/streaming_query.feature (1)
182-195: Add explicit token-metrics assertions to new long-query streaming scenarios.Line 182 and Line 190 validate status/error behavior, but they don’t assert token metric behavior. Please add metric capture/assertions so these paths also protect quota accounting regressions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/streaming_query.feature` around lines 182 - 195, Add explicit token-metrics capture and assertions around the two long-query streaming scenarios that use the "streaming_query" step (the scenario with shields that returns 413 and the `@disable-shields` scenario that returns 200). Before invoking the "streaming_query" step, record current token-related metrics (e.g., prompt_tokens, completion_tokens, tokens_consumed) from the same metrics endpoint or helper your test suite uses (call the existing getMetrics / metrics helper), then call "streaming_query" and re-fetch metrics and assert the deltas match expected behavior for each path (ensure the 413 path still records prompt token accounting if expected, and the `@disable-shields` path records streamed error token usage); add these metric assertions into the scenarios in tests/e2e/features/streaming_query.feature adjacent to the existing status/body checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/features/environment.py`:
- Around line 176-190: When restoring shields after a `@disable-shields` scenario,
don't unconditionally call register_shield with defaults if unregister_shield
returned no previous provider; change the restore logic to check the saved tuple
from unregister_shield (the values stored on context.llama_guard_provider_id and
context.llama_guard_provider_shield_id) and only call register_shield to
re-create the shield when saved is truthy (i.e., the shield existed before). If
saved is falsy/None, skip register_shield so you don't create a new default
shield and thus preserve scenario isolation; update any code paths that assume
register_shield will always run to use this presence check instead.
---
Nitpick comments:
In `@tests/e2e/features/query.feature`:
- Around line 220-225: Add assertions to the "Check if query with shields
returns 413 when question is too long for model context" scenario to validate
token metrics in the too-long-query response; after the step "When I use
\"query\" to ask question with too-long query and authorization header" and the
existing status/body checks, assert that the response includes token-related
fields (e.g., token_count/token_usage or relevant headers) and that those values
reflect the query exceeded the model/context limit (e.g., token_count >
model_limit or a flagged usage field). Update the scenario's expectations so the
step verifying the 413 and "Prompt is too long" message also checks the presence
and correctness of these token metric indicators returned by the query handler.
In `@tests/e2e/features/steps/llm_query_response.py`:
- Line 193: Remove the raw dump call print(context.response.text) from the
assertion step (the print statement that outputs context.response.text); either
delete it or replace it with a non-sensitive debug/log call that only emits a
short, masked summary (e.g., status and length) behind a verbosity flag or
logger.debug, ensuring the assertion logic (in the step handling LLM responses)
remains unchanged.
In `@tests/e2e/features/streaming_query.feature`:
- Around line 182-195: Add explicit token-metrics capture and assertions around
the two long-query streaming scenarios that use the "streaming_query" step (the
scenario with shields that returns 413 and the `@disable-shields` scenario that
returns 200). Before invoking the "streaming_query" step, record current
token-related metrics (e.g., prompt_tokens, completion_tokens, tokens_consumed)
from the same metrics endpoint or helper your test suite uses (call the existing
getMetrics / metrics helper), then call "streaming_query" and re-fetch metrics
and assert the deltas match expected behavior for each path (ensure the 413 path
still records prompt token accounting if expected, and the `@disable-shields` path
records streamed error token usage); add these metric assertions into the
scenarios in tests/e2e/features/streaming_query.feature adjacent to the existing
status/body checks.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tests/e2e/features/environment.pytests/e2e/features/query.featuretests/e2e/features/steps/llm_query_response.pytests/e2e/features/streaming_query.featuretests/e2e/utils/llama_stack_shields.py
| # @disable-shields: unregister shield via client.shields.delete("llama-guard"). | ||
| # Only in server mode: in library mode there is no separate Llama Stack to call, | ||
| # and unregistering in the test process would not affect the app's in-process instance. | ||
| if "disable-shields" in scenario.effective_tags: | ||
| if context.is_library_mode: | ||
| scenario.skip( | ||
| "Shield unregister/register only applies in server mode (Llama Stack as a " | ||
| "separate service). In library mode the app's shields cannot be disabled from e2e." | ||
| ) | ||
| return | ||
| try: | ||
| saved = unregister_shield("llama-guard") | ||
| context.llama_guard_provider_id = saved[0] if saved else None | ||
| context.llama_guard_provider_shield_id = saved[1] if saved else None | ||
| print("Unregistered shield llama-guard for this scenario") |
There was a problem hiding this comment.
Avoid unconditional shield re-registration; it can corrupt scenario isolation.
At Line 246, register_shield(...) runs for every @disable-shields scenario. But Line 187 can return None (shield absent), and then register_shield falls back to defaults, potentially creating a shield that did not exist before the scenario.
🔧 Suggested fix
def before_scenario(context: Context, scenario: Scenario) -> None:
+ context.llama_guard_restore_required = False
+
# `@disable-shields`: unregister shield via client.shields.delete("llama-guard").
# Only in server mode: in library mode there is no separate Llama Stack to call,
# and unregistering in the test process would not affect the app's in-process instance.
if "disable-shields" in scenario.effective_tags:
@@
try:
saved = unregister_shield("llama-guard")
- context.llama_guard_provider_id = saved[0] if saved else None
- context.llama_guard_provider_shield_id = saved[1] if saved else None
- print("Unregistered shield llama-guard for this scenario")
+ if saved:
+ context.llama_guard_provider_id = saved[0]
+ context.llama_guard_provider_shield_id = saved[1]
+ context.llama_guard_restore_required = True
+ print("Unregistered shield llama-guard for this scenario")
+ else:
+ context.llama_guard_provider_id = None
+ context.llama_guard_provider_shield_id = None
+ print("Shield llama-guard was not registered; nothing to restore")
except Exception as e: # pylint: disable=broad-exception-caught
scenario.skip(
f"Could not unregister shield (is Llama Stack reachable?): {e}"
)
return
@@
- if "disable-shields" in scenario.effective_tags:
+ if (
+ "disable-shields" in scenario.effective_tags
+ and not context.is_library_mode
+ and getattr(context, "llama_guard_restore_required", False)
+ ):
try:
provider_id = getattr(context, "llama_guard_provider_id", None)
provider_shield_id = getattr(
context, "llama_guard_provider_shield_id", None
)
register_shield(
"llama-guard",
provider_id=provider_id,
provider_shield_id=provider_shield_id,
)
print("Re-registered shield llama-guard")
+ context.llama_guard_restore_required = False
except Exception as e: # pylint: disable=broad-exception-caught
print(f"Warning: Could not re-register shield: {e}")Also applies to: 245-256
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e/features/environment.py` around lines 176 - 190, When restoring
shields after a `@disable-shields` scenario, don't unconditionally call
register_shield with defaults if unregister_shield returned no previous
provider; change the restore logic to check the saved tuple from
unregister_shield (the values stored on context.llama_guard_provider_id and
context.llama_guard_provider_shield_id) and only call register_shield to
re-create the shield when saved is truthy (i.e., the shield existed before). If
saved is falsy/None, skip register_shield so you don't create a new default
shield and thus preserve scenario isolation; update any code paths that assume
register_shield will always run to use this presence check instead.
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit