Skip to content

LCORE-1206: add tests for too long question#1232

Open
radofuchs wants to merge 1 commit intolightspeed-core:mainfrom
radofuchs:LCORE_1206_OLS_test_alignment
Open

LCORE-1206: add tests for too long question#1232
radofuchs wants to merge 1 commit intolightspeed-core:mainfrom
radofuchs:LCORE_1206_OLS_test_alignment

Conversation

@radofuchs
Copy link
Contributor

@radofuchs radofuchs commented Feb 27, 2026

Description

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used) Cursor
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Tests
    • Enhanced end-to-end test coverage for query scenarios with long inputs, including tests with and without shield protection.
    • Improved test infrastructure with robust shield management and Llama Stack diagnostics capabilities.
    • Added streaming response error handling and validation for test scenarios.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 27, 2026

Walkthrough

This PR extends end-to-end test coverage to validate query handling with shields enabled and disabled, particularly for long-query scenarios. It introduces shield management utilities (unregister/re-register via API), enhanced streaming error handling with error event tracking, and diagnostic capabilities for Llama Stack restoration failures.

Changes

Cohort / File(s) Summary
Shield Management & Diagnostics
tests/e2e/features/environment.py, tests/e2e/utils/llama_stack_shields.py
Adds before/after scenario hooks to unregister and re-register the "llama-guard" shield in server mode, with new utility module providing AsyncLlamaStackClient-based functions for shield lifecycle operations. Introduces _print_llama_stack_diagnostics() to emit container state, health, and docker logs on restoration failure or unhealthy state.
Query Test Scenarios
tests/e2e/features/query.feature, tests/e2e/features/streaming_query.feature
Adds four new test scenarios: two in query.feature (active and skipped) testing long-query 413 responses with/without shields; two in streaming_query.feature testing long-query handling with shields enabled (expects 413) and disabled (expects 200 with streamed error).
Query Step Implementation
tests/e2e/features/steps/llm_query_response.py
Adds streaming response aggregation via _read_streamed_response() to handle streaming endpoints; extends streaming error tracking to capture "error" events in SSE streams; adds ask_question_too_long_authorized() and check_streamed_response_error_message() steps for long-query validation and error verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • tisnik
  • are-ces
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The PR adds tests for too-long questions with shields, but linked issue #1103 states the objective is to add tests for token quotas. The relationship between long question tests and token quota testing is unclear. Clarify how the too-long question test coverage addresses the token quota testing requirement stated in issue #1103, or verify if the scope has been updated.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely summarizes the main objective: adding end-to-end tests for handling too-long questions, which aligns with the primary changes across all modified files.
Out of Scope Changes check ✅ Passed All changes are focused on end-to-end tests for too-long questions: new test scenarios, streaming response handling, shield management, and diagnostics. These are all directly related to the PR objective.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/e2e/features/steps/llm_query_response.py (1)

193-193: Remove raw streamed body dump from assertion step.

Line 193 prints full streamed content on every run. This can create noisy logs and expose payload content unnecessarily in CI output.

🔧 Suggested fix
 `@then`("The streamed response contains error message {message}")
 def check_streamed_response_error_message(context: Context, message: str) -> None:
@@
     assert context.response is not None, "Request needs to be performed first"
-    print(context.response.text)
     parsed = _parse_streaming_response(context.response.text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/features/steps/llm_query_response.py` at line 193, Remove the raw
dump call print(context.response.text) from the assertion step (the print
statement that outputs context.response.text); either delete it or replace it
with a non-sensitive debug/log call that only emits a short, masked summary
(e.g., status and length) behind a verbosity flag or logger.debug, ensuring the
assertion logic (in the step handling LLM responses) remains unchanged.
tests/e2e/features/query.feature (1)

220-225: Consider validating token metrics in the new active long-query query scenario.

Line 220-225 checks status/body only. Adding token metric assertions here would strengthen coverage for quota-related regressions on too-long query failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/features/query.feature` around lines 220 - 225, Add assertions to
the "Check if query with shields returns 413 when question is too long for model
context" scenario to validate token metrics in the too-long-query response;
after the step "When I use \"query\" to ask question with too-long query and
authorization header" and the existing status/body checks, assert that the
response includes token-related fields (e.g., token_count/token_usage or
relevant headers) and that those values reflect the query exceeded the
model/context limit (e.g., token_count > model_limit or a flagged usage field).
Update the scenario's expectations so the step verifying the 413 and "Prompt is
too long" message also checks the presence and correctness of these token metric
indicators returned by the query handler.
tests/e2e/features/streaming_query.feature (1)

182-195: Add explicit token-metrics assertions to new long-query streaming scenarios.

Line 182 and Line 190 validate status/error behavior, but they don’t assert token metric behavior. Please add metric capture/assertions so these paths also protect quota accounting regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/features/streaming_query.feature` around lines 182 - 195, Add
explicit token-metrics capture and assertions around the two long-query
streaming scenarios that use the "streaming_query" step (the scenario with
shields that returns 413 and the `@disable-shields` scenario that returns 200).
Before invoking the "streaming_query" step, record current token-related metrics
(e.g., prompt_tokens, completion_tokens, tokens_consumed) from the same metrics
endpoint or helper your test suite uses (call the existing getMetrics / metrics
helper), then call "streaming_query" and re-fetch metrics and assert the deltas
match expected behavior for each path (ensure the 413 path still records prompt
token accounting if expected, and the `@disable-shields` path records streamed
error token usage); add these metric assertions into the scenarios in
tests/e2e/features/streaming_query.feature adjacent to the existing status/body
checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/features/environment.py`:
- Around line 176-190: When restoring shields after a `@disable-shields` scenario,
don't unconditionally call register_shield with defaults if unregister_shield
returned no previous provider; change the restore logic to check the saved tuple
from unregister_shield (the values stored on context.llama_guard_provider_id and
context.llama_guard_provider_shield_id) and only call register_shield to
re-create the shield when saved is truthy (i.e., the shield existed before). If
saved is falsy/None, skip register_shield so you don't create a new default
shield and thus preserve scenario isolation; update any code paths that assume
register_shield will always run to use this presence check instead.

---

Nitpick comments:
In `@tests/e2e/features/query.feature`:
- Around line 220-225: Add assertions to the "Check if query with shields
returns 413 when question is too long for model context" scenario to validate
token metrics in the too-long-query response; after the step "When I use
\"query\" to ask question with too-long query and authorization header" and the
existing status/body checks, assert that the response includes token-related
fields (e.g., token_count/token_usage or relevant headers) and that those values
reflect the query exceeded the model/context limit (e.g., token_count >
model_limit or a flagged usage field). Update the scenario's expectations so the
step verifying the 413 and "Prompt is too long" message also checks the presence
and correctness of these token metric indicators returned by the query handler.

In `@tests/e2e/features/steps/llm_query_response.py`:
- Line 193: Remove the raw dump call print(context.response.text) from the
assertion step (the print statement that outputs context.response.text); either
delete it or replace it with a non-sensitive debug/log call that only emits a
short, masked summary (e.g., status and length) behind a verbosity flag or
logger.debug, ensuring the assertion logic (in the step handling LLM responses)
remains unchanged.

In `@tests/e2e/features/streaming_query.feature`:
- Around line 182-195: Add explicit token-metrics capture and assertions around
the two long-query streaming scenarios that use the "streaming_query" step (the
scenario with shields that returns 413 and the `@disable-shields` scenario that
returns 200). Before invoking the "streaming_query" step, record current
token-related metrics (e.g., prompt_tokens, completion_tokens, tokens_consumed)
from the same metrics endpoint or helper your test suite uses (call the existing
getMetrics / metrics helper), then call "streaming_query" and re-fetch metrics
and assert the deltas match expected behavior for each path (ensure the 413 path
still records prompt token accounting if expected, and the `@disable-shields` path
records streamed error token usage); add these metric assertions into the
scenarios in tests/e2e/features/streaming_query.feature adjacent to the existing
status/body checks.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a6024a and 51e8477.

📒 Files selected for processing (5)
  • tests/e2e/features/environment.py
  • tests/e2e/features/query.feature
  • tests/e2e/features/steps/llm_query_response.py
  • tests/e2e/features/streaming_query.feature
  • tests/e2e/utils/llama_stack_shields.py

Comment on lines +176 to +190
# @disable-shields: unregister shield via client.shields.delete("llama-guard").
# Only in server mode: in library mode there is no separate Llama Stack to call,
# and unregistering in the test process would not affect the app's in-process instance.
if "disable-shields" in scenario.effective_tags:
if context.is_library_mode:
scenario.skip(
"Shield unregister/register only applies in server mode (Llama Stack as a "
"separate service). In library mode the app's shields cannot be disabled from e2e."
)
return
try:
saved = unregister_shield("llama-guard")
context.llama_guard_provider_id = saved[0] if saved else None
context.llama_guard_provider_shield_id = saved[1] if saved else None
print("Unregistered shield llama-guard for this scenario")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid unconditional shield re-registration; it can corrupt scenario isolation.

At Line 246, register_shield(...) runs for every @disable-shields scenario. But Line 187 can return None (shield absent), and then register_shield falls back to defaults, potentially creating a shield that did not exist before the scenario.

🔧 Suggested fix
 def before_scenario(context: Context, scenario: Scenario) -> None:
+    context.llama_guard_restore_required = False
+
     # `@disable-shields`: unregister shield via client.shields.delete("llama-guard").
     # Only in server mode: in library mode there is no separate Llama Stack to call,
     # and unregistering in the test process would not affect the app's in-process instance.
     if "disable-shields" in scenario.effective_tags:
@@
         try:
             saved = unregister_shield("llama-guard")
-            context.llama_guard_provider_id = saved[0] if saved else None
-            context.llama_guard_provider_shield_id = saved[1] if saved else None
-            print("Unregistered shield llama-guard for this scenario")
+            if saved:
+                context.llama_guard_provider_id = saved[0]
+                context.llama_guard_provider_shield_id = saved[1]
+                context.llama_guard_restore_required = True
+                print("Unregistered shield llama-guard for this scenario")
+            else:
+                context.llama_guard_provider_id = None
+                context.llama_guard_provider_shield_id = None
+                print("Shield llama-guard was not registered; nothing to restore")
         except Exception as e:  # pylint: disable=broad-exception-caught
             scenario.skip(
                 f"Could not unregister shield (is Llama Stack reachable?): {e}"
             )
             return
@@
-    if "disable-shields" in scenario.effective_tags:
+    if (
+        "disable-shields" in scenario.effective_tags
+        and not context.is_library_mode
+        and getattr(context, "llama_guard_restore_required", False)
+    ):
         try:
             provider_id = getattr(context, "llama_guard_provider_id", None)
             provider_shield_id = getattr(
                 context, "llama_guard_provider_shield_id", None
             )
             register_shield(
                 "llama-guard",
                 provider_id=provider_id,
                 provider_shield_id=provider_shield_id,
             )
             print("Re-registered shield llama-guard")
+            context.llama_guard_restore_required = False
         except Exception as e:  # pylint: disable=broad-exception-caught
             print(f"Warning: Could not re-register shield: {e}")

Also applies to: 245-256

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/features/environment.py` around lines 176 - 190, When restoring
shields after a `@disable-shields` scenario, don't unconditionally call
register_shield with defaults if unregister_shield returned no previous
provider; change the restore logic to check the saved tuple from
unregister_shield (the values stored on context.llama_guard_provider_id and
context.llama_guard_provider_shield_id) and only call register_shield to
re-create the shield when saved is truthy (i.e., the shield existed before). If
saved is falsy/None, skip register_shield so you don't create a new default
shield and thus preserve scenario isolation; update any code paths that assume
register_shield will always run to use this presence check instead.

@radofuchs radofuchs requested a review from tisnik February 27, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant