LCORE-1206: add tests for too long question by radofuchs · Pull Request #1232 · lightspeed-core/lightspeed-stack

radofuchs · 2026-02-27T08:03:02Z

Description

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used) Cursor
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue LCORE-1206: add extra e2e tests to align with OLS #1103
Closes LCORE-1206: add extra e2e tests to align with OLS #1103

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Tests
- Enhanced end-to-end test coverage for query scenarios with long inputs, including tests with and without shield protection.
- Improved test infrastructure with robust shield management and Llama Stack diagnostics capabilities.
- Added streaming response error handling and validation for test scenarios.

coderabbitai · 2026-02-27T08:03:16Z

Walkthrough

This PR extends end-to-end test coverage to validate query handling with shields enabled and disabled, particularly for long-query scenarios. It introduces shield management utilities (unregister/re-register via API), enhanced streaming error handling with error event tracking, and diagnostic capabilities for Llama Stack restoration failures.

Changes

Cohort / File(s)	Summary
Shield Management & Diagnostics `tests/e2e/features/environment.py`, `tests/e2e/utils/llama_stack_shields.py`	Adds before/after scenario hooks to unregister and re-register the "llama-guard" shield in server mode, with new utility module providing AsyncLlamaStackClient-based functions for shield lifecycle operations. Introduces `_print_llama_stack_diagnostics()` to emit container state, health, and docker logs on restoration failure or unhealthy state.
Query Test Scenarios `tests/e2e/features/query.feature`, `tests/e2e/features/streaming_query.feature`	Adds four new test scenarios: two in query.feature (active and skipped) testing long-query 413 responses with/without shields; two in streaming_query.feature testing long-query handling with shields enabled (expects 413) and disabled (expects 200 with streamed error).
Query Step Implementation `tests/e2e/features/steps/llm_query_response.py`	Adds streaming response aggregation via `_read_streamed_response()` to handle streaming endpoints; extends streaming error tracking to capture "error" events in SSE streams; adds `ask_question_too_long_authorized()` and `check_streamed_response_error_message()` steps for long-query validation and error verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

LCORE-488: Add E2E tests for query endpoint #616 — Modifies the same test helper file tests/e2e/features/steps/llm_query_response.py to add ask_question_authorized step.
LCORE-489: E2E tests for streaming query endpoint #539 — Modifies the same file to enhance SSE streaming parsing and streaming-related test step logic.
LCORE-816: Add new E2E tests for shields endpoint #682 — Adds end-to-end tests and utilities for shields functionality validation.

Suggested reviewers

tisnik
are-ces

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	❓ Inconclusive	The PR adds tests for too-long questions with shields, but linked issue `#1103` states the objective is to add tests for token quotas. The relationship between long question tests and token quota testing is unclear.	Clarify how the too-long question test coverage addresses the token quota testing requirement stated in issue `#1103`, or verify if the scope has been updated.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and concisely summarizes the main objective: adding end-to-end tests for handling too-long questions, which aligns with the primary changes across all modified files.
Out of Scope Changes check	✅ Passed	All changes are focused on end-to-end tests for too-long questions: new test scenarios, streaming response handling, shield management, and diagnostics. These are all directly related to the PR objective.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/e2e/features/steps/llm_query_response.py (1)

193-193: Remove raw streamed body dump from assertion step.

Line 193 prints full streamed content on every run. This can create noisy logs and expose payload content unnecessarily in CI output.

🔧 Suggested fix

 `@then`("The streamed response contains error message {message}")
 def check_streamed_response_error_message(context: Context, message: str) -> None:
@@
     assert context.response is not None, "Request needs to be performed first"
-    print(context.response.text)
     parsed = _parse_streaming_response(context.response.text)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/features/steps/llm_query_response.py` at line 193, Remove the raw
dump call print(context.response.text) from the assertion step (the print
statement that outputs context.response.text); either delete it or replace it
with a non-sensitive debug/log call that only emits a short, masked summary
(e.g., status and length) behind a verbosity flag or logger.debug, ensuring the
assertion logic (in the step handling LLM responses) remains unchanged.

tests/e2e/features/query.feature (1)

220-225: Consider validating token metrics in the new active long-query query scenario.

Line 220-225 checks status/body only. Adding token metric assertions here would strengthen coverage for quota-related regressions on too-long query failures.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/features/query.feature` around lines 220 - 225, Add assertions to
the "Check if query with shields returns 413 when question is too long for model
context" scenario to validate token metrics in the too-long-query response;
after the step "When I use \"query\" to ask question with too-long query and
authorization header" and the existing status/body checks, assert that the
response includes token-related fields (e.g., token_count/token_usage or
relevant headers) and that those values reflect the query exceeded the
model/context limit (e.g., token_count > model_limit or a flagged usage field).
Update the scenario's expectations so the step verifying the 413 and "Prompt is
too long" message also checks the presence and correctness of these token metric
indicators returned by the query handler.

tests/e2e/features/streaming_query.feature (1)

182-195: Add explicit token-metrics assertions to new long-query streaming scenarios.

Line 182 and Line 190 validate status/error behavior, but they don’t assert token metric behavior. Please add metric capture/assertions so these paths also protect quota accounting regressions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/features/streaming_query.feature` around lines 182 - 195, Add
explicit token-metrics capture and assertions around the two long-query
streaming scenarios that use the "streaming_query" step (the scenario with
shields that returns 413 and the `@disable-shields` scenario that returns 200).
Before invoking the "streaming_query" step, record current token-related metrics
(e.g., prompt_tokens, completion_tokens, tokens_consumed) from the same metrics
endpoint or helper your test suite uses (call the existing getMetrics / metrics
helper), then call "streaming_query" and re-fetch metrics and assert the deltas
match expected behavior for each path (ensure the 413 path still records prompt
token accounting if expected, and the `@disable-shields` path records streamed
error token usage); add these metric assertions into the scenarios in
tests/e2e/features/streaming_query.feature adjacent to the existing status/body
checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/features/environment.py`:
- Around line 176-190: When restoring shields after a `@disable-shields` scenario,
don't unconditionally call register_shield with defaults if unregister_shield
returned no previous provider; change the restore logic to check the saved tuple
from unregister_shield (the values stored on context.llama_guard_provider_id and
context.llama_guard_provider_shield_id) and only call register_shield to
re-create the shield when saved is truthy (i.e., the shield existed before). If
saved is falsy/None, skip register_shield so you don't create a new default
shield and thus preserve scenario isolation; update any code paths that assume
register_shield will always run to use this presence check instead.

---

Nitpick comments:
In `@tests/e2e/features/query.feature`:
- Around line 220-225: Add assertions to the "Check if query with shields
returns 413 when question is too long for model context" scenario to validate
token metrics in the too-long-query response; after the step "When I use
\"query\" to ask question with too-long query and authorization header" and the
existing status/body checks, assert that the response includes token-related
fields (e.g., token_count/token_usage or relevant headers) and that those values
reflect the query exceeded the model/context limit (e.g., token_count >
model_limit or a flagged usage field). Update the scenario's expectations so the
step verifying the 413 and "Prompt is too long" message also checks the presence
and correctness of these token metric indicators returned by the query handler.

In `@tests/e2e/features/steps/llm_query_response.py`:
- Line 193: Remove the raw dump call print(context.response.text) from the
assertion step (the print statement that outputs context.response.text); either
delete it or replace it with a non-sensitive debug/log call that only emits a
short, masked summary (e.g., status and length) behind a verbosity flag or
logger.debug, ensuring the assertion logic (in the step handling LLM responses)
remains unchanged.

In `@tests/e2e/features/streaming_query.feature`:
- Around line 182-195: Add explicit token-metrics capture and assertions around
the two long-query streaming scenarios that use the "streaming_query" step (the
scenario with shields that returns 413 and the `@disable-shields` scenario that
returns 200). Before invoking the "streaming_query" step, record current
token-related metrics (e.g., prompt_tokens, completion_tokens, tokens_consumed)
from the same metrics endpoint or helper your test suite uses (call the existing
getMetrics / metrics helper), then call "streaming_query" and re-fetch metrics
and assert the deltas match expected behavior for each path (ensure the 413 path
still records prompt token accounting if expected, and the `@disable-shields` path
records streamed error token usage); add these metric assertions into the
scenarios in tests/e2e/features/streaming_query.feature adjacent to the existing
status/body checks.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a6024a and 51e8477.

📒 Files selected for processing (5)

tests/e2e/features/environment.py
tests/e2e/features/query.feature
tests/e2e/features/steps/llm_query_response.py
tests/e2e/features/streaming_query.feature
tests/e2e/utils/llama_stack_shields.py

coderabbitai · 2026-02-27T08:08:54Z

tests/e2e/features/environment.py

+    # @disable-shields: unregister shield via client.shields.delete("llama-guard").
+    # Only in server mode: in library mode there is no separate Llama Stack to call,
+    # and unregistering in the test process would not affect the app's in-process instance.
+    if "disable-shields" in scenario.effective_tags:
+        if context.is_library_mode:
+            scenario.skip(
+                "Shield unregister/register only applies in server mode (Llama Stack as a "
+                "separate service). In library mode the app's shields cannot be disabled from e2e."
+            )
+            return
+        try:
+            saved = unregister_shield("llama-guard")
+            context.llama_guard_provider_id = saved[0] if saved else None
+            context.llama_guard_provider_shield_id = saved[1] if saved else None
+            print("Unregistered shield llama-guard for this scenario")


⚠️ Potential issue | 🟠 Major

Avoid unconditional shield re-registration; it can corrupt scenario isolation.

At Line 246, register_shield(...) runs for every @disable-shields scenario. But Line 187 can return None (shield absent), and then register_shield falls back to defaults, potentially creating a shield that did not exist before the scenario.

🔧 Suggested fix

def before_scenario(context: Context, scenario: Scenario) -> None: + context.llama_guard_restore_required = False + # `@disable-shields`: unregister shield via client.shields.delete("llama-guard"). # Only in server mode: in library mode there is no separate Llama Stack to call, # and unregistering in the test process would not affect the app's in-process instance. if "disable-shields" in scenario.effective_tags: @@ try: saved = unregister_shield("llama-guard") - context.llama_guard_provider_id = saved[0] if saved else None - context.llama_guard_provider_shield_id = saved[1] if saved else None - print("Unregistered shield llama-guard for this scenario") + if saved: + context.llama_guard_provider_id = saved[0] + context.llama_guard_provider_shield_id = saved[1] + context.llama_guard_restore_required = True + print("Unregistered shield llama-guard for this scenario") + else: + context.llama_guard_provider_id = None + context.llama_guard_provider_shield_id = None + print("Shield llama-guard was not registered; nothing to restore") except Exception as e: # pylint: disable=broad-exception-caught scenario.skip( f"Could not unregister shield (is Llama Stack reachable?): {e}" ) return @@ - if "disable-shields" in scenario.effective_tags: + if ( + "disable-shields" in scenario.effective_tags + and not context.is_library_mode + and getattr(context, "llama_guard_restore_required", False) + ): try: provider_id = getattr(context, "llama_guard_provider_id", None) provider_shield_id = getattr( context, "llama_guard_provider_shield_id", None ) register_shield( "llama-guard", provider_id=provider_id, provider_shield_id=provider_shield_id, ) print("Re-registered shield llama-guard") + context.llama_guard_restore_required = False except Exception as e: # pylint: disable=broad-exception-caught print(f"Warning: Could not re-register shield: {e}")

Also applies to: 245-256

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/e2e/features/environment.py` around lines 176 - 190, When restoring shields after a `@disable-shields` scenario, don't unconditionally call register_shield with defaults if unregister_shield returned no previous provider; change the restore logic to check the saved tuple from unregister_shield (the values stored on context.llama_guard_provider_id and context.llama_guard_provider_shield_id) and only call register_shield to re-create the shield when saved is truthy (i.e., the shield existed before). If saved is falsy/None, skip register_shield so you don't create a new default shield and thus preserve scenario isolation; update any code paths that assume register_shield will always run to use this presence check instead.

add tests for too long question

51e8477

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

radofuchs requested a review from tisnik February 27, 2026 08:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCORE-1206: add tests for too long question#1232

LCORE-1206: add tests for too long question#1232
radofuchs wants to merge 1 commit intolightspeed-core:mainfrom
radofuchs:LCORE_1206_OLS_test_alignment

radofuchs commented Feb 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 27, 2026 •

edited

Loading

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

radofuchs commented Feb 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

radofuchs commented Feb 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 27, 2026 •

edited

Loading