LCORE-1350: Refactor of vector_search utilities by asimurka · Pull Request #1233 · lightspeed-core/lightspeed-stack

asimurka · 2026-02-27T08:17:32Z

Description

This PR refactors vector search utilities such that:

functions accept individual query request attributes instead of the whole request object
configuration options are not passed as arguments
already existing utilities are reused where possible
unnecessary variables are removed

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: N/A

Related Tickets & Documents

Related Issue # LCORE-1350
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Refactor
- Simplified internal vector search and query parameter handling to improve consistency across regular and streaming results.
Bug Fixes
- More reliable aggregation and deduplication of referenced documents so search results and streamed responses show clearer, non-duplicated document references.

coderabbitai · 2026-02-27T08:17:54Z

Walkthrough

Refactors vector search to pass a raw query string and optional solr params into perform_vector_search (signature changed). _is_solr_enabled no longer accepts configuration param. Endpoints adapt to the new return values and streaming deduplication moved to a shared utility.

Changes

Cohort / File(s)	Summary
Endpoints `src/app/endpoints/query.py`, `src/app/endpoints/streaming_query.py`	Call sites updated to pass `query_request.query` and `query_request.solr` into `perform_vector_search`; removed local initializations for `doc_ids_from_chunks` / `pre_rag_chunks`; streaming endpoint now uses `deduplicate_referenced_documents` and `retrieve_response_generator` gained a `doc_ids_from_chunks` parameter.
Vector search utility `src/utils/vector_search.py`	Changed `perform_vector_search(client, query_request, configuration, ...)` → `perform_vector_search(client, query: str, solr: Optional[dict]=None, ...)`; `_is_solr_enabled()` now parameterless; `_build_query_params(solr)` signature updated; chunk typing refined to `Chunk`.
Imports / types `src/app/endpoints/query.py`, `src/app/endpoints/streaming_query.py`, `src/utils/vector_search.py`	Removed `ReferencedDocument` / `RAGChunk` imports from endpoints; added/adjusted imports for `Chunk` and `deduplicate_referenced_documents`; updated type usages and logging references to new param names (`query`, `solr`).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Endpoint as Endpoint (query/streaming)
    participant VectorSearch as perform_vector_search
    participant VectorStore as Vector IO
    participant Dedupe as deduplicate_referenced_documents

    Client->>Endpoint: send QueryRequest (includes query, solr)
    Endpoint->>VectorSearch: perform_vector_search(query, solr, client)
    VectorSearch->>VectorStore: query vector index (query, solr params)
    VectorStore-->>VectorSearch: returned chunks + doc ids
    VectorSearch-->>Endpoint: return chunks, doc_ids_from_chunks, pre_rag_chunks
    Endpoint->>Dedupe: deduplicate_referenced_documents(tool_docs + pre_rag_chunks)
    Dedupe-->>Endpoint: referenced_documents
    Endpoint-->>Client: streaming/generated response + turn_summary (referenced_documents)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

LCORE-873: Support Solr Vector I/O Provider in LCORE #868: Modifies vector search API and endpoint integrations to support Solr-backed vector search and signature changes.

Suggested reviewers

are-ces
tisnik

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'LCORE-1350: Refactor of vector_search utilities' accurately describes the main change in the PR, which refactors vector_search utilities across multiple files including src/utils/vector_search.py and related endpoint modules.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

asimurka · 2026-02-27T08:20:19Z

src/app/endpoints/query.py


    if doc_ids_from_chunks:
        turn_summary.referenced_documents = deduplicate_referenced_documents(
-            doc_ids_from_chunks + (turn_summary.referenced_documents or [])


Not necessary, attribute is never None

asimurka · 2026-02-27T08:21:01Z

src/app/endpoints/streaming_query.py


-    pre_rag_chunks: list[RAGChunk] = []
-    doc_ids_from_chunks: list[ReferencedDocument] = []
-


Not necessary to initialize

asimurka · 2026-02-27T08:21:36Z

src/app/endpoints/streaming_query.py

        rag_id_mapping=context.rag_id_mapping,
    )

-    # Merge pre-RAG documents with tool-based documents (similar to query.py)


Utility function for this purpose already exists

coderabbitai

🧹 Nitpick comments (2)

src/utils/vector_search.py (1)

41-41: Add explicit return type annotation.

The _build_query_params function is missing its return type annotation. As per coding guidelines, all functions must have complete type annotations.

Proposed fix

-def _build_query_params(solr: Optional[dict[str, Any]] = None) -> dict:
+def _build_query_params(solr: Optional[dict[str, Any]] = None) -> dict[str, Any]:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` at line 41, Add an explicit return type to the
_build_query_params function signature (e.g., change def
_build_query_params(solr: Optional[dict[str, Any]] = None) -> dict[str, Any]:)
so the function has a complete type annotation; update the signature in the
_build_query_params definition and ensure typing imports remain correct for
Optional and Any if needed.

src/app/endpoints/streaming_query.py (1)

263-280: Update docstring to document the new doc_ids_from_chunks parameter.

The function signature was updated to include doc_ids_from_chunks: list[ReferencedDocument], but the docstring's Args section doesn't document this new parameter. As per coding guidelines, all functions must have complete docstrings with Parameters sections.

Proposed docstring update

     """
     Retrieve the appropriate response generator.

     Handles shield moderation check and retrieves response.
     Returns the generator (shield violation or response generator) and turn_summary.
     Fills turn_summary attributes for token usage, referenced documents, and tool calls.

     Args:
         responses_params: The Responses API parameters
         context: The response generator context
+        doc_ids_from_chunks: Referenced documents extracted from pre-RAG vector search

     Returns:
         tuple[AsyncIterator[str], TurnSummary]: The response generator and turn summary

     """

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 263 - 280, The docstring
for retrieve_response_generator is missing documentation for the new parameter
doc_ids_from_chunks; update the Args (or Parameters) section to include
doc_ids_from_chunks: list[ReferencedDocument] describing it as the list of
ReferencedDocument objects extracted from input chunks that should be considered
for referenced document tracking and included in
TurnSummary.referenced_documents, and ensure the description notes its role in
token usage and moderation handling similar to responses_params and context.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 263-280: The docstring for retrieve_response_generator is missing
documentation for the new parameter doc_ids_from_chunks; update the Args (or
Parameters) section to include doc_ids_from_chunks: list[ReferencedDocument]
describing it as the list of ReferencedDocument objects extracted from input
chunks that should be considered for referenced document tracking and included
in TurnSummary.referenced_documents, and ensure the description notes its role
in token usage and moderation handling similar to responses_params and context.

In `@src/utils/vector_search.py`:
- Line 41: Add an explicit return type to the _build_query_params function
signature (e.g., change def _build_query_params(solr: Optional[dict[str, Any]] =
None) -> dict[str, Any]:) so the function has a complete type annotation; update
the signature in the _build_query_params definition and ensure typing imports
remain correct for Optional and Any if needed.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a6024a and d69de5e.

📒 Files selected for processing (3)

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/utils/vector_search.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/utils/vector_search.py (1)

48-57: ⚠️ Potential issue | 🟠 Major

Avoid logging raw Solr payloads at info level.

These lines log request-derived Solr filters and full query params verbatim at info. That can leak sensitive metadata and create noisy high-volume logs. Keep this detail in debug (or log only coarse flags/counts).

Proposed fix

-    logger.info("Initial params: %s", params)
-    logger.info("solr: %s", solr)
+    logger.debug("Initialized vector search params")
+    logger.debug("Solr filters present: %s", bool(solr))

     if solr:
         params["solr"] = solr
-        logger.info("Final params with solr filters: %s", params)
+        logger.debug("Applied Solr filters to vector search params")
     else:
-        logger.info("No solr filters provided")
+        logger.debug("No Solr filters provided")

-    logger.info("Final params being sent to vector_io.query: %s", params)
+    logger.debug("Vector search params ready for vector_io.query")

As per coding guidelines, "Use logger.debug() for detailed diagnostic information, logger.info() for general execution info, logger.warning() for unexpected events, and logger.error() for serious problems".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 48 - 57, The current logging in
vector_search.py prints raw Solr filters and full query params at info level
(logger.info("Initial params: %s", params), logger.info("solr: %s", solr),
logger.info("Final params with solr filters: %s", params), logger.info("Final
params being sent to vector_io.query: %s", params)), which can leak sensitive
data; change these to logger.debug() and/or replace the payloads with coarse
summaries (e.g., presence flags or counts) so only non-sensitive, high-level
info remains at info level—update the logger calls around the solr and params
variables in the function handling the vector query accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/utils/vector_search.py`:
- Around line 48-57: The current logging in vector_search.py prints raw Solr
filters and full query params at info level (logger.info("Initial params: %s",
params), logger.info("solr: %s", solr), logger.info("Final params with solr
filters: %s", params), logger.info("Final params being sent to vector_io.query:
%s", params)), which can leak sensitive data; change these to logger.debug()
and/or replace the payloads with coarse summaries (e.g., presence flags or
counts) so only non-sensitive, high-level info remains at info level—update the
logger calls around the solr and params variables in the function handling the
vector query accordingly.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d69de5e and 29135a9.

📒 Files selected for processing (3)

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/utils/vector_search.py

asimurka commented Feb 27, 2026

View reviewed changes

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

Refactor of vector_search utilities

29135a9

asimurka force-pushed the vector_search_refactor branch from d69de5e to 29135a9 Compare February 27, 2026 08:25

asimurka requested a review from tisnik February 27, 2026 08:26

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

asimurka marked this pull request as draft February 27, 2026 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCORE-1350: Refactor of vector_search utilities#1233

LCORE-1350: Refactor of vector_search utilities#1233
asimurka wants to merge 1 commit intolightspeed-core:mainfrom
asimurka:vector_search_refactor

asimurka commented Feb 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

asimurka Feb 27, 2026

Uh oh!

asimurka Feb 27, 2026

Uh oh!

asimurka Feb 27, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		pre_rag_chunks: list[RAGChunk] = []
		doc_ids_from_chunks: list[ReferencedDocument] = []

Conversation

asimurka commented Feb 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

asimurka Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

asimurka Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

asimurka Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

asimurka commented Feb 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 27, 2026 •

edited

Loading