LCORE-1350: Refactor of vector_search utilities#1233
LCORE-1350: Refactor of vector_search utilities#1233asimurka wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
WalkthroughRefactors vector search to pass a raw Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Endpoint as Endpoint (query/streaming)
participant VectorSearch as perform_vector_search
participant VectorStore as Vector IO
participant Dedupe as deduplicate_referenced_documents
Client->>Endpoint: send QueryRequest (includes query, solr)
Endpoint->>VectorSearch: perform_vector_search(query, solr, client)
VectorSearch->>VectorStore: query vector index (query, solr params)
VectorStore-->>VectorSearch: returned chunks + doc ids
VectorSearch-->>Endpoint: return chunks, doc_ids_from_chunks, pre_rag_chunks
Endpoint->>Dedupe: deduplicate_referenced_documents(tool_docs + pre_rag_chunks)
Dedupe-->>Endpoint: referenced_documents
Endpoint-->>Client: streaming/generated response + turn_summary (referenced_documents)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
|
||
| if doc_ids_from_chunks: | ||
| turn_summary.referenced_documents = deduplicate_referenced_documents( | ||
| doc_ids_from_chunks + (turn_summary.referenced_documents or []) |
There was a problem hiding this comment.
Not necessary, attribute is never None
|
|
||
| pre_rag_chunks: list[RAGChunk] = [] | ||
| doc_ids_from_chunks: list[ReferencedDocument] = [] | ||
|
|
There was a problem hiding this comment.
Not necessary to initialize
| rag_id_mapping=context.rag_id_mapping, | ||
| ) | ||
|
|
||
| # Merge pre-RAG documents with tool-based documents (similar to query.py) |
There was a problem hiding this comment.
Utility function for this purpose already exists
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/utils/vector_search.py (1)
41-41: Add explicit return type annotation.The
_build_query_paramsfunction is missing its return type annotation. As per coding guidelines, all functions must have complete type annotations.Proposed fix
-def _build_query_params(solr: Optional[dict[str, Any]] = None) -> dict: +def _build_query_params(solr: Optional[dict[str, Any]] = None) -> dict[str, Any]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` at line 41, Add an explicit return type to the _build_query_params function signature (e.g., change def _build_query_params(solr: Optional[dict[str, Any]] = None) -> dict[str, Any]:) so the function has a complete type annotation; update the signature in the _build_query_params definition and ensure typing imports remain correct for Optional and Any if needed.src/app/endpoints/streaming_query.py (1)
263-280: Update docstring to document the newdoc_ids_from_chunksparameter.The function signature was updated to include
doc_ids_from_chunks: list[ReferencedDocument], but the docstring's Args section doesn't document this new parameter. As per coding guidelines, all functions must have complete docstrings with Parameters sections.Proposed docstring update
""" Retrieve the appropriate response generator. Handles shield moderation check and retrieves response. Returns the generator (shield violation or response generator) and turn_summary. Fills turn_summary attributes for token usage, referenced documents, and tool calls. Args: responses_params: The Responses API parameters context: The response generator context + doc_ids_from_chunks: Referenced documents extracted from pre-RAG vector search Returns: tuple[AsyncIterator[str], TurnSummary]: The response generator and turn summary """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/endpoints/streaming_query.py` around lines 263 - 280, The docstring for retrieve_response_generator is missing documentation for the new parameter doc_ids_from_chunks; update the Args (or Parameters) section to include doc_ids_from_chunks: list[ReferencedDocument] describing it as the list of ReferencedDocument objects extracted from input chunks that should be considered for referenced document tracking and included in TurnSummary.referenced_documents, and ensure the description notes its role in token usage and moderation handling similar to responses_params and context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 263-280: The docstring for retrieve_response_generator is missing
documentation for the new parameter doc_ids_from_chunks; update the Args (or
Parameters) section to include doc_ids_from_chunks: list[ReferencedDocument]
describing it as the list of ReferencedDocument objects extracted from input
chunks that should be considered for referenced document tracking and included
in TurnSummary.referenced_documents, and ensure the description notes its role
in token usage and moderation handling similar to responses_params and context.
In `@src/utils/vector_search.py`:
- Line 41: Add an explicit return type to the _build_query_params function
signature (e.g., change def _build_query_params(solr: Optional[dict[str, Any]] =
None) -> dict[str, Any]:) so the function has a complete type annotation; update
the signature in the _build_query_params definition and ensure typing imports
remain correct for Optional and Any if needed.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/app/endpoints/query.pysrc/app/endpoints/streaming_query.pysrc/utils/vector_search.py
d69de5e to
29135a9
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/utils/vector_search.py (1)
48-57:⚠️ Potential issue | 🟠 MajorAvoid logging raw Solr payloads at
infolevel.These lines log request-derived Solr filters and full query params verbatim at
info. That can leak sensitive metadata and create noisy high-volume logs. Keep this detail indebug(or log only coarse flags/counts).Proposed fix
- logger.info("Initial params: %s", params) - logger.info("solr: %s", solr) + logger.debug("Initialized vector search params") + logger.debug("Solr filters present: %s", bool(solr)) if solr: params["solr"] = solr - logger.info("Final params with solr filters: %s", params) + logger.debug("Applied Solr filters to vector search params") else: - logger.info("No solr filters provided") + logger.debug("No Solr filters provided") - logger.info("Final params being sent to vector_io.query: %s", params) + logger.debug("Vector search params ready for vector_io.query")As per coding guidelines, "Use
logger.debug()for detailed diagnostic information,logger.info()for general execution info,logger.warning()for unexpected events, andlogger.error()for serious problems".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils/vector_search.py` around lines 48 - 57, The current logging in vector_search.py prints raw Solr filters and full query params at info level (logger.info("Initial params: %s", params), logger.info("solr: %s", solr), logger.info("Final params with solr filters: %s", params), logger.info("Final params being sent to vector_io.query: %s", params)), which can leak sensitive data; change these to logger.debug() and/or replace the payloads with coarse summaries (e.g., presence flags or counts) so only non-sensitive, high-level info remains at info level—update the logger calls around the solr and params variables in the function handling the vector query accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/utils/vector_search.py`:
- Around line 48-57: The current logging in vector_search.py prints raw Solr
filters and full query params at info level (logger.info("Initial params: %s",
params), logger.info("solr: %s", solr), logger.info("Final params with solr
filters: %s", params), logger.info("Final params being sent to vector_io.query:
%s", params)), which can leak sensitive data; change these to logger.debug()
and/or replace the payloads with coarse summaries (e.g., presence flags or
counts) so only non-sensitive, high-level info remains at info level—update the
logger calls around the solr and params variables in the function handling the
vector query accordingly.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/app/endpoints/query.pysrc/app/endpoints/streaming_query.pysrc/utils/vector_search.py
Description
This PR refactors vector search utilities such that:
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
Refactor
Bug Fixes