UN-2836 [FEAT] Return full text contents of input file in API response#1904
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Summary by CodeRabbit
WalkthroughAdds a new ChangesInclude extracted_text feature
Sequence DiagramsequenceDiagram
participant Client
participant APIView as API View
participant Serializer
participant Helper as DeploymentHelper
participant Execution as ExecutionResult/DTO
Client->>APIView: POST/GET with include_extracted_text flag
APIView->>Serializer: validate request/query
Serializer-->>APIView: validated data (include_extracted_text)
APIView->>Helper: execute_workflow(..., include_extracted_text=flag)
Helper->>Execution: run/process execution
alt include_extracted_text == true
Helper->>Execution: promote_extracted_text()
Execution-->>Helper: response with top-level extracted_text
else
Helper->>Execution: ensure extracted_text removed from metadata
Execution-->>Helper: response without extracted_text
end
Helper-->>APIView: shaped execution response
APIView-->>Client: return API response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| Filename | Overview |
|---|---|
| backend/workflow_manager/workflow_v2/dto.py | Adds promote_extracted_text() that safely copies extracted_text from inner metadata to item top-level; guards against non-list result and non-dict items correctly. |
| backend/api_v2/deployment_helper.py | Threads include_extracted_text through both sync execution and async-poll paths; conditional preservation and promotion logic is symmetric and correct. |
| backend/api_v2/api_deployment_views.py | Reads include_extracted_text from both POST and GET serializers and forwards it to the appropriate helper methods; no logic gaps. |
| backend/api_v2/serializers.py | Adds include_extracted_text = BooleanField(default=False) to both ExecutionRequestSerializer and ExecutionQuerySerializer; docstring updated for the POST serializer only. |
| backend/api_v2/constants.py | Adds INCLUDE_EXTRACTED_TEXT string constant; consistent with existing constant pattern. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[API Request] --> B{POST or GET?}
B -- POST --> C[ExecutionRequestSerializer]
B -- GET poll --> D[ExecutionQuerySerializer]
C --> E[execute_workflow_and_get_response]
D --> F[get_execution_status]
F --> G{COMPLETED?}
G -- No --> H[Return status only]
G -- Yes --> I[process_completed_execution]
E --> J{enable_highlight?}
I --> J
J -- No --> K[remove highlight_data]
K --> L{include_extracted_text?}
L -- No --> M[remove extracted_text from metadata]
L -- Yes --> N[keep extracted_text in metadata]
J -- Yes --> N
N --> O[promote_extracted_text]
M --> P{include_metadata?}
O --> P
P -- No --> Q[remove_inner_result_metadata]
P -- Yes --> R[enrich with usage metadata]
Q --> S[Return: extracted_text at top level]
R --> S
Reviews (4): Last reviewed commit: "Merge branch 'main' into UN-2836-include..." | Re-trigger Greptile
Add `include_extracted_text` parameter to API deployment endpoints that returns the full extracted text of each input file at the top level of each file result, independent of `include_metadata` and the `ENABLE_HIGHLIGHT_API_DEPLOYMENT` configuration flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3461d01 to
20f097a
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/api_v2/deployment_helper.py (1)
482-487: Logic is correct; consider extracting shared post-processing into a helper.The extracted_text handling logic here is identical to lines 277-282 in
execute_workflow(). Both blocks share the same pattern forenable_highlightchecking,highlight_dataremoval, andextracted_textconditional removal/promotion.♻️ Optional: Extract shared logic into a private helper
`@staticmethod` def _apply_response_post_processing( response: ExecutionResponse, organization: Any, include_extracted_text: bool, include_metadata: bool, include_metrics: bool, ) -> None: """Apply common post-processing to execution responses.""" enable_highlight = False if ConfigurationRegistry.is_config_key_available( "ENABLE_HIGHLIGHT_API_DEPLOYMENT" ): enable_highlight = Configuration.get_value_by_organization( config_key="ENABLE_HIGHLIGHT_API_DEPLOYMENT", organization=organization, ) if not enable_highlight: response.remove_result_metadata_keys(["highlight_data"]) if not include_extracted_text: response.remove_result_metadata_keys(["extracted_text"]) if include_extracted_text: response.promote_extracted_text() if include_metadata or include_metrics: DeploymentHelper._enrich_result_with_usage_metadata(response) if not include_metadata: response.remove_inner_result_metadata() if not include_metrics: response.remove_result_metrics()This would reduce the duplicated logic in both
execute_workflow()andprocess_completed_execution().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/api_v2/deployment_helper.py` around lines 482 - 487, The duplicated post-processing logic for extracted_text/highlight_data in process_completed_execution() and execute_workflow() should be moved into a private helper to avoid duplication: add a static method (e.g., _apply_response_post_processing(response: ExecutionResponse, organization, include_extracted_text: bool, include_metadata: bool, include_metrics: bool)) in DeploymentHelper that encapsulates the ENABLE_HIGHLIGHT_API_DEPLOYMENT config check, removal of "highlight_data" and conditional removal/promotion of "extracted_text", and the existing metadata/metrics enrichment/removal logic (calling _enrich_result_with_usage_metadata, remove_inner_result_metadata, remove_result_metrics), then replace the duplicated blocks in execute_workflow() and process_completed_execution() with calls to this new helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/api_v2/deployment_helper.py`:
- Around line 482-487: The duplicated post-processing logic for
extracted_text/highlight_data in process_completed_execution() and
execute_workflow() should be moved into a private helper to avoid duplication:
add a static method (e.g., _apply_response_post_processing(response:
ExecutionResponse, organization, include_extracted_text: bool, include_metadata:
bool, include_metrics: bool)) in DeploymentHelper that encapsulates the
ENABLE_HIGHLIGHT_API_DEPLOYMENT config check, removal of "highlight_data" and
conditional removal/promotion of "extracted_text", and the existing
metadata/metrics enrichment/removal logic (calling
_enrich_result_with_usage_metadata, remove_inner_result_metadata,
remove_result_metrics), then replace the duplicated blocks in execute_workflow()
and process_completed_execution() with calls to this new helper.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 76482e9f-7631-4155-b078-692e1cf52d79
📒 Files selected for processing (5)
backend/api_v2/api_deployment_views.pybackend/api_v2/constants.pybackend/api_v2/deployment_helper.pybackend/api_v2/serializers.pybackend/workflow_manager/workflow_v2/dto.py
✅ Files skipped from review due to trivial changes (1)
- backend/api_v2/constants.py
🚧 Files skipped from review as they are similar to previous changes (3)
- backend/workflow_manager/workflow_v2/dto.py
- backend/api_v2/serializers.py
- backend/api_v2/api_deployment_views.py
|
chandrasekharan-zipstack
left a comment
There was a problem hiding this comment.
LGTM, @pk-zipstack follow up on whether we can eventually avoid showing this extracted text as part of include_metadata
|
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|



What
include_extracted_textboolean parameter (defaultfalse) to both sync (POST) and async polling (GET) API deployment endpoints"extracted_text": "..."include_metadataandENABLE_HIGHLIGHT_API_DEPLOYMENTconfigurationWhy
extracted_textis only available inside metadata, and is gated behind bothinclude_metadata=trueand the enterpriseENABLE_HIGHLIGHT_API_DEPLOYMENTflagHow
INCLUDE_EXTRACTED_TEXTconstant toApiExecutioninclude_extracted_textfield toExecutionRequestSerializer(POST) andExecutionQuerySerializer(GET)promote_extracted_text()method toExecutionResponseDTO — copiesextracted_textfromresult[i].result.metadatatoresult[i].extracted_textdeployment_helper.py) and GET (api_deployment_views.py) flows: wheninclude_extracted_text=true, preserveextracted_textin metadata before highlight filtering, then promote it to the top levelinclude_metadata/ highlight filtering still runs, so metadata is cleaned as usualCan this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
false, so existing API behavior is unchanged. The promotion logic only runs when explicitly requested. No changes to the execution pipeline, caching, or metadata population.Database Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
include_extracted_text=truereturnsextracted_textat file-result top levelinclude_extracted_text=false(default) does not includeextracted_textinclude_extracted_text=truereturnsextracted_textat file-result top levelinclude_extracted_text=trueworks withoutinclude_metadata=trueinclude_extracted_text=trueworks even whenENABLE_HIGHLIGHT_API_DEPLOYMENTis disabledinclude_extracted_text=trueandinclude_metadata=true, extracted_text appears at top level and in metadataScreenshots
Checklist
I have read and understood the Contribution Guidelines.