UN-2836 [FEAT] Return full text contents of input file in API response by pk-zipstack · Pull Request #1904 · Zipstack/unstract

pk-zipstack · 2026-04-07T03:48:54Z

What

Add include_extracted_text boolean parameter (default false) to both sync (POST) and async polling (GET) API deployment endpoints
When enabled, the full extracted text of each input file is returned at the top level of each file result as "extracted_text": "..."
Works independently of include_metadata and ENABLE_HIGHLIGHT_API_DEPLOYMENT configuration

Why

Customers using the Unstract Python client want to retrieve the full extracted text of uploaded documents alongside the structured extraction results
Currently extracted_text is only available inside metadata, and is gated behind both include_metadata=true and the enterprise ENABLE_HIGHLIGHT_API_DEPLOYMENT flag
This makes it inaccessible to OSS users and adds unnecessary coupling between text retrieval and highlighting features

How

Added INCLUDE_EXTRACTED_TEXT constant to ApiExecution
Added include_extracted_text field to ExecutionRequestSerializer (POST) and ExecutionQuerySerializer (GET)
Added promote_extracted_text() method to ExecutionResponse DTO — copies extracted_text from result[i].result.metadata to result[i].extracted_text
In both POST (deployment_helper.py) and GET (api_deployment_views.py) flows: when include_extracted_text=true, preserve extracted_text in metadata before highlight filtering, then promote it to the top level
After promotion, the normal include_metadata / highlight filtering still runs, so metadata is cleaned as usual

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

No. The parameter defaults to false, so existing API behavior is unchanged. The promotion logic only runs when explicitly requested. No changes to the execution pipeline, caching, or metadata population.

Database Migrations

None

Env Config

None

Relevant Docs

N/A

Related Issues or PRs

UN-2836

Dependencies Versions

None

Notes on Testing

POST with include_extracted_text=true returns extracted_text at file-result top level
POST with include_extracted_text=false (default) does not include extracted_text
GET polling with include_extracted_text=true returns extracted_text at file-result top level
include_extracted_text=true works without include_metadata=true
include_extracted_text=true works even when ENABLE_HIGHLIGHT_API_DEPLOYMENT is disabled
When both include_extracted_text=true and include_metadata=true, extracted_text appears at top level and in metadata

Screenshots

Checklist

I have read and understood the Contribution Guidelines.

coderabbitai · 2026-04-07T03:49:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e387a425-fa1b-4f5a-beae-17b40d757976

📥 Commits

Reviewing files that changed from the base of the PR and between 20f097a and aac777f.

📒 Files selected for processing (1)

backend/api_v2/api_deployment_views.py

🚧 Files skipped from review as they are similar to previous changes (1)

backend/api_v2/api_deployment_views.py

Summary by CodeRabbit

New Features
- Added an opt-in include_extracted_text flag (default: false) for execution requests and queries so clients can request full extracted text in API responses.
- Completed-execution responses are now post-processed to surface extracted text when the flag is enabled; when disabled, extracted text is omitted as before.

Walkthrough

Adds a new include_extracted_text boolean flag that is validated on requests/queries and propagated into execution paths; when true, extracted text is promoted into top-level extracted_text fields instead of being removed during post-processing.

Changes

Include extracted_text feature

Layer / File(s)	Summary
Flag constant `backend/api_v2/constants.py`	`ApiExecution.INCLUDE_EXTRACTED_TEXT = "include_extracted_text"` added.
Serializers `backend/api_v2/serializers.py`	Added `include_extracted_text = BooleanField(default=False)` to `ExecutionRequestSerializer` and `ExecutionQuerySerializer`.
API views `backend/api_v2/api_deployment_views.py`	`post` and `get` now read `include_extracted_text` from validated body/query and forward it into deployment execution/post-processing (`DeploymentHelper`).
Deployment helper `backend/api_v2/deployment_helper.py`	`execute_workflow` and `process_completed_execution` gained `include_extracted_text: bool = False`; extracted-text removal now conditional and promotion invoked when true.
Response DTO `backend/workflow_manager/workflow_v2/dto.py`	Added `ExecutionResponse.promote_extracted_text()` to surface extracted text from nested metadata to top-level `extracted_text` entries.

Sequence Diagram

sequenceDiagram
    participant Client
    participant APIView as API View
    participant Serializer
    participant Helper as DeploymentHelper
    participant Execution as ExecutionResult/DTO

    Client->>APIView: POST/GET with include_extracted_text flag
    APIView->>Serializer: validate request/query
    Serializer-->>APIView: validated data (include_extracted_text)
    APIView->>Helper: execute_workflow(..., include_extracted_text=flag)
    Helper->>Execution: run/process execution
    alt include_extracted_text == true
        Helper->>Execution: promote_extracted_text()
        Execution-->>Helper: response with top-level extracted_text
    else
        Helper->>Execution: ensure extracted_text removed from metadata
        Execution-->>Helper: response without extracted_text
    end
    Helper-->>APIView: shaped execution response
    APIView-->>Client: return API response

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and concisely describes the main feature: returning full text contents of input files in API responses, which is the primary objective of the changeset.
Description check	✅ Passed	The description comprehensively covers all required sections: What, Why, How, breaking changes assessment, migrations, env config, related issues, testing notes, and checklist confirmation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch UN-2836-include-extracted-text-api-response

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-04-07T03:57:03Z

Greptile Summary

This PR adds an include_extracted_text boolean parameter (default false) to both the sync POST and async GET API deployment endpoints, allowing callers to receive the full extracted text of each uploaded document at the top level of file results without needing include_metadata=true or the enterprise ENABLE_HIGHLIGHT_API_DEPLOYMENT flag.

dto.py: Adds promote_extracted_text() to ExecutionResponse, copying extracted_text from item["result"]["metadata"] to item["extracted_text"].
deployment_helper.py / api_deployment_views.py: In both sync and async-poll execution paths, the extracted_text key is conditionally preserved in metadata before highlight filtering, then promoted to the top level.
serializers.py / constants.py: New include_extracted_text field wired through ExecutionRequestSerializer, ExecutionQuerySerializer, and ApiExecution constants.

Confidence Score: 5/5

Safe to merge — the new parameter defaults to false, preserving all existing API behavior unchanged.

All five changed files are additive: a new opt-in boolean parameter wired consistently through constants, serializers, helper methods, and views. The promotion logic in promote_extracted_text() is guarded against non-list results and non-dict items, and it runs after the existing metadata-removal guards so no existing response shape is altered when the flag is off. Both the sync POST and async GET polling paths handle the new flag symmetrically. No database, cache, or execution-pipeline changes are involved.

No files require special attention.

Important Files Changed

Filename	Overview
backend/workflow_manager/workflow_v2/dto.py	Adds promote_extracted_text() that safely copies extracted_text from inner metadata to item top-level; guards against non-list result and non-dict items correctly.
backend/api_v2/deployment_helper.py	Threads include_extracted_text through both sync execution and async-poll paths; conditional preservation and promotion logic is symmetric and correct.
backend/api_v2/api_deployment_views.py	Reads include_extracted_text from both POST and GET serializers and forwards it to the appropriate helper methods; no logic gaps.
backend/api_v2/serializers.py	Adds include_extracted_text = BooleanField(default=False) to both ExecutionRequestSerializer and ExecutionQuerySerializer; docstring updated for the POST serializer only.
backend/api_v2/constants.py	Adds INCLUDE_EXTRACTED_TEXT string constant; consistent with existing constant pattern.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[API Request] --> B{POST or GET?}
    B -- POST --> C[ExecutionRequestSerializer]
    B -- GET poll --> D[ExecutionQuerySerializer]
    C --> E[execute_workflow_and_get_response]
    D --> F[get_execution_status]
    F --> G{COMPLETED?}
    G -- No --> H[Return status only]
    G -- Yes --> I[process_completed_execution]
    E --> J{enable_highlight?}
    I --> J
    J -- No --> K[remove highlight_data]
    K --> L{include_extracted_text?}
    L -- No --> M[remove extracted_text from metadata]
    L -- Yes --> N[keep extracted_text in metadata]
    J -- Yes --> N
    N --> O[promote_extracted_text]
    M --> P{include_metadata?}
    O --> P
    P -- No --> Q[remove_inner_result_metadata]
    P -- Yes --> R[enrich with usage metadata]
    Q --> S[Return: extracted_text at top level]
    R --> S

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into UN-2836-include..." | Re-trigger Greptile}

Add `include_extracted_text` parameter to API deployment endpoints that returns the full extracted text of each input file at the top level of each file result, independent of `include_metadata` and the `ENABLE_HIGHLIGHT_API_DEPLOYMENT` configuration flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

backend/api_v2/deployment_helper.py (1)

482-487: Logic is correct; consider extracting shared post-processing into a helper.

The extracted_text handling logic here is identical to lines 277-282 in execute_workflow(). Both blocks share the same pattern for enable_highlight checking, highlight_data removal, and extracted_text conditional removal/promotion.

♻️ Optional: Extract shared logic into a private helper

`@staticmethod`
def _apply_response_post_processing(
    response: ExecutionResponse,
    organization: Any,
    include_extracted_text: bool,
    include_metadata: bool,
    include_metrics: bool,
) -> None:
    """Apply common post-processing to execution responses."""
    enable_highlight = False
    if ConfigurationRegistry.is_config_key_available(
        "ENABLE_HIGHLIGHT_API_DEPLOYMENT"
    ):
        enable_highlight = Configuration.get_value_by_organization(
            config_key="ENABLE_HIGHLIGHT_API_DEPLOYMENT",
            organization=organization,
        )
    if not enable_highlight:
        response.remove_result_metadata_keys(["highlight_data"])
        if not include_extracted_text:
            response.remove_result_metadata_keys(["extracted_text"])
    if include_extracted_text:
        response.promote_extracted_text()
    if include_metadata or include_metrics:
        DeploymentHelper._enrich_result_with_usage_metadata(response)
    if not include_metadata:
        response.remove_inner_result_metadata()
    if not include_metrics:
        response.remove_result_metrics()

This would reduce the duplicated logic in both execute_workflow() and process_completed_execution().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/api_v2/deployment_helper.py` around lines 482 - 487, The duplicated
post-processing logic for extracted_text/highlight_data in
process_completed_execution() and execute_workflow() should be moved into a
private helper to avoid duplication: add a static method (e.g.,
_apply_response_post_processing(response: ExecutionResponse, organization,
include_extracted_text: bool, include_metadata: bool, include_metrics: bool)) in
DeploymentHelper that encapsulates the ENABLE_HIGHLIGHT_API_DEPLOYMENT config
check, removal of "highlight_data" and conditional removal/promotion of
"extracted_text", and the existing metadata/metrics enrichment/removal logic
(calling _enrich_result_with_usage_metadata, remove_inner_result_metadata,
remove_result_metrics), then replace the duplicated blocks in execute_workflow()
and process_completed_execution() with calls to this new helper.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/api_v2/deployment_helper.py`:
- Around line 482-487: The duplicated post-processing logic for
extracted_text/highlight_data in process_completed_execution() and
execute_workflow() should be moved into a private helper to avoid duplication:
add a static method (e.g., _apply_response_post_processing(response:
ExecutionResponse, organization, include_extracted_text: bool, include_metadata:
bool, include_metrics: bool)) in DeploymentHelper that encapsulates the
ENABLE_HIGHLIGHT_API_DEPLOYMENT config check, removal of "highlight_data" and
conditional removal/promotion of "extracted_text", and the existing
metadata/metrics enrichment/removal logic (calling
_enrich_result_with_usage_metadata, remove_inner_result_metadata,
remove_result_metrics), then replace the duplicated blocks in execute_workflow()
and process_completed_execution() with calls to this new helper.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 76482e9f-7631-4155-b078-692e1cf52d79

📥 Commits

Reviewing files that changed from the base of the PR and between 3461d01 and 20f097a.

📒 Files selected for processing (5)

backend/api_v2/api_deployment_views.py
backend/api_v2/constants.py
backend/api_v2/deployment_helper.py
backend/api_v2/serializers.py
backend/workflow_manager/workflow_v2/dto.py

✅ Files skipped from review due to trivial changes (1)

backend/api_v2/constants.py

🚧 Files skipped from review as they are similar to previous changes (3)

backend/workflow_manager/workflow_v2/dto.py
backend/api_v2/serializers.py
backend/api_v2/api_deployment_views.py

sonarqubecloud · 2026-04-07T06:52:00Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

chandrasekharan-zipstack

LGTM, @pk-zipstack follow up on whether we can eventually avoid showing this extracted text as part of include_metadata

sonarqubecloud · 2026-05-27T12:22:17Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-27T12:22:18Z

Test Results

Summary

✅ Runner Tests: 11 passed, 0 failed (11 total)
✅ SDK1 Tests: 347 passed, 0 failed (347 total)

Runner Tests - Full Report

filepath	function	$$\textcolor{#23d18b}{\tt{passed}}$$	SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_logs}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_client\_init}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_run\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$		$$\textcolor{#23d18b}{\tt{11}}$$	$$\textcolor{#23d18b}{\tt{11}}$$

SDK1 Tests - Full Report

pk-zipstack force-pushed the UN-2836-include-extracted-text-api-response branch from 3461d01 to 20f097a Compare April 7, 2026 06:32

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

Merge branch 'main' into UN-2836-include-extracted-text-api-response

9995a23

pk-zipstack requested review from Deepak-Kesavan and chandrasekharan-zipstack April 20, 2026 08:25

pk-zipstack assigned harini-venkataraman Apr 20, 2026

pk-zipstack assigned pk-zipstack and unassigned harini-venkataraman May 11, 2026

chandrasekharan-zipstack approved these changes May 18, 2026

View reviewed changes

Deepak-Kesavan approved these changes May 27, 2026

View reviewed changes

Merge branch 'main' into UN-2836-include-extracted-text-api-response

aac777f

chandrasekharan-zipstack merged commit b89f412 into main May 27, 2026
8 checks passed

chandrasekharan-zipstack deleted the UN-2836-include-extracted-text-api-response branch May 27, 2026 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UN-2836 [FEAT] Return full text contents of input file in API response#1904

UN-2836 [FEAT] Return full text contents of input file in API response#1904
chandrasekharan-zipstack merged 3 commits into
mainfrom
UN-2836-include-extracted-text-api-response

pk-zipstack commented Apr 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 7, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented Apr 7, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

coderabbitai Bot left a comment

Uh oh!

sonarqubecloud Bot commented Apr 7, 2026

Uh oh!

chandrasekharan-zipstack left a comment

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pk-zipstack commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Screenshots

Checklist

Uh oh!

coderabbitai Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Apr 7, 2026

Quality Gate passed

Uh oh!

chandrasekharan-zipstack left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented May 27, 2026

Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pk-zipstack commented Apr 7, 2026 •

edited

Loading

coderabbitai Bot commented Apr 7, 2026 •

edited

Loading

greptile-apps Bot commented Apr 7, 2026 •

edited

Loading