Skip to content

Initial implementation of report-builder interface#5266

Open
Swiddis wants to merge 16 commits intoopensearch-project:mainfrom
Swiddis:feat/report-builder
Open

Initial implementation of report-builder interface#5266
Swiddis wants to merge 16 commits intoopensearch-project:mainfrom
Swiddis:feat/report-builder

Conversation

@Swiddis
Copy link
Copy Markdown
Collaborator

@Swiddis Swiddis commented Mar 25, 2026

Description

Well, this is turning out quite long.

The fundamental idea of this PR is to add a lot more context to error messages. For example, take this bad query:

source = big5
| eval range_bucket = case(
   `metrics.size` < -10, 'range_1',
   `metrics.size` >= -10 and `metrics.size` < 10, 'range_2',
   `metrics.size` >= 10 and `metrics.size` < 100, 'range_3',
   `metrics.size` >= 100 and `metrics.size` < 1000, 'range_4',
   `metrics.size` >= 1000 and `metrics.size` < 2000, 'range_5',
   `metrics.size` >= 2000, 'range_6')
| stats count() by range_bucket, span(`@timestamp`, 1h) as auto_span
| sort + range_bucket, + auto_spa

Now there's more context, provided at various layers:

{
  "error": {
    "context": {
      "stage": "analyzing",
      "stage_description": "Parsing and validating the query"
    },
    "details": "Field [nonex] not found.",
    "location": [
      "while preparing and validating the query plan"
    ],
    "code": "FIELD_NOT_FOUND",
    "type": "IllegalArgumentException"
  },
  "status": 400
}

Or, the humble 'index not found':

{
  "error": {
    "context": {
      "stage": "analyzing",
      "index_name": "big5_nonex",
      "stage_description": "Parsing and validating the query"
    },
    "details": "no such index [big5_nonex]",
    "location": [
      "while preparing and validating the query plan",
      "while fetching index mappings"
    ],
    "code": "INDEX_NOT_FOUND",
    "type": "IndexNotFoundException"
  },
  "status": 404
}

The doctest deltas have a few concrete cases where I went through and did light touch-ups. Improving any particular error takes O(2 minutes) to wrap in a report with context, and it'll integrate with any higher-up wrapping to give the completed report at the end.

A lot of this PR in terms of line count is just patching specific places where tests are asserting errors behave in specific ways. The core of the PR is:

  • ErrorReport and StageErrorHandler, which adds utilities for incrementally building detailed errors
  • QueryService where we hook up the outer shell of the stage handlers, and
  • RestPPLQueryAction where we handle and present the errors.

Simple tests are in error_reports.yml, as well as doctest and various other scattered tests. (There's also lots of tests that have no diff, but did catch bugs in earlier versions of this PR.)

The enhancements aren't super complete in this PR as the emphasis is scaffolding (and the integration work + fixing tests is already quite significant), for the most part the changes are backwards compatible in normal cases. The type, details, and status fields are in the same spots and by default pull from the underlying exception. I'll be adding more thorough probes for specific error cases in followups.

Related Issues

Meta: #5261
Closes #4919

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@Swiddis Swiddis added the enhancement New feature or request label Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
@Swiddis Swiddis force-pushed the feat/report-builder branch from c1c2355 to 23c5fbb Compare March 25, 2026 23:45
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

Swiddis added 3 commits March 26, 2026 12:04
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

ahkcs added a commit to ahkcs/sql that referenced this pull request Mar 26, 2026
NPE in the analytics path is a server bug (null schema field, missing
row), not bad user input. Removed from client error list. Will sync
this classification with RestPPLQueryAction updates in opensearch-project#5266.

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 26, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 6f8d67a.

PathLineSeverityDescription
integ-test/src/test/java/org/opensearch/sql/datasource/DataSourceEnabledIT.java154lowDebug logging statements (System.err.println) printing full HTTP response bodies were added to a test helper method. While in test code and not exfiltrating data externally, these could inadvertently log sensitive response content (tokens, credentials) to stderr in CI/CD pipelines. Likely a development artifact left in unintentionally.
core/src/main/java/org/opensearch/sql/calcite/utils/CalciteToolsHelper.java434lowInternal query plan content is extracted from exception messages and placed into structured error responses returned to clients (via the 'plan' context key). This could expose internal execution plan details to end users. The behavior is intentional for debugging but represents information disclosure of internal system structure.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 0 | Low: 2


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@Swiddis
Copy link
Copy Markdown
Collaborator Author

Swiddis commented Mar 26, 2026

The error enrichment logic extracts the raw SQL query plan string from internal exception messages and...

In context, this is pulling details out of a potentially very fuzzy Calcite error. The error logic without any of this ultimately returns the plan with a message resembling "Failed to prepare plan: [blob]." The plan is provided as context as the innermost error might be weird and only apply to one node, e.g. CalciteEnumerableNestedAggregate having a problem means nothing without access to the plan containing said aggregate. In OSD this plan probably won't be presented to users directly.

Any details about our internal implementation can be easily scrutinized by navigating to https://github.com/opensearch-project/sql in a recent version of Netscape Navigator.

ahkcs added a commit that referenced this pull request Mar 26, 2026
…5267)

* [Mustang] Add query routing and execution handoff for Parquet-backed indices (#5247)

Implements the query routing and AnalyticsExecutionEngine for Project
Mustang's unified query pipeline. PPL queries targeting parquet_ prefixed
indices are routed through UnifiedQueryPlanner and executed via a stub
QueryPlanExecutor, with results formatted through the existing JDBC
response pipeline.

New files:
- QueryPlanExecutor: @FunctionalInterface contract for analytics engine
- AnalyticsExecutionEngine: converts Iterable<Object[]> to QueryResponse
  with type mapping and query size limit enforcement
- RestUnifiedQueryAction: orchestrates schema building, planning,
  execution on sql-worker thread pool, with client/server error
  classification and metrics
- StubQueryPlanExecutor: canned data for parquet_logs and parquet_metrics
  tables for development and testing

Modified files:
- RestPPLQueryAction: routing branch for parquet_ indices
- SQLPlugin: passes ClusterService and NodeClient to RestPPLQueryAction
- plugin/build.gradle: adds :api dependency for UnifiedQueryPlanner

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Align isClientError with RestPPLQueryAction classification

Add missing exception types to isClientError(): IndexNotFoundException,
ExpressionEvaluationException, QueryEngineException,
DataSourceClientException, IllegalAccessException. Matches the full
list in RestPPLQueryAction.isClientError().

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Move stub code into analytics/stub sub-package

Extract StubSchemaProvider, StubQueryPlanExecutor, and StubIndexDetector
into plugin/.../rest/analytics/stub/ package to clearly separate
temporary stub code from production code. RestUnifiedQueryAction now
delegates to these stub classes.

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Use ErrorMessageFactory for error responses

Replace hand-crafted JSON error response with
ErrorMessageFactory.createErrorMessage(), matching the standard error
format used in RestPPLQueryAction.reportError().

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Make metrics and response formatting query-type-aware

Use QueryType to select the correct metrics (PPL_REQ_TOTAL vs REQ_TOTAL,
PPL_FAILED_REQ_COUNT_* vs FAILED_REQ_COUNT_*) and LangSpec (PPL_SPEC
vs SQL_SPEC) so this class can serve both PPL and SQL queries when
unified SQL support is added.

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Move analytics routing from REST layer to transport layer

Move the analytics index routing check from RestPPLQueryAction into
TransportPPLQueryAction.doExecute(). This ensures the analytics path
gets the same PPL enabled check, metrics, request ID, and inter-plugin
transport support as the existing Lucene path. RestPPLQueryAction and
SQLPlugin are reverted to their original state.

Added executeViaTransport() to RestUnifiedQueryAction which returns
results via ActionListener<TransportPPLQueryResponse> instead of
RestChannel, integrating properly with the transport action pattern.

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Add query size limit to RelNode plan instead of post-processing

Add LogicalSystemLimit to the RelNode plan before passing it to the
analytics engine, consistent with PPL V3 (QueryService.convertToCalcitePlan).
This ensures the analytics engine enforces the limit during execution
rather than returning all rows for post-processing truncation.

Remove post-processing querySizeLimit truncation from
AnalyticsExecutionEngine -- the limit is now part of the plan the
executor receives.

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Propagate security context to sql-worker thread pool

Wrap scheduled lambdas in both execute() and executeViaTransport() with
withCurrentContext() to capture and restore ThreadContext (user identity,
permissions, audit trail) on the worker thread. Follows the same pattern
as OpenSearchQueryManager.withCurrentContext().

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Move analytics routing after profiling setup

Move the analytics index routing check after QueryContext.setProfile()
and wrapWithProfilingClear(listener). Use clearingListener instead of
raw listener so profiling thread-local state is properly cleaned up
after analytics path execution.

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Remove NPE from isClientError classification

NPE in the analytics path is a server bug (null schema field, missing
row), not bad user input. Removed from client error list. Will sync
this classification with RestPPLQueryAction updates in #5266.

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Remove duplicate REST execution path from RestUnifiedQueryAction

Remove execute(query, queryType, channel), doExecute(), createQueryListener(channel),
recordSuccessMetric(), recordFailureMetric(), reportError(), and related
REST imports. Since routing now goes through TransportPPLQueryAction,
the REST-specific path was unused. Renamed executeViaTransport() to
execute() as the sole entry point.

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>

---------

Signed-off-by: Kai Huang <kaihuang@amazon.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

…oint

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
@Swiddis Swiddis added the error-experience Issues related to how we handle failure cases in the plugin. label Mar 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@Swiddis Swiddis force-pushed the feat/report-builder branch from 090ece5 to 4b73f48 Compare March 27, 2026 19:53
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@Swiddis Swiddis force-pushed the feat/report-builder branch from 4b73f48 to ca77eeb Compare March 27, 2026 19:56
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
@Swiddis Swiddis force-pushed the feat/report-builder branch from ca77eeb to 6f8d67a Compare March 27, 2026 19:56
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

Comment on lines +404 to +415
private static boolean isNonPushdownEnumerableAggregate(String message) {
return message.contains("Error while preparing plan")
&& message.contains("CalciteEnumerableNestedAggregate");
}

// Detect if error is due to window functions in unsupported context (bins on time fields)
private static boolean isWindowBinOnTimeField(SQLException e) {
String errorMsg = e.getMessage();
return errorMsg != null
&& errorMsg.contains("Error while preparing plan")
&& errorMsg.contains("WIDTH_BUCKET");
}
Copy link
Copy Markdown
Collaborator Author

@Swiddis Swiddis Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-suggestion: These conditions can be simplified by using error codes at a lower level

The checks here just become report.getCode() == ErrorCode.ILLEGAL_AGGREGATE_PUSHDOWN and report.getCode() == ErrorCode.INVALID_WINDOW_BIN or something, and we can remove the helpers. Can add this if it comes up or is in the way of other changes.

I mostly bring it up to highlight why I'm introducing codes in the first place: They can let us specify much more specific scenarios than generic exceptions do. I just didn't fully integrate it here in the interest of time & keeping the PR as small as I could get it.

@anasalkouz
Copy link
Copy Markdown
Member

Not sure if understand the improvements: what is the error experience before and after this PR?

@Swiddis
Copy link
Copy Markdown
Collaborator Author

Swiddis commented Mar 27, 2026

You can see some concrete examples by looking at the diffs for doctests that check for errors https://github.com/opensearch-project/sql/pull/5266/changes#diff-601c336135ceed5ca58ff75bf3783afd2722610e3b5583160476fc9378b34cd6

@dai-chen
Copy link
Copy Markdown
Collaborator

In meta issue, we mention the "The core problem is that misbehaving queries are typically hard to debug". So the improvements like location are mostly for developer? Probably need some clarification for the goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request error-experience Issues related to how we handle failure cases in the plugin.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Error Report-building Mechanism for PPL Errors

3 participants