Skip to content

Correctness fixes for analytics-engine search route (datetime schema + IP/BINARY byte[])#5447

Open
ahkcs wants to merge 4 commits into
opensearch-project:mainfrom
ahkcs:pr/search-special-index-parquet
Open

Correctness fixes for analytics-engine search route (datetime schema + IP/BINARY byte[])#5447
ahkcs wants to merge 4 commits into
opensearch-project:mainfrom
ahkcs:pr/search-special-index-parquet

Conversation

@ahkcs
Copy link
Copy Markdown
Collaborator

@ahkcs ahkcs commented May 15, 2026

What this PR does

Four narrowly-scoped correctness fixes the analytics-engine search path needs. No visitor lowering, no work-arounds for typed-predicate query_string — those belong upstream (see below).

# Commit What it fixes Without this commit
1 Inject parquet settings in testSearchCommandWithSpecialIndexName Makes the special-index test create a parquet-backed shard like the rest of the suite. Test runs on a Lucene-backed shard — not representative of the analytics-engine route.
2 Pass minimal {"properties":{}} mapping in testSearchCommandWithSpecialIndexName Empty index needs at least an empty properties map for the analytics catalog to surface it. Index goes missing; test fails at "index not found".
3 Preserve datetime schema-column types in AnalyticsExecutionEngine DatetimeOutputCastRule casts every datetime column to VARCHAR for PPL's wire format. The response schema now recovers the pre-cast RelDataType so a @timestamp column is reported as timestamp, not string. Type-checked clients break: schema says string for what is actually a timestamp.
4 Convert byte[] cells to ExprValue for IP / BINARY fields ExprValueUtils.fromObjectValue gets a byte[] branch — 4 / 16 bytes → ExprIpValue via InetAddress.getByAddress; other lengths → base64 ExprStringValue. Every IP / binary cell throws ExpressionEvaluationException: unsupported object class [B. Hard crash.

Two test-plumbing commits + two correctness commits in two files (AnalyticsExecutionEngine, ExprValueUtils). That's the whole PR.

Out of scope — typed-predicate query_string is a Lucene-secondary concern

CalciteSearchCommandIT on the analytics-engine route (tests.analytics.force_routing=true -Dtests.analytics.parquet_indices=true, against current upstream/main + opensearch-project/OpenSearch#21681) — 26 / 52 pass. The 25 failures all collapse to one upstream root cause: Lucene-secondary query_string doesn't re-type literals inside the query body using the field mapping the way normal Lucene _search does.

query_string input Lucene _search Analytics-engine Lucene secondary
[severityText], "ERROR" (text equality) ✅ matches ✅ matches
[severityNumber], "severityNumber:>15" (numeric range) ✅ re-parses "15" as long, BKD range hit ❌ literal stays as text → 0 rows
[@timestamp], "@timestamp:>2024-01-01" (date range) ✅ re-parses date ❌ 0 rows
query_string(text) AND age>30 (mixed-type AND) Panic: primitive array

Why this PR doesn't paper over it on the SQL side. The SQL plugin could lower numeric/boolean comparisons to native typed RexCall and dodge the gap (an earlier version of this PR did exactly that, lifting 26 → 38). The fix doesn't belong here, though — typed query_string should evaluate correctly on the Lucene secondary regardless of which engine produced the predicate. The right place is the analytics-engine sandbox's QuerySerializerRegistry / Lucene-secondary mapping path that opensearch-project/OpenSearch#21562 introduced. Once typed query_string works end-to-end there, the 25 failures unblock with no change in this repo.

The compound-AND panic that #21562's author flagged in their own test cases 10c / 10d as "handled separately" is the same family of bug.

Results

CalciteSearchCommandIT on the analytics-engine route, against current upstream/main + opensearch-project/OpenSearch#21681:

Step PASS / 52
Pre-#21631 baseline 3
+ Upstream lands #21631 / #21698 / #21628 / #21622 / #21562 ~26
+ opensearch-project/OpenSearch#21681 (sandbox fixes) + this PR (4 correctness lifts) 26 ⇨ correctness on the 26 that run

The +0 net pass-rate vs. upstream baseline is intentional: the SQL-side fixes here prevent wrong answers (timestamp-as-string schema, byte[] crash on IP / BINARY scans) on tests that already pass at row-count level. They don't unblock new tests — that takes the Lucene-secondary fix described above.

Out of scope — 25 failures bucketed by root cause

All collapse to "Lucene-secondary query_string doesn't type-coerce literals using the field mapping":

Bucket Tests (25) Where to fix
Numeric / decimal range (5) testSearchWithNumericComparison, testSearchWithInclusiveRanges, testSearchWithDoubleFieldComparisons, testSearchWithDoubleINOperator, testSearchWithDoubleRangeOperators analytics-engine Lucene-secondary mapping / QueryStringQueryBuilder evaluator
Date-math / date range (10) testSearchWithAbsoluteEarliestAndNow, testSearchWithDateFormats, testSearchWithDateRangeComparisons, testSearchWithDateINOperator, testSearchWithRelativeTimeModifiers, testSearchWithQuarterlyModifiers, testSearchWithTimeUnitSnapping, testSearchTimeModifierWithSnappedWeek, testSearchWithChainedRelativeTimeRange, testSearchWithNumericTimeRange Same
Compound predicate (4) testSearchWithNestedParentheses, testSearchWithComplexBooleanExpression, testSearchMixedWithPipeCommands, testSearchWithFieldNotEquals Composite engine — composing query_string-delegated and DataFusion-native predicates
Attribute / multi-field (2) testSearchWithAttributeFields, testSearchWithMultipleFieldTypes Same family
IP address (1) testSearchWithIPAddress Same family
Lucene-secondary NOT / wildcard semantics (2) testWildcardPatternMatching, testDifferenceBetweenNOTAndNotEquals Lucene-secondary side, separate from #21562
Special-index empty-mapping streaming (1) testSearchCommandWithSpecialIndexName Composite-engine streaming — can't open empty composite shard

Tracked in docs/dev/search-command-analytics-route-status.md.

Check list

  • New functionality includes testing
  • Commits are signed per the DCO using --signoff

@ahkcs ahkcs changed the title Inject parquet settings in testSearchCommandWithSpecialIndexName Lower structured search predicates to native RexCall; analytics-route coverage fixes May 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 0cef8c8

@ahkcs ahkcs force-pushed the pr/search-special-index-parquet branch from 602e1f0 to 0cef8c8 Compare May 15, 2026 21:17
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 0cef8c8

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 0cef8c8

@ahkcs ahkcs added calcite calcite migration releated PPL Piped processing language enhancement New feature or request labels May 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 828b80d

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 52ffeb8

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 96b33cf

@ahkcs ahkcs force-pushed the pr/search-special-index-parquet branch from 96b33cf to c6e07a0 Compare May 15, 2026 22:34
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit c6e07a0

@ahkcs ahkcs force-pushed the pr/search-special-index-parquet branch from c6e07a0 to 51f4213 Compare May 18, 2026 19:03
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 51f4213

@ahkcs ahkcs force-pushed the pr/search-special-index-parquet branch from 51f4213 to 8416a3b Compare May 18, 2026 19:16
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8416a3b

@ahkcs ahkcs force-pushed the pr/search-special-index-parquet branch from 8416a3b to 02807c7 Compare May 18, 2026 19:54
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 02807c7

@ahkcs ahkcs force-pushed the pr/search-special-index-parquet branch 2 times, most recently from dc23a73 to 9583614 Compare May 18, 2026 22:09
ahkcs added 2 commits May 19, 2026 10:00
The test creates `logs-2021.01.11` and `logs-7.10.0-2021.01.11` directly
via raw PUTs to exercise the parser's handling of dot/version-prefixed
index names. The raw path bypasses `TestUtils.createIndexByRestClient`,
so the analytics-engine `tests.analytics.parquet_indices=true` toggle
never injects the composite/parquet settings. On the analytics-engine
route this surfaces as `Table 'logs-2021.01.11' not found` (the
catalog-builder only exposes indices that match an analytics-engine-
recognized data format), failing the test.

Routing the creates through `TestUtils.createIndexByRestClient` makes
both code paths honor the toggle: legacy / Calcite continues to create
plain Lucene-backed indices, analytics-engine gets parquet-backed
composite indices. The index names are still tested verbatim.

Surfaced by `CalciteSearchCommandIT` on the analytics-engine route.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ialIndexName

`OpenSearchSchemaBuilder.buildSchema` skips any index whose mapping has
no `properties` block — the analytics catalog then doesn't surface the
index, and the PPL parser fails with `Table 'logs-2021.01.11' not
found` before reaching the row type. Routing the special-index create
through `TestUtils.createIndexByRestClient` with `null` mapping
produces an empty `{}` body, so the index is created with no mapping
at all and the catalog still skips it.

Pass a minimal `{"mappings":{"properties":{}}}` body. OpenSearch
strips the empty `properties` to `{}` server-side but the schema
builder sees a non-null mapping and registers the (zero-column) table.
The query then parses cleanly and the test's `verifyDataRows()` (no
expected rows) succeeds against the empty index.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the pr/search-special-index-parquet branch from 20b7bc3 to 57178c1 Compare May 19, 2026 17:01
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 57178c1

@ahkcs ahkcs changed the title Lower structured search predicates to native RexCall; analytics-route coverage fixes Correctness fixes for analytics-engine search route (datetime schema + IP/BINARY byte[]) May 19, 2026
`DatetimeOutputCastRule` (api/spec/datetime) wraps every datetime-typed
root column in `CAST(... AS VARCHAR)` so the analytics-engine backend
can emit PPL's documented space-separator format via `to_char`
(`DatetimeOutputCastRewriter` on the OpenSearch side). The downside is
that `plan.getRowType()` then advertises those columns as VARCHAR — the
response schema reports `"string"` for what callers expect as
`"timestamp"`. The legacy / Calcite-reference path doesn't have this
divergence: the wire value is the formatted string while the schema
column type stays as the original datetime type.

Walk the outermost `Project` (descending through a single `Sort`, the
`LogicalSystemLimit` system query-size wrapper) and, for any slot of
shape `CAST(<datetime> AS VARCHAR)`, swap in the inner `RelDataType`
for the schema column. Values still come back as the strings DataFusion
emitted — only the schema column type is restored.

Falls back to the slot's reported type for non-cast slots, non-datetime
sources, or when the root isn't a Project (raw scan / aggregate fragment).

Surfaced by `verifySchema(... schema("@timestamp", "timestamp") ...)`
assertions in `CalciteSearchCommandIT` time-modifier tests; this commit
removes the schema-type divergence for those queries (the remaining
data-row mismatches are downstream Lucene-secondary issues filed
separately).

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the pr/search-special-index-parquet branch from 57178c1 to adf0887 Compare May 19, 2026 18:23
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit adf0887

// `OpenSearchSchemaBuilder` + `schema_coerce.rs`) and route by type here.
if (bytes.length == 4 || bytes.length == 16) {
try {
return new ExprIpValue(java.net.InetAddress.getByAddress(bytes).getHostAddress());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This is going to trip someone up at some point. In particular, any 4-letter words in ASCII will hit this. I don't want to Norway our customers.

Is there not a way to somehow tag the bytes with whether they're actually IP data or not?

Copy link
Copy Markdown
Collaborator Author

@ahkcs ahkcs May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right. Dropped the length-based heuristic in the latest force-push: byte[] cells now always render as base64, no IP-guessing.

Trade-off: genuine IPv4 / IPv6 cells render as base64 strings instead of dotted-decimal until the schema actually tells us "this column is IP." That's the upstream TODO — OpenSearchSchemaBuilder.mapFieldType currently collapses both ip and binary into VARBINARY. Splitting them into separate UDTs is a cross-repo refactor (analytics-api + isthmus + Substrait + SQL plugin), so leaving it as a follow-up.
The conservative fix here still resolves the original crash (every PPL query scanning an ip column was throwing unsupported object class [B) — just without the misinterpretation risk you flagged.

`AnalyticsExecutionEngine.convertRows` reaches `ExprValueUtils.fromObjectValue`
for every cell in every row. OpenSearch `ip` and `binary` fields are stored
as raw bytes, so once the analytics-engine reader (paired with OpenSearch
PR #21681 commit 4) materializes a parquet column, each cell arrives as a
Java `byte[]`. The fall-through catch-all previously threw:

    ExpressionEvaluationException: unsupported object class [B
      at ExprValueUtils.fromObjectValue(ExprValueUtils.java:168)
      at AnalyticsExecutionEngine.convertRows(AnalyticsExecutionEngine.java:128)

Every PPL query that even projects an ip column fails on the analytics route
with this error, regardless of operator or predicate.

`OpenSearchSchemaBuilder.mapFieldType` collapses both `ip` and `binary` into
Calcite `VARBINARY`, so the cell-level type system can't distinguish the two.
An earlier draft of this commit gated IP rendering on byte length (4 or 16
bytes → `ExprIpValue` via `InetAddress.getByAddress`), but that misinterprets
any 4-byte ASCII payload (e.g. `"INFO"`, `"WARN"`, `"POST"`) as a public
IPv4 address — the classic "Norway problem" applied to bytes. With no schema
context here we can't safely guess, so default to a base64-encoded
`ExprStringValue`. Typed call sites that DO know the column is `ip` should
route through `ipValue(String)` explicitly.

Once `OpenSearchSchemaBuilder` splits `ip` and `binary` into separate UDTs
(TODO already filed in that file), this branch can render IP cells in
dotted-decimal form again without the length-guessing.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the pr/search-special-index-parquet branch from adf0887 to cdac62e Compare May 19, 2026 19:20
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit cdac62e

@opensearch-project opensearch-project deleted a comment from github-actions Bot May 19, 2026
@opensearch-project opensearch-project deleted a comment from github-actions Bot May 19, 2026
@ahkcs ahkcs requested a review from Swiddis May 19, 2026 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

calcite calcite migration releated enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants