Correctness fixes for analytics-engine search route (datetime schema + IP/BINARY byte[])#5447
Correctness fixes for analytics-engine search route (datetime schema + IP/BINARY byte[])#5447ahkcs wants to merge 4 commits into
Conversation
|
Persistent review updated to latest commit 0cef8c8 |
602e1f0 to
0cef8c8
Compare
|
Persistent review updated to latest commit 0cef8c8 |
1 similar comment
|
Persistent review updated to latest commit 0cef8c8 |
|
Persistent review updated to latest commit 828b80d |
|
Persistent review updated to latest commit 52ffeb8 |
|
Persistent review updated to latest commit 96b33cf |
96b33cf to
c6e07a0
Compare
|
Persistent review updated to latest commit c6e07a0 |
c6e07a0 to
51f4213
Compare
|
Persistent review updated to latest commit 51f4213 |
51f4213 to
8416a3b
Compare
|
Persistent review updated to latest commit 8416a3b |
8416a3b to
02807c7
Compare
|
Persistent review updated to latest commit 02807c7 |
dc23a73 to
9583614
Compare
The test creates `logs-2021.01.11` and `logs-7.10.0-2021.01.11` directly via raw PUTs to exercise the parser's handling of dot/version-prefixed index names. The raw path bypasses `TestUtils.createIndexByRestClient`, so the analytics-engine `tests.analytics.parquet_indices=true` toggle never injects the composite/parquet settings. On the analytics-engine route this surfaces as `Table 'logs-2021.01.11' not found` (the catalog-builder only exposes indices that match an analytics-engine- recognized data format), failing the test. Routing the creates through `TestUtils.createIndexByRestClient` makes both code paths honor the toggle: legacy / Calcite continues to create plain Lucene-backed indices, analytics-engine gets parquet-backed composite indices. The index names are still tested verbatim. Surfaced by `CalciteSearchCommandIT` on the analytics-engine route. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ialIndexName
`OpenSearchSchemaBuilder.buildSchema` skips any index whose mapping has
no `properties` block — the analytics catalog then doesn't surface the
index, and the PPL parser fails with `Table 'logs-2021.01.11' not
found` before reaching the row type. Routing the special-index create
through `TestUtils.createIndexByRestClient` with `null` mapping
produces an empty `{}` body, so the index is created with no mapping
at all and the catalog still skips it.
Pass a minimal `{"mappings":{"properties":{}}}` body. OpenSearch
strips the empty `properties` to `{}` server-side but the schema
builder sees a non-null mapping and registers the (zero-column) table.
The query then parses cleanly and the test's `verifyDataRows()` (no
expected rows) succeeds against the empty index.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
20b7bc3 to
57178c1
Compare
|
Persistent review updated to latest commit 57178c1 |
`DatetimeOutputCastRule` (api/spec/datetime) wraps every datetime-typed
root column in `CAST(... AS VARCHAR)` so the analytics-engine backend
can emit PPL's documented space-separator format via `to_char`
(`DatetimeOutputCastRewriter` on the OpenSearch side). The downside is
that `plan.getRowType()` then advertises those columns as VARCHAR — the
response schema reports `"string"` for what callers expect as
`"timestamp"`. The legacy / Calcite-reference path doesn't have this
divergence: the wire value is the formatted string while the schema
column type stays as the original datetime type.
Walk the outermost `Project` (descending through a single `Sort`, the
`LogicalSystemLimit` system query-size wrapper) and, for any slot of
shape `CAST(<datetime> AS VARCHAR)`, swap in the inner `RelDataType`
for the schema column. Values still come back as the strings DataFusion
emitted — only the schema column type is restored.
Falls back to the slot's reported type for non-cast slots, non-datetime
sources, or when the root isn't a Project (raw scan / aggregate fragment).
Surfaced by `verifySchema(... schema("@timestamp", "timestamp") ...)`
assertions in `CalciteSearchCommandIT` time-modifier tests; this commit
removes the schema-type divergence for those queries (the remaining
data-row mismatches are downstream Lucene-secondary issues filed
separately).
Signed-off-by: Kai Huang <ahkcs@amazon.com>
57178c1 to
adf0887
Compare
|
Persistent review updated to latest commit adf0887 |
| // `OpenSearchSchemaBuilder` + `schema_coerce.rs`) and route by type here. | ||
| if (bytes.length == 4 || bytes.length == 16) { | ||
| try { | ||
| return new ExprIpValue(java.net.InetAddress.getByAddress(bytes).getHostAddress()); |
There was a problem hiding this comment.
issue: This is going to trip someone up at some point. In particular, any 4-letter words in ASCII will hit this. I don't want to Norway our customers.
Is there not a way to somehow tag the bytes with whether they're actually IP data or not?
There was a problem hiding this comment.
Good catch — you're right. Dropped the length-based heuristic in the latest force-push: byte[] cells now always render as base64, no IP-guessing.
Trade-off: genuine IPv4 / IPv6 cells render as base64 strings instead of dotted-decimal until the schema actually tells us "this column is IP." That's the upstream TODO — OpenSearchSchemaBuilder.mapFieldType currently collapses both ip and binary into VARBINARY. Splitting them into separate UDTs is a cross-repo refactor (analytics-api + isthmus + Substrait + SQL plugin), so leaving it as a follow-up.
The conservative fix here still resolves the original crash (every PPL query scanning an ip column was throwing unsupported object class [B) — just without the misinterpretation risk you flagged.
`AnalyticsExecutionEngine.convertRows` reaches `ExprValueUtils.fromObjectValue`
for every cell in every row. OpenSearch `ip` and `binary` fields are stored
as raw bytes, so once the analytics-engine reader (paired with OpenSearch
PR #21681 commit 4) materializes a parquet column, each cell arrives as a
Java `byte[]`. The fall-through catch-all previously threw:
ExpressionEvaluationException: unsupported object class [B
at ExprValueUtils.fromObjectValue(ExprValueUtils.java:168)
at AnalyticsExecutionEngine.convertRows(AnalyticsExecutionEngine.java:128)
Every PPL query that even projects an ip column fails on the analytics route
with this error, regardless of operator or predicate.
`OpenSearchSchemaBuilder.mapFieldType` collapses both `ip` and `binary` into
Calcite `VARBINARY`, so the cell-level type system can't distinguish the two.
An earlier draft of this commit gated IP rendering on byte length (4 or 16
bytes → `ExprIpValue` via `InetAddress.getByAddress`), but that misinterprets
any 4-byte ASCII payload (e.g. `"INFO"`, `"WARN"`, `"POST"`) as a public
IPv4 address — the classic "Norway problem" applied to bytes. With no schema
context here we can't safely guess, so default to a base64-encoded
`ExprStringValue`. Typed call sites that DO know the column is `ip` should
route through `ipValue(String)` explicitly.
Once `OpenSearchSchemaBuilder` splits `ip` and `binary` into separate UDTs
(TODO already filed in that file), this branch can render IP cells in
dotted-decimal form again without the length-guessing.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
adf0887 to
cdac62e
Compare
|
Persistent review updated to latest commit cdac62e |
What this PR does
Four narrowly-scoped correctness fixes the analytics-engine
searchpath needs. No visitor lowering, no work-arounds for typed-predicatequery_string— those belong upstream (see below).Inject parquet settings in testSearchCommandWithSpecialIndexNamePass minimal {"properties":{}} mapping in testSearchCommandWithSpecialIndexNamepropertiesmap for the analytics catalog to surface it.Preserve datetime schema-column types in AnalyticsExecutionEngineDatetimeOutputCastRulecasts every datetime column toVARCHARfor PPL's wire format. The response schema now recovers the pre-castRelDataTypeso a@timestampcolumn is reported astimestamp, notstring.stringfor what is actually a timestamp.Convert byte[] cells to ExprValue for IP / BINARY fieldsExprValueUtils.fromObjectValuegets abyte[]branch — 4 / 16 bytes →ExprIpValueviaInetAddress.getByAddress; other lengths → base64ExprStringValue.ExpressionEvaluationException: unsupported object class [B. Hard crash.Two test-plumbing commits + two correctness commits in two files (
AnalyticsExecutionEngine,ExprValueUtils). That's the whole PR.Out of scope — typed-predicate
query_stringis a Lucene-secondary concernCalciteSearchCommandITon the analytics-engine route (tests.analytics.force_routing=true -Dtests.analytics.parquet_indices=true, against currentupstream/main+ opensearch-project/OpenSearch#21681) — 26 / 52 pass. The 25 failures all collapse to one upstream root cause: Lucene-secondaryquery_stringdoesn't re-type literals inside the query body using the field mapping the way normal Lucene_searchdoes.query_stringinput_search[severityText], "ERROR"(text equality)[severityNumber], "severityNumber:>15"(numeric range)"15"aslong, BKD range hit[@timestamp], "@timestamp:>2024-01-01"(date range)query_string(text) AND age>30(mixed-type AND)Panic: primitive arrayWhy this PR doesn't paper over it on the SQL side. The SQL plugin could lower numeric/boolean comparisons to native typed
RexCalland dodge the gap (an earlier version of this PR did exactly that, lifting26 → 38). The fix doesn't belong here, though — typedquery_stringshould evaluate correctly on the Lucene secondary regardless of which engine produced the predicate. The right place is the analytics-engine sandbox'sQuerySerializerRegistry/ Lucene-secondary mapping path that opensearch-project/OpenSearch#21562 introduced. Once typedquery_stringworks end-to-end there, the 25 failures unblock with no change in this repo.The compound-AND panic that #21562's author flagged in their own test cases 10c / 10d as "handled separately" is the same family of bug.
Results
CalciteSearchCommandITon the analytics-engine route, against currentupstream/main+ opensearch-project/OpenSearch#21681:The +0 net pass-rate vs. upstream baseline is intentional: the SQL-side fixes here prevent wrong answers (timestamp-as-string schema, byte[] crash on IP / BINARY scans) on tests that already pass at row-count level. They don't unblock new tests — that takes the Lucene-secondary fix described above.
Out of scope — 25 failures bucketed by root cause
All collapse to "Lucene-secondary
query_stringdoesn't type-coerce literals using the field mapping":testSearchWithNumericComparison,testSearchWithInclusiveRanges,testSearchWithDoubleFieldComparisons,testSearchWithDoubleINOperator,testSearchWithDoubleRangeOperatorsQueryStringQueryBuilderevaluatortestSearchWithAbsoluteEarliestAndNow,testSearchWithDateFormats,testSearchWithDateRangeComparisons,testSearchWithDateINOperator,testSearchWithRelativeTimeModifiers,testSearchWithQuarterlyModifiers,testSearchWithTimeUnitSnapping,testSearchTimeModifierWithSnappedWeek,testSearchWithChainedRelativeTimeRange,testSearchWithNumericTimeRangetestSearchWithNestedParentheses,testSearchWithComplexBooleanExpression,testSearchMixedWithPipeCommands,testSearchWithFieldNotEqualsquery_string-delegated and DataFusion-native predicatestestSearchWithAttributeFields,testSearchWithMultipleFieldTypestestSearchWithIPAddresstestWildcardPatternMatching,testDifferenceBetweenNOTAndNotEqualstestSearchCommandWithSpecialIndexNameTracked in
docs/dev/search-command-analytics-route-status.md.Check list
--signoff