Support deleting columns from a table schema, with optional on-reload reclamation#18831
Support deleting columns from a table schema, with optional on-reload reclamation#18831sourabh-27 wants to merge 3 commits into
Conversation
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal correctness issue; see inline comment.
| String column = columnMetadata.getColumnName(); | ||
| // Only remove auto-generated columns | ||
| if (!_schema.hasColumn(column) && columnMetadata.isAutoGenerated()) { | ||
| if (!_schema.hasColumn(column) && (columnMetadata.isAutoGenerated() || reclaimDeletedColumns)) { |
There was a problem hiding this comment.
When reclaimDeletedColumnsOnReload=true, this now reclaims every column missing from the schema, not just auto-generated default columns. That also matches live synthetic columns such as OPEN_STRUCT materialized children (payload$key, payload$__sparse__), which are intentionally not schema fields but are still required by the standard column-loading path. A reload on a table with OPEN_STRUCT columns would therefore physically delete still-live child indexes/data. Please exclude synthetic/materialized child columns from this reclaim path (for example via the existing parent-column metadata / OPEN_STRUCT naming) and add a reload test for it.
Description
Solves #18808
Apache Pinot's schema-update API has been strictly additive. The
PUT /schemas/{schemaName}endpoint rejects any update that removes a column to prevent accidental backward-incompatible changes. The only workaround was usingforce=true, which is unsafe as it bypasses all structural validation. Furthermore, Pinot lacked a mechanism to reclaim on-disk space occupied by removed columns in existing, already-built segments.This PR introduces guard-railed support for deleting columns from a schema, divided into two independent, opt-in tiers:
1. Logical Deletion (Controller / Schema API)
A new
allowColumnDeletionquery parameter is introduced to the schema-update endpoints.true, callers can intentionally drop columns present in the old schema but absent from the new one.2. Physical Reclamation (Server / Segment Reload)
A config flag,
reclaimDeletedColumnsOnReload, dictates whether data for ingested columns missing from the schema is physically purged from segments during a reload operation.true), a segment reload explicitly drops the forward index, dictionary, and all auxiliary indexes for columns no longer present in the schema, freeing up disk space.3. Query Layer Behavior (Unchanged)
Changes
pinot-controllerPinotSchemaRestletResource: Added theallowColumnDeletionquery parameter (default:false) to both the multipart and JSONPUT /schemas/{schemaName}endpoints.PinotHelixResourceManager: Passed theallowColumnDeletionparameter down intoupdateSchema(...).pinot-spiSchema: OverloadedisBackwardCompatibleWith(Schema oldSchema, boolean allowColumnDeletion). The original single-argument method signature remains intact.IndexingConfig: Added thereclaimDeletedColumnsOnReloadoption (default:false).pinot-segment-localIndexLoadingConfig: Exposed theisReclaimDeletedColumnsOnReload()configuration property.BaseDefaultColumnHandler: Updated to computeREMOVEactions for ingested columns absent from the schema when the reclamation flag is active (extending the existing auto-generated column removal logic).pinot-clientsSchemaAdminClient: OverloadedupdateSchema(..., boolean allowColumnDeletion).Testing
./mvnw -pl pinot-spi -am -Dtest=SchemaTest -Dsurefire.failIfNoSpecifiedTests=false test./mvnw -pl pinot-controller -am -Dtest=PinotSchemaRestletResourceTest -Dsurefire.failIfNoSpecifiedTests=false test./mvnw -pl pinot-segment-local -am -Dtest=DefaultColumnHandlerTest,SegmentPreProcessorTest -Dsurefire.failIfNoSpecifiedTests=false test./mvnw -pl pinot-integration-tests -am -Dtest=OfflineClusterIntegrationTest#testSchemaColumnDeletion -Dsurefire.failIfNoSpecifiedTests=false test./mvnw spotless:apply -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests./mvnw license:format -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests./mvnw checkstyle:check -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests./mvnw license:check -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-testsgit diff --checkRelease Notes
PUT /schemas/{schemaName}endpoints now accept an optionalallowColumnDeletionquery parameter (default:false). Whentrue, columns omitted from the new schema are safely dropped, provided they are not actively referenced by any table configuration. Structural type and primary-key compatibility assertions remain strictly enforced.indexingConfig.reclaimDeletedColumnsOnReload(default:false). When activated, ingested columns omitted from the schema are physically wiped from segments upon reload to reclaim storage space.