Skip to content

Support deleting columns from a table schema, with optional on-reload reclamation#18831

Open
sourabh-27 wants to merge 3 commits into
apache:masterfrom
sourabh-27:feature/DeleteColumn
Open

Support deleting columns from a table schema, with optional on-reload reclamation#18831
sourabh-27 wants to merge 3 commits into
apache:masterfrom
sourabh-27:feature/DeleteColumn

Conversation

@sourabh-27

Copy link
Copy Markdown

Description

Solves #18808

Apache Pinot's schema-update API has been strictly additive. The PUT /schemas/{schemaName} endpoint rejects any update that removes a column to prevent accidental backward-incompatible changes. The only workaround was using force=true, which is unsafe as it bypasses all structural validation. Furthermore, Pinot lacked a mechanism to reclaim on-disk space occupied by removed columns in existing, already-built segments.

This PR introduces guard-railed support for deleting columns from a schema, divided into two independent, opt-in tiers:

1. Logical Deletion (Controller / Schema API)

A new allowColumnDeletion query parameter is introduced to the schema-update endpoints.

  • Behavior: When set to true, callers can intentionally drop columns present in the old schema but absent from the new one.
  • Safety Guards: All other backward-compatibility rules remain same (e.g., changes to column types or primary keys are still rejected).

2. Physical Reclamation (Server / Segment Reload)

A config flag, reclaimDeletedColumnsOnReload, dictates whether data for ingested columns missing from the schema is physically purged from segments during a reload operation.

  • Behavior: Previously, only auto-generated default columns were cleaned up; ingested column data persisted indefinitely. When this flag is enabled (set to true), a segment reload explicitly drops the forward index, dictionary, and all auxiliary indexes for columns no longer present in the schema, freeing up disk space.

3. Query Layer Behavior (Unchanged)

  • Behavior: Queries referencing a column that has been deleted from the schema will throw an error. This remains the consistent, standard behavior alongside these changes.

Changes

pinot-controller

  • PinotSchemaRestletResource: Added the allowColumnDeletion query parameter (default: false) to both the multipart and JSON PUT /schemas/{schemaName} endpoints.
  • PinotHelixResourceManager: Passed the allowColumnDeletion parameter down into updateSchema(...).

pinot-spi

  • Schema: Overloaded isBackwardCompatibleWith(Schema oldSchema, boolean allowColumnDeletion). The original single-argument method signature remains intact.
  • IndexingConfig: Added the reclaimDeletedColumnsOnReload option (default: false).

pinot-segment-local

  • IndexLoadingConfig: Exposed the isReclaimDeletedColumnsOnReload() configuration property.
  • BaseDefaultColumnHandler: Updated to compute REMOVE actions for ingested columns absent from the schema when the reclamation flag is active (extending the existing auto-generated column removal logic).

pinot-clients

  • SchemaAdminClient: Overloaded updateSchema(..., boolean allowColumnDeletion).

Testing

  • ./mvnw -pl pinot-spi -am -Dtest=SchemaTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl pinot-controller -am -Dtest=PinotSchemaRestletResourceTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl pinot-segment-local -am -Dtest=DefaultColumnHandlerTest,SegmentPreProcessorTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl pinot-integration-tests -am -Dtest=OfflineClusterIntegrationTest#testSchemaColumnDeletion -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw spotless:apply -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests
  • ./mvnw license:format -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests
  • ./mvnw checkstyle:check -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests
  • ./mvnw license:check -pl pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests
  • git diff --check

Release Notes

  • New Schema-Update Option: PUT /schemas/{schemaName} endpoints now accept an optional allowColumnDeletion query parameter (default: false). When true, columns omitted from the new schema are safely dropped, provided they are not actively referenced by any table configuration. Structural type and primary-key compatibility assertions remain strictly enforced.
  • New Table Indexing Configuration: Added indexingConfig.reclaimDeletedColumnsOnReload (default: false). When activated, ingested columns omitted from the schema are physically wiped from segments upon reload to reclaim storage space.

@xiangfu0 xiangfu0 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal correctness issue; see inline comment.

String column = columnMetadata.getColumnName();
// Only remove auto-generated columns
if (!_schema.hasColumn(column) && columnMetadata.isAutoGenerated()) {
if (!_schema.hasColumn(column) && (columnMetadata.isAutoGenerated() || reclaimDeletedColumns)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reclaimDeletedColumnsOnReload=true, this now reclaims every column missing from the schema, not just auto-generated default columns. That also matches live synthetic columns such as OPEN_STRUCT materialized children (payload$key, payload$__sparse__), which are intentionally not schema fields but are still required by the standard column-loading path. A reload on a table with OPEN_STRUCT columns would therefore physically delete still-live child indexes/data. Please exclude synthetic/materialized child columns from this reclaim path (for example via the existing parent-column metadata / OPEN_STRUCT naming) and add a reload test for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants