Skip to content

[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts#54517

Closed
yaooqinn wants to merge 2 commits intoapache:masterfrom
yaooqinn:SPARK-55716
Closed

[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts#54517
yaooqinn wants to merge 2 commits intoapache:masterfrom
yaooqinn:SPARK-55716

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Feb 26, 2026

What changes were proposed in this pull request?

V1 file-based DataSource writes (Parquet, ORC, JSON, etc.) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement controlled by spark.sql.fileSource.insert.enforceNotNull.

Changes:

  1. CreateDataSourceTableCommand: Preserves user-specified nullability by recursively merging nullability flags from the user schema into the resolved dataSource.schema. Previously it stored dataSource.schema directly, which is all-nullable due to DataSource.resolveRelation() calling dataSchema.asNullable (SPARK-13738).

  2. PreprocessTableInsertion: Restores nullability flags from the catalog schema before null checks, ensuring AssertNotNull is injected when needed. Gated behind spark.sql.fileSource.insert.enforceNotNull.

  3. New config: spark.sql.fileSource.insert.enforceNotNull (default false) — when set to true, enables NOT NULL constraint enforcement for V1 file-based tables, consistent with the behavior for other data sources and V2 catalog tables.

  4. SparkGetColumnsOperation: Fixed IS_NULLABLE to respect column.nullable instead of always returning "YES".

Why are the changes needed?

DataSource.resolveRelation() calls dataSchema.asNullable (added in SPARK-13738 for read safety), which strips all NOT NULL constraints recursively. CreateDataSourceTableCommand then stores this all-nullable schema in the catalog, permanently losing NOT NULL information. As a result, PreprocessTableInsertion never injects AssertNotNull for V1 file source tables.

Note: InsertableRelation (e.g., SimpleInsertSource) does NOT have this problem because it preserves the original schema (SPARK-24583).

Does this PR introduce any user-facing change?

No change in default behavior. Users can opt in to NOT NULL enforcement for V1 file source tables by setting spark.sql.fileSource.insert.enforceNotNull to true.

How was this patch tested?

  • Added 7 new tests in InsertSuite covering top-level, nested struct, array, and map null constraint enforcement.
  • Fixed 3 existing interval column test assertions in SparkMetadataOperationSuite.

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with GitHub Copilot.

@yaooqinn
Copy link
Member Author

cc @dongjoon-hyun @cloud-fan @gengliangwang, Is the fix direction correct? is this a genuine bug or design choice. I haven't found any public discussions on this area.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 26, 2026

Hi, @yaooqinn

  • Apache Spark didn't claim to support SQL CONSTRAINT for v1 yet, did it? IIUC, it should be handled explicitly by the user in FROM SELECT clause of that INSERT statement.
  • SPARK-51207: SPIP: Constraints in DSv2 was a fairly new feature of Apache Spark 4.1.0 only.

For me, this PR seems to introduce a new feature instead of bugs.

cc @aokolnychyi , @peter-toth , too.

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 27, 2026

Apache Spark didn't claim to support SQL CONSTRAINT for v1 yet, did it? IIUC, it should be handled explicitly by the user in FROM SELECT clause of that INSERT statement.

@dongjoon-hyun
I'm not sure,StoreAssignmentPolicy is introduced in 3.0, in the meanwhile NOT NULL keywords are long-term support(but may not be fully or well documented) features. It looks like that V2 support is an addition based on that. In legacy v1, it partially work for some datasources with no asNullable called

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 27, 2026

IIUC, StoreAssignmentPolicy is supposed to check type casting only. For a single nullable type, I'm not sure that was achieved by SPARK-28730.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 27, 2026

Anyway, let's wait for @gengliangwang and @cloud-fan 's comment. Maybe, I lost the track of code change.

@yaooqinn
Copy link
Member Author

IIUC, StoreAssignmentPolicy is supposed to check type casting only.

@dongjoon-hyun
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L447

FYI, Null constraints for table output relies on it too

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 27, 2026

@cloud-fan
Copy link
Contributor

This is probably a too breaking change, and many users may already treat this bug as a feature.

If users need strict not null enforcement, they should migrate to DS v2.

@gengliangwang
Copy link
Member

+1 with @cloud-fan . At least, the default behavior should not be changed in 4.2.

@yaooqinn
Copy link
Member Author

Yeah, I agree with you that this is too breaking. However, many of our clients and developers here have raised doubts about this strange behavior of Spark. There are no clear docs for this.

they should migrate to DS v2.

Do you mean by using built-in sources with v2 code path or suggesting users migrate to third-party formats?

@cloud-fan
Copy link
Contributor

I think using plain file source table is not recommended now, people should switch to lakehouse table formats.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 27, 2026

Since DSv2 migration is an independent topic from this DSv1 bug, I made a PR for DSv2 specifically, @cloud-fan , @gengliangwang , @yaooqinn .

@yaooqinn
Copy link
Member Author

Hi @dongjoon-hyun builtin file sources with DSv2 code path have the same issue based on my unit tests

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 27, 2026

@cloud-fan Instead of making NOT NULL as a default behavior, would you mind if we give it a shot that provide an alternative option for users if they want built-in parquet and delta lake to behave the same on NOT NULL constraints?

For steaming source, we have such an option for toggling asNullable.

…ble inserts

V1 file-based DataSource writes (parquet/orc/json) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement by:

1. CreateDataSourceTableCommand: Preserve user-specified nullability by recursively merging nullability flags from the user schema into the resolved dataSource.schema.
2. PreprocessTableInsertion: Restore nullability flags from the catalog schema before null checks. This ensures AssertNotNull is injected when needed, gated by spark.sql.fileSource.insert.enforceNotNull.
3. Config: spark.sql.fileSource.insert.enforceNotNull (default false) - when true, enforces NOT NULL constraints for file-based tables.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yaooqinn yaooqinn changed the title [SPARK-55716][SQL] Fix V1 file source NOT NULL constraint enforcement [SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts Feb 28, 2026
@cloud-fan
Copy link
Contributor

I'm fine with a new config to provide NOT NULL enforcement.

@yaooqinn
Copy link
Member Author

yaooqinn commented Mar 1, 2026

Thank you @cloud-fan,can you take a deeper review when you have some time, happy weekend!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, the latest approach looks safer to me. Thank you for clarifying the AS-IS situation and proposing a better solution for Spark users, @yaooqinn ! Although we may hit some corner cases later, I believe this leads us toward more desirable status.

@yaooqinn
Copy link
Member Author

yaooqinn commented Mar 3, 2026

.internal()

@dongjoon-hyun, thank you.

Before merging this, I want to discuss the config scope, shall we make it public since we are providing options for Spark users

@dongjoon-hyun
Copy link
Member

+1 for public like spark.sql.storeAssignmentPolicy config.

@yaooqinn yaooqinn closed this in fccedae Mar 3, 2026
@yaooqinn
Copy link
Member Author

yaooqinn commented Mar 3, 2026

Thank you @dongjoon-hyun and all, merged to master

@yaooqinn yaooqinn deleted the SPARK-55716 branch March 3, 2026 08:23
.createWithDefault(StoreAssignmentPolicy.ANSI)

val FILE_SOURCE_INSERT_ENFORCE_NOT_NULL =
buildConf("spark.sql.fileSource.insert.enforceNotNull")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buildConf("spark.sql.fileSource.insert.enforceNotNull")
buildConf("spark.sql.files.insert.enforceNotNull")

Can we follow the existing config namespace? Currently, we have many spark.sql.files.* configs which are "effective only when using file-based sources"

spark.sql.files.maxPartitionBytes
spark.sql.files.openCostInBytes
spark.sql.files.minPartitionNum
spark.sql.files.maxPartitionNum
spark.sql.files.ignoreMissingFiles
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make not much sense to me, these configurations work on pure files, but this one actually takes no effect on the file-only mode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this one actually takes no effect on the file-only mode

it's true because a file-based table without a catalog can not store constraint info, but is that worth a new namespace for config? I can not infer such info from the proposed namespace name fileSource. IMO, it still fits the spark.sql.files. scope, and we can mention such a limitation in the config docs, if necessary.


case _ =>
// Merge nullability from the user-specified schema into the resolved schema.
// DataSource.resolveRelation() calls dataSchema.asNullable which strips NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it simpler to not do dataSchema.asNullable if the flag is on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaooqinn added a commit that referenced this pull request Mar 4, 2026
…ng asNullable in resolveRelation

### What changes were proposed in this pull request?

Followup to #54517. Simplifies NOT NULL constraint preservation per [review feedback](#54517 (comment)).

Instead of calling `dataSchema.asNullable` in `resolveRelation()` and then restoring nullability with recursive `restoreNullability`/`restoreDataTypeNullability` helpers in `CreateDataSourceTableCommand`, this PR:

1. Adds a `forceNullable` parameter to `DataSource.resolveRelation()` (default `true`, preserving existing behavior)
2. Passes `forceNullable = !enforceNotNull` from `CreateDataSourceTableCommand`, so `asNullable` is skipped only when the config is enabled
3. Removes `restoreNullability` and `restoreDataTypeNullability` helper methods entirely

**Data flow:**
- **Config OFF** (default): `forceNullable = true` → `asNullable` runs → same behavior as before SPARK-55716
- **Config ON**: `forceNullable = false` → `asNullable` skipped → catalog stores NOT NULL → `PreprocessTableInsertion` enforces at insert time

### Why are the changes needed?

Addresses review feedback: "is it simpler to not do `dataSchema.asNullable` if the flag is on?"

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing SPARK-55716 tests (7 tests in InsertSuite) and ShowCreateTableSuite (30 tests) all pass.

### Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with GitHub Copilot.

Closes #54597 from yaooqinn/SPARK-55716.

Authored-by: Kent Yao <kentyao@microsoft.com>
Signed-off-by: Kent Yao <kentyao@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants