[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts by yaooqinn · Pull Request #54517 · apache/spark

yaooqinn · 2026-02-26T17:15:28Z

What changes were proposed in this pull request?

V1 file-based DataSource writes (Parquet, ORC, JSON, etc.) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement controlled by spark.sql.fileSource.insert.enforceNotNull.

Changes:

CreateDataSourceTableCommand: Preserves user-specified nullability by recursively merging nullability flags from the user schema into the resolved dataSource.schema. Previously it stored dataSource.schema directly, which is all-nullable due to DataSource.resolveRelation() calling dataSchema.asNullable (SPARK-13738).
PreprocessTableInsertion: Restores nullability flags from the catalog schema before null checks, ensuring AssertNotNull is injected when needed. Gated behind spark.sql.fileSource.insert.enforceNotNull.
New config: spark.sql.fileSource.insert.enforceNotNull (default false) — when set to true, enables NOT NULL constraint enforcement for V1 file-based tables, consistent with the behavior for other data sources and V2 catalog tables.
SparkGetColumnsOperation: Fixed IS_NULLABLE to respect column.nullable instead of always returning "YES".

Why are the changes needed?

DataSource.resolveRelation() calls dataSchema.asNullable (added in SPARK-13738 for read safety), which strips all NOT NULL constraints recursively. CreateDataSourceTableCommand then stores this all-nullable schema in the catalog, permanently losing NOT NULL information. As a result, PreprocessTableInsertion never injects AssertNotNull for V1 file source tables.

Note: InsertableRelation (e.g., SimpleInsertSource) does NOT have this problem because it preserves the original schema (SPARK-24583).

Does this PR introduce any user-facing change?

No change in default behavior. Users can opt in to NOT NULL enforcement for V1 file source tables by setting spark.sql.fileSource.insert.enforceNotNull to true.

How was this patch tested?

Added 7 new tests in InsertSuite covering top-level, nested struct, array, and map null constraint enforcement.
Fixed 3 existing interval column test assertions in SparkMetadataOperationSuite.

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with GitHub Copilot.

yaooqinn · 2026-02-26T17:32:29Z

cc @dongjoon-hyun @cloud-fan @gengliangwang, Is the fix direction correct? is this a genuine bug or design choice. I haven't found any public discussions on this area.

dongjoon-hyun · 2026-02-26T17:54:25Z

Hi, @yaooqinn

Apache Spark didn't claim to support SQL CONSTRAINT for v1 yet, did it? IIUC, it should be handled explicitly by the user in FROM SELECT clause of that INSERT statement.
SPARK-51207: SPIP: Constraints in DSv2 was a fairly new feature of Apache Spark 4.1.0 only.

For me, this PR seems to introduce a new feature instead of bugs.

cc @aokolnychyi , @peter-toth , too.

yaooqinn · 2026-02-27T02:47:24Z

Apache Spark didn't claim to support SQL CONSTRAINT for v1 yet, did it? IIUC, it should be handled explicitly by the user in FROM SELECT clause of that INSERT statement.

@dongjoon-hyun
I'm not sure,StoreAssignmentPolicy is introduced in 3.0, in the meanwhile NOT NULL keywords are long-term support（but may not be fully or well documented） features. It looks like that V2 support is an addition based on that. In legacy v1, it partially work for some datasources with no asNullable called

dongjoon-hyun · 2026-02-27T03:52:45Z

IIUC, StoreAssignmentPolicy is supposed to check type casting only. For a single nullable type, I'm not sure that was achieved by SPARK-28730.

[SPARK-28730][SQL] Configurable type coercion policy for table insertion #25453

dongjoon-hyun · 2026-02-27T03:55:14Z

Anyway, let's wait for @gengliangwang and @cloud-fan 's comment. Maybe, I lost the track of code change.

yaooqinn · 2026-02-27T05:56:24Z

IIUC, StoreAssignmentPolicy is supposed to check type casting only.

@dongjoon-hyun
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L447

FYI, Null constraints for table output relies on it too

dongjoon-hyun · 2026-02-27T06:22:19Z

Got it. So, from Apache Spark 3.5.0 via the following? Then, cc @aokolnychyi , too.

cloud-fan · 2026-02-27T16:01:18Z

This is probably a too breaking change, and many users may already treat this bug as a feature.

If users need strict not null enforcement, they should migrate to DS v2.

gengliangwang · 2026-02-27T16:40:59Z

+1 with @cloud-fan . At least, the default behavior should not be changed in 4.2.

yaooqinn · 2026-02-27T16:53:16Z

Yeah, I agree with you that this is too breaking. However, many of our clients and developers here have raised doubts about this strange behavior of Spark. There are no clear docs for this.

they should migrate to DS v2.

Do you mean by using built-in sources with v2 code path or suggesting users migrate to third-party formats?

cloud-fan · 2026-02-27T17:03:39Z

I think using plain file source table is not recommended now, people should switch to lakehouse table formats.

dongjoon-hyun · 2026-02-27T17:15:10Z

Since DSv2 migration is an independent topic from this DSv1 bug, I made a PR for DSv2 specifically, @cloud-fan , @gengliangwang , @yaooqinn .

[SPARK-55748][SQL] Use DSv2 for avro|csv|json|kafka|orc|parquet|text by default #54547

yaooqinn · 2026-02-27T17:30:59Z

Hi @dongjoon-hyun builtin file sources with DSv2 code path have the same issue based on my unit tests

yaooqinn · 2026-02-27T17:36:12Z

@cloud-fan Instead of making NOT NULL as a default behavior, would you mind if we give it a shot that provide an alternative option for users if they want built-in parquet and delta lake to behave the same on NOT NULL constraints?

For steaming source, we have such an option for toggling asNullable.

…ble inserts V1 file-based DataSource writes (parquet/orc/json) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement by: 1. CreateDataSourceTableCommand: Preserve user-specified nullability by recursively merging nullability flags from the user schema into the resolved dataSource.schema. 2. PreprocessTableInsertion: Restore nullability flags from the catalog schema before null checks. This ensures AssertNotNull is injected when needed, gated by spark.sql.fileSource.insert.enforceNotNull. 3. Config: spark.sql.fileSource.insert.enforceNotNull (default false) - when true, enforces NOT NULL constraints for file-based tables. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cloud-fan · 2026-03-01T15:03:45Z

I'm fine with a new config to provide NOT NULL enforcement.

yaooqinn · 2026-03-01T15:10:11Z

Thank you @cloud-fan，can you take a deeper review when you have some time, happy weekend!

dongjoon-hyun

From my side, the latest approach looks safer to me. Thank you for clarifying the AS-IS situation and proposing a better solution for Spark users, @yaooqinn ! Although we may hit some corner cases later, I believe this leads us toward more desirable status.

yaooqinn · 2026-03-03T03:59:08Z

.internal()

@dongjoon-hyun, thank you.

Before merging this, I want to discuss the config scope, shall we make it public since we are providing options for Spark users

dongjoon-hyun · 2026-03-03T04:19:33Z

+1 for public like spark.sql.storeAssignmentPolicy config.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

yaooqinn · 2026-03-03T08:23:26Z

Thank you @dongjoon-hyun and all, merged to master

pan3793 · 2026-03-03T08:51:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(StoreAssignmentPolicy.ANSI)

+  val FILE_SOURCE_INSERT_ENFORCE_NOT_NULL =
+    buildConf("spark.sql.fileSource.insert.enforceNotNull")


Suggested change

buildConf("spark.sql.fileSource.insert.enforceNotNull")

buildConf("spark.sql.files.insert.enforceNotNull")

Can we follow the existing config namespace? Currently, we have many spark.sql.files.* configs which are "effective only when using file-based sources"

spark.sql.files.maxPartitionBytes spark.sql.files.openCostInBytes spark.sql.files.minPartitionNum spark.sql.files.maxPartitionNum spark.sql.files.ignoreMissingFiles ...

Make not much sense to me, these configurations work on pure files, but this one actually takes no effect on the file-only mode

but this one actually takes no effect on the file-only mode

it's true because a file-based table without a catalog can not store constraint info, but is that worth a new namespace for config? I can not infer such info from the proposed namespace name fileSource. IMO, it still fits the spark.sql.files. scope, and we can mention such a limitation in the config docs, if necessary.

cloud-fan · 2026-03-03T14:17:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala


      case _ =>
+        // Merge nullability from the user-specified schema into the resolved schema.
+        // DataSource.resolveRelation() calls dataSchema.asNullable which strips NOT NULL


is it simpler to not do dataSchema.asNullable if the flag is on?

…ng asNullable in resolveRelation ### What changes were proposed in this pull request? Followup to #54517. Simplifies NOT NULL constraint preservation per [review feedback](#54517 (comment)). Instead of calling `dataSchema.asNullable` in `resolveRelation()` and then restoring nullability with recursive `restoreNullability`/`restoreDataTypeNullability` helpers in `CreateDataSourceTableCommand`, this PR: 1. Adds a `forceNullable` parameter to `DataSource.resolveRelation()` (default `true`, preserving existing behavior) 2. Passes `forceNullable = !enforceNotNull` from `CreateDataSourceTableCommand`, so `asNullable` is skipped only when the config is enabled 3. Removes `restoreNullability` and `restoreDataTypeNullability` helper methods entirely **Data flow:** - **Config OFF** (default): `forceNullable = true` → `asNullable` runs → same behavior as before SPARK-55716 - **Config ON**: `forceNullable = false` → `asNullable` skipped → catalog stores NOT NULL → `PreprocessTableInsertion` enforces at insert time ### Why are the changes needed? Addresses review feedback: "is it simpler to not do `dataSchema.asNullable` if the flag is on?" ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing SPARK-55716 tests (7 tests in InsertSuite) and ShowCreateTableSuite (30 tests) all pass. ### Was this patch authored or co-authored using generative AI tooling? Yes, co-authored with GitHub Copilot. Closes #54597 from yaooqinn/SPARK-55716. Authored-by: Kent Yao <kentyao@microsoft.com> Signed-off-by: Kent Yao <kentyao@microsoft.com>

yaooqinn force-pushed the SPARK-55716 branch from 30258f5 to 8aeab4f Compare February 26, 2026 17:18

yaooqinn requested review from cloud-fan and dongjoon-hyun February 26, 2026 17:23

yaooqinn force-pushed the SPARK-55716 branch from 8aeab4f to 28f9608 Compare February 27, 2026 03:04

yaooqinn closed this Feb 27, 2026

yaooqinn force-pushed the SPARK-55716 branch from 28f9608 to 4b7a2fd Compare February 27, 2026 06:55

yaooqinn reopened this Feb 27, 2026

yaooqinn force-pushed the SPARK-55716 branch from 4b7a2fd to 68ad307 Compare February 27, 2026 11:39

yaooqinn force-pushed the SPARK-55716 branch from 68ad307 to d9631e2 Compare February 28, 2026 04:51

yaooqinn force-pushed the SPARK-55716 branch from d9631e2 to dd394fc Compare February 28, 2026 05:57

yaooqinn changed the title ~~[SPARK-55716][SQL] Fix V1 file source NOT NULL constraint enforcement~~ [SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts Feb 28, 2026

dongjoon-hyun approved these changes Mar 2, 2026

View reviewed changes

yaooqinn commented Mar 3, 2026

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

Apply suggestion from @yaooqinn

3feda3d

yaooqinn closed this in fccedae Mar 3, 2026

yaooqinn deleted the SPARK-55716 branch March 3, 2026 08:23

pan3793 reviewed Mar 3, 2026

View reviewed changes

cloud-fan reviewed Mar 3, 2026

View reviewed changes

yaooqinn mentioned this pull request Mar 3, 2026

[SPARK-55716][SQL][FOLLOWUP] Simplify NOT NULL preservation by skipping asNullable in resolveRelation #54597

Closed

	buildConf("spark.sql.fileSource.insert.enforceNotNull")
	buildConf("spark.sql.files.insert.enforceNotNull")

Conversation

yaooqinn commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn commented Feb 26, 2026

Uh oh!

dongjoon-hyun commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Feb 27, 2026

Uh oh!

dongjoon-hyun commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Feb 27, 2026

Uh oh!

gengliangwang commented Feb 27, 2026

Uh oh!

yaooqinn commented Feb 27, 2026

Uh oh!

cloud-fan commented Feb 27, 2026

Uh oh!

dongjoon-hyun commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Feb 27, 2026

Uh oh!

yaooqinn commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Mar 1, 2026

Uh oh!

yaooqinn commented Mar 1, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Mar 3, 2026

Uh oh!

dongjoon-hyun commented Mar 3, 2026

Uh oh!

Uh oh!

yaooqinn commented Mar 3, 2026

Uh oh!

pan3793 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

yaooqinn Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

yaooqinn Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

yaooqinn commented Feb 26, 2026 •

edited

Loading

dongjoon-hyun commented Feb 26, 2026 •

edited

Loading

yaooqinn commented Feb 27, 2026 •

edited

Loading

dongjoon-hyun commented Feb 27, 2026 •

edited

Loading

dongjoon-hyun commented Feb 27, 2026 •

edited

Loading

dongjoon-hyun commented Feb 27, 2026 •

edited

Loading

dongjoon-hyun commented Feb 27, 2026 •

edited

Loading

yaooqinn commented Feb 27, 2026 •

edited

Loading