[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts#54517
[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts#54517yaooqinn wants to merge 2 commits intoapache:masterfrom
Conversation
30258f5 to
8aeab4f
Compare
|
cc @dongjoon-hyun @cloud-fan @gengliangwang, Is the fix direction correct? is this a genuine bug or design choice. I haven't found any public discussions on this area. |
|
Hi, @yaooqinn
For me, this PR seems to introduce a new feature instead of bugs. cc @aokolnychyi , @peter-toth , too. |
@dongjoon-hyun |
8aeab4f to
28f9608
Compare
|
IIUC, |
|
Anyway, let's wait for @gengliangwang and @cloud-fan 's comment. Maybe, I lost the track of code change. |
@dongjoon-hyun FYI, Null constraints for table output relies on it too |
|
Got it. So, from Apache Spark 3.5.0 via the following? Then, cc @aokolnychyi , too. |
28f9608 to
4b7a2fd
Compare
4b7a2fd to
68ad307
Compare
|
This is probably a too breaking change, and many users may already treat this bug as a feature. If users need strict not null enforcement, they should migrate to DS v2. |
|
+1 with @cloud-fan . At least, the default behavior should not be changed in 4.2. |
|
Yeah, I agree with you that this is too breaking. However, many of our clients and developers here have raised doubts about this strange behavior of Spark. There are no clear docs for this.
Do you mean by using built-in sources with v2 code path or suggesting users migrate to third-party formats? |
|
I think using plain file source table is not recommended now, people should switch to lakehouse table formats. |
|
Since DSv2 migration is an independent topic from this DSv1 bug, I made a PR for DSv2 specifically, @cloud-fan , @gengliangwang , @yaooqinn . |
|
Hi @dongjoon-hyun builtin file sources with DSv2 code path have the same issue based on my unit tests |
|
@cloud-fan Instead of making NOT NULL as a default behavior, would you mind if we give it a shot that provide an alternative option for users if they want built-in parquet and delta lake to behave the same on NOT NULL constraints? For steaming source, we have such an option for toggling asNullable. |
68ad307 to
d9631e2
Compare
…ble inserts V1 file-based DataSource writes (parquet/orc/json) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement by: 1. CreateDataSourceTableCommand: Preserve user-specified nullability by recursively merging nullability flags from the user schema into the resolved dataSource.schema. 2. PreprocessTableInsertion: Restore nullability flags from the catalog schema before null checks. This ensures AssertNotNull is injected when needed, gated by spark.sql.fileSource.insert.enforceNotNull. 3. Config: spark.sql.fileSource.insert.enforceNotNull (default false) - when true, enforces NOT NULL constraints for file-based tables. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d9631e2 to
dd394fc
Compare
|
I'm fine with a new config to provide NOT NULL enforcement. |
|
Thank you @cloud-fan,can you take a deeper review when you have some time, happy weekend! |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
From my side, the latest approach looks safer to me. Thank you for clarifying the AS-IS situation and proposing a better solution for Spark users, @yaooqinn ! Although we may hit some corner cases later, I believe this leads us toward more desirable status.
@dongjoon-hyun, thank you. Before merging this, I want to discuss the config scope, shall we make it public since we are providing options for Spark users |
|
+1 for |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
Thank you @dongjoon-hyun and all, merged to master |
| .createWithDefault(StoreAssignmentPolicy.ANSI) | ||
|
|
||
| val FILE_SOURCE_INSERT_ENFORCE_NOT_NULL = | ||
| buildConf("spark.sql.fileSource.insert.enforceNotNull") |
There was a problem hiding this comment.
| buildConf("spark.sql.fileSource.insert.enforceNotNull") | |
| buildConf("spark.sql.files.insert.enforceNotNull") |
Can we follow the existing config namespace? Currently, we have many spark.sql.files.* configs which are "effective only when using file-based sources"
spark.sql.files.maxPartitionBytes
spark.sql.files.openCostInBytes
spark.sql.files.minPartitionNum
spark.sql.files.maxPartitionNum
spark.sql.files.ignoreMissingFiles
...
There was a problem hiding this comment.
Make not much sense to me, these configurations work on pure files, but this one actually takes no effect on the file-only mode
There was a problem hiding this comment.
but this one actually takes no effect on the file-only mode
it's true because a file-based table without a catalog can not store constraint info, but is that worth a new namespace for config? I can not infer such info from the proposed namespace name fileSource. IMO, it still fits the spark.sql.files. scope, and we can mention such a limitation in the config docs, if necessary.
|
|
||
| case _ => | ||
| // Merge nullability from the user-specified schema into the resolved schema. | ||
| // DataSource.resolveRelation() calls dataSchema.asNullable which strips NOT NULL |
There was a problem hiding this comment.
is it simpler to not do dataSchema.asNullable if the flag is on?
…ng asNullable in resolveRelation ### What changes were proposed in this pull request? Followup to #54517. Simplifies NOT NULL constraint preservation per [review feedback](#54517 (comment)). Instead of calling `dataSchema.asNullable` in `resolveRelation()` and then restoring nullability with recursive `restoreNullability`/`restoreDataTypeNullability` helpers in `CreateDataSourceTableCommand`, this PR: 1. Adds a `forceNullable` parameter to `DataSource.resolveRelation()` (default `true`, preserving existing behavior) 2. Passes `forceNullable = !enforceNotNull` from `CreateDataSourceTableCommand`, so `asNullable` is skipped only when the config is enabled 3. Removes `restoreNullability` and `restoreDataTypeNullability` helper methods entirely **Data flow:** - **Config OFF** (default): `forceNullable = true` → `asNullable` runs → same behavior as before SPARK-55716 - **Config ON**: `forceNullable = false` → `asNullable` skipped → catalog stores NOT NULL → `PreprocessTableInsertion` enforces at insert time ### Why are the changes needed? Addresses review feedback: "is it simpler to not do `dataSchema.asNullable` if the flag is on?" ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing SPARK-55716 tests (7 tests in InsertSuite) and ShowCreateTableSuite (30 tests) all pass. ### Was this patch authored or co-authored using generative AI tooling? Yes, co-authored with GitHub Copilot. Closes #54597 from yaooqinn/SPARK-55716. Authored-by: Kent Yao <kentyao@microsoft.com> Signed-off-by: Kent Yao <kentyao@microsoft.com>
What changes were proposed in this pull request?
V1 file-based DataSource writes (Parquet, ORC, JSON, etc.) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement controlled by
spark.sql.fileSource.insert.enforceNotNull.Changes:
CreateDataSourceTableCommand: Preserves user-specified nullability by recursively merging nullability flags from the user schema into the resolveddataSource.schema. Previously it storeddataSource.schemadirectly, which is all-nullable due toDataSource.resolveRelation()callingdataSchema.asNullable(SPARK-13738).PreprocessTableInsertion: Restores nullability flags from the catalog schema before null checks, ensuringAssertNotNullis injected when needed. Gated behindspark.sql.fileSource.insert.enforceNotNull.New config:
spark.sql.fileSource.insert.enforceNotNull(defaultfalse) — when set totrue, enables NOT NULL constraint enforcement for V1 file-based tables, consistent with the behavior for other data sources and V2 catalog tables.SparkGetColumnsOperation: FixedIS_NULLABLEto respectcolumn.nullableinstead of always returning"YES".Why are the changes needed?
DataSource.resolveRelation()callsdataSchema.asNullable(added in SPARK-13738 for read safety), which strips all NOT NULL constraints recursively.CreateDataSourceTableCommandthen stores this all-nullable schema in the catalog, permanently losing NOT NULL information. As a result,PreprocessTableInsertionnever injectsAssertNotNullfor V1 file source tables.Note:
InsertableRelation(e.g.,SimpleInsertSource) does NOT have this problem because it preserves the original schema (SPARK-24583).Does this PR introduce any user-facing change?
No change in default behavior. Users can opt in to NOT NULL enforcement for V1 file source tables by setting
spark.sql.fileSource.insert.enforceNotNulltotrue.How was this patch tested?
InsertSuitecovering top-level, nested struct, array, and map null constraint enforcement.SparkMetadataOperationSuite.Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored with GitHub Copilot.