Skip to content

bug: datafusion-spark string literals don't interpret escape sequences like Spark #21516

@andygrove

Description

@andygrove

Describe the bug

DataFusion-spark treats \t and \n in SQL string literals as literal backslash characters, while Apache Spark interprets them as escape sequences (tab and newline). This affects any function that receives string arguments containing these sequences.

To Reproduce

PySpark (Spark behavior):

SELECT soundex('\thello');  -- returns tab + "hello" (soundex passes through non-alpha input)
SELECT soundex('\nhello');  -- returns newline + "hello"
SELECT length('\thello');   -- 6 (tab is one character)
SELECT length('\nhello');   -- 6 (newline is one character)

DataFusion-spark (current behavior):

SELECT soundex('\thello');  -- returns literal "\thello" (backslash-t-hello)
SELECT soundex('\nhello');  -- returns literal "\nhello" (backslash-n-hello)
SELECT length('\thello');   -- 7 (\t is two characters: backslash and t)
SELECT length('\nhello');   -- 7 (\n is two characters: backslash and n)

Expected behavior

DataFusion-spark should interpret \t, \n, and other escape sequences in string literals the same way Spark does, for Spark compatibility.

Additional context

This is a string literal parsing issue, not specific to soundex. It affects all string functions. The .slt tests at string/soundex.slt lines 83 and 193 have expected values that match DataFusion's literal interpretation rather than Spark's escape interpretation.

This was discovered by running a PySpark validation script against the .slt test files (see #17045, #21508).

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions