Skip to content

Enhance URI scheme validation for Windows paths#3161

Open
AlgoDeveloper400 wants to merge 2 commits intoapache:mainfrom
AlgoDeveloper400:main
Open

Enhance URI scheme validation for Windows paths#3161
AlgoDeveloper400 wants to merge 2 commits intoapache:mainfrom
AlgoDeveloper400:main

Conversation

@AlgoDeveloper400
Copy link
Copy Markdown

fix: handle Windows drive letters in parse_location

Rationale for this change

When a Windows user passes a local file path like C:\Users\file.avro to PyArrowFileIO,
Python's urlparse incorrectly treats the Windows drive letter C as a URL scheme (like s3 or http).

This caused PyIceberg to crash with:

Unrecognized filesystem type in URI: 'c'

The Fix

Before ❌ (Original Code):

uri = urlparse(location)

if not uri.scheme:
    default_scheme = properties.get("DEFAULT_SCHEME", "file")
    default_netloc = properties.get("DEFAULT_NETLOC", "")
    return default_scheme, default_netloc, os.path.abspath(location)

After ✅ (Fixed Code):

uri = urlparse(location)

if not uri.scheme or (len(uri.scheme) == 1 and uri.scheme.isalpha()):
    # len == 1 and isalpha() catches Windows drive letters like C:\ D:\
    default_scheme = properties.get("DEFAULT_SCHEME", "file")
    default_netloc = properties.get("DEFAULT_NETLOC", "")
    return default_scheme, default_netloc, os.path.abspath(location)

The only change:

# Before ❌
if not uri.scheme:

# After ✅
if not uri.scheme or (len(uri.scheme) == 1 and uri.scheme.isalpha()):

The added condition checks if the scheme is a single alphabetic character (e.g. C, D, E)
and treats it as a Windows drive letter instead of a URL scheme.


Example

from pyiceberg.io.pyarrow import PyArrowFileIO

io = PyArrowFileIO()

# Before fix - crashed with: Unrecognized filesystem type in URI: 'c'
# After fix - works correctly
scheme, netloc, path = io.parse_location("C:\\Users\\test\\file.avro")

print(scheme)  # 'file'
print(netloc)  # ''
print(path)    # 'C:\\Users\\test\\file.avro'

Impact

This fix affects all local file operations on Windows including:

  • Reading local Iceberg tables
  • Writing local Iceberg tables
  • Any local Avro/Parquet file operations

Are these changes tested?

Yes - existing tests now pass on Windows.

tests/test_avro_sanitization.py

python -m pytest tests/test_avro_sanitization.py -v
tests/test_avro_sanitization.py::test_comprehensive_field_name_sanitization  PASSED
tests/test_avro_sanitization.py::test_comprehensive_avro_compatibility        PASSED
tests/test_avro_sanitization.py::test_emoji_field_name_sanitization           PASSED

tests/io/test_pyarrow.py

python -m pytest tests/io/test_pyarrow.py::test_pyarrow_infer_local_fs_from_path -v
tests/io/test_pyarrow.py::test_pyarrow_infer_local_fs_from_path               PASSED

Are there any user-facing changes?

Yes - fixes local file access on Windows for all PyIceberg users.

@Fokko
Copy link
Copy Markdown
Contributor

Fokko commented Mar 27, 2026

@AlgoDeveloper400 can you fix the linter? I think it is reasonable to add this, but we might break it easily since there are no tests to enforce the behavior

@AlgoDeveloper400
Copy link
Copy Markdown
Author

Okay I will fix the linter,

@AlgoDeveloper400
Copy link
Copy Markdown
Author

@Fokko All the tests are successful and the linter issue has been resolved.

super().__init__(properties=properties)

@staticmethod
def parse_location(location: str, properties: Properties = EMPTY_DICT) -> tuple[str, str, str]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could still add against parse_location here for the drive something that can ensure we are treating drive letters as local paths and not uri schemes.

How about something like this with the windows paths as parameters

def test_parse_location_windows_drive_letter() -> None:
  loc = "C:\\Users\\test\\file.avro"
  scheme, netloc, _ = PyArrowFileIO.parse_location(loc)
  assert scheme == "file"
  assert netloc == ""

# len == 1 and alpha catches Windows drive letters like C:\ D:\
default_scheme = properties.get("DEFAULT_SCHEME", "file")
default_netloc = properties.get("DEFAULT_NETLOC", "")
return default_scheme, default_netloc, os.path.abspath(location)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a few other places use the urlparse and will be affected like fs spec and schema resolution.

def _infer_file_io_from_scheme(path: str, properties: Properties) -> FileIO | None:
parsed_url = urlparse(path)
if parsed_url.scheme:
if file_ios := SCHEMA_TO_FILE_IO.get(parsed_url.scheme):
for file_io_path in file_ios:
if file_io := _import_file_io(file_io_path, properties):
return file_io
else:
warnings.warn(f"No preferred file implementation for scheme: {parsed_url.scheme}", stacklevel=2)
return None

Maybe we could create a helper in the init class that all modules could use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants