Skip to content

Slow parsing of large CSV files #696

@Liam3851

Description

@Liam3851

In #594, the Python CSV engine is selected for pandas reads of files larger than 50 MB. I am a long-time pandas user and contributor and am unaware of any benefit to using the python engine for large files. Importantly, the PR (written by Claude) alleges this information:

Python integers: Have no practical 32-bit limits and can handle large datasets
pandas C parser: Uses 32-bit integers internally, causing overflow with very large CSV files (5M+ rows)
Trigger condition: Large result files (>50MB) processed with pandas default C engine

All of this appears to be AI hallucination. It is simply untrue that Pandas cannot read CSV files of 5M+ rows. Pandas' C parser also does not have any int32 limits, and the default dtype for pandas integers is int64. It can read files of larger than 4 GB with ease, if you have the RAM. Or, you could use chunksize with a pandas read_csv to keep the chunks smaller.

Please consider reverting #594. I am a longtime pyathena user and this change makes the PandasCursor almost unusable for data with timestamp datatypes.

Edit: I was running this on pandas 1.5.3. Changing pandas versions does make this somewhat more bearable. Additionally, much of the performance loss comes from using new-style nullable Int64 dtypes; however, that we can't presumably do without. I appreciate the lossless integers and I'm sure others do as well. Benchmark results of the C vs the python parser on a 700 MB CSV result with 2 timestamp and 5 nullable integer columns below:

pandas_version athena_defaults athena_c pandas_defaults
1.5.3 241 218 37.9
2.3.3 68.3 53.3 22.1

athena_defaults: python engine (PyAthena defaults, using PyAthena read_csv_kwargs in the pd.read_csv call)
athena-c: C engine (same args, replacing engine='python' with engine='c'-- with nullable int support)
pandas_defaults: C engine (no keyword arguments, just read_csv(filename)-- no support for nullable ints)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions