-
Notifications
You must be signed in to change notification settings - Fork 107
Description
In #594, the Python CSV engine is selected for pandas reads of files larger than 50 MB. I am a long-time pandas user and contributor and am unaware of any benefit to using the python engine for large files. Importantly, the PR (written by Claude) alleges this information:
Python integers: Have no practical 32-bit limits and can handle large datasets
pandas C parser: Uses 32-bit integers internally, causing overflow with very large CSV files (5M+ rows)
Trigger condition: Large result files (>50MB) processed with pandas default C engine
All of this appears to be AI hallucination. It is simply untrue that Pandas cannot read CSV files of 5M+ rows. Pandas' C parser also does not have any int32 limits, and the default dtype for pandas integers is int64. It can read files of larger than 4 GB with ease, if you have the RAM. Or, you could use chunksize with a pandas read_csv to keep the chunks smaller.
Please consider reverting #594. I am a longtime pyathena user and this change makes the PandasCursor almost unusable for data with timestamp datatypes.
Edit: I was running this on pandas 1.5.3. Changing pandas versions does make this somewhat more bearable. Additionally, much of the performance loss comes from using new-style nullable Int64 dtypes; however, that we can't presumably do without. I appreciate the lossless integers and I'm sure others do as well. Benchmark results of the C vs the python parser on a 700 MB CSV result with 2 timestamp and 5 nullable integer columns below:
| pandas_version | athena_defaults | athena_c | pandas_defaults |
|---|---|---|---|
| 1.5.3 | 241 | 218 | 37.9 |
| 2.3.3 | 68.3 | 53.3 | 22.1 |
athena_defaults: python engine (PyAthena defaults, using PyAthena read_csv_kwargs in the pd.read_csv call)
athena-c: C engine (same args, replacing engine='python' with engine='c'-- with nullable int support)
pandas_defaults: C engine (no keyword arguments, just read_csv(filename)-- no support for nullable ints)