Slow parsing of large CSV files

In #594, the Python CSV engine is selected for pandas reads of files larger than 50 MB. I am a long-time pandas user and contributor and am unaware of any benefit to using the python engine for large files. Importantly, the PR (written by Claude) alleges this information:

```
Python integers: Have no practical 32-bit limits and can handle large datasets
pandas C parser: Uses 32-bit integers internally, causing overflow with very large CSV files (5M+ rows)
Trigger condition: Large result files (>50MB) processed with pandas default C engine
```

All of this appears to be AI hallucination. It is simply untrue that Pandas cannot read CSV files of 5M+ rows. Pandas' C parser also does not have any int32 limits, and the default dtype for pandas integers is int64. It can read files of larger than 4 GB with ease, if you have the RAM. Or, you could use chunksize with a pandas read_csv to keep the chunks smaller. 

Please consider reverting #594. I am a longtime pyathena user and this change makes the PandasCursor almost unusable for data with timestamp datatypes.

Edit: I was running this on pandas 1.5.3. Changing pandas versions does make this somewhat more bearable. Additionally, much of the performance loss comes from using new-style nullable Int64 dtypes; however, that we can't presumably do without. I appreciate the lossless integers and I'm sure others do as well. Benchmark results of the C vs the python parser on a 700 MB CSV result with 2 timestamp and 5 nullable integer columns below:


| pandas_version | athena_defaults | athena_c | pandas_defaults |
| ---- | ---- | ---- | --- |
| 1.5.3 | 241 | 218 | 37.9 |
| 2.3.3 | 68.3 | 53.3 | 22.1 |


athena_defaults: python engine (PyAthena defaults, using PyAthena read_csv_kwargs in the pd.read_csv call)
athena-c: C engine (same args, replacing engine='python' with engine='c'-- with nullable int support)
pandas_defaults: C engine (no keyword arguments, just read_csv(filename)-- no support for nullable ints)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow parsing of large CSV files #696

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow parsing of large CSV files #696

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions