Commit 946d70a
committed
feat: add dictionary_columns to scan API for memory-efficient string reads
Exposes `dictionary_columns: tuple[str, ...] | None = None` on `Table.scan()`
and `DataScan`, threading it through to PyArrow's `ParquetFileFormat` so that
named columns are read as `DictionaryArray` instead of plain `large_utf8`.
This dramatically reduces memory usage for high-cardinality repeated JSON/string
columns (issue #3168) and addresses the general scan parameter extensibility
request (issue #3170).
Key implementation details:
- ORC files are guarded — `dictionary_columns` is only passed for Parquet
- `ArrowScan.to_table()` rebuilds the Arrow schema with dict types before the
empty-table fast-path so schema is consistent regardless of row count
- `DataScan.to_arrow_batch_reader()` rebuilds `target_schema` with dict types
to prevent `.cast()` from silently decoding DictionaryArray back to plain string
- `DataScan.__init__` declares and stores the param so `TableScan.update()`
(which uses `inspect.signature`) preserves it across scan copies
Fixes #3168, closes #31701 parent 1a54e9c commit 946d70a
4 files changed
Lines changed: 286 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1614 | 1614 | | |
1615 | 1615 | | |
1616 | 1616 | | |
| 1617 | + | |
1617 | 1618 | | |
1618 | | - | |
| 1619 | + | |
| 1620 | + | |
| 1621 | + | |
| 1622 | + | |
| 1623 | + | |
1619 | 1624 | | |
1620 | 1625 | | |
1621 | 1626 | | |
| |||
1718 | 1723 | | |
1719 | 1724 | | |
1720 | 1725 | | |
| 1726 | + | |
1721 | 1727 | | |
1722 | 1728 | | |
1723 | 1729 | | |
| |||
1737 | 1743 | | |
1738 | 1744 | | |
1739 | 1745 | | |
| 1746 | + | |
| 1747 | + | |
1740 | 1748 | | |
1741 | 1749 | | |
1742 | 1750 | | |
| |||
1745 | 1753 | | |
1746 | 1754 | | |
1747 | 1755 | | |
| 1756 | + | |
1748 | 1757 | | |
1749 | 1758 | | |
1750 | 1759 | | |
| |||
1773 | 1782 | | |
1774 | 1783 | | |
1775 | 1784 | | |
| 1785 | + | |
| 1786 | + | |
| 1787 | + | |
| 1788 | + | |
| 1789 | + | |
| 1790 | + | |
| 1791 | + | |
| 1792 | + | |
| 1793 | + | |
| 1794 | + | |
| 1795 | + | |
1776 | 1796 | | |
1777 | 1797 | | |
1778 | 1798 | | |
| |||
1855 | 1875 | | |
1856 | 1876 | | |
1857 | 1877 | | |
| 1878 | + | |
1858 | 1879 | | |
1859 | 1880 | | |
1860 | 1881 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1121 | 1121 | | |
1122 | 1122 | | |
1123 | 1123 | | |
| 1124 | + | |
1124 | 1125 | | |
1125 | 1126 | | |
1126 | 1127 | | |
| |||
1147 | 1148 | | |
1148 | 1149 | | |
1149 | 1150 | | |
| 1151 | + | |
| 1152 | + | |
| 1153 | + | |
| 1154 | + | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
1150 | 1158 | | |
1151 | 1159 | | |
1152 | 1160 | | |
| |||
1162 | 1170 | | |
1163 | 1171 | | |
1164 | 1172 | | |
| 1173 | + | |
1165 | 1174 | | |
1166 | 1175 | | |
1167 | 1176 | | |
| |||
1664 | 1673 | | |
1665 | 1674 | | |
1666 | 1675 | | |
| 1676 | + | |
1667 | 1677 | | |
1668 | 1678 | | |
1669 | 1679 | | |
| |||
1916 | 1926 | | |
1917 | 1927 | | |
1918 | 1928 | | |
| 1929 | + | |
| 1930 | + | |
| 1931 | + | |
| 1932 | + | |
| 1933 | + | |
| 1934 | + | |
| 1935 | + | |
| 1936 | + | |
| 1937 | + | |
| 1938 | + | |
| 1939 | + | |
| 1940 | + | |
| 1941 | + | |
| 1942 | + | |
| 1943 | + | |
| 1944 | + | |
| 1945 | + | |
| 1946 | + | |
| 1947 | + | |
| 1948 | + | |
| 1949 | + | |
| 1950 | + | |
| 1951 | + | |
| 1952 | + | |
| 1953 | + | |
| 1954 | + | |
| 1955 | + | |
| 1956 | + | |
| 1957 | + | |
| 1958 | + | |
1919 | 1959 | | |
1920 | 1960 | | |
1921 | 1961 | | |
| |||
2113 | 2153 | | |
2114 | 2154 | | |
2115 | 2155 | | |
2116 | | - | |
| 2156 | + | |
| 2157 | + | |
| 2158 | + | |
| 2159 | + | |
| 2160 | + | |
| 2161 | + | |
| 2162 | + | |
2117 | 2163 | | |
2118 | 2164 | | |
2119 | 2165 | | |
| |||
2132 | 2178 | | |
2133 | 2179 | | |
2134 | 2180 | | |
| 2181 | + | |
| 2182 | + | |
| 2183 | + | |
| 2184 | + | |
| 2185 | + | |
| 2186 | + | |
| 2187 | + | |
| 2188 | + | |
| 2189 | + | |
| 2190 | + | |
| 2191 | + | |
| 2192 | + | |
| 2193 | + | |
| 2194 | + | |
| 2195 | + | |
| 2196 | + | |
| 2197 | + | |
2135 | 2198 | | |
2136 | | - | |
| 2199 | + | |
| 2200 | + | |
| 2201 | + | |
| 2202 | + | |
| 2203 | + | |
| 2204 | + | |
| 2205 | + | |
2137 | 2206 | | |
2138 | 2207 | | |
2139 | 2208 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3152 | 3152 | | |
3153 | 3153 | | |
3154 | 3154 | | |
| 3155 | + | |
| 3156 | + | |
| 3157 | + | |
| 3158 | + | |
| 3159 | + | |
| 3160 | + | |
| 3161 | + | |
| 3162 | + | |
| 3163 | + | |
| 3164 | + | |
| 3165 | + | |
| 3166 | + | |
| 3167 | + | |
| 3168 | + | |
| 3169 | + | |
| 3170 | + | |
| 3171 | + | |
| 3172 | + | |
| 3173 | + | |
| 3174 | + | |
| 3175 | + | |
| 3176 | + | |
| 3177 | + | |
| 3178 | + | |
| 3179 | + | |
| 3180 | + | |
| 3181 | + | |
| 3182 | + | |
| 3183 | + | |
| 3184 | + | |
| 3185 | + | |
| 3186 | + | |
| 3187 | + | |
| 3188 | + | |
| 3189 | + | |
| 3190 | + | |
| 3191 | + | |
| 3192 | + | |
| 3193 | + | |
| 3194 | + | |
| 3195 | + | |
| 3196 | + | |
| 3197 | + | |
| 3198 | + | |
| 3199 | + | |
| 3200 | + | |
| 3201 | + | |
| 3202 | + | |
| 3203 | + | |
| 3204 | + | |
| 3205 | + | |
| 3206 | + | |
| 3207 | + | |
| 3208 | + | |
| 3209 | + | |
| 3210 | + | |
| 3211 | + | |
| 3212 | + | |
| 3213 | + | |
| 3214 | + | |
| 3215 | + | |
| 3216 | + | |
| 3217 | + | |
| 3218 | + | |
| 3219 | + | |
| 3220 | + | |
| 3221 | + | |
| 3222 | + | |
| 3223 | + | |
| 3224 | + | |
| 3225 | + | |
| 3226 | + | |
| 3227 | + | |
| 3228 | + | |
| 3229 | + | |
| 3230 | + | |
| 3231 | + | |
| 3232 | + | |
| 3233 | + | |
| 3234 | + | |
| 3235 | + | |
| 3236 | + | |
| 3237 | + | |
| 3238 | + | |
| 3239 | + | |
| 3240 | + | |
| 3241 | + | |
| 3242 | + | |
| 3243 | + | |
| 3244 | + | |
| 3245 | + | |
| 3246 | + | |
| 3247 | + | |
| 3248 | + | |
| 3249 | + | |
| 3250 | + | |
| 3251 | + | |
| 3252 | + | |
| 3253 | + | |
| 3254 | + | |
| 3255 | + | |
| 3256 | + | |
| 3257 | + | |
| 3258 | + | |
| 3259 | + | |
| 3260 | + | |
| 3261 | + | |
| 3262 | + | |
| 3263 | + | |
| 3264 | + | |
| 3265 | + | |
| 3266 | + | |
| 3267 | + | |
| 3268 | + | |
| 3269 | + | |
| 3270 | + | |
| 3271 | + | |
| 3272 | + | |
| 3273 | + | |
| 3274 | + | |
| 3275 | + | |
| 3276 | + | |
| 3277 | + | |
| 3278 | + | |
| 3279 | + | |
| 3280 | + | |
| 3281 | + | |
| 3282 | + | |
| 3283 | + | |
| 3284 | + | |
| 3285 | + | |
| 3286 | + | |
| 3287 | + | |
| 3288 | + | |
| 3289 | + | |
| 3290 | + | |
| 3291 | + | |
| 3292 | + | |
| 3293 | + | |
| 3294 | + | |
| 3295 | + | |
| 3296 | + | |
| 3297 | + | |
| 3298 | + | |
| 3299 | + | |
| 3300 | + | |
| 3301 | + | |
| 3302 | + | |
| 3303 | + | |
| 3304 | + | |
| 3305 | + | |
| 3306 | + | |
| 3307 | + | |
| 3308 | + | |
| 3309 | + | |
| 3310 | + | |
| 3311 | + | |
| 3312 | + | |
| 3313 | + | |
| 3314 | + | |
| 3315 | + | |
| 3316 | + | |
| 3317 | + | |
| 3318 | + | |
| 3319 | + | |
| 3320 | + | |
| 3321 | + | |
| 3322 | + | |
| 3323 | + | |
| 3324 | + | |
| 3325 | + | |
| 3326 | + | |
3155 | 3327 | | |
3156 | 3328 | | |
3157 | 3329 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
274 | 274 | | |
275 | 275 | | |
276 | 276 | | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
277 | 298 | | |
278 | 299 | | |
279 | 300 | | |
| |||
0 commit comments