[opt](csv reader) optimize stream load CSV read performance#60920
Open
liaoxin01 wants to merge 1 commit intoapache:masterfrom
Open
[opt](csv reader) optimize stream load CSV read performance#60920liaoxin01 wants to merge 1 commit intoapache:masterfrom
liaoxin01 wants to merge 1 commit intoapache:masterfrom
Conversation
Cache nullable string column pointers per-batch to eliminate per-row assert_cast, inline the write path to bypass StringSerDe layer, and pre-reserve ColumnStr/NullMap capacity to reduce realloc overhead.
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
b5643ff to
9711a75
Compare
There was a problem hiding this comment.
Pull request overview
Optimizes vectorized CSV stream-load parsing for nullable string columns by removing per-row SerDe overhead and reducing reallocations in the hot loop.
Changes:
- Adds per-batch caching of
ColumnNullablenested string column and null-map pointers to avoid repeatedassert_castper row. - Inlines the nullable-string CSV decode path (null detection + escape handling +
insert_data/push_back) instead of calling through SerDe layers. - Pre-reserves
offsets,chars, andnull_mapcapacity per batch to reducePODArraygrowth overhead.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| be/src/vec/exec/format/csv/csv_reader.h | Adds nullable string column cache structures/members and required column includes. |
| be/src/vec/exec/format/csv/csv_reader.cpp | Initializes/uses the cache per batch, inlines nullable-string deserialization, and adds per-batch reserves. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Author
|
run buildall |
TPC-H: Total hot run time: 28649 ms |
TPC-DS: Total hot run time: 184062 ms |
Contributor
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
Optimize stream load CSV read performance for nullable string columns by eliminating per-row overhead from the SerDe abstraction layer.
Changes
Cache nullable string column pointers per-batch: Pre-compute
assert_castresults (ColumnStr and NullMap pointers) once per batch instead of once per row per column, stored inNullableStringColumnCache.Inline nullable string write path: Bypass
_deserialize_nullable_stringandStringSerDe::deserialize_one_cell_from_csvin the hot loop, directly performing null checks, escape handling, andinsert_data/push_back.Pre-reserve column capacity: Reserve
offsets,chars, andnull_mapcapacity at batch start to reduce PODArray realloc overhead during the row loop.Performance
Tested with ClickBench dataset stream load:
Flame graph analysis
Before optimization,
_deserialize_nullable_stringpath dominated with +96s self-time from:assert_cast<ColumnNullable&>(+65s)StringSerDe::deserialize_one_cell_from_csvintermediate layer (+54s)After optimization, these costs are eliminated or amortized to per-batch.