perf: add SIMD-accelerated UTF-8 validation to core arrow crates#9495
perf: add SIMD-accelerated UTF-8 validation to core arrow crates#9495lyang24 wants to merge 1 commit intoapache:mainfrom
Conversation
| /// on the happy path for improved throughput. Falls back to `std::str::from_utf8` | ||
| /// on the error path to provide a detailed [`std::str::Utf8Error`]. | ||
| #[inline(always)] | ||
| pub fn check_utf8(val: &[u8]) -> Result<&str, std::str::Utf8Error> { |
There was a problem hiding this comment.
Can we unify this with the existing utf8 check?
There was a problem hiding this comment.
Hi, do you mean unifying it with the one in parquet folder?
Add simdutf8 for fast UTF-8 validation in arrow-data, arrow-array, arrow-row, and arrow-csv. A shared check_utf8() utility in arrow-data uses SIMD on the happy path and falls back to std::str::from_utf8 on error for detailed Utf8Error. The feature is default-enabled in the arrow umbrella crate.
|
run benchmark json_reader arrow_reader |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
|
some work is needed to unify utf8 checks with parquet. taking this one to draft for now |
Which issue does this PR close?
validate_string_viewand other utf8 validation #7014.Rationale for this change
Add simdutf8 for fast UTF-8 validation in arrow-data, arrow-array, arrow-row, and arrow-csv. A shared check_utf8() utility in arrow-data uses SIMD on the happy path and falls back to std::str::from_utf8 on error for detailed Utf8Error. The feature is default-enabled in the arrow umbrella crate.
What changes are included in this PR?
simd impl of utf8 instead of the standard lib method
Are these changes tested?
all tests passed
Are there any user-facing changes?
no