chore(bench): kernel-vs-Thrift performance baseline harness + results#790
Open
vikrantpuppala wants to merge 1 commit into
Open
chore(bench): kernel-vs-Thrift performance baseline harness + results#790vikrantpuppala wants to merge 1 commit into
vikrantpuppala wants to merge 1 commit into
Conversation
One-shot benchmark script under scripts/ that runs each (backend ×
SQL-shape) combination N+1 times against a live warehouse, drops
the first run (cache warm-up), and reports min/median/max for
session-open, time-to-first-row, drain, and RSS delta.
Not a CI gate — single-machine, single-warehouse, high-variance
script meant to be re-run by hand when we want a baseline.
Shapes:
- SELECT 1 (pure round-trip latency, no data)
- range(10k) (inline result, ~10K rows)
- range(1M) (crosses CloudFetch threshold; currently
panics on the kernel backend — see kernel
issue #19, nested block_on bug)
- wide-uuid(100k) (wider rows, Arrow serialization)
- metadata.catalogs (metadata round-trip)
Output is a Markdown table you can paste into a PR. Run with:
set -a && source ~/.databricks/pecotesting-creds && set +a
export DATABRICKS_SERVER_HOSTNAME=${DATABRICKS_HOST#https://}
.venv/bin/python scripts/bench_kernel_vs_thrift.py
Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Contributor
Author
|
Update: kernel issue #19 fixed in databricks-sql-kernel#20. Re-ran the benchmark with the fix applied:
The big result: on Other observations unchanged:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One-shot benchmark script that runs each (backend × SQL-shape) combination N+1 times against a live warehouse, drops the first run (cache warm-up), and reports min/median/max for session-open, time-to-first-row (TTFR), drain, and RSS delta.
Not a CI gate — single-machine, single-warehouse, high-variance. Meant to be re-run by hand when we want a baseline. Output is a Markdown table you can paste into a PR.
```
set -a && source ~/.databricks/pecotesting-creds && set +a
export DATABRICKS_SERVER_HOSTNAME=${DATABRICKS_HOST#https://}
.venv/bin/python scripts/bench_kernel_vs_thrift.py
```
Results (median of 5 samples, warm-up dropped, dogfood)
Detailed tables with min/max ranges and RSS-delta numbers are in the script output.
Three findings worth flagging
1. Fixed kernel TTFR overhead (~700ms)
`SELECT 1` is the cleanest signal because drain time is essentially zero. Kernel pays ~700ms more than Thrift on every query. On large queries (drain dominates) the relative cost shrinks to 1.3–1.5×.
Plausible causes (not investigated):
A flamegraph would distinguish these. Worth a follow-up.
2. CloudFetch panic on large results (kernel issue #19)
`range(1M)` crosses the CloudFetch threshold; the kernel's reader does `runtime_handle.block_on` from a sync trait method, which panics when called from inside our PyO3 `runtime.block_on`. `use_sea=True` is unusable in production for any large-result workload until this is fixed. The connector's e2e tests use `range(10000)` which is below the CloudFetch threshold, which is why it didn't surface earlier.
3. RSS overhead ~+1MB per kernel session
Consistent across every shape. Probably tokio worker thread stacks (default ~2MB × N workers, partially committed). Not a problem at small connection counts; maps to the "process-global `OnceLock`" follow-up the kernel reviewer flagged.
What this doesn't measure
Useful next benchmarks once #19 lands: real CloudFetch shape (1M+ rows), concurrent sessions, and a memory-profile pass.
This pull request and its description were written by Isaac.