-
Notifications
You must be signed in to change notification settings - Fork 251
Add pg_clickhouse #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add pg_clickhouse #742
Conversation
It dupes the clickhouse benchmark config to load data (except that column names are lowercase) and the Postgres config to run the queries. All queries push down to ClickHouse without error.
|
Output in my test: |
| @@ -0,0 +1,43 @@ | |||
| SELECT COUNT(*) FROM hits; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the measurements with c6a.4xlarge instances: https://pastila.clickhouse.com/?00012d7f/6abbadde357ce424eac9b0db7b1aa307#sOCA4BXIIk/CnS1ic4Kt8w== (runtimes are at the end)
I couldn't spot any error messages in the log output. What makes me suspicious is that the corresponding "pure" ClickHouse measurements on c6a.4xlarge systems return worse runtimes: https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.4xlarge.json (--> see e.g. the cold runs = the first values in each row).
Do you have an explanation how that may be?
Maybe we need to add more checks that the data was successfully imported and the queries return non-empty results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure it exits with an error on load failure; I saw that a lot while I was working out how to make it work. Same for query failures; they mess with the resulting output.
I ran the test in Docker without greping-out query output and got this. Note that this is just one parquet file; I don't have capacity for 100.
For comparison, here's what I get in Docker for one parquet file and running the base ClickHouse benchmark . Looks fairly consistent at a glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working through the queries, comparing outputs. Found a that pg_clickhouse isn't pushing down MIN() and MAX(), and that the EventDate column is turning up empty. Going to work on that, then once fixed, go through the rest of the queries and fix any other issues I find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've worked through all the queries, and only one wasn't working properly:
SELECT MIN(EventDate), MAX(EventDate) FROM hits;I've fixed it and will make a release shortly. After that we can run the benchmark again.
I compared the output of all the other queries against a hits table with 100m rows in ClickHouse, and they all looked right (some vary, but the outputs in those cases are somewhat nondeterministic).
Note that this PR fixes the previous mainly by removing | tail -n1 from run.sh, which allows all three runs for each query to be collected, rather than just the last, But curiously, an awful lot of the test cases in this repo contain the same thing:
❯ rg -F 'tail -n1'
cloud-init.sh.in
24:df -B1 / | tail -n1 | awk '{ print $3 }' | tee -a log
29:df -B1 / | tail -n1 | awk '{ print $3 }' | tee -a log
redshift-serverless/run.sh
8: psql -h "${FQDN}" -U awsuser -d dev -p 5439 -t -c 'SET enable_result_cache_for_session = off' -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
paradedb/run.sh
12: psql -h localhost -U postgres -p 5432 -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
mariadb/benchmark.sh
30:sudo mariadb test -e "SELECT data_length + index_length FROM information_schema.TABLES WHERE table_schema = 'test' AND table_name = 'hits';" | tail -n1
hydra/run.sh
11: ) | psql $DATABASE_URL 2>&1 | grep -P 'Time|psql: error' | tail -n1
pg_duckdb-motherduck/run.sh
14: ) | psql --no-psqlrc --tuples-only $CONNECTION 2>&1 | grep -P 'Time|psql: error' | tail -n1
postgresql-indexed/run.sh
13: ) | sudo -u postgres psql test -t 2>&1 | grep -P 'Time|psql: error' | tail -n1
cloudberry/run.sh
13: psql -d postgres -t -f /tmp/query_temp.sql 2>&1 | grep -P 'Time|psql: error' | tail -n1
pg_mooncake/run.sh
14: ) | psql $CONNECTION 2>&1 | grep -P 'Time|psql: error' | tail -n1
greenplum/run.sh
13: psql -d postgres -t -f /tmp/query_temp.sql 2>&1 | grep -P 'Time|psql: error' | tail -n1
aurora-postgresql/run.sh
8: psql -U postgres -h "${FQDN}" test -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
postgresql-orioledb/run.sh
13: ) | psql -h localhost -p 5432 -U postgres -d test -t 2>&1 | grep -P 'Time|psql: error' | tail -n1
umbra/run.sh
22: PGPASSWORD=postgres psql -p 5432 -h 127.0.0.1 -U postgres -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
pg_duckdb-parquet/run.sh
13: ) | psql --no-psqlrc --tuples-only postgres://postgres:duckdb@localhost:5432/postgres 2>&1 | grep -P 'Time|psql: error' | tail -n1
mysql/benchmark.sh
26:sudo mysql test -e "SELECT data_length + index_length FROM information_schema.TABLES WHERE table_schema = 'test' AND table_name = 'hits';" | tail -n1
cedardb/run.sh
22: PGPASSWORD=test psql -h localhost -U postgres -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
yugabytedb/run.sh
13: ) | ./yugabyte/bin/ysqlsh -U yugabyte -d test -t 2>&1 | grep -P 'Time|ysqlsh: error' | tail -n1
pgpro_tam/run.sh
13: ) | psql -h 127.0.0.1 -U postgres -t 2>&1 | grep -P 'Time|psql: error' | tail -n1
timescaledb-no-columnstore/run.sh
11: sudo -u postgres psql nocolumnstore -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
timescaledb/run.sh
11: sudo -u postgres psql test -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
citus/run.sh
11: psql -U postgres -h localhost -d postgres --no-password -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
tidb/benchmark.sh
115:mysql test -e "SELECT (DATA_LENGTH + INDEX_LENGTH) AS TIKV_STORAGE_SIZE_BYTES FROM information_schema.tables WHERE table_schema = '$DB_NAME' AND table_name = '$TABLE_NAME';" | tail -n1
120: mysql test -e "SELECT TOTAL_SIZE AS TIFLASH_STORAGE_SIZE_BYTES FROM information_schema.tiflash_tables WHERE TIDB_DATABASE = '$DB_NAME' AND TIDB_TABLE = '$TABLE_NAME';" | tail -n1
cratedb/run.sh
18: psql -U crate -h localhost --no-password -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
supabase/run.sh
13: ) | psql ${SUPABASE_CONNECTION_STRING} -t 2>&1 | grep -P 'Time|psql: error' | tail -n1
pg_duckdb-indexed/run.sh
16: ) | psql $CONNECTION --no-psqlrc --tuples-only 2>&1 | grep -P 'Time|psql: error' | tail -n1
tablespace/run.sh
13: psql "host=$HOSTNAME port=5432 dbname=csdb user=csuser password=$PASSWORD sslmode=require" -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
clickhouse-parquet-partitioned/run.sh
11: RES=$(./clickhouse local --time --format Null --query="$(cat create.sql); $query" 2>&1 | tail -n1)
clickhouse-parquet/run.sh
11: RES=$(./clickhouse local --time --format Null --query="$(cat create.sql); $query" 2>&1 | tail -n1)
paradedb-partitioned/run.sh
12: psql -h localhost -U postgres -p 5432 -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
clickhouse-cloud/collect-results.sh
28: "data_size": '$(tail -n1 "$f" | tr -d "\n")',
hologres/run.sh
25: PGUSER=$PG_USER PGPASSWORD=$PG_PASSWORD psql -h $HOST_NAME -p $PORT -d $DB -t -f $temp_file 2>&1 | grep -P 'ms|psql: error' | tail -n1
mysql-myisam/benchmark.sh
26:sudo mysql test -e "SELECT data_length + index_length FROM information_schema.TABLES WHERE table_schema = 'test' AND table_name = 'hits';" | tail -n1
cockroachdb/benchmark.sh
39:cockroach sql --insecure --host=localhost --database=test --execute="SELECT SUM(range_size) FROM [SHOW RANGES FROM TABLE hits WITH DETAILS];" | tail -n1
alloydb/run.sh
11: PGPASSWORD='<PASSWORD>' psql -h 127.0.0.1 -p 5432 -U postgres -d clickbenc -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
redshift/run.sh
8: psql -h "${FQDN}" -U awsuser -d dev -p 5439 -t -c 'SET enable_result_cache_for_session = off' -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
timescale-cloud/run.sh
8: psql "${CONNECTION_STRING}" -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
tembo-olap/run.sh
11: psql "host=$HOSTNAME port=5432 dbname=test user=postgres password=$PASSWORD sslmode=require" -t -c '\timing' -c "$query" 2>&1 | grep -P 'Time|psql: error' | tail -n1
pg_duckdb/run.sh
16: ) | psql $CONNECTION --no-psqlrc --tuples-only 2>&1 | grep -P 'Time|psql: error' | tail -n1
Are those all only supposed to output the timing from the last run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the reason.
The automation expects a certain result format to work.
Details are in the repository README: https://github.com/ClickHouse/ClickBench
(section "How to add a new result").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are the instructions I followed, though I don't see a description of how to get the results into results/.
Should the other engines be updated to emit all three results for each query?
|
I've released pg_clickhouse v0.1.2 with the |
It dupes the clickhouse benchmark config to load data (except that column names are lowercase) and the Postgres config to run the queries. All queries push down to ClickHouse without error.
Replaces #739 by eliminating a wayward
| head -n1and makes the column names in ClickHouse lowercase. Closes ClickHouse/pg_clickhouse#82.