Docker compose local (dev) distributed Storm cluster with full observability and network simulation#8706
Docker compose local (dev) distributed Storm cluster with full observability and network simulation#8706GGraziadei wants to merge 3 commits into
Conversation
rzo1
left a comment
There was a problem hiding this comment.
Thanks for the PR. Useful dev tooling and the README covers the setup well. A few things to sort out before it can merge against master:
1. CI will fail on Apache RAT. The two Grafana dashboards (grafana/dashboards/storm-cluster.json, storm-metrics-v2.json) have no ASF license header, and JSON has no comment syntax to carry one. RAT does scan JSON (we already exclude package-lock.json in the root pom.xml), so please add an exclusion there, e.g.:
<exclude>**/dev-tools/cluster/grafana/dashboards/*.json</exclude>Please run mvn apache-rat:check -Prat locally to confirm nothing else (e.g. the new extlib-daemon/.gitignore) trips it.
2. topology.tuple.compression.enable references an unmerged feature. This config key isn't on master (the existing storm.compression.zstd.* / ZstdBridgeThriftSerializationDelegate is cluster-state serialization, not tuple compression). The Dockerfile comment "so it runs your code (e.g. the zstd tuple-compression feature)" and the topology.tuple.compression.enable: false in FileReadWordCountTopo-cluster.yaml will be silently ignored. Please drop these so the harness works against current master, or land it alongside the tuple-compression PR. (The EWMA/jitter config and metrics are fine — those are already on master.)
3. Please bind published ports to localhost. docker-compose.yml publishes 6627/8080/9090/3000 on 0.0.0.0. With unauthenticated Nimbus Thrift and Grafana admin/admin, that exposes a dev cluster to the whole LAN. 127.0.0.1:8080:8080 etc. is safer.
4. Windows support is missing — fine as a follow-up, but please note the Linux/macOS (or WSL2) requirement in the README Prerequisites. The scripts are bash + mvn (not mvn.cmd), and netsim.sh is tc/netem-only by nature.
Minor:
netsim.shhardcodescluster-supervisor{1,2}-1, which assumes the Compose project name iscluster; breaks under-p <name>or a renamed checkout. Consider resolving viadocker compose ps -q supervisor1.prepare-extlib.shdefaultsSTORM_VERSIONto3.0.0-SNAPSHOTinstead of reading the pom likebuild-image.shdoes — it'llcpa wrong-named jar after a version bump. Source.envor read the pom.storm-metrics-v2.jsonis missing a trailing newline.
|
Thanks for the detailed review and the helpful insights! Everything is now pushed and ready for another look! |
What is the purpose of the change
This PR introduces a repeatable, Docker-based distributed Storm dev cluster designed for realistic benchmarking
storm-perfon a local machine. It provisions a complete environment, including Nimbus, ZooKeeper, and two Supervisor, forcing inter-worker traffic across the network to trigger true serialization overhead.Backed by a full observability stack (Prometheus and Grafana), the setup provides granular, per-task tracking via Storm Metrics v2. Additionally, it includes a
netsim.shutility to inject controlled network latency and jitter, allowing developers to easily stress-test topology resilience and analyze bottlenecks under degraded network conditions.How was the change tested
I verified the environment by executing the benchmark smoke test outlined in the README.md, running the
FileReadWordCountTopotopology for 120 seconds across two workers on separate supervisors.Smoke testing successfully validated the baseline performance and the replication of bottlenecks. Injecting typical datacenter network conditions (3 ms latency, 1 ms jitter) caused average complete latency to rise from 390 ms to 446 ms; this induced back-pressure safely reduced total tuple throughput from 40.93M to 36.25M without dropping packets.