Skip to content

Docker compose local (dev) distributed Storm cluster with full observability and network simulation#8706

Open
GGraziadei wants to merge 3 commits into
apache:masterfrom
GGraziadei:docker-dev-cluster
Open

Docker compose local (dev) distributed Storm cluster with full observability and network simulation#8706
GGraziadei wants to merge 3 commits into
apache:masterfrom
GGraziadei:docker-dev-cluster

Conversation

@GGraziadei
Copy link
Copy Markdown
Contributor

What is the purpose of the change

This PR introduces a repeatable, Docker-based distributed Storm dev cluster designed for realistic benchmarking storm-perf on a local machine. It provisions a complete environment, including Nimbus, ZooKeeper, and two Supervisor, forcing inter-worker traffic across the network to trigger true serialization overhead.
Backed by a full observability stack (Prometheus and Grafana), the setup provides granular, per-task tracking via Storm Metrics v2. Additionally, it includes a netsim.sh utility to inject controlled network latency and jitter, allowing developers to easily stress-test topology resilience and analyze bottlenecks under degraded network conditions.

How was the change tested

I verified the environment by executing the benchmark smoke test outlined in the README.md, running the FileReadWordCountTopo topology for 120 seconds across two workers on separate supervisors.
Smoke testing successfully validated the baseline performance and the replication of bottlenecks. Injecting typical datacenter network conditions (3 ms latency, 1 ms jitter) caused average complete latency to rise from 390 ms to 446 ms; this induced back-pressure safely reduced total tuple throughput from 40.93M to 36.25M without dropping packets.

@GGraziadei GGraziadei changed the title docker compose dev cluster Docker compose local (dev) distributed Storm cluster with full observability and network simulation May 22, 2026
@GGraziadei GGraziadei marked this pull request as draft May 22, 2026 12:01
@GGraziadei GGraziadei marked this pull request as ready for review May 22, 2026 13:43
Copy link
Copy Markdown
Contributor

@rzo1 rzo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Useful dev tooling and the README covers the setup well. A few things to sort out before it can merge against master:

1. CI will fail on Apache RAT. The two Grafana dashboards (grafana/dashboards/storm-cluster.json, storm-metrics-v2.json) have no ASF license header, and JSON has no comment syntax to carry one. RAT does scan JSON (we already exclude package-lock.json in the root pom.xml), so please add an exclusion there, e.g.:

<exclude>**/dev-tools/cluster/grafana/dashboards/*.json</exclude>

Please run mvn apache-rat:check -Prat locally to confirm nothing else (e.g. the new extlib-daemon/.gitignore) trips it.

2. topology.tuple.compression.enable references an unmerged feature. This config key isn't on master (the existing storm.compression.zstd.* / ZstdBridgeThriftSerializationDelegate is cluster-state serialization, not tuple compression). The Dockerfile comment "so it runs your code (e.g. the zstd tuple-compression feature)" and the topology.tuple.compression.enable: false in FileReadWordCountTopo-cluster.yaml will be silently ignored. Please drop these so the harness works against current master, or land it alongside the tuple-compression PR. (The EWMA/jitter config and metrics are fine — those are already on master.)

3. Please bind published ports to localhost. docker-compose.yml publishes 6627/8080/9090/3000 on 0.0.0.0. With unauthenticated Nimbus Thrift and Grafana admin/admin, that exposes a dev cluster to the whole LAN. 127.0.0.1:8080:8080 etc. is safer.

4. Windows support is missing — fine as a follow-up, but please note the Linux/macOS (or WSL2) requirement in the README Prerequisites. The scripts are bash + mvn (not mvn.cmd), and netsim.sh is tc/netem-only by nature.

Minor:

  • netsim.sh hardcodes cluster-supervisor{1,2}-1, which assumes the Compose project name is cluster; breaks under -p <name> or a renamed checkout. Consider resolving via docker compose ps -q supervisor1.
  • prepare-extlib.sh defaults STORM_VERSION to 3.0.0-SNAPSHOT instead of reading the pom like build-image.sh does — it'll cp a wrong-named jar after a version bump. Source .env or read the pom.
  • storm-metrics-v2.json is missing a trailing newline.

@GGraziadei
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review and the helpful insights!
I have applied all the requested changes.
Added the exclusion for Grafana json dashboards in the root pom.xml and verified locally with mvn apache-rat:check -Prat.
Removed the references and configurations related to the unmerged tuple compression feature (apologies for the distraction and mixing the PR contents!).
Bound all published ports to 127.0.0.1 in docker-compose.yml to prevent LAN exposure.
Updated the README to state the Linux/macOS/WSL2 requirement explicitly, and resolved the hardcoded container names in netsim.sh, fixed the versioning fallback in prepare-extlib.sh, and added the missing trailing newline for a json file.

Everything is now pushed and ready for another look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants