Skip to content

Add fault tolerance demo docs#22837

Open
jhlodin wants to merge 1 commit intomainfrom
jl/doc-11437
Open

Add fault tolerance demo docs#22837
jhlodin wants to merge 1 commit intomainfrom
jl/doc-11437

Conversation

@jhlodin
Copy link
Contributor

@jhlodin jhlodin commented Feb 25, 2026

https://cockroachlabs.atlassian.net/browse/DOC-11437

  • Document the new in-Console fault tolerance demo (in preview)
  • Add mentions to the demo as a next step after deploying in Advanced
  • Add release note

@netlify
Copy link

netlify bot commented Feb 25, 2026

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit f4865a7
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/699e5939d006bc0008b5d619

@netlify
Copy link

netlify bot commented Feb 25, 2026

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit f4865a7
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/699e5939bfbf62000842df17

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page is woefully out of date. There is an ongoing discussion to either remove this page altogether or commit to better maintenance, but for right now it's probably better to add the note than not.

@jhlodin jhlodin requested review from fantapop and jaiayu February 25, 2026 02:08
@netlify
Copy link

netlify bot commented Feb 25, 2026

Netlify Preview

Name Link
🔨 Latest commit f4865a7
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/699e59396cc11500085420c1
😎 Deploy Preview https://deploy-preview-22837--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.


### Fault tolerance demo

CockroachDB {{ site.data.products.advanced }} includes a [built-in fault tolerance demo]({% link {{ page.version.version }}/demo-cockroachdb-resilience.md %}#run-a-guided-demo-in-cockroachdb-cloud) that allows you to monitor query execution during a simulated failure and recovery. The fault tolerance demo is in Preview.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that "is in Preview" is similar to a line in the Folders section below but preview is not capitalized there. Also, should add the link? ({% link {{ page.version.version }}/cockroachdb-feature-availability.md %})

---

This page guides you through a simple demonstration of how CockroachDB remains available during, and recovers after, failure. Starting with a 6-node local cluster with the default 3-way replication, you'll run a sample workload, terminate a node to simulate failure, and see how the cluster continues uninterrupted. You'll then leave that node offline for long enough to watch the cluster repair itself by re-replicating missing data to other nodes. You'll then prepare the cluster for 2 simultaneous node failures by increasing to 5-way replication, then take two nodes offline at the same time, and again see how the cluster continues uninterrupted.
This page describes how to see a hands-on demonstration of how CockroachDB's fault-tolerant design allows services to remain available during a failure and recovery.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: Multiple uses of "how" close together sounds a bit awkward here. How to see ... how . I don't have any suggestions for improvements however

CockroachDB {{ site.data.products.cloud }} {{ site.data.products.advanced }} includes a built-in fault tolerance demo in the {{ site.data.products.cloud }} Console that automatically runs a sample workload and simulates a node failure on your cluster, showing real-time metrics of query latency and failure rate during the outage and recovery.

{{ site.data.alerts.callout_info }}
The CockroachDB {{ site.data.products.cloud }} fault tolerance demo is in [Preview]({% link {{ page.version.version }}/cockroachdb-feature-availability.md %}).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but it seems like we should have prebuilt macros for each of visibilities. Its annoying that we'd have to add the version and availability to each place we use this. Ideally we could do this and have it link up correctly.

fault tolerance demo is in {{site.data.visibility.preview}}.

- A [CockroachDB {{ site.data.products.advanced }} cluster]({% link cockroachcloud/create-an-advanced-cluster.md %}) with at least three nodes.
- All nodes are healthy.
- The cluster's CPU utilization is below 30%.
- The cluster does not a custom [replication zone configuration]({% link {{ page.version.version }}/configure-replication-zones.md %}).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dropped this one.

There are some others but I don't think most of them are worth listing.

The one additional that I think we should consider is around the cluster being in an unlocked state. For example if they are already undergoing cluster disruption or they are scaling their cluster or the cluster is under maintenance, they won't be able to run the demo. Anything that has locked the cluster will prevent the demo from starting. The messaging we show the user in this case is:

The fault tolerance demo cannot be run because this cluster is currently in a locked state. Try again once the cluster is available.

- The cluster's CPU utilization is below 30%.
- The cluster does not a custom [replication zone configuration]({% link {{ page.version.version }}/configure-replication-zones.md %}).

To run the fault tolerance demo, open the {{ site.data.products.cloud }} Console and navigate to **Actions > Fault tolerance demo**. Follow the prompts to check that your cluster is eligible and begin the demo.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the prompts to check that your cluster is eligible

I wonder if this will be confusing? There aren't really any visible prompts to check eligibility since we run them automatically when you try to start the demo.

To start using your CockroachDB {{ site.data.products.advanced }} cluster, refer to:

- [Connect to your cluster]({% link cockroachcloud/connect-to-your-cluster.md %})
- Run the [fault tolerance demo]({% link {{ site.versions["stable"] }}/demo-cockroachdb-resilience.md %}#run-a-guided-demo-in-cockroachdb-cloud)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the headline that results in this page anchor is using the macro {{ site.data.products.cloud }} but it's hard coded here in the anchor. It means that if we ever changed cloud, the anchor would break. This is obviously unlikely and perhaps we'd check by some automated link scanner but it suggests that the current system for linking we have within pages is lacking. It would be better if we had some layer of indirection for each headline that would allow us to change its name without changing it's id. Then we use the id to look up the current name and generate the anchor on the fly. Obviously we wouldn't do any of that in this PR. Just food for thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants