|
| 1 | +# Self-Healing E2E Test Plan (System Harness) |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +Add deterministic, production-grade system E2E coverage for self-healing in `tests/system`, aligned with the current off-chain phase: |
| 6 | + |
| 7 | +- Trigger based on weighted watchlist (multi-reporter epoch reports). |
| 8 | +- Challenge generation from on-chain CASCADE actions. |
| 9 | +- Recipient-side reconstruction via `RecoveryReseed`. |
| 10 | +- Observer quorum verification. |
| 11 | +- Restart-safe once-per-window behavior. |
| 12 | + |
| 13 | +## Out of Scope |
| 14 | + |
| 15 | +- Any on-chain phase-4 capability/governance changes. |
| 16 | +- Redesign of `x/audit` or `x/action` module semantics. |
| 17 | +- Replacing current supernode runtime architecture. |
| 18 | + |
| 19 | +## Existing Harness Baseline |
| 20 | + |
| 21 | +- Use existing system harness and lifecycle already used by cascade tests: |
| 22 | + - `tests/system/main_test.go` |
| 23 | + - `tests/system/e2e_cascade_test.go` |
| 24 | + - `tests/system/supernode-utils.go` |
| 25 | +- Existing startup commands: |
| 26 | + - `make setup-supernodes` |
| 27 | + - `make test-cascade` |
| 28 | + - `make test-e2e` |
| 29 | + |
| 30 | +## Files To Add |
| 31 | + |
| 32 | +- `tests/system/e2e_self_healing_test.go` |
| 33 | +- `tests/system/self_healing_helpers.go` |
| 34 | + |
| 35 | +## Optional Harness Hardening (Recommended First) |
| 36 | + |
| 37 | +To ensure truly per-node event-state assertions, isolate node-local SQLite in system tests: |
| 38 | + |
| 39 | +1. In `tests/system/supernode-utils.go`, set per-process `HOME` to each node data dir before `cmd.Start()`. |
| 40 | +2. Keep one sqlite DB per supernode process (`$HOME/.supernode/history.db`). |
| 41 | + |
| 42 | +If this is not done, system tests may still pass but assertions that depend on node-local ownership/lease state can be ambiguous. |
| 43 | + |
| 44 | +## Test Fixture Design |
| 45 | + |
| 46 | +Create one reusable fixture function, for example `setupSelfHealingFixture(t)`: |
| 47 | + |
| 48 | +1. Start chain and register supernodes (reuse existing helper flow). |
| 49 | +2. Start all supernodes. |
| 50 | +3. Create one CASCADE action from `tests/system/test.txt` (reuse cascade helper/client logic). |
| 51 | +4. Wait for action to reach `DONE`/`APPROVED`. |
| 52 | +5. Capture: |
| 53 | + - `actionID` |
| 54 | + - expected data hash from action metadata (`CascadeMetadata.DataHash`) |
| 55 | + - candidate anchor key from metadata (`RqIdsIds` smallest lexicographic key) |
| 56 | + |
| 57 | +## Weighted Watchlist Trigger Setup |
| 58 | + |
| 59 | +Self-healing trigger requires weighted view, not a single local opinion. |
| 60 | + |
| 61 | +Use `AuditMsg().SubmitEpochReport(...)` from multiple reporter nodes in the same epoch: |
| 62 | + |
| 63 | +1. Query current epoch via `Audit().GetCurrentEpoch`. |
| 64 | +2. Query audit params for: |
| 65 | + - `required_open_ports` |
| 66 | + - `peer_quorum_reports` |
| 67 | + - `peer_port_postpone_threshold_percent` |
| 68 | +3. Submit reports from at least `peer_quorum_reports` distinct reporters against target holders. |
| 69 | +4. For watchers intended to be flagged, make `closed_votes / total_votes >= threshold`. |
| 70 | +5. Wait at least one block after report submission. |
| 71 | + |
| 72 | +## Core E2E Scenarios |
| 73 | + |
| 74 | +### 1) Happy Path: Request -> Reseed -> Verify -> Complete |
| 75 | + |
| 76 | +Steps: |
| 77 | + |
| 78 | +1. Ensure holders of the selected action anchor key are on weighted watchlist. |
| 79 | +2. Wait for one generation/process window. |
| 80 | +3. Let challenger emit challenge and process event. |
| 81 | + |
| 82 | +Assertions: |
| 83 | + |
| 84 | +1. One event exists per `(window, action_id)` challenge ID. |
| 85 | +2. Recipient response accepted. |
| 86 | +3. `reconstruction_required=true` when recipient was missing local data. |
| 87 | +4. Observer verification reaches threshold. |
| 88 | +5. Event status becomes `completed`. |
| 89 | +6. Reconstructed content hash matches action metadata hash (see hash assertion section). |
| 90 | + |
| 91 | +### 2) Recipient Down -> Retry -> Terminal |
| 92 | + |
| 93 | +Steps: |
| 94 | + |
| 95 | +1. Bring recipient process down before processing. |
| 96 | +2. Let challenger process event across retries. |
| 97 | + |
| 98 | +Assertions: |
| 99 | + |
| 100 | +1. Event transitions to `retry` with backoff timestamps. |
| 101 | +2. `attempt_count` increments each claim. |
| 102 | +3. After `max_event_attempts`, status becomes `terminal`. |
| 103 | +4. Terminal reason indicates recipient/request failure. |
| 104 | + |
| 105 | +### 3) Observer Quorum Fail |
| 106 | + |
| 107 | +Steps: |
| 108 | + |
| 109 | +1. Keep recipient up. |
| 110 | +2. Make enough observers unavailable or mismatch response. |
| 111 | + |
| 112 | +Assertions: |
| 113 | + |
| 114 | +1. Verification messages captured for observers. |
| 115 | +2. `ok_count < observer_threshold`. |
| 116 | +3. Event is retried, then terminal when max attempts is reached. |
| 117 | + |
| 118 | +### 4) Duplicate Replay / Once-Per-Window |
| 119 | + |
| 120 | +Steps: |
| 121 | + |
| 122 | +1. Trigger same generation window repeatedly. |
| 123 | +2. Attempt duplicate insertion/replay of same challenge ID. |
| 124 | + |
| 125 | +Assertions: |
| 126 | + |
| 127 | +1. Only one row per challenge ID in `self_healing_challenge_events`. |
| 128 | +2. Processing runs once for that window challenge identity. |
| 129 | +3. No duplicate completion metrics for same sender/message type tuple. |
| 130 | + |
| 131 | +### 5) Restart-Safe Lease Reclaim |
| 132 | + |
| 133 | +Steps: |
| 134 | + |
| 135 | +1. Force challenger to claim event. |
| 136 | +2. Kill challenger before completion update. |
| 137 | +3. Wait lease expiry. |
| 138 | +4. Restart challenger. |
| 139 | + |
| 140 | +Assertions: |
| 141 | + |
| 142 | +1. Event is reclaimed. |
| 143 | +2. Processing resumes and completes (or retries/terminal deterministically). |
| 144 | +3. No double-complete state. |
| 145 | + |
| 146 | +### 6) Stale Window Handling |
| 147 | + |
| 148 | +Steps: |
| 149 | + |
| 150 | +1. Insert or mutate an old event payload with stale `window_id`. |
| 151 | +2. Run event processor tick. |
| 152 | + |
| 153 | +Assertions: |
| 154 | + |
| 155 | +1. Event immediately marked `terminal`. |
| 156 | +2. Reason is `stale_window`. |
| 157 | +3. No request RPC is sent. |
| 158 | + |
| 159 | +### 7) Scale/Throughput Smoke (Bounded) |
| 160 | + |
| 161 | +Steps: |
| 162 | + |
| 163 | +1. Generate many eligible targets (for system test keep bounded, e.g. 300-1000). |
| 164 | +2. Configure `max_events_per_tick`, `event_workers`, and retry intervals for test runtime. |
| 165 | + |
| 166 | +Assertions: |
| 167 | + |
| 168 | +1. Processor drains queue over ticks without deadlock. |
| 169 | +2. No unbounded goroutine growth. |
| 170 | +3. Event throughput scales with `event_workers` until bounded by I/O. |
| 171 | +4. DB lock errors do not appear persistently. |
| 172 | + |
| 173 | +## Hash Integrity Assertion (Required) |
| 174 | + |
| 175 | +For healed files, assert reconstructed hash is equal to action metadata hash: |
| 176 | + |
| 177 | +1. Decode `CascadeMetadata` for the action (`DataHash` is base64-encoded hash payload). |
| 178 | +2. Validate reconstructed file/key hash using same verification utility used in cascade flow (`cascadekit.VerifyB64DataHash` path). |
| 179 | +3. Also assert observer-reported `reconstructed_hash_hex` matches recipient reconstructed local content. |
| 180 | + |
| 181 | +## DB/State Assertions To Query |
| 182 | + |
| 183 | +Primary tables: |
| 184 | + |
| 185 | +- `self_healing_challenge_events` |
| 186 | +- `self_healing_execution_metrics` |
| 187 | + |
| 188 | +Key columns to assert: |
| 189 | + |
| 190 | +- `challenge_id` |
| 191 | +- `status` (`pending`, `processing`, `retry`, `completed`, `terminal`) |
| 192 | +- `attempt_count` |
| 193 | +- `lease_owner`, `lease_expires_at` |
| 194 | +- `next_retry_at` |
| 195 | +- `last_error` |
| 196 | + |
| 197 | +## Execution Commands |
| 198 | + |
| 199 | +1. `make setup-supernodes` |
| 200 | +2. `make test-cascade` (baseline) |
| 201 | +3. `go test ./tests/system -run TestSelfHealing -v` |
| 202 | +4. `make test-e2e` |
| 203 | + |
| 204 | +## Acceptance Criteria |
| 205 | + |
| 206 | +1. All new self-healing E2E tests pass reliably on repeated runs. |
| 207 | +2. No flaky dependence on local timing race; use polling with bounded timeouts. |
| 208 | +3. Happy path proves: |
| 209 | + - weighted trigger, |
| 210 | + - reconstruction via reseed, |
| 211 | + - observer quorum, |
| 212 | + - hash integrity. |
| 213 | +4. Failure-path tests prove: |
| 214 | + - retry policy, |
| 215 | + - terminal behavior, |
| 216 | + - restart-safe reclaim, |
| 217 | + - stale-window rejection. |
0 commit comments