Skip to content

Commit 39dc474

Browse files
committed
self-healing: harden processing flow and add e2e implementation docs
1 parent 40c96ce commit 39dc474

17 files changed

Lines changed: 2313 additions & 88 deletions

File tree

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Master Prompt: Implement Self-Healing E2E Coverage
2+
3+
Use this prompt as-is with an implementation agent.
4+
5+
---
6+
7+
You are working in:
8+
9+
- Repo: `/Users/bilaltanveer/GolangProjects/src/Lumera Protocol/supernode`
10+
- Branch: `jawad/self-healing-phase2-supernode`
11+
12+
## Objective
13+
14+
Implement production-grade system E2E coverage for self-healing, aligned with current off-chain self-healing architecture.
15+
16+
Use this plan as authoritative guidance:
17+
18+
- `docs/self-healing-e2e-test-plan.md`
19+
20+
## Required Context To Read First
21+
22+
1. `tests/system/README.md`
23+
2. `tests/system/e2e_cascade_test.go`
24+
3. `tests/system/supernode-utils.go`
25+
4. `supernode/self_healing/service.go`
26+
5. `supernode/transport/grpc/self_healing/handler.go`
27+
6. `supernode/cascade/reseed.go`
28+
7. `pkg/storage/queries/self_healing.go`
29+
8. `pkg/storage/queries/sqlite.go`
30+
9. `proto/supernode/self_healing.proto`
31+
32+
## Constraints
33+
34+
1. Do not introduce on-chain phase-4 changes.
35+
2. Keep behavior aligned with weighted watchlist + action-driven targets.
36+
3. Do not revert unrelated branch changes.
37+
4. Keep tests deterministic with bounded polling (no fragile sleeps only).
38+
5. Reuse existing system harness patterns and conventions.
39+
40+
## Deliverables
41+
42+
1. Add self-healing E2E tests in `tests/system`:
43+
- `tests/system/e2e_self_healing_test.go`
44+
- `tests/system/self_healing_helpers.go`
45+
2. Cover scenarios:
46+
- happy path (reseed + verify + complete),
47+
- recipient down retry/terminal,
48+
- observer quorum failure,
49+
- duplicate replay once-per-window,
50+
- restart-safe lease reclaim,
51+
- stale-window terminal,
52+
- bounded scale smoke.
53+
3. Add hash-integrity assertions:
54+
- reconstructed data must match action metadata hash.
55+
4. If needed for correctness, make minimal harness hardening changes (for example per-node HOME isolation in system tests so sqlite is node-local).
56+
5. Keep code formatted (`gofmt`) and maintain readable block formatting.
57+
58+
## Test Requirements
59+
60+
1. Weighted trigger must be exercised through multi-reporter epoch reports (`AuditMsg().SubmitEpochReport`) and not via local-only shortcuts.
61+
2. Reconstruction path must validate that recipient healed via `RecoveryReseed` path semantics.
62+
3. Assertions must include DB event status progression from `pending` to terminal/completed states where relevant.
63+
4. Assertions must verify no duplicate processing for same deterministic challenge ID in the same window.
64+
65+
## Execution Commands
66+
67+
Run and report output for:
68+
69+
1. `make setup-supernodes`
70+
2. `go test ./tests/system -run TestSelfHealing -v`
71+
3. `make test-e2e`
72+
73+
If environment-specific linker/toolchain constraints block full run, still run all reachable subsets and explicitly report:
74+
75+
- what passed,
76+
- what failed,
77+
- exact blocker,
78+
- why blocker is environmental vs logic.
79+
80+
## Quality Bar
81+
82+
Do not stop at happy path.
83+
84+
Implementation is complete only when:
85+
86+
1. all listed scenarios are implemented,
87+
2. assertions are explicit and meaningful,
88+
3. no obvious race/flaky timing patterns remain,
89+
4. code is clean and maintainable.
90+
91+
## Final Output Format
92+
93+
Provide:
94+
95+
1. concise summary of changes,
96+
2. file-by-file change list,
97+
3. test command results,
98+
4. known risks or follow-ups.
99+
100+
---

docs/self-healing-e2e-test-plan.md

Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
# Self-Healing E2E Test Plan (System Harness)
2+
3+
## Goal
4+
5+
Add deterministic, production-grade system E2E coverage for self-healing in `tests/system`, aligned with the current off-chain phase:
6+
7+
- Trigger based on weighted watchlist (multi-reporter epoch reports).
8+
- Challenge generation from on-chain CASCADE actions.
9+
- Recipient-side reconstruction via `RecoveryReseed`.
10+
- Observer quorum verification.
11+
- Restart-safe once-per-window behavior.
12+
13+
## Out of Scope
14+
15+
- Any on-chain phase-4 capability/governance changes.
16+
- Redesign of `x/audit` or `x/action` module semantics.
17+
- Replacing current supernode runtime architecture.
18+
19+
## Existing Harness Baseline
20+
21+
- Use existing system harness and lifecycle already used by cascade tests:
22+
- `tests/system/main_test.go`
23+
- `tests/system/e2e_cascade_test.go`
24+
- `tests/system/supernode-utils.go`
25+
- Existing startup commands:
26+
- `make setup-supernodes`
27+
- `make test-cascade`
28+
- `make test-e2e`
29+
30+
## Files To Add
31+
32+
- `tests/system/e2e_self_healing_test.go`
33+
- `tests/system/self_healing_helpers.go`
34+
35+
## Optional Harness Hardening (Recommended First)
36+
37+
To ensure truly per-node event-state assertions, isolate node-local SQLite in system tests:
38+
39+
1. In `tests/system/supernode-utils.go`, set per-process `HOME` to each node data dir before `cmd.Start()`.
40+
2. Keep one sqlite DB per supernode process (`$HOME/.supernode/history.db`).
41+
42+
If this is not done, system tests may still pass but assertions that depend on node-local ownership/lease state can be ambiguous.
43+
44+
## Test Fixture Design
45+
46+
Create one reusable fixture function, for example `setupSelfHealingFixture(t)`:
47+
48+
1. Start chain and register supernodes (reuse existing helper flow).
49+
2. Start all supernodes.
50+
3. Create one CASCADE action from `tests/system/test.txt` (reuse cascade helper/client logic).
51+
4. Wait for action to reach `DONE`/`APPROVED`.
52+
5. Capture:
53+
- `actionID`
54+
- expected data hash from action metadata (`CascadeMetadata.DataHash`)
55+
- candidate anchor key from metadata (`RqIdsIds` smallest lexicographic key)
56+
57+
## Weighted Watchlist Trigger Setup
58+
59+
Self-healing trigger requires weighted view, not a single local opinion.
60+
61+
Use `AuditMsg().SubmitEpochReport(...)` from multiple reporter nodes in the same epoch:
62+
63+
1. Query current epoch via `Audit().GetCurrentEpoch`.
64+
2. Query audit params for:
65+
- `required_open_ports`
66+
- `peer_quorum_reports`
67+
- `peer_port_postpone_threshold_percent`
68+
3. Submit reports from at least `peer_quorum_reports` distinct reporters against target holders.
69+
4. For watchers intended to be flagged, make `closed_votes / total_votes >= threshold`.
70+
5. Wait at least one block after report submission.
71+
72+
## Core E2E Scenarios
73+
74+
### 1) Happy Path: Request -> Reseed -> Verify -> Complete
75+
76+
Steps:
77+
78+
1. Ensure holders of the selected action anchor key are on weighted watchlist.
79+
2. Wait for one generation/process window.
80+
3. Let challenger emit challenge and process event.
81+
82+
Assertions:
83+
84+
1. One event exists per `(window, action_id)` challenge ID.
85+
2. Recipient response accepted.
86+
3. `reconstruction_required=true` when recipient was missing local data.
87+
4. Observer verification reaches threshold.
88+
5. Event status becomes `completed`.
89+
6. Reconstructed content hash matches action metadata hash (see hash assertion section).
90+
91+
### 2) Recipient Down -> Retry -> Terminal
92+
93+
Steps:
94+
95+
1. Bring recipient process down before processing.
96+
2. Let challenger process event across retries.
97+
98+
Assertions:
99+
100+
1. Event transitions to `retry` with backoff timestamps.
101+
2. `attempt_count` increments each claim.
102+
3. After `max_event_attempts`, status becomes `terminal`.
103+
4. Terminal reason indicates recipient/request failure.
104+
105+
### 3) Observer Quorum Fail
106+
107+
Steps:
108+
109+
1. Keep recipient up.
110+
2. Make enough observers unavailable or mismatch response.
111+
112+
Assertions:
113+
114+
1. Verification messages captured for observers.
115+
2. `ok_count < observer_threshold`.
116+
3. Event is retried, then terminal when max attempts is reached.
117+
118+
### 4) Duplicate Replay / Once-Per-Window
119+
120+
Steps:
121+
122+
1. Trigger same generation window repeatedly.
123+
2. Attempt duplicate insertion/replay of same challenge ID.
124+
125+
Assertions:
126+
127+
1. Only one row per challenge ID in `self_healing_challenge_events`.
128+
2. Processing runs once for that window challenge identity.
129+
3. No duplicate completion metrics for same sender/message type tuple.
130+
131+
### 5) Restart-Safe Lease Reclaim
132+
133+
Steps:
134+
135+
1. Force challenger to claim event.
136+
2. Kill challenger before completion update.
137+
3. Wait lease expiry.
138+
4. Restart challenger.
139+
140+
Assertions:
141+
142+
1. Event is reclaimed.
143+
2. Processing resumes and completes (or retries/terminal deterministically).
144+
3. No double-complete state.
145+
146+
### 6) Stale Window Handling
147+
148+
Steps:
149+
150+
1. Insert or mutate an old event payload with stale `window_id`.
151+
2. Run event processor tick.
152+
153+
Assertions:
154+
155+
1. Event immediately marked `terminal`.
156+
2. Reason is `stale_window`.
157+
3. No request RPC is sent.
158+
159+
### 7) Scale/Throughput Smoke (Bounded)
160+
161+
Steps:
162+
163+
1. Generate many eligible targets (for system test keep bounded, e.g. 300-1000).
164+
2. Configure `max_events_per_tick`, `event_workers`, and retry intervals for test runtime.
165+
166+
Assertions:
167+
168+
1. Processor drains queue over ticks without deadlock.
169+
2. No unbounded goroutine growth.
170+
3. Event throughput scales with `event_workers` until bounded by I/O.
171+
4. DB lock errors do not appear persistently.
172+
173+
## Hash Integrity Assertion (Required)
174+
175+
For healed files, assert reconstructed hash is equal to action metadata hash:
176+
177+
1. Decode `CascadeMetadata` for the action (`DataHash` is base64-encoded hash payload).
178+
2. Validate reconstructed file/key hash using same verification utility used in cascade flow (`cascadekit.VerifyB64DataHash` path).
179+
3. Also assert observer-reported `reconstructed_hash_hex` matches recipient reconstructed local content.
180+
181+
## DB/State Assertions To Query
182+
183+
Primary tables:
184+
185+
- `self_healing_challenge_events`
186+
- `self_healing_execution_metrics`
187+
188+
Key columns to assert:
189+
190+
- `challenge_id`
191+
- `status` (`pending`, `processing`, `retry`, `completed`, `terminal`)
192+
- `attempt_count`
193+
- `lease_owner`, `lease_expires_at`
194+
- `next_retry_at`
195+
- `last_error`
196+
197+
## Execution Commands
198+
199+
1. `make setup-supernodes`
200+
2. `make test-cascade` (baseline)
201+
3. `go test ./tests/system -run TestSelfHealing -v`
202+
4. `make test-e2e`
203+
204+
## Acceptance Criteria
205+
206+
1. All new self-healing E2E tests pass reliably on repeated runs.
207+
2. No flaky dependence on local timing race; use polling with bounded timeouts.
208+
3. Happy path proves:
209+
- weighted trigger,
210+
- reconstruction via reseed,
211+
- observer quorum,
212+
- hash integrity.
213+
4. Failure-path tests prove:
214+
- retry policy,
215+
- terminal behavior,
216+
- restart-safe reclaim,
217+
- stale-window rejection.

pkg/lumera/modules/action/action_mock.go

Lines changed: 15 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pkg/lumera/modules/action/impl.go

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,3 +57,15 @@ func (m *module) GetParams(ctx context.Context) (*types.QueryParamsResponse, err
5757

5858
return resp, nil
5959
}
60+
61+
// ListActions fetches actions with optional filters/pagination.
62+
func (m *module) ListActions(ctx context.Context, req *types.QueryListActionsRequest) (*types.QueryListActionsResponse, error) {
63+
if req == nil {
64+
req = &types.QueryListActionsRequest{}
65+
}
66+
resp, err := m.client.ListActions(ctx, req)
67+
if err != nil {
68+
return nil, fmt.Errorf("failed to list actions: %w", err)
69+
}
70+
return resp, nil
71+
}

pkg/lumera/modules/action/interface.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ type Module interface {
1313
GetAction(ctx context.Context, actionID string) (*types.QueryGetActionResponse, error)
1414
GetActionFee(ctx context.Context, dataSize string) (*types.QueryGetActionFeeResponse, error)
1515
GetParams(ctx context.Context) (*types.QueryParamsResponse, error)
16+
ListActions(ctx context.Context, req *types.QueryListActionsRequest) (*types.QueryListActionsResponse, error)
1617
}
1718

1819
// NewModule creates a new Action module client

0 commit comments

Comments
 (0)