diff --git a/docs/fma.md b/docs/fma.md new file mode 100644 index 00000000..cde5e704 --- /dev/null +++ b/docs/fma.md @@ -0,0 +1,524 @@ +# pg_durable Failure Mode Analysis + +**Status**: Draft +**Created**: 2026-03-24 + +--- + +## 1. Overview + +This document catalogs production failure modes for pg_durable, a PostgreSQL extension providing durable SQL function execution. pg_durable runs entirely inside the PostgreSQL server — a single background worker process orchestrates durable functions via [duroxide](https://github.com/anthropics/duroxide), while user sessions build function graphs through DSL operators. + +**Scope**: Failures that affect a pg_durable deployment on a PostgreSQL-as-a-Service (PaaS) platform. Covers the background worker, activity execution, orchestration logic, client-side DSL, extension lifecycle, and operational concerns. + +**Methodology**: Static analysis of `src/`, `sql/`, and `tests/` combined with architectural reasoning about the PostgreSQL process model and duroxide runtime behavior. + +--- + +## 2. Severity Definitions + +| Level | Definition | +|-------|-----------| +| **SEV-1** | All durable functions for all users are blocked or data loss is possible. Requires immediate operator intervention. | +| **SEV-2** | Subset of users or workloads affected, or degraded functionality system-wide. | +| **SEV-3** | Single-user or single-instance failure with no broader impact. Self-recoverable or cosmetic. | + +--- + +## 3. Failure Modes + +### FM-1: Background Worker Fails to Start + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-1 | +| **Component** | `src/worker.rs` — `duroxide_worker_main` | +| **Trigger** | Tokio runtime creation fails (fd exhaustion, OOM), `shared_preload_libraries` misconfigured, or PostgreSQL crashes the worker on startup. | +| **Impact** | No durable functions execute. `df.start()` succeeds (rows written to `df.instances`/`df.nodes`), but instances remain `pending` indefinitely. Users see workflows that never progress. | +| **Detection — existing** | PostgreSQL log: `"pg_durable: failed to create tokio runtime: {}"`. PostgreSQL's built-in `pg_stat_activity` shows no `pg_durable_worker` background worker. The `df._worker_epoch` table remains empty. | +| **Detection — gap** | **No health-check SQL function** (e.g., `df.worker_alive()`) exists for users or monitoring dashboards to query. The only detection is log-scraping or direct system catalog inspection. | +| **Programmatic mitigation** | PostgreSQL auto-restarts background workers after the configured `set_restart_time(Some(Duration::from_secs(5)))`. The worker registers with `BgWorkerStartTime::RecoveryFinished`, so it starts after crash recovery completes. | +| **Process mitigation** | PaaS alerting on `pg_stat_activity` background worker presence. Log-based alert on `"failed to create tokio runtime"`. | +| **User recommendation** | Check `SELECT * FROM df._worker_epoch` — a recent `last_seen_at` timestamp confirms the worker is alive. If the table is empty or `last_seen_at` is stale, contact your database administrator. | + +--- + +### FM-2: Worker Cannot Connect to PostgreSQL + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-1 | +| **Component** | `src/worker.rs` — poll pool creation, store initialization | +| **Trigger** | The `pg_durable.worker_role` GUC names a role that doesn't exist, has been dropped, or whose password/auth has changed. The `pg_durable.database` GUC names a database that doesn't exist. Network-level issues on loopback (rare for local connections). | +| **Impact** | Worker enters infinite retry loop. Logs fill with `"failed to create polling pool (will retry in 5s): {}"` or `"failed to create PostgreSQL store (will retry): {}"`. No durable functions execute. | +| **Detection — existing** | PostgreSQL log messages every 1–5 seconds. `df._worker_epoch` table stays empty. | +| **Detection — gap** | **No retry counter or backoff telemetry**. An operator reading logs sees repeated errors but has no metric for "worker has been retrying for N minutes". No alerting hook. | +| **Programmatic mitigation** | Retry loops are infinite with fixed intervals (5s for poll pool, 1s for store). The worker checks `is_shutdown_requested()` between retries to allow clean shutdown. | +| **Process mitigation** | PaaS validation at provisioning time: ensure worker role exists, is a superuser, and can authenticate. Alert on repeated `"will retry"` log patterns. | +| **User recommendation** | If no workflows are executing, ask the database administrator to verify that the `pg_durable.worker_role` (`SHOW pg_durable.worker_role`) exists and has superuser privileges. | + +--- + +### FM-3: Worker Role Is Not a Superuser + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-1 | +| **Component** | `src/lib.rs` — extension SQL validation, `src/worker.rs` — runtime operation | +| **Trigger** | The role named by `pg_durable.worker_role` exists but is not a superuser. | +| **Impact** | Worker connects successfully and starts the duroxide runtime, but RLS policies filter out all rows in `df.instances` and `df.nodes` for the worker's activities. `load_function_graph` finds no instance. `update_instance_status` and `update_node_status` update zero rows. All workflows stall at `pending` or `running` with no progress. This is a **silent failure** — no error is raised. | +| **Detection — existing** | Extension SQL emits `RAISE WARNING 'pg_durable: worker role "..." is NOT a superuser...'` at `CREATE EXTENSION` time. Activity traces show `"Instance {id} not found after 5s"` in duroxide logs. `df.metrics()` shows `running_instances` climbing while `completed_instances` stays flat. | +| **Detection — gap** | **The warning at extension creation is easily missed.** There is no recurring health check that validates the worker role's privilege level. The activity failure message doesn't distinguish "RLS filtered" from "genuinely missing instance". | +| **Programmatic mitigation** | The `CREATE EXTENSION` SQL includes a `DO $$` block that checks `rolsuper` for the worker role. However, it only emits a `WARNING`, not an `EXCEPTION`. | +| **Process mitigation** | PaaS should enforce that the worker role is superuser as part of the managed PostgreSQL setup. Consider promoting the warning to an error. | +| **User recommendation** | Users cannot fix this themselves; it's a platform configuration issue. Symptom: all workflows stuck at `pending`/`running`. | + +> **Recommendation**: Promote the `RAISE WARNING` at extension creation to `RAISE EXCEPTION` so that `CREATE EXTENSION pg_durable` fails fast if the worker role isn't a superuser. + +--- + +### FM-4: Extension Created in Wrong Database + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-2 | +| **Component** | `src/lib.rs` — database validation SQL | +| **Trigger** | User runs `CREATE EXTENSION pg_durable` in a database other than the one configured in `pg_durable.database`. | +| **Impact** | Extension tables exist in one database; the background worker connects to a different one. Workflows submitted in the wrong database are never picked up. | +| **Detection — existing** | Extension SQL includes a `DO $$` block that checks `current_database()` against `current_setting('pg_durable.database')` and raises `EXCEPTION` in production builds or `NOTICE` in test builds. | +| **Detection — gap** | In test builds (which may leak to staging), only a `NOTICE` is emitted, not an error. | +| **Programmatic mitigation** | The database check in `CREATE EXTENSION` prevents creation in the wrong database (in production builds). | +| **Process mitigation** | PaaS should create the extension as part of managed provisioning, targeting the correct database. | +| **User recommendation** | If you receive a database mismatch error, run `CREATE EXTENSION` in the database shown by `SHOW pg_durable.database`. | + +--- + +### FM-5: Transaction Visibility Race (df.start → Worker Pickup) + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-3 | +| **Component** | `src/activities/load_function_graph.rs` | +| **Trigger** | `df.start()` inserts into `df.instances` and `df.nodes`, then calls `start_orchestration()` on the duroxide client. The duroxide runtime may schedule the orchestration's `load_function_graph` activity before the user's transaction commits. | +| **Impact** | The activity's SQL query against `df.instances` finds no row (transaction not yet visible). If the user's transaction takes longer than 5 seconds to commit, the activity fails with `"Instance {id} not found after 5s"`. The instance transitions to `failed`. | +| **Detection — existing** | Activity trace: `"Instance {id} not yet visible, waiting for transaction commit..."` followed by `"Instance {id} not found after 5s"`. | +| **Detection — gap** | **No metric** for how often the 5-second retry window is hit, or how close to the limit activities get. | +| **Programmatic mitigation** | `load_function_graph` retries with 100ms polling for up to 5 seconds (`MAX_WAIT_SECS`). This handles the common case where the commit is milliseconds away. | +| **Process mitigation** | Document that `df.start()` should be called near the end of a transaction, not inside a long-running transaction with many preceding statements. | +| **User recommendation** | Call `df.start()` as the last operation before `COMMIT`. If workflows fail immediately with "not found", your transaction may be too long. | + +--- + +### FM-6: User SQL Execution Failure + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-3 | +| **Component** | `src/activities/execute_sql.rs` | +| **Trigger** | The SQL query in a `df.sql()` node contains a syntax error, references a non-existent table/column, or the `submitted_by` user lacks the required privileges. | +| **Impact** | The activity returns an error. The orchestration marks the node as `failed` and propagates the error. The instance transitions to `failed` with the SQL error message in the output. **This is expected behavior** — user SQL errors are surfaced correctly. | +| **Detection — existing** | Activity trace: `"SQL execution failed: {}"`. Node status in `df.nodes` set to `failed` with the error in the `result` column. Instance status in `df.instances` set to `failed`. `df.status()` returns `'failed'`. `df.result()` accessible for error diagnosis. | +| **Detection — gap** | None significant. Error reporting is good. | +| **Programmatic mitigation** | None needed — this is correct error propagation. | +| **User recommendation** | Check `SELECT * FROM df.instance_nodes('your-instance-id')` to see which node failed and the error message. Fix the SQL and re-submit. | + +--- + +### FM-7: User SQL Connection/Authentication Failure + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-2 (if systemic) or SEV-3 (if isolated) | +| **Component** | `src/activities/execute_sql.rs` — `connect_as_user()` | +| **Trigger** | The `execute_sql` activity connects to PostgreSQL as the `login_role` and then `SET ROLE` to `submitted_by`. Connection may fail if: (a) PostgreSQL is at its `max_connections` limit, (b) the login role's password changed, (c) `pg_hba.conf` rejects the connection, (d) the role was dropped after `df.start()`. | +| **Impact** | Activity fails with `"Failed to connect as ..."`. Instance transitions to `failed`. If `max_connections` is exhausted, this affects all concurrent workflow executions — not just one user. | +| **Detection — existing** | Activity trace: `"Failed to connect..."` or `"SET ROLE ... failed: {}"`. | +| **Detection — gap** | **No connection pool metrics** for the per-user sqlx connections. No visibility into how many concurrent activity connections are open. No correlation between `max_connections` pressure and durable function failures. | +| **Programmatic mitigation** | Each SQL activity opens a fresh connection (no pooling for user connections, which is correct for `SET ROLE` isolation). Duroxide may retry the activity depending on its retry policy. | +| **Process mitigation** | PaaS should monitor `max_connections` utilization and alert when approaching capacity. Reserved connections for the worker role. | +| **User recommendation** | If workflows fail with connection errors, check if your database is at connection capacity. Reduce concurrent workflow count or increase `max_connections`. | + +> **Recommendation**: Add connection-count telemetry or at minimum log the active connection count when a connection attempt fails. + +--- + +### FM-8: HTTP Activity Failure (Network / Remote Server) + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-3 | +| **Component** | `src/activities/execute_http.rs` | +| **Trigger** | Target HTTP server returns 5xx, connection times out, DNS resolution fails, or the remote server is unreachable. | +| **Impact** | Activity fails with a descriptive error (`"HTTP timeout after {timeout}s"`, `"HTTP connection failed"`, `"HTTP request failed: status {code}"`). Instance transitions to `failed`. | +| **Detection — existing** | Activity traces: `"HTTP {method} completed: status={status}, ok={ok}, duration={duration}ms"`. SSRF blocks logged separately. | +| **Detection — gap** | **No automatic retry for transient HTTP errors.** A single 503 fails the entire workflow. **No histogram of HTTP latencies** or error-rate metric. | +| **Programmatic mitigation** | Timeout is user-configurable (`df.http()` `timeout_seconds` parameter, default 30s). Redirect following is disabled to prevent SSRF bypass. | +| **Process mitigation** | Document that users should wrap HTTP calls in retry logic using `df.loop()` with error handling if they need resilience against transient failures. | +| **User recommendation** | Set appropriate `timeout_seconds` for your endpoint. For resilience against transient failures, wrap HTTP nodes in a loop with a condition that checks for success. Consider using `df.http()` with explicit error handling. | + +> **Recommendation**: Consider adding a built-in retry option to `df.http()` (e.g., `retries` parameter with exponential backoff) for transient HTTP errors (429, 502, 503, 504). + +--- + +### FM-9: SSRF Attempt / Blocked Request + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-3 (security event, not a system failure) | +| **Component** | `src/ssrf.rs`, `src/activities/execute_http.rs` | +| **Trigger** | User-submitted URL targets a private IP range (10.x, 172.16.x, 192.168.x, 169.254.x, loopback), uses a non-HTTP(S) scheme, or DNS resolves to a blocked IP. | +| **Impact** | Activity fails with `"BLOCKED: ..."`. Workflow transitions to `failed`. This is correct defensive behavior. | +| **Detection — existing** | Activity traces include audit fields: `"HTTP BLOCKED (scheme\|ip) url={url} submitted_by={user} login_role={role}"`. These are logged at the duroxide activity trace level. | +| **Detection — gap** | **No dedicated security event stream** for SSRF blocks. Traces are mixed with normal activity logs. A PaaS security team would need to grep duroxide traces for `"BLOCKED"` — there's no structured security audit log or counter metric. **The `no-ssrf-protection` feature flag**, if accidentally enabled in a production build, disables all protection silently. | +| **Programmatic mitigation** | Three-layer defense: URL scheme validation, IP literal check, DNS resolution filtering via `SsrfSafeResolver`. Redirect following disabled. IPv4-mapped IPv6 addresses unwrapped and checked. | +| **Process mitigation** | PaaS build pipeline should verify `no-ssrf-protection` feature is not enabled. Security team should have alerts on `"HTTP BLOCKED"` patterns in logs. | +| **User recommendation** | If your HTTP request is blocked and the target is a legitimate external service, verify the URL resolves to a public IP address. Private IP ranges and cloud metadata endpoints are blocked by design. | + +> **Recommendation**: Emit a structured security event (e.g., to a separate `df.security_events` table or a dedicated log channel) for all SSRF blocks, including the requesting user, URL, and block reason. Add a build-time assertion that `no-ssrf-protection` is never enabled alongside `pg17` (the production feature). + +--- + +### FM-10: Orchestration Deadlock / Infinite Loop + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-2 | +| **Component** | `src/orchestrations/execute_function_graph.rs` | +| **Trigger** | A `df.loop()` with no condition and a body that never calls `df.break()`. Or a condition that always evaluates to true. The loop uses `continue_as_new` for each iteration, so duroxide creates a new execution per iteration. | +| **Impact** | The instance runs indefinitely, consuming duroxide execution history. Each iteration creates duroxide events/state. The loop doesn't block other instances (duroxide dispatches independently), but it accumulates storage in `duroxide.*` tables unboundedly. | +| **Detection — existing** | `df.metrics()` shows `running_instances` staying elevated. `df.instance_executions()` shows a growing execution count. Orchestration traces log `"Continuing as new for next loop iteration"` repeatedly. | +| **Detection — gap** | **No max-iteration limit.** **No max-execution-duration limit.** **No alerting threshold** for instances that have been running longer than N minutes/hours. No per-instance resource consumption metric. | +| **Programmatic mitigation** | `continue_as_new` prevents orchestration history from growing unboundedly within a single execution (each iteration is a fresh execution). Users can `df.cancel()` a runaway instance. | +| **Process mitigation** | PaaS should implement a TTL or max-duration policy for durable function instances. Alert on instances running longer than a configurable threshold. | +| **User recommendation** | Always include a termination condition in `df.loop()`. Monitor long-running instances with `df.list_instances('running')` and cancel with `df.cancel()` if needed. | + +> **Recommendation**: Add a `max_iterations` parameter to `df.loop()` (default: unbounded, but warn in docs). Add a system-wide GUC `pg_durable.max_instance_duration_seconds` that auto-cancels instances exceeding the limit. + +--- + +### FM-11: JOIN Branch Failure (Partial Failure in Parallel Execution) + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-3 | +| **Component** | `src/orchestrations/execute_function_graph.rs` — JOIN handling | +| **Trigger** | One branch of a `df.join()` (parallel execution) fails while others succeed. | +| **Impact** | The entire JOIN fails. **All branch results are discarded**, including successful ones. The instance transitions to `failed`. This is the correct semantic (all-or-nothing parallel execution), but may surprise users who expect partial results. | +| **Detection — existing** | Node status for the failed branch in `df.instance_nodes()`. Orchestration output contains the error from the failing branch. | +| **Detection — gap** | **No visibility into which branches succeeded before the JOIN was marked failed.** Successful branch results are lost. | +| **Programmatic mitigation** | None — this is the defined JOIN semantic. | +| **User recommendation** | If you need partial-failure tolerance in parallel execution, wrap each branch's SQL in its own error handling (e.g., `BEGIN...EXCEPTION...END` in PL/pgSQL) so it returns an error value instead of raising an exception. | + +--- + +### FM-12: RACE Semantics — Losing Branch Continues Running + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-3 | +| **Component** | `src/orchestrations/execute_function_graph.rs` — RACE handling | +| **Trigger** | A `df.race()` completes when the first branch finishes. The "losing" branch is **not cancelled** — it continues executing. | +| **Impact** | Side effects from the losing branch (SQL writes, HTTP calls) still occur even though the RACE result has been determined. This can cause unexpected mutations or duplicate HTTP requests. Resource waste from the abandoned branch. | +| **Detection — existing** | The losing branch's node status in `df.instance_nodes()` will eventually show `completed` or `failed` independently. | +| **Detection — gap** | **No clear indication** in the RACE result that the losing branch is still running. No log distinguishing "race winner" from "race loser (still running)". | +| **Programmatic mitigation** | None — duroxide `select2` doesn't cancel the loser. | +| **Process mitigation** | Document this behavior prominently: RACE does not cancel the losing branch. Users must ensure losing branches are idempotent or side-effect-free. | +| **User recommendation** | Use `df.race()` only when both branches are safe to run to completion independently. Do not use RACE if losing branches have destructive side effects (e.g., DELETE statements). | + +--- + +### FM-13: Variable Substitution — Unset Variables + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-3 | +| **Component** | `src/types.rs` — `substitute_all()` | +| **Trigger** | A SQL query references `$varname` but the variable was never set via `df.setvar()` or captured via `\|=>`. | +| **Impact** | The literal string `$varname` is left in the query unchanged. The query is sent to PostgreSQL as-is, which will likely produce a syntax error or unexpected behavior (e.g., `$varname` could be interpreted as a dollar-quoted string boundary). The activity fails with a SQL error. | +| **Detection — existing** | Activity trace: `"SQL execution failed: ..."` with the raw query visible in `"Executing SQL: {final_query}"`. | +| **Detection — gap** | **No warning at substitution time** that a referenced variable was not found. The substitution silently passes through unknown references. | +| **Programmatic mitigation** | None — `substitute_all()` uses `String::replace()` which is a no-op for missing keys. | +| **Process mitigation** | Document variable substitution behavior and the requirement to set variables before referencing them. | +| **User recommendation** | Set all variables with `df.setvar()` before calling `df.start()`. Use `\|=> 'name'` to capture intermediate results. Check `df.instance_nodes()` output to see the actual executed SQL if a node fails. | + +> **Recommendation**: Log a warning in `substitute_all()` when a `$varname` pattern is present in the query but no matching variable is found in the vars map. + +--- + +### FM-14: Extension DROP While Workflows Are Running + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-1 | +| **Component** | `src/worker.rs` — epoch sentinel, extension lifecycle | +| **Trigger** | An administrator runs `DROP EXTENSION pg_durable CASCADE` while durable functions are in-flight. | +| **Impact** | The `df.instances`, `df.nodes`, and `df.vars` tables are dropped. Running orchestrations lose their state tables. The worker detects the extension drop via the epoch sentinel (or extension-existence polling) and returns to the "waiting for extension" state. Duroxide runtime is shut down with a 10-second timeout. **In-flight activities that are mid-SQL-execution may fail with "relation does not exist" errors.** All instance data is permanently lost. | +| **Detection — existing** | Worker log: `"pg_durable: epoch sentinel gone — extension dropped or recreated"`. Worker log: `"pg_durable: initiating duroxide runtime shutdown..."`. | +| **Detection — gap** | **No pre-drop safety check.** PostgreSQL allows `DROP EXTENSION` even with active instances. No advisory lock or "in-use" guard. **Duroxide state in `duroxide.*` tables may or may not be dropped** depending on whether they're owned by the extension. | +| **Programmatic mitigation** | The worker gracefully shuts down the duroxide runtime (10s timeout). After extension re-creation, the worker re-initializes. | +| **Process mitigation** | PaaS should restrict `DROP EXTENSION` to maintenance windows. Document the data-loss implications. Consider an event trigger that warns when durable functions are active. | +| **User recommendation** | Never drop the extension while workflows are running. Check `SELECT count(*) FROM df.instances WHERE status IN ('pending', 'running')` before dropping. | + +> **Recommendation**: Add an event trigger or pre-drop check that warns/blocks if active instances exist. + +--- + +### FM-15: Duroxide State Corruption / Schema Drift + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-1 | +| **Component** | duroxide-pg-opt provider, `sql/duroxide_install.sql` | +| **Trigger** | The `duroxide.*` schema tables become corrupted (e.g., manual edits, failed migration, storage corruption), or the pg_durable extension is compiled against a different version of duroxide-pg-opt than what's in the database. | +| **Impact** | `PostgresProvider::new_with_config()` fails schema validation, entering the infinite retry loop. Or, runtime starts but produces incorrect behavior (events lost, wrong execution order, duplicate activity dispatches). | +| **Detection — existing** | Worker log: `"failed to create PostgreSQL store (will retry): {}"` with schema validation errors. The `verify-duroxide-migrations.sh` script ensures compile-time consistency. | +| **Detection — gap** | **No runtime schema version check** after initial startup. If tables are altered while running, behavior is undefined. **No checksum or version stamp** in the duroxide schema for runtime verification. | +| **Programmatic mitigation** | The provider uses `MigrationPolicy::VerifyOnly` (never auto-migrates; only verifies schema matches expectations). CI runs `verify-duroxide-migrations.sh` on every PR. | +| **Process mitigation** | PaaS should never allow direct DDL on `duroxide.*` tables. Schema modifications only through extension upgrades. | +| **User recommendation** | Do not modify tables in the `duroxide` schema directly. If workflows stop processing, contact your database administrator. | + +--- + +### FM-16: Single Background Worker Bottleneck + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-2 | +| **Component** | `src/worker.rs` — single worker architecture | +| **Trigger** | High volume of concurrent durable function submissions. Each SQL activity opens a synchronous database connection. Long-running SQL queries block the activity thread. | +| **Impact** | Duroxide dispatches activities to a thread pool within the single background worker process. Under high load, the thread pool saturates. New orchestrations and activities queue up. Latency increases for all users. With many long-running SQL queries, the worker's connection count approaches PostgreSQL's `max_connections`. | +| **Detection — existing** | `df.metrics()` shows `running_instances` count. `df.list_instances('pending')` shows queued work. Increasing gap between `df.start()` time and first activity execution visible in `df.instance_nodes()` timestamps. | +| **Detection — gap** | **No queue depth metric.** **No activity throughput metric** (activities/second). **No worker thread pool utilization metric.** **No p50/p95/p99 latency metric** for activity execution or end-to-end instance completion. These are critical for capacity planning. | +| **Programmatic mitigation** | Duroxide's internal dispatcher handles concurrency. `continue_as_new` for loops limits per-execution history growth. | +| **Process mitigation** | PaaS should establish capacity guidelines (max concurrent instances per database size/tier). Monitor queue depth trends. | +| **User recommendation** | If workflows are slow to start, check the pending instance count with `SELECT count(*) FROM df.instances WHERE status = 'pending'`. Avoid submitting many long-running SQL workflows simultaneously. | + +> **Recommendation**: Expose worker thread pool metrics via `df.metrics()` — add fields for `pending_activities`, `active_activities`, `activity_throughput_per_min`. Consider adding a GUC for max concurrent activities. + +--- + +### FM-17: PostgreSQL Restart / Failover + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-2 | +| **Component** | Worker lifecycle, duroxide runtime | +| **Trigger** | PostgreSQL server restarts (planned maintenance, crash recovery, HA failover). | +| **Impact** | Background worker process terminates. All in-flight activities are interrupted. On restart, the worker re-initializes: creates new Tokio runtime, reconnects, re-creates the duroxide runtime. Duroxide replays incomplete orchestrations from their last checkpoint. **Activities that were mid-execution are re-dispatched** (duroxide's at-least-once guarantee). SQL queries that were partially executed may be re-executed. | +| **Detection — existing** | Worker log: `"pg_durable: duroxide background worker starting..."` (after restart). The `df._worker_epoch` table gets a new epoch UUID. | +| **Detection — gap** | **No metric for "time since last worker restart"** or "worker uptime". **No explicit log** distinguishing a fresh start from a restart-after-crash. **No replay counter** showing how many orchestrations were replayed after restart. | +| **Programmatic mitigation** | Duroxide's durable execution model handles this: orchestrations replay deterministically, activities that completed are not re-executed (their results are in the event log), activities that were in-flight are re-dispatched. | +| **Process mitigation** | PaaS should monitor worker restarts. Frequent restarts indicate an underlying issue. | +| **User recommendation** | Durable functions survive PostgreSQL restarts by design. If a workflow was `running` before a restart, it will resume automatically. Ensure your SQL operations are idempotent where possible, as in-flight activity SQL may be re-executed. | + +> **Critical user guidance**: SQL activities have **at-least-once** execution semantics. A SQL statement that was executing when PostgreSQL restarted will be re-executed after recovery. **Users must design their SQL to be idempotent** (e.g., use `INSERT ... ON CONFLICT`, `UPDATE ... WHERE` with guards, not bare `INSERT`). + +--- + +### FM-18: Column Type Extraction Loss in SQL Results + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-3 | +| **Component** | `src/activities/execute_sql.rs` — result row mapping | +| **Trigger** | A SQL query returns a column of a PostgreSQL type not handled by the type-extraction cascade (which tries: String, i64, i32, bool, f64 in order). Custom types, arrays, composite types, `bytea`, `uuid`, `inet`, etc. | +| **Impact** | The column value is silently replaced with `null` in the JSON result. The workflow continues with partial data. Downstream nodes that depend on this value see `null` instead of the actual value. | +| **Detection — existing** | None — the fallback to `null` is silent. The activity trace logs `"SQL returned N rows"` without indicating data loss. | +| **Detection — gap** | **No warning when a column value falls through all type extractors to null.** This is a silent data loss bug. | +| **Programmatic mitigation** | The type cascade covers the most common types. Users can `CAST` to supported types in their SQL. | +| **User recommendation** | Cast complex column types to `text` in your SQL queries (e.g., `SELECT my_uuid::text, my_array::text FROM ...`) to ensure values are captured in the result. | + +> **Recommendation**: Log a warning (via `ctx.trace_info`) when a column value falls through all type extractors to null, including the column name and PostgreSQL type OID. + +--- + +### FM-19: `continue_as_new` Serialization Failure in Loops + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-2 | +| **Component** | `src/orchestrations/execute_function_graph.rs` — loop iteration | +| **Trigger** | The orchestration input (including accumulated variable state and graph) fails to serialize for the `continue_as_new` call. This could happen if the state has grown very large (many variables with large values) or contains non-serializable data. | +| **Impact** | The `unwrap_or(...)` fallback provides a minimal input (just the instance ID), potentially losing accumulated loop state including variables, iteration results, and context. The next iteration starts with degraded state, which may cause incorrect behavior or errors. | +| **Detection — existing** | No explicit log for serialization failure — the `unwrap_or` silently degrades. | +| **Detection — gap** | **Complete blind spot.** No logging, no metric, no indication that state was lost during `continue_as_new`. | +| **Programmatic mitigation** | The `unwrap_or` prevents a panic but trades correctness for availability. | +| **Process mitigation** | None currently. | +| **User recommendation** | Keep workflow variable counts and sizes reasonable. Avoid storing large result sets in named variables (`\|=> 'name'`). | + +> **Recommendation**: Replace the `unwrap_or` with explicit error handling that logs a warning and/or fails the orchestration cleanly rather than silently degrading. + +--- + +### FM-20: Stale Client Connection in Backend Processes + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-2 | +| **Component** | `src/client.rs` — `OnceLock` | +| **Trigger** | A backend process (user session) creates a duroxide `Client` via `get_duroxide_client()` on the first call. The client holds a connection pool to the duroxide store. If the underlying connection becomes stale (e.g., after a network partition heals, pgbouncer timeout, or connection idle timeout), subsequent calls fail. | +| **Impact** | `df.start()`, `df.cancel()`, `df.signal()` fail for that backend session with connection errors. Since `OnceLock` initializes only once, the stale client persists for the lifetime of the backend process. The user must disconnect and reconnect to get a fresh client. | +| **Detection — existing** | `pgrx::error!` with `"Failed to start durable function: ..."` or similar. | +| **Detection — gap** | **No health-check or reconnection logic** for the cached client. No metric for client connection age or staleness. | +| **Programmatic mitigation** | sqlx's built-in pool management handles some connection recycling, but the pool configuration isn't tuned for long-lived backend processes. | +| **Process mitigation** | PaaS connection management (e.g., pgbouncer) should be configured with timeouts compatible with pg_durable's connection caching. | +| **User recommendation** | If `df.start()` fails with a connection error, disconnect your session and reconnect. The new session will create a fresh client. | + +> **Recommendation**: Add connection validation (e.g., test query before use) or TTL-based client recycling to `get_duroxide_client()`. + +--- + +### FM-21: Duroxide Runtime Shutdown Timeout + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-2 | +| **Component** | `src/worker.rs` — `duroxide_runtime.shutdown(Some(10_000))` | +| **Trigger** | During PostgreSQL shutdown or extension drop, the duroxide runtime is given 10 seconds to complete shutdown. If activities are mid-execution (e.g., a long-running SQL query or HTTP request), they may not complete within this window. | +| **Impact** | Activities are forcibly terminated. In-flight SQL statements are rolled back by PostgreSQL. Orchestrations are interrupted mid-execution. On next startup, duroxide replays from the last checkpoint, re-dispatching interrupted activities. **During the shutdown window, the PostgreSQL shutdown is delayed by up to 10 seconds**, which may cause PaaS health checks to flag the instance. | +| **Detection — existing** | Worker log: `"pg_durable: initiating duroxide runtime shutdown..."` followed by `"pg_durable: duroxide runtime shutdown complete"`. | +| **Detection — gap** | **No log indicating whether shutdown completed within the timeout or was forcibly terminated.** The 10s timeout is a fire-and-forget; we don't know if activities were cleanly stopped. | +| **Programmatic mitigation** | The 10s timeout is hardcoded. Tokio runtime has a separate 5s `shutdown_timeout`. Duroxide replays handle interrupted work. | +| **Process mitigation** | PaaS should configure PostgreSQL shutdown timeouts to accommodate the 10s duroxide shutdown + 5s Tokio shutdown. | +| **User recommendation** | Long-running workflows will resume after a server restart. No action needed. | + +--- + +### FM-22: Node ID Collision + +| Attribute | Detail | +|-----------|--------| +| **Severity** | SEV-3 (extremely unlikely) | +| **Component** | `src/dsl.rs` — node ID generation | +| **Trigger** | Node IDs are generated as 8-character hex strings (4 bytes of randomness = ~4 billion possibilities). Under very high volume, a collision is theoretically possible within a single instance's graph. | +| **Impact** | An INSERT into `df.nodes` fails with a primary key violation. `df.start()` fails and returns an error to the user. No data corruption — the transaction is rolled back. | +| **Detection — existing** | `pgrx::error!("Failed to insert node {}: {:?}")` | +| **Detection — gap** | None — the error is clear and the failure is safe. | +| **Programmatic mitigation** | The probability is vanishingly small for typical graph sizes (< 1000 nodes). | +| **User recommendation** | Retry `df.start()` if you encounter a node insertion error. | + +--- + +## 4. Telemetry & Observability Assessment + +### 4.1 What Exists Today + +| Mechanism | Location | Content | Consumers | +|-----------|----------|---------|-----------| +| **PostgreSQL server logs** (`pgrx::log!`) | `src/worker.rs`, `src/dsl.rs`, `src/client.rs` | ~25 lifecycle messages with `"pg_durable:"` prefix | Log aggregation (CloudWatch, Azure Monitor, etc.) | +| **Duroxide activity traces** (`ctx.trace_info`) | `src/activities/*.rs` | SQL audit trail, HTTP audit trail, SSRF blocks, status updates | Stored in duroxide event history; queryable via `df.instance_nodes()` | +| **Duroxide orchestration traces** (`ctx.trace_info`) | `src/orchestrations/*.rs` | Node execution flow, variable substitution, loop iterations, condition evaluation | Stored in duroxide event history | +| **`df.metrics()`** | `src/monitoring.rs` | 6 aggregate counters: total/running/completed/failed instances, total executions, total events | User SQL queries, dashboards | +| **`df.status()`** | `src/dsl.rs` | Per-instance status: pending/running/completed/failed/cancelled | User SQL queries, polling loops | +| **`df.list_instances()`** | `src/monitoring.rs` | RLS-filtered instance listing with status, label, output | User SQL queries | +| **`df.instance_info()`** | `src/monitoring.rs` | Single-instance detail with execution count | User SQL queries | +| **`df.instance_executions()`** | `src/monitoring.rs` | Execution history for looping instances | User SQL queries | +| **`df.instance_nodes()`** | `src/monitoring.rs` | Per-node execution status and results | User SQL queries | +| **Epoch sentinel** (`df._worker_epoch`) | `src/worker.rs`, `src/lib.rs` | Worker liveness: UUID + `last_seen_at` timestamp | Operator queries | +| **Tracing subscriber** | `src/worker.rs` | Configurable via `RUST_LOG` env var; defaults to `warn` with `info` for duroxide modules | stderr → PostgreSQL log file | + +### 4.2 Telemetry Gaps + +The following gaps are prioritized by operational impact: + +#### Critical Gaps (needed for production readiness) + +| Gap | Relevant Failure Modes | Recommendation | +|-----|----------------------|----------------| +| **No worker health-check function** | FM-1, FM-2, FM-3 | Add `df.worker_status()` returning `(alive bool, last_heartbeat timestamptz, uptime_seconds int, current_epoch uuid)` by querying `df._worker_epoch`. | +| **No queue depth / throughput metrics** | FM-16 | Extend `df.metrics()` with `pending_instances`, `avg_completion_time_ms`, and if feasible `active_activities`, `pending_activities`. | +| **No per-instance duration metric** | FM-10, FM-16 | Add `duration_ms` to `df.list_instances()` output (computed from `created_at` to `completed_at`). | +| **No connection count visibility** | FM-7, FM-16 | Log active activity connection count in `execute_sql`; consider a `df.worker_connections()` metric. | + +#### Important Gaps (needed for operational maturity) + +| Gap | Relevant Failure Modes | Recommendation | +|-----|----------------------|----------------| +| **Silent monitoring failures** | FM-16 | `df.list_instances()`, `df.instance_info()`, etc. return empty results on internal errors (store connection failure). Add `RAISE WARNING` when the duroxide client fails. | +| **No structured security audit log** | FM-9 | Emit SSRF blocks to a dedicated channel or table, not just activity traces. | +| **No `continue_as_new` failure logging** | FM-19 | Replace `unwrap_or` with explicit error logging. | +| **No column type fallback warning** | FM-18 | Log when a SQL result column value falls through to `null`. | +| **No activity retry telemetry** | FM-5, FM-8 | Log/count transaction-visibility retries in `load_function_graph` and any future HTTP retries. | + +#### Nice-to-Have Gaps (for mature observability) + +| Gap | Relevant Failure Modes | Recommendation | +|-----|----------------------|----------------| +| **No histogram metrics** (p50/p95/p99 latency) | FM-16 | Requires integration with a metrics library (e.g., `metrics` crate exported via `pg_stat` or Prometheus endpoint). | +| **No worker restart counter** | FM-17 | Track epoch changes in `df._worker_epoch` (each new row = a restart). | +| **No `RACE` winner/loser logging** | FM-12 | Log which branch won a RACE and that the other continues running. | +| **No variable substitution miss warning** | FM-13 | Warn when `$varname` patterns remain after substitution. | + +### 4.3 Log Searchability + +All PostgreSQL-level logs use the `"pg_durable: "` prefix, making them grep-friendly. Recommended log-based alert patterns for operators: + +| Pattern | Meaning | Action | +|---------|---------|--------| +| `"pg_durable: failed to create tokio runtime"` | Worker startup failure | Page on-call — FM-1 | +| `"will retry"` repeated > 10 times | Worker connection loop stuck | Investigate auth/connectivity — FM-2 | +| `"worker role.*NOT a superuser"` | Misconfigured worker role | Fix role privileges — FM-3 | +| `"epoch sentinel gone"` | Extension dropped/recreated | Verify intentional — FM-14 | +| `"HTTP BLOCKED"` | SSRF attempt | Security review — FM-9 | +| `"Instance.*not found after 5s"` | Transaction visibility timeout | Check for long transactions — FM-5 | +| `"failed to create PostgreSQL store"` repeated | Duroxide schema issue | Check migrations — FM-15 | + +--- + +## 5. User-Facing Recommendations Summary + +### Before Going to Production + +1. **Verify worker health**: Query `SELECT * FROM df._worker_epoch` — confirm `last_seen_at` is recent. +2. **Test idempotency**: All SQL in durable functions may be re-executed after a server restart. Use `INSERT ... ON CONFLICT`, conditional `UPDATE`s, etc. +3. **Set appropriate timeouts**: `df.http()` timeout defaults to 30s. Set it based on your endpoint's expected latency. +4. **Cast complex types**: Use `::text` casts for UUID, array, composite, and other non-primitive columns in `df.sql()` queries. +5. **Scope variables**: Set all `df.setvar()` values before `df.start()`. Use `\|=> 'name'` for intermediate results. + +### Monitoring Your Workflows + +| What to Check | How | +|---------------|-----| +| Workflow status | `SELECT * FROM df.status('instance-id')` | +| All your workflows | `SELECT * FROM df.list_instances()` | +| Stuck workflows | `SELECT * FROM df.list_instances('running')` — check for old entries | +| Failed workflow details | `SELECT * FROM df.instance_nodes('instance-id')` — find the failed node | +| System health | `SELECT * FROM df.metrics()` — watch for growing `running_instances` with flat `completed_instances` | + +### When Things Go Wrong + +| Symptom | Likely Cause | Action | +|---------|-------------|--------| +| Workflow stuck at `pending` | Worker not running or not a superuser | Check `df._worker_epoch`, contact DBA | +| Workflow `failed` immediately | SQL error, missing table/role, validation failure | Check `df.instance_nodes()` for error details | +| HTTP node failed | Timeout, SSRF block, remote server error | Check node result for error message; verify URL is public | +| All workflows slow | Worker overloaded or PostgreSQL under pressure | Check `df.metrics()`, reduce concurrent submissions | +| `df.start()` errors with connection failure | Stale client in backend session | Reconnect your session | + +--- + +## 6. Service-Owner (PaaS) Operational Runbook + +### Alerts to Configure + +| Alert | Condition | Severity | Failure Mode | +|-------|-----------|----------|-------------| +| Worker absent | No `pg_durable_worker` in `pg_stat_activity` for > 30s | P1 | FM-1 | +| Worker retry storm | `"will retry"` in logs > 10 occurrences/minute | P1 | FM-2 | +| Worker role warning | `"NOT a superuser"` in logs at extension creation | P1 | FM-3 | +| Epoch sentinel stale | `df._worker_epoch.last_seen_at` > 60s old | P2 | FM-1, FM-14 | +| Pending queue growth | `df.metrics().running_instances` increasing without `completed_instances` growth | P2 | FM-3, FM-16 | +| SSRF blocks | `"HTTP BLOCKED"` in logs | P3 (security) | FM-9 | +| Extension dropped | `"epoch sentinel gone"` in logs | P2 | FM-14 | + +### Capacity Planning Considerations + +- Each SQL activity opens one PostgreSQL connection. Plan `max_connections` with headroom for concurrent activity execution. +- Duroxide state tables (`duroxide.*`) grow with instance count and execution history. Plan storage for long-running or eternal (looping) instances. +- The single background worker is the throughput bottleneck. Monitor pending-to-running transition latency as a proxy for capacity. + +### Upgrade / Migration Safety + +- `pg_durable.worker_role` and `pg_durable.database` are `PGC_POSTMASTER` GUCs — changes require a PostgreSQL restart. +- Extension upgrades (`ALTER EXTENSION pg_durable UPDATE`) must be tested against the existing duroxide schema. The `MigrationPolicy::VerifyOnly` setting means the extension will not auto-migrate — schema must already match. +- Always run `verify-duroxide-migrations.sh` before deploying a new version. diff --git a/docs/service-fma.md b/docs/service-fma.md new file mode 100644 index 00000000..880665fb --- /dev/null +++ b/docs/service-fma.md @@ -0,0 +1,484 @@ +# pg_durable Service-Level Failure Mode Analysis + +**Status**: Draft +**Created**: 2026-03-24 + +--- + +## 1. Overview + +This document analyzes failure modes for pg_durable **as a feature of a managed PostgreSQL-as-a-Service (PaaS) platform** — e.g., Azure Database for PostgreSQL Flexible Server. It covers infrastructure, control plane, data plane, and operational concerns that are outside the extension's own codebase but directly affect pg_durable users. + +For extension-internal failure modes (background worker, activities, orchestrations, DSL, client), see [fma.md](fma.md). + +**Assumptions**: +- pg_durable is deployed as a first-party extension on the PaaS platform. +- The extension `.so` is baked into the PostgreSQL engine image. +- `shared_preload_libraries` includes `pg_durable` on all nodes where the feature is enabled. +- The duroxide schema (`duroxide.*`) and extension schema (`df.*`) live in the same database. +- The background worker runs as the platform superuser role (e.g., `azuresu`). + +--- + +## 2. Severity Definitions + +| Level | Definition | +|-------|-----------| +| **SEV-1** | All pg_durable users in a region/stamp are impacted. Data loss or extended unavailability. | +| **SEV-2** | Subset of users affected, or degraded functionality across the service. | +| **SEV-3** | Single-server or single-tenant impact. Self-recoverable or cosmetic. | + +--- + +## 3. Deployments + +### SFM-1: Region Buildout — Missing Feature Registration + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | New region is enabled but the infrastructure subscription lacks feature registration for pg_durable (e.g., `shared_preload_libraries` allowlisting, extension package deployment, or ARM resource provider registration). | +| **Severity** | SEV-2 (new region only) | +| **Impact** | Customers in the new region cannot enable pg_durable. `CREATE EXTENSION pg_durable` fails or `shared_preload_libraries` rejects the library name. Existing regions unaffected. | +| **Programmatic mitigation** | Extension SQL includes a startup check: `pg_durable must be loaded via shared_preload_libraries` — fast-fails if misconfigured. | +| **Process mitigation** | Buildout checklist should include: (1) extension package deployed to region's image, (2) `shared_preload_libraries` allowlist updated, (3) ARM RP feature flag enabled. Validate with smoke test before region GA. | +| **Detection** | Customer-reported `CREATE EXTENSION` failures. Platform provisioning logs show missing extension in `pg_available_extensions`. | +| **Recommendation** | Add pg_durable to the region-buildout validation suite: after deployment, run `SELECT * FROM pg_available_extensions WHERE name = 'pg_durable'` and `SHOW shared_preload_libraries` on a canary server. | + +### SFM-2: Region Buildout — Monitoring Not Configured + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | New region is enabled but platform monitoring (Geneva/Azure Monitor rules, dashboards, ICM connectors) hasn't been onboarded for pg_durable-specific signals. | +| **Severity** | SEV-2 | +| **Impact** | pg_durable failures in the new region go undetected by the service team. No alerts fire for worker crashes, stuck instances, or other FM-* scenarios from [fma.md](fma.md). | +| **Programmatic mitigation** | None — monitoring configuration is external to the extension. | +| **Process mitigation** | Buildout checklist should include monitoring validation. Use infrastructure-as-code for monitor/alert definitions so they deploy atomically with the region. | +| **Detection** | Periodic audit of monitoring coverage per region. Synthetic canary tests that verify alerts fire. | +| **Recommendation** | Define pg_durable monitoring rules as code (e.g., Azure Monitor alert rules in Bicep) and deploy them as part of the region buildout pipeline, not as a separate manual step. | + +### SFM-3: Engine Image Deployment — Extension `.so` Missing or Mismatched + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | A new engine image is deployed to the fleet but the pg_durable `.so` is missing, is compiled against the wrong PostgreSQL major version, or is an older version than expected. | +| **Severity** | SEV-1 (if rollout is fleet-wide) or SEV-2 (if canary catches it) | +| **Impact** | On server restart with the new image, PostgreSQL fails to start because `shared_preload_libraries` references a missing/incompatible library. Or, PostgreSQL starts but pg_durable functions produce unexpected behavior due to binary-schema mismatch. See FM-15 in [fma.md](fma.md) for duroxide schema drift specifics. | +| **Programmatic mitigation** | The extension's `_PG_init()` fails fast if not loaded via `shared_preload_libraries`. The duroxide provider uses `MigrationPolicy::VerifyOnly` on backend connections (fails closed on schema mismatch). CI runs `test-upgrade.sh` to validate backward compatibility. | +| **Process mitigation** | Image build pipeline should: (1) compile the extension against the exact PG version in the image, (2) run smoke tests before fleet rollout, (3) use canary deployments. Docker CI (`docker.yml`) validates the image builds and passes E2E tests. | +| **Detection** | PostgreSQL fails to start → platform health check detects unresponsive server. Or, `df.version()` returns unexpected version after image update. | +| **Recommendation** | Add a post-deployment validation step that connects to a canary server and runs `SELECT df.version()`, verifying it matches the expected version in the release manifest. | + +### SFM-4: Engine Image Deployment — Binary-Schema Gap During Rolling Update + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | New pg_durable `.so` is deployed via engine image update (fleet maintenance window). Customers have not yet run `ALTER EXTENSION pg_durable UPDATE`. The new binary runs against the old schema for hours, days, or indefinitely. | +| **Severity** | SEV-1 (if backward compat is broken) or SEV-3 (if tested and compatible) | +| **Impact** | If the new binary is not backward compatible with the old schema, durable functions fail silently or produce incorrect results. The background worker may crash-loop or activities may error. | +| **Programmatic mitigation** | CI enforces backward compatibility via `test-upgrade.sh` (Scenario B1: new `.so` against all previous schemas). The duroxide provider's `VerifyOnly` policy fails closed on schema mismatch. | +| **Process mitigation** | All code changes must pass upgrade tests before merge. Release notes must document any required `ALTER EXTENSION UPDATE` steps. The upgrade model is designed for an extended binary-schema gap. | +| **Detection** | Worker logs: `"failed to create PostgreSQL store (will retry)"` repeated — indicates schema verification failure. `df.metrics()` shows no progress. See FM-15 in [fma.md](fma.md). | +| **Recommendation** | For breaking schema changes, use a two-phase rollout: (1) deploy binary that supports both old and new schema, (2) after fleet adoption, deploy the schema migration via `ALTER EXTENSION UPDATE` guidance. Never ship a binary that requires the new schema to function. | + +### SFM-5: Sidecar Deployment Failure + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Platform sidecars (monitoring agent, log collector, backup agent, security scanner) fail to deploy or crash on a node running pg_durable. | +| **Severity** | SEV-2 | +| **Impact** | pg_durable itself is unaffected (it runs inside the PostgreSQL process, not as a sidecar). However: (1) log collection failure means pg_durable worker logs and activity traces are not ingested — all detection mechanisms in [fma.md](fma.md) Section 4.3 become blind, (2) backup agent failure means duroxide state and extension tables may not be backed up, (3) monitoring agent failure means platform-level metrics (CPU, memory, connections) that are proxies for pg_durable health are missing. | +| **Programmatic mitigation** | None internal to pg_durable. | +| **Process mitigation** | Platform sidecar health monitoring. Ensure sidecar restarts don't trigger PostgreSQL restarts (process isolation). | +| **Detection** | Sidecar health checks. Gap in telemetry ingestion (missing log entries for expected time windows). | +| **Recommendation** | pg_durable's most critical observability data is in PostgreSQL server logs (`pgrx::log!` with `"pg_durable:"` prefix) and duroxide execution history (in `duroxide.*` tables). Ensure the log collector captures the PostgreSQL log file even if other sidecars fail. | + +### SFM-6: Control Ring — Management Service Deployment Failure + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | An Orcas/Management Service (ARM RP) deployment fails or causes an outage. This includes the component that handles `CREATE SERVER`, `UPDATE SERVER`, and extension management operations. | +| **Severity** | SEV-1 (if management plane is down) | +| **Impact** | Customers cannot create new servers with pg_durable, cannot enable/disable the extension via portal/CLI, and cannot perform server scaling operations. **Running durable functions on existing servers are unaffected** — the data plane (PostgreSQL + background worker) operates independently of the management plane. | +| **Programmatic mitigation** | pg_durable has no dependency on the management plane at runtime. The background worker and all DSL/monitoring functions operate purely within the PostgreSQL process. | +| **Process mitigation** | Standard management service deployment safeguards (canary, rollback, health checks). | +| **Detection** | ARM RP health monitoring. Customer-reported provisioning failures. | +| **pg_durable-specific concern** | If the management service deployment includes a change to pg_durable's `shared_preload_libraries` configuration or GUC defaults, a failed rollout could leave servers in an inconsistent state where some have the new config and others don't. | + +--- + +## 4. Service SLAs / KPIs and Customer Workflows + +### SFM-7: Login Availability Below SLA + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Customer cannot authenticate to the PostgreSQL server. Login success rate drops below SLA. | +| **Severity** | SEV-1 | +| **Impact** | Customers cannot submit new durable functions (`df.start()`), check status (`df.status()`), or retrieve results (`df.result()`). **Running durable functions continue executing** — the background worker uses its own connection pool (authenticated at startup) and is not affected by frontend authentication failures. However, `execute_sql` activities that need to connect as a user role may fail if the authentication substrate is globally degraded. See FM-7 in [fma.md](fma.md). | +| **Programmatic mitigation** | The worker's sqlx pool is long-lived and reconnects independently. Activity connections use `SET ROLE` after connecting as the login role, which may succeed even if new logins are throttled (existing connections survive). | +| **Process mitigation** | Platform login availability monitoring and alerting. | +| **Detection** | Platform login success rate metric. `execute_sql` activity failures with auth-related errors in duroxide traces. | +| **pg_durable-specific concern** | The `execute_sql` activity opens a **new connection per SQL node** (not pooled). Under login degradation, each SQL node execution pays the full authentication cost and may fail. High-concurrency workflows amplify the impact. | + +### SFM-8: Customer Cannot Connect (Network/Firewall) + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Customer connectivity is blocked by firewall rules, VNet misconfiguration, private endpoint issues, or DNS resolution failure. | +| **Severity** | SEV-3 (single customer) | +| **Impact** | Customer cannot interact with pg_durable at all (no DSL calls, no monitoring). **Running durable functions continue executing** — the background worker is local to the PostgreSQL process and doesn't traverse the customer's network path. Activities that execute SQL connect via the local socket/loopback, not through the customer-facing endpoint. | +| **Programmatic mitigation** | None — this is a platform networking concern. | +| **Detection** | Customer-reported. Platform connection metrics per server. | +| **pg_durable-specific concern** | If `df.http()` nodes target endpoints within the customer's VNet, those HTTP requests originate from the PostgreSQL server's network context, not the customer's client. Firewall rules must account for the server's egress IP, not the customer's ingress path. | + +### SFM-9: Azure Compute Failure (Full) + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | The VM or container hosting the PostgreSQL server experiences a compute failure (hardware fault, hypervisor crash, VM eviction). | +| **Severity** | SEV-1 | +| **Impact** | PostgreSQL process terminates. Background worker dies. All in-flight activities are interrupted. **Duroxide's durability guarantee applies**: on restart, the worker replays incomplete orchestrations from the last checkpoint. Activities that were mid-execution are re-dispatched. See FM-17 in [fma.md](fma.md) for restart/replay behavior. | +| **Programmatic mitigation** | Duroxide's event-sourced architecture provides at-least-once execution. PostgreSQL's restart-time of the background worker (`set_restart_time(5s)`) ensures quick recovery. The epoch sentinel detects the restart and re-initializes cleanly. | +| **Process mitigation** | Platform HA: availability zone redundancy, automated failover, VM auto-restart. | +| **Detection** | Platform VM health monitoring. After restart: worker log `"pg_durable: duroxide background worker starting..."`. `df._worker_epoch` shows a new epoch UUID. | +| **pg_durable-specific concern** | **SQL activities have at-least-once semantics.** In-flight SQL statements are rolled back by PostgreSQL crash recovery, then re-dispatched by duroxide replay. Users must design SQL to be **idempotent** (`INSERT ... ON CONFLICT`, conditional UPDATEs). This is the single most important user-facing guidance for pg_durable on a PaaS. | + +### SFM-10: Azure Storage Failure (Full) + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | The managed disk or storage subsystem backing PostgreSQL's data directory becomes unavailable or experiences data loss. | +| **Severity** | SEV-1 | +| **Impact** | PostgreSQL cannot read/write data. All pg_durable state is lost if storage is unrecoverable: `df.instances`, `df.nodes`, `df.vars`, and all `duroxide.*` tables (orchestration history, event log, activity state). **Total loss of durable function state.** There is no external state store — everything is in PostgreSQL. | +| **Programmatic mitigation** | None — pg_durable stores all state in PostgreSQL by design. There is no out-of-band state replication. | +| **Process mitigation** | Platform storage redundancy (LRS/ZRS/GRS). Point-in-time restore (PITR) from backups. | +| **Detection** | Platform storage health alerts. PostgreSQL `PANIC` logs. | +| **pg_durable-specific concern** | Duroxide's durability guarantee is **only as strong as the underlying PostgreSQL storage**. Unlike external orchestrators (Temporal, Azure Durable Functions) that have independent state stores, pg_durable's state lives in the same storage as user data. A storage failure that loses user data **also loses orchestration state**. Recovery via PITR restores both user data and pg_durable state to the same point-in-time, which is actually a consistency advantage — but any durable functions that completed between the restore point and the failure are lost. | + +### SFM-11: Limited Azure Storage/Compute Failure + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Partial degradation: elevated I/O latency, intermittent storage errors, CPU throttling, or memory pressure. | +| **Severity** | SEV-2 | +| **Impact** | pg_durable operations slow down. `execute_sql` activities take longer. The duroxide runtime's polling intervals feel the latency. The worker's sqlx pool may experience connection timeouts. Under memory pressure, the Tokio runtime may fail to spawn tasks. Under CPU throttling, duroxide's orchestration dispatcher falls behind. | +| **Programmatic mitigation** | sqlx pool has built-in connection health checks. Duroxide polling is tolerant of latency (it just polls less frequently). Activity timeouts prevent indefinite blocking. | +| **Detection** | Platform I/O and CPU metrics. `df.metrics()` shows growing `running_instances` without corresponding `completed_instances` growth — see FM-16 in [fma.md](fma.md). Activity traces show increasing `duration_ms` for HTTP nodes. | +| **pg_durable-specific concern** | The background worker is a **single process** running inside PostgreSQL. It competes with user workloads for CPU and memory. Under resource pressure, user queries and durable function execution degrade together. There is no resource isolation between the worker and user sessions. | + +--- + +## 5. Manageability + +### SFM-12: Create Server + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Customer creates a new server with pg_durable enabled. The provisioning workflow must: (1) configure `shared_preload_libraries`, (2) set GUCs (`pg_durable.worker_role`, `pg_durable.database`), (3) ensure the worker role exists and is superuser, (4) run `CREATE EXTENSION pg_durable`. | +| **Severity** | SEV-3 (single server) | +| **Impact** | If any step fails, the server exists but pg_durable is non-functional. Common failure modes: (a) `shared_preload_libraries` not set → extension load fails (see FM-1 in [fma.md](fma.md)), (b) worker role doesn't exist or isn't superuser → silent failure (see FM-3 in [fma.md](fma.md)), (c) `CREATE EXTENSION` runs in wrong database → worker can't find extension (see FM-4 in [fma.md](fma.md)). | +| **Programmatic mitigation** | Extension SQL validates: `shared_preload_libraries` inclusion, worker role existence/superuser status, correct database. The validation emits errors (not just warnings) for critical misconfigurations in production builds. | +| **Process mitigation** | Provisioning workflow should have explicit pg_durable setup steps with validation at each stage. Post-provisioning smoke test: `SELECT df.version()`. | +| **Detection** | Provisioning workflow logs. Customer-reported. Post-provision health check. | +| **Recommendation** | Add a provisioning validation step that waits for the background worker to initialize (check `df._worker_epoch` has a recent `last_seen_at`) before marking the server as `Ready`. | + +### SFM-13: Update Server — Compute Scale Up/Down + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Customer scales compute (vCPU/memory) up or down. This typically requires a server restart. | +| **Severity** | SEV-3 | +| **Impact** | PostgreSQL restarts. Background worker terminates and re-initializes. Same behavior as FM-17 in [fma.md](fma.md): duroxide replays incomplete orchestrations. **Brief interruption** to durable function execution during the restart window (typically seconds to a minute). | +| **Programmatic mitigation** | Duroxide replay handles restart. Worker auto-restarts after 5s. | +| **Process mitigation** | Scale operations should occur during maintenance windows when possible. Platform should drain active connections gracefully before restart. | +| **Detection** | New epoch UUID in `df._worker_epoch` after scale operation. Worker log: `"duroxide background worker starting..."`. | +| **pg_durable-specific concern** | Scaling **down** could push the server into resource pressure. If the worker's Tokio runtime or activity sqlx pool were sized for the larger tier, the reduced tier may not have enough memory/connections. GUCs like `max_connections` may be auto-adjusted, reducing headroom for activity connections. | + +### SFM-14: Update Server — Storage Scale Up/Down + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Customer scales storage. May or may not require restart depending on platform. | +| **Severity** | SEV-3 | +| **Impact** | If restart is required: same as SFM-13. If online resize: pg_durable is unaffected — it doesn't manage storage directly. Scaling **down** could trigger disk space pressure if duroxide tables have grown large (see SFM-27). | +| **Programmatic mitigation** | None specific to pg_durable. | +| **Recommendation** | Before scaling storage down, check the size of `duroxide.*` tables: `SELECT pg_size_pretty(pg_total_relation_size('duroxide.instances'))`. Duroxide execution history can be significant for long-running or eternal functions. | + +### SFM-15: Drop Server + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Customer or platform deletes the server. | +| **Severity** | SEV-3 (intentional) or SEV-1 (accidental) | +| **Impact** | All pg_durable state is permanently destroyed: instance history, node results, duroxide execution log, variables. There is no external backup of orchestration state. | +| **Programmatic mitigation** | None — state lives entirely in PostgreSQL. | +| **Process mitigation** | Platform soft-delete / retention period for dropped servers. Backup retention policies. | +| **pg_durable-specific concern** | Unlike external orchestrators that have independent state, pg_durable state is **co-located with the database**. Dropping the server also drops the orchestration engine and all its history. Users should be warned that dropping a server with active durable functions is irreversible. | + +### SFM-16: Update Server — `shared_preload_libraries` Change + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | A server configuration change removes `pg_durable` from `shared_preload_libraries`, or a platform update resets the configuration. | +| **Severity** | SEV-1 (for that server) | +| **Impact** | After the next PostgreSQL restart, pg_durable's `_PG_init()` is not called. The background worker is never registered. Extension functions still exist (SQL objects), but the worker doesn't run. All pending/running durable functions stall. New `df.start()` calls succeed (they write to tables) but instances never execute. | +| **Programmatic mitigation** | `_PG_init()` errors if not in `shared_preload_libraries`. But this only fires when the library is explicitly loaded, not when it's absent. | +| **Detection** | Worker absence from `pg_stat_activity`. Empty or stale `df._worker_epoch`. `df.start()` succeeds but `df.status()` never transitions from `pending`. | +| **Recommendation** | Platform should treat `pg_durable` in `shared_preload_libraries` as an invariant when the extension is installed. Configuration changes that remove it should be blocked or warn explicitly. | + +--- + +## 6. Disaster Recovery + +### SFM-17: Point-in-Time Restore (PITR) + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Customer or platform initiates a PITR to a point before a data loss/corruption event. | +| **Severity** | SEV-2 | +| **Impact** | The restored database includes pg_durable state (`df.*` and `duroxide.*` tables) as of the restore point. Durable functions that **completed after the restore point are lost** — their results, status updates, and execution history revert. Durable functions that were `running` at the restore point will be **replayed from their last checkpoint** when the background worker starts on the restored server. Some activities may re-execute (at-least-once). | +| **Programmatic mitigation** | Duroxide's replay model handles partial state correctly — it replays from the last committed event. This is **the same behavior as a crash recovery** (see FM-17 in [fma.md](fma.md)). | +| **Process mitigation** | Document PITR behavior for pg_durable in the user guide. | +| **Detection** | After restore: `df._worker_epoch` shows a new epoch. Some instances may show statuses that don't match what the user last observed. | +| **pg_durable-specific concern** | PITR restores both user data and orchestration state to the same point-in-time. This is actually **better than external orchestrators** where the orchestration state and database are restored independently, requiring reconciliation. With pg_durable, the orchestration state is always consistent with the data it operated on. | +| **User recommendation** | After a PITR, check `df.list_instances('running')` — some workflows may re-execute activities. Ensure your SQL is idempotent. Workflows that completed after the restore point will appear in their pre-completion state. | + +### SFM-18: Accidental Data Deletion by User + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | User accidentally runs `DELETE FROM df.instances` or `DROP TABLE df.nodes` or similar destructive DML/DDL against pg_durable tables. | +| **Severity** | SEV-2 (for that user/server) | +| **Impact** | **With RLS enabled**: User can only delete their own rows from `df.instances` and `df.nodes`. Other users' workflows are unaffected. The user's own workflows lose their tracking state, but duroxide still has the execution history in `duroxide.*` tables. If `duroxide.*` tables are intact, in-flight orchestrations continue — but status updates and result writes back to `df.instances`/`df.nodes` will fail (rows missing). **Without RLS / superuser**: All workflow state destroyed. | +| **Programmatic mitigation** | RLS limits blast radius to the calling user's own rows. Decision 8 in [rls.md](rls.md) grants no DELETE privilege to PUBLIC on `df.instances`/`df.nodes`, preventing accidental deletes by non-superusers entirely. | +| **Detection** | `update_instance_status` and `update_node_status` activities fail with update-zero-rows. Duroxide traces show failures. | +| **User recommendation** | Do not run DML directly against `df.*` tables. Use `df.cancel()` to stop workflows. If you accidentally deleted rows, contact your DBA — PITR may be needed. | + +### SFM-19: Azure Regional Disaster / Full Region Failure + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | An entire Azure region becomes unavailable. | +| **Severity** | SEV-1 | +| **Impact** | All pg_durable workloads in the region are unavailable. If geo-redundant backups are configured, the database (including all pg_durable state) can be restored in another region. pg_durable state is recovered along with the database — no separate state recovery needed. | +| **Programmatic mitigation** | None specific to pg_durable. State co-location with the database means DR procedures restore everything atomically. | +| **Process mitigation** | Geo-redundant backups. Cross-region read replicas (if supported — note: pg_durable's background worker only runs on the primary). | +| **pg_durable-specific concern** | **Read replicas cannot run durable functions.** The background worker only operates on the primary. If a read replica is promoted during DR, the worker will start on the new primary and begin processing. In-flight workflows replay from the last replicated checkpoint. The replication lag determines how much orchestration progress is lost. | + +### SFM-20: Restoring an Accidentally Dropped Server + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Server was dropped and needs to be recovered from the platform's retention/soft-delete mechanism. | +| **Severity** | SEV-1 | +| **Impact** | If the platform supports soft-delete recovery, the full database (including pg_durable state) is restored. `shared_preload_libraries` and GUC configuration must be re-applied. The background worker will start fresh — duroxide replays any incomplete orchestrations. | +| **pg_durable-specific concern** | Ensure the restored server has the same `pg_durable.worker_role` and `pg_durable.database` GUC values. If the worker role was dropped as part of cleanup, it must be recreated before the worker can operate. | + +--- + +## 7. Reliability / Performance + +### SFM-21: IOPS Limit Reached + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | The server hits its provisioned IOPS limit due to high I/O from user workloads, duroxide polling, or activity execution. | +| **Severity** | SEV-2 | +| **Impact** | All database operations slow down, including pg_durable. Duroxide polling (long-poll or interval-based) becomes slower. Activity SQL execution takes longer. Status updates to `df.instances`/`df.nodes` are delayed. Users observe workflows completing slowly. The worker's throughput decreases proportionally. | +| **Programmatic mitigation** | Duroxide uses long-polling (reduces unnecessary I/O vs. tight polling). Activity connections are opened on-demand and closed after use (no persistent per-user pool). | +| **Detection** | Platform IOPS metrics at provisioned limit. `df.metrics()` shows growing `running_instances` without `completed_instances` growth. Activity trace durations increase. | +| **pg_durable-specific concern** | Duroxide's polling-based dispatcher generates **baseline IOPS** even when idle. Each poll cycle queries the `duroxide.*` tables. Under IOPS contention, this baseline load compounds the problem. Additionally, each `execute_sql` activity writes: (1) activity start event, (2) SQL execution, (3) activity completion event, (4) node status update, (5) instance status update — **5+ write operations per SQL node**. High-throughput workflows with many SQL nodes generate significant write IOPS. | +| **User recommendation** | Monitor your server's IOPS utilization. If durable function throughput degrades coincident with IOPS saturation, scale to a higher IOPS tier or reduce concurrent workflow submissions. | + +### SFM-22: Write Latency High + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Storage write latency is elevated (slow disk, storage throttling, VNet latency to remote storage). | +| **Severity** | SEV-2 | +| **Impact** | Every duroxide event, activity checkpoint, and status update involves a write. Orchestration progress slows proportionally to write latency. `df.start()` (which inserts into `df.instances` and `df.nodes`) becomes slow from the user's perspective. `execute_sql` activities that perform writes are doubly affected (user SQL write + duroxide checkpoint write). | +| **Detection** | Platform storage latency metrics. User-perceived `df.start()` latency. Duroxide execution durations increase across the board. | +| **pg_durable-specific concern** | Duroxide's event-sourcing model is **write-heavy by design**. Every activity invocation, every orchestration decision, and every checkpoint is an INSERT into `duroxide.*` tables. pg_durable amplifies write-latency impact more than a typical read-heavy PostgreSQL workload. | + +### SFM-23: Resource Usage High (CPU/Memory) + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Server CPU or memory utilization is high from user workloads, leaving insufficient resources for the pg_durable background worker. | +| **Severity** | SEV-2 | +| **Impact** | The background worker competes with user backend processes for CPU and memory. Under memory pressure, the Tokio runtime may fail to allocate buffers. Under CPU saturation, duroxide's dispatcher thread can't keep up with orchestration decisions. Activity throughput drops. In extreme cases, PostgreSQL's OOM killer terminates the worker process. | +| **Programmatic mitigation** | PostgreSQL auto-restarts the background worker after termination (5s delay). The worker has minimal steady-state memory (main allocation is the sqlx pool and Tokio runtime). | +| **Detection** | Platform CPU/memory metrics. Worker crashes appear as restarts in `df._worker_epoch` (new epoch UUID). Repeated entries in PostgreSQL log: `"pg_durable: duroxide background worker starting..."`. | +| **pg_durable-specific concern** | There is **no resource isolation** between the background worker and user sessions. No cgroup, no memory limit, no CPU pinning. The worker is a regular PostgreSQL background worker process. On a server running both heavy user queries and many durable functions, the two workloads contend for the same resources. See FM-16 in [fma.md](fma.md) for the single-worker bottleneck analysis. | + +### SFM-24: Server Crash Due to Lack of Resources + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | PostgreSQL crashes due to OOM, disk full, or other resource exhaustion. | +| **Severity** | SEV-1 | +| **Impact** | Same as SFM-9 (compute failure). All in-flight activities interrupted. Duroxide replays after recovery. Risk of corruption if crash occurs during a WAL write for duroxide tables. PostgreSQL's crash recovery (WAL replay) restores consistency. | +| **pg_durable-specific concern** | If the crash was caused by duroxide's own resource consumption (e.g., a runaway loop creating unbounded execution history), the crash-and-restart cycle may repeat. The worker will restart, pick up the same runaway orchestration, and consume resources again. See FM-10 in [fma.md](fma.md) for infinite loop scenarios. Without a max-duration or max-iteration limit, this creates a **crash loop**. | +| **Recommendation** | Implement a circuit breaker: if the worker crashes N times within a window, delay restart progressively. Add a system-level `pg_durable.max_instance_duration_seconds` GUC to auto-cancel runaway instances. | + +--- + +## 8. Monitoring + +### SFM-25: Issues Not Detected (TTD Gap) + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | A pg_durable failure occurs but the service team doesn't detect it. | +| **Severity** | SEV-2 | +| **Impact** | Customer workflows are broken but no alert fires. Time-to-detect (TTD) is driven by customer complaint rather than proactive monitoring. | +| **Current detection capabilities** | See [fma.md](fma.md) Section 4 for the full telemetry inventory. Key signals: | +| **Gaps** | | +| **Recommendation** | Build a PaaS monitoring integration that: (1) periodically calls `df.metrics()` and publishes to Azure Monitor as custom metrics, (2) checks `df._worker_epoch.last_seen_at` for worker liveness, (3) queries `SELECT count(*) FROM df.instances WHERE status = 'pending' AND created_at < now() - interval '5 minutes'` for stuck-instance detection. See also [fma.md](fma.md) Section 4.2 for the complete gap analysis. | + +### SFM-26: Issues Detected Too Slowly (TTD Below Target) + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Alerts exist but fire too late — e.g., a log-based alert has a 5-minute ingestion delay, or a metric threshold is set too high. | +| **Severity** | SEV-2 | +| **Impact** | Customers experience extended downtime before the service team is engaged. | +| **pg_durable-specific concern** | The epoch sentinel heartbeat (`df._worker_epoch.last_seen_at`) is updated every ~5 seconds during the worker's main loop. If log ingestion has a 5-minute lag, a worker crash at T=0 isn't detectable via logs until T=5min. Direct database queries against `df._worker_epoch` would detect it within 10–15 seconds. | +| **Recommendation** | For the fastest TTD, use a **synthetic canary** that periodically submits a trivial durable function (`df.start(df.sql('SELECT 1'), 'canary')`) and verifies completion within an expected SLA (e.g., 30 seconds). This is an end-to-end health probe that catches all failure modes: worker down, connection failures, permission issues, schema drift, resource exhaustion. | + +### SFM-27: Issues Mitigated Too Slowly (TTM Below Target) + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | An issue is detected but mitigation takes too long — e.g., worker restart requires manual intervention, or a configuration change needs a PostgreSQL restart. | +| **Severity** | SEV-2 | +| **Impact** | Extended customer impact. | +| **pg_durable-specific concern** | Two GUCs (`pg_durable.worker_role`, `pg_durable.database`) are `PGC_POSTMASTER` — they require a full PostgreSQL restart to change. If the mitigation involves changing these values, TTM includes the restart window plus any HA failover time. The background worker auto-restarts after a crash (5s delay), so crash-related mitigations are fast. But for configuration issues (wrong role, wrong database), there is no way to reconfigure without a restart. | +| **Recommendation** | Provide a runbook for common pg_durable mitigations. Include expected TTM for each: | + +--- + +## 9. Billing + +### SFM-28: Incorrect Billing / Metering + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | pg_durable's resource consumption is not properly accounted for in billing, or customers are charged for resources consumed by the background worker's internal operations. | +| **Severity** | SEV-2 | +| **Impact** | If billing is based on compute time, storage, or IOPS: the background worker's polling, activity execution, and duroxide state management generate measurable resource consumption. This is "extension overhead" that the customer didn't explicitly trigger. Storage for `duroxide.*` tables grows with orchestration history and may be significant for high-volume users. | +| **pg_durable-specific concern** | Duroxide tables (`duroxide.instances`, `duroxide.execution_events`, etc.) can grow very large for: (a) eternal/looping functions (each iteration creates a new execution), (b) functions with many parallel branches (high event count), (c) long-running functions (event history accumulates). There is currently **no automatic purge/TTL** for completed duroxide state. A customer with thousands of completed workflows accumulates unbounded storage in `duroxide.*`. | +| **Recommendation** | (1) Document pg_durable's storage overhead in the pricing FAQ. (2) Implement a TTL/purge mechanism for completed orchestration history (e.g., `pg_durable.history_retention_days` GUC). (3) Include `duroxide.*` table sizes in storage usage dashboards so customers can see the breakdown. | + +--- + +## 10. Service Components + +### SFM-29: Backup — Full/Diff/Log Backups Out of SLO + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Platform backup jobs (full, differential, or WAL archiving) exceed their SLO windows. | +| **Severity** | SEV-2 | +| **Impact** | RPO (Recovery Point Objective) increases. If a failure occurs during the backup gap, more pg_durable state is lost on PITR. Backup operations competing for I/O may slow duroxide's write-heavy workload (see SFM-21). | +| **pg_durable-specific concern** | Duroxide's write amplification (multiple events per activity, event-sourced model) increases WAL volume. High-throughput durable function workloads generate more WAL than typical OLTP, which extends backup times. Large duroxide tables increase full/differential backup size. | +| **Recommendation** | Monitor WAL generation rate for servers with pg_durable enabled. Consider automatic vacuuming and table maintenance for `duroxide.*` tables. | + +### SFM-30: Backup — Detected Corrupt Backups + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Backup integrity check reveals corruption in a backup that includes pg_durable state. | +| **Severity** | SEV-1 | +| **Impact** | The backup may not be usable for PITR. Last known good backup determines actual RPO. pg_durable state may be irrecoverable for the affected window. | +| **pg_durable-specific concern** | Because pg_durable state is in PostgreSQL tables (not an external store), backup corruption affects orchestration state equally. There is no secondary copy or replication target for duroxide state. | +| **Recommendation** | Standard backup integrity validation applies. No pg_durable-specific action needed — corrupted backups are a platform-level concern. | + +### SFM-31: Backup — Azure Storage Failure + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | The Azure Storage account used for backups becomes unavailable. | +| **Severity** | SEV-1 | +| **Impact** | Backups cannot be written. RPO effectively becomes "last successful backup." If the primary database also fails, pg_durable state is unrecoverable past the last good backup. Active durable functions are unaffected (they run from primary storage). | +| **pg_durable-specific concern** | None specific beyond the general database concern. | + +### SFM-32: Restore — Slow or Hung Restore + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | A PITR or full restore operation takes longer than expected or hangs. | +| **Severity** | SEV-2 | +| **Impact** | Extended RTO (Recovery Time Objective). pg_durable is unavailable for the duration. After restore completes, the background worker starts fresh and duroxide replays orchestrations — adding additional time before durable functions resume processing. | +| **pg_durable-specific concern** | Large `duroxide.*` tables increase restore time. The post-restore duroxide replay phase adds latency before the worker is fully operational. For servers with many in-flight orchestrations at the restore point, the replay phase can be significant. | +| **Recommendation** | Include duroxide table sizes in restore-time estimation. Consider adding a `df.purge_completed(older_than interval)` function to help customers manage duroxide table sizes. | + +### SFM-33: Storage — Loss of Data Files or WAL + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | Primary data files or WAL segments for the PostgreSQL data directory are lost or corrupted. | +| **Severity** | SEV-1 | +| **Impact** | If WAL is intact: PostgreSQL crash recovery may restore consistency, and pg_durable state is recovered along with all other data. If data files for `duroxide.*` or `df.*` tables are lost: those tables are unreadable. Worker enters retry loop (can't read duroxide schema). PITR is required. | +| **pg_durable-specific concern** | The `duroxide.*` schema and `df.*` schema are stored in the same tablespace as user data (default tablespace). There is no separation. Loss of tablespace data files affects pg_durable equally. | + +### SFM-34: Storage — Corrupt Page in pg_durable Tables + +| Attribute | Detail | +|-----------|--------| +| **Scenario** | A data page corruption is detected in a `df.*` or `duroxide.*` table. | +| **Severity** | SEV-2 | +| **Impact** | Queries against the corrupted table fail. If `duroxide.instances` or `duroxide.execution_events` is corrupted, the runtime may fail to dispatch or replay orchestrations. If `df.instances` is corrupted, `df.status()` and `df.list_instances()` fail. If `df.nodes` is corrupted, `load_function_graph` fails for affected instances. | +| **Programmatic mitigation** | PostgreSQL's `data_checksums` (if enabled) detects corruption at read time. The `pg_surgery` extension can skip corrupted tuples. | +| **Detection** | PostgreSQL log: `WARNING: page verification failed`. Worker retry logs if duroxide tables are affected. | +| **pg_durable-specific concern** | Duroxide's event-sourced model means a corrupt page in `duroxide.execution_events` could affect replay of multiple orchestrations — not just one. The impact radius of a single corrupt page may be broader than for a typical application table. | +| **Recommendation** | Enable `data_checksums` on all managed PostgreSQL instances running pg_durable. Monitor for checksum failure warnings. | + +--- + +## 11. Summary: Detection Matrix + +Cross-references each service failure mode with the detection mechanisms available. + +| SFM | Platform Metrics | PG Server Logs | `df.metrics()` | `df._worker_epoch` | Synthetic Canary | Gap? | +|-----|:---:|:---:|:---:|:---:|:---:|:---:| +| SFM-1 (Feature registration) | | | | | Y | | +| SFM-2 (Monitoring gaps) | | | | | | **Yes** — meta-gap | +| SFM-3 (Image mismatch) | Y (health) | Y | | | Y | | +| SFM-4 (Binary-schema gap) | | Y | Y | | Y | | +| SFM-5 (Sidecar failure) | Y | | | | | | +| SFM-6 (Control ring) | Y | | | | | | +| SFM-7 (Login SLA) | Y | | | | Y | | +| SFM-8 (Connectivity) | Y | | | | | | +| SFM-9 (Compute failure) | Y | Y | | Y | Y | | +| SFM-10 (Storage failure) | Y | Y | | | | | +| SFM-11 (Resource degradation) | Y | | Y | | Y | | +| SFM-12 (Create server) | | Y | | Y | Y | | +| SFM-13 (Scale compute) | | Y | | Y | | | +| SFM-14 (Scale storage) | | | | | | | +| SFM-15 (Drop server) | Y | | | | | | +| SFM-16 (Config change) | | | | Y | Y | | +| SFM-17 (PITR) | | Y | | Y | | | +| SFM-18 (Accidental delete) | | | | | | **Yes** — no alert | +| SFM-19 (Regional DR) | Y | | | | | | +| SFM-20 (Restore dropped) | | Y | | Y | | | +| SFM-21 (IOPS limit) | Y | | Y | | Y | | +| SFM-22 (Write latency) | Y | | | | Y | | +| SFM-23 (CPU/Memory) | Y | Y | | Y | | | +| SFM-24 (Crash loop) | Y | Y | | Y | | **Partial** — no circuit breaker metric | +| SFM-25 (TTD gap) | | | | | | **Yes** — see details | +| SFM-26 (Slow TTD) | | | | Y | Y | | +| SFM-27 (Slow TTM) | | Y | | | | | +| SFM-28 (Billing) | | | | | | **Yes** — no duroxide storage metric | +| SFM-29 (Backup SLO) | Y | | | | | | +| SFM-30 (Corrupt backup) | Y | | | | | | +| SFM-31 (Backup storage) | Y | | | | | | +| SFM-32 (Slow restore) | Y | | | | | | +| SFM-33 (Data file loss) | Y | Y | | | | | +| SFM-34 (Corrupt page) | | Y | | | | | + +**Key takeaways**: +- A **synthetic canary** (submit and verify a trivial workflow) detects the widest range of failure modes end-to-end. +- **`df._worker_epoch`** is the best pg_durable-specific liveness signal — but requires direct database access, not log scraping. +- **`df.metrics()`** is useful for throughput degradation detection but lacks queue depth, latency percentiles, and storage size metrics. +- The largest detection gaps are in **billing/storage accounting** and **accidental user-side data deletion**.