Retry mechanism with retry breaking system#797
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8bc993cafd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
# Conflicts: # lib/OnyxUtils.ts # lib/storage/providers/IDBKeyValProvider/createStore.ts
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 23bc519d72
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: edc6ec8dbb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| StorageCircuitBreaker.recordEviction(); | ||
|
|
||
| // @ts-expect-error No overload matches this call. | ||
| return remove(keyForRemoval).then(() => onyxMethod(defaultParams, nextRetryAttempt)); |
There was a problem hiding this comment.
Move eviction marker after deletion completes
Fresh evidence beyond the earlier stale-marker issue is that this records an eviction before remove(keyForRemoval) has resolved. When several writes hit quota concurrently, the next write's rejection can run while the first deletion transaction is still pending, so recordCapacityFailure() counts it as a no-progress eviction even though nothing has been freed yet; after five such racing failures the breaker opens and subsequent storage writes are skipped for 60s despite the evictions potentially succeeding. Record the eviction only after the deletion completes, or associate the no-progress verdict with the retry that follows that deletion.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
+1 Can we move the marker until after remove() resolves, or better tie the no-progress verdict to the retry that follows that specific eviction?
roryabraham
left a comment
There was a problem hiding this comment.
I've spent some time reviewing this, and I think this refactor makes things cleaner:
- Create a generic, type-safe
StateMachineclass implementing a basic state machine - Create classes implementing the basic circuit breaker pattern. We actually have two variants of the pattern: one that counts errors over a rolling window, and another that counts consecutive errors. So we create an abstract base class and two concrete subclasses
- Bonus: the "half-open" state (common in other implementations of the circuit breaker pattern I've seen) helps reduce the "thundering herd" problem when we probe a service that hasn't yet recovered
- StorageCircuitBreaker is a composition of both: the rolling window and the consecutive. This logic was already present, just made clearer by the refactor imo.
- (optional, not shown in the diff) - put
StateMachineand theCircuitBreakerclasses in expensify-common for reuse, since they're generic.
diff
diff --git a/lib/CircuitBreaker/AbstractCircuitBreaker.ts b/lib/CircuitBreaker/AbstractCircuitBreaker.ts
new file mode 100644
index 00000000..d3b4ca87
--- /dev/null
+++ b/lib/CircuitBreaker/AbstractCircuitBreaker.ts
@@ -0,0 +1,155 @@
+import StateMachine from '../StateMachine';
+import {CIRCUIT_BREAKER_TRANSITIONS, type CircuitBreakerOptions, type CircuitBreakerState} from './types';
+
+/**
+ * Generic circuit breaker built on {@link StateMachine}.
+ *
+ * - **closed**: requests are allowed; failures are counted.
+ * - **open**: requests are rejected until {@link resetTimeoutMs} elapses.
+ * - **half-open**: a single probe request is allowed; success closes the circuit, failure reopens it.
+ *
+ * Subclasses implement failure counting for a specific policy (consecutive, rolling window, etc.).
+ *
+ * @example
+ * const breaker = new ConsecutiveFailureCircuitBreaker({failureThreshold: 3, resetTimeoutMs: 30_000});
+ *
+ * async function callUpstream() {
+ * if (!breaker.isAllowed()) {
+ * throw new Error('Circuit open');
+ * }
+ *
+ * try {
+ * const result = await fetch('/api/data');
+ * breaker.recordSuccess();
+ * return result;
+ * } catch {
+ * breaker.recordFailure();
+ * throw new Error('Upstream failed');
+ * }
+ * }
+ */
+abstract class AbstractCircuitBreaker {
+ private machine: StateMachine<typeof CIRCUIT_BREAKER_TRANSITIONS, CircuitBreakerState>;
+
+ private openedAt = 0;
+
+ private isProbeInFlight = false;
+
+ private readonly resetTimeoutMs: number;
+
+ private readonly onTrip?: (reason: string) => void;
+
+ private readonly onClose?: () => void;
+
+ constructor(options: CircuitBreakerOptions = {}) {
+ this.resetTimeoutMs = options.resetTimeoutMs ?? 60_000;
+ this.onTrip = options.onTrip;
+ this.onClose = options.onClose;
+ this.machine = new StateMachine('closed', CIRCUIT_BREAKER_TRANSITIONS);
+ }
+
+ /** Record a failure while the circuit is closed. Returns a trip reason when the threshold is exceeded. */
+ protected abstract recordFailureInClosed(): string | null;
+
+ /** Update failure state after a successful request while the circuit is closed. */
+ protected abstract recordSuccessInClosed(): void;
+
+ /** Clear accumulated failure state without changing circuit state. */
+ protected abstract resetFailureState(): void;
+
+ /**
+ * Whether a request may proceed.
+ * Returns `false` while open. In half-open, only one probe is admitted at a time.
+ */
+ isAllowed(): boolean {
+ const currentState = this.getState();
+
+ if (currentState === 'open') {
+ return false;
+ }
+
+ if (currentState === 'half-open') {
+ if (this.isProbeInFlight) {
+ return false;
+ }
+ this.isProbeInFlight = true;
+ }
+
+ return true;
+ }
+
+ /**
+ * Record a failed request. May open the circuit from closed or half-open.
+ * @returns `true` when the circuit is open after recording (the request must not proceed).
+ */
+ recordFailure(): boolean {
+ if (this.machine.state === 'open') {
+ return true;
+ }
+
+ if (this.machine.state === 'half-open') {
+ this.trip();
+ return true;
+ }
+
+ const reason = this.recordFailureInClosed();
+ if (reason) {
+ this.trip(reason);
+ return true;
+ }
+
+ return false;
+ }
+
+ /** Record a successful request. Closes the circuit from half-open and clears failure counts. */
+ recordSuccess(): void {
+ if (this.machine.state === 'half-open') {
+ this.close();
+ return;
+ }
+
+ if (this.machine.state === 'closed') {
+ this.recordSuccessInClosed();
+ }
+ }
+
+ private getState(): CircuitBreakerState {
+ this.maybeRecover();
+ return this.machine.state;
+ }
+
+ private trip(reason = ''): void {
+ if (this.machine.state === 'open') {
+ return;
+ }
+
+ this.machine = this.machine.transition('open');
+ this.openedAt = Date.now();
+ this.isProbeInFlight = false;
+ this.resetFailureState();
+ this.onTrip?.(reason);
+ }
+
+ private close(): void {
+ this.machine = new StateMachine('closed', CIRCUIT_BREAKER_TRANSITIONS);
+ this.openedAt = 0;
+ this.isProbeInFlight = false;
+ this.resetFailureState();
+ this.onClose?.();
+ }
+
+ private maybeRecover(): void {
+ if (this.machine.state !== 'open') {
+ return;
+ }
+
+ if (Date.now() - this.openedAt < this.resetTimeoutMs) {
+ return;
+ }
+
+ this.machine = this.machine.transition('half-open');
+ this.isProbeInFlight = false;
+ }
+}
+
+export default AbstractCircuitBreaker;
diff --git a/lib/CircuitBreaker/ConsecutiveFailureCircuitBreaker.ts b/lib/CircuitBreaker/ConsecutiveFailureCircuitBreaker.ts
new file mode 100644
index 00000000..a95f966e
--- /dev/null
+++ b/lib/CircuitBreaker/ConsecutiveFailureCircuitBreaker.ts
@@ -0,0 +1,41 @@
+import AbstractCircuitBreaker from './AbstractCircuitBreaker';
+import type {CircuitBreakerOptions} from './types';
+
+type ConsecutiveFailureCircuitBreakerOptions = CircuitBreakerOptions & {
+ /** Consecutive failures in the closed state before the circuit opens. */
+ failureThreshold?: number;
+};
+
+/** Trips after consecutive failures in a row. Success resets the count. */
+class ConsecutiveFailureCircuitBreaker extends AbstractCircuitBreaker {
+ private failureCount = 0;
+
+ private readonly failureThreshold: number;
+
+ private readonly formatTripReason: (failureCount: number, windowMs?: number) => string;
+
+ constructor(options: ConsecutiveFailureCircuitBreakerOptions = {}) {
+ super(options);
+ this.failureThreshold = options.failureThreshold ?? 5;
+ this.formatTripReason = options.formatTripReason ?? ((failureCount) => `${failureCount} consecutive failures`);
+ }
+
+ protected recordFailureInClosed(): string | null {
+ this.failureCount += 1;
+ if (this.failureCount >= this.failureThreshold) {
+ return this.formatTripReason(this.failureCount);
+ }
+ return null;
+ }
+
+ protected recordSuccessInClosed(): void {
+ this.failureCount = 0;
+ }
+
+ protected resetFailureState(): void {
+ this.failureCount = 0;
+ }
+}
+
+export default ConsecutiveFailureCircuitBreaker;
+export type {ConsecutiveFailureCircuitBreakerOptions};
diff --git a/lib/CircuitBreaker/RollingWindowCircuitBreaker.ts b/lib/CircuitBreaker/RollingWindowCircuitBreaker.ts
new file mode 100644
index 00000000..0c150cc7
--- /dev/null
+++ b/lib/CircuitBreaker/RollingWindowCircuitBreaker.ts
@@ -0,0 +1,61 @@
+import AbstractCircuitBreaker from './AbstractCircuitBreaker';
+import type {CircuitBreakerOptions} from './types';
+
+type RollingWindowCircuitBreakerOptions = CircuitBreakerOptions & {
+ /** Rolling window length in milliseconds. */
+ windowMs: number;
+
+ /** Failures inside the window above which the circuit opens. */
+ failureThreshold: number;
+
+ /** Called when the rolling window is empty before a new failure is recorded. */
+ onWindowBecameEmpty?: () => void;
+};
+
+/** Trips when failures inside a rolling time window exceed the threshold. */
+class RollingWindowCircuitBreaker extends AbstractCircuitBreaker {
+ private failureTimestamps: number[] = [];
+
+ private readonly windowMs: number;
+
+ private readonly failureThreshold: number;
+
+ private readonly formatTripReason: (failureCount: number, windowMs?: number) => string;
+
+ private readonly onWindowBecameEmpty?: () => void;
+
+ constructor(options: RollingWindowCircuitBreakerOptions) {
+ super(options);
+ this.windowMs = options.windowMs;
+ this.failureThreshold = options.failureThreshold;
+ this.onWindowBecameEmpty = options.onWindowBecameEmpty;
+ this.formatTripReason = options.formatTripReason ?? ((failureCount, windowMs) => `${failureCount} failures within ${(windowMs ?? 0) / 1000}s`);
+ }
+
+ protected recordFailureInClosed(): string | null {
+ const now = Date.now();
+ this.failureTimestamps = this.failureTimestamps.filter((timestamp) => now - timestamp < this.windowMs);
+
+ if (this.failureTimestamps.length === 0) {
+ this.onWindowBecameEmpty?.();
+ }
+
+ this.failureTimestamps.push(now);
+
+ if (this.failureTimestamps.length > this.failureThreshold) {
+ return this.formatTripReason(this.failureTimestamps.length, this.windowMs);
+ }
+ return null;
+ }
+
+ protected recordSuccessInClosed(): void {
+ // A success does not clear the rolling window; only {@link resetFailureState} does.
+ }
+
+ protected resetFailureState(): void {
+ this.failureTimestamps = [];
+ }
+}
+
+export default RollingWindowCircuitBreaker;
+export type {RollingWindowCircuitBreakerOptions};
diff --git a/lib/CircuitBreaker/types.ts b/lib/CircuitBreaker/types.ts
new file mode 100644
index 00000000..04d8358d
--- /dev/null
+++ b/lib/CircuitBreaker/types.ts
@@ -0,0 +1,25 @@
+/** States of the circuit breaker. */
+type CircuitBreakerState = 'closed' | 'open' | 'half-open';
+
+const CIRCUIT_BREAKER_TRANSITIONS = {
+ closed: ['open'],
+ open: ['half-open'],
+ 'half-open': ['closed', 'open'],
+} as const satisfies Record<CircuitBreakerState, readonly CircuitBreakerState[]>;
+
+type CircuitBreakerOptions = {
+ /** Time in milliseconds the circuit stays open before moving to half-open. */
+ resetTimeoutMs?: number;
+
+ /** Called once each time the circuit opens. */
+ onTrip?: (reason: string) => void;
+
+ /** Called when the circuit closes. */
+ onClose?: () => void;
+
+ /** Builds the trip reason string when the threshold is exceeded. */
+ formatTripReason?: (failureCount: number, windowMs?: number) => string;
+};
+
+export type {CircuitBreakerOptions, CircuitBreakerState};
+export {CIRCUIT_BREAKER_TRANSITIONS};
diff --git a/lib/OnyxUtils.ts b/lib/OnyxUtils.ts
index 4979b3c5..d89927fa 100644
--- a/lib/OnyxUtils.ts
+++ b/lib/OnyxUtils.ts
@@ -826,10 +826,12 @@ function retryOperation<TMethod extends RetriableOnyxOperation>(
const nextRetryAttempt = currentRetryAttempt + 1;
const errorClass = Storage.classifyError(error);
- // Once the breaker is open, every capacity write is going to fail the same way. Drop it silently —
+ // While open (or while a half-open probe is already in flight), drop capacity retries silently —
// the breaker already emitted its single alert, and logging per failed write is exactly the storm
- // we are suppressing. (We return before the log line below on purpose.)
- if (errorClass === StorageErrorClass.CAPACITY && StorageCircuitBreaker.isTripped()) {
+ // we are suppressing. A rejected half-open caller is the in-flight probe failing; record that so
+ // the circuit reopens for another window. (We return before the log line below on purpose.)
+ if (errorClass === StorageErrorClass.CAPACITY && !StorageCircuitBreaker.isAllowed()) {
+ StorageCircuitBreaker.recordProbeFailure();
return Promise.resolve();
}
@@ -869,8 +871,7 @@ function retryOperation<TMethod extends RetriableOnyxOperation>(
// cannot stop a session-wide storm — each evicted key triggers an OnyxDerived recompute that spawns
// a fresh write with its own budget — so the breaker is what actually halts the meltdown. (The
// already-open case returned silently at the top of this function.)
- StorageCircuitBreaker.recordCapacityFailure();
- if (StorageCircuitBreaker.isTripped()) {
+ if (StorageCircuitBreaker.recordCapacityFailure()) {
// This failure tripped the breaker; it already emitted its single alert. Stop here.
return Promise.resolve();
}
diff --git a/lib/StateMachine.ts b/lib/StateMachine.ts
new file mode 100644
index 00000000..43d9e84b
--- /dev/null
+++ b/lib/StateMachine.ts
@@ -0,0 +1,63 @@
+import type {ReadonlyDeep} from 'type-fest';
+
+/**
+ * A directed transition graph keyed by state name.
+ * Use `as const` when defining a graph so illegal transitions are caught at compile time.
+ *
+ * @example
+ * const transitions = {
+ * idle: ['loading'],
+ * loading: ['success', 'error'],
+ * success: [],
+ * error: ['idle'],
+ * } as const;
+ */
+type TransitionGraph = Readonly<Record<string, readonly string[]>>;
+
+/** Target states reachable from `Current` according to `Graph`. */
+type TransitionsFrom<Graph extends TransitionGraph, Current extends keyof Graph & string> = Graph[Current] extends readonly (infer Target extends string)[] ? Target : never;
+
+/**
+ * An immutable, type-safe finite state machine.
+ * Pass the transition graph with `as const` so `transition` only accepts legal target states.
+ *
+ * @example
+ * const transitions = {
+ * idle: ['loading'],
+ * loading: ['success', 'error'],
+ * success: [],
+ * error: ['idle'],
+ * } as const;
+ *
+ * const idleMachine = new StateMachine('idle', transitions);
+ * const loadingMachine = idleMachine.transition('loading');
+ * loadingMachine.transition('success');
+ */
+class StateMachine<const Graph extends TransitionGraph, Current extends keyof Graph & string> {
+ /** The current state. Deeply readonly and owned by this state machine instance. */
+ readonly state: ReadonlyDeep<Current>;
+
+ private readonly transitions: Graph;
+
+ constructor(currentState: Current, transitions: Graph) {
+ this.state = currentState as ReadonlyDeep<Current>;
+ this.transitions = transitions;
+ Object.freeze(this);
+ }
+
+ /**
+ * Transition to a new state, returning a new state machine instance.
+ * Only transitions declared in the graph for the current state are accepted.
+ */
+ transition<Target extends TransitionsFrom<Graph, Current>>(target: Target): StateMachine<Graph, Target> {
+ const validTargets = this.transitions[this.state as Current];
+ if (!validTargets?.includes(target)) {
+ throw new Error(`Illegal transition from "${String(this.state)}" to "${String(target)}"`);
+ }
+
+ return new StateMachine(target, this.transitions);
+ }
+}
+
+export default StateMachine;
+export type {TransitionGraph, TransitionsFrom};
diff --git a/lib/StorageCircuitBreaker.ts b/lib/StorageCircuitBreaker.ts
index d7c55677..71f54dc3 100644
--- a/lib/StorageCircuitBreaker.ts
+++ b/lib/StorageCircuitBreaker.ts
@@ -1,3 +1,5 @@
+import ConsecutiveFailureCircuitBreaker from './CircuitBreaker/ConsecutiveFailureCircuitBreaker';
+import RollingWindowCircuitBreaker from './CircuitBreaker/RollingWindowCircuitBreaker';
import * as Logger from './Logger';
/**
@@ -15,106 +17,117 @@ import * as Logger from './Logger';
* failure (the eviction freed nothing the next write could use — a no-progress cycle). This is a
* cheap proxy for `getDatabaseSize()`, which is costly and only reports origin-wide usage.
*
- * On trip it emits exactly ONE alert and self-resets once the rolling window clears, so a persistent
- * condition produces at most one alert per window instead of one log line per failed write.
+ * On trip it emits exactly ONE alert. After {@link ROLLING_WINDOW_MS} the circuit moves to half-open
+ * and admits a single eviction+retry probe; a successful probe closes the circuit, a failed probe
+ * reopens it for another window.
*/
+class StorageCircuitBreaker {
+ private static readonly ROLLING_WINDOW_MS = 60 * 1000;
-/** Rolling window over which capacity failures are counted, and how long a trip stays open. */
-const ROLLING_WINDOW_MS = 60 * 1000;
+ private static readonly FAILURE_THRESHOLD = 50;
-/** Capacity failures within the window above which the breaker trips (storm backstop). */
-const FAILURE_THRESHOLD = 50;
+ private static readonly NO_PROGRESS_CAP = 5;
-/** Consecutive no-progress evictions (evict -> still capacity failure) above which the breaker trips. */
-const NO_PROGRESS_CAP = 5;
+ private evictionAwaitingResult = false;
-let failureTimestamps: number[] = [];
-let consecutiveNoProgressEvictions = 0;
-let evictionAwaitingResult = false;
-let trippedUntil = 0;
+ private hasTripped = false;
-function reset(): void {
- failureTimestamps = [];
- consecutiveNoProgressEvictions = 0;
- evictionAwaitingResult = false;
- trippedUntil = 0;
-}
+ private rollingWindowBreaker!: RollingWindowCircuitBreaker;
-/** Whether the breaker is currently open. Self-resets once the window since the trip has cleared. */
-function isTripped(): boolean {
- if (trippedUntil === 0) {
- return false;
- }
- if (Date.now() >= trippedUntil) {
- reset();
- return false;
- }
- return true;
-}
+ private noProgressBreaker!: ConsecutiveFailureCircuitBreaker;
-function trip(reason: string): void {
- trippedUntil = Date.now() + ROLLING_WINDOW_MS;
- Logger.logAlert(`Storage circuit breaker tripped: ${reason}. Halting eviction/retry for ${ROLLING_WINDOW_MS / 1000}s to stop a storage failure storm.`);
-}
+ constructor() {
+ this.initBreakers();
+ }
-/**
- * Record a CAPACITY failure. Call once per capacity failure in `retryOperation`, BEFORE deciding
- * whether to evict; then check {@link isTripped} to decide whether to proceed.
- */
-function recordCapacityFailure(): void {
- // While open, recording is a no-op: no extra timestamps, no second alert, and nothing to keep the
- // window from clearing. `isTripped()` self-resets here once the window has elapsed.
- if (isTripped()) {
- return;
+ /** Whether a capacity eviction+retry may proceed. */
+ isAllowed(): boolean {
+ return this.rollingWindowBreaker.isAllowed() && this.noProgressBreaker.isAllowed();
}
- const now = Date.now();
- failureTimestamps = failureTimestamps.filter((timestamp) => now - timestamp < ROLLING_WINDOW_MS);
+ /**
+ * Record that a half-open probe failed. No-op while open; reopens the half-open breaker on failure.
+ */
+ recordProbeFailure(): void {
+ if (this.rollingWindowBreaker.recordFailure()) {
+ return;
+ }
+ this.noProgressBreaker.recordFailure();
+ }
- // A fresh storm (nothing left in the window) resets the no-progress tracking so a stale eviction
- // from an earlier, unrelated incident can't be miscounted as no-progress for this one.
- if (failureTimestamps.length === 0) {
- consecutiveNoProgressEvictions = 0;
- evictionAwaitingResult = false;
+ /**
+ * Record a CAPACITY failure. Call once per capacity failure in `retryOperation`, BEFORE deciding
+ * whether to evict. Returns `true` when the breaker is open and eviction must not proceed.
+ */
+ recordCapacityFailure(): boolean {
+ const rollingTripped = this.rollingWindowBreaker.recordFailure();
+ if (rollingTripped) {
+ return true;
+ }
+
+ if (!this.evictionAwaitingResult) {
+ return false;
+ }
+
+ this.evictionAwaitingResult = false;
+ return this.noProgressBreaker.recordFailure();
}
- // We evicted on the previous cycle and we're back here with another capacity failure, so that
- // eviction freed no usable space.
- if (evictionAwaitingResult) {
- consecutiveNoProgressEvictions += 1;
- evictionAwaitingResult = false;
+ /** Record that `retryOperation` just evicted a key, so the next capacity failure counts as no-progress. */
+ recordEviction(): void {
+ this.evictionAwaitingResult = true;
}
- failureTimestamps.push(now);
+ /** Record that a storage write succeeded. */
+ recordWriteSuccess(): void {
+ if (this.evictionAwaitingResult) {
+ this.evictionAwaitingResult = false;
+ }
+ this.rollingWindowBreaker.recordSuccess();
+ this.noProgressBreaker.recordSuccess();
+ }
- if (failureTimestamps.length > FAILURE_THRESHOLD) {
- trip(`${failureTimestamps.length} capacity failures within ${ROLLING_WINDOW_MS / 1000}s`);
- return;
+ reset(): void {
+ this.initBreakers();
+ this.hasTripped = false;
+ this.evictionAwaitingResult = false;
}
- if (consecutiveNoProgressEvictions >= NO_PROGRESS_CAP) {
- trip(`${consecutiveNoProgressEvictions} consecutive evictions freed no usable space`);
+
+ private initBreakers(): void {
+ this.noProgressBreaker = new ConsecutiveFailureCircuitBreaker({
+ failureThreshold: StorageCircuitBreaker.NO_PROGRESS_CAP,
+ resetTimeoutMs: StorageCircuitBreaker.ROLLING_WINDOW_MS,
+ formatTripReason: (failureCount) => `${failureCount} consecutive evictions freed no usable space`,
+ onTrip: (reason) => this.handleTrip(reason),
+ onClose: () => this.handleBreakerClose(),
+ });
+
+ this.rollingWindowBreaker = new RollingWindowCircuitBreaker({
+ windowMs: StorageCircuitBreaker.ROLLING_WINDOW_MS,
+ failureThreshold: StorageCircuitBreaker.FAILURE_THRESHOLD,
+ resetTimeoutMs: StorageCircuitBreaker.ROLLING_WINDOW_MS,
+ formatTripReason: (failureCount) => `${failureCount} capacity failures within ${StorageCircuitBreaker.ROLLING_WINDOW_MS / 1000}s`,
+ onWindowBecameEmpty: () => {
+ this.noProgressBreaker.recordSuccess();
+ this.evictionAwaitingResult = false;
+ },
+ onTrip: (reason) => this.handleTrip(reason),
+ onClose: () => this.handleBreakerClose(),
+ });
}
-}
-/** Record that `retryOperation` just evicted a key, so the next capacity failure counts as no-progress. */
-function recordEviction(): void {
- evictionAwaitingResult = true;
-}
+ private handleTrip(reason: string): void {
+ if (this.hasTripped) {
+ return;
+ }
+ this.hasTripped = true;
+ Logger.logAlert(`Storage circuit breaker tripped: ${reason}. Halting eviction/retry for ${StorageCircuitBreaker.ROLLING_WINDOW_MS / 1000}s to stop a storage failure storm.`);
+ }
-/**
- * Record that a storage write SUCCEEDED. If an eviction was awaiting its verdict, the eviction freed
- * usable space — so it must NOT later be miscounted as a no-progress cycle by the next capacity
- * failure. Clear the pending flag and reset the consecutive no-progress streak (a success breaks the
- * streak). No-op when no eviction is pending (the common case), so it's cheap to call on every write.
- */
-function recordWriteSuccess(): void {
- if (!evictionAwaitingResult) {
- return;
+ private handleBreakerClose(): void {
+ this.hasTripped = false;
+ this.evictionAwaitingResult = false;
}
- evictionAwaitingResult = false;
- consecutiveNoProgressEvictions = 0;
}
-const StorageCircuitBreaker = {recordCapacityFailure, recordEviction, recordWriteSuccess, isTripped, reset, ROLLING_WINDOW_MS, FAILURE_THRESHOLD, NO_PROGRESS_CAP};
-
-export default StorageCircuitBreaker;
+export default new StorageCircuitBreaker();
diff --git a/tests/types/StateMachine.ts b/tests/types/StateMachine.ts
new file mode 100644
index 00000000..8c12958e
--- /dev/null
+++ b/tests/types/StateMachine.ts
@@ -0,0 +1,24 @@
+import StateMachine from '../../lib/StateMachine';
+
+const transitions = {
+ idle: ['loading'],
+ loading: ['success', 'error'],
+ success: [],
+ error: ['idle'],
+} as const;
+
+const machine = new StateMachine('idle', transitions);
+
+// Valid transitions
+machine.transition('loading');
+machine.transition('loading').transition('success');
+machine.transition('loading').transition('error').transition('idle');
+
+// @ts-expect-error illegal transition from idle
+machine.transition('success');
+
+// @ts-expect-error illegal transition from loading
+machine.transition('loading').transition('idle');
+
+// @ts-expect-error terminal state has no outgoing transitions
+machine.transition('loading').transition('success').transition('loading');
diff --git a/tests/unit/StorageCircuitBreakerTest.ts b/tests/unit/StorageCircuitBreakerTest.ts
index 94069e06..5d5da7c4 100644
--- a/tests/unit/StorageCircuitBreakerTest.ts
+++ b/tests/unit/StorageCircuitBreakerTest.ts
@@ -1,6 +1,11 @@
import * as Logger from '../../lib/Logger';
import StorageCircuitBreaker from '../../lib/StorageCircuitBreaker';
+/** Mirror StorageCircuitBreaker tuning — tests assert behavior at these boundaries. */
+const ROLLING_WINDOW_MS = 60_000;
+const FAILURE_THRESHOLD = 50;
+const NO_PROGRESS_CAP = 5;
+
describe('StorageCircuitBreaker', () => {
let currentTime = 1_000_000;
let nowSpy: jest.SpyInstance;
@@ -9,6 +14,20 @@ describe('StorageCircuitBreaker', () => {
currentTime += ms;
};
+ const expectAdmissionClosed = () => {
+ expect(StorageCircuitBreaker.isAllowed()).toBe(true);
+ expect(StorageCircuitBreaker.isAllowed()).toBe(true);
+ };
+
+ const expectAdmissionOpen = () => {
+ expect(StorageCircuitBreaker.isAllowed()).toBe(false);
+ };
+
+ const expectAdmissionHalfOpen = () => {
+ expect(StorageCircuitBreaker.isAllowed()).toBe(true);
+ expect(StorageCircuitBreaker.isAllowed()).toBe(false);
+ };
+
beforeEach(() => {
currentTime = 1_000_000;
nowSpy = jest.spyOn(Date, 'now').mockImplementation(() => currentTime);
@@ -21,41 +40,42 @@ describe('StorageCircuitBreaker', () => {
});
it('should not trip below the failure threshold', () => {
- for (let i = 0; i < StorageCircuitBreaker.FAILURE_THRESHOLD; i++) {
- StorageCircuitBreaker.recordCapacityFailure();
+ for (let i = 0; i < FAILURE_THRESHOLD; i++) {
+ expect(StorageCircuitBreaker.recordCapacityFailure()).toBe(false);
}
- expect(StorageCircuitBreaker.isTripped()).toBe(false);
+ expectAdmissionClosed();
});
it('should trip once capacity failures exceed the threshold within the window', () => {
- for (let i = 0; i <= StorageCircuitBreaker.FAILURE_THRESHOLD; i++) {
- StorageCircuitBreaker.recordCapacityFailure();
+ for (let i = 0; i < FAILURE_THRESHOLD; i++) {
+ expect(StorageCircuitBreaker.recordCapacityFailure()).toBe(false);
}
- expect(StorageCircuitBreaker.isTripped()).toBe(true);
+ expect(StorageCircuitBreaker.recordCapacityFailure()).toBe(true);
+ expectAdmissionOpen();
});
it('should not trip when failures are spread across multiple windows', () => {
- for (let i = 0; i <= StorageCircuitBreaker.FAILURE_THRESHOLD; i++) {
+ for (let i = 0; i <= FAILURE_THRESHOLD; i++) {
StorageCircuitBreaker.recordCapacityFailure();
// Space each failure out so older ones fall out of the rolling window before the count builds up.
advance(2_000);
}
- expect(StorageCircuitBreaker.isTripped()).toBe(false);
+ expectAdmissionClosed();
});
it('should trip after consecutive no-progress evictions', () => {
// Each cycle is a capacity failure followed by an eviction that frees no usable space.
- for (let i = 0; i < StorageCircuitBreaker.NO_PROGRESS_CAP; i++) {
+ for (let i = 0; i < NO_PROGRESS_CAP; i++) {
StorageCircuitBreaker.recordCapacityFailure();
StorageCircuitBreaker.recordEviction();
}
// The next capacity failure observes that the last eviction made no progress, tipping it over.
- StorageCircuitBreaker.recordCapacityFailure();
+ expect(StorageCircuitBreaker.recordCapacityFailure()).toBe(true);
- expect(StorageCircuitBreaker.isTripped()).toBe(true);
+ expectAdmissionOpen();
});
it('should not trip when each eviction makes progress (retry succeeds)', () => {
@@ -63,18 +83,18 @@ describe('StorageCircuitBreaker', () => {
// A successful retry means the eviction freed usable space, so it must never be counted as a
// no-progress cycle by the next failure. Without recordWriteSuccess the stale pending flag made
// each subsequent failure look like no-progress and tripped the breaker after NO_PROGRESS_CAP cycles.
- for (let i = 0; i < StorageCircuitBreaker.NO_PROGRESS_CAP + 3; i++) {
+ for (let i = 0; i < NO_PROGRESS_CAP + 3; i++) {
StorageCircuitBreaker.recordCapacityFailure();
StorageCircuitBreaker.recordEviction();
StorageCircuitBreaker.recordWriteSuccess();
}
- expect(StorageCircuitBreaker.isTripped()).toBe(false);
+ expectAdmissionClosed();
});
it('should reset the no-progress streak when an eviction finally makes progress', () => {
// A few no-progress evictions build the streak up, but short of the cap.
- for (let i = 0; i < StorageCircuitBreaker.NO_PROGRESS_CAP - 1; i++) {
+ for (let i = 0; i < NO_PROGRESS_CAP - 1; i++) {
StorageCircuitBreaker.recordCapacityFailure();
StorageCircuitBreaker.recordEviction();
}
@@ -87,24 +107,24 @@ describe('StorageCircuitBreaker', () => {
// Two more no-progress cycles must not trip, because the streak was reset by the success above.
StorageCircuitBreaker.recordCapacityFailure();
StorageCircuitBreaker.recordEviction();
- StorageCircuitBreaker.recordCapacityFailure();
+ expect(StorageCircuitBreaker.recordCapacityFailure()).toBe(false);
- expect(StorageCircuitBreaker.isTripped()).toBe(false);
+ expectAdmissionClosed();
});
it('should not count a failure as no-progress when no eviction preceded it', () => {
// Capacity failures with no interleaved evictions must not accumulate no-progress cycles.
- for (let i = 0; i < StorageCircuitBreaker.NO_PROGRESS_CAP + 2; i++) {
- StorageCircuitBreaker.recordCapacityFailure();
+ for (let i = 0; i < NO_PROGRESS_CAP + 2; i++) {
+ expect(StorageCircuitBreaker.recordCapacityFailure()).toBe(false);
}
- expect(StorageCircuitBreaker.isTripped()).toBe(false);
+ expectAdmissionClosed();
});
it('should emit exactly one alert when it trips, even as failures continue', () => {
const logAlertSpy = jest.spyOn(Logger, 'logAlert');
- for (let i = 0; i <= StorageCircuitBreaker.FAILURE_THRESHOLD; i++) {
+ for (let i = 0; i <= FAILURE_THRESHOLD; i++) {
StorageCircuitBreaker.recordCapacityFailure();
}
// Further failures while open must not produce more alerts.
@@ -115,28 +135,61 @@ describe('StorageCircuitBreaker', () => {
expect(logAlertSpy).toHaveBeenCalledWith(expect.stringContaining('Storage circuit breaker tripped'));
});
- it('should self-reset once the rolling window clears', () => {
- for (let i = 0; i <= StorageCircuitBreaker.FAILURE_THRESHOLD; i++) {
+ it('should move to half-open once the open window clears', () => {
+ for (let i = 0; i <= FAILURE_THRESHOLD; i++) {
StorageCircuitBreaker.recordCapacityFailure();
}
- expect(StorageCircuitBreaker.isTripped()).toBe(true);
+ expectAdmissionOpen();
- advance(StorageCircuitBreaker.ROLLING_WINDOW_MS);
+ advance(ROLLING_WINDOW_MS);
- expect(StorageCircuitBreaker.isTripped()).toBe(false);
+ expectAdmissionHalfOpen();
+ });
+
+ it('should admit only one capacity retry while half-open', () => {
+ for (let i = 0; i <= FAILURE_THRESHOLD; i++) {
+ StorageCircuitBreaker.recordCapacityFailure();
+ }
+ advance(ROLLING_WINDOW_MS);
+
+ expectAdmissionHalfOpen();
+ });
+
+ it('should close the circuit when a half-open probe succeeds', () => {
+ for (let i = 0; i <= FAILURE_THRESHOLD; i++) {
+ StorageCircuitBreaker.recordCapacityFailure();
+ }
+ advance(ROLLING_WINDOW_MS);
+ expect(StorageCircuitBreaker.isAllowed()).toBe(true);
+
+ StorageCircuitBreaker.recordWriteSuccess();
+
+ expectAdmissionClosed();
+ });
+
+ it('should reopen when a half-open probe fails', () => {
+ for (let i = 0; i <= FAILURE_THRESHOLD; i++) {
+ StorageCircuitBreaker.recordCapacityFailure();
+ }
+ advance(ROLLING_WINDOW_MS);
+ expect(StorageCircuitBreaker.isAllowed()).toBe(true);
+
+ StorageCircuitBreaker.recordProbeFailure();
+
+ expectAdmissionOpen();
});
it('should reset no-progress tracking after the window clears between storms', () => {
// First storm: some no-progress evictions, but not enough to trip.
- for (let i = 0; i < StorageCircuitBreaker.NO_PROGRESS_CAP - 1; i++) {
+ for (let i = 0; i < NO_PROGRESS_CAP - 1; i++) {
StorageCircuitBreaker.recordCapacityFailure();
StorageCircuitBreaker.recordEviction();
}
// Let the window fully clear so the next failure starts a fresh storm.
- advance(StorageCircuitBreaker.ROLLING_WINDOW_MS + 1);
- StorageCircuitBreaker.recordCapacityFailure();
+ advance(ROLLING_WINDOW_MS + 1);
+ expect(StorageCircuitBreaker.recordCapacityFailure()).toBe(false);
- expect(StorageCircuitBreaker.isTripped()).toBe(false);
+ expectAdmissionClosed();
});
});
diff --git a/tests/unit/onyxUtilsTest.ts b/tests/unit/onyxUtilsTest.ts
index 99ae441d..7a1e51c9 100644
--- a/tests/unit/onyxUtilsTest.ts
+++ b/tests/unit/onyxUtilsTest.ts
@@ -747,6 +747,8 @@ describe('OnyxUtils', () => {
describe('retryOperation', () => {
const retryOperationSpy = jest.spyOn(OnyxUtils, 'retryOperation');
+ /** Mirrors StorageCircuitBreaker rolling-window trip threshold. */
+ const STORAGE_FAILURE_THRESHOLD = 50;
const genericError = new Error('Generic storage error');
const invalidDataError = new Error("Failed to execute 'put' on 'IDBObjectStore': invalid data");
const diskFullError = new Error('database or disk is full');
@@ -889,21 +891,21 @@ describe('OnyxUtils', () => {
// No evictable keys are configured, so each failing write records exactly one capacity
// failure with the breaker (it cannot evict). Enough of them within one window trips it.
- for (let i = 0; i <= StorageCircuitBreaker.FAILURE_THRESHOLD; i++) {
+ for (let i = 0; i <= STORAGE_FAILURE_THRESHOLD; i++) {
await Onyx.set(ONYXKEYS.TEST_KEY, {test: i});
}
await waitForPromisesToResolve();
- expect(StorageCircuitBreaker.isTripped()).toBe(true);
+ expect(StorageCircuitBreaker.isAllowed()).toBe(false);
expect(logAlertSpy).toHaveBeenCalledWith(expect.stringContaining('Storage circuit breaker tripped'));
});
it('should drop capacity writes silently while the circuit breaker is open', async () => {
// Trip the breaker deterministically so every capacity failure below is observed while open.
- for (let i = 0; i <= StorageCircuitBreaker.FAILURE_THRESHOLD; i++) {
+ for (let i = 0; i <= STORAGE_FAILURE_THRESHOLD; i++) {
StorageCircuitBreaker.recordCapacityFailure();
}
- expect(StorageCircuitBreaker.isTripped()).toBe(true);
+ expect(StorageCircuitBreaker.isAllowed()).toBe(false);
// Clear so we only observe logging caused by the write below, not the trip alert above.
const logInfoSpy = jest.spyOn(Logger, 'logInfo').mockClear();
@@ -915,7 +917,7 @@ describe('OnyxUtils', () => {
// The write and any cascading derived writes are dropped without per-write log spam, and
// without re-alerting — the single trip alert is the only signal while open.
- expect(StorageCircuitBreaker.isTripped()).toBe(true);
+ expect(StorageCircuitBreaker.isAllowed()).toBe(false);
expect(logInfoSpy).not.toHaveBeenCalledWith(expect.stringContaining('Failed to save to storage'));
expect(logAlertSpy).not.toHaveBeenCalled();
});
@@ -1651,14 +1653,19 @@ describe('OnyxUtils', () => {
await LocalOnyx.set(unrelatedKey, {value: 'evict-me'});
const memberCalls: unknown[] = [];
- LocalOnyx.connect({key: memberKey, callback: (value) => memberCalls.push(value)});
+ LocalOnyx.connect({
+ key: memberKey,
+ callback: (value) => memberCalls.push(value),
+ });
await waitForPromisesToResolve();
memberCalls.length = 0;
// Storage.multiMerge rejects once with disk-full, then succeeds on retry.
LocalStorageMock.multiMerge = jest.fn(LocalStorageMock.multiMerge).mockRejectedValueOnce(diskFullError).mockImplementation(LocalStorageMock.multiMerge);
- await LocalOnyx.mergeCollection(collectionKey, {[memberKey]: {value: 'merged'}} as GenericCollection);
+ await LocalOnyx.mergeCollection(collectionKey, {
+ [memberKey]: {value: 'merged'},
+ } as GenericCollection);
// The old code evicted the in-flight key and re-ran the merge against an empty cache,
// collapsing {id: 1, value: 'orig'} + {value: 'merged'} to just {value: 'merged'}. Now
@@ -1678,7 +1685,10 @@ describe('OnyxUtils', () => {
expect(LocalOnyxCache.getKeyForEviction()).toBe(memberKey);
const memberCalls: unknown[] = [];
- LocalOnyx.connect({key: memberKey, callback: (value) => memberCalls.push(value)});
+ LocalOnyx.connect({
+ key: memberKey,
+ callback: (value) => memberCalls.push(value),
+ });
await waitForPromisesToResolve();
memberCalls.length = 0;
@@ -1686,7 +1696,9 @@ describe('OnyxUtils', () => {
// finds no acceptable key and reports the quota instead of dropping (and truncating) it.
LocalStorageMock.multiMerge = jest.fn(LocalStorageMock.multiMerge).mockRejectedValue(diskFullError);
- await LocalOnyx.mergeCollection(collectionKey, {[memberKey]: {value: 'merged'}} as GenericCollection);
+ await LocalOnyx.mergeCollection(collectionKey, {
+ [memberKey]: {value: 'merged'},
+ } as GenericCollection);
expect(LocalOnyxCache.get(memberKey)).toEqual({id: 1, value: 'merged'});
expect(memberCalls.at(-1)).toEqual({id: 1, value: 'merged'});| UNKNOWN: 'unknown', | ||
| } as const; | ||
|
|
||
| type StorageErrorClassValue = ValueOf<typeof StorageErrorClass>; |
There was a problem hiding this comment.
NAB: I think ValueOf<typeof StorageErrorClass> is actually a touch clearer at the callsites than this named alias.
Details
Replace the brittle, string-matching storage-error retry logic with a layered design — classify the error → route it to exactly one recovery owner → stop session-wide storms with a circuit breaker.
Related Issues
Expensify/App#94069
Linked E/App PR
Expensify/App#92931
Automated Tests
Manual Tests
Prerequisites: put Chrome into insufficient storage situation by going to Console -> Application -> Storage and enable
Simulate custom storage quota. Set a value lower than used by the application.OnyxStorage circuit breaker tripped: 5 consecutive evictions freed no usable space. Halting eviction/retry for 60s to stop a storage failure storm.Author Checklist
### Related Issuessection above### Linked E/App PRsection above, and verified this change against it (E/App CI passed and manual testing completed)TestssectiontoggleReportand notonIconClick)myBool && <MyComponent />.STYLE.md) were followedAvatar, I verified the components usingAvatarare working as expected)/** comment above it */thisproperly so there are no scoping issues (i.e. foronClick={this.submit}the methodthis.submitshould be bound tothisin the constructor)thisare necessary to be bound (i.e. avoidthis.submit = this.submit.bind(this);ifthis.submitis never passed to a component event handler likeonClick)Avataris modified, I verified thatAvataris working as expected in all cases)mainbranch was merged into this PR after a review, I tested again and verified the outcome was still expected according to theTeststeps.Screenshots/Videos
Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari
before:
Nagranie.z.ekranu.2026-06-22.o.15.05.45.mov
after:
Nagranie.z.ekranu.2026-06-22.o.15.07.34.mov