Skip to content

Fix broker stuck in SYNCHRONIZING on DB error during rollback#4995

Open
kathap wants to merge 2 commits intomainfrom
fix-service-broker-stuck-synchronizing
Open

Fix broker stuck in SYNCHRONIZING on DB error during rollback#4995
kathap wants to merge 2 commits intomainfrom
fix-service-broker-stuck-synchronizing

Conversation

@kathap
Copy link
Copy Markdown
Contributor

@kathap kathap commented Apr 7, 2026

When a service broker update job fails, a database connection failure during state rollback could leave the broker permanently stuck in SYNCHRONIZING state with a FAILED job. This PR adds comprehensive error handling and a recovery hook to ensure brokers are reverted to their previous state even when database outages occur during job failure.

Changes:

a) app/jobs/v3/services/update_broker_job.rb:

  • Add recover_from_failure hook to revert broker state when job transitions to FAILED
  • Refactor rescue block to use rollback_broker_state and destroy_update_request helper methods with graceful error handling
  • Add conditional WHERE clause (state: SYNCHRONIZING) to prevent overwriting newer broker states

b) app/jobs/pollable_job_wrapper.rb:

  • Invoke recover_from_failure hook in failure method using Enqueuer.unwrap_job
  • Log recovery failures without raising to prevent cascading errors

c) spec/unit/jobs/v3/services/update_broker_job_spec.rb:

  • Add test for database disconnect during state rollback
  • Add comprehensive tests for recover_from_failure method covering normal recovery, idempotency, state protection, and error handling

d) spec/unit/jobs/pollable_job_wrapper_spec.rb:

  • Add integration test verifying recovery hook is called on job failure
  • A short explanation of the proposed change:

This PR adds multi-layered error handling for broker state management during job failures:

  1. Immediate rollback: Best-effort state revert in the job's rescue block (extracted to rollback_broker_state helper)
  2. Failure hook: recover_from_failure method called when job exhausts retries and transitions to FAILED
  3. Conditional updates: WHERE clause ensures only SYNCHRONIZING brokers are reverted, protecting against overwriting newer states
  • An explanation of the use cases your change solves

This change solves a critical production issue where service brokers become permanently stuck in SYNCHRONIZING state. This occurs when:

  1. A service broker update job is executing and encounters an error (e.g., invalid catalog, validation failure)
  2. The job's error handling attempts to revert the broker state back to its previous state (e.g., AVAILABLE)
  3. A database connection failure occurs during this state revert operation
  4. The job exhausts retries while the database remains unavailable
  5. Without this fix: The broker remains stuck in SYNCHRONIZING state with a FAILED job, requiring manual intervention
  6. With this fix: When the job transitions to FAILED, the recover_from_failure hook is invoked by PollableJobWrapper, which attempts to revert the broker state. The conditional WHERE
    clause prevents overwriting any newer state that may have been set, ensuring safe recovery.
  • I have reviewed the contributing guide

  • I have viewed, signed, and submitted the Contributor License Agreement

  • I have made this pull request to the main branch

  • I have run all the unit tests using bundle exec rake

  • I have run CF Acceptance Tests

@kathap kathap marked this pull request as draft April 7, 2026 12:09
@kathap kathap force-pushed the fix-service-broker-stuck-synchronizing branch from edad655 to c6e8ca8 Compare April 7, 2026 13:09
@kathap kathap marked this pull request as ready for review April 8, 2026 07:40
@kathap kathap force-pushed the fix-service-broker-stuck-synchronizing branch 3 times, most recently from b45ec83 to fdf1cfb Compare April 8, 2026 09:27
johha
johha previously approved these changes Apr 8, 2026
@kathap kathap marked this pull request as draft April 8, 2026 09:41
@kathap kathap marked this pull request as ready for review April 9, 2026 10:40
@kathap kathap force-pushed the fix-service-broker-stuck-synchronizing branch from df03384 to c00c7e5 Compare April 9, 2026 14:48
johha
johha previously approved these changes Apr 9, 2026
Copy link
Copy Markdown
Contributor

@johha johha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

When a service broker update job fails and attempts to revert the
  broker state, a database connection failure could cause the job to
  crash without properly handling the original error. This left the
  broker stuck in SYNCHRONIZING state with a FAILED job.

  This change wraps the state rollback operation in error handling to
  catch database errors and allow the original exception to be raised
  and the job to be retried properly.

  Changes:
  - app/jobs/v3/services/update_broker_job.rb: Add error handling around
    ServiceBroker.where().update() call in rescue block to gracefully
    handle database disconnections during state rollback
  - spec/unit/jobs/v3/services/update_broker_job_spec.rb: Add test case
    for database disconnect during state rollback
@kathap kathap force-pushed the fix-service-broker-stuck-synchronizing branch from c00c7e5 to b002752 Compare April 9, 2026 14:56
johha
johha previously approved these changes Apr 9, 2026
@kathap kathap marked this pull request as draft April 9, 2026 15:15
@kathap kathap force-pushed the fix-service-broker-stuck-synchronizing branch from 7de4587 to 8a9524d Compare April 10, 2026 06:58
When UpdateBrokerJob exhausts retries and transitions to FAILED,
  invoke recover_from_failure to revert broker from SYNCHRONIZING
  back to previous state. This ensures brokers don't remain stuck
  when jobs fail during extended database outages.

  Changes:
  - UpdateBrokerJob: Add recover_from_failure method with conditional
    WHERE clause to safely revert SYNCHRONIZING brokers
  - PollableJobWrapper: Call recover_from_failure hook in failure method
  - Extract rollback_broker_state and destroy_update_request helpers
    for cleaner error handling
  - Add tests for recovery hook behavior and edge cases
@kathap kathap force-pushed the fix-service-broker-stuck-synchronizing branch from 8a9524d to 1c425fc Compare April 10, 2026 07:09
@kathap kathap marked this pull request as ready for review April 10, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants