Skip to content

Operator#installShutdownHook silently skips installation when leader election is enabled #3376

@Dennis-Mircea

Description

@Dennis-Mircea

Summary

When leader election is enabled, Operator#installShutdownHook(Duration) does not install any JVM shutdown hook. It only logs a warning and returns. This silently drops a behavior the caller explicitly opted into, and the rationale for doing so is neither documented in the code nor in the related commit history. On bare-JVM deployments this leaves the leader lease unreleased on SIGTERM, forcing standbys to wait for lease expiry before they can take over.

Additionally, the method's Duration gracefulShutdownTimeout parameter has been dead since PR #2479. It is accepted but never honored at runtime, yet the JavaDoc still describes it as the timeout used during shutdown.

Code references

operator-framework-core/src/main/java/io/javaoperatorsdk/operator/Operator.java

// lines 143-150
@SuppressWarnings("unused")
public void installShutdownHook(Duration gracefulShutdownTimeout) {
  if (!leaderElectionManager.isLeaderElectionEnabled()) {
    Runtime.getRuntime().addShutdownHook(new Thread(this::stop));
  } else {
    log.warn("Leader election is on, shutdown hook will not be installed.");
  }
}

And the stop() method already handles leader election correctly:

// lines 191-214 (relevant excerpt)
@Override
public void stop() throws OperatorException {
  ...
  controllerManager.stop();
  configurationService.getExecutorServiceManager().stop(reconciliationTerminationTimeout);
  leaderElectionManager.stop();              // <-- already releases the lease
  if (configurationService.closeClientOnStop()) {
    getKubernetesClient().close();
  }
  ...
}

Observations

1. The asymmetry between installShutdownHook and stop

stop() is fully aware of leader election: it calls leaderElectionManager.stop(). If the shutdown hook simply invoked this::stop (which it does in the non-leader-election branch), the lease would be released cleanly on JVM shutdown. There is no apparent technical reason the hook can't be installed when leader election is on as the cleanup path is already there.

The current behavior means that:

  • The user explicitly calls installShutdownHook(...) to request graceful shutdown on SIGTERM.
  • JOSDK silently declines to honor that request (only a log.warn).
  • On SIGTERM, the JVM exits without invoking stop().
  • leaderElectionManager.stop() never runs.
  • The lease held by the dying leader remains in etcd until its leaseDuration elapses.
  • Standby replicas wait out the lease before electing themselves leader.

This is the opposite of what most users would expect from "install a graceful shutdown hook on a leader-elected operator".

2. Dead gracefulShutdownTimeout parameter

Originally installShutdownHook(Duration) called stop(gracefulShutdownTimeout), honoring the per-call value. Commit 9390d89e ("feat: support for graceful shutdown based on configuration", PR #2479) re-routed the timeout to come from ConfigurationService.reconciliationTerminationTimeout() and changed the hook to call the no-arg this::stop. The Duration parameter was kept in the signature (probably for binary compatibility) but is no longer used, hence the @SuppressWarnings("unused"). The JavaDoc still says:

@param gracefulShutdownTimeout timeout to wait for executor threads to complete actual reconciliations

…which is misleading. Callers can pass any value and it will have no effect.

3. No documented rationale for the skip

I searched the commit history for installShutdownHook and the warn-and-skip behavior, and could not find a commit message or PR description explaining why the hook is skipped when leader election is on. The closest hypothesis I can construct is that "frameworks like Spring/Quarkus install their own shutdown hooks, so JOSDK should not double-register", but that is never stated in the code, and it also leaves bare-JVM users with broken graceful shutdown.

Open Questions

  1. Is the skip-on-leader-election behavior intentional? If so, what's the underlying concern (double-hook registration with a framework, race with the lease renewer, something else)?
  2. Should installShutdownHook(...) install the hook regardless of leader election, and rely on stop() (which already handles leaderElectionManager.stop()) to do the right thing?
  3. The Duration parameter on installShutdownHook(Duration) is unused. Should it be deprecated in favor of a no-arg overload, with the JavaDoc updated to point at ConfigurationServiceOverrider#withReconciliationTerminationTimeout(...) as the real configuration knob?

Possible directions

Depending on maintainer guidance, any of these are reasonable PR scopes (single PR or split):

  • Docs-only: update the JavaDoc and the log.warn message to explain the current behavior and point users at the right pattern for framework-managed shutdown. No behavior change.
  • Behavior fix: install the hook regardless of leader-election state, since stop() is already leader-election-aware.
  • API cleanup: deprecate the Duration parameter; introduce installShutdownHook() (no-arg) and update the JavaDoc to reference withReconciliationTerminationTimeout(...).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions