Skip to content

feat: reload configuration on SIGHUP without pod restart#84

Open
danolivo wants to merge 2 commits intomainfrom
ace-160
Open

feat: reload configuration on SIGHUP without pod restart#84
danolivo wants to merge 2 commits intomainfrom
ace-160

Conversation

@danolivo
Copy link

@danolivo danolivo commented Mar 5, 2026

ACE scheduler previously required a full restart to pick up any changes to ace.yaml or pg_service.conf. This adds live reload support via SIGHUP, satisfying the requirement for database-driven config management and Kubernetes ConfigMap updates.

Changes:

  • pkg/config: add thread-safe Reload/Set/Get API around the global Cfg; store CfgPath in Init so the reload handler knows where to read from
  • internal/cli: replace the flat runner loop in StartSchedulerCLI with a schedulerReloadLoop that handles SIGHUP independently of SIGTERM/SIGINT:
    • on SIGHUP: load new config, validate via dry-run BuildJobsFromConfig, drain in-flight ops via gocron.Shutdown, then atomically swap config
    • on invalid config: log the error and keep the current scheduler running
    • API server is started once and unaffected by reloads
  • tests: add pkg/config/config_reload_test.go covering Load, Init, Reload, Set, Get, DefaultCluster and concurrent access; add internal/cli/sighup_test.go covering shutdown, invalid-config rejection, valid reload, and multiple rapid SIGHUPs

Note: pg_service.conf is read lazily at task-execution time and is therefore reloaded implicitly on the next job run after a SIGHUP.

ACE scheduler previously required a full restart to pick up any changes
to ace.yaml or pg_service.conf.  This adds live reload support via
SIGHUP, satisfying the requirement for database-driven config management
and Kubernetes ConfigMap updates.

Changes:
- pkg/config: add thread-safe Reload/Set/Get API around the global Cfg;
  store CfgPath in Init so the reload handler knows where to read from
- internal/cli: replace the flat runner loop in StartSchedulerCLI with
  a schedulerReloadLoop that handles SIGHUP independently of SIGTERM/SIGINT:
    * on SIGHUP: load new config, validate via dry-run BuildJobsFromConfig,
      drain in-flight ops via gocron.Shutdown, then atomically swap config
    * on invalid config: log the error and keep the current scheduler running
    * API server is started once and unaffected by reloads
- tests: add pkg/config/config_reload_test.go covering Load, Init, Reload,
  Set, Get, DefaultCluster and concurrent access; add
  internal/cli/sighup_test.go covering shutdown, invalid-config rejection,
  valid reload, and multiple rapid SIGHUPs

Note: pg_service.conf is read lazily at task-execution time and is
therefore reloaded implicitly on the next job run after a SIGHUP.
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Introduces a SIGHUP-driven scheduler reload loop that validates and atomically swaps configs with graceful draining, and adds concurrency-safe config APIs (Reload, Set, Get) plus a mutex-protected Cfg/CfgPath for live reloads.

Changes

Cohort / File(s) Summary
Scheduler reload mechanism
internal/cli/cli.go
Adds schedulerReloadLoop(runCtx context.Context, sighupCh <-chan os.Signal) error and rewrites CLI startup to run the API server in a separate goroutine; replaces one-shot scheduler start with a SIGHUP-driven reload loop that dry-runs new configs, drains in-flight jobs, and atomically swaps active config.
Concurrency-safe configuration
pkg/config/config.go
Introduces package-level RWMutex, CfgPath and exported Cfg accessors. Adds Reload(path string) (*Config,error), Set(c *Config), and Get() *Config. Converts Load() to return parsed *Config without mutating global state and updates callers (e.g., DefaultCluster) to use Get().

Poem

🐇 I sniff the SIGHUP on the breeze,
I hop, I pause, I check with ease,
Locks held tight, old jobs drain slow,
New config blooms — now off I go! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately describes the main change: adding live configuration reload support on SIGHUP without requiring pod restart, which is the primary objective of this pull request.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the problem, solution, implementation details, and noting implicit behavior with pg_service.conf.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ace-160

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/cli/cli.go (1)

1402-1404: ⚠️ Potential issue | 🟠 Major

Use a config.Get() snapshot instead of direct config.Cfg access.

These lines bypass the synchronized accessor pattern introduced for live reloads. Capture one snapshot (cfg := config.Get()) and use that for nil-checks, canStartAPIServer, and server.New in both CLI entry points.

Suggested fix
 func StartSchedulerCLI(ctx *cli.Context) error {
-	if config.Cfg == nil {
+	cfg := config.Get()
+	if cfg == nil {
 		return fmt.Errorf("configuration not loaded; run inside a directory with ace.yaml or set ACE_CONFIG")
 	}
@@
 	if runAPI {
-		if ok, apiErr := canStartAPIServer(config.Cfg); ok {
-			apiServer, err := server.New(config.Cfg)
+		if ok, apiErr := canStartAPIServer(cfg); ok {
+			apiServer, err := server.New(cfg)
 			if err != nil {
 				return fmt.Errorf("api server init failed: %w", err)
 			}
@@
 func StartAPIServerCLI(ctx *cli.Context) error {
-	if config.Cfg == nil {
+	cfg := config.Get()
+	if cfg == nil {
 		return fmt.Errorf("configuration not loaded; run inside a directory with ace.yaml or set ACE_CONFIG")
 	}
 
-	if ok, err := canStartAPIServer(config.Cfg); !ok {
+	if ok, err := canStartAPIServer(cfg); !ok {
 		return err
 	}
 
-	apiServer, err := server.New(config.Cfg)
+	apiServer, err := server.New(cfg)
 	if err != nil {
 		return err
 	}

Also applies to: 1433-1435, 1557-1565

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/cli/cli.go` around lines 1402 - 1404, Replace direct uses of
config.Cfg with a single snapshot from the synchronized accessor: call cfg :=
config.Get() at the start of each CLI entry point and use that local cfg for the
nil-check, the call to canStartAPIServer(cfg) and for server.New(cfg, ...)
instead of config.Cfg; update the other occurrences noted (around the other
blocks reported) to follow the same pattern so live-reload synchronization is
respected.
🧹 Nitpick comments (2)
internal/cli/cli.go (1)

1521-1524: Prefer config.Reload("") over direct config.CfgPath reads.

Reading config.CfgPath directly bypasses the synchronized path access logic already encapsulated in Reload. Calling Reload("") keeps the lock/lookup behavior in one place.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/cli/cli.go` around lines 1521 - 1524, The code calls
config.Reload(config.CfgPath) which bypasses the synchronized path lookup inside
Reload; change the call to config.Reload("") so Reload uses its internal
lock/lookup logic instead of reading config.CfgPath directly. Update the
surrounding logic that inspects loadErr (and the logger.Info call) to continue
operating unchanged, but ensure any places that assumed a returned path from
Reload still work with the Reload("") semantics; specifically modify the
invocation in the scheduler SIGHUP handler that currently references
config.CfgPath to call config.Reload("") and rely on Reload to resolve the
effective path.
pkg/config/config.go (1)

116-125: Tighten the config access contract to avoid future unsafe reads.

Line 117 currently suggests direct reads of Cfg, and CfgPath is also publicly readable. That weakens the new synchronization model and makes accidental lock bypasses likely. Prefer documenting Get()/Set() as the only supported access path (and optionally add a getter for path).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/config/config.go` around lines 116 - 125, Update the package-level
comment and API so callers do not read Cfg or CfgPath directly: change the doc
on the exported vars (Cfg and CfgPath) to state that direct access is forbidden
and that callers must use the provided Get()/Set() functions for config access;
additionally either make CfgPath unexported or add an explicit getter (e.g.,
GetCfgPath) and document it as the supported way to obtain the config file path,
and ensure cfgMu protection is referenced in the comments for Get/Set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/cli/cli.go`:
- Around line 1505-1518: The select inside the reloaded loop currently ignores
schedDone so an early exit from scheduler.RunJobs can silently leave the loop
blocked; update the select (the loop that checks runCtx and sighupCh) to also
handle the schedDone channel returned by the goroutine that calls
scheduler.RunJobs(schedCtx, jobs), read the error value, call schedCancel if
needed, and return the error when it is non-nil and not errors.Is(err,
context.Canceled) (mirroring the existing shutdown handling around schedCancel
and <-schedDone) so the main loop properly terminates on scheduler failures.

---

Outside diff comments:
In `@internal/cli/cli.go`:
- Around line 1402-1404: Replace direct uses of config.Cfg with a single
snapshot from the synchronized accessor: call cfg := config.Get() at the start
of each CLI entry point and use that local cfg for the nil-check, the call to
canStartAPIServer(cfg) and for server.New(cfg, ...) instead of config.Cfg;
update the other occurrences noted (around the other blocks reported) to follow
the same pattern so live-reload synchronization is respected.

---

Nitpick comments:
In `@internal/cli/cli.go`:
- Around line 1521-1524: The code calls config.Reload(config.CfgPath) which
bypasses the synchronized path lookup inside Reload; change the call to
config.Reload("") so Reload uses its internal lock/lookup logic instead of
reading config.CfgPath directly. Update the surrounding logic that inspects
loadErr (and the logger.Info call) to continue operating unchanged, but ensure
any places that assumed a returned path from Reload still work with the
Reload("") semantics; specifically modify the invocation in the scheduler SIGHUP
handler that currently references config.CfgPath to call config.Reload("") and
rely on Reload to resolve the effective path.

In `@pkg/config/config.go`:
- Around line 116-125: Update the package-level comment and API so callers do
not read Cfg or CfgPath directly: change the doc on the exported vars (Cfg and
CfgPath) to state that direct access is forbidden and that callers must use the
provided Get()/Set() functions for config access; additionally either make
CfgPath unexported or add an explicit getter (e.g., GetCfgPath) and document it
as the supported way to obtain the config file path, and ensure cfgMu protection
is referenced in the comments for Get/Set.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3641b1bc-6df6-4047-9778-3de107f21c62

📥 Commits

Reviewing files that changed from the base of the PR and between fac0d31 and d240478.

📒 Files selected for processing (2)
  • internal/cli/cli.go
  • pkg/config/config.go

@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #85

coderabbitai bot added a commit that referenced this pull request Mar 5, 2026
Docstrings generation was requested by @danolivo.

* #84 (comment)

The following files were modified:

* `internal/cli/cli.go`
* `pkg/config/config.go`
The select inside schedulerReloadLoop only watched runCtx.Done() and
sighupCh, leaving schedDone unmonitored. If scheduler.RunJobs returned
early on its own (e.g. a gocron internal error), the goroutine would
write to schedDone and the inner loop would block indefinitely waiting
for a signal that never arrives.

Add a schedDone case to the select: call schedCancel for cleanup, then
propagate a real error or return nil for a clean/canceled exit, matching
the existing shutdown handling pattern.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/cli/cli.go`:
- Around line 1438-1454: The goroutine calling apiServer.Run can fail and
currently only calls stop(), which allows the surrounding function to still
return nil; change this to propagate that runtime error: create an error channel
(e.g., errCh) or shared atomic/variable to capture the non-canceled error from
apiServer.Run inside the goroutine (reference apiServer.Run, logger.Error,
stop()), send the error to errCh when it occurs, and then after waiting for
runCtx.Done()/shutdown check errCh and return the captured error instead of nil;
also apply the same propagation for the API-only branch so failures from
apiServer.Run are returned to the caller rather than masked.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4299f572-ddf1-4649-846d-494562f579e8

📥 Commits

Reviewing files that changed from the base of the PR and between d240478 and 848d154.

📒 Files selected for processing (1)
  • internal/cli/cli.go

@mason-sharp mason-sharp requested a review from ibrarahmad March 6, 2026 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant