Hyperfleet-542 : Add OpenTelemetry instrumentation to Sentinel#51
Hyperfleet-542 : Add OpenTelemetry instrumentation to Sentinel#51tirthct wants to merge 11 commits intoopenshift-hyperfleet:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughAdds OpenTelemetry tracing across the service: new pkg/telemetry with tracer-provider initialization (OTLP or stdout), sampler configuration via env vars, span helpers, CloudEvent trace injection, and shutdown; cmd/sentinel conditionally initializes and shuts down the tracer provider based on TRACING_ENABLED and OTEL_SERVICE_NAME; pkg/logger exposes OTel config; HTTP client transport wrapped with otelhttp; internal/sentinel creates annotated spans for poll/evaluate/publish and attaches trace context to published CloudEvents; extensive unit, integration, and instrumentation tests added; config examples include messaging_system entries; go.mod updated with OTLP-related dependencies. Sequence Diagram(s)sequenceDiagram
participant Main as Service Startup
participant Config as Config/Logger
participant TP as TracerProvider
participant Sentinel as Sentinel Loop
participant Client as HTTP Client
participant Publisher as Event Publisher
participant Exporter as Trace Exporter
Main->>Config: Load OTel config (TRACING_ENABLED, OTEL_SERVICE_NAME, sampling_rate)
alt tracing enabled
Main->>TP: InitTraceProvider(serviceName, version, sampling_rate)
TP->>Exporter: Initialize exporter (OTLP HTTP or stdout)
Exporter-->>TP: Exporter ready
TP-->>Main: Tracer provider initialized
else tracing disabled
Main-->>Config: Tracing disabled
end
Main->>Sentinel: Start polling loop
activate Sentinel
Sentinel->>Sentinel: Start "sentinel.poll" span
loop per resource
Sentinel->>Sentinel: Start "sentinel.evaluate" span (resource attrs)
Sentinel->>Client: HTTP request (otelhttp-wrapped)
Client->>Client: Create HTTP span(s)
Client-->>Sentinel: Response
Sentinel->>Publisher: Publish event (SetTraceContext from current span)
Publisher->>Publisher: Start publish span (messaging attrs)
Publisher->>Exporter: Export trace
Publisher-->>Sentinel: Publish result
Publisher->>Publisher: End publish span
Sentinel->>Sentinel: End "sentinel.evaluate" span
end
Sentinel->>Sentinel: End "sentinel.poll" span
deactivate Sentinel
alt tracing enabled
Main->>TP: Shutdown tracer provider
TP->>Exporter: Flush & close
Exporter-->>TP: Shutdown complete
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/sentinel/sentinel.go (1)
113-187:⚠️ Potential issue | 🔴 CriticalSpan leak:
evalSpan.End()andpublishSpan.End()are skipped oncontinuepaths.When
event.SetDatafails (Line 141) orPublishfails (Line 164), thecontinuestatements skipevalSpan.End()at Line 187 andpublishSpan.End()at Line 167 respectively. Leaked spans will accumulate in memory and produce incomplete traces.The safest fix is to end spans via
deferor explicitEnd()calls before eachcontinue.Proposed fix
// span: sentinel.evaluate evalCtx, evalSpan := telemetry.StartSpan(ctx, "sentinel.evaluate", attribute.String("hyperfleet.resource_type", resource.Kind), attribute.String("hyperfleet.resource_id", resource.ID), ) decision := s.decisionEngine.Evaluate(resource, now) if decision.ShouldPublish { pending++ // Add decision reason to context for structured logging eventCtx := logger.WithDecisionReason(evalCtx, decision.Reason) // Create CloudEvent event := cloudevents.NewEvent() event.SetSpecVersion(cloudevents.VersionV1) event.SetType(fmt.Sprintf("com.redhat.hyperfleet.%s.reconcile", resource.Kind)) event.SetSource("hyperfleet-sentinel") event.SetID(uuid.New().String()) if err := event.SetData(cloudevents.ApplicationJSON, map[string]interface{}{ "kind": resource.Kind, "id": resource.ID, "generation": resource.Generation, "href": resource.Href, "reason": decision.Reason, }); err != nil { s.logger.Errorf(eventCtx, "Failed to set event data resource_id=%s error=%v", resource.ID, err) + evalSpan.End() continue } // span: publish (child of sentinel.evaluate) publishCtx, publishSpan := telemetry.StartSpan(eventCtx, fmt.Sprintf("%s publish", topic), attribute.String("hyperfleet.resource_type", resource.Kind), attribute.String("hyperfleet.resource_id", resource.ID), attribute.String("hyperfleet.decision_reason", decision.Reason), attribute.String("messaging.system", "gcp_pubsub"), attribute.String("messaging.operation.type", "publish"), attribute.String("messaging.destination.name", topic), attribute.String("messaging.message.id", event.ID()), ) if publishSpan.SpanContext().IsValid() { telemetry.SetTraceContext(&event, publishSpan) } // Publish to broker using configured topic if err := s.publisher.Publish(publishCtx, topic, &event); err != nil { // Record broker error metrics.UpdateBrokerErrorsMetric(resourceType, resourceSelector, "publish_error") s.logger.Errorf(eventCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err) + publishSpan.End() + evalSpan.End() continue } publishSpan.End()Alternatively, a cleaner approach would be to use a closure or restructure to use
defer-like patterns within the loop body.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel.go` around lines 113 - 187, The spans started by telemetry.StartSpan (evalSpan and publishSpan) are leaked on early continue paths (failures in event.SetData and s.publisher.Publish); fix by ensuring each span is ended on all control paths—either call defer evalSpan.End() immediately after creating evalSpan and defer publishSpan.End() right after creating publishSpan (so they run even on continue), or explicitly call publishSpan.End() and evalSpan.End() just before every continue in the error branches (references: telemetry.StartSpan, evalSpan, publishSpan, event.SetData, s.publisher.Publish).
🧹 Nitpick comments (4)
internal/sentinel/sentinel_test.go (1)
444-448: Global tracer provider is set but never restored after test.
otel.SetTracerProvider(tp)modifies the global state. While tests in the same package run sequentially by default, this could affect subsequent tests that rely on the global provider state (or lack thereof). Consider saving and restoring the previous provider.Suggested pattern
otel.SetTracerProvider(tp) + // Optionally restore previous provider on cleanup + // prevTP := otel.GetTracerProvider() + // defer otel.SetTracerProvider(prevTP)This is optional since tests currently run sequentially and the deferred
tp.Shutdownwill clean up the provider.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel_test.go` around lines 444 - 448, The test sets a global provider via trace.NewTracerProvider(...) and otel.SetTracerProvider(tp) but doesn't restore the previous global provider; capture the current provider with otel.GetTracerProvider() before calling otel.SetTracerProvider(tp), and defer otel.SetTracerProvider(previousProvider) so global state is restored; keep the existing deferred tp.Shutdown() to shut down the new provider.internal/client/client_test.go (1)
796-797: Ignoredw.Writeerror return value.Line 797 ignores the error from
w.Write(...), unlike the rest of the test file which consistently handles write errors.Suggested fix
- w.Write([]byte(`{"error": "internal server error"}`)) + if _, err := w.Write([]byte(`{"error": "internal server error"}`)); err != nil { + t.Logf("Error writing response: %v", err) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/client/client_test.go` around lines 796 - 797, The test HTTP handler ignores the error returned from w.Write(...) at the end of the handler; update the handler to check the write error and fail the test if it occurs. Specifically, replace the bare w.Write(...) call following w.WriteHeader(http.StatusInternalServerError) with a checked write (capture the return error from w.Write and call t.Fatalf or t.Errorf with the error), using the same testing.T variable used elsewhere in the file so failures are reported consistently.test/integration/integration_test.go (1)
459-464:InitTraceProviderwith stdout exporter will write span output to stdout during test runs.Since no
OTEL_EXPORTER_OTLP_ENDPOINTenv var is set in the test,InitTraceProviderfalls back to the stdout exporter, which will print formatted span data to stdout during test execution. This is noisy but functional. If this is intentional for debugging, it's fine; otherwise consider using an in-memory exporter as done inTestIntegration_EndToEndSpanHierarchy.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/integration_test.go` around lines 459 - 464, The test currently calls telemetry.InitTraceProvider(ctx, "sentinel", "test", 1.0) which falls back to the stdout exporter and prints spans to stdout; change the test to use the in-memory exporter instead of stdout (or set OTEL_EXPORTER_OTLP_ENDPOINT) by invoking the InitTraceProvider variant or option the project uses for in-memory tracing (match the pattern used in TestIntegration_EndToEndSpanHierarchy), and keep the deferred telemetry.Shutdown(tp) call; update the call site in integration_test.go to use that in-memory configuration so spans are not emitted to stdout during test runs.pkg/logger/logger.go (1)
46-50: Consider validatingSamplingRaterange.
SamplingRateis afloat64with no validation at the config level. WhileInitTraceProviderinpkg/telemetry/otel.gohandles edge cases (≥1.0 → AlwaysSample, ≤0.0 → NeverSample), a negative or >1.0 value could be confusing to operators. Consider documenting the valid range or adding a validation method onOTelConfig.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/logger/logger.go` around lines 46 - 50, Add explicit validation for OTelConfig.SamplingRate by implementing a validation method (e.g., func (c *OTelConfig) Validate() error) that ensures SamplingRate is within the inclusive range 0.0–1.0 and returns a clear error if out of range; reference the OTelConfig struct and call this Validate method where configuration is loaded or before InitTraceProvider is invoked so invalid values error early (include a helpful message indicating the valid range).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/sentinel/main.go`:
- Around line 257-262: The telemetry shutdown is using the same shutdownCtx as
metricsServer.Shutdown and the log call uses the already-cancelled ctx; create a
separate context with its own timeout for telemetry (e.g., teleCtx via
context.WithTimeout) and call telemetry.Shutdown(teleCtx, tp) so it has its own
budget independent of metricsServer.Shutdown, ensure you cancel the teleCtx
after use, and change the error log invocation (log.Extra(...).Error) to use the
telemetry context (teleCtx) or the appropriate non-cancelled shutdown context so
the log is not emitted with a canceled ctx; reference telemetry.Shutdown,
metricsServer.Shutdown, shutdownCtx, tp, and log.Extra to locate and update the
code.
- Around line 153-169: The OTel settings never get set because initLogging uses
DefaultConfig() which leaves LogConfig.OTel.Enabled false; update initLogging to
populate logCfg.OTel.Enabled and logCfg.OTel.SamplingRate from configuration
sources (either the loaded SentinelConfig/YAML or environment variables) so the
check in main.go that inspects logCfg.OTel.Enabled can actually enable tracing;
specifically, within initLogging (and where config.LoadConfig/SentinelConfig is
processed) read an OTEL_ENABLED (boolean) env var or map the YAML fields into
LogConfig.OTel (including SamplingRate) and assign them to logCfg before main.go
calls telemetry.InitTraceProvider (referencing symbols: initLogging, LogConfig,
DefaultConfig, logCfg.OTel.Enabled, logCfg.OTel.SamplingRate,
config.LoadConfig).
In `@internal/sentinel/sentinel.go`:
- Around line 149-150: The attribute.String("messaging.system", "gcp_pubsub") is
hardcoded but should come from configuration; change the code that builds the
span/attributes (the attribute.String calls around
messaging.operation.type/publish in sentinel.go) to read the messaging system
from a configuration value or environment variable (e.g., cfg.MessagingSystem or
os.Getenv("MESSAGING_SYSTEM")) and use that value when constructing
attribute.String("messaging.system", <configuredValue>), with a sensible default
(e.g., "rabbitmq") if unset.
In `@pkg/telemetry/otel_test.go`:
- Around line 308-326: TestSetTraceContext_InvalidSpan currently ends a real
span which doesn't make SpanContext invalid; replace creation of the span with a
no-op/non-recording span (e.g., obtain span via
trace.SpanFromContext(context.Background())) so span.SpanContext().IsValid() is
false and the !IsValid() branch in SetTraceContext is exercised, then call
SetTraceContext(&event, span) and assert that no "traceparent" extension was
added to event.Extensions(); update imports accordingly to reference
go.opentelemetry.io/otel/trace if needed.
- Around line 42-77: TestInitTraceProvider_OTLPExporter currently points
OTEL_EXPORTER_OTLP_ENDPOINT at a non-existent collector and then calls Shutdown
which blocks/fails; fix by making the test use a local httptest server that
responds 200 to POST /v1/traces and set OTEL_EXPORTER_OTLP_ENDPOINT to that
server's URL (server.URL + "/v1/traces"), so InitTraceProvider, tracer use, and
the deferred Shutdown call complete reliably; update the test to start
httptest.NewServer before setting the env var and close the server at the end,
referencing TestInitTraceProvider_OTLPExporter, InitTraceProvider, and Shutdown
in your changes.
- Around line 65-70: Deferred cleanup closures calling Shutdown currently use
t.Fatal which is unsafe inside defers; change those deferred anonymous functions
(the ones invoking Shutdown(ctx, tp) and similar cleanup calls) to call t.Errorf
with the error when Shutdown returns non-nil instead of t.Fatal/t.Fatalf, so the
test reports the cleanup failure without invoking Fatal after the test body has
returned. Locate the three other deferred closures in this file that call
Shutdown or similar cleanup (same pattern as the closure around Shutdown at
lines noted in the review) and replace their t.Fatal/t.Fatalf calls with
t.Errorf while keeping the rest of the defer logic unchanged.
In `@pkg/telemetry/otel.go`:
- Line 117: Fix the typo in the doc comment for StartSpan: change "conext" to
"context" in the comment that reads "// StartSpan starts a span and enriches
conext with trace/span IDs for logging" so it correctly reads "// StartSpan
starts a span and enriches context with trace/span IDs for logging"; locate the
comment above the StartSpan function in otel.go and update the word only.
- Around line 22-29: The listed package-level string literals
(SAMPLER_ALWAYS_ON, SAMPLER_ALWAYS_OFF, SAMPLER_TRACE_ID_RATIO,
ENV_OTEL_TRACES_SAMPLER, ENV_OTEL_TRACES_SAMPLER_ARG,
ENV_OTEL_EXPORTER_OTLP_ENDPOINT) should be declared as constants instead of
variables; replace the var (...) block in pkg/telemetry/otel.go with a const
(...) block so these identifiers are immutable and cannot be reassigned
elsewhere in the package.
- Around line 139-145: The traceparent being set in SetTraceContext incorrectly
hardcodes the trace-flags byte to "01"; update SetTraceContext to read the
actual flags from span.SpanContext().TraceFlags() and format that byte as two
hex characters when building traceParent (keep the existing TraceID and SpanID
usage and the IsValid() guard). Use the span context's TraceFlags value
(converted to a uint8 and formatted with "%02x") instead of the fixed "-01" so
the traceparent accurately reflects whether the span was sampled.
- Around line 41-46: The code incorrectly passes a full URL from
ENV_OTEL_EXPORTER_OTLP_ENDPOINT into otlptracehttp.WithEndpoint (which expects
host:port), causing mis-parsing; fix by either removing the manual env var
handling and simply calling otlptracehttp.New(ctx) so the SDK reads
OTEL_EXPORTER_OTLP_ENDPOINT itself, or if you must pass it explicitly, replace
otlptracehttp.WithEndpoint(otlpEndpoint) with
otlptracehttp.WithEndpointURL(parsedURL) (use the URL form) and ensure
otlpEndpoint is a properly parsed *url.URL; update error logging in the
otlptracehttp.New/creation branch (log.Errorf) accordingly and remove the
redundant env read of ENV_OTEL_EXPORTER_OTLP_ENDPOINT.
In `@test/integration/integration_test.go`:
- Around line 816-817: The loop using pollSpans[:2] can panic if fewer than 2
spans were captured; update the iteration to safely bound the slice by computing
n := len(pollSpans) and if n > 2 set n = 2, then iterate over pollSpans[:n] (or
otherwise iterate up to min(2, len(pollSpans))). Apply this change around the
existing pollSpans variable and the for _, pollSpan := range pollSpans[:2] loop
so the test never slices beyond the available elements.
---
Outside diff comments:
In `@internal/sentinel/sentinel.go`:
- Around line 113-187: The spans started by telemetry.StartSpan (evalSpan and
publishSpan) are leaked on early continue paths (failures in event.SetData and
s.publisher.Publish); fix by ensuring each span is ended on all control
paths—either call defer evalSpan.End() immediately after creating evalSpan and
defer publishSpan.End() right after creating publishSpan (so they run even on
continue), or explicitly call publishSpan.End() and evalSpan.End() just before
every continue in the error branches (references: telemetry.StartSpan, evalSpan,
publishSpan, event.SetData, s.publisher.Publish).
---
Nitpick comments:
In `@internal/client/client_test.go`:
- Around line 796-797: The test HTTP handler ignores the error returned from
w.Write(...) at the end of the handler; update the handler to check the write
error and fail the test if it occurs. Specifically, replace the bare
w.Write(...) call following w.WriteHeader(http.StatusInternalServerError) with a
checked write (capture the return error from w.Write and call t.Fatalf or
t.Errorf with the error), using the same testing.T variable used elsewhere in
the file so failures are reported consistently.
In `@internal/sentinel/sentinel_test.go`:
- Around line 444-448: The test sets a global provider via
trace.NewTracerProvider(...) and otel.SetTracerProvider(tp) but doesn't restore
the previous global provider; capture the current provider with
otel.GetTracerProvider() before calling otel.SetTracerProvider(tp), and defer
otel.SetTracerProvider(previousProvider) so global state is restored; keep the
existing deferred tp.Shutdown() to shut down the new provider.
In `@pkg/logger/logger.go`:
- Around line 46-50: Add explicit validation for OTelConfig.SamplingRate by
implementing a validation method (e.g., func (c *OTelConfig) Validate() error)
that ensures SamplingRate is within the inclusive range 0.0–1.0 and returns a
clear error if out of range; reference the OTelConfig struct and call this
Validate method where configuration is loaded or before InitTraceProvider is
invoked so invalid values error early (include a helpful message indicating the
valid range).
In `@test/integration/integration_test.go`:
- Around line 459-464: The test currently calls telemetry.InitTraceProvider(ctx,
"sentinel", "test", 1.0) which falls back to the stdout exporter and prints
spans to stdout; change the test to use the in-memory exporter instead of stdout
(or set OTEL_EXPORTER_OTLP_ENDPOINT) by invoking the InitTraceProvider variant
or option the project uses for in-memory tracing (match the pattern used in
TestIntegration_EndToEndSpanHierarchy), and keep the deferred
telemetry.Shutdown(tp) call; update the call site in integration_test.go to use
that in-memory configuration so spans are not emitted to stdout during test
runs.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (10)
cmd/sentinel/main.gogo.modinternal/client/client.gointernal/client/client_test.gointernal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gopkg/logger/logger.gopkg/telemetry/otel.gopkg/telemetry/otel_test.gotest/integration/integration_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
cmd/sentinel/main.go (2)
160-176:⚠️ Potential issue | 🔴 Critical
logCfg.OTel.Enabledis alwaysfalse— OTel initialization is unreachable dead code.
initLoggingconstructsLogConfigvialogger.DefaultConfig()(which setsOTel.Enabled: false) and only updatesLevel,Format, andOutput. NeitherOTEL_ENABLEDnorOTEL_SAMPLING_RATEare ever read and applied tologCfg.OTel. As a result, the branch at Line 166 is never taken and tracing is never activated.This was flagged in a previous review; the fix (reading
OTEL_ENABLED/OTEL_SAMPLING_RATEfrom env vars ininitLogging) has not been applied yet.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 160 - 176, The OTel branch is never reached because initLogging uses logger.DefaultConfig() which leaves logCfg.OTel.Enabled false and never reads OTEL_ENABLED/OTEL_SAMPLING_RATE; update initLogging to read the OTEL_ENABLED and OTEL_SAMPLING_RATE environment variables and apply them to the returned LogConfig (i.e., set logCfg.OTel.Enabled based on OTEL_ENABLED parsing and set logCfg.OTel.SamplingRate from OTEL_SAMPLING_RATE parsing), ensuring any parsing errors are handled sensibly so the downstream check of logCfg.OTel.Enabled in main.go will correctly enable telemetry; reference initLogging, logger.DefaultConfig(), LogConfig and logCfg.OTel.Enabled/SamplingRate in your changes.
290-295:⚠️ Potential issue | 🟠 MajorOTel shutdown reuses the shared
shutdownCtx, and the error log uses an already-cancelledctx.Two pre-existing issues remain unaddressed:
telemetry.Shutdownshares the sameshutdownCtx(20 s) already partially consumed byhealthServer.Shutdown+metricsServer.Shutdown. If those are slow, the OTel flush budget is eroded.- Line 293 passes the root
ctx(cancelled at Line 279) tolog.Extra(...).Error(...), so the log call is issued with a cancelled context.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 290 - 295, The OTel shutdown call reuses the shared shutdownCtx and logs using the already-cancelled root ctx; change telemetry.Shutdown(shutdownCtx, tp) to use its own fresh context and ensure the log call uses a live context: create a new context with timeout (e.g., otelCtx, otelCancel := context.WithTimeout(context.Background(), 20*time.Second)) and defer otelCancel(), call telemetry.Shutdown(otelCtx, tp), and pass a non-cancelled context (e.g., otelCtx or context.Background()) to log.Extra(...).Error(...) so the shutdown has its own budget and the error is logged with an active context.
🧹 Nitpick comments (1)
cmd/sentinel/main.go (1)
30-33:serviceNameis a runtime-configurable value mixed into the build-infovarblock.
version,commit, anddateare set byldflagsat build time and are effectively constants, whileserviceNameis mutated at runtime insiderunServe(Line 162). Keeping them in the same block is misleading. Consider makingserviceNamea local variable inrunServeor, if it needs to be the default value, a separateconst.♻️ Suggested refactor
var ( version = "dev" commit = "none" date = "unknown" - serviceName = "sentinel" )Then in
runServe:+ serviceName := "sentinel" if envServiceName := os.Getenv("OTEL_SERVICE_NAME"); envServiceName != "" { serviceName = envServiceName }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 30 - 33, The serviceName variable should not live in the same build-info var block as the ldflags-populated constants (version, commit, date); move serviceName out of that var block and make it either a local variable inside runServe (referencing runServe) if it is mutated at runtime, or make it a separate const/var with a default value distinct from version/commit/date if you need a package-level default. Update any references to serviceName accordingly (e.g., where runServe mutates it) so you no longer mix runtime-mutable state with the ldflags-injected build constants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cmd/sentinel/main.go`:
- Around line 160-176: The OTel branch is never reached because initLogging uses
logger.DefaultConfig() which leaves logCfg.OTel.Enabled false and never reads
OTEL_ENABLED/OTEL_SAMPLING_RATE; update initLogging to read the OTEL_ENABLED and
OTEL_SAMPLING_RATE environment variables and apply them to the returned
LogConfig (i.e., set logCfg.OTel.Enabled based on OTEL_ENABLED parsing and set
logCfg.OTel.SamplingRate from OTEL_SAMPLING_RATE parsing), ensuring any parsing
errors are handled sensibly so the downstream check of logCfg.OTel.Enabled in
main.go will correctly enable telemetry; reference initLogging,
logger.DefaultConfig(), LogConfig and logCfg.OTel.Enabled/SamplingRate in your
changes.
- Around line 290-295: The OTel shutdown call reuses the shared shutdownCtx and
logs using the already-cancelled root ctx; change
telemetry.Shutdown(shutdownCtx, tp) to use its own fresh context and ensure the
log call uses a live context: create a new context with timeout (e.g., otelCtx,
otelCancel := context.WithTimeout(context.Background(), 20*time.Second)) and
defer otelCancel(), call telemetry.Shutdown(otelCtx, tp), and pass a
non-cancelled context (e.g., otelCtx or context.Background()) to
log.Extra(...).Error(...) so the shutdown has its own budget and the error is
logged with an active context.
---
Nitpick comments:
In `@cmd/sentinel/main.go`:
- Around line 30-33: The serviceName variable should not live in the same
build-info var block as the ldflags-populated constants (version, commit, date);
move serviceName out of that var block and make it either a local variable
inside runServe (referencing runServe) if it is mutated at runtime, or make it a
separate const/var with a default value distinct from version/commit/date if you
need a package-level default. Update any references to serviceName accordingly
(e.g., where runServe mutates it) so you no longer mix runtime-mutable state
with the ldflags-injected build constants.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (2)
cmd/sentinel/main.gogo.mod
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/client/client_test.go`:
- Line 804: The test currently ignores the error returned by
NewHyperFleetClient; update the test to capture both return values (client, err
:= NewHyperFleetClient(...)) and assert/fail if err != nil (e.g., t.Fatalf or
require.NoError) before using client so any constructor failures surface
immediately; ensure you also handle the client variable in subsequent
assertions.
- Around line 795-800: In the httptest server handler (the anonymous
http.HandlerFunc passed to httptest.NewServer) replace the unsafe t.Fatalf call
with t.Errorf and then return to avoid invoking FailNow from a handler
goroutine; specifically change the error handling after w.Write to call
t.Errorf("Failed to write response: %v", err) followed by return instead of
t.Fatalf.
- Around line 670-676: Capture the previous global tracer provider before
calling otel.SetTracerProvider(tp) in both
TestNewHyperFleetClient_HTTPInstrumentation and
TestNewHyperFleetClient_HTTPInstrumentation_ErrorCase (e.g., prevTp :=
otel.GetTracerProvider()), then in the defer restore it (e.g., defer
otel.SetTracerProvider(prevTp)) and still shutdown tp as before; ensure you
reference the local tp shutdown logic (tp.Shutdown(ctx)) but add restoring the
previous provider to avoid test-state leakage.
| server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | ||
| w.WriteHeader(http.StatusInternalServerError) | ||
| _, err := w.Write([]byte(`{"error": "internal server error"}`)) | ||
| if err != nil { | ||
| t.Fatalf("Failed to write response: %v", err) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n internal/client/client_test.go | sed -n '790,810p'Repository: openshift-hyperfleet/hyperfleet-sentinel
Length of output: 868
🏁 Script executed:
# Also check if there are other similar patterns in the test file
rg 't\.Fatalf|t\.FailNow' internal/client/client_test.go -A 3 -B 3Repository: openshift-hyperfleet/hyperfleet-sentinel
Length of output: 4362
🏁 Script executed:
# Check the full context around httptest usage in this file
rg 'httptest\.NewServer' internal/client/client_test.go -B 5 -A 20Repository: openshift-hyperfleet/hyperfleet-sentinel
Length of output: 11976
Replace t.Fatalf with t.Errorf and return in httptest handler.
Calling t.Fatalf from an httptest handler goroutine is unsafe—Fatalf and FailNow can only be called from the test goroutine. The pattern throughout this file uses t.Errorf for errors in handlers.
Proposed fix
_, err := w.Write([]byte(`{"error": "internal server error"}`))
if err != nil {
- t.Fatalf("Failed to write response: %v", err)
+ t.Errorf("Failed to write response: %v", err)
+ return
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | |
| w.WriteHeader(http.StatusInternalServerError) | |
| _, err := w.Write([]byte(`{"error": "internal server error"}`)) | |
| if err != nil { | |
| t.Fatalf("Failed to write response: %v", err) | |
| } | |
| server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | |
| w.WriteHeader(http.StatusInternalServerError) | |
| _, err := w.Write([]byte(`{"error": "internal server error"}`)) | |
| if err != nil { | |
| t.Errorf("Failed to write response: %v", err) | |
| return | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/client/client_test.go` around lines 795 - 800, In the httptest
server handler (the anonymous http.HandlerFunc passed to httptest.NewServer)
replace the unsafe t.Fatalf call with t.Errorf and then return to avoid invoking
FailNow from a handler goroutine; specifically change the error handling after
w.Write to call t.Errorf("Failed to write response: %v", err) followed by return
instead of t.Fatalf.
There was a problem hiding this comment.
Replaced fatals with errors
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the specific area where the issue was reported (around line 795-800)
cat -n internal/client/client_test.go | sed -n '792,802p'
# Also search for any remaining t.Fatalf in httptest handlers to ensure all instances were fixed
echo "---"
echo "Checking for any remaining t.Fatalf in httptest handlers:"
rg -n 't\.Fatalf' internal/client/client_test.go -A 2 -B 2 | head -40
Length of output: 1765
🏁 Script executed:
#!/bin/bash
# Check the TestNewHyperFleetClient_HTTPInstrumentation_ErrorCase function specifically
# This is where the issue was reported
cat -n internal/client/client_test.go | sed -n '778,824p'
Length of output: 1843
🏁 Script executed:
#!/bin/bash
# Check the other instrumentation test (TestNewHyperFleetClient_HTTPInstrumentation)
# to see if there are similar issues with t.Fatalf in handlers
cat -n internal/client/client_test.go | sed -n '662,700p'
Length of output: 1569
@tirthct, I checked the code and the t.Fatalf call is still present at line 809 in the TestNewHyperFleetClient_HTTPInstrumentation_ErrorCase function:
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusInternalServerError)
_, err := w.Write([]byte(`{"error": "internal server error"}`))
if err != nil {
t.Fatalf("Failed to write response: %v", err) // ← Still here at line 809
}
}))Could you please replace this with t.Errorf followed by return to match the pattern used elsewhere in the test file? The fix should be:
if err != nil {
t.Errorf("Failed to write response: %v", err)
return
}There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the exact code around line 809 where the fix should be applied
cat -n internal/client/client_test.go | sed -n '805,812p'
Length of output: 361
@tirthct, the fix is almost complete, but the return statement is still missing. Currently at line 809, the code shows:
if err != nil {
t.Errorf("Failed to write response: %v", err)
}It should be:
if err != nil {
t.Errorf("Failed to write response: %v", err)
return
}Without the return, the handler continues executing after the error, which could lead to unexpected behavior. Could you please add the return statement after t.Errorf?
|
|
||
| // Publish to broker using configured topic | ||
| if err := s.publisher.Publish(eventCtx, topic, &event); err != nil { | ||
| if err := s.publisher.Publish(publishCtx, topic, &event); err != nil { |
There was a problem hiding this comment.
Errors in the Publish path are logged but never recorded on the span. The
https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/standards/tracing.md
requires span.RecordError() and span.SetStatus() so errors are visible in trace backends
(Jaeger/Tempo). Without this, failed publish operations appear as successful spans.
| if err := s.publisher.Publish(publishCtx, topic, &event); err != nil { | |
| if err := s.publisher.Publish(publishCtx, topic, &event); err != nil { | |
| publishSpan.RecordError(err) | |
| publishSpan.SetStatus(codes.Error, "publish failed") | |
| metrics.UpdateBrokerErrorsMetric(resourceType, resourceSelector, "publish_error") | |
| s.logger.Errorf(eventCtx, "Failed to publish event resource_id=%s error=%v", resource.ID, err) | |
| publishSpan.End() | |
| evalSpan.End() | |
| continue | |
| } |
There was a problem hiding this comment.
Ah, okay. Added RecordError() and SetStatus()
…OTel traceparent function and tests
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (3)
pkg/telemetry/otel.go (1)
93-93:⚠️ Potential issue | 🟠 MajorPrefer parent-based ratio sampling to preserve incoming trace decisions.
Using
trace.TraceIDRatioBaseddirectly can resample child spans independently. Wrap it withtrace.ParentBased(...)for distributed trace continuity.💡 Proposed fix
- sampler = trace.TraceIDRatioBased(samplingRate) + sampler = trace.ParentBased(trace.TraceIDRatioBased(samplingRate))OpenTelemetry Go: For services receiving remote parent context, is ParentBased(TraceIDRatioBased(x)) recommended over TraceIDRatioBased(x) to preserve parent sampling decisions?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel.go` at line 93, Replace the direct TraceIDRatioBased sampler assignment so child spans respect remote parent sampling: instead of assigning sampler = trace.TraceIDRatioBased(samplingRate) wrap it with trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) (use the same samplingRate variable) so parent-based sampling is preserved for incoming trace contexts.cmd/sentinel/main.go (2)
34-34:⚠️ Potential issue | 🟡 MinorSet the default OTel service name to
hyperfleet-sentinel.Keeping the default as
sentinelcan fragment service identity in tracing backends unless explicitly overridden.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` at line 34, The default OTel serviceName value currently set as serviceName = "sentinel" should be changed to "hyperfleet-sentinel" to standardize tracing identity; update the assignment of the serviceName variable in cmd/sentinel/main.go (the serviceName symbol) to use "hyperfleet-sentinel" as the default so traces are attributed consistently unless explicitly overridden.
301-305:⚠️ Potential issue | 🟡 MinorUse the telemetry shutdown context when logging shutdown errors.
The error log still uses
ctxinstead ofotelShutdownCtx, which may already be canceled during shutdown.💡 Proposed fix
- if err := telemetry.Shutdown(otelShutdownCtx, tp); err != nil { - log.Extra("error", err).Error(ctx, "Failed to shutdown OpenTelemetry") + if err := telemetry.Shutdown(otelShutdownCtx, tp); err != nil { + log.Extra("error", err).Error(otelShutdownCtx, "Failed to shutdown OpenTelemetry") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 301 - 305, The shutdown logging currently uses the outer request context variable ctx which may be canceled; change the log call to use the telemetry shutdown context instead: when calling telemetry.Shutdown(otelShutdownCtx, tp) and it returns an error, pass otelShutdownCtx into log.Extra(...).Error(...) (keeping the existing log.Extra("error", err) and message) so the failure is logged with the shutdown-specific context created by context.WithTimeout (otelShutdownCtx/otelShutdownCancel).
🧹 Nitpick comments (1)
test/integration/integration_test.go (1)
902-909: Trace continuity assertion is too broad and can become flaky.This currently counts trace IDs from all captured spans. If unrelated root spans are emitted, the check can fail even when poll/evaluate/publish hierarchy is correct. Prefer deriving continuity from poll traces (or poll descendants) only.
Suggested tightening
- // Verify trace continuity - spans should form coherent traces - traceIDs := make(map[string]bool) - for _, span := range spans { - traceIDs[span.SpanContext.TraceID().String()] = true - } - - if len(traceIDs) > len(pollSpans) { - t.Errorf("Expected spans to belong to %d traces (one per poll), but found %d trace IDs", len(pollSpans), len(traceIDs)) - } + // Verify trace continuity for poll roots specifically (one trace per poll span) + pollTraceIDs := make(map[string]struct{}) + for _, pollSpan := range pollSpans { + pollTraceIDs[pollSpan.SpanContext.TraceID().String()] = struct{}{} + } + if len(pollTraceIDs) != len(pollSpans) { + t.Errorf("Expected one unique trace per poll span, got %d unique traces for %d poll spans", len(pollTraceIDs), len(pollSpans)) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/integration_test.go` around lines 902 - 909, The current check counts trace IDs from all captured spans which can include unrelated roots and become flaky; change it to derive continuity only from the poll traces by first building a set of expectedTraceIDs from pollSpans (use each pollSpan.SpanContext.TraceID().String()) and then only collect/count trace IDs from spans whose TraceID is in that expectedTraceIDs set (i.e., filter spans by membership in the poll trace set) before asserting the number of distinct poll-related traces matches len(pollSpans).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/config/config.go`:
- Around line 110-112: The code currently assigns MESSAGING_SYSTEM even when the
environment variable is present but empty; update the block that reads
os.LookupEnv("MESSAGING_SYSTEM") so it trims and checks the value is non-empty
before assigning to cfg.MessagingSystem (e.g., use strings.TrimSpace on the
looked-up value and only set cfg.MessagingSystem when the trimmed value != "");
this avoids clearing defaults or emitting empty messaging.system attributes.
In `@internal/sentinel/sentinel.go`:
- Around line 114-118: The telemetry spans created via telemetry.StartSpan
(e.g., evalSpan and publishSpan in the sentinel.evaluate flow) are not ended on
error paths that call continue; update each error branch that returns/continues
after calling resource.SetData or publisher.Publish to call the corresponding
span.End() (evalSpan.End() and, if publishSpan was created, publishSpan.End())
before continuing so all spans are closed; search for usages of SetData and
Publish within the sentinel evaluation loop (the branches around evalSpan and
publishSpan creation) and add the missing End() calls on every error path.
In `@pkg/telemetry/otel_test.go`:
- Around line 84-104: The test calls Shutdown twice (once in the deferred
anonymous func and once explicitly), causing flakiness; update
TestInitTraceProvider_OTLPExporter to call Shutdown only once by removing the
redundant call: either eliminate the explicit err = Shutdown(shutdownCtx, tp)
block (keeping the defer) or remove the defer and keep the explicit shutdown,
ensuring only a single call to Shutdown(ctx, tp) on the trace provider
(reference Shutdown and the deferred anonymous func that wraps Shutdown).
In `@test/integration/integration_test.go`:
- Around line 717-718: The test currently ignores the error returned by
client.NewHyperFleetClient; modify the setup to capture the error (e.g.,
hyperfleetClient, err := client.NewHyperFleetClient(...)) and fail fast if it is
non-nil by calling t.Fatalf or using the test assertion helper (e.g.,
require.NoError) so setup failures are explicit; update the code around
NewHyperFleetClient and any test setup that depends on hyperfleetClient to use
the checked variable.
- Around line 675-682: The test sets a new global OTEL tracer provider with
otel.SetTracerProvider(tp) but never restores the previous global provider,
leaking state to other tests; capture the existing provider via
otel.GetTracerProvider() before calling otel.SetTracerProvider(tp) and in the
deferred cleanup restore it by calling otel.SetTracerProvider(prevProvider) (and
then perform tp.Shutdown(cleanupCtx) as currently implemented), referencing
otel.SetTracerProvider, otel.GetTracerProvider, tp, and tp.Shutdown to locate
where to add the save/restore logic.
---
Duplicate comments:
In `@cmd/sentinel/main.go`:
- Line 34: The default OTel serviceName value currently set as serviceName =
"sentinel" should be changed to "hyperfleet-sentinel" to standardize tracing
identity; update the assignment of the serviceName variable in
cmd/sentinel/main.go (the serviceName symbol) to use "hyperfleet-sentinel" as
the default so traces are attributed consistently unless explicitly overridden.
- Around line 301-305: The shutdown logging currently uses the outer request
context variable ctx which may be canceled; change the log call to use the
telemetry shutdown context instead: when calling
telemetry.Shutdown(otelShutdownCtx, tp) and it returns an error, pass
otelShutdownCtx into log.Extra(...).Error(...) (keeping the existing
log.Extra("error", err) and message) so the failure is logged with the
shutdown-specific context created by context.WithTimeout
(otelShutdownCtx/otelShutdownCancel).
In `@pkg/telemetry/otel.go`:
- Line 93: Replace the direct TraceIDRatioBased sampler assignment so child
spans respect remote parent sampling: instead of assigning sampler =
trace.TraceIDRatioBased(samplingRate) wrap it with
trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) (use the same
samplingRate variable) so parent-based sampling is preserved for incoming trace
contexts.
---
Nitpick comments:
In `@test/integration/integration_test.go`:
- Around line 902-909: The current check counts trace IDs from all captured
spans which can include unrelated roots and become flaky; change it to derive
continuity only from the poll traces by first building a set of expectedTraceIDs
from pollSpans (use each pollSpan.SpanContext.TraceID().String()) and then only
collect/count trace IDs from spans whose TraceID is in that expectedTraceIDs set
(i.e., filter spans by membership in the poll trace set) before asserting the
number of distinct poll-related traces matches len(pollSpans).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
cmd/sentinel/main.goconfigs/dev-example.yamlconfigs/gcp-pubsub-example.yamlconfigs/rabbitmq-example.yamlinternal/client/client_test.gointernal/config/config.gointernal/sentinel/sentinel.gopkg/telemetry/otel.gopkg/telemetry/otel_test.gotest/integration/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/client/client_test.go
… tracing standardization
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
cmd/sentinel/main.go (1)
301-305:⚠️ Potential issue | 🟡 MinorUse
otelShutdownCtxfor the shutdown error log context.Line 304 logs with
ctxeven though telemetry shutdown is executed underotelShutdownCtx.💡 Proposed fix
if tp != nil { otelShutdownCtx, otelShutdownCancel := context.WithTimeout(context.Background(), 5*time.Second) defer otelShutdownCancel() if err := telemetry.Shutdown(otelShutdownCtx, tp); err != nil { - log.Extra("error", err).Error(ctx, "Failed to shutdown OpenTelemetry") + log.Extra("error", err).Error(otelShutdownCtx, "Failed to shutdown OpenTelemetry") } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/sentinel/main.go` around lines 301 - 305, The shutdown error log is using the wrong context; change the log call to use otelShutdownCtx (created by context.WithTimeout) instead of ctx so the telemetry.Shutdown error is logged with the same context it runs under: locate otelShutdownCtx and otelShutdownCancel and update the log.Extra(...).Error(...) invocation that follows telemetry.Shutdown(otelShutdownCtx, tp) to pass otelShutdownCtx as the context argument.internal/sentinel/sentinel.go (1)
134-143:⚠️ Potential issue | 🔴 CriticalEnd and mark
evalSpanonSetDatafailure beforecontinue.On Line 142,
continueskips Line 192, soevalSpanleaks on this error path and the span is not marked failed.🐛 Proposed fix
if err := event.SetData(cloudevents.ApplicationJSON, map[string]interface{}{ "kind": resource.Kind, "id": resource.ID, "generation": resource.Generation, "href": resource.Href, "reason": decision.Reason, }); err != nil { + evalSpan.RecordError(err) + evalSpan.SetStatus(codes.Error, "set event data failed") s.logger.Errorf(eventCtx, "Failed to set event data resource_id=%s error=%v", resource.ID, err) + evalSpan.End() continue }Also applies to: 192-192
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/sentinel/sentinel.go` around lines 134 - 143, When event.SetData(...) fails in the block around event.SetData, ensure you mark and end the `evalSpan` before returning/continuing: after logging the error in the SetData error branch, record the error on `evalSpan` (e.g., RecordError / SetStatus to error with the err message) and call `evalSpan.End()` prior to the `continue` so the span is not leaked; apply the same fix to the other identical error path near the later SetData call (the occurrence referenced at line ~192).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/sentinel/main.go`:
- Around line 151-155: The code currently ignores parse errors from
strconv.ParseBool for TRACING_ENABLED; modify the block around
otelEnabled/strconv.ParseBool so that when err != nil you either log a clear
error and exit (fail fast) or at minimum emit a warning to stderr/logger
indicating the invalid value and that tracing will remain at its default, e.g.
using log.Printf or fmt.Fprintf(os.Stderr, ...), referencing the TRACING_ENABLED
env var, the strconv.ParseBool call, and cfg.OTel.Enabled to show
current/default state; ensure successful ParseBool still sets cfg.OTel.Enabled
as before.
In `@pkg/telemetry/otel.go`:
- Around line 77-80: The code silently ignores strconv.ParseFloat errors when
reading ENV_OTEL_TRACES_SAMPLER_ARG; update the parsing branch (the block
referencing ENV_OTEL_TRACES_SAMPLER_ARG and samplingRate) to detect a parse
error and handle it explicitly by logging a clear warning or returning an error
that includes the environment variable name and the invalid value, and only
assign samplingRate on successful parse; ensure the log/returned message
references ENV_OTEL_TRACES_SAMPLER_ARG and the raw arg so misconfiguration is
visible.
---
Duplicate comments:
In `@cmd/sentinel/main.go`:
- Around line 301-305: The shutdown error log is using the wrong context; change
the log call to use otelShutdownCtx (created by context.WithTimeout) instead of
ctx so the telemetry.Shutdown error is logged with the same context it runs
under: locate otelShutdownCtx and otelShutdownCancel and update the
log.Extra(...).Error(...) invocation that follows
telemetry.Shutdown(otelShutdownCtx, tp) to pass otelShutdownCtx as the context
argument.
In `@internal/sentinel/sentinel.go`:
- Around line 134-143: When event.SetData(...) fails in the block around
event.SetData, ensure you mark and end the `evalSpan` before
returning/continuing: after logging the error in the SetData error branch,
record the error on `evalSpan` (e.g., RecordError / SetStatus to error with the
err message) and call `evalSpan.End()` prior to the `continue` so the span is
not leaked; apply the same fix to the other identical error path near the later
SetData call (the occurrence referenced at line ~192).
|
/retest |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pkg/telemetry/otel.go (1)
87-96:ParentBasednot applied forAlwaysSampleandNeverSamplecases.When
samplingRate >= 1.0or<= 0.0, the sampler is set directly toAlwaysSample()orNeverSample()without wrapping inParentBased(). This means parent sampling decisions are ignored in these cases, potentially breaking trace continuity across service boundaries.For consistency with the architecture standard that specifies
parentbased_traceidratioas the default strategy, consider wrapping all samplers:♻️ Proposed fix
var sampler trace.Sampler switch { case samplingRate >= 1.0: - sampler = trace.AlwaysSample() // Sample all + sampler = trace.ParentBased(trace.AlwaysSample()) case samplingRate <= 0.0: - sampler = trace.NeverSample() // Sample none + sampler = trace.ParentBased(trace.NeverSample()) default: sampler = trace.ParentBased(trace.TraceIDRatioBased(samplingRate)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel.go` around lines 87 - 96, The sampler logic sets sampler directly to trace.AlwaysSample() or trace.NeverSample() for samplingRate >= 1.0 or <= 0.0, which bypasses parent-based behavior; change those branches to wrap the base samplers with trace.ParentBased(...) so all cases—including trace.AlwaysSample(), trace.NeverSample() and trace.TraceIDRatioBased(samplingRate)—are applied via trace.ParentBased, keeping the existing switch and the sampler variable but replacing direct calls with trace.ParentBased(trace.AlwaysSample()), trace.ParentBased(trace.NeverSample()), and trace.ParentBased(trace.TraceIDRatioBased(samplingRate)).pkg/telemetry/otel_test.go (1)
181-185: Test assertion forexpectedSample=falsecases may be misleading.For
always_off(line 117-121), the test expectsexpectedSample: falsebut only checksIsValid()whenexpectedSampleis true. The span context remains valid even when not sampled —NeverSamplestill produces a validSpanContext, it just won't be exported.To properly verify sampling behavior, check
span.SpanContext().IsSampled()instead:💡 Suggested enhancement
- if tt.expectedSample { - if !span.SpanContext().IsValid() { - t.Error("Expected valid span context for sampling=true") - } + if span.SpanContext().IsSampled() != tt.expectedSample { + t.Errorf("Expected IsSampled()=%v, got %v", tt.expectedSample, span.SpanContext().IsSampled()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 181 - 185, The test currently only checks span.SpanContext().IsValid() when tt.expectedSample is true, which doesn't verify sampling; update the assertion to use span.SpanContext().IsSampled() (and assert true/false according to tt.expectedSample) so cases like the "always_off" sampler correctly validate sampling behavior; locate the table-driven test and replace the IsValid-based conditional with an assertion on SpanContext().IsSampled() for the test case keyed by tt.expectedSample in the test function in pkg/telemetry/otel_test.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/sentinel/sentinel_test.go`:
- Around line 449-454: The deferred shutdown of the tracer provider uses
t.Fatalf which runs after the test body and can mislead test results; change the
deferred call in the shutdown closure (the anonymous func that calls
tp.Shutdown(ctx) on the trace.TracerProvider instance `tp` using `ctx`) to use
t.Errorf to report errors instead of t.Fatalf, so cleanup failures are reported
without aborting after test completion.
- Line 467: The test currently ignores the error returned by
NewHyperFleetClient; change the setup to capture the error (e.g.,
hyperfleetClient, err := client.NewHyperFleetClient(server.URL, 10*time.Second))
and fail fast on error (e.g., t.Fatalf or require.NoError(t, err)) so setup
failures are reported immediately instead of surfacing as secondary errors later
when hyperfleetClient is used.
In `@pkg/telemetry/otel.go`:
- Around line 77-83: The current logic checks ENV_OTEL_TRACES_SAMPLER_ARG and
logs "Invalid" in the else branch when the env var is empty, but it fails to
warn on strconv.ParseFloat errors; update the logic in the block that reads
ENV_OTEL_TRACES_SAMPLER_ARG so that: if arg == "" do nothing (keep default
samplingRate), if arg != "" attempt strconv.ParseFloat and on err != nil call
log.Warnf with a clear message including ENV_OTEL_TRACES_SAMPLER_ARG and the bad
arg value and that the default samplingRate will be used, and only set
samplingRate when ParseFloat succeeds; adjust references to
ENV_OTEL_TRACES_SAMPLER_ARG, arg, strconv.ParseFloat, samplingRate, and
log.Warnf accordingly.
---
Nitpick comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 181-185: The test currently only checks
span.SpanContext().IsValid() when tt.expectedSample is true, which doesn't
verify sampling; update the assertion to use span.SpanContext().IsSampled() (and
assert true/false according to tt.expectedSample) so cases like the "always_off"
sampler correctly validate sampling behavior; locate the table-driven test and
replace the IsValid-based conditional with an assertion on
SpanContext().IsSampled() for the test case keyed by tt.expectedSample in the
test function in pkg/telemetry/otel_test.go.
In `@pkg/telemetry/otel.go`:
- Around line 87-96: The sampler logic sets sampler directly to
trace.AlwaysSample() or trace.NeverSample() for samplingRate >= 1.0 or <= 0.0,
which bypasses parent-based behavior; change those branches to wrap the base
samplers with trace.ParentBased(...) so all cases—including
trace.AlwaysSample(), trace.NeverSample() and
trace.TraceIDRatioBased(samplingRate)—are applied via trace.ParentBased, keeping
the existing switch and the sampler variable but replacing direct calls with
trace.ParentBased(trace.AlwaysSample()), trace.ParentBased(trace.NeverSample()),
and trace.ParentBased(trace.TraceIDRatioBased(samplingRate)).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
cmd/sentinel/main.gointernal/client/client_test.gointernal/config/config.gointernal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gopkg/telemetry/otel.gopkg/telemetry/otel_test.gotest/integration/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/client/client_test.go
- internal/config/config.go
| attribute.String("hyperfleet.resource_id", resource.ID), | ||
| ) | ||
|
|
||
| decision := s.decisionEngine.Evaluate(resource, now) |
There was a problem hiding this comment.
The sentinel.evaluate span is missing the hyperfleet.decision_reason attribute. The decision
is computed (s.decisionEngine.Evaluate(...)) but the result is never added to
evalSpan. Per the JIRA acceptance criteria and the architecture tracing standard,
hyperfleet.decision_reason is a required attribute on this span.
This is especially important for skipped resources — since no publish span is created in that
case, the decision reason would be completely absent from the trace.
Suggestion: add the attribute after the evaluation:
| decision := s.decisionEngine.Evaluate(resource, now) | |
| decision := s.decisionEngine.Evaluate(resource, now) | |
| evalSpan.SetAttributes(attribute.String("hyperfleet.decision_reason", decision.Reason)) |
There was a problem hiding this comment.
Added decision reason to evalSpan (thought we only need to add it to publish span)
…r check for new hyperfleet client
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/telemetry/otel.go`:
- Around line 142-149: Add nil/zero guards in SetTraceContext: return early if
event is nil or span is nil/invalid before calling span.SpanContext(); check
span != nil (or equivalent) and use span.SpanContext().IsValid() only after
those checks to avoid panics from dereferencing nil spans or nil events, and
keep the existing traceparent formatting logic (traceParent := fmt.Sprintf(...)
and event.SetExtension("traceparent", traceParent)) inside the guarded block so
they only run when both inputs are valid.
- Around line 41-43: The current code calls
otlptracehttp.WithEndpointURL(otlpEndpoint) (inside otlptracehttp.New), which
treats the value as a full URL and prevents the SDK from appending the
/v1/traces path; remove that call and either let the SDK read
OTEL_EXPORTER_OTLP_ENDPOINT itself (omit the WithEndpointURL option) or, if
specifying programmatically, replace WithEndpointURL with
otlptracehttp.WithEndpoint(...) using a host:port value (and keep the call site
in otlptracehttp.New).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
internal/sentinel/sentinel.gointernal/sentinel/sentinel_test.gopkg/telemetry/otel.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/sentinel/sentinel_test.go
- internal/sentinel/sentinel.go
… check in SetTraceContext()
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/telemetry/otel.go`:
- Around line 41-42: The exporter selection currently only checks
ENV_OTEL_EXPORTER_OTLP_ENDPOINT; update the gating logic to also consider
ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and prefer the traces-specific variable
when present so otlptracehttp.New is attempted for traces; modify the check
around otlptracehttp.New (references: ENV_OTEL_EXPORTER_OTLP_ENDPOINT,
ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, otlptracehttp.New, exporter, err) to
look for the traces endpoint first and fall back to the generic OTLP endpoint
before defaulting to stdout.
| if otlpEndpoint := os.Getenv(ENV_OTEL_EXPORTER_OTLP_ENDPOINT); otlpEndpoint != "" { | ||
| exporter, err = otlptracehttp.New(ctx) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n pkg/telemetry/otel.go | head -80Repository: openshift-hyperfleet/hyperfleet-sentinel
Length of output: 3206
🌐 Web query:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT OpenTelemetry specification environment variable traces
💡 Result:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is the OpenTelemetry OTLP exporter environment variable that sets the trace (spans) export endpoint URL only (it overrides the generic OTEL_EXPORTER_OTLP_ENDPOINT for traces). [1]
Key spec behaviors:
- Precedence: If
OTEL_EXPORTER_OTLP_TRACES_ENDPOINTis set, it takes precedence overOTEL_EXPORTER_OTLP_ENDPOINTfor traces. [1] - OTLP/HTTP path handling: The value of
OTEL_EXPORTER_OTLP_TRACES_ENDPOINTis used as-is (the exporter must not auto-append/v1/traces). Exception: if the URL has no path,/must be used. [1]- In contrast, when only
OTEL_EXPORTER_OTLP_ENDPOINTis set (OTLP/HTTP), the exporter constructs the traces URL by appendingv1/tracesto the base endpoint. [1]
- In contrast, when only
- Defaults (commonly documented):
- OTLP/gRPC:
http://localhost:4317 - OTLP/HTTP:
http://localhost:4318/v1/traces[2]
- OTLP/gRPC:
Sources:
[1] OpenTelemetry Specification – OTLP Exporter (Endpoint URLs and env vars) https://opentelemetry.io/docs/specs/otel/protocol/exporter/
[2] OpenTelemetry Docs – OTLP Exporter Configuration (defaults and examples) https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/
🌐 Web query:
OpenTelemetry Go otlptracehttp.New environment variables OTEL_EXPORTER_OTLP_ENDPOINT OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
💡 Result:
For the Go OTLP/HTTP trace exporter (go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp.New), the two env vars behave like this:
-
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT(per-signal) takes precedence overOTEL_EXPORTER_OTLP_ENDPOINT(base). IfOTEL_EXPORTER_OTLP_TRACES_ENDPOINTis set, it’s used for traces instead of the base endpoint. The Go exporter docs also noteOTEL_EXPORTER_OTLP_TRACES_ENDPOINToverridesOTEL_EXPORTER_OTLP_ENDPOINT. [1][2][3] -
Path handling differs:
- If you set
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, the URL is used as-is (no automatic/v1/tracesappend). So for OTLP/HTTP you typically include the full trace path yourself (e.g.http://collector:4318/v1/traces). [1][2] - If you set only
OTEL_EXPORTER_OTLP_ENDPOINT, it’s treated as a base URL, and the exporter constructs the trace URL by appendingv1/traces(e.g.http://collector:4318→http://collector:4318/v1/traces). [1][2][3]
- If you set
-
Code options can override env: in Go, options like
WithEndpoint,WithEndpointURL,WithInsecurecan override env-based configuration. [3]
Examples
- Single base endpoint (auto path):
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318- traces →
http://collector:4318/v1/traces[1][2]
- Per-traces endpoint (no auto path):
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://collector:4318/v1/traces- traces → exactly that URL [1][2]
Sources
[1] OpenTelemetry OTLP/HTTP exporter spec (endpoint construction & precedence) (opentelemetry.io)
[2] OpenTelemetry “OTLP Exporter Configuration” (env var meanings/defaults) (opentelemetry.io)
[3] Go otlptracehttp package docs (env vars + override behavior) (pkg.go.dev)
Citations:
- 1: https://opentelemetry.io/docs/specs/otel/protocol/exporter/?utm_source=openai
- 2: https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/?utm_source=openai
- 3: https://pkg.go.dev/go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp?utm_source=openai
Check for traces-specific OTLP endpoint when selecting exporter.
The current gating logic at line 41 only checks OTEL_EXPORTER_OTLP_ENDPOINT. Per the OpenTelemetry specification, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is a valid configuration that takes precedence for traces. If a user sets only this traces-specific endpoint (valid configuration), the condition fails and the code silently falls back to stdout, never instantiating the OTLP exporter where otlptracehttp.New() would properly handle it.
🔧 Proposed fix
const (
SAMPLER_ALWAYS_ON = "always_on"
SAMPLER_ALWAYS_OFF = "always_off"
SAMPLER_TRACE_ID_RATIO = "traceidratio"
ENV_OTEL_TRACES_SAMPLER = "OTEL_TRACES_SAMPLER"
ENV_OTEL_TRACES_SAMPLER_ARG = "OTEL_TRACES_SAMPLER_ARG"
ENV_OTEL_EXPORTER_OTLP_ENDPOINT = "OTEL_EXPORTER_OTLP_ENDPOINT"
+ ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT = "OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
)
@@
- if otlpEndpoint := os.Getenv(ENV_OTEL_EXPORTER_OTLP_ENDPOINT); otlpEndpoint != "" {
+ if os.Getenv(ENV_OTEL_EXPORTER_OTLP_ENDPOINT) != "" ||
+ os.Getenv(ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT) != "" {
exporter, err = otlptracehttp.New(ctx)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if otlpEndpoint := os.Getenv(ENV_OTEL_EXPORTER_OTLP_ENDPOINT); otlpEndpoint != "" { | |
| exporter, err = otlptracehttp.New(ctx) | |
| const ( | |
| SAMPLER_ALWAYS_ON = "always_on" | |
| SAMPLER_ALWAYS_OFF = "always_off" | |
| SAMPLER_TRACE_ID_RATIO = "traceidratio" | |
| ENV_OTEL_TRACES_SAMPLER = "OTEL_TRACES_SAMPLER" | |
| ENV_OTEL_TRACES_SAMPLER_ARG = "OTEL_TRACES_SAMPLER_ARG" | |
| ENV_OTEL_EXPORTER_OTLP_ENDPOINT = "OTEL_EXPORTER_OTLP_ENDPOINT" | |
| ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT = "OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" | |
| ) | |
| if os.Getenv(ENV_OTEL_EXPORTER_OTLP_ENDPOINT) != "" || | |
| os.Getenv(ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT) != "" { | |
| exporter, err = otlptracehttp.New(ctx) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/telemetry/otel.go` around lines 41 - 42, The exporter selection currently
only checks ENV_OTEL_EXPORTER_OTLP_ENDPOINT; update the gating logic to also
consider ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and prefer the traces-specific
variable when present so otlptracehttp.New is attempted for traces; modify the
check around otlptracehttp.New (references: ENV_OTEL_EXPORTER_OTLP_ENDPOINT,
ENV_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, otlptracehttp.New, exporter, err) to
look for the traces endpoint first and fall back to the generic OTLP endpoint
before defaulting to stdout.
Summary
Implements comprehensive OpenTelemetry tracing instrumentation in Sentinel to enable distributed tracing across the HyperFleet orchestration pipeline.
Span hierarchy:
sentinel.poll (root)
Resolves
HYPERFLEET-542
Changed
Tested
Summary by CodeRabbit
New Features
Tests