[SLES-2666] handle_end_invocation race condition#1024
[SLES-2666] handle_end_invocation race condition#1024pablomartinezbernardo wants to merge 3 commits intomainfrom
Conversation
| Ok(r) => r, | ||
| Err(e) => { | ||
| error!("Failed to extract request body: {e}"); | ||
| return (StatusCode::OK, json!({}).to_string()).into_response(); |
There was a problem hiding this comment.
handle_end_invocation returns StatusCode::OK on error but handle_start_invocation returns StatusCode::BAD_REQUEST (line 129). Should these be consistent?
If this was intentional, consider leaving a comment.
There was a problem hiding this comment.
This was intentional as in this is the behavior that does not introduce regressions: handle_end_invocation was never able to return anything other than 200, and consumers of this endpoint may not be expecting something other than 200. But yeah, a comment explaining why (or better yet, making consumers aware that this may not return 200) is a good idea.
| State((invocation_processor_handle, _, tasks)): State<ListenerState>, | ||
| request: Request, | ||
| ) -> Response { | ||
| let (parts, body) = match extract_request_body(request).await { |
There was a problem hiding this comment.
Consider leaving a comment to explain why this should not be async (so somebody doesn't come and try to "optimize" it again later).
Something like:
extract_request_body must complete BEFORE returning 200 OK
to avoid a race condition. See SLES-2666.
|
Updating branch and creating an RC from it |
Overview
anonymousTaskhandle_end_invocationthen immediately returns 200 so the tracer can continueanonymousTaskis busy with a complex body inextract_request_bodyfor the time beingPlatformRuntimeDoneis processedinitialization_type: SnapStart) thenPlatformRuntimeDonetries to pair_platform_runtime_done_event which isNonebecauseanonymousTaskis still busy with the bodyanonymousTaskfinally completes, but that's irrelevant becausesend_ctx_spansis only run onPlatformRuntimeDonewhich assumes universal_instrumentation_end has already been sentWhy this looks likely
In the customer's logs we can see
05:11:48.463datadog.trace.agent.core.DDSpan - Finished span (WRITTEN): DDSpan [ t_id=274254290101965219205:11:48.489PlatformRuntimeDone received05:11:48.630REPORT RequestId1db22159-7200-43c8-bec1-11b89df4f099(last log emitted in an execution)05:11:53.784START RequestId:8c801767-e21b-43f7-bd11-078bb64bc430(new request id, 5s later)05:11:53.789Received end invocation request from headers:{""x-datadog-trace-id"": ""2742542901019652192"... -> we are now trying to finish the span after the request is long gone 🙃In this specific run, the lambda even had time to stop before continuin with the anonymous task from
handle_end_invocation.Performance
This PR makes the reading of the body synchronous with the response. This will delay handing over execution to outside the extension until the body is read. But that is irrelevant because it is a requirement to read the body and send
universal_instrumentation_endbefore relinquishing control.Testing
Suggestions very welcome