Note: This issue was debugged with the assistance of Claude. I
have verified the analysis to the best of my knowledge.
Version: 4.0.0
What happens
The hackney pool GenServer crashes with a MatchError when a checkout races a
connection closing. hackney_sup restarts the pool immediately (one_for_one,
permanent) and checkout/4's try/catch converts the exit into {error, ...}
for the caller, so the system recovers. The practical impact is a single failed
checkout and log noise.
Noticed this when using
:broadway_sqs, which polls SQS
regularly. AWS closes idle connections on their end, triggering the race.
Have had similar experiences with other HTTP clients where AWS closing
connections needs handling in application or client library level.
** (MatchError) no match of right hand side value: {:error, :invalid_state}
(hackney 4.0.0) hackney_pool.erl:450: :hackney_pool.handle_call/3
(stdlib 7.3) gen_server.erl:2470: :gen_server.try_handle_call/4
(stdlib 7.3) gen_server.erl:2499: :gen_server.handle_msg/3
(stdlib 7.3) proc_lib.erl:333: :proc_lib.init_p_do_apply/3
Last message: {:checkout, {"sqs.eu-west-2.amazonaws.com", 443, :hackney_tcp}, #PID<...>, [recv_timeout: 30000]}
Suspected root cause
hackney_pool.erl:450 hard-matches the return of hackney_conn:set_owner/2:
ok = hackney_conn:set_owner(Pid, Requester),
find_available does call hackney_conn:is_ready/1 and only proceeds when it
gets {ok, connected}. It skips a connection in closed state.
Maybe: is_ready and set_owner are two separate synchronous
gen_statem calls, which may leave a window where a tcp_closed info message
arrives in the connection process mailbox between them:
- Pool calls
is_ready. gen_statem responds {ok, connected}, mailbox empty
- TCP stack delivers
tcp_closed to the gen_statem mailbox because AWS closed the connection?
- Pool receives
{ok, connected} and dispatches set_owner
- gen_statem processes
tcp_closed first (queued earlier). Transitions to closed
- gen_statem processes
set_owner, hits the catch-all. Returns {error, invalid_state}
The set_owner call would then hit the handle_common catch-all in hackney_conn.erl:1587:
handle_common({call, From}, _, _State, _Data) ->
{keep_state_and_data, [{reply, From, {error, invalid_state}}]};
The connection process stays alive for CLOSED_GRACE_MS (50 ms) before
terminating, so maybe the pool would not yet have received the 'DOWN' message to
remove the Pid from available?
The -spec for set_owner/2 only declares -> ok, and not {error, invalid_state}.
Possible fix
Maybe handle the {error, invalid_state}, regardless of how it happens?
Since it is part of the return path. Handle the error in hackney_pool.erl and
fall through to a fresh connection rather than crashing:
{ok, Pid, Available2} ->
case hackney_conn:set_owner(Pid, Requester) of
ok ->
InUse2 = maps:put(Pid, Key, InUse),
{reply, {ok, Pid}, State#state{available=Available2, in_use=InUse2}};
{error, _} ->
%% Connection closed between find_available and set_owner - start a fresh one
case start_connection(Key, Requester, Opts, State#state{available=Available2}) of
{ok, Pid2, State2} ->
InUse2 = maps:put(Pid2, Key, State2#state.in_use),
{reply, {ok, Pid2}, State2#state{in_use=InUse2}};
{error, Reason} ->
{reply, {error, Reason}, State#state{available=Available2}}
end
end;
The spec for set_owner/2 updated:
-spec set_owner(pid(), pid()) -> ok | {error, invalid_state}.
I can create a PR if it is a issue and not expected behaviour.
Version: 4.0.0
What happens
The hackney pool GenServer crashes with a
MatchErrorwhen a checkout races aconnection closing.
hackney_suprestarts the pool immediately (one_for_one,permanent) andcheckout/4'stry/catchconverts the exit into{error, ...}for the caller, so the system recovers. The practical impact is a single failed
checkout and log noise.
Noticed this when using
:broadway_sqs, which polls SQS
regularly. AWS closes idle connections on their end, triggering the race.
Have had similar experiences with other HTTP clients where AWS closing
connections needs handling in application or client library level.
Suspected root cause
hackney_pool.erl:450hard-matches the return ofhackney_conn:set_owner/2:find_availabledoes callhackney_conn:is_ready/1and only proceeds when itgets
{ok, connected}. It skips a connection inclosedstate.Maybe:
is_readyandset_ownerare two separate synchronousgen_statemcalls, which may leave a window where atcp_closedinfo messagearrives in the connection process mailbox between them:
is_ready. gen_statem responds{ok, connected}, mailbox emptytcp_closedto the gen_statem mailbox because AWS closed the connection?{ok, connected}and dispatchesset_ownertcp_closedfirst (queued earlier). Transitions toclosedset_owner, hits the catch-all. Returns{error, invalid_state}The
set_ownercall would then hit thehandle_commoncatch-all inhackney_conn.erl:1587:The connection process stays alive for
CLOSED_GRACE_MS(50 ms) beforeterminating, so maybe the pool would not yet have received the
'DOWN'message toremove the Pid from
available?The
-specforset_owner/2only declares-> ok, and not{error, invalid_state}.Possible fix
Maybe handle the
{error, invalid_state}, regardless of how it happens?Since it is part of the return path. Handle the error in
hackney_pool.erlandfall through to a fresh connection rather than crashing:
{ok, Pid, Available2} -> case hackney_conn:set_owner(Pid, Requester) of ok -> InUse2 = maps:put(Pid, Key, InUse), {reply, {ok, Pid}, State#state{available=Available2, in_use=InUse2}}; {error, _} -> %% Connection closed between find_available and set_owner - start a fresh one case start_connection(Key, Requester, Opts, State#state{available=Available2}) of {ok, Pid2, State2} -> InUse2 = maps:put(Pid2, Key, State2#state.in_use), {reply, {ok, Pid2}, State2#state{in_use=InUse2}}; {error, Reason} -> {reply, {error, Reason}, State#state{available=Available2}} end end;The spec for
set_owner/2updated:I can create a PR if it is a issue and not expected behaviour.