Intermittent Pool GenServer crash on `hackney_conn:set_owner/2`  failing to match `{error, invalid_state}` if the connection is closed

> **Note:** This issue was debugged with the assistance of Claude. I
> have verified the analysis to the best of my knowledge.

**Version:** 4.0.0

## What happens

The hackney pool GenServer crashes with a `MatchError` when a checkout races a
connection closing. `hackney_sup` restarts the pool immediately (`one_for_one`,
`permanent`) and `checkout/4`'s `try/catch` converts the exit into `{error, ...}`
for the caller, so the system recovers. **The practical impact is a single failed
checkout and log noise.**

Noticed this when using
[:broadway_sqs](https://github.com/dashbitco/broadway_sqs), which polls SQS
regularly. AWS closes idle connections on their end, triggering the race. 

Have had similar experiences with other HTTP clients where AWS closing
connections needs handling in application or client library level.

```
** (MatchError) no match of right hand side value: {:error, :invalid_state}
    (hackney 4.0.0) hackney_pool.erl:450: :hackney_pool.handle_call/3
    (stdlib 7.3) gen_server.erl:2470: :gen_server.try_handle_call/4
    (stdlib 7.3) gen_server.erl:2499: :gen_server.handle_msg/3
    (stdlib 7.3) proc_lib.erl:333: :proc_lib.init_p_do_apply/3
Last message: {:checkout, {"sqs.eu-west-2.amazonaws.com", 443, :hackney_tcp}, #PID<...>, [recv_timeout: 30000]}
```

## Suspected root cause

`hackney_pool.erl:450` hard-matches the return of `hackney_conn:set_owner/2`:

```erlang
ok = hackney_conn:set_owner(Pid, Requester),
```

`find_available` does call `hackney_conn:is_ready/1` and only proceeds when it
gets `{ok, connected}`. It skips a connection in `closed` state.

Maybe: `is_ready` and `set_owner` are two separate synchronous
`gen_statem` calls, which may leave a window where a `tcp_closed` info message
arrives in the connection process mailbox between them:

1. Pool calls `is_ready`. gen_statem responds `{ok, connected}`, mailbox empty
2. TCP stack delivers `tcp_closed` to the gen_statem mailbox because AWS closed the connection?
3. Pool receives `{ok, connected}` and dispatches `set_owner`
4. gen_statem processes `tcp_closed` first (queued earlier). Transitions to `closed`
5. gen_statem processes `set_owner`, hits the catch-all. Returns `{error, invalid_state}`

The `set_owner` call would then hit the `handle_common` catch-all in `hackney_conn.erl:1587`:

```erlang
handle_common({call, From}, _, _State, _Data) ->
    {keep_state_and_data, [{reply, From, {error, invalid_state}}]};
```

The connection process stays alive for `CLOSED_GRACE_MS` (50 ms) before
terminating, so maybe the pool would not yet have received the `'DOWN'` message to
remove the Pid from `available`?

The `-spec` for `set_owner/2` only declares `-> ok`, and not `{error,
invalid_state}`. 

## Possible fix

Maybe handle the `{error, invalid_state}`, regardless of how it happens?
Since it is part of the return path. Handle the error in `hackney_pool.erl` and
fall through to a fresh connection rather than crashing:

```erlang
{ok, Pid, Available2} ->
    case hackney_conn:set_owner(Pid, Requester) of
        ok ->
            InUse2 = maps:put(Pid, Key, InUse),
            {reply, {ok, Pid}, State#state{available=Available2, in_use=InUse2}};
        {error, _} ->
            %% Connection closed between find_available and set_owner - start a fresh one
            case start_connection(Key, Requester, Opts, State#state{available=Available2}) of
                {ok, Pid2, State2} ->
                    InUse2 = maps:put(Pid2, Key, State2#state.in_use),
                    {reply, {ok, Pid2}, State2#state{in_use=InUse2}};
                {error, Reason} ->
                    {reply, {error, Reason}, State#state{available=Available2}}
            end
    end;
```

The spec for `set_owner/2` updated:

```erlang
-spec set_owner(pid(), pid()) -> ok | {error, invalid_state}.
```

I can create a PR if it is a issue and not expected behaviour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intermittent Pool GenServer crash on `hackney_conn:set_owner/2` failing to match `{error, invalid_state}` if the connection is closed #850

What happens

Suspected root cause

Possible fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Intermittent Pool GenServer crash on hackney_conn:set_owner/2 failing to match {error, invalid_state} if the connection is closed #850

Description

What happens

Suspected root cause

Possible fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Intermittent Pool GenServer crash on `hackney_conn:set_owner/2` failing to match `{error, invalid_state}` if the connection is closed #850