Skip to content

Persistent broken pipe errors (~1-2% of requests) #2077

@sid-stripe

Description

@sid-stripe

Description

We're seeing persistent WriteFailedException: Failed to write request to socket [broken pipe] errors at a ~1-2% error rate across all Lambda instances. Bref recovers by restarting FPM, but the current invocation returns a 500.

The error pattern on a single Lambda instance:

  1. Request N completes normally (e.g., 220ms)
  2. Request N+1 arrives ~6ms later
  3. WriteFailedException immediately — the FPM socket is already dead
  4. Bref restarts FPM, next request works

Errors also spontaneously drop to near-zero (~0.01%) for 7-14 hours without any deployment, then climb back to ~1-2%. This suggests state corruption that self-heals through Lambda instance recycling.

How to reproduce

We haven't found a reliable reproduction — it's intermittent and only manifests in production under real traffic. It has persisted since our initial Bref migration and across PHP 8.1 and 8.4.

  • Bref: 2.4.18 (Docker images, not layers)
  • Docker image: bref/php-84-fpm:2
  • PHP: 8.4.18
  • AWS service: Lambda (via API Gateway HTTP API)
  • Lambda config: 1024 MB memory, 28s timeout
  • FPM config: Bref defaults (pm=static, max_children=1, log_limit=8192)
  • Extensions: redis (Predis pure PHP client), intl, opcache, pcntl, posix, pdo_mysql
  • Framework: Laravel 10

What we've ruled out

  • SIGPIPE (PHP-FPM children die with SIGPIPE since 8.3.10 #1854) — zero matches in CloudWatch
  • JIT segfaults (FastCgiCommunicationFailed on every second request #842) — opcache.jit = disable, verified from image
  • OOM — memory well within Lambda limits
  • Excimer profiler — removed from Dockerfile entirely, errors persist
  • Sentry tracing/profiling — fully disabled, errors persist
  • Redis/Valkey — Predis runs in child only, master never touches it
  • Large responses — no correlation between response size and errors
  • stderr log_limit — max log message is ~2KB, well under the 8192 limit
  • Cold starts — orders of magnitude fewer than error count
  • PHP version — reproduced on both 8.1 and 8.4

Questions

  • We noticed the catch block in FpmHandler.php (line 162) calls $this->stop() before capturing
    proc_get_status($this->fpm), so the exit code and termination signal are lost. Would you accept a PR adding
    this logging on the failure path?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions