Confirm this is a feature request for the Python library and not the underlying OpenAI API.
Describe the feature or improvement you're requesting
Hey, I'm writing on behalf on Baseten,
When a client times out, the server has no idea, and chaining the cancellation in services, e.g. http1 has canvats. It keeps doing expensive work (inference, token generation) that nobody will ever receive. The only signal today is a TCP disconnect — reactive, not proactive. In theory/future, this could also be adopted to e.g. issue a timeout on server side if work is unrealistic to be completed within that time.
Proposal
Send a Request-Timeout-Ms header on every request so nginx, Go contexts, and load balancers can cancel in-flight work proactively when the deadline elapses — no client disconnect needed, or cancellation chain can work proactivley. Internal services can also convert this into Request-Deadline-Ms a ms since unix epoch time, which allows server side verification in distributed systems.
Why not just use x-stainless-read-timeout?
timeout.read is a per-chunk silence threshold, not a wall-clock budget. It resets on every received chunk, so it's the wrong value to drive server-side cancellation — a healthy long running stream would get killed incorrectly.
What's a valid value?
Only a plain float timeout (e.g. OpenAI(timeout=20.0)) is a true wall-clock budget for e2e time. httpx.Timeout objects have no equivalent field. We should not send the header for those — worse than no header, as we could cancel the work on server side for this..
Proposed Implementation
# _build_headers(), _base_client.py
if "request-timeout-ms" not in lower_custom_headers:
timeout = self.timeout if isinstance(options.timeout, NotGiven) else options.timeout
if not isinstance(timeout, Timeout) and timeout is not None:
headers["request-timeout-ms"] = str(int(timeout * 1000))
Prior Art
- gRPC:
grpc-timeout propagates deadlines e2e across all services — the canonical example of this pattern. Middleware can decreatse that
- Envoy:
x-envoy-upstream-rq-timeout-ms — exact same semantics, widely adopted in service meshes. Unfortunately not very cross-vendor agnostic.
- Google Maps/Cloud API:
X-Server-Timeout used for deadline propagation - unfortunately in seconds, not milliseconds.
- Stainless SDKs: Already send
x-stainless-read-timeout for observability — this builds on that foundation with correct cancellation semantics.
It would be great to have a vendor agnostic name, that could be adopted from a range of LLM projects. The stainless OpenAI API is IMO the best proxy. I think having a header we can rely on would help us save a ton of compute - i believe. Please don't make the header contain openai or stainless.
Additional context
Confirm this is a feature request for the Python library and not the underlying OpenAI API.
Describe the feature or improvement you're requesting
Hey, I'm writing on behalf on Baseten,
When a client times out, the server has no idea, and chaining the cancellation in services, e.g. http1 has canvats. It keeps doing expensive work (inference, token generation) that nobody will ever receive. The only signal today is a TCP disconnect — reactive, not proactive. In theory/future, this could also be adopted to e.g. issue a timeout on server side if work is unrealistic to be completed within that time.
Proposal
Send a
Request-Timeout-Msheader on every request so nginx, Go contexts, and load balancers can cancel in-flight work proactively when the deadline elapses — no client disconnect needed, or cancellation chain can work proactivley. Internal services can also convert this intoRequest-Deadline-Msa ms since unix epoch time, which allows server side verification in distributed systems.Why not just use
x-stainless-read-timeout?timeout.readis a per-chunk silence threshold, not a wall-clock budget. It resets on every received chunk, so it's the wrong value to drive server-side cancellation — a healthy long running stream would get killed incorrectly.What's a valid value?
Only a plain
floattimeout (e.g.OpenAI(timeout=20.0)) is a true wall-clock budget for e2e time.httpx.Timeoutobjects have no equivalent field. We should not send the header for those — worse than no header, as we could cancel the work on server side for this..Proposed Implementation
Prior Art
grpc-timeoutpropagates deadlines e2e across all services — the canonical example of this pattern. Middleware can decreatse thatx-envoy-upstream-rq-timeout-ms— exact same semantics, widely adopted in service meshes. Unfortunately not very cross-vendor agnostic.X-Server-Timeoutused for deadline propagation - unfortunately in seconds, not milliseconds.x-stainless-read-timeoutfor observability — this builds on that foundation with correct cancellation semantics.It would be great to have a vendor agnostic name, that could be adopted from a range of LLM projects. The stainless OpenAI API is IMO the best proxy. I think having a header we can rely on would help us save a ton of compute - i believe. Please don't make the header contain
openaiorstainless.Additional context