Skip to content

readline_timed drops trace stream events under load (uses isc_info_svc_line per-call instead of isc_info_svc_to_eof bulk drain) #70

@fdcastel

Description

@fdcastel

Summary

Server.readline_timed (the API documented for consuming a Firebird trace stream from Python) drops the majority of trace events under sustained load. The cause is the request mode it sends to the services API: isc_info_svc_line (= SrvInfoCode.LINE, value 62), which requests one line per round-trip. Under realistic OLTP throughput the per-call overhead caps the consumer at roughly 480 trace-events per second, while Firebird's fbtracemgr.exe (the reference C++ consumer the engine ships) keeps up with the full stream because it uses isc_info_svc_to_eof (= SrvInfoCode.TO_EOF, value 63) and drains many lines per round-trip.

I have evidence from a five-experiment baseline (matched workload, same python3-driver build, only the request mode varying) showing the gap is purely the request-mode choice: a Python reader using _svc.query(..., bytes([SrvInfoCode.TO_EOF]), ...) against the same trace session captures as many events as fbtracemgr does, exactly.

Reproduction

  1. Start a Firebird 5.0.x server with a workload that emits trace at a non-trivial rate. I used HammerDB's TPC-C driver (200-call transaction mix run via tclsh from https://github.com/fdcastel/HammerDB/tree/feature/firebird-support) but the same effect should appear with gbak, a sysbench-style workload, or even a tight loop of EXECUTE_STATEMENT_* events.
  2. Open a long-running trace session via python3-driver:
    import firebird.driver as fb
    conn = fb.connect_server(server="localhost/3050", user="SYSDBA", password="masterkey")
    session_id = conn.trace.start(config=EVERYTHING_CONFIG, name="repro")
    while True:
        line = conn.readline_timed(1)
        if line is None: break
        sys.stdout.write(line)
  3. Run the same trace.conf against the same workload using fbtracemgr -se localhost:service_mgr -user SYSDBA -password masterkey -start -name repro2 -config trace.conf > fbtracemgr.out.
  4. Count START_TRANSACTION (or any frequent event type) in each capture.

Expected: identical counts. Observed: fbtracemgr captures roughly 6× more events on a runtxn-only workload, and ~17× more when the trace session is also observing a heavy bulk-load phase.

Quantitative evidence

Five experiments under matched conditions (Firebird 5.0.3 on Windows Server 2022; HammerDB TPC-C 200-call transaction mix against a pre-loaded 1-warehouse schema; trace config = EVERYTHING_CONFIG mirrored from the project's defaults):

Experiment Consumer Request mode START_TX captured Total events trace bytes
E1 fbtracemgr isc_info_svc_to_eof 756 (load+run) 1 209 620 1.34 GB
E2 fbtracemgr isc_info_svc_to_eof 201 (run only) 10 170 8.1 MB
E3 python3-driver via Server.readline_timed isc_info_svc_line 58 (run only) 2 605 2.1 MB
E4 python3-driver bare readline_timed loop isc_info_svc_line 58 (run only) 2 977 2.4 MB
E5 python3-driver low-level _svc.query(..., TO_EOF, ...) isc_info_svc_to_eof 201 (run only) 10 226 8.4 MB

E5 is the same Python process, same connection class, same Firebird session — only the SPB request item differs. It captures the full stream identically to fbtracemgr. Per-call verbose instrumentation on E3/E4 confirmed the Python reader is not throughput-bound on its own: it processes incoming lines at ~10 k lines/sec consistently. The events it's missing are dropped server-side by Firebird's trace buffer because the LINE-mode round-trips can't drain it fast enough; events accumulate until the server discards them to protect itself.

The harness produced raw traces, per-call timeline logs, and response-tag histograms. I can share specific bytes or a self-contained reproducer if helpful.

Root cause (single line)

Server.readline_timed sends bytes([SrvInfoCode.LINE]) as the request item. Compare with Firebird's own reference trace consumer at src/utilities/fbtracemgr/traceMgrMain.cpp:216: const char query[] = {isc_info_svc_to_eof, isc_info_end};. Server-side handling for the two request items goes through the same get() function in src/jrd/svc.cpp but with different flags (GET_LINE vs GET_EOF), and the bulk-drain path is the only one fast enough for a busy trace stream.

A secondary issue lives in the same method: when the server returns isc_info_data_not_ready (transient buffer-empty, but session still active), readline_timed returns None. Callers conventionally treat None as EOF and tear down the iterator, which means a brief idle moment can prematurely terminate trace capture. The behavior is consistent with the current code:

data = self.response.read_sized_string(encoding=self.encoding, errors=self.encoding_errors)
if self.response.get_tag() == SrvInfoCode.TIMEOUT:
    return TIMEOUT
return data if data else None

SrvInfoCode.TIMEOUT is checked explicitly but isc_info_data_not_ready (value 4) is not — both should signal "no data right now, session still alive" rather than EOF. fbtracemgr's loop treats both indicators identically: it keeps polling until either Ctrl-C or a genuine end-of-service (which the server signals by ending the response with isc_info_end without a trailing indicator).

Proposed fix

Adding a new public method is the least disruptive shape — existing readline_timed callers (notably Server.readline, stdin handling, and any third-party code relying on per-line semantics) keep their current behavior.

def readlines_to_eof_timed(self, timeout: int) -> str | Sentinel | None:
    """Drain available trace/services text from the server in one
    round-trip. Returns the (possibly multi-line) chunk, `TIMEOUT` when
    the server-side timeout elapsed with no payload, or `None` when the
    service has reported `isc_info_end` (end of stream).

    Distinct from `readline_timed` in that:
      - the request item is `isc_info_svc_to_eof`, not `isc_info_svc_line`;
        the server returns up to MAXBUF bytes per call instead of one line.
      - the trailing indicator is inspected: `isc_info_svc_timeout` and
        `isc_info_data_not_ready` both return `TIMEOUT` (idle, keep going);
        `isc_info_truncated` is silently absorbed (more data coming);
        only `isc_info_end` (no trailing indicator) returns `None`.

    Designed for trace-session consumers that need fbtracemgr-equivalent
    throughput; falls back to `readline_timed` for one-line semantics.
    """
    self.response.clear()
    self._svc.query(self._make_request(timeout),
                    bytes([SrvInfoCode.TO_EOF]),
                    self.response.raw)
    if (tag := self.response.get_tag()) != SrvInfoCode.TO_EOF:
        raise InterfaceError(f"Service responded with error code: {tag}")
    data = self.response.read_sized_string(encoding=self.encoding,
                                           errors=self.encoding_errors)
    trailing = self.response.get_tag()
    if trailing == ISC_INFO_END:  # value 1
        return data if data else None
    if data:
        return data
    return TIMEOUT

Server.readline could optionally call this internally and split the chunk into per-line yields, preserving its existing signature while picking up the throughput improvement automatically; the per-line buffering pattern would resemble the line-buffer in the existing Server.readline flow.

Tests should cover at least:

  • a busy mock services connection where _svc.query returns multi-line responses with the various trailing indicators
  • the isc_info_end termination case (final round with no payload, no trailing indicator)
  • the isc_info_data_not_ready idle case (does NOT return None)

I'm happy to send a PR with the implementation and tests if the maintainers prefer that route.

Context

I hit this while diagnosing low event-capture rates in a long-running Python trace consumer on a busy multi-host deployment. The five-experiment baseline above pinned the root cause to the request-mode choice. As a workaround I call _svc.query directly with the TO_EOF request item (private-API access in user code) and the throughput gap closes — the same Python process now captures the full trace stream identically to fbtracemgr. I would much rather depend on a public driver method than maintain the workaround.

Happy to provide more diagnostics, sample captures, or a draft PR — whichever shape the maintainers find easiest to move forward.

Thanks for the work on this library.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions