[perf] Make PrettyPrinter format lazily so output can be budget-capped#14588
Draft
Pierre-Sassoulas wants to merge 3 commits into
Draft
[perf] Make PrettyPrinter format lazily so output can be budget-capped#14588Pierre-Sassoulas wants to merge 3 commits into
PrettyPrinter format lazily so output can be budget-capped#14588Pierre-Sassoulas wants to merge 3 commits into
Conversation
35ed5b2 to
d4b901c
Compare
…apped
``_format`` and the per-type helpers now ``yield`` their output as a
stream of string chunks instead of writing to a file-like object, and
``pformat`` joins them. On top of that, ``pformat_lines`` pulls from the
formatter only until a budget is reached:
pformat_lines(obj, max_lines=None, max_chars=None)
It stops on the first chunk that reaches *either* budget, so a huge
collection costs O(budget) rather than O(N). Either dimension may be
``None`` (unbounded); with both ``None`` the whole object is formatted.
Motivation
----------
Assertion diffs are truncated to a handful of lines/chars before being
shown. Formatting the whole of a large ``==`` comparison and then
throwing almost all of it away is pure waste. With a lazy formatter the
truncating caller simply stops pulling once it has enough.
Benchmark (``PrettyPrinter`` alone, width 80)::
list(range(500_000)):
pformat().splitlines() ~805 ms
pformat_lines(max_lines=11) ~0.027 ms (~30000x)
[8 small ints] (common small diff):
pformat().splitlines() ~0.0133 ms
pformat_lines(max_lines=11) ~0.0185 ms (+~5 us)
["x"*100_000] * 3 (flat, few huge elements):
pformat_lines(max_chars=640) stops after ~100_000 chars
(one element) instead of 300_000
Why a lazy generator rather than a fast path + budget stream
------------------------------------------------------------
An earlier approach kept a cheap ``pformat().splitlines()`` fast path
guarded by ``len(obj) <= max_lines`` plus a flatness check, falling back
to a write-intercepting budget-stream class for the rest. Two problems:
* ``len(obj)`` is only a *lower* bound on the line count — one nested
element (``[{...50 keys...}]``) expands to many lines — so the guard
needed the flatness scan to stay correct, and even then it bounded
only *lines*, never *chars*: a flat container of a few enormous
strings has almost no lines but blows the char budget.
* it was two code paths plus a stream class plus an exception used for
control flow.
Because the formatter is lazy, "stop pulling at the budget" is the whole
optimisation: correct regardless of how lines/chars are distributed
across elements, bounding both dimensions, with no ``len()`` proxy to
get wrong and no fast/slow branch. The common small-diff case costs only
~5 us more than the unbounded path (it is never the bottleneck — a
failing assertion isn't hot), while large comparisons drop by orders of
magnitude.
``_pprint_set``/``_pprint_dict`` also try a plain ``sorted`` first and
fall back to the ``_safe_key`` wrapper only for unorderable mixes.
This diverges structurally from the upstream cpython ``pprint`` it was
vendored from; the module header notes it is no longer kept in sync.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d4b901c to
133da41
Compare
In ``pformat_lines``'s budget loop, ``chunk.count("\n")`` ran on every
chunk, but most chunks (brackets, indentation, item reprs) contain no
newline. Guarding the call with ``"\n" in chunk`` skips it on those and
recovers part of the per-chunk budget-tracking overhead: formatting an
8-element list under a budget drops from ~0.0185 ms to ~0.0163 ms
(versus ~0.0132 ms for an uncapped ``pformat().splitlines()``, so the
budget overhead roughly halves, from ~+5 us to ~+3 us).
The win is small and only matters on the ``-v`` truncating path of a
failing assertion (the default path doesn't format the diff at all), so
this is kept as a separate commit — easy to drop if the extra branch
isn't judged worth it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
133da41 to
f4bd109
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor required prior to #14523.
_formatand the per-type helpers nowyieldtheir output as a stream of string chunks instead of writing to a file-like object, andpformatjoins them. On top of that,pformat_linespulls from the formatter only until a budget is reached:It stops on the first chunk that reaches either budget, so a huge collection costs O(budget) rather than O(N). Either dimension may be
None(unbounded); with bothNonethe whole object is formatted.Benchmark (
PrettyPrinteralone, width 80)::