Skip to content

Conversation

@akx
Copy link
Member

@akx akx commented Mar 17, 2025

Inspired by #1199.

This PR improves parsing performance for a large PO file (https://projects.blender.org/blender/blender-manual-translations/raw/branch/main/fi/LC_MESSAGES/blender_manual.po) by approximately 1.4x on my machine.

On a corpus of 11,600 .po files (namely all that I had hanging around on my machine, heh) the speed-up is 1.30x:

$ hyperfine 'git checkout read-po-opt-nr && python readpobench_many.py' 'git checkout master && python readpobench_many.py'
Benchmark 1: git checkout read-po-opt-nr && python readpobench_many.py
  Time (mean ± σ):      8.176 s ±  0.075 s    [User: 7.683 s, System: 0.478 s]
  Range (min … max):    8.052 s …  8.311 s    10 runs

Benchmark 2: git checkout master && python readpobench_many.py
  Time (mean ± σ):     10.638 s ±  0.122 s    [User: 10.153 s, System: 0.469 s]
  Range (min … max):   10.491 s … 10.880 s    10 runs

Summary
  git checkout read-po-opt-nr && python readpobench_many.py ran
    1.30 ± 0.02 times faster than git checkout master && python readpobench_many.py

The only "breaking" change here is the one implemented by f988d79 – you can no longer pass a mixed iterable of byteses or strs into read_po. Typing-wise, you never could (AnyStr is either bytes or str, but not an union of them), but now it'll actually break.

Internally, _NormalizedString lost some of its lustre, but it's fine, since it is an internal helper no one should ever see.

@codecov
Copy link

codecov bot commented Mar 17, 2025

Codecov Report

Attention: Patch coverage is 98.73418% with 1 line in your changes missing coverage. Please review.

Project coverage is 91.71%. Comparing base (3ce1e61) to head (977b258).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
babel/messages/pofile.py 98.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1200      +/-   ##
==========================================
+ Coverage   91.69%   91.71%   +0.02%     
==========================================
  Files          27       27              
  Lines        4696     4685      -11     
==========================================
- Hits         4306     4297       -9     
+ Misses        390      388       -2     
Flag Coverage Δ
macos-14-3.10 90.75% <98.73%> (+0.02%) ⬆️
macos-14-3.11 90.69% <98.73%> (+0.02%) ⬆️
macos-14-3.12 90.90% <98.73%> (+0.02%) ⬆️
macos-14-3.13 90.90% <98.73%> (+0.02%) ⬆️
macos-14-3.8 90.62% <98.73%> (+0.02%) ⬆️
macos-14-3.9 90.68% <98.73%> (+0.02%) ⬆️
macos-14-pypy3.10 90.75% <98.73%> (+0.02%) ⬆️
ubuntu-24.04-3.10 90.77% <98.73%> (+0.02%) ⬆️
ubuntu-24.04-3.11 90.71% <98.73%> (+0.02%) ⬆️
ubuntu-24.04-3.12 90.92% <98.73%> (+0.02%) ⬆️
ubuntu-24.04-3.13 90.92% <98.73%> (+0.02%) ⬆️
ubuntu-24.04-3.8 90.64% <98.73%> (+0.02%) ⬆️
ubuntu-24.04-3.9 90.70% <98.73%> (+0.02%) ⬆️
ubuntu-24.04-pypy3.10 90.56% <98.73%> (+0.02%) ⬆️
windows-2022-3.10 90.76% <98.73%> (+0.02%) ⬆️
windows-2022-3.11 90.70% <98.73%> (+0.02%) ⬆️
windows-2022-3.12 90.91% <98.73%> (+0.02%) ⬆️
windows-2022-3.13 90.91% <98.73%> (+0.02%) ⬆️
windows-2022-3.8 90.74% <98.73%> (+0.02%) ⬆️
windows-2022-3.9 90.69% <98.73%> (+0.02%) ⬆️
windows-2022-pypy3.10 90.76% <98.73%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@akx akx requested a review from tomasr8 March 17, 2025 13:30
@tomasr8
Copy link
Member

tomasr8 commented Mar 17, 2025

🎊 1200 🎊

@akx akx marked this pull request as ready for review March 17, 2025 14:41
@akx
Copy link
Member Author

akx commented Mar 18, 2025

@AA-Turner Thanks for the suggestions! I measured things and turns out Explicit Is Better Than Implicit – I just made them regular old for loops...

@akx akx force-pushed the read-po-opt-nr branch from 61de603 to 3a77f9d Compare March 18, 2025 07:17
yield from iterable
return
seen = set()
for item in iter(iterable):
Copy link
Contributor

@AA-Turner AA-Turner Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The itertools recipe suggests

Suggested change
for item in iter(iterable):
for item in filterfalse(seen.__contains__, iterable):

Another alternative would be:

yield from dict.fromkeys(iterable)

The language now guarantees that dictionaries preserve order, but this approach is non-iterable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict.fromkeys(iterable) is clever! It's 5% faster than this in a microbenchmark.

What do you mean with "but this approach is non-iterable"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, next(distinct(lst)) will only process one item currently, but the fromkeys approach eagerly de-duplicates the entire list. I don't know enough about babel's use to know if the lazy iteration is important here, or just a side effect of the original design.

A

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we don't expect there to be very many items, and all of the internal uses are list(distinct(...)) or list(distinct(... + ...)) – TBH, I think we could micro-optimize that a bit too with an internal-use-only _distinct_of(*iterables) -> list function...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 72e25cf :)

@akx akx force-pushed the read-po-opt-nr branch from 3a77f9d to 72e25cf Compare March 20, 2025 07:10
@akx akx force-pushed the read-po-opt-nr branch from 72e25cf to 977b258 Compare March 21, 2025 06:24
@akx akx merged commit d7a7589 into master Apr 5, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants