Optimizations for read_po #1200

akx · 2025-03-17T13:01:50Z

Inspired by #1199.

This PR improves parsing performance for a large PO file (https://projects.blender.org/blender/blender-manual-translations/raw/branch/main/fi/LC_MESSAGES/blender_manual.po) by approximately 1.4x on my machine.

On a corpus of 11,600 .po files (namely all that I had hanging around on my machine, heh) the speed-up is 1.30x:

$ hyperfine 'git checkout read-po-opt-nr && python readpobench_many.py' 'git checkout master && python readpobench_many.py'
Benchmark 1: git checkout read-po-opt-nr && python readpobench_many.py
  Time (mean ± σ):      8.176 s ±  0.075 s    [User: 7.683 s, System: 0.478 s]
  Range (min … max):    8.052 s …  8.311 s    10 runs

Benchmark 2: git checkout master && python readpobench_many.py
  Time (mean ± σ):     10.638 s ±  0.122 s    [User: 10.153 s, System: 0.469 s]
  Range (min … max):   10.491 s … 10.880 s    10 runs

Summary
  git checkout read-po-opt-nr && python readpobench_many.py ran
    1.30 ± 0.02 times faster than git checkout master && python readpobench_many.py

The only "breaking" change here is the one implemented by f988d79 – you can no longer pass a mixed iterable of byteses or strs into read_po. Typing-wise, you never could (AnyStr is either bytes or str, but not an union of them), but now it'll actually break.

Internally, _NormalizedString lost some of its lustre, but it's fine, since it is an internal helper no one should ever see.

codecov · 2025-03-17T13:03:05Z

Codecov Report

Attention: Patch coverage is 98.73418% with 1 line in your changes missing coverage. Please review.

Project coverage is 91.71%. Comparing base (3ce1e61) to head (977b258).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
babel/messages/pofile.py	98.11%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1200      +/-   ##
==========================================
+ Coverage   91.69%   91.71%   +0.02%     
==========================================
  Files          27       27              
  Lines        4696     4685      -11     
==========================================
- Hits         4306     4297       -9     
+ Misses        390      388       -2

Flag	Coverage Δ
macos-14-3.10	`90.75% <98.73%> (+0.02%)`	⬆️
macos-14-3.11	`90.69% <98.73%> (+0.02%)`	⬆️
macos-14-3.12	`90.90% <98.73%> (+0.02%)`	⬆️
macos-14-3.13	`90.90% <98.73%> (+0.02%)`	⬆️
macos-14-3.8	`90.62% <98.73%> (+0.02%)`	⬆️
macos-14-3.9	`90.68% <98.73%> (+0.02%)`	⬆️
macos-14-pypy3.10	`90.75% <98.73%> (+0.02%)`	⬆️
ubuntu-24.04-3.10	`90.77% <98.73%> (+0.02%)`	⬆️
ubuntu-24.04-3.11	`90.71% <98.73%> (+0.02%)`	⬆️
ubuntu-24.04-3.12	`90.92% <98.73%> (+0.02%)`	⬆️
ubuntu-24.04-3.13	`90.92% <98.73%> (+0.02%)`	⬆️
ubuntu-24.04-3.8	`90.64% <98.73%> (+0.02%)`	⬆️
ubuntu-24.04-3.9	`90.70% <98.73%> (+0.02%)`	⬆️
ubuntu-24.04-pypy3.10	`90.56% <98.73%> (+0.02%)`	⬆️
windows-2022-3.10	`90.76% <98.73%> (+0.02%)`	⬆️
windows-2022-3.11	`90.70% <98.73%> (+0.02%)`	⬆️
windows-2022-3.12	`90.91% <98.73%> (+0.02%)`	⬆️
windows-2022-3.13	`90.91% <98.73%> (+0.02%)`	⬆️
windows-2022-3.8	`90.74% <98.73%> (+0.02%)`	⬆️
windows-2022-3.9	`90.69% <98.73%> (+0.02%)`	⬆️
windows-2022-pypy3.10	`90.76% <98.73%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

tomasr8 · 2025-03-17T14:11:28Z

🎊 1200 🎊

babel/messages/catalog.py

akx · 2025-03-18T07:17:23Z

@AA-Turner Thanks for the suggestions! I measured things and turns out Explicit Is Better Than Implicit – I just made them regular old for loops...

AA-Turner · 2025-03-18T15:18:01Z

babel/util.py

+        yield from iterable
+        return
    seen = set()
    for item in iter(iterable):


The itertools recipe suggests

Suggested change

for item in iter(iterable):

for item in filterfalse(seen.__contains__, iterable):

Another alternative would be:

yield from dict.fromkeys(iterable)

The language now guarantees that dictionaries preserve order, but this approach is non-iterable.

dict.fromkeys(iterable) is clever! It's 5% faster than this in a microbenchmark.

What do you mean with "but this approach is non-iterable"?

As in, next(distinct(lst)) will only process one item currently, but the fromkeys approach eagerly de-duplicates the entire list. I don't know enough about babel's use to know if the lazy iteration is important here, or just a side effect of the original design.

A

Yeah, we don't expect there to be very many items, and all of the internal uses are list(distinct(...)) or list(distinct(... + ...)) – TBH, I think we could micro-optimize that a bit too with an internal-use-only _distinct_of(*iterables) -> list function...

See 72e25cf :)

babel/messages/pofile.py

akx added enhancement area/messages labels Mar 17, 2025

akx requested a review from tomasr8 March 17, 2025 13:30

akx marked this pull request as ready for review March 17, 2025 14:41

AA-Turner reviewed Mar 17, 2025

View reviewed changes

babel/messages/catalog.py Outdated Show resolved Hide resolved

babel/messages/catalog.py Outdated Show resolved Hide resolved

akx force-pushed the read-po-opt-nr branch from 61de603 to 3a77f9d Compare March 18, 2025 07:17

AA-Turner reviewed Mar 18, 2025

View reviewed changes

AA-Turner reviewed Mar 19, 2025

View reviewed changes

babel/messages/pofile.py Outdated Show resolved Hide resolved

AA-Turner approved these changes Mar 19, 2025

View reviewed changes

akx force-pushed the read-po-opt-nr branch from 3a77f9d to 72e25cf Compare March 20, 2025 07:10

akx added 10 commits March 21, 2025 08:23

Avoid extra casts (Message() takes care of those)

787c567

Optimize empty normalized strings

3fe4a5e

Don't sort translations unless plural

c0b3286

Optimize unescape()

3bfdce7

Optimize line processing

48ddea1

Optimize keyword parsing

533f4e9

Optimize comment parsing

8333449

Avoid hot isinstanceing in PO file parse loop

7a8b587

Add fast paths in python_format and python_brace_format

55738bf

Inline distincting in catalog.py

977b258

akx force-pushed the read-po-opt-nr branch from 72e25cf to 977b258 Compare March 21, 2025 06:24

tomasr8 approved these changes Apr 5, 2025

View reviewed changes

akx merged commit d7a7589 into master Apr 5, 2025
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizations for read_po #1200

Optimizations for read_po #1200

akx commented Mar 17, 2025 •

edited

Loading

Uh oh!

codecov bot commented Mar 17, 2025 •

edited

Loading

Uh oh!

tomasr8 commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

akx commented Mar 18, 2025

Uh oh!

AA-Turner Mar 18, 2025 •

edited

Loading

Uh oh!

akx Mar 19, 2025

Uh oh!

AA-Turner Mar 19, 2025

Uh oh!

akx Mar 20, 2025

Uh oh!

akx Mar 20, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	for item in iter(iterable):
	for item in filterfalse(seen.__contains__, iterable):

Optimizations for read_po #1200

Optimizations for read_po #1200

Conversation

akx commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tomasr8 commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

akx commented Mar 18, 2025

Uh oh!

AA-Turner Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akx Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

AA-Turner Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

akx Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

akx Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

akx commented Mar 17, 2025 •

edited

Loading

codecov bot commented Mar 17, 2025 •

edited

Loading

AA-Turner Mar 18, 2025 •

edited

Loading