-
Notifications
You must be signed in to change notification settings - Fork 461
Optimizations for read_po #1200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1200 +/- ##
==========================================
+ Coverage 91.69% 91.71% +0.02%
==========================================
Files 27 27
Lines 4696 4685 -11
==========================================
- Hits 4306 4297 -9
+ Misses 390 388 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
🎊 1200 🎊 |
|
@AA-Turner Thanks for the suggestions! I measured things and turns out Explicit Is Better Than Implicit – I just made them regular old for loops... |
| yield from iterable | ||
| return | ||
| seen = set() | ||
| for item in iter(iterable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The itertools recipe suggests
| for item in iter(iterable): | |
| for item in filterfalse(seen.__contains__, iterable): |
Another alternative would be:
yield from dict.fromkeys(iterable)
The language now guarantees that dictionaries preserve order, but this approach is non-iterable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict.fromkeys(iterable) is clever! It's 5% faster than this in a microbenchmark.
What do you mean with "but this approach is non-iterable"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, next(distinct(lst)) will only process one item currently, but the fromkeys approach eagerly de-duplicates the entire list. I don't know enough about babel's use to know if the lazy iteration is important here, or just a side effect of the original design.
A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we don't expect there to be very many items, and all of the internal uses are list(distinct(...)) or list(distinct(... + ...)) – TBH, I think we could micro-optimize that a bit too with an internal-use-only _distinct_of(*iterables) -> list function...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 72e25cf :)
Inspired by #1199.
This PR improves parsing performance for a large PO file (https://projects.blender.org/blender/blender-manual-translations/raw/branch/main/fi/LC_MESSAGES/blender_manual.po) by approximately 1.4x on my machine.
On a corpus of 11,600 .po files (namely all that I had hanging around on my machine, heh) the speed-up is 1.30x:
The only "breaking" change here is the one implemented by f988d79 – you can no longer pass a mixed iterable of byteses or strs into
read_po. Typing-wise, you never could (AnyStris eitherbytesorstr, but not an union of them), but now it'll actually break.Internally,
_NormalizedStringlost some of its lustre, but it's fine, since it is an internal helper no one should ever see.