Skip to content

perf: use unordered_set for Name sets for better compile speed#8586

Open
Changqing-JING wants to merge 2 commits intoWebAssembly:mainfrom
Changqing-JING:opt/compile-speed
Open

perf: use unordered_set for Name sets for better compile speed#8586
Changqing-JING wants to merge 2 commits intoWebAssembly:mainfrom
Changqing-JING:opt/compile-speed

Conversation

@Changqing-JING
Copy link
Copy Markdown
Contributor

@Changqing-JING Changqing-JING commented Apr 9, 2026

wasm::Name has an O(1) pointer-based hash but operator< does O(n) memcmp, making std::set<Name> unnecessarily slow. On large workloads, ~35% of wasm-opt CPU time was spent in __memcmp_evex_movbe inside EffectAnalyzer::walk called from CodeFolding. Switching the four std::set<Name> fields in EffectAnalyzer, NameSet in branch-utils.h, and the local containers in CodeFolding to their unordered equivalents eliminates the bottleneck.

@Changqing-JING Changqing-JING marked this pull request as draft April 9, 2026 05:32
@Changqing-JING Changqing-JING requested a review from a team as a code owner April 9, 2026 05:32
@Changqing-JING Changqing-JING requested review from tlively and removed request for a team April 9, 2026 05:32
@Changqing-JING Changqing-JING marked this pull request as ready for review April 9, 2026 09:38
@Changqing-JING Changqing-JING marked this pull request as draft April 9, 2026 10:17
@Changqing-JING Changqing-JING marked this pull request as ready for review April 9, 2026 10:34
@kripken
Copy link
Copy Markdown
Member

kripken commented Apr 9, 2026

@Changqing-JING what workloads did you test on?

I ran a test with a large Java testcase. Instruction counts, branches, and walltime were within noise.

If you are seeing 35% on this code, perhaps there is something special in your testcase? In general, the number of globals and break targets is very small, so a normal set can do well (by saving the time it takes to do hashing).

@Changqing-JING
Copy link
Copy Markdown
Contributor Author

Changqing-JING commented Apr 10, 2026

@kripken Thank you for review

  1. Yes, it's reproduced when a br_table has large amout of targets. Background story is, in assemblyscript GC, it uses br_table type_rtid to dispatching the gc visitor. In a large app, when there are large amount of types, then the CodeFolding become very slow.
    It can be reproduced with this testcase
    https://github.com/Changqing-JING/assemblyscript/blob/slow-compile/test.ts
time node ./bin/asc.js -O2 -o build/test.wasm ./test.ts 

real    6m27.324s
user    6m19.690s
sys     0m11.712s

For better understanding of this problem, I created an example to emulate the assemblyscript case
https://github.com/Changqing-JING/BinaryenLearn/blob/binaryen-slow-pass/flamegraph.sh

  1. I can share the flamegraph
    flamegraph

  2. The flamegraph shows, even though the set and map saved time from hashing, but operator< of wasm::Name costed more. So that map saving time from hashing is majorly benifit for number key e.g. wasm::Index, because compare the key is cheap, but when the key is string, it's not that case, especially when the string is long and having same prefix.

  3. Base on 3, the problem is hard to be reproduced by wasm-opt, because wasm-opt use auto indexing label name like $block1. But when binaryen is used as a lib of a frontend compiler, the name can be long like $__inlined_func$~lib/rt/itcms/Object#unlink$81, then the strcmp costs much longer time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants