20260218-aes-prefetch-optimize by douzzer · Pull Request #9797 · wolfSSL/wolfssl

douzzer · 2026-02-18T19:17:56Z

wolfcrypt/src/aes.c: add static volatile word32 x_volatile to PreFetchTe() and PreFetchSBox(), to move volatile accesses out of the loop iteration, reclaiming some (but not all) of the performance regression from 299e7bd.

wolfssl/wolfcrypt/wc_port.h, wolfssl/wolfcrypt/sha256.h, wolfssl/wolfcrypt/sha512.h, wolfssl/wolfcrypt/sp.h, wolfssl/wolfcrypt/wc_mlkem.h: add WC_NO_INLINE.

tested with

wolfssl-multi-test.sh ...
check-source-text
quantum-safe-wolfssl-all-noasm-clang-tidy
benchmark-wolfcrypt-noasm-all
benchmark-wolfcrypt-noasm-all-gcc-latest
benchmark-wolfcrypt-noasm-all-clang

AES-GCM performance from nightly benchmark-wolfcrypt-noasm-all on gcc-16.0.1_p20260208-r1, before bikeshedding:

      AES-128-GCM-enc -- best trial 81.14% of baseline, Z+304.85
      AES-128-GCM-dec -- best trial 81.18% of baseline, Z+32.84
      AES-192-GCM-enc -- best trial 82.30% of baseline, Z+334.24
      AES-192-GCM-dec -- best trial 82.37% of baseline, Z+682.31
      AES-256-GCM-enc -- best trial 83.29% of baseline, Z+175.97
      AES-256-GCM-dec -- best trial 83.30% of baseline, Z+160.43

after bikeshedding:

       AES-128-GCM-enc -- best trial 85.81% of baseline, Z+216.84
      AES-128-GCM-dec -- best trial 85.84% of baseline, Z+23.37
      AES-192-GCM-enc -- best trial 86.25% of baseline, Z+247.85
      AES-192-GCM-dec -- best trial 86.23% of baseline, Z+508.96
      AES-256-GCM-enc -- best trial 87.41% of baseline, Z+126.30
      AES-256-GCM-dec -- best trial 87.39% of baseline, Z+115.42

Same for benchmark-wolfcrypt-noasm-all-clang on LLVM 23.0.0_pre20260214, for CBC and ECB (other modes are fast enough after bikeshedding to pass the tests):

      AES-128-CBC-enc -- best trial 83.37% of baseline, Z+98.99
      AES-128-CBC-dec -- best trial 74.45% of baseline, Z+68.34
      AES-192-CBC-enc -- best trial 85.98% of baseline, Z+111.46
      AES-192-CBC-dec -- best trial 76.38% of baseline, Z+49.86
      AES-256-CBC-enc -- best trial 86.46% of baseline, Z+96.90
      AES-256-CBC-dec -- best trial 79.28% of baseline, Z+40.64
      AES-128-ECB-enc -- best trial 74.44% of baseline, Z+72.71
      AES-128-ECB-dec -- best trial 75.82% of baseline, Z+111.80
      AES-192-ECB-enc -- best trial 77.76% of baseline, Z+199.32
      AES-192-ECB-dec -- best trial 77.56% of baseline, Z+495.70
      AES-256-ECB-enc -- best trial 79.74% of baseline, Z+36.27
      AES-256-ECB-dec -- best trial 80.17% of baseline, Z+49.80
      AES-XTS-enc -- best trial 79.64% of baseline, Z+206.27
      AES-XTS-dec -- best trial 81.42% of baseline, Z+112.83

after (unshown modes were fast enough to pass):

      AES-128-CBC-dec -- best trial 73.91% of baseline, Z+70.30
      AES-192-CBC-dec -- best trial 76.07% of baseline, Z+50.74
      AES-256-CBC-dec -- best trial 79.06% of baseline, Z+41.18
      AES-128-ECB-dec -- best trial 76.04% of baseline, Z+110.47
      AES-192-ECB-dec -- best trial 77.58% of baseline, Z+495.24
      AES-256-ECB-dec -- best trial 80.19% of baseline, Z+49.74
      AES-XTS-dec -- best trial 81.25% of baseline, Z+114.08

…tchTe() and PreFetchSBox(), to move volatile accesses out of the loop iteration, reclaiming some (but not all) of the performance regression from 299e7bd.

…crypt/sha512.h, wolfssl/wolfcrypt/sp.h, wolfssl/wolfcrypt/wc_mlkem.h: add WC_NO_INLINE.

DrKatieL · 2026-02-19T20:35:50Z

Here's an example of what compiler behavior I'm seeing and what I'm reading to let you decide if this is an appropriate fix. I'm going to talk specifically about AesDecrypt_C's usage of PreFetchTd() as this was the code I was analyzing when I found the bug. It should hold more generally for this issue.

What I'm reading:
First fix (volatile word x =0;) - This tells the compiler that at any time x may be read and it should always be correctly updated. It forces the fetch to happen in order and as it's read as each x&Td must be individually computed.

Second fix (current commit, volatile updated at end of loop) - This tells the compiler:
1 x might not be 0 at the beginning of the loop. This forces the result to be computed (in whatever way the compiler thinks is best) assuming the result is used
2 updating the volatile at the end of the loop forces the compiler to keep the result as it might be used.

What I'm seeing:
I've added below Ghidra screenshots using the -O3 flag on gcc version 13.3.0 aarch64-linux-gnu. (Note this is actually a WolfHSM build if you're planning on duplicating exact flags.)

If you notice, before either fix the entirety of the PreFetchTd() logic is gone. With both fixes you see a loop that fetches Td at the beginning of the computation. The fixes implement the loop slightly differently. I'm not familiar enough with Arm instruction timings to comment on the performance differences between the approaches.

What decision should be made:
The first fix produces exactly the behavior described in the C code. The second fix's behavior will be compiler dependent but will still probably force the Td array into cache when the main loop hits. I could see a compiler precomputing Td[0]&Td[1].... for the whole array and storing it somewhere since the & operation is associative so this approach may not work everywhere. Assuming the second fix behaves nicely, I'm less familiar with timing attacks against software AES than other algorithms (it's less common to both care about SCA and have to do AES in software). The second strategy will produce timing results that need to be inspected at the assembly level to be fully understood. Are you primarily concerned with fetching at least once at some time before the main computation starts or do you need control over the fetches?

Before any fix

After first fix

after second fix - note I didn't pull this commit. I manually added the same volatile strategy this commit uses to PreFetchTd so I could directly compare assembly

douzzer added 2 commits February 18, 2026 12:56

wolfcrypt/src/aes.c: add static volatile word32 x_volatile to PreFe…

79dedf4

…tchTe() and PreFetchSBox(), to move volatile accesses out of the loop iteration, reclaiming some (but not all) of the performance regression from 299e7bd.

wolfssl/wolfcrypt/wc_port.h, wolfssl/wolfcrypt/sha256.h, wolfssl/wolf…

5430e9b

…crypt/sha512.h, wolfssl/wolfcrypt/sp.h, wolfssl/wolfcrypt/wc_mlkem.h: add WC_NO_INLINE.

douzzer requested a review from SparkiDev February 18, 2026 19:17

douzzer assigned SparkiDev Feb 18, 2026

SparkiDev mentioned this pull request Feb 19, 2026

[Bug]: WolfSSL software AES cache prefetch can be optimized out by compiler #9781

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

20260218-aes-prefetch-optimize#9797

20260218-aes-prefetch-optimize#9797
douzzer wants to merge 2 commits intowolfSSL:masterfrom
douzzer:20260218-aes-prefetch-optimize

douzzer commented Feb 18, 2026

Uh oh!

DrKatieL commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

douzzer commented Feb 18, 2026

Uh oh!

DrKatieL commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants