Skip to content

gh-144981: Make PyUnstable_Code_SetExtra/GetExtra thread-safe#144980

Open
yoney wants to merge 3 commits intopython:mainfrom
yoney:extra
Open

gh-144981: Make PyUnstable_Code_SetExtra/GetExtra thread-safe#144980
yoney wants to merge 3 commits intopython:mainfrom
yoney:extra

Conversation

@yoney
Copy link
Contributor

@yoney yoney commented Feb 18, 2026

  • PyUnstable_Code_GetExtra and PyUnstable_Code_SetExtra had no thread-safety. Concurrent calls could cause data races.
  • GetExtra is performance-sensitive for use cases like JIT. It uses a lock-free acquire load of co_extra, and acquire loads of individual slots.
  • SetExtra uses a per-object critical section. When the slot already exists, it updates in place with a release store. When the buffer needs to grow, the free-threaded build allocates a new buffer, copies existing slots, publishes with a release store, and reclaims the old buffer via _PyMem_FreeDelayed(QSBR). The GIL build simply uses realloc.
  • PyUnstable_Eval_RequestCodeExtraIndex is now protected by interp->code_state.mutex.
  • Added free-threading tests for concurrent RequestCodeExtraIndex, SetExtra, and GetExtra.

The new tests exercise the concurrent paths and a free-threaded TSAN build without the fix crashes/emits warnings.

cc: @DinoV @colesbury

@yoney yoney changed the title Make PyUnstable_Code_SetExtra/GetExtra thread-safe gh-144981: Make PyUnstable_Code_SetExtra/GetExtra thread-safe Feb 18, 2026
@yoney yoney marked this pull request as ready for review February 18, 2026 23:22
@yoney yoney requested a review from markshannon as a code owner February 18, 2026 23:22
@bedevere-app
Copy link

bedevere-app bot commented Feb 18, 2026

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@yoney
Copy link
Contributor Author

yoney commented Feb 19, 2026

I benchmarked both the base and new implementations on GIL builds. Since SetExtra regressed significantly (from ~6 ns to ~104 ns), I refactored the implementation. I am going to update after this comment.

Note: Benchmarks were run on an x86_64 (Skylake) system.

Base (GIL)

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_CodeExtra_RequestIndex/iterations:250       40.9 ns         38.0 ns          250 items_per_second=26.3047M/s
BM_CodeExtra_SetExtra                          5.57 ns         5.56 ns    125068314 items_per_second=180.002M/s
BM_CodeExtra_GetExtra                          3.12 ns         3.12 ns    224233759 items_per_second=320.758M/s
BM_CodeExtra_SetGet                            7.98 ns         7.96 ns     87762271 items_per_second=125.585M/s

New Implementation (GIL)

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_CodeExtra_RequestIndex/iterations:250       38.9 ns         36.4 ns          250 items_per_second=27.5028M/s
BM_CodeExtra_SetExtra                           104 ns          103 ns      5854292 items_per_second=9.66832M/s
BM_CodeExtra_GetExtra                          2.77 ns         2.77 ns    252889160 items_per_second=361.305M/s
BM_CodeExtra_SetGet                             105 ns          105 ns      6025338 items_per_second=9.54896M/s

@colesbury
Copy link
Contributor

How are these used in practice? Can we leave PyUnstable_Code_GetExtra()/PyUnstable_Code_SetExtra() on the same index synchronization up to the caller? Then we only would have to avoid races between calls on different indexes.

For example, something like:

  • PyUnstable_Eval_RequestCodeExtraIndex: stop-the-world on resize
  • PyUnstable_Code_GetExtra: no synchronization
  • PyUnstable_Code_SetExtra: no synchronization

Would that work?

@colesbury
Copy link
Contributor

Oh, the stop the world on resize would be in PyUnstable_Code_SetExtra, not PyUnstable_Eval_RequestCodeExtraIndex.

@yoney
Copy link
Contributor Author

yoney commented Feb 19, 2026

After the refactoring benchmark:

  • For GIL builds, use realloc.
  • For FT builds, allocate a new buffer and swap it in only when the buffer needs to grow.

GetExtra performance looks as good as the baseline.
SetExtra has additional overhead on FT builds, increasing from ~6 ns to ~21 ns (as expected).

Note: Benchmarks were run on an x86_64 (Skylake) system.

Base (GIL)

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_CodeExtra_RequestIndex/iterations:250       40.9 ns         38.0 ns          250 items_per_second=26.3047M/s
BM_CodeExtra_SetExtra                          5.57 ns         5.56 ns    125068314 items_per_second=180.002M/s
BM_CodeExtra_GetExtra                          3.12 ns         3.12 ns    224233759 items_per_second=320.758M/s
BM_CodeExtra_SetGet                            7.98 ns         7.96 ns     87762271 items_per_second=125.585M/s

New Implementation (GIL)

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_CodeExtra_RequestIndex/iterations:250       39.1 ns         35.6 ns          250 items_per_second=28.1215M/s
BM_CodeExtra_SetExtra                          6.08 ns         6.06 ns    114801349 items_per_second=165.076M/s
BM_CodeExtra_GetExtra                          2.78 ns         2.77 ns    252335083 items_per_second=360.554M/s
BM_CodeExtra_SetGet                            8.33 ns         8.31 ns     84310711 items_per_second=120.317M/s

New Implementation (FT)

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_CodeExtra_RequestIndex/iterations:250       56.0 ns         52.5 ns          250 items_per_second=19.0346M/s
BM_CodeExtra_SetExtra                          21.6 ns         21.6 ns     32413782 items_per_second=46.3671M/s
BM_CodeExtra_GetExtra                          2.78 ns         2.78 ns    251634827 items_per_second=360.15M/s
BM_CodeExtra_SetGet                            22.9 ns         22.8 ns     24306594 items_per_second=43.7748M/s

@yoney
Copy link
Contributor Author

yoney commented Feb 19, 2026

How are these used in practice?

For Cinderx, this is used to count the number of calls before making a JIT decision.

We implemented a workaround here: facebookincubator/cinderx@26f3e2b
The counting logic is here: facebookincubator/cinderx@8593c48

@yoney
Copy link
Contributor Author

yoney commented Feb 19, 2026

I have made the requested changes; please review again.

@bedevere-app
Copy link

bedevere-app bot commented Feb 19, 2026

Thanks for making the requested changes!

@DinoV: please review the changes made to this pull request.

@bedevere-app bedevere-app bot requested a review from DinoV February 19, 2026 18:27
@yoney
Copy link
Contributor Author

yoney commented Feb 19, 2026

aarch64 benchmark run:

GetExtra performance looks as good as the baseline.
SetExtra has additional overhead on FT builds, increasing from ~3 ns to ~15 ns (as expected).

Base (GIL)

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_CodeExtra_RequestIndex/iterations:250       19.7 ns         16.8 ns          250 items_per_second=59.6374M/s
BM_CodeExtra_SetExtra                          2.36 ns         2.36 ns    304021112 items_per_second=424.059M/s
BM_CodeExtra_GetExtra                          1.54 ns         1.53 ns    456420073 items_per_second=652.039M/s
BM_CodeExtra_SetGet                            3.34 ns         3.33 ns    223818527 items_per_second=300.594M/s

New Implementation (GIL)

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_CodeExtra_RequestIndex/iterations:250       19.2 ns         15.9 ns          250 items_per_second=63.004M/s
BM_CodeExtra_SetExtra                          2.42 ns         2.40 ns    296821748 items_per_second=416.132M/s
BM_CodeExtra_GetExtra                          1.55 ns         1.53 ns    458869555 items_per_second=652.618M/s
BM_CodeExtra_SetGet                            3.38 ns         3.34 ns    220168435 items_per_second=299.219M/s

New Implementation (FT)

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_CodeExtra_RequestIndex/iterations:250       32.0 ns         30.1 ns          250 items_per_second=33.2447M/s
BM_CodeExtra_SetExtra                          15.8 ns         15.8 ns     44310672 items_per_second=63.4835M/s
BM_CodeExtra_GetExtra                          1.55 ns         1.53 ns    458680585 items_per_second=653.316M/s
BM_CodeExtra_SetGet                            19.8 ns         19.7 ns     35630001 items_per_second=50.8699M/s

@DinoV
Copy link
Contributor

DinoV commented Feb 19, 2026

Oh, the stop the world on resize would be in PyUnstable_Code_SetExtra, not PyUnstable_Eval_RequestCodeExtraIndex.

I think stop the world on growth would probably be too expensive. If it was on every initial allocation it would certainly be way too expensive. But even if it was on an actual resize it might be really painful if someone added a new index as there'd be a burst of stop the world events as all the code objects got updated (as usually consumers are going to be attaching to a lot of objects).

But I do think we could make the callers perform whatever synchronization they need to protect their own indexes - e.g. maybe they need to construct a (cheap and empty) value to be filled in and tolerate throwing one of those away and then do a get to resolve the race. But I also think the minimal synchronization in the latest version is fine.

Copy link
Contributor

@DinoV DinoV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments