Fix race condition with py::make_key_iterator in free threading#5971
Fix race condition with py::make_key_iterator in free threading#5971rwgk merged 13 commits intopybind:masterfrom
Conversation
The creation of the iterator class needs to be synchronized.
include/pybind11/detail/internals.h
Outdated
| struct internals { | ||
| #ifdef Py_GIL_DISABLED | ||
| pymutex mutex; | ||
| pyrecursive_mutex mutex; |
There was a problem hiding this comment.
@henryiii @rwgk - I could use some advice here. I probably should have used some sort of recursive mutex here from the start -- it's pretty difficult to do the locking for make_iterator_impl without it.
I think changing this would require bumping PYBIND11_INTERNALS_VERSION, at least for Py_GIL_DISABLED builds. Is that acceptable?
Alternatively, maybe we should use something like Py_BEGIN_CRITICAL_SECTION_MUTEX, which supports recursion and eliminates some potential lock ordering deadlocks if you call into Python. The downside is that it is 3.14+ only.
There was a problem hiding this comment.
I think changing this would require bumping
PYBIND11_INTERNALS_VERSION, at least forPy_GIL_DISABLEDbuilds. Is that acceptable?
After the v3.0.2 release, yes. We already have three other PRs that need an internals version bump.
I was planning to release today (PR #5969), but there was a show stopper. We're waiting for a fix for the segfault.
My thinking: the race isn't new, therefore it would seem reasonable to defer fixing it until after the v3.0.2 release, when we have a window of opportunity to bump the internals version.
Caveat: I don't have enough background to decide between the internals version bump and the critical section alternative.
There was a problem hiding this comment.
Py_BEGIN_CRITICAL_SECTION_MUTEX sounds fine, I'd be okay to deprecate and remove Python 3.13t support. Maybe we could just fix this bug on 3.14t, and then drop 3.13t in 3.1?
We are thinking about dropping 3.13t in cibuildwheel too.
There was a problem hiding this comment.
Also fine to make the lock recursive in 3.1, we have several ABI bumps coming up.
There was a problem hiding this comment.
Great, I think dropping 3.13t makes sense. I switched to PyCriticalSection_BeginMutex.
I saw there's also a py::scoped_critical_section. Not sure if you'd want to combine them.
There was a problem hiding this comment.
Could we have a guard with #if defined(Py_GIL_DISABLED) && defined(PY_VERSION_HEX) && PY_VERSION_HEX >= ... here? It is causing a compilation error with Python 3.13t. In addition, the function PyCriticalSection_BeginMutex was added in Python 3.14.0rc1 and Python 3.15.0a1.
There was a problem hiding this comment.
It is causing a compilation error with Python 3.13t.
Ah, looks like we were too efficient dropping 3.13t. We should keep it working at the level before this PR for the v3.0.2 release, then fully drop 3.13t support after the release. @XuehaiPan could you help starting a PR for that? Just the code changes, to be sure it works for you. I can help adding back one 3.13t job.
|
Thanks for working on this so quickly! I can try it with the full library that I discovered the issue with once it is in a ready-to-review state 😃 |
|
@ikrommyd - if you can run test with the full library, that would be great. |
|
@rwgk what should we do with 3.13t here? Okay to drop it in a patch release, or should we gate this for 3.14t+ only (fine to leave the bug unfixed on 3.13t), and drop in the next minor release instead? CPython 3.13t was always experimental, while 3.14t is not. |
Move vote: yes, drop I don't think 3.13t is worth the trouble doing anything special. |
| void unlock() { PyMutex_Unlock(&mutex); } | ||
| }; | ||
|
|
||
| class pycritical_section { |
There was a problem hiding this comment.
Do we need to be careful about copy/move here?
... give me a moment to look at this with the help of some LLM ...
There was a problem hiding this comment.
Result: be01d1e
(created with Cursor Composer 1 model)
The pycritical_section class is a RAII wrapper that manages a Python critical section lifecycle: - Acquires the critical section in the constructor via PyCriticalSection_BeginMutex - Releases it in the destructor via PyCriticalSection_End - Holds a reference to a pymutex Allowing copy or move operations would be dangerous: 1. Copy: Both the original and copied objects would call PyCriticalSection_End on the same PyCriticalSection object in their destructors, leading to double-unlock and undefined behavior. 2. Move: The moved-from object's destructor would still run and attempt to end the critical section, while the moved-to object would also try to end it, again causing double-unlock. This follows the same pattern used by other RAII lock guards in the codebase, such as gil_scoped_acquire and gil_scoped_release, which also explicitly delete copy/move operations to prevent similar issues. By explicitly deleting these operations, we prevent accidental misuse and ensure the critical section is properly managed by a single RAII object throughout its lifetime.
Python 3.13t was experimental, while Python 3.14t is not. This PR uses PyCriticalSection_BeginMutex which is only available in Python 3.14+, making Python 3.13t incompatible with the changes. Removed all Python 3.13t CI jobs: - ubuntu-latest, 3.13t (standard-large matrix) - macos-15-intel, 3.13t (standard-large matrix) - windows-latest, 3.13t (standard-large matrix) - manylinux job testing 3.13t This aligns with the decision to drop Python 3.13t support as discussed in PR pybind#5971.
After removing Python 3.13t support (incompatible with PyCriticalSection_BeginMutex which requires Python 3.14+), we're adding replacement jobs using Python 3.13 (default) to maintain test coverage in key dimensions: 1. ubuntu-latest, Python 3.13: C++20 + DISABLE_HANDLE_TYPE_NAME_DEFAULT_IMPLEMENTATION - Replaces: ubuntu-latest, 3.13t with same config - Maintains coverage for this specific configuration combination 2. macos-15-intel, Python 3.13: C++11 - Replaces: macos-15-intel, 3.13t with same config - Maintains macOS coverage for Python 3.13 3. manylinux (musllinux), Python 3.13: GIL testing - Replaces: manylinux, 3.13t job - Maintains manylinux/musllinux container testing coverage These additions are proposed to get feedback on which jobs should be kept to maintain appropriate test coverage without the experimental 3.13t builds.
While I was at it, I went ahead and added two commits: I'm not sure about the second commit: @henryiii which of these jobs would you add? |
|
This job failed: https://github.com/pybind/pybind11/actions/runs/21319724363/job/61367809291 I triggered a rerun to see if it's a flake. |
It was a flake, but maybe that's a problem for another day? Or do you suspect that the changes in this PR could play a role? |
|
I'm not sure what caused that timeout, but I don't think it's related to the PR. Mixing subinterpreters and free threading isn't very well tested currently. |
|
The failure below is reproducible (not a flake) and almost certainly related to commit 91189c9 here. For comparison, running the CI for current master comes back all green. I also verified that #5972 does not fix it (@b-pass @XuehaiPan for visibility), but there is a second failure of the same kind:
I'll undo 91189c9 in hopes that it'll allow us to get this PR merged. CI / Manylinux on 🐍 3.14t (pull_request)Failing after 6m |
This reverts commit 91189c9. Reason: pybind#5971 (comment)
| }; | ||
|
|
||
| class pycritical_section { | ||
| pymutex &mutex; |
There was a problem hiding this comment.
Okay, interesting to me the RAII doesn't own the mutex and the ctor takes a mutable reference.
There was a problem hiding this comment.
This looks correct to me, based on this Cursor-generated analysis:
This is the standard lock-guard pattern: the RAII object owns the critical-section lifetime, not the mutex. It takes a
non-const ref because locking mutates the mutex (and PyCriticalSection_BeginMutex needs a mutable `PyMutex). The mutex itself is owned by the internals and outlives the guard:
pybind11/include/pybind11/detail/internals.h
Lines 230 to 239 in 0080cae
This reverts commit f3197de. After pybind#5972 is/was merged, tests should pass (already tested under pybind#5980). See also pybind#5972 (comment)
…afety This exact state was already tested under pybind#5980 CI - https://github.com/pybind/pybind11/actions/runs/21575190538?pr=5980 CIBW - https://github.com/pybind/pybind11/actions/runs/21575190471?pr=5980
|
Not rerunning the CI, because the exact same state was tested under #5980 already. |


Description
The creation of the iterator class needs to be synchronized.
Fixes #5970
Suggested changelog entry:
py::make_key_iteratorwith free threaded Python