Security and robustness fixes from NIF audit#67
Merged
Conversation
All enif_binary_to_term calls now pass ERL_NIF_BIN2TERM_SAFE so attacker-influenced data (notably a Python "__etf__:<base64>" callback result) cannot mint new non-GC'd atoms and exhaust the atom table. Local pids/refs and existing atoms still round-trip; only brand-new atoms, remote-node terms, and external funs are rejected. Adds py_reentrant_SUITE:test_etf_decode_safe covering both the non-breaking round-trip and the rejected-novel-atom case.
Cap nesting depth in py_to_term/term_to_py so a deeply nested term or Python structure returns an error instead of overflowing the C stack and crashing the node. Bound the dict->map loop against mid-iteration mutation (use the actual filled count). NULL-check the argument PyTuple_New across the call/eval/resume paths (py_nif, py_event_loop, py_callback, py_exec); most sites were already guarded. Adds py_SUITE:test_conversion_depth_guard (1M-deep term -> clean error, node stays up).
py_buffer_write relocated and freed buf->data on growth even when a Python memoryview (PyBuffer_getbuffer) held a raw pointer into it, dangling the view (use-after-free, whole-node crash). Refuse to grow while view_count > 0; the write returns an error instead. Adds py_buffer_SUITE:buffer_grow_pinned_test (pinned grow -> error, release -> write succeeds).
Keep the worker resource alive while a callback is suspended (it could be GC'd mid-suspension -> use-after-free on resume); release it in the suspended-state destructor and NULL-check before the replay deref. Free any prior result_data before a resume stores a new one (result_data, not the toggling has_result flag, is the real pending-result indicator, so a duplicate/raced resume no longer leaks). Clear the pending-callback TLS at the worker request boundary so a reused worker thread doesn't trip the stale-TLS invariant. Run the callback-response pipe writes on dirty IO schedulers with non-blocking, deadline-bounded writes so a stalled reader or large payload can't wedge a scheduler or desync the framed protocol. Also Py_XDECREF callback_args in the GIL-held destructor branch. Adds py_reentrant_SUITE:test_reentrant_resume_stress.
A per-request dict allocation failure in a subinterpreter worker no longer breaks (and permanently kills) the worker command loop while holding the GIL; it now sends an error response and keeps serving. The owngil_* dispatch NIFs run on dirty IO schedulers and use the non-blocking, deadline-bounded read_with_timeout/write_all_with_deadline helpers (cmd_pipe write end set O_NONBLOCK), so a stalled or dead worker can't wedge a scheduler forever or desync the framed protocol. SuspensionRequired is now resolved from the current interpreter's erlang module (like ProcessError), avoiding cross-interpreter PyObject use under OWN_GIL. Adds py_owngil_features_SUITE:owngil_reentrant_multi_stress_test.
The asyncio reader/writer integration handed Python a raw fd_resource pointer cast to an integer, then cast it back and dereferenced it in remove/update/clear/release. A stale, duplicate, or fabricated id was a double-free or arbitrary-pointer deref that crashed the node. Hand out opaque ids tracked in a validating registry instead: producers register and return an id; removing consumers take (and release) the entry so a duplicate id is a no-op; non-removing consumers look up with a temporary keep. An unknown id is a safe no-op or a clean ValueError. fd_read/fd_write also moved to ERL_NIF_DIRTY_JOB_IO_BOUND. Adds py_fd_ops_SUITE:fd_registry_invalid_id_test.
py_state gains an optional max_state_entries cap (default infinity, so existing behavior is unchanged) enforced with atomic admission (insert_new + update_counter, ets:take on remove) so Python-driven state_set can't exhaust node memory. A reserved sentinel row holds the count and is protected from caller corruption and hidden from keys/fetch. The py:stream and setup_logging helpers that build Python source now validate module/function/kwarg names as identifiers (rejecting injection at positions where quoting is meaningless) and escape string-literal values, including newlines and other control bytes. Adds py_state_SUITE (cap accounting + sentinel) and py_stream_SUITE:test_stream_rejects_injection.
ensure_venv and dependency installation built shell command strings and
ran them through os:cmd, with quote/1 single-quote wrapping that did not
escape embedded quotes -- a venv path/requirements/extras value could
break out and inject. Run the executables via
open_port({spawn_executable, Exe}, [{args, Args}, ...]) with literal
argument lists instead. For uv, VIRTUAL_ENV is passed via the port
{env, ...} option rather than a shell prefix.
Adds py_venv_SUITE:test_venv_path_metacharacters.
Guard make_py_error against a NULL message/type when a Python exception's text isn't UTF-8-encodable (and against a NULL exception type). Reject embedded NULs in binary_to_string so a module/func/attr/code name can't be silently truncated at the NUL. Release the leaked split() method in ReactorBuffer_split. Drop a stray debug fprintf on the normal worker send path. Document that the worker-pool init/shutdown is gen_server serialized (no lock needed). Adds py_SUITE:test_embedded_nul_name_rejected.
owngil_reentrant_multi_stress_test (4 contexts x 20 iters x 6 deep) timed out on slow CI runners. Reduce to 2 contexts x 5 iters x depth 4 with a 120s budget; it still drives concurrent reentrant suspend/resume across distinct subinterpreters (the existing concurrent/nested reentrant tests cover the same paths).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the NIF/security audit findings, one commit per area. Each adds a regression test.
binary_to_termdecodes withERL_NIF_BIN2TERM_SAFE(atom-exhaustion).PyTuple_New/PyList_NewNULL checks + dict-mutation bound.py_bufferwon't relocate while a Pythonmemoryviewpins it (UAF).owngil_*dispatch is dirty with bounded I/O;SuspensionRequiredresolved per-interpreter.fd_read/fd_writemoved to dirty schedulers.py_stateoptionalmax_state_entriescap (defaultinfinity) with atomic admission;py:stream/logging builders validate identifiers and escape literals.spawn_executablewith argv lists, notos:cmd.make_py_errorNULL guards, embedded-NUL rejection, a leakedsplitmethod, a stray debugfprintf.Notes: the
py_statecap is opt-in (defaultinfinity); Phase 4 frees the prior result on resume rather than guarding the togglinghas_result; the fd registry is global rather than per-loop.