Security and robustness fixes from NIF audit by benoitc · Pull Request #67 · benoitc/erlang-python

benoitc · 2026-05-30T09:31:15Z

Fixes the NIF/security audit findings, one commit per area. Each adds a regression test.

binary_to_term decodes with ERL_NIF_BIN2TERM_SAFE (atom-exhaustion).
Bounded conversion recursion + PyTuple_New/PyList_New NULL checks + dict-mutation bound.
Zero-copy py_buffer won't relocate while a Python memoryview pins it (UAF).
Callback suspend/resume lifetime: keep the worker, free prior result on resume, clear pending-callback TLS, dirty + bounded callback-pipe writes.
OWN_GIL worker: a per-request failure no longer kills the worker loop; owngil_* dispatch is dirty with bounded I/O; SuspensionRequired resolved per-interpreter.
Event-loop fd handles are opaque ids validated against a registry; fd_read/fd_write moved to dirty schedulers.
py_state optional max_state_entries cap (default infinity) with atomic admission; py:stream/logging builders validate identifiers and escape literals.
venv/installer commands run via spawn_executable with argv lists, not os:cmd.
Low-severity batch: make_py_error NULL guards, embedded-NUL rejection, a leaked split method, a stray debug fprintf.

Notes: the py_state cap is opt-in (default infinity); Phase 4 frees the prior result on resume rather than guarding the toggling has_result; the fd registry is global rather than per-loop.

All enif_binary_to_term calls now pass ERL_NIF_BIN2TERM_SAFE so attacker-influenced data (notably a Python "__etf__:<base64>" callback result) cannot mint new non-GC'd atoms and exhaust the atom table. Local pids/refs and existing atoms still round-trip; only brand-new atoms, remote-node terms, and external funs are rejected. Adds py_reentrant_SUITE:test_etf_decode_safe covering both the non-breaking round-trip and the rejected-novel-atom case.

Cap nesting depth in py_to_term/term_to_py so a deeply nested term or Python structure returns an error instead of overflowing the C stack and crashing the node. Bound the dict->map loop against mid-iteration mutation (use the actual filled count). NULL-check the argument PyTuple_New across the call/eval/resume paths (py_nif, py_event_loop, py_callback, py_exec); most sites were already guarded. Adds py_SUITE:test_conversion_depth_guard (1M-deep term -> clean error, node stays up).

py_buffer_write relocated and freed buf->data on growth even when a Python memoryview (PyBuffer_getbuffer) held a raw pointer into it, dangling the view (use-after-free, whole-node crash). Refuse to grow while view_count > 0; the write returns an error instead. Adds py_buffer_SUITE:buffer_grow_pinned_test (pinned grow -> error, release -> write succeeds).

Keep the worker resource alive while a callback is suspended (it could be GC'd mid-suspension -> use-after-free on resume); release it in the suspended-state destructor and NULL-check before the replay deref. Free any prior result_data before a resume stores a new one (result_data, not the toggling has_result flag, is the real pending-result indicator, so a duplicate/raced resume no longer leaks). Clear the pending-callback TLS at the worker request boundary so a reused worker thread doesn't trip the stale-TLS invariant. Run the callback-response pipe writes on dirty IO schedulers with non-blocking, deadline-bounded writes so a stalled reader or large payload can't wedge a scheduler or desync the framed protocol. Also Py_XDECREF callback_args in the GIL-held destructor branch. Adds py_reentrant_SUITE:test_reentrant_resume_stress.

A per-request dict allocation failure in a subinterpreter worker no longer breaks (and permanently kills) the worker command loop while holding the GIL; it now sends an error response and keeps serving. The owngil_* dispatch NIFs run on dirty IO schedulers and use the non-blocking, deadline-bounded read_with_timeout/write_all_with_deadline helpers (cmd_pipe write end set O_NONBLOCK), so a stalled or dead worker can't wedge a scheduler forever or desync the framed protocol. SuspensionRequired is now resolved from the current interpreter's erlang module (like ProcessError), avoiding cross-interpreter PyObject use under OWN_GIL. Adds py_owngil_features_SUITE:owngil_reentrant_multi_stress_test.

The asyncio reader/writer integration handed Python a raw fd_resource pointer cast to an integer, then cast it back and dereferenced it in remove/update/clear/release. A stale, duplicate, or fabricated id was a double-free or arbitrary-pointer deref that crashed the node. Hand out opaque ids tracked in a validating registry instead: producers register and return an id; removing consumers take (and release) the entry so a duplicate id is a no-op; non-removing consumers look up with a temporary keep. An unknown id is a safe no-op or a clean ValueError. fd_read/fd_write also moved to ERL_NIF_DIRTY_JOB_IO_BOUND. Adds py_fd_ops_SUITE:fd_registry_invalid_id_test.

py_state gains an optional max_state_entries cap (default infinity, so existing behavior is unchanged) enforced with atomic admission (insert_new + update_counter, ets:take on remove) so Python-driven state_set can't exhaust node memory. A reserved sentinel row holds the count and is protected from caller corruption and hidden from keys/fetch. The py:stream and setup_logging helpers that build Python source now validate module/function/kwarg names as identifiers (rejecting injection at positions where quoting is meaningless) and escape string-literal values, including newlines and other control bytes. Adds py_state_SUITE (cap accounting + sentinel) and py_stream_SUITE:test_stream_rejects_injection.

ensure_venv and dependency installation built shell command strings and ran them through os:cmd, with quote/1 single-quote wrapping that did not escape embedded quotes -- a venv path/requirements/extras value could break out and inject. Run the executables via open_port({spawn_executable, Exe}, [{args, Args}, ...]) with literal argument lists instead. For uv, VIRTUAL_ENV is passed via the port {env, ...} option rather than a shell prefix. Adds py_venv_SUITE:test_venv_path_metacharacters.

Guard make_py_error against a NULL message/type when a Python exception's text isn't UTF-8-encodable (and against a NULL exception type). Reject embedded NULs in binary_to_string so a module/func/attr/code name can't be silently truncated at the NUL. Release the leaked split() method in ReactorBuffer_split. Drop a stray debug fprintf on the normal worker send path. Document that the worker-pool init/shutdown is gen_server serialized (no lock needed). Adds py_SUITE:test_embedded_nul_name_rejected.

owngil_reentrant_multi_stress_test (4 contexts x 20 iters x 6 deep) timed out on slow CI runners. Reduce to 2 contexts x 5 iters x depth 4 with a 120s budget; it still drives concurrent reentrant suspend/resume across distinct subinterpreters (the existing concurrent/nested reentrant tests cover the same paths).

benoitc added 10 commits May 30, 2026 02:03

benoitc merged commit 80c2106 into main May 30, 2026
17 checks passed

benoitc deleted the security-fixes branch May 30, 2026 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Security and robustness fixes from NIF audit#67

Security and robustness fixes from NIF audit#67
benoitc merged 10 commits into
mainfrom
security-fixes

benoitc commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

benoitc commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benoitc commented May 30, 2026 •

edited

Loading