Skip to content
Merged
61 changes: 61 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,67 @@

## Unreleased

### Fixed

- **NIF robustness hardening** - `make_py_error` no longer passes a NULL message/type
to `enif_make_string`/`enif_make_atom` when a Python exception's text isn't
UTF-8-encodable; `binary_to_string` rejects names/code containing an embedded NUL
(which would silently truncate a module/function/attr/code string) rather than
truncating; a leaked `split` method object in the reactor buffer is released; and a
stray debug `fprintf` on the normal worker send path is removed.

### Security

- **No shell for venv/installer commands** - `py:ensure_venv` and dependency
installation now run the executables via `open_port({spawn_executable, ...})` with an
argument list instead of building a shell string for `os:cmd`. Venv paths, requirement
files, and extras are passed literally, so shell metacharacters can't be injected. For
`uv`, `VIRTUAL_ENV` is passed via the port `{env, ...}` option rather than a shell prefix.
- **Bounded shared state + safe stream/log builders** - `py_state` gained an optional
`max_state_entries` cap (default `infinity`, unchanged behavior) enforced with atomic
admission so Python-driven `state_set` can't exhaust node memory, and its size counter
is protected from corruption. The `py:stream` and logging helpers that build Python
source now strictly validate module/function/kwarg names as identifiers (rejecting
injection at positions where quoting is meaningless) and escape string-literal values
including control characters.
- **Validated event-loop fd handles** - The asyncio reader/writer integration no longer
hands Python a raw `fd_resource` pointer as an integer key. Each handle is an opaque id
validated against a registry on every use, so a stale, duplicate, or fabricated id is a
safe no-op (or clean error) instead of a double-free or arbitrary-pointer dereference
that crashed the node. `fd_read`/`fd_write` also moved to dirty IO schedulers.
- **OWN_GIL worker robustness** (Python 3.14+) - A per-request allocation failure in
a subinterpreter worker no longer `break`s (and permanently kills) the worker command
loop; it returns an error and keeps serving. The `owngil_*` dispatch NIFs now run on
dirty IO schedulers and use non-blocking, deadline-bounded pipe reads and writes, so a
stalled or dead worker can't wedge a scheduler forever. The internal `SuspensionRequired`
exception is now looked up per-interpreter (like `ProcessError`), avoiding cross-
interpreter object use under OWN_GIL.
- **Callback suspend/resume lifetime hardening** - The worker resource is now kept
alive for the lifetime of a suspended callback (it could previously be GC'd mid-
suspension, causing a use-after-free on resume). A resume frees any prior result
before storing a new one (no leak/double-replay on a duplicate resume), the
pending-callback thread-local is cleared at the worker request boundary, and the
callback-response pipe writes run on dirty schedulers with non-blocking, deadline-
bounded writes so a stalled reader or large payload can't wedge a scheduler or
desync the framed protocol.
- **Zero-copy buffer pinning** - `py_buffer` no longer relocates (and frees) its
storage while a Python `memoryview` points into it. A write that would grow the
buffer while a view is held now returns an error instead of dangling the view into
freed memory (a use-after-free that crashed the whole node).
- **Bounded recursion in type conversion** - The Erlang<->Python converters now cap
nesting depth, so a deeply nested term (or Python structure) returns a clean error
instead of overflowing the C stack and crashing the whole node.
- **NULL-checked tuple allocation** - Argument-tuple allocations in the call/eval paths
are checked before use, and the Python->Erlang map conversion is bounded against
mid-iteration dict mutation, closing two ways an allocation failure or re-entrant
`__str__` could corrupt memory.
- **Safe term decoding at the NIF boundary** - All `enif_binary_to_term` calls now
pass `ERL_NIF_BIN2TERM_SAFE`, preventing attacker-influenced data (notably a Python
`"__etf__:<base64>"` callback result) from minting new, non-GC'd atoms and exhausting
the atom table. Local-node pids/refs and already-existing atoms still round-trip
unchanged; only brand-new atoms, remote-node pids/refs, and external funs in
Python-supplied payloads are now rejected.

### Changed

- **Support Erlang/OTP 28 and 29** - Validated builds and the full Common Test
Expand Down
8 changes: 8 additions & 0 deletions c_src/py_buffer.c
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,14 @@ int py_buffer_write(py_buffer_resource_t *buf, const unsigned char *data, size_t
/* Check if we need to grow the buffer */
size_t required = buf->write_pos + size;
if (required > buf->capacity) {
/* A live Python memoryview (from PyBuffer_getbuffer) holds a raw pointer
* into buf->data. Relocating/freeing the buffer now would leave that
* pointer dangling -> use-after-free that crashes the whole node. Refuse
* to grow while any view is pinned; the caller gets a write error. */
if (buf->view_count > 0) {
pthread_mutex_unlock(&buf->mutex);
return -1;
}
/* Calculate new capacity */
size_t new_capacity = buf->capacity;
while (new_capacity < required) {
Expand Down
65 changes: 59 additions & 6 deletions c_src/py_callback.c
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,33 @@ static PyObject *get_current_process_error(void) {
return exc_class;
}

/**
* Get the SuspensionRequired exception class from the current interpreter's
* erlang module. Under OWN_GIL subinterpreters each interpreter has its own
* erlang module/class, so raising the process-global object (which belongs to
* whichever interpreter initialized last) is cross-interpreter UB. Mirrors
* get_current_process_error().
*/
static PyObject *get_current_suspension_required(void) {
PyObject *erlang_module = PyImport_ImportModule("erlang");
if (erlang_module == NULL) {
PyErr_Clear();
return SuspensionRequiredException; /* Fallback to global */
}

PyObject *exc_class = PyObject_GetAttrString(erlang_module, "SuspensionRequired");
Py_DECREF(erlang_module);

if (exc_class == NULL) {
PyErr_Clear();
return SuspensionRequiredException; /* Fallback to global */
}

/* See get_current_process_error: decref and rely on the module keeping it alive. */
Py_DECREF(exc_class);
return exc_class;
}

/* ============================================================================
* Callback Name Registry
*
Expand Down Expand Up @@ -399,6 +426,14 @@ static suspended_state_t *create_suspended_state_ex(
} else {
state->worker = source->data.existing->worker;
}
/* Keep the worker resource alive for as long as the suspended state exists.
* Without this the worker can be GC'd while a callback is suspended, and
* nif_resume_callback_dirty would dereference a freed worker (use-after-free
* with the GIL held). Mirrors the enif_keep_resource(ctx) on the context path;
* suspended_state_destructor releases it. */
if (state->worker != NULL) {
enif_keep_resource(state->worker);
}

state->callback_id = PyLong_AsUnsignedLongLong(callback_id_obj);

Expand Down Expand Up @@ -977,7 +1012,7 @@ static PyObject *decode_etf_string(const char *str, Py_ssize_t len) {

/* Decode the ETF binary to an Erlang term */
ERL_NIF_TERM term;
if (enif_binary_to_term(tmp_env, (unsigned char *)bin_data, bin_len, &term, 0) == 0) {
if (enif_binary_to_term(tmp_env, (unsigned char *)bin_data, bin_len, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
/* Decoding failed */
enif_free_env(tmp_env);
Py_DECREF(decoded);
Expand Down Expand Up @@ -2109,7 +2144,7 @@ static PyObject *erlang_call_impl(PyObject *self, PyObject *args) {
Py_XSETREF(tl_pending_args, call_args);

/* Raise exception to abort Python execution */
PyErr_SetString(SuspensionRequiredException, "callback pending");
PyErr_SetString(get_current_suspension_required(), "callback pending");
return NULL;
}

Expand Down Expand Up @@ -2859,7 +2894,7 @@ static PyObject *erlang_channel_try_receive_impl(PyObject *self, PyObject *args)
}

ERL_NIF_TERM term;
if (enif_binary_to_term(tmp_env, data, size, &term, 0) == 0) {
if (enif_binary_to_term(tmp_env, data, size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
enif_free(data);
enif_free_env(tmp_env);
PyErr_SetString(PyExc_RuntimeError, "failed to decode term");
Expand Down Expand Up @@ -2939,7 +2974,7 @@ static PyObject *erlang_channel_receive_impl(PyObject *self, PyObject *args) {
}

ERL_NIF_TERM term;
if (enif_binary_to_term(tmp_env, data, size, &term, 0) == 0) {
if (enif_binary_to_term(tmp_env, data, size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
enif_free(data);
enif_free_env(tmp_env);
PyErr_SetString(PyExc_RuntimeError, "failed to decode term");
Expand Down Expand Up @@ -3251,7 +3286,7 @@ static PyObject *erlang_channel_wait_impl(PyObject *self, PyObject *args) {
}

ERL_NIF_TERM term;
if (enif_binary_to_term(tmp_env, data, msg_size, &term, 0) == 0) {
if (enif_binary_to_term(tmp_env, data, msg_size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
enif_free(data);
enif_free_env(tmp_env);
PyErr_SetString(PyExc_RuntimeError, "failed to decode term");
Expand Down Expand Up @@ -4316,7 +4351,14 @@ static ERL_NIF_TERM nif_resume_callback(ErlNifEnv *env, int argc, const ERL_NIF_
/* Store the result in the suspended state */
pthread_mutex_lock(&state->mutex);

/* Copy result data */
/* Copy result data. Free any prior result first: a duplicate/raced resume
* would otherwise leak the previous buffer. (has_result is not a one-shot
* flag -- it toggles during nested replay -- so result_data is the real
* pending-result indicator.) */
if (state->result_data != NULL) {
enif_free(state->result_data);
state->result_data = NULL;
}
state->result_data = enif_alloc(result_bin.size);
if (state->result_data == NULL) {
pthread_mutex_unlock(&state->mutex);
Expand Down Expand Up @@ -4364,6 +4406,12 @@ static ERL_NIF_TERM nif_resume_callback_dirty(ErlNifEnv *env, int argc, const ER
return make_error(env, "no_result");
}

/* The worker is kept alive for the lifetime of the suspended state, but
* guard rather than dereference NULL in the replay below. */
if (state->worker == NULL) {
return make_error(env, "no_worker");
}

/* Set up thread-local state for replay */
tl_current_worker = state->worker;
tl_callback_env = env;
Expand Down Expand Up @@ -4430,6 +4478,11 @@ static ERL_NIF_TERM nif_resume_callback_dirty(ErlNifEnv *env, int argc, const ER
}

PyObject *args = PyTuple_New(args_len);
if (args == NULL) {
Py_DECREF(func);
result = make_error(env, "alloc_failed");
goto call_cleanup;
}
ERL_NIF_TERM head, tail = state->orig_args;
for (unsigned int i = 0; i < args_len; i++) {
enif_get_list_cell(state->orig_env, tail, &head, &tail);
Expand Down
6 changes: 3 additions & 3 deletions c_src/py_channel.c
Original file line number Diff line number Diff line change
Expand Up @@ -555,7 +555,7 @@ static ERL_NIF_TERM nif_channel_receive(ErlNifEnv *env, int argc, const ERL_NIF_
if (result == 0) {
/* Data available - convert back to term */
ERL_NIF_TERM term;
if (enif_binary_to_term(env, data, size, &term, 0) == 0) {
if (enif_binary_to_term(env, data, size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
enif_free(data);
return make_error(env, "binary_to_term_failed");
}
Expand Down Expand Up @@ -590,7 +590,7 @@ static ERL_NIF_TERM nif_channel_try_receive(ErlNifEnv *env, int argc, const ERL_
if (result == 0) {
/* Data available - convert back to term */
ERL_NIF_TERM term;
if (enif_binary_to_term(env, data, size, &term, 0) == 0) {
if (enif_binary_to_term(env, data, size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
enif_free(data);
return make_error(env, "binary_to_term_failed");
}
Expand Down Expand Up @@ -760,7 +760,7 @@ ERL_NIF_TERM nif_channel_wait(ErlNifEnv *env, int argc, const ERL_NIF_TERM argv[

/* Convert back to term */
ERL_NIF_TERM term;
if (enif_binary_to_term(env, data, msg_size, &term, 0) == 0) {
if (enif_binary_to_term(env, data, msg_size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
enif_free(data);
return make_error(env, "binary_to_term_failed");
}
Expand Down
Loading
Loading