Skip to content

feat(python): add configurable size guardrails#3429

Open
eyad-hazem-elmorsy wants to merge 3 commits intoapache:mainfrom
eyad-hazem-elmorsy:feat/add-python-size-guardrails
Open

feat(python): add configurable size guardrails#3429
eyad-hazem-elmorsy wants to merge 3 commits intoapache:mainfrom
eyad-hazem-elmorsy:feat/add-python-size-guardrails

Conversation

@eyad-hazem-elmorsy
Copy link
Contributor

This commit:

  • provides configurable size guardrails for untrusted payloads (lists, sets, tuples, maps, and strings)
  • tests size guardrails

Why?

There are currently no configurable limits for payload-driven lengths. Untrusted string/map/list lengths can trigger large allocations and memory pressure.

What does this PR do?

  • Adds guardrail fields in Python runtime configuration.
  • Enforces length limits for lists, tuples, sets, dicts(maps), and string bytes.
  • Raises an exception when a configured limit is exceeded.

Related issues

This commit:
- provides configurable size guardrails for untrusted payloads (lists, sets, tuples, maps, and strings)
- tests size guardrails
"policy",
"max_collection_length",
"max_map_length",
"max_string_bytes_length",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we rename max_string_bytes_length to max_binary_size, and reame other _length to _size for consistency?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we could use max_collection_size for map size currently to reduce config option overhead.


cls = serializer.type_ if hasattr(serializer, "type_") else None
if cls is str:
return buffer.read_string(self.max_string_bytes_length)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to validate string size, because the valide is only used to prevent malious data to fake a hught size to cause a big allocation which incur oom. but for string, underlien buffer doesn't haev that much buffer, the read will jsut fail, there is no OOM issue

policy: DeserializationPolicy = None,
field_nullable: bool = False,
meta_compressor=None,
max_collection_length: int = -1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set to 1_000_000

meta_compressor=None,
max_collection_length: int = -1,
max_map_length: int = -1,
max_string_bytes_length: int = -1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set to 64 * 1024 * 1024

- Remove enforcing max_string_length.
- Use max_collection_size for size limits including maps(dicts).
- Modify tests.
@eyad-hazem-elmorsy
Copy link
Contributor Author

You can take a look @chaokunyang

self.c_buffer.write_bytes(&data[0], length)

cpdef inline bytes read_bytes_and_size(self):
cpdef inline bytes read_bytes_and_size(self, int32_t max_binary_size=-1):
Copy link
Collaborator

@chaokunyang chaokunyang Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis method doesn't have preallocation, we don't need to check size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we want to enforce max_binary_size for binary byte-length reads (as mentioned in issue), should I move the size checking to read_bytes() method before this line?
cdef bytes py_bytes = PyBytes_FromStringAndSize(NULL, length)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we don't want to enforce binary byte-length reads, we only wanto to avoid malicouse length to introuce OOM. So if there is no preallocation, there should not be no such check. Because if no engought data to read from buffer, buffer read PAI will raise exception

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove max_binary_size parameter to keep API surface minimal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can even add comments to clarify that those options are used for preallocation checks only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I want to ask about the line I mentioned before:
cdef bytes py_bytes = PyBytes_FromStringAndSize(NULL, length) in buffer read_bytes function
I searched and found that this call may result OOM crash or Memory Error before any checks.
So we can check bounds or ensure readability first. What do you see?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] configurable size guardrails for untrusted payloads

2 participants