-
Notifications
You must be signed in to change notification settings - Fork 50
(Improvement) improve performance of Vector type parsing #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements comprehensive performance optimizations for VectorType deserialization across multiple layers of the Python driver stack. The changes introduce optimized deserialization paths using struct.unpack for small vectors, numpy.frombuffer for large vectors (when NumPy is available), and a new Cython DesVectorType deserializer that uses low-level C operations with ntohl/ntohs intrinsics for efficient byte-swapping.
Changes:
- Added Cython-based DesVectorType deserializer with optimized paths for float, double, int32, int64, and int16 vector types
- Enhanced Python-level VectorType deserialization with struct.unpack and numpy.frombuffer optimizations
- Extended NumpyParser to create 2D arrays for vector types, enabling efficient batch processing
- Optimized low-level byte-swap operations using ntohl/ntohs intrinsics and simplified varint_unpack using int.from_bytes
- Removed slice_buffer function in favor of simpler from_ptr_and_size for direct pointer manipulation
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_types.py | Adds comprehensive tests for Cython DesVectorType deserializer covering float, double, int32, int64, and int16 vectors |
| tests/unit/test_numpy_parser.py | New test suite for NumPy parser 2D array support for vectors with mixed column types and large dimensions |
| cassandra/cqltypes.py | Python-level VectorType optimizations using struct.unpack and numpy.frombuffer, plus improved variable-size vector handling |
| cassandra/deserializers.pyx | New Cython DesVectorType class with type-specific optimized deserialization methods |
| cassandra/numpy_parser.pyx | Enhanced to create 2D NumPy arrays for VectorType columns and pre-allocate arrays list |
| cassandra/cython_marshal.pyx | Optimized unpack_num to use ntohl/ntohs intrinsics and simplified varint_unpack |
| cassandra/ioutils.pyx | Optimized read_int to use ntohl directly |
| cassandra/marshal.py | Simplified varint_unpack using int.from_bytes |
| cassandra/buffer.pxd | Replaced slice_buffer with simpler from_ptr_and_size function |
| benchmarks/vector_deserialize.py | New comprehensive benchmark suite comparing different optimization strategies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
….from_bytes Performance improvements to serialization/deserialization hot paths: 1. unpack_num(): Use ntohs()/ntohl() for 16-bit and 32-bit integer types instead of byte-by-byte swapping loop. These compile to single bswap instructions on x86, providing more predictable performance. 2. read_int(): Simplify to use ntohl() directly instead of going through unpack_num() with a temporary Buffer. 3. varint_unpack(): Replace hex string conversion with int.from_bytes(). This eliminates string allocations and provides 4-18x speedup for the function itself (larger gains for longer varints). 4. Remove slice_buffer() and replaced with direct assignment 5. _unpack_len() is now implemented similar to read_int() Also removes unused 'start' and 'end' variables from unpack_num(). End-to-end benchmark shows ~4-5% improvement in row throughput. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Use hardware byte-swap intrinsic for float unmarshaling instead of manual 4-iteration loop, providing 4-8x speedup on little-endian systems. All tests passing (609 total) [see next commit for a fix for existing Cython related issue!] Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Refactor deserializers.pyx to use from_ptr_and_size() consistently instead of manual Buffer field assignment for better code clarity and maintainability. Changes: - cassandra/deserializers.pyx: Refactor 4 locations to use helper Tests: All Cython tests compile and pass (5 tests) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add comprehensive benchmark comparing different deserialization strategies for VectorType with various numeric types and vector sizes. The benchmark measures: - Current element-by-element baseline - struct.unpack bulk deserialization - numpy frombuffer with tolist() - numpy frombuffer zero-copy approach Tested with common ML/AI embedding dimensions: - Small vectors: 3-4 elements - Medium vectors: 128-384 elements - Large vectors: 768-1536 elements Usage: export CASS_DRIVER_NO_CYTHON=1 # Test pure Python implementation python benchmarks/vector_deserialize.py Includes CPU pinning for consistent measurements and result verification to ensure correctness of all optimization approaches. Baseline Performance (per-operation deserialization time): Vector<float, 3> : 0.88 μs Vector<float, 4> : 0.78 μs Vector<float, 128> : 4.72 μs Vector<float, 384> : 15.38 μs Vector<float, 768> : 32.43 μs Vector<float, 1536> : 63.74 μs Vector<double, 128> : 4.83 μs Vector<int, 128> : 2.27 μs Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…ct.unpack
Add bulk deserialization using struct.unpack for common numeric vector types
instead of element-by-element deserialization. This provides significant
performance improvements, especially for small vectors and integer types.
Optimized types:
- FloatType ('>Nf' format)
- DoubleType ('>Nd' format)
- Int32Type ('>Ni' format)
- LongType ('>Nq' format)
- ShortType ('>Nh' format)
Performance improvements (measured with CASS_DRIVER_NO_CYTHON=1):
Small vectors (3-4 elements):
Vector<float, 3> : 0.88 μs → 0.25 μs (3.58x faster)
Vector<float, 4> : 0.78 μs → 0.28 μs (2.79x faster)
Medium vectors (128 elements):
Vector<float, 128> : 4.72 μs → 4.06 μs (1.16x faster)
Vector<double, 128> : 4.83 μs → 4.01 μs (1.20x faster)
Vector<int, 128> : 2.27 μs → 1.25 μs (1.82x faster)
Large vectors (384-1536 elements):
Vector<float, 384> : 15.38 μs → 14.67 μs (1.05x faster)
Vector<float, 768> : 32.43 μs → 30.72 μs (1.06x faster)
Vector<float, 1536> : 63.74 μs → 63.24 μs (1.01x faster)
The optimization is most effective for:
- Small vectors (3-4 elements): 2.8-3.6x speedup
- Integer vectors: 1.8x speedup
- Medium-sized float/double vectors: 1.2-1.3x speedup
For very large vectors (384+ elements), the benefit is minimal as the
deserialization time is dominated by data copying rather than function
call overhead.
Variable-size subtypes and other numeric types continue to use the
element-by-element fallback path.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
For vectors with 32 or more elements, use numpy.frombuffer() which provides 1.3-1.5x speedup for large vectors (128+ elements) compared to struct.unpack. The hybrid approach: - Small vectors (< 32 elements): struct.unpack (2.8-3.6x faster than baseline) - Large vectors (>= 32 elements): numpy.frombuffer().tolist() (1.3-1.5x faster than struct.unpack) Threshold of 32 elements balances code complexity with performance gains. Benchmark results: - float[128]: 2.15 μs → 1.87 μs (1.15x faster) - float[384]: 6.17 μs → 4.44 μs (1.39x faster) - float[768]: 12.25 μs → 8.45 μs (1.45x faster) - float[1536]: 24.44 μs → 15.77 μs (1.55x faster) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…izer Addded DesVectorType Cython deserializer with C-level optimizations for improved performance in row parsing for vectors. The deserializer uses: - Direct C byte swapping (ntohl, ntohs) for numeric types - Memory operations without Python object overhead - Unified numpy path for large vectors (≥32 elements) - struct.unpack fallback for small vectors (<32 elements) Performance improvements: - Small vectors (3-4 elements): 4.4-4.7x faster - Medium vectors (128 elements): 1.0-1.5x faster - Large vectors (384-1536 elements): 0.9-1.0x (marginal) The Cython deserializer is automatically used by the row parser when available via find_deserializer(). Includes unit tests and benchmark code. Follow-up commits will try to get Numpy arrays, and perhaps more. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Extend NumpyParser to handle VectorType columns by creating 2D NumPy arrays (rows × vector_dimension) instead of object arrays. This enables zero-copy parsing for vector embeddings in ML/AI workloads. Features: - Detects VectorType via vector_size and subtype attributes - Creates 2D masked arrays for numeric vector subtypes (float, double, int32, int64, int16) - Falls back to object arrays for unsupported vector subtypes - Handles endianness conversion for both 1D and 2D arrays - Pre-allocates result arrays for efficiency Supported vector types: - Vector<float> → 2D float32 array - Vector<double> → 2D float64 array - Vector<int> → 2D int32 array - Vector<bigint> → 2D int64 array - Vector<smallint> → 2D int16 array Adds comprehensive test coverage for all supported vector types, mixed column queries, and large vector dimensions (384-element embeddings). Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Replace POSIX-specific arpa/inet.h with conditional compilation that uses winsock2.h on Windows and arpa/inet.h on POSIX systems. This ensures the driver can be compiled on Windows without modification. Changes: - cassandra/cython_marshal.pyx: Add platform detection for ntohs/ntohl - cassandra/ioutils.pyx: Add platform detection for ntohl Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add bounds checking to prevent buffer overruns and properly
handle CQL protocol value semantics in deserializers.
Changes:
- subelem(): Add bounds validation with protocol-compliant value handling
* Happy path: Check elemlen >= 0 and offset + elemlen <= buf.size
* Support NULL values (elemlen == -1) per CQL protocol
* Support "not set" values (elemlen == -2) per CQL protocol
* Reject invalid values (elemlen < -2) with clear error message
- _unpack_len(): Add bounds check before reading int32 length field
* Validates offset + 4 <= buf.size before pointer dereference
* Prevents reading beyond buffer boundaries
- DesTupleType: Add defensive bounds checking for tuple deserialization
* Check p + 4 <= buf.size before reading item length
* Check p + itemlen <= buf.size before reading item data
* Explicit NULL value handling (itemlen < 0)
* Clear error messages for buffer overruns
- DesCompositeType: Add bounds validation for composite type elements
* Check 2 + element_length + 1 <= buf.size (length + data + EOC byte)
* Prevents buffer overrun when reading composite elements
- DesVectorType._deserialize_generic(): Add size validation
* Verify buf.size == expected_size before processing
* Provides clear error message with expected vs actual sizes
Protocol specification reference:
[value] = [int] n, followed by n bytes if n >= 0
n == -1: NULL value
n == -2: not set value
n < -2: invalid (error)
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
|
Per discussion earlier, Iv'e changed to optimize only for float/double/int, since those are the more frequently used and are not troublesome protocol-wise. I'll submit a fixed version of this series. |
Vector type is supported on Scylla 2025.4 and above. Enable the integration tests. Tested locally against both 2025.4.2 and 2026.1 and they pass. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Multiple partially/mostly independent commits (if needed, most can be extracted from this series) to improve the parsing of vector arrays.
Across Python, Cython and even Numpy array creation, this series includes both test and a benchmark to improve the deserialization of vectors.
There's some prerequisite and a bug fix (that I've extracted to its own PR), but otherwise the series is mostly complete.
I think in a follow-up or following commits I'll add the same/similar to serialization of vector types.
Pre-review checklist
./docs/source/.Fixes:annotations to PR description.