(Improvement) improve performance of Vector type parsing #689

mykaul · 2026-02-05T21:46:36Z

Multiple partially/mostly independent commits (if needed, most can be extracted from this series) to improve the parsing of vector arrays.
Across Python, Cython and even Numpy array creation, this series includes both test and a benchmark to improve the deserialization of vectors.

There's some prerequisite and a bug fix (that I've extracted to its own PR), but otherwise the series is mostly complete.

I think in a follow-up or following commits I'll add the same/similar to serialization of vector types.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

Copilot

Pull request overview

This pull request implements comprehensive performance optimizations for VectorType deserialization across multiple layers of the Python driver stack. The changes introduce optimized deserialization paths using struct.unpack for small vectors, numpy.frombuffer for large vectors (when NumPy is available), and a new Cython DesVectorType deserializer that uses low-level C operations with ntohl/ntohs intrinsics for efficient byte-swapping.

Changes:

Added Cython-based DesVectorType deserializer with optimized paths for float, double, int32, int64, and int16 vector types
Enhanced Python-level VectorType deserialization with struct.unpack and numpy.frombuffer optimizations
Extended NumpyParser to create 2D arrays for vector types, enabling efficient batch processing
Optimized low-level byte-swap operations using ntohl/ntohs intrinsics and simplified varint_unpack using int.from_bytes
Removed slice_buffer function in favor of simpler from_ptr_and_size for direct pointer manipulation

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
tests/unit/test_types.py	Adds comprehensive tests for Cython DesVectorType deserializer covering float, double, int32, int64, and int16 vectors
tests/unit/test_numpy_parser.py	New test suite for NumPy parser 2D array support for vectors with mixed column types and large dimensions
cassandra/cqltypes.py	Python-level VectorType optimizations using struct.unpack and numpy.frombuffer, plus improved variable-size vector handling
cassandra/deserializers.pyx	New Cython DesVectorType class with type-specific optimized deserialization methods
cassandra/numpy_parser.pyx	Enhanced to create 2D NumPy arrays for VectorType columns and pre-allocate arrays list
cassandra/cython_marshal.pyx	Optimized unpack_num to use ntohl/ntohs intrinsics and simplified varint_unpack
cassandra/ioutils.pyx	Optimized read_int to use ntohl directly
cassandra/marshal.py	Simplified varint_unpack using int.from_bytes
cassandra/buffer.pxd	Replaced slice_buffer with simpler from_ptr_and_size function
benchmarks/vector_deserialize.py	New comprehensive benchmark suite comparing different optimization strategies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cassandra/deserializers.pyx

cassandra/cython_marshal.pyx

cassandra/ioutils.pyx

benchmarks/vector_deserialize.py

cassandra/deserializers.pyx

….from_bytes Performance improvements to serialization/deserialization hot paths: 1. unpack_num(): Use ntohs()/ntohl() for 16-bit and 32-bit integer types instead of byte-by-byte swapping loop. These compile to single bswap instructions on x86, providing more predictable performance. 2. read_int(): Simplify to use ntohl() directly instead of going through unpack_num() with a temporary Buffer. 3. varint_unpack(): Replace hex string conversion with int.from_bytes(). This eliminates string allocations and provides 4-18x speedup for the function itself (larger gains for longer varints). 4. Remove slice_buffer() and replaced with direct assignment 5. _unpack_len() is now implemented similar to read_int() Also removes unused 'start' and 'end' variables from unpack_num(). End-to-end benchmark shows ~4-5% improvement in row throughput. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Use hardware byte-swap intrinsic for float unmarshaling instead of manual 4-iteration loop, providing 4-8x speedup on little-endian systems. All tests passing (609 total) [see next commit for a fix for existing Cython related issue!] Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Refactor deserializers.pyx to use from_ptr_and_size() consistently instead of manual Buffer field assignment for better code clarity and maintainability. Changes: - cassandra/deserializers.pyx: Refactor 4 locations to use helper Tests: All Cython tests compile and pass (5 tests) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Add comprehensive benchmark comparing different deserialization strategies for VectorType with various numeric types and vector sizes. The benchmark measures: - Current element-by-element baseline - struct.unpack bulk deserialization - numpy frombuffer with tolist() - numpy frombuffer zero-copy approach Tested with common ML/AI embedding dimensions: - Small vectors: 3-4 elements - Medium vectors: 128-384 elements - Large vectors: 768-1536 elements Usage: export CASS_DRIVER_NO_CYTHON=1 # Test pure Python implementation python benchmarks/vector_deserialize.py Includes CPU pinning for consistent measurements and result verification to ensure correctness of all optimization approaches. Baseline Performance (per-operation deserialization time): Vector<float, 3> : 0.88 μs Vector<float, 4> : 0.78 μs Vector<float, 128> : 4.72 μs Vector<float, 384> : 15.38 μs Vector<float, 768> : 32.43 μs Vector<float, 1536> : 63.74 μs Vector<double, 128> : 4.83 μs Vector<int, 128> : 2.27 μs Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

…ct.unpack Add bulk deserialization using struct.unpack for common numeric vector types instead of element-by-element deserialization. This provides significant performance improvements, especially for small vectors and integer types. Optimized types: - FloatType ('>Nf' format) - DoubleType ('>Nd' format) - Int32Type ('>Ni' format) - LongType ('>Nq' format) - ShortType ('>Nh' format) Performance improvements (measured with CASS_DRIVER_NO_CYTHON=1): Small vectors (3-4 elements): Vector<float, 3> : 0.88 μs → 0.25 μs (3.58x faster) Vector<float, 4> : 0.78 μs → 0.28 μs (2.79x faster) Medium vectors (128 elements): Vector<float, 128> : 4.72 μs → 4.06 μs (1.16x faster) Vector<double, 128> : 4.83 μs → 4.01 μs (1.20x faster) Vector<int, 128> : 2.27 μs → 1.25 μs (1.82x faster) Large vectors (384-1536 elements): Vector<float, 384> : 15.38 μs → 14.67 μs (1.05x faster) Vector<float, 768> : 32.43 μs → 30.72 μs (1.06x faster) Vector<float, 1536> : 63.74 μs → 63.24 μs (1.01x faster) The optimization is most effective for: - Small vectors (3-4 elements): 2.8-3.6x speedup - Integer vectors: 1.8x speedup - Medium-sized float/double vectors: 1.2-1.3x speedup For very large vectors (384+ elements), the benefit is minimal as the deserialization time is dominated by data copying rather than function call overhead. Variable-size subtypes and other numeric types continue to use the element-by-element fallback path. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

For vectors with 32 or more elements, use numpy.frombuffer() which provides 1.3-1.5x speedup for large vectors (128+ elements) compared to struct.unpack. The hybrid approach: - Small vectors (< 32 elements): struct.unpack (2.8-3.6x faster than baseline) - Large vectors (>= 32 elements): numpy.frombuffer().tolist() (1.3-1.5x faster than struct.unpack) Threshold of 32 elements balances code complexity with performance gains. Benchmark results: - float[128]: 2.15 μs → 1.87 μs (1.15x faster) - float[384]: 6.17 μs → 4.44 μs (1.39x faster) - float[768]: 12.25 μs → 8.45 μs (1.45x faster) - float[1536]: 24.44 μs → 15.77 μs (1.55x faster) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

…izer Addded DesVectorType Cython deserializer with C-level optimizations for improved performance in row parsing for vectors. The deserializer uses: - Direct C byte swapping (ntohl, ntohs) for numeric types - Memory operations without Python object overhead - Unified numpy path for large vectors (≥32 elements) - struct.unpack fallback for small vectors (<32 elements) Performance improvements: - Small vectors (3-4 elements): 4.4-4.7x faster - Medium vectors (128 elements): 1.0-1.5x faster - Large vectors (384-1536 elements): 0.9-1.0x (marginal) The Cython deserializer is automatically used by the row parser when available via find_deserializer(). Includes unit tests and benchmark code. Follow-up commits will try to get Numpy arrays, and perhaps more. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Extend NumpyParser to handle VectorType columns by creating 2D NumPy arrays (rows × vector_dimension) instead of object arrays. This enables zero-copy parsing for vector embeddings in ML/AI workloads. Features: - Detects VectorType via vector_size and subtype attributes - Creates 2D masked arrays for numeric vector subtypes (float, double, int32, int64, int16) - Falls back to object arrays for unsupported vector subtypes - Handles endianness conversion for both 1D and 2D arrays - Pre-allocates result arrays for efficiency Supported vector types: - Vector<float> → 2D float32 array - Vector<double> → 2D float64 array - Vector<int> → 2D int32 array - Vector<bigint> → 2D int64 array - Vector<smallint> → 2D int16 array Adds comprehensive test coverage for all supported vector types, mixed column queries, and large vector dimensions (384-element embeddings). Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Replace POSIX-specific arpa/inet.h with conditional compilation that uses winsock2.h on Windows and arpa/inet.h on POSIX systems. This ensures the driver can be compiled on Windows without modification. Changes: - cassandra/cython_marshal.pyx: Add platform detection for ntohs/ntohl - cassandra/ioutils.pyx: Add platform detection for ntohl Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Add bounds checking to prevent buffer overruns and properly handle CQL protocol value semantics in deserializers. Changes: - subelem(): Add bounds validation with protocol-compliant value handling * Happy path: Check elemlen >= 0 and offset + elemlen <= buf.size * Support NULL values (elemlen == -1) per CQL protocol * Support "not set" values (elemlen == -2) per CQL protocol * Reject invalid values (elemlen < -2) with clear error message - _unpack_len(): Add bounds check before reading int32 length field * Validates offset + 4 <= buf.size before pointer dereference * Prevents reading beyond buffer boundaries - DesTupleType: Add defensive bounds checking for tuple deserialization * Check p + 4 <= buf.size before reading item length * Check p + itemlen <= buf.size before reading item data * Explicit NULL value handling (itemlen < 0) * Clear error messages for buffer overruns - DesCompositeType: Add bounds validation for composite type elements * Check 2 + element_length + 1 <= buf.size (length + data + EOC byte) * Prevents buffer overrun when reading composite elements - DesVectorType._deserialize_generic(): Add size validation * Verify buf.size == expected_size before processing * Provides clear error message with expected vs actual sizes Protocol specification reference: [value] = [int] n, followed by n bytes if n >= 0 n == -1: NULL value n == -2: not set value n < -2: invalid (error) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

mykaul · 2026-02-09T14:05:27Z

Per discussion earlier, Iv'e changed to optimize only for float/double/int, since those are the more frequently used and are not troublesome protocol-wise.
Results:

Benchmark	Master (μs)	int32_pack (μs)	Change (μs)	Change (%)
Vector<float, 3>	3.32	1.34	-1.98	-59.64%
Vector<float, 4>	2.21	0.78	-1.43	-64.71%
Vector<float, 128>	46.78	3.53	-43.25	-92.45%
Vector<float, 384>	145.53	10.12	-135.41	-93.05%
Vector<float, 768>	287.93	19.20	-268.73	-93.33%
Vector<float, 1536>	579.66	37.84	-541.82	-93.48%
Vector<double, 128>	47.88	3.80	-44.08	-92.07%
Vector<double, 768>	294.95	19.72	-275.23	-93.31%
Vector<double, 1536>	584.74	37.69	-547.05	-93.55%
Vector<int, 64>	22.06	2.19	-19.87	-90.07%
Vector<int, 128>	43.50	2.82	-40.68	-93.52%

I'll submit a fixed version of this series.

Vector type is supported on Scylla 2025.4 and above. Enable the integration tests. Tested locally against both 2025.4.2 and 2026.1 and they pass. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

mykaul requested a review from Copilot February 5, 2026 21:46

mykaul added the enhancement New feature or request label Feb 5, 2026

Copilot started reviewing on behalf of mykaul February 5, 2026 21:46 View session

mykaul marked this pull request as draft February 5, 2026 21:52

Copilot AI reviewed Feb 5, 2026

View reviewed changes

mykaul added 6 commits February 8, 2026 18:44

mykaul force-pushed the int32_pack branch from bd97002 to 911e40d Compare February 8, 2026 21:30

mykaul added 5 commits February 9, 2026 15:14

benchmarks: expand vector sizes

bead89b

tests: enable vector integration tests on Scylla 2025.4+

b56fa6d

Vector type is supported on Scylla 2025.4 and above. Enable the integration tests. Tested locally against both 2025.4.2 and 2026.1 and they pass. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

mykaul force-pushed the int32_pack branch from 911e40d to b56fa6d Compare February 9, 2026 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Improvement) improve performance of Vector type parsing #689

(Improvement) improve performance of Vector type parsing #689

Uh oh!

mykaul commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mykaul commented Feb 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

(Improvement) improve performance of Vector type parsing #689

Are you sure you want to change the base?

(Improvement) improve performance of Vector type parsing #689

Uh oh!

Conversation

mykaul commented Feb 5, 2026

Pre-review checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mykaul commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mykaul commented Feb 9, 2026 •

edited

Loading