Skip to content

Conversation

@mykaul
Copy link

@mykaul mykaul commented Feb 5, 2026

Multiple partially/mostly independent commits (if needed, most can be extracted from this series) to improve the parsing of vector arrays.
Across Python, Cython and even Numpy array creation, this series includes both test and a benchmark to improve the deserialization of vectors.

There's some prerequisite and a bug fix (that I've extracted to its own PR), but otherwise the series is mostly complete.

I think in a follow-up or following commits I'll add the same/similar to serialization of vector types.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@mykaul mykaul requested a review from Copilot February 5, 2026 21:46
@mykaul mykaul added the enhancement New feature or request label Feb 5, 2026
@mykaul mykaul marked this pull request as draft February 5, 2026 21:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements comprehensive performance optimizations for VectorType deserialization across multiple layers of the Python driver stack. The changes introduce optimized deserialization paths using struct.unpack for small vectors, numpy.frombuffer for large vectors (when NumPy is available), and a new Cython DesVectorType deserializer that uses low-level C operations with ntohl/ntohs intrinsics for efficient byte-swapping.

Changes:

  • Added Cython-based DesVectorType deserializer with optimized paths for float, double, int32, int64, and int16 vector types
  • Enhanced Python-level VectorType deserialization with struct.unpack and numpy.frombuffer optimizations
  • Extended NumpyParser to create 2D arrays for vector types, enabling efficient batch processing
  • Optimized low-level byte-swap operations using ntohl/ntohs intrinsics and simplified varint_unpack using int.from_bytes
  • Removed slice_buffer function in favor of simpler from_ptr_and_size for direct pointer manipulation

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
tests/unit/test_types.py Adds comprehensive tests for Cython DesVectorType deserializer covering float, double, int32, int64, and int16 vectors
tests/unit/test_numpy_parser.py New test suite for NumPy parser 2D array support for vectors with mixed column types and large dimensions
cassandra/cqltypes.py Python-level VectorType optimizations using struct.unpack and numpy.frombuffer, plus improved variable-size vector handling
cassandra/deserializers.pyx New Cython DesVectorType class with type-specific optimized deserialization methods
cassandra/numpy_parser.pyx Enhanced to create 2D NumPy arrays for VectorType columns and pre-allocate arrays list
cassandra/cython_marshal.pyx Optimized unpack_num to use ntohl/ntohs intrinsics and simplified varint_unpack
cassandra/ioutils.pyx Optimized read_int to use ntohl directly
cassandra/marshal.py Simplified varint_unpack using int.from_bytes
cassandra/buffer.pxd Replaced slice_buffer with simpler from_ptr_and_size function
benchmarks/vector_deserialize.py New comprehensive benchmark suite comparing different optimization strategies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

….from_bytes

Performance improvements to serialization/deserialization hot paths:

1. unpack_num(): Use ntohs()/ntohl() for 16-bit and 32-bit integer types
   instead of byte-by-byte swapping loop. These compile to single bswap
   instructions on x86, providing more predictable performance.

2. read_int(): Simplify to use ntohl() directly instead of going through
   unpack_num() with a temporary Buffer.

3. varint_unpack(): Replace hex string conversion with int.from_bytes().
   This eliminates string allocations and provides 4-18x speedup for the
   function itself (larger gains for longer varints).

4. Remove slice_buffer() and replaced with direct assignment

5. _unpack_len() is now implemented similar to read_int()

Also removes unused 'start' and 'end' variables from unpack_num().

End-to-end benchmark shows ~4-5% improvement in row throughput.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Use hardware byte-swap intrinsic for float unmarshaling instead of manual
4-iteration loop, providing 4-8x speedup on little-endian systems.

All tests passing (609 total) [see next commit for a fix for existing Cython related issue!]

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Refactor deserializers.pyx to use from_ptr_and_size() consistently
instead of manual Buffer field assignment for better code clarity and
maintainability.

Changes:
- cassandra/deserializers.pyx: Refactor 4 locations to use helper

Tests: All Cython tests compile and pass (5 tests)

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add comprehensive benchmark comparing different deserialization strategies
for VectorType with various numeric types and vector sizes.

The benchmark measures:
- Current element-by-element baseline
- struct.unpack bulk deserialization
- numpy frombuffer with tolist()
- numpy frombuffer zero-copy approach

Tested with common ML/AI embedding dimensions:
- Small vectors: 3-4 elements
- Medium vectors: 128-384 elements
- Large vectors: 768-1536 elements

Usage:
  export CASS_DRIVER_NO_CYTHON=1  # Test pure Python implementation
  python benchmarks/vector_deserialize.py

Includes CPU pinning for consistent measurements and result verification
to ensure correctness of all optimization approaches.

Baseline Performance (per-operation deserialization time):
  Vector<float, 3>     :  0.88 μs
  Vector<float, 4>     :  0.78 μs
  Vector<float, 128>   :  4.72 μs
  Vector<float, 384>   : 15.38 μs
  Vector<float, 768>   : 32.43 μs
  Vector<float, 1536>  : 63.74 μs
  Vector<double, 128>  :  4.83 μs
  Vector<int, 128>     :  2.27 μs

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…ct.unpack

Add bulk deserialization using struct.unpack for common numeric vector types
instead of element-by-element deserialization. This provides significant
performance improvements, especially for small vectors and integer types.

Optimized types:
- FloatType  ('>Nf' format)
- DoubleType ('>Nd' format)
- Int32Type  ('>Ni' format)
- LongType   ('>Nq' format)
- ShortType  ('>Nh' format)

Performance improvements (measured with CASS_DRIVER_NO_CYTHON=1):

Small vectors (3-4 elements):
  Vector<float, 3>  : 0.88 μs → 0.25 μs  (3.58x faster)
  Vector<float, 4>  : 0.78 μs → 0.28 μs  (2.79x faster)

Medium vectors (128 elements):
  Vector<float, 128>  : 4.72 μs → 4.06 μs  (1.16x faster)
  Vector<double, 128> : 4.83 μs → 4.01 μs  (1.20x faster)
  Vector<int, 128>    : 2.27 μs → 1.25 μs  (1.82x faster)

Large vectors (384-1536 elements):
  Vector<float, 384>  : 15.38 μs → 14.67 μs  (1.05x faster)
  Vector<float, 768>  : 32.43 μs → 30.72 μs  (1.06x faster)
  Vector<float, 1536> : 63.74 μs → 63.24 μs  (1.01x faster)

The optimization is most effective for:
- Small vectors (3-4 elements): 2.8-3.6x speedup
- Integer vectors: 1.8x speedup
- Medium-sized float/double vectors: 1.2-1.3x speedup

For very large vectors (384+ elements), the benefit is minimal as the
deserialization time is dominated by data copying rather than function
call overhead.

Variable-size subtypes and other numeric types continue to use the
element-by-element fallback path.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
For vectors with 32 or more elements, use numpy.frombuffer() which provides
1.3-1.5x speedup for large vectors (128+ elements) compared to struct.unpack.

The hybrid approach:
- Small vectors (< 32 elements): struct.unpack (2.8-3.6x faster than baseline)
- Large vectors (>= 32 elements): numpy.frombuffer().tolist() (1.3-1.5x faster than struct.unpack)

Threshold of 32 elements balances code complexity with performance gains.

Benchmark results:
- float[128]:  2.15 μs → 1.87 μs (1.15x faster)
- float[384]:  6.17 μs → 4.44 μs (1.39x faster)
- float[768]: 12.25 μs → 8.45 μs (1.45x faster)
- float[1536]: 24.44 μs → 15.77 μs (1.55x faster)

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…izer

Addded DesVectorType Cython deserializer with C-level optimizations for
improved performance in row parsing for vectors.
The deserializer uses:
- Direct C byte swapping (ntohl, ntohs) for numeric types
- Memory operations without Python object overhead
- Unified numpy path for large vectors (≥32 elements)
- struct.unpack fallback for small vectors (<32 elements)

Performance improvements:
- Small vectors (3-4 elements): 4.4-4.7x faster
- Medium vectors (128 elements): 1.0-1.5x faster
- Large vectors (384-1536 elements): 0.9-1.0x (marginal)

The Cython deserializer is automatically used by the row parser when
available via find_deserializer().

Includes unit tests and benchmark code.

Follow-up commits will try to get Numpy arrays, and perhaps more.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Extend NumpyParser to handle VectorType columns by creating 2D NumPy
arrays (rows × vector_dimension) instead of object arrays. This enables
zero-copy parsing for vector embeddings in ML/AI workloads.

Features:
- Detects VectorType via vector_size and subtype attributes
- Creates 2D masked arrays for numeric vector subtypes (float, double,
  int32, int64, int16)
- Falls back to object arrays for unsupported vector subtypes
- Handles endianness conversion for both 1D and 2D arrays
- Pre-allocates result arrays for efficiency

Supported vector types:
- Vector<float> → 2D float32 array
- Vector<double> → 2D float64 array
- Vector<int> → 2D int32 array
- Vector<bigint> → 2D int64 array
- Vector<smallint> → 2D int16 array

Adds comprehensive test coverage for all supported vector types,
mixed column queries, and large vector dimensions (384-element embeddings).

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Replace POSIX-specific arpa/inet.h with conditional compilation that uses
winsock2.h on Windows and arpa/inet.h on POSIX systems.

This ensures the driver can be compiled on Windows without modification.

Changes:
- cassandra/cython_marshal.pyx: Add platform detection for ntohs/ntohl
- cassandra/ioutils.pyx: Add platform detection for ntohl

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add bounds checking to prevent buffer overruns and properly
handle CQL protocol value semantics in deserializers.

Changes:
- subelem(): Add bounds validation with protocol-compliant value handling
  * Happy path: Check elemlen >= 0 and offset + elemlen <= buf.size
  * Support NULL values (elemlen == -1) per CQL protocol
  * Support "not set" values (elemlen == -2) per CQL protocol
  * Reject invalid values (elemlen < -2) with clear error message

- _unpack_len(): Add bounds check before reading int32 length field
  * Validates offset + 4 <= buf.size before pointer dereference
  * Prevents reading beyond buffer boundaries

- DesTupleType: Add defensive bounds checking for tuple deserialization
  * Check p + 4 <= buf.size before reading item length
  * Check p + itemlen <= buf.size before reading item data
  * Explicit NULL value handling (itemlen < 0)
  * Clear error messages for buffer overruns

- DesCompositeType: Add bounds validation for composite type elements
  * Check 2 + element_length + 1 <= buf.size (length + data + EOC byte)
  * Prevents buffer overrun when reading composite elements

- DesVectorType._deserialize_generic(): Add size validation
  * Verify buf.size == expected_size before processing
  * Provides clear error message with expected vs actual sizes

Protocol specification reference:
  [value] = [int] n, followed by n bytes if n >= 0
            n == -1: NULL value
            n == -2: not set value
            n < -2: invalid (error)

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@mykaul
Copy link
Author

mykaul commented Feb 9, 2026

Per discussion earlier, Iv'e changed to optimize only for float/double/int, since those are the more frequently used and are not troublesome protocol-wise.
Results:

Benchmark	Master (μs)	int32_pack (μs)	Change (μs)	Change (%)
Vector<float, 3>	3.32	1.34	-1.98	-59.64%
Vector<float, 4>	2.21	0.78	-1.43	-64.71%
Vector<float, 128>	46.78	3.53	-43.25	-92.45%
Vector<float, 384>	145.53	10.12	-135.41	-93.05%
Vector<float, 768>	287.93	19.20	-268.73	-93.33%
Vector<float, 1536>	579.66	37.84	-541.82	-93.48%
Vector<double, 128>	47.88	3.80	-44.08	-92.07%
Vector<double, 768>	294.95	19.72	-275.23	-93.31%
Vector<double, 1536>	584.74	37.69	-547.05	-93.55%
Vector<int, 64>	22.06	2.19	-19.87	-90.07%
Vector<int, 128>	43.50	2.82	-40.68	-93.52%

I'll submit a fixed version of this series.

Vector type is supported on Scylla 2025.4 and above.
Enable the integration tests.

Tested locally against both 2025.4.2 and 2026.1 and they pass.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant