Skip to content

Fix: Add missing serial_size() method to ShortType#688

Closed
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:fix_short
Closed

Fix: Add missing serial_size() method to ShortType#688
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:fix_short

Conversation

@mykaul
Copy link

@mykaul mykaul commented Feb 5, 2026

ShortType (smallint) is a fixed-size type (2 bytes) but was missing the serial_size() classmethod. This prevented optimizations that depend on knowing the fixed size of elements, such as the struct.unpack optimization in VectorType deserialization.

All other fixed-size numeric types (Int32Type, LongType, FloatType, DoubleType, BooleanType) already have serial_size() defined.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

ShortType (smallint) is a fixed-size type (2 bytes) but was missing
the serial_size() classmethod. This prevented optimizations that
depend on knowing the fixed size of elements, such as the struct.unpack
optimization in VectorType deserialization.

All other fixed-size numeric types (Int32Type, LongType, FloatType,
DoubleType, BooleanType) already have serial_size() defined.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@Lorak-mmk
Copy link

Lorak-mmk commented Feb 5, 2026

This is not a bug in the driver. Even better: this PR introduces a bug in the driver!
For some reason, Cassandra got the size of some types wrong when implementing vectors, and so smallint and tinyint are treated as unsized types. We now have to live with it.
We have the following function in Rust Driver:

impl NativeType {
    /// This function returns the size of the type as it is used by Cassandra
    /// for the purposes of serialization and deserialization of vectors,
    /// it is needed as the variable size types (`None`) are de/serialized
    /// differently than the fixed size types (`Some(size)`). Note that
    /// many fixed size types are treated as variable size by Cassandra.
    pub(crate) fn type_size(&self) -> Option<usize> {
        match self {
            NativeType::Ascii => None,
            NativeType::Boolean => Some(1),
            NativeType::Blob => None,
            NativeType::Counter => None,
            NativeType::Date => None,
            NativeType::Decimal => None,
            NativeType::Double => Some(8),
            NativeType::Duration => None,
            NativeType::Float => Some(4),
            NativeType::Int => Some(4),
            NativeType::BigInt => Some(8),
            NativeType::Text => None,
            NativeType::Timestamp => Some(8),
            NativeType::Inet => None,
            NativeType::SmallInt => None,
            NativeType::TinyInt => None,
            NativeType::Time => None,
            NativeType::Timeuuid => Some(16),
            NativeType::Uuid => Some(16),
            NativeType::Varint => None,
        }
    }
}

@mykaul
Copy link
Author

mykaul commented Feb 5, 2026

Wow. Seriously? And they did not bother to fix it? Wow. Why do we have to live with it? Why not do it correctly against ScyllaDB? Did Scylla follow and got it 'wrong' too ?

@mykaul mykaul marked this pull request as draft February 5, 2026 21:49
@mykaul
Copy link
Author

mykaul commented Feb 5, 2026

CC @swasik for awareness. Scary stuff - how does Scylla work in this case?

@Lorak-mmk
Copy link

Lorak-mmk commented Feb 5, 2026

And they did not bother to fix it? Wow. Why do we have to live with it?

That would require a new protocol version, or at least an extension.

Why not do it correctly against ScyllaDB? Did Scylla follow and got it 'wrong' too ?

We are protocol-compatible with Cassandra. If we did it differently, our drivers would not work with C*, and their drivers would not work with Scylla.

Seriously?

Yes, and it is not their only questionable design decision regarding vectors.

  1. For variable-sized vectors, length of each element is not encoded as e.g. 4-byte int or 2-byte int. Instead it is encoded as variable-size int! This makes efficient serialization more difficult. In Rust Driver we had some ideas, but didn't implement them, because it would be more complicated, and the canonical case for vector is vector of doubles and vector of floats (which are both constant-sized).
  2. In PREPARED and ROWS responses returned from server, vector type is not encoded in the binary form in columns metadata (as all other types are). See https://github.com/apache/cassandra/blob/00bf5380fe83dd8919e8f286edd2e6a65786ca6e/doc/native_protocol_v4.spec#L617-L659 for this binary encoding. Instead, "Custom" type (id: 0x00) is used, so the whole type needs to be parsed from string, instead of the binary representation. For that reason we had to include whole new parser in Rust Driver, because it had no need to parse syntax of "custom" types before.

Adding Vector support was 2nd most commented Rust Driver PR ever for good reason.
scylladb/scylla-rust-driver#1165

@mykaul
Copy link
Author

mykaul commented Feb 6, 2026

<bad idea>
Since vector is NOT part of protocol v4, we could implement the Right Thing(TM) and properly support the types with the correct fixed length perhaps?
</bad idea>

@Lorak-mmk , @swasik - WDYT?

@swasik
Copy link

swasik commented Feb 6, 2026

@Lorak-mmk , @swasik - WDYT?

But I think we want to keep the compatibility with Cassandra drivers - this way Cassandra integration (such as LangChain) will work out of the box without the need to change to Scylla drivers.

@Lorak-mmk
Copy link

Since vector is NOT part of protocol v4, we could implement the Right Thing(TM) and properly support the types with the correct fixed length perhaps?

The unfortunate truth is that the protocol specs are really bad w.r.t vectors.
Yes, vector is only mentioned in v5, but in reality it is implemented the exact same way in v4 as well, so any such changes would break it :(

Another example of breakage: For a vector with variable-length elements, the size of the elements will preced each element. Each element is the [bytes] representing the serialized value.. This is just false - in [bytes] the length is 4-byte integer. In vector elements it is variable-length integer. It is not mentioned anywhere.

We could fix vectors with protocol extension. Does it make sense? I don't think so. None of the issues affect vector<float, n>, vector<double, n>, which are afaik the main use cases for vectors.

@mykaul
Copy link
Author

mykaul commented Feb 6, 2026

Things are going from bad to worse. Wireshark CQL dissector doesn't parse custom fields in a response. So now I'm fixing that. But it doesn't compile, so now I'm bisecting that.
Hope to get both done over the weekend, so I can see the horrors in my own eyes.

@mykaul
Copy link
Author

mykaul commented Feb 6, 2026

We could fix vectors with protocol extension. Does it make sense? I don't think so. None of the issues affect vector<float, n>, vector<double, n>, which are afaik the main use cases for vectors.

I'm hoping that this is good enough. I think I'll focus on float optimization and let the rest be as it used to be.

@mykaul mykaul closed this Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants