-
Notifications
You must be signed in to change notification settings - Fork 33
FIX: varchar columnsize does not account for utf8 conversion #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…less SQLGetData fallback
Also add columnSize == 0 check to FetchLobColumnData which was likely missed. That way FetchLobColumnData gets called directly instead of going through a SQLGetData call with an empty buffer first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a buffer overflow issue when fetching VARCHAR columns containing characters that require more bytes in UTF-8 encoding than in the database's character encoding. The fix multiplies the fetch buffer size by 2 for CHAR/VARCHAR columns and removes fallback mechanisms that previously masked these issues.
Key Changes:
- Doubled buffer allocation for CHAR/VARCHAR columns to account for UTF-8 conversion overhead
- Removed
isLobfield from column info structures and simplified LOB detection logic - Replaced fallback mechanisms (FetchLobColumnData calls) with explicit error messages when buffers overflow
- Added test case validating special character handling (ß character that requires 2 bytes in UTF-8)
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| tests/test_004_cursor.py | Adds test for VARCHAR buffer sizing with special character (ß) that requires 2 bytes in UTF-8 |
| mssql_python/pybind/ddbc_bindings.h | Removes isLob field and FetchLobColumnData forward declaration; updates ProcessChar, ProcessWChar, and ProcessBinary to throw errors instead of falling back to LOB streaming |
| mssql_python/pybind/ddbc_bindings.cpp | Implements 2x buffer multiplier for VARCHAR columns in SQLGetData_wrap, SQLBindColums, FetchBatchData, and calculateRowSize; removes lobColumns parameter from FetchBatchData signature; adds columnSize == 0 check for WCHAR LOB detection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@ffelixg Can you please have a look at why the tests are failing, and make the required changes |
|
I couldn't reproduce the failure locally, but I think that MSVC seems to be configured more restrictively than GCC regarding unused parameters, I removed hStmt from the columnProcessors Could you rerun the CI @jahnvi480? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@ffelixg Can you please check on why the tests are only failing on Windows. |
|
@jahnvi480 It's the issues I've commented on in #391. I changed the tests so that they pass and instead execute a separate branch for windows where they assert that the problematic cases return exceptions or corrupt data. My original issue regarding the buffer size is fixed and everything works as I would expect on Linux now. Since the windows driver returns latin1 data, which can't have multiple bytes per character, that issue was actually only affecting Linux/Macos. Regarding the additional issues I've found while writing the tests, I think you should discuss internally about how you want to restructure the API. The current behavior would require the User to write platform specific code, which is not great. My suggestion would be requesting wide chars from the odbc driver, like I wrote in the issue. Actually addressing that is probably outside the scope of this PR/issue. Maybe we could sneak in a fix for fetchmany/fetchall ignoring setdecoding, that seems fairly straight forward. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 2959-2970 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 3033-3044 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 3303-3314 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""mssql_python/pybind/ddbc_bindings.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.row.py: 66.2%
mssql_python.helpers.py: 67.5%
mssql_python.pybind.ddbc_bindings.cpp: 69.3%
mssql_python.pybind.ddbc_bindings.h: 71.9%
mssql_python.pybind.connection.connection.cpp: 73.6%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 84.1%🔗 Quick Links
|
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Work Item / Issue Reference
Summary
This PR enlarges the fetch buffer size for char/varchar columns by a factor of 4 to account for characters which take up more space in utf8 than in whichever encoding the database is using.
It also adds a test, which passes at each commit and therefore tracks how the interface changes. I removed some of the fallback mechanisms and I hope that the evolution of the error messages over the different iterations of the test shows why that's preferable.
For the SQLGetData path for wchars, a
columnSize == 0check is added, to avoid needing the fallback branch fornvarchar(max). Previously, SQLGetData was called once with a buffer of length 0 and only then did the real attempt to fetch data start, after entering the fallback branch.columnSize == 0was actually also the only case where the fallback branch behaved correctly, anything else discarded the firstcolumnSizecharacters.A test that covers all of latin1 and documents behavior with unmapped characters is added as well.
Note that if the database is using a utf8 collation then a buffer of size
columnSizewould be enough, as it tells us the number of bytes and not the number of characters (this distinction only matters for utf8).