Skip to content

Add ASAN#2274

Open
Kaushik Raina (k-raina) wants to merge 14 commits into
masterfrom
dev_kip-932_queues-for-kafka_leak_tests
Open

Add ASAN#2274
Kaushik Raina (k-raina) wants to merge 14 commits into
masterfrom
dev_kip-932_queues-for-kafka_leak_tests

Conversation

@k-raina

Copy link
Copy Markdown
Member

What

Checklist

  • Contains customer facing changes? Including API/behavior changes
  • Did you add sufficient unit test and/or integration test coverage for this PR?
    • If not, please explain why it is not required

References

JIRA:

Test & Review

Open questions / Follow-ups

@confluent-cla-assistant

Copy link
Copy Markdown

🎉 All Contributor License Agreements have been signed. Ready to merge.
Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.

Kaushik Raina (k-raina) and others added 13 commits June 12, 2026 12:52
…-party leaks

The integration ASAN job installed requirements-tests.txt (pytest only), so
pytest collection died importing the SR client that cluster_fixture loads at
module level: ModuleNotFoundError: authlib. Switch to requirements-tests-install.txt,
which -r's in requirements-schemaregistry.txt (authlib + cryptography) and the
vendored trivup 0.14.0.

Also suppress the one-time cryptography/cffi module-init allocations LSAN flags
from those transitively-pulled extensions, so they don't gate the job
(exitcode=1) once the tests actually run. Scoped to those .so's; the broker-free
job never imports them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scratch validation for the share-consumer sanitizer job. Adds two deliberate
leaks (raw libc malloc; a lost GC dict) to the broker-free ASAN job and turns
on print_suppressions, to prove the detector + exitcode=1 gate actually fire
and to check whether the broad _PyObject_GC_* suppressions mask our own
objects. Revert after reading the run (delete the canary file, restore
print_suppressions=0 and the pytest arg).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The canary run proved exitcode=1 does not propagate when libasan is
LD_PRELOADed into a non-instrumented CPython: LSAN detected the planted leaks
(raw malloc + a lost dict) and printed the report, but the process still
exited 0 and the job passed. So neither ASAN job was actually gating on leaks.

Add a post-pytest grep on the tee'd log for the LeakSanitizer/AddressSanitizer
signatures and fail the job when present (also catches ASAN memory errors).
Canary + print_suppressions=1 kept for one confirming run (expect the
binding-layer job to go RED now); revert both after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The canary run confirmed the fix: with the grep gate added, the planted leaks
flipped the binding-layer job to red (previously green on the identical
canary). Removing the scratch bits now — delete the canary test, restore
print_suppressions=0, drop it from the pytest args. The real fix (gate on the
LeakSanitizer/AddressSanitizer report text in both ASAN jobs) stays.

Net of the three TEMP commits: the only lasting change is the report-text gate.
Squash these before merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ng layer

Broker-free only: Memcheck's ~20-50x slowdown would wreck the integration
suite's 1s share-lock timing, so Valgrind runs just the fast tests/test_ShareConsumer*
suite. It adds the one class ASAN misses — reads of uninitialized memory in
cimpl's own C.

- build-librdkafka-branch.sh: new LIBRDKAFKA_DEBUG mode (--enable-devel
  --disable-optimization), a non-ASAN debug build (ASAN and Valgrind can't co-run).
- Vendor librdkafka's Valgrind suppressions (glibc-TLS/getaddrinfo/OpenSSL
  false positives) as .semaphore/librdkafka.suppressions.
- New strict, per-PR block: PYTHONMALLOC=malloc + memcheck with --track-origins;
  gate (grep the report) on any error or a definite/possible leak.

First runs may surface CPython interpreter noise to suppress — harvest from the
--gen-suppressions output into .semaphore/valgrind-confluent.supp until green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ASAN/Valgrind share-consumer jobs ran 'artifact push workflow artifacts/' unguarded in their epilogue, so a failed log upload failed the whole job even when the tests and the gate passed clean. Add '|| true', matching the cp on the line above.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sonarqube-confluent

Copy link
Copy Markdown

Quality Gate failed Quality Gate failed

Failed conditions
1.1% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube

Base automatically changed from dev_kip-932_queues-for-kafka to master June 23, 2026 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant