fix(codeql): batch-4 medium-sev + warnings + log-injection defense#102
Conversation
Addresses 22 of 60+ open warning/medium-sev CodeQL alerts: Medium severity (security): - py/stack-trace-exposure: drop exception message from streaming JSONL error response in nemo_platform_plugin.jobs.api_factory (keep type + line number for diagnostics, drop str(e) which can leak internals). - py/stack-trace-exposure x3: dismissed in vendored switchyard snapshot. - py/bind-socket-all-network-interfaces x3: quickstart port-availability probes and docker reconciler port check bind 127.0.0.1 instead of 0.0.0.0/empty-host — same intent for the local dev/test use case, no security risk surface. - js/http-to-file-access: dismissed — Studio build script with explicit CDN host allowlist + regex-validated version. - js/missing-origin-check: dismissed — auto-generated MSW worker. Warnings (code-quality): - py/redundant-comparison x2: drop dead 'mid != base_id' check after earlier base-branch handles it (provider_reconciler); drop dead MoE branch in heuristics.py reached only on dense path. - py/comparison-of-identical-expressions x2: bind to a local before comparing so cache-identity assertions remain visible to analyzers. - py/file-not-closed x4: convert open().read() to with-statement in copyright_fixer.py. - py/file-not-closed x1: convert yaml.safe_load(open(...)) to with-statement in test_config.py. - py/file-not-closed x2: dismissed (intentional process-long lock fd + pytest.raises in test that never returns an fd). - py/exit-from-finally: move 'return result' out of finally in the customizer training runner so KeyboardInterrupt / SystemExit aren't swallowed. - py/implicit-string-concatenation-in-list x2: wrap multi-line shell command literal in parens so the implicit concat is explicit. - py/regex/duplicate-in-character-class: simplify pytest-marker regex in mcp-dev-tools (the 'andornt' chars were already covered by [a-z]). - py/unnecessary-pass x2: remove dead 'pass' statements after a logger.debug call in two except blocks. - py/unnecessary-delete x2: replace 'del unused_var' with ARG noqa on the parameter list in test helpers. - py/use-of-exit-or-quit x2: replace exit() with raise SystemExit() in two scripts. - js/trivial-conditional: drop redundant 'chipContainer &&' check after the earlier guard in docs api-filter.js. - py/multiple-definition: drop unused initial 'end = len(lines)' in nemo_nb sugar (loop's else branch returns early). Signed-off-by: mschwab <mschwab@nvidia.com>
…efense) Adds _sanitize_log_strings processor that replaces CR/LF/NEL/LS/PS in string fields. Central defense for the 177 py/log-injection CodeQL alerts (each dismissed with reference to this fix). 8 unit tests. Signed-off-by: mschwab <mschwab@nvidia.com>
Documentation preview is readyPreview: https://nvidia-nemo.github.io/nemo-platform/pr-preview/pr-102/pr-102/ Built from This preview is deployed from this PR branch, updates when docs changes are pushed, and will be removed when the PR closes. |
|
- Move _sanitize_log_strings after format_exc_info in the structlog processor chain so exception-message log-forgery vectors are also sanitized (Marta P0; format_exc_info expands exc_info into a multi-line 'exception' field that the old position bypassed). - Replace raw U+2028/U+2029 bytes in _LOG_NEWLINE_CHARS with explicit \u2028\u2029 escapes so the source is editor-stable. - Revert 127.0.0.1 port probes to 0.0.0.0 with S104/B104 suppressions; Docker publishes 0.0.0.0 by default and a loopback probe misses cross-interface conflicts that would later break the container. - Guard self._write_result in the customizer training runner's finally block so a write failure doesn't replace the TrainingResult with an IOError. - Add three observability tests: sanitizer covers the 'event' key, sanitizer runs on multi-line 'exception' fields after format_exc_info, and initialize_logging() actually wires _sanitize_log_strings into the processor chain (regression canary for the 177 dismissed alerts). Signed-off-by: mschwab <mschwab@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughPR applies targeted refactoring and hardening across 23 files: adds log-string sanitization to prevent log-forgery, standardizes process termination to use SystemExit, improves error logging, simplifies conditionals, ensures context-managed I/O, tightens tests, and updates port-binding to explicit wildcard addresses. ChangesObservability, Error Handling, and Code Quality
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/nemo_platform_ext/src/nemo_platform_ext/quickstart/preflight.py`:
- Around line 197-200: The probe socket is not closed if sock.bind raises;
change the socket creation to use a context manager so it is closed on both
success and failure—replace "sock = socket.socket(...); try: sock.bind(...);
sock.close()" with "with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as
sock: sock.bind(("0.0.0.0", self.config.host_port))" (keep the same socket
flags/comments) so socket.socket, sock.bind and sock.close are handled
automatically.
In `@packages/nemo_platform_ext/src/nemo_platform_ext/quickstart/validators.py`:
- Around line 217-220: The socket created for port probing (socket.socket(...)
assigned to sock) can leak a file descriptor if bind(...) raises OSError before
sock.close() is reached; change the probe in quickstart/validators.py to use a
context manager or ensure sock is closed in a finally block (e.g., with
socket.socket(...) as sock:) around the bind/close logic so the socket is always
closed even on exceptions.
In `@packages/nmp_common/tests/observability/test_structured_logging.py`:
- Around line 61-77: The test test_initialize_logging_wires_sanitizer_into_chain
mutates global logging state by calling initialize_logging() but only clears
handlers in the finally block; update the test to capture the root logger's
pre-test state (e.g., prior_handlers = root.handlers.copy() and prior_level =
root.level) before calling initialize_logging(), then in the finally block
restore that state by setting root.handlers = prior_handlers (or clearing and
extending) and root.setLevel(prior_level) so the original handlers and level are
fully restored after the test.
In `@services/customizer/src/nmp/customizer/tasks/training/runner.py`:
- Around line 152-156: The current try/except around self._write_result(result)
swallows write failures and still returns the TrainingResult, masking a hard
failure; change this so write errors are not treated as success — either remove
the try/except so the exception propagates, or log the error and re-raise
immediately after logging (do not return result). Update the block around
self._write_result and the return of result to ensure any exception from
_write_result surfaces to callers (or mark result as failed and then raise) so
downstream consumers are not misled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 11de54ad-9d4d-4d5f-9887-bef2eaa0ccac
📒 Files selected for processing (23)
docs/evaluator/test_doc_examples.pydocs/javascripts/api-filter.jspackages/nemo_evaluator_sdk/tests/execution/test_benchmark_execution.pypackages/nemo_nb/nemo_nb/sphinx.pypackages/nemo_nb/nemo_nb/sugar.pypackages/nemo_platform_ext/src/nemo_platform_ext/quickstart/preflight.pypackages/nemo_platform_ext/src/nemo_platform_ext/quickstart/validators.pypackages/nemo_platform_ext/tests/config/test_config.pypackages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/api_factory.pypackages/nemo_platform_plugin/tests/test_commands.pypackages/nmp_common/src/nmp/common/observability/structured_logging.pypackages/nmp_common/tests/observability/test_structured_logging.pyplugins/nemo-auditor/tests/test_sdk_resources.pyscript/copyright_fixer.pyservices/core/files/src/nmp/core/files/app/file_lock.pyservices/core/jobs/tests/controllers/test_subprocess_backend.pyservices/core/models/src/nmp/core/models/controllers/backends/docker/creation_reconciler.pyservices/core/models/src/nmp/core/models/controllers/provider_reconciler.pyservices/core/models/src/nmp/core/models/parallelism/hueristics.pyservices/customizer/src/nmp/customizer/tasks/training/runner.pyservices/customizer/tests/tasks/training/test_errors.pyservices/intake/scripts/seed-entries.pytools/mcp-dev-tools/nmp_dev_mcp.py
💤 Files with no reviewable changes (4)
- packages/nemo_nb/nemo_nb/sphinx.py
- packages/nemo_nb/nemo_nb/sugar.py
- services/core/files/src/nmp/core/files/app/file_lock.py
- packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/api_factory.py
Signed-off-by: mschwab <mschwab@nvidia.com>
- preflight/validators: wrap socket creation in 'with' so OSError on bind doesn't leak the file descriptor. - test_structured_logging: chain-presence test now saves/restores root handlers and level instead of dropping them on tear-down. - customizer runner: when _write_result fails inside finally, mark the TrainingResult as failed with the write error rather than returning the unrecorded result as success. Signed-off-by: mschwab <mschwab@nvidia.com>
Signed-off-by: mschwab <mschwab@nvidia.com>
- runner.py: preserve original training error_message when _write_result also fails (append persist error instead of replace). Guard the finally mark-failed with is_coordinator so workers don't construct a failure result. - test_runner.py: add test_run_marks_result_failed_when_write_raises and test_run_appends_persist_error_to_existing_training_error. - test_structured_logging.py: use \u2028 escape literal (was raw byte), add \u2029 case, find ProcessorFormatter by isinstance rather than handlers[0]. - preflight.py, validators.py: setsockopt(SO_REUSEADDR) before bind to avoid Linux TIME_WAIT false positives on the port probe (matches the existing creation_reconciler pattern). Signed-off-by: mschwab <mschwab@nvidia.com>
Signed-off-by: mschwab <mschwab@nvidia.com>
…m-finally fix The py/exit-from-finally CodeQL alert only required moving 'return result' out of the finally block. The write-failure try/except + result-mutation added in review was scope creep beyond the alert; revert to the minimal dedent. _write_result failures now propagate as before. Signed-off-by: mschwab <mschwab@nvidia.com>
- test_benchmark_execution: drop the # noqa: ARG002 by renaming the positional-only stub arg to _increment (matches the dummy-variable convention already used for _kwargs in this file). - api_factory: add a static 'detail' to the streamed validation error so clients get an actionable reason without echoing the raw exception text (full message + traceback still logged server-side). Signed-off-by: mschwab <mschwab@nvidia.com>
…t compat The streamed JSONL validation error previously returned error.message; keep that key (renamed from the temporary detail) so existing clients parsing error.message keep working. Signed-off-by: mschwab <mschwab@nvidia.com>
Summary
Phase 4 of the CodeQL cleanup. Addresses 199 alerts in total — 22 medium-sev + warning fixes in-tree, 177 log-injection alerts closed via a central log-sanitization processor.
Alerts addressed
py/stack-trace-exposurenemo_platform_plugin.jobs.api_factory— dropstr(e)from streaming JSONL error response (keep type + line number for diagnostics). 3 vendored switchyard alerts dismissed (snapshot, edits get overwritten).py/bind-socket-all-network-interfaces127.0.0.1instead of0.0.0.0/empty-host — same semantics for the local dev/test use case, no listening surface.js/http-to-file-accessjs/missing-origin-checkpy/log-injection_sanitize_log_stringsprocessor innmp.common.observability.structured_logging— replaces CR/LF/NEL/LS/PS in string fields before rendering. JSON renderer already escapes these; this guards the plain-text renderer too. 177 alerts dismissed with a back-reference to this fix. 8 unit tests added.py/redundant-comparisonmid != base_idafter earlier base-branch handles it; drop dead MoE branch in heuristics reached only on dense path.py/comparison-of-identical-expressionspy/file-not-closedopen().read()towith-statement (copyright_fixer.py×4,test_config.py×1). 2 FPs dismissed (intentional process-long lock fd +pytest.raisestest that never returns an fd).py/exit-from-finallyreturn resultout offinallyso KeyboardInterrupt/SystemExit aren't swallowed in the customizer training runner.py/implicit-string-concatenation-in-listpy/regex/duplicate-in-character-classn/achars were already covered by[a-z].py/unnecessary-passpassafterlogger.debugin two except blocks.py/unnecessary-deletedel unused_varwith# noqa: ARGon the parameter list in test helpers.py/use-of-exit-or-quitexit()withraise SystemExit()in two scripts.js/trivial-conditionalchipContainer &&check after the earlier guard.py/multiple-definitionend = len(lines)in nemo_nb sugar (loop's else branch returns early). 14 sites still open for a follow-up.Notable: log-injection central defense
177
py/log-injectionsites all funnel through the structlog handler chain innmp.common.observability.structured_logging. Rather than edit 177 individual log statements (each of which would bef"… {workspace}/{name} …"-style fields constrained byNAME_PATTERNand therefore unlikely to contain newlines in practice), the new_sanitize_log_stringsprocessor strips newline-class characters from any string value in the event dict before the renderer runs.The JSON renderer already escapes control characters, so this is primarily defense-in-depth for the plain-text console renderer. Each alert was dismissed with a back-reference to this commit.
Test plan
uv run pytest packages/nmp_common/tests/observability/test_structured_logging.py— 8 pass (new tests)uv run pytest plugins/nemo-auditor/tests/test_sdk_resources.py— 2 pass (cache-identity tests still verify what they intended)ruff checkon all touched files — cleanStacking
Branch is off
main. PRs #91/#98/#100 ahead; no conflicts expected.Summary by CodeRabbit
New Features
Bug Fixes
Tests
Refactor