Skip to content

Remote execution implementation#939

Open
tmckayus wants to merge 8 commits intoNVIDIA:mainfrom
tmckayus:grpc-server-v4
Open

Remote execution implementation#939
tmckayus wants to merge 8 commits intoNVIDIA:mainfrom
tmckayus:grpc-server-v4

Conversation

@tmckayus
Copy link
Contributor

@tmckayus tmckayus commented Mar 6, 2026

This change replaces the solve_lp_remote and solve_mip_remote stubs with real routines that use an embedded grpc client to communicate with a remote cuopt server.

There are two documents included, GRPC_ARCHITECTURE.md and SERVER_ARCHITECTURE.md that act as developer guides.

The server is built by build.sh and is called cuopt_grpc_server.

Remote execution is still enabled via env vars CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT

To try this feature, checkout the branch/PR and do the following:

$ build.sh
$ cuopt_grpc_server // defaults are fine, run with -h for help
$ CUOPT_REMOTE_HOST=localhost CUOPT_REMOTE_PORT=8765 cuopt_cli myproblem.mps

All cuopt APIs will pick up remote execution if the env vars are set. So cuopt_cli, the C API, the Python API, all will solve problems on the server if the env vars are set and the server is running. Just use cuopt tools and APIs as you normally would for a local solve.

@tmckayus tmckayus requested review from a team as code owners March 6, 2026 14:56
@tmckayus tmckayus requested review from Iroy30, aliceb-nv, bdice and rg20 March 6, 2026 14:56
@tmckayus tmckayus added feature request New feature or request non-breaking Introduces a non-breaking change labels Mar 6, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

Adds a complete gRPC remote execution subsystem: protobuf/gRPC schemas and codegen, a gRPC client and server (multi-process workers, shared memory and pipe IPC, chunked upload/download), bidirectional mappers, large test suites (unit + integration with TLS/mTLS), build/CI/conda packaging changes, and removal of previous remote stub implementations.

Changes

Cohort / File(s) Summary
Build & Packaging
build.sh, cpp/CMakeLists.txt, python/libcuopt/CMakeLists.txt, cpp/src/pdlp/CMakeLists.txt, cpp/tests/linear_programming/CMakeLists.txt, cpp/tests/linear_programming/grpc/CMakeLists.txt, dependencies.yaml, conda/recipes/libcuopt/recipe.yaml, conda/environments/*, ci/utils/install_protobuf_grpc.sh, ci/build_wheel_libcuopt.sh
Expose new cuopt_grpc_server target; wire gRPC/protobuf codegen and linking; add CI script to build/install protobuf/gRPC/abseil; add packaging and conda deps (openssl, c-ares, libuuid); test wiring for server.
Protobuf Definitions
cpp/src/grpc/cuopt_remote.proto, cpp/src/grpc/cuopt_remote_service.proto
Add protobuf schema for problems/results/settings/chunked-transfer metadata and full CuOptRemoteService gRPC service with job lifecycle and chunk APIs.
Client Library & Remote Entrypoints
cpp/src/grpc/client/grpc_client.hpp, cpp/src/grpc/client/grpc_client.cpp, cpp/src/grpc/client/solve_remote.cpp
New grpc_client_t public API and implementation (TLS, unary & chunked upload/download, submit/poll/get-result, log/incumbent streaming) and solve_*_remote wrappers used by local APIs.
Mapping Utilities
cpp/src/grpc/grpc_problem_mapper.{hpp,cpp}, cpp/src/grpc/grpc_settings_mapper.{hpp,cpp}, cpp/src/grpc/grpc_solution_mapper.{hpp,cpp}, cpp/src/grpc/grpc_service_mapper.{hpp,cpp}
Bidirectional mappers between CPU types and protobufs; header builders and array-chunk builders; settings/status conversions and explicit template instantiations.
Server Core Types & IO
cpp/src/grpc/server/grpc_server_types.hpp, cpp/src/grpc/server/grpc_pipe_io.cpp, cpp/src/grpc/server/grpc_pipe_serialization.hpp, cpp/src/grpc/server/grpc_field_element_size.hpp, cpp/src/grpc/server/grpc_incumbent_proto.hpp
Shared-memory job/result queue types, pipe read/write helpers, chunked request/result wire-format utilities, array element-size helper, and incumbent proto helpers.
Server Entry & Worker Infra
cpp/src/grpc/server/grpc_server_main.cpp, cpp/src/grpc/server/grpc_worker_infra.cpp, cpp/src/grpc/server/grpc_worker.cpp
Server main (CLI, TLS, worker spawning), per-worker pipe creation and lifecycle, worker_process implementation for LP/MIP solves, incumbent forwarding, and result publication.
Server Threads & Job Management
cpp/src/grpc/server/grpc_server_threads.cpp, cpp/src/grpc/server/grpc_job_management.cpp
Background threads (worker monitor, result/incumbent retrieval, session reaper); async submission, chunked-job handling, job tracking, cancellation, and log management.
gRPC Service Implementation
cpp/src/grpc/server/grpc_service_impl.cpp, cpp/src/grpc/server/grpc_server_main.cpp
Full CuOptRemoteService implementation: SubmitJob, chunked upload/download RPCs, GetResult/CheckStatus/Cancel/Delete/Wait, StreamLogs, GetIncumbents, session state machines, validation, and factory function.
Pipe Serialization & Tests
cpp/src/grpc/server/grpc_pipe_serialization.hpp, cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp
Wire-format helpers and unit tests validating chunked request/result serialization and multi-chunk assembly over pipes.
Tests & Test Utilities
cpp/tests/linear_programming/grpc/*, cpp/tests/linear_programming/grpc/grpc_client_test.cpp, cpp/tests/linear_programming/grpc/grpc_integration_test.cpp, cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp, cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp, cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp, python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
Large additions: client unit tests with mock stubs; integration tests that spawn cuopt_grpc_server (including TLS/mTLS); test log capture utilities; test helpers to inject mock stubs; C API and Python tests updated to use live server fixtures.
Server Service Mapper & Helpers
cpp/src/grpc/grpc_service_mapper.{hpp,cpp}, cpp/src/grpc/grpc_settings_mapper.{hpp,cpp}, cpp/src/grpc/grpc_solution_mapper.{hpp,cpp}
Builders for SubmitJob requests and settings/solution conversions; implementations and explicit instantiations used by client and service.
Removed Old Remote Stubs
cpp/src/pdlp/solve_remote.cu
Deleted previous placeholder remote-stub implementations and template instantiations; remote execution paths now use gRPC client/server.
Documentation
GRPC_INTERFACE.md, GRPC_QUICK_START.md, GRPC_SERVER_ARCHITECTURE.md
New documentation for gRPC API, quick-start (including TLS/mTLS), server architecture, chunked protocol, and configuration.
Packaging / Conda Manifests
conda/..., dependencies.yaml
Conda/build manifest updates to include openssl and c-ares; recipe updated to build/install protobuf/gRPC and include cuopt_grpc_server in package outputs.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)

152-163: ⚠️ Potential issue | 🟠 Major

Strengthen this warmstart test to assert numerical correctness, not just non-None outputs.

After adding CUOPT_PRESOLVE at Line 152, the test still only checks that a primal solution exists (Line 162). Please also validate warmstart numerical correctness (e.g., objective consistency/improvement within tolerance).

Proposed test-strengthening diff
     sol1 = linear_programming.Solve(dm, settings)
     ws = sol1.get_pdlp_warm_start_data()
+    obj1 = sol1.get_primal_objective()

     if ws is not None:
         settings.set_pdlp_warm_start_data(ws)
         settings.set_parameter(CUOPT_ITERATION_LIMIT, 200)
         sol2 = linear_programming.Solve(dm, settings)
         assert sol2.get_primal_solution() is not None
+        obj2 = sol2.get_primal_objective()
+        assert obj2 is not None
+        assert abs(obj2 - obj1) / max(1.0, abs(obj1)) < 1e-6

As per coding guidelines, **/*test*.{cpp,cu,py} must “Write tests validating numerical correctness of optimization results (not just 'runs without error').”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 152 - 163, The test currently only asserts
sol2.get_primal_solution() is not None; strengthen it by checking numerical
correctness: retrieve both primal solutions (via sol1.get_primal_solution() and
sol2.get_primal_solution()), compute their objective values (either via
solX.get_primal_objective() if available or by forming the dot product of the
primal vector with the problem cost from dm), then assert the objective values
are consistent within a small tolerance (e.g., abs(obj1 - obj2) <= 1e-6) or that
sol2's objective is no worse (<=) than sol1's given the increased
CUOPT_ITERATION_LIMIT; keep tolerance and exact assertion choice explicit in the
test.
🟠 Major comments (18)
cpp/src/grpc/server/grpc_worker.cpp-56-74 (1)

56-74: ⚠️ Potential issue | 🟠 Major

Potential race condition in result queue access.

store_simple_result iterates through result_queue looking for an empty slot without synchronization. If multiple workers call this concurrently, they could both find the same slot with !result_queue[i].ready and overwrite each other's results.

Consider using an atomic compare-exchange on ready similar to the job claiming pattern (lines 86-87), or protect access with a mutex.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 56 - 74,
store_simple_result currently scans result_queue and sets a slot when
result_queue[i].ready is false without synchronization, causing a race where
multiple callers can claim the same slot; update this function to claim a slot
atomically (e.g., change result_queue[i].ready to an std::atomic<bool> and
perform an atomic compare-exchange to flip false->true before writing the slot)
or surround the scan-and-write with a mutex; ensure you only write the other
fields (job_id, status, data_size, worker_index, error_message, retrieved) after
successfully claiming the slot to avoid concurrent overwrites.
cpp/src/grpc/server/grpc_server_types.hpp-55-64 (1)

55-64: 🛠️ Refactor suggestion | 🟠 Major

Avoid using directives in headers.

The using declarations (lines 55-61) and using namespace directive (line 63) pollute the global namespace for any file that includes this header. This can cause unexpected name collisions.

Suggested fix: Use qualified names instead

Remove these lines and use fully qualified names (e.g., grpc::Server, grpc::Status) in the implementation files that include this header. Alternatively, move these declarations inside the #ifdef CUOPT_ENABLE_GRPC guard's associated .cpp files only.

-using grpc::Server;
-using grpc::ServerBuilder;
-using grpc::ServerContext;
-using grpc::ServerReaderWriter;
-using grpc::ServerWriter;
-using grpc::Status;
-using grpc::StatusCode;
-
-using namespace cuopt::linear_programming;
-// Note: NOT using "using namespace cuopt::remote" to avoid JobStatus enum conflict
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 55 - 64, Remove the
header-level using declarations for grpc types (grpc::Server,
grpc::ServerBuilder, grpc::ServerContext, grpc::ServerReaderWriter,
grpc::ServerWriter, grpc::Status, grpc::StatusCode) and the using namespace
cuopt::linear_programming; to avoid leaking names (and the JobStatus enum
conflict). Replace usages in this header with fully-qualified names (e.g.,
grpc::Server, cuopt::linear_programming::Whatever) or move the using
declarations into the corresponding .cpp files (or inside the CUOPT_ENABLE_GRPC
guard in .cpp) so only implementation files import those namespaces; ensure any
references in grpc_server_types.hpp are updated to the qualified symbols.
cpp/src/grpc/server/grpc_worker_infra.cpp-194-205 (1)

194-205: ⚠️ Potential issue | 🟠 Major

Handle full result_queue when marking worker failures.

If all result slots are occupied, this path drops the terminal failure record. A waiting client can then stall even though the job already failed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 194 - 205, The
current loop that writes a terminal failure into result_queue stops if every
slot is already marked ready, which can drop the failure; modify the failure
handling logic so that if no slot with !ready is found it will (1) search for a
slot with ready && retrieved (i.e., already consumed) and reuse it, using the
same writes (copy_cstr to job_id/error_message, set status, data_size,
worker_index, retrieved=false, ready=true), and (2) if none exist, as a last
resort overwrite the oldest/first slot (e.g., index 0) to guarantee the client
gets a terminal response; update the block around result_queue/MAX_RESULTS and
reuse the existing copy_cstr calls so the behavior is deterministic.
cpp/src/grpc/server/grpc_server_main.cpp-50-60 (1)

50-60: ⚠️ Potential issue | 🟠 Major

Harden numeric CLI parsing and range validation.

Line 53/55/57/59 use stoi/stoll without exception handling or range checks. Invalid input or negative values (e.g., workers/port) can crash startup or create invalid runtime config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 60, The CLI
numeric parsing loop that assigns config.port, config.num_workers,
config.max_message_mb and config.max_message_b uses std::stoi/std::stoll without
error handling or range checks; wrap each conversion in a try/catch for
std::invalid_argument and std::out_of_range, validate values (e.g., port > 0 &&
port <= 65535, num_workers > 0, max_message_mb >= 0, max_message_b >= 4096) and
on invalid input log an error via the existing logger (or std::cerr) and
return/exit with non‑zero status or fall back to safe defaults; update the
parsing branches that set config.port, config.num_workers, config.max_message_mb
and the --max-message-bytes branch that sets config.max_message_b to use these
guarded conversions and checks.
cpp/src/grpc/server/grpc_worker_infra.cpp-120-124 (1)

120-124: ⚠️ Potential issue | 🟠 Major

Close child-end FDs on fork failure to prevent descriptor leaks.

On fork failure, only server-side pipe ends are closed. Parent-owned child ends (worker_read_fd, worker_write_fd, worker_incumbent_write_fd) remain open.

🔧 Minimal fix
   if (pid < 0) {
     std::cerr << "[Server] Failed to fork " << (is_replacement ? "replacement worker " : "worker ")
               << worker_id << "\n";
     close_worker_pipes_server(worker_id);
+    close_worker_pipes_child_ends(worker_id);
     return -1;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 120 - 124, On fork
failure in grpc_worker_infra.cpp (the pid < 0 branch), close the parent-owned
child-end file descriptors to avoid leaks: ensure worker_read_fd,
worker_write_fd, and worker_incumbent_write_fd are closed (if valid) in addition
to calling close_worker_pipes_server(worker_id); update the pid < 0 error path
in the code that creates workers so it explicitly closes those descriptors
before returning -1.
cpp/src/grpc/client/solve_remote.cu-55-57 (1)

55-57: ⚠️ Potential issue | 🟠 Major

Validate env overrides before applying chunk/message sizes.

CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES are accepted as-is; zero/negative/too-small values can break chunk sizing and message-limit behavior.

✅ Minimal validation example
 static void apply_env_overrides(grpc_client_config_t& config)
 {
-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  auto max_bytes  = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+
+  config.chunk_size_bytes = std::max<int64_t>(4096, chunk_size);
+  config.max_message_bytes =
+    (max_bytes <= 0) ? config.max_message_bytes : std::max<int64_t>(4096, max_bytes);
   config.enable_debug_log  = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The code currently
applies CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES directly which can allow
zero/negative/too-small values to break chunking; after calling parse_env_int64
for these into config.chunk_size_bytes and config.max_message_bytes, validate
each parsed value against sensible thresholds (e.g., >0 and >= a defined
MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constant) and only assign the override if it
passes validation; if invalid, keep the existing default and emit a warning/log
mentioning the env var name and rejected value (use the same symbols
config.chunk_size_bytes, config.max_message_bytes, parse_env_int64 and add
MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constants or inline thresholds).
cpp/src/grpc/client/solve_remote.cu-153-166 (1)

153-166: ⚠️ Potential issue | 🟠 Major

Don’t forward incumbent objective as the callback bound.

Line 164 sets bound_copy equal to incumbent objective. That can make gap-sensitive callback logic interpret the incumbent as already tightly bounded, which is semantically wrong when a true best bound is unavailable.

Based on learnings: objective direction is encoded by objective-sense metadata, and callback bounds should use direction-appropriate “no bound” values when the true bound is not available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 153 - 166, The incumbent
callback is incorrectly using the incumbent objective as the callback bound
(bound_copy), which can mislead gap-sensitive logic; instead, set bound_copy to
a direction-appropriate "no bound" sentinel (e.g., +infinity for minimization,
-infinity for maximization) rather than objective. In the
config.incumbent_callback lambda (referencing mip_callbacks and
get_solution_callback_t::get_solution), query the objective sense from the
callback or problem metadata (e.g., a get_objective_sense() or equivalent) and
assign bound_copy = std::numeric_limits<double>::infinity() for minimization or
-std::numeric_limits<double>::infinity() for maximization before calling
get_solution; do not forward the incumbent objective as the bound.
cpp/src/grpc/server/grpc_server_threads.cpp-253-263 (1)

253-263: ⚠️ Potential issue | 🟠 Major

Synchronize worker_pipes access across polling and worker-restart paths.

This thread reads worker_pipes (lines 254–255) while the monitor thread replaces dead workers via spawn_single_worker()close_worker_pipes_server(), which modifies the same entries without a shared lock. This is a data race: the vector can reallocate during concurrent modifications, and file descriptors can be closed while the incumbent thread polls them, causing undefined behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 253 - 263, The
polling code reads shared worker_pipes without synchronization, causing a data
race with spawn_single_worker() / close_worker_pipes_server() which mutate that
vector; fix by introducing and using a shared mutex (e.g., worker_pipes_mutex)
that both the polling thread and the worker-restart code lock when
accessing/modifying worker_pipes, and change the polling path (the loop that
builds pfds) to take the mutex, copy the needed file descriptors into a local
vector<int> (or local pfds) while holding the lock, then release the lock before
calling poll() so the poll is not done while holding the mutex; also ensure
close_worker_pipes_server() and spawn_single_worker() acquire the same mutex
when closing/removing entries to avoid closing fds being polled.
cpp/src/grpc/grpc_settings_mapper.cu-144-146 (1)

144-146: ⚠️ Potential issue | 🟠 Major

Guard iteration_limit narrowing conversion.

Line 145 casts protobuf int64 directly into i_t. Oversized values can overflow/truncate and corrupt solver limits.

Proposed fix
-  if (pb_settings.iteration_limit() >= 0) {
-    settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
-  }
+  if (pb_settings.iteration_limit() >= 0) {
+    const auto iter_limit = pb_settings.iteration_limit();
+    if (iter_limit > static_cast<int64_t>(std::numeric_limits<i_t>::max())) {
+      settings.iteration_limit = std::numeric_limits<i_t>::max();
+    } else {
+      settings.iteration_limit = static_cast<i_t>(iter_limit);
+    }
+  }

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The code
currently casts pb_settings.iteration_limit() directly to i_t (setting
settings.iteration_limit) which can overflow; update the assignment in the block
that checks pb_settings.iteration_limit() to first compare the int64 value
against the representable range of i_t (use std::numeric_limits<i_t>::max() and
min()/0 as appropriate), and then either clamp to that range or return/log an
error before assigning to settings.iteration_limit; reference
pb_settings.iteration_limit(), settings.iteration_limit, and the i_t type when
making this change so the narrowing conversion is guarded.
cpp/src/grpc/server/grpc_job_management.cpp-109-113 (1)

109-113: ⚠️ Potential issue | 🟠 Major

Cap incumbent payload size before data.resize(size).

Line 111 trusts wire-provided size and can allocate arbitrarily large memory.

Proposed fix
 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  constexpr uint64_t kMaxIncumbentBytes = 64ULL * 1024 * 1024;
+  if (size > kMaxIncumbentBytes) {
+    std::cerr << "[Server] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code
reads a wire-provided uint64_t size and calls data.resize(size) without
validation; before calling data.resize (in the block that uses read_from_pipe,
size and data), validate and cap size against a defined maximum payload (e.g.
kMaxPayloadSize) and reject sizes that are zero or exceed the cap by returning
false; use the same symbols from the snippet (uint64_t size, read_from_pipe(fd,
&size,...), data.resize(size)) so the check occurs immediately after reading
size and before any allocation or further read_from_pipe calls.
cpp/src/grpc/server/grpc_service_impl.cpp-699-716 (1)

699-716: ⚠️ Potential issue | 🟠 Major

StreamLogs can block indefinitely for terminal jobs without a log file.

The wait loop only exits on file existence or NOT_FOUND; a completed/failed/cancelled job with no log file never terminates the RPC.

Proposed fix
       if (waited_ms >= 2000) {
         std::string msg;
         JobStatus s = check_job_status(job_id, msg);
         if (s == JobStatus::NOT_FOUND) {
           if (config.verbose) {
             std::cout << "[gRPC] StreamLogs job not found: " << job_id << std::endl;
           }
           return Status(grpc::StatusCode::NOT_FOUND, "Job not found: " + job_id);
         }
+        if (s == JobStatus::COMPLETED || s == JobStatus::FAILED || s == JobStatus::CANCELLED) {
+          cuopt::remote::LogMessage done;
+          done.set_line("");
+          done.set_byte_offset(from_byte);
+          done.set_job_complete(true);
+          writer->Write(done);
+          return Status::OK;
+        }
         waited_ms = 0;
       }

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 699 - 716, The
StreamLogs loop can hang for terminal jobs that never produce a log file; modify
the loop (in grpc_service_impl::StreamLogs) to call check_job_status(job_id,
msg) when the file is still missing and, if the returned JobStatus is a terminal
state (e.g., JobStatus::COMPLETED, JobStatus::FAILED, JobStatus::CANCELLED or
other terminal enums your code defines), break out and return a gRPC error (for
example Status(grpc::StatusCode::NOT_FOUND, "Log file not available for job: " +
job_id)) or an appropriate Status; keep the existing verbose logging path
(config.verbose) and context->IsCancelled() checks, and retain the waited_ms
retry/backoff behavior but ensure terminal-state handling resets/short-circuits
the wait to avoid indefinite blocking.
cpp/src/grpc/server/grpc_service_impl.cpp-189-193 (1)

189-193: ⚠️ Potential issue | 🟠 Major

SendArrayChunk allows unbounded duplicate/overlapping payload growth.

meta.received_bytes and state.total_bytes are incremented unconditionally; repeated overlapping chunks can inflate memory usage until OOM.

Proposed guard
-    meta.received_bytes += static_cast<int64_t>(raw.size());
+    if (meta.received_bytes + static_cast<int64_t>(raw.size()) > array_bytes) {
+      return Status(StatusCode::INVALID_ARGUMENT, "Received bytes exceed declared array size");
+    }
+    meta.received_bytes += static_cast<int64_t>(raw.size());
     state.total_bytes += static_cast<int64_t>(raw.size());

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 189 - 193,
SendArrayChunk currently unconditionally increments meta.received_bytes and
state.total_bytes and pushes state.chunks, allowing duplicate/overlapping chunks
to inflate memory; modify SendArrayChunk to first validate the incoming chunk
against already-received ranges (e.g., track per-array received byte ranges in
state or use an interval set) and skip or trim duplicate/overlapping payloads
before modifying meta.received_bytes, state.total_bytes, state.chunks, and
state.total_chunks; also enforce a global per-array max size check and reject or
cap chunks that would push state.total_bytes beyond that limit to prevent
unbounded growth.
cpp/src/grpc/grpc_settings_mapper.cu-154-154 (1)

154-154: ⚠️ Potential issue | 🟠 Major

Validate protobuf presolver value before casting to presolver_t.

Line 154 and Line 221 trust unvalidated wire values; invalid integers become unsupported internal enum states.

As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."

Also applies to: 221-221

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` at line 154, The code casts the
protobuf integer pb_settings.presolver() directly into presolver_t (see
assignment to settings.presolver and the similar use at the other spot), which
can produce invalid enum states for out-of-range wire values; before
static_casting to presolver_t, validate the integer is within the defined
presolver_t range (e.g., compare against the enum's min/max or use a switch/case
for known values), handle unknown values by logging an error/warning and
selecting a safe default (or returning an error), and then assign the validated
value to settings.presolver to avoid unsupported internal enum states.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-673-699 (1)

673-699: ⚠️ Potential issue | 🟠 Major

SolveInfeasibleLP can pass without actually validating infeasibility behavior.

The assertion is gated by if (result.success && result.solution), so transport or solve regressions can still produce a passing test.

Proposed fix
   auto result = client->solve_lp(problem, settings);
-  if (result.success && result.solution) {
-    EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal);
-  }
+  ASSERT_TRUE(result.success) << result.error_message;
+  ASSERT_NE(result.solution, nullptr);
+  EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal);

As per coding guidelines "**/*test*.{cpp,cu,py}: Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 673
- 699, The test SolveInfeasibleLP currently skips assertions when the solver
failed or returned no solution; replace the guarded if-check with hard
assertions so the test fails on transport/solve regressions: assert that
result.success is true (ASSERT_TRUE(result.success)), assert that
result.solution is non-null (ASSERT_NE(result.solution, nullptr) or
ASSERT_TRUE(result.solution)), and then assert the termination status equals
pdlp_termination_status_t::Infeasible
(ASSERT_EQ(result.solution->get_termination_status(),
pdlp_termination_status_t::Infeasible)) so the test verifies infeasibility
behavior unconditionally.
cpp/src/grpc/grpc_problem_mapper.cu-175-182 (1)

175-182: ⚠️ Potential issue | 🟠 Major

Validate lower/upper bound array sizes before applying both.

Line 175 only checks constraint_lower_bounds_size() > 0; a malformed payload with mismatched upper-size still gets applied.

Proposed guard
-  if (pb_problem.constraint_lower_bounds_size() > 0) {
+  if (pb_problem.constraint_lower_bounds_size() > 0 ||
+      pb_problem.constraint_upper_bounds_size() > 0) {
+    if (pb_problem.constraint_lower_bounds_size() != pb_problem.constraint_upper_bounds_size()) {
+      return;  // or propagate an explicit mapping error
+    }
     std::vector<f_t> con_lb(pb_problem.constraint_lower_bounds().begin(),
                             pb_problem.constraint_lower_bounds().end());

As per coding guidelines "Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 175 - 182, The code applies
constraint lower and upper bounds when only constraint_lower_bounds_size() > 0
is checked, which allows mismatched arrays; update the guard to validate both
pb_problem.constraint_lower_bounds_size() and
pb_problem.constraint_upper_bounds_size() are > 0 and equal before calling
cpu_problem.set_constraint_lower_bounds(...) and
cpu_problem.set_constraint_upper_bounds(...); if sizes differ or one is zero,
skip applying them (or log/return an error) to avoid out-of-bounds/misaligned
mappings and ensure index consistency across transformations.
cpp/src/grpc/grpc_problem_mapper.cu-454-461 (1)

454-461: ⚠️ Potential issue | 🟠 Major

Require complete CSR/Q triplets before setting matrix data.

Line 454 and Line 526 gate only on values array presence; applying matrices without matching indices/offsets can create malformed models.

Proposed guard pattern
-  if (!a_values.empty()) {
+  if (!a_values.empty() || !a_indices.empty() || !a_offsets.empty()) {
+    if (a_values.empty() || a_indices.empty() || a_offsets.empty()) { return; }
     cpu_problem.set_csr_constraint_matrix(...);
   }
-  if (!q_values.empty()) {
+  if (!q_values.empty() || !q_indices.empty() || !q_offsets.empty()) {
+    if (q_values.empty() || q_indices.empty() || q_offsets.empty()) { return; }
     cpu_problem.set_quadratic_objective_matrix(...);
   }

As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."

Also applies to: 526-533

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 454 - 461, The current
guard only checks a_values before calling cpu_problem.set_csr_constraint_matrix,
which can yield malformed CSR data if a_indices or a_offsets are missing or
their sizes don't match; update the condition to require a_values, a_indices,
and a_offsets to be non-empty and verify their sizes are consistent (e.g.,
a_indices.size() matches a_values.size() and a_offsets.size() matches expected
row count + 1) before calling cpu_problem.set_csr_constraint_matrix, and apply
the same strengthened guard/validation to the other matrix-setting block
referenced around the second call (the similar call later in the file).
cpp/src/grpc/server/grpc_service_impl.cpp-823-824 (1)

823-824: ⚠️ Potential issue | 🟠 Major

Fix incumbent pagination cursor calculation.

next_index is always set to available, which skips data when max_count is smaller than remaining incumbents.

Proposed fix
-    response->set_next_index(available);
+    response->set_next_index(from_index + count);

As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 823 - 824, The
pagination logic incorrectly sets response->set_next_index(available) which
skips items when max_count < remaining; change it to compute the next cursor as
the start index plus the number of items actually returned (e.g., start +
returned_count) or equivalently start + std::min(max_count, available - start),
then call response->set_next_index(...) with that value and set done when that
next index >= available; update the code around
response->set_next_index(available) and the subsequent bool done = ...
expression accordingly.
cpp/src/grpc/client/grpc_client.cu-525-531 (1)

525-531: ⚠️ Potential issue | 🟠 Major

Use compute_chunk_size() to clamp chunk payload to negotiated message-size limits.

chunk_data_budget is derived only from config_.chunk_size_bytes without enforcing server/client limits. If user config exceeds server capabilities, chunk RPC calls will fail with message-size errors. Apply the existing compute_chunk_size() helper at both occurrences (lines 525-531 and 694-698):

🔧 Proposed fix
-  int64_t chunk_data_budget = config_.chunk_size_bytes;
-  if (chunk_data_budget <= 0) { chunk_data_budget = 1LL * 1024 * 1024; }
+  int64_t chunk_data_budget =
+    compute_chunk_size(server_max_message_bytes_, config_.max_message_bytes, config_.chunk_size_bytes);

   const int64_t proto_overhead = 64;
   if (chunk_data_budget > proto_overhead) { chunk_data_budget -= proto_overhead; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 525 - 531, The code sets
chunk_data_budget directly from config_.chunk_size_bytes then subtracts
proto_overhead before calling build_array_chunk_requests, which can exceed
negotiated gRPC limits; replace the direct assignment with a call to
compute_chunk_size(config_.chunk_size_bytes) (or the appropriate
compute_chunk_size overload) to clamp to negotiated message-size limits, then
subtract proto_overhead and pass that clamped chunk_data_budget into
build_array_chunk_requests; apply the exact same change to the other place in
this file where chunk_data_budget is computed later (the second occurrence) so
both chunk payload computations use compute_chunk_size().
🟡 Minor comments (10)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-152-152 (1)

152-152: ⚠️ Potential issue | 🟡 Minor

Clarify the presolve mode and add numerical validation to the warmstart test.

At line 152, the literal 0 for CUOPT_PRESOLVE obscures intent—there is no PresolverMode enum exported to Python, so add an inline comment explaining that 0 disables presolve (required for warmstart). More importantly, the test at line 162 only checks that the solution is non-None; it should validate numerical correctness of the warmstart result by comparing it against a baseline solve or checking solution bounds.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` at
line 152, Replace the magic literal by documenting intent: add an inline comment
next to settings.set_parameter(CUOPT_PRESOLVE, 0) stating that 0 disables
presolve (required for warmstart) since no PresolverMode enum is exposed to
Python. Then extend the warmstart test (the test that currently only asserts
solution is not None) to perform numerical validation: run a baseline solve
without warmstart (or use known optimal objective/variable bounds) and assert
the warmstarted solution’s objective and key variable values are within a tight
tolerance (e.g., via pytest.approx) of the baseline; reference the same settings
and solution objects used in the test to obtain values for comparison.
cpp/src/grpc/server/grpc_incumbent_proto.hpp-30-32 (1)

30-32: ⚠️ Potential issue | 🟡 Minor

Unchecked serialization return value.

SerializeToArray can fail and returns a boolean, but the return value is ignored. This could lead to returning an invalid/incomplete buffer silently.

🛡️ Proposed fix
   std::vector<uint8_t> buffer(msg.ByteSizeLong());
-  msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()));
+  if (!msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()))) {
+    return {};  // Return empty vector on failure
+  }
   return buffer;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 30 - 32, The code
ignores the boolean result of msg.SerializeToArray which can fail; update the
serialization block to check the return value of
msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())) and handle
failure (e.g., throw a std::runtime_error or return an empty/optional buffer)
rather than returning a potentially incomplete vector; keep the existing use of
msg.ByteSizeLong() and the buffer variable but ensure you validate
SerializeToArray's result and produce a clear error/alternative return before
returning buffer.
cpp/src/grpc/server/grpc_worker.cpp-248-251 (1)

248-251: ⚠️ Potential issue | 🟡 Minor

Missing CUDA error checking on cudaMemcpy.

Per coding guidelines, CUDA errors should be caught and mapped to meaningful error codes. These cudaMemcpy calls could fail (e.g., device out of memory) but errors are not checked.

🛡️ Proposed fix pattern
-        cudaMemcpy(host_solution.data(),
-                   device_solution.data(),
-                   device_solution.size() * sizeof(double),
-                   cudaMemcpyDeviceToHost);
+        cudaError_t err = cudaMemcpy(host_solution.data(),
+                                     device_solution.data(),
+                                     device_solution.size() * sizeof(double),
+                                     cudaMemcpyDeviceToHost);
+        if (err != cudaSuccess) {
+          throw std::runtime_error(std::string("cudaMemcpy failed: ") + cudaGetErrorString(err));
+        }

Apply the same pattern to the other cudaMemcpy calls on lines 296-307.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 248 - 251, The cudaMemcpy
calls (e.g., the call copying device_solution -> host_solution) lack CUDA error
checks; wrap each cudaMemcpy (including the other calls around lines 296-307) to
capture the returned cudaError_t, check if it is not cudaSuccess, and convert it
to a meaningful error/Status (use/create a helper like mapCudaErrorToStatus or
mapCudaErrorToGrpcStatus) and return/log that error instead of proceeding;
update the cudaMemcpy calls that reference device_solution, host_solution, and
any other device/host buffers to use this pattern so failures (OOM, invalid
value, etc.) are detected and mapped to the existing error-handling flow.
cpp/src/grpc/client/test_grpc_client.cpp-401-403 (1)

401-403: ⚠️ Potential issue | 🟡 Minor

Unhandled exception from std::stod.

If the user provides an invalid --time-limit value (e.g., non-numeric), std::stod throws std::invalid_argument or std::out_of_range, resulting in an unhandled exception with a cryptic error message.

🛡️ Proposed fix
     } else if (arg == "--time-limit" && arg_idx + 1 < argc) {
-      time_limit = std::stod(argv[++arg_idx]);
+      try {
+        time_limit = std::stod(argv[++arg_idx]);
+      } catch (const std::exception& e) {
+        std::cerr << "ERROR: Invalid time limit value\n";
+        return 1;
+      }
       arg_idx++;

The same issue exists for std::stoll on line 531.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Wrap the
std::stod call that parses --time-limit (and the std::stoll call at the other
location) in a try/catch that catches std::invalid_argument and
std::out_of_range, then log a clear error via the existing logging/exit path
and/or set a safe default; specifically, around the parsing of argv in the
branch handling "--time-limit" (variable time_limit and std::stod) catch parsing
exceptions, call the same error-handling used elsewhere (print usage or error
and exit with non-zero) and do the same for the std::stoll parse to avoid
uncaught exceptions and cryptic crashes.
cpp/src/grpc/cuopt_remote_service.proto-314-318 (1)

314-318: ⚠️ Potential issue | 🟡 Minor

Inconsistent job_id type: bytes here vs string elsewhere.

IncumbentRequest.job_id is defined as bytes (line 315), but all other request messages (e.g., StatusRequest, GetResultRequest, CancelRequest, DeleteRequest) use string for job_id. This inconsistency could cause confusion and require different handling on client/server.

Suggested fix for consistency
 message IncumbentRequest {
-  bytes job_id = 1;
+  string job_id = 1;
   int64 from_index = 2;        // Return incumbents starting from this index
   int32 max_count = 3;         // Optional limit (0 or negative => no limit)
 }

Similarly, Incumbent.job_id at line 324 should also be string for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 318,
IncumbentRequest.job_id (and matching Incumbent.job_id) are declared as bytes
while all other request messages use string; change the field type from bytes to
string in the proto for IncumbentRequest and Incumbent so they match
StatusRequest/GetResultRequest/CancelRequest/DeleteRequest, then
recompile/regenerate the gRPC/protobuf stubs (look for the IncumbentRequest and
Incumbent message definitions) to ensure client/server code uses the consistent
string type.
cpp/src/grpc/server/grpc_server_types.hpp-263-265 (1)

263-265: ⚠️ Potential issue | 🟡 Minor

Hardcoded shared memory names may cause conflicts with concurrent server instances.

The shared memory segment names (/cuopt_job_queue, /cuopt_result_queue, /cuopt_control) are hardcoded constants. If multiple server instances run on the same host, they will conflict.

Consider incorporating the port number or a unique identifier into the segment names.

Suggested approach
// Instead of inline constants, generate names dynamically:
inline std::string get_shm_job_queue_name(int port) {
  return "/cuopt_job_queue_" + std::to_string(port);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 263 - 265, The three
hardcoded shared-memory name constants (SHM_JOB_QUEUE, SHM_RESULT_QUEUE,
SHM_CONTROL) will collide across concurrent server instances; change them from
inline const char* values to functions that generate unique names at runtime
(e.g., append the server port, PID, or UUID) and update call sites to use the
generator functions (e.g., get_shm_job_queue_name(port),
get_shm_result_queue_name(port), get_shm_control_name(port)) so each server
instance uses distinct segment names.
GRPC_ARCHITECTURE.md-298-304 (1)

298-304: ⚠️ Potential issue | 🟡 Minor

Document required remote endpoint env vars in the Environment Variables section.

This table currently lists only transfer tuning variables. Please also include CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT, since remote execution depends on them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_ARCHITECTURE.md` around lines 298 - 304, The Environment Variables table
is missing the remote endpoint variables: add two rows for CUOPT_REMOTE_HOST and
CUOPT_REMOTE_PORT in the "Environment Variables" section, specifying sensible
defaults (e.g., blank or "localhost" for CUOPT_REMOTE_HOST and a default port
like 50051 for CUOPT_REMOTE_PORT) and clear descriptions such as "Remote
execution host" and "Remote execution port" so remote execution is documented
alongside CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES.
GRPC_ARCHITECTURE.md-16-39 (1)

16-39: ⚠️ Potential issue | 🟡 Minor

Specify languages on fenced code blocks to satisfy markdown lint.

These blocks are missing a language identifier (MD040), which is currently failing lint hygiene.

📝 Example fix pattern
-```
+```text
 cpp/src/grpc/
 ...
-```
+```

Also applies to: 105-120, 136-151, 359-371

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_ARCHITECTURE.md` around lines 16 - 39, The markdown fenced code blocks
in GRPC_ARCHITECTURE.md are missing language identifiers (MD040); update each
triple-backtick block (including the shown cpp/src/grpc/ tree block and the
other blocks at ranges 105-120, 136-151, 359-371) to include a language token
such as "text" (i.e., change ``` to ```text) so linting passes; ensure all
similar block openings (```...) in the file are consistently annotated.
cpp/src/grpc/grpc_solution_mapper.cu-559-560 (1)

559-560: ⚠️ Potential issue | 🟡 Minor

Warm-start presence check is too narrow.

Warm-start reconstruction only triggers when RESULT_WS_CURRENT_PRIMAL is non-empty; other warm-start fields can be present and still be dropped.

Suggested adjustment
-  auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL);
-  if (!ws_primal.empty()) {
+  auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL);
+  auto ws_dual   = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_DUAL);
+  auto ws_init_p = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_PRIMAL_AVG);
+  auto ws_init_d = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_DUAL_AVG);
+  if (!ws_primal.empty() || !ws_dual.empty() || !ws_init_p.empty() || !ws_init_d.empty()) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 560, The code only
triggers warm-start reconstruction when ws_primal (from
bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL)) is
non-empty, which misses cases where other warm-start fields exist; change the
presence check to detect any warm-start data key in arrays (e.g.,
cuopt::remote::RESULT_WS_CURRENT_PRIMAL, RESULT_WS_CURRENT_DUAL,
RESULT_WS_PREV_PRIMAL, RESULT_WS_PREV_DUAL or other RESULT_WS_* entries) before
skipping reconstruction. Implement a small helper or boolean that queries arrays
for any of those RESULT_WS_* keys (or calls bytes_to_typed for each and checks
non-empty) and use that combined condition instead of only ws_primal.empty() so
reconstruction runs when any warm-start field is present.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-53-57 (1)

53-57: ⚠️ Potential issue | 🟡 Minor

Add <fcntl.h> include for open() and flag constants.

Line 120 uses open() with O_WRONLY|O_CREAT|O_TRUNC, but <fcntl.h> is not explicitly included. While some systems may include it transitively through unistd.h, this is not guaranteed by the POSIX standard and creates a portability risk.

File handle cleanup is correct; fd is properly closed at line 125.

Proposed fix
 `#include` <signal.h>
+#include <fcntl.h>
 `#include` <sys/types.h>
 `#include` <sys/wait.h>
 `#include` <unistd.h>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 53
- 57, The test uses open() with flags O_WRONLY|O_CREAT|O_TRUNC (call site:
open(..., O_WRONLY|O_CREAT|O_TRUNC)) but does not include <fcntl.h>, which can
break portability; add `#include` <fcntl.h> to the top includes so the O_* flag
constants and open declaration are defined and the code compiles reliably across
platforms.
🧹 Nitpick comments (10)
cpp/src/grpc/server/grpc_worker.cpp (1)

149-161: Duplicate error handling blocks.

Lines 149-151 and 153-161 both handle !read_success. The first block logs to stderr, while the second handles the failure logic. These could be consolidated.

♻️ Proposed consolidation
-    if (!read_success) {
-      std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n";
-    }
-
     if (!read_success) {
+      std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n";
       store_simple_result(job_id, worker_id, 1, "Failed to read job data");
       job.worker_pid   = 0;
       // ... rest of cleanup
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 149 - 161, The duplicate
handling for !read_success should be consolidated into a single block: remove
the first standalone stderr log and merge its message into the existing failure
branch so that when read_success is false you both log the error (including
worker_id) and execute the failure logic (call store_simple_result(job_id,
worker_id, 1, "..."), reset job fields job.worker_pid, job.worker_index,
job.data_sent, job.ready, job.claimed, and continue). Update the block around
the read_success check in grpc_worker.cpp to perform logging and the cleanup
exactly once to avoid duplicate branches.
cpp/src/grpc/server/grpc_incumbent_proto.hpp (1)

44-49: Consider clearing the output vector before populating.

The assignment vector is passed by reference but not cleared before push_back calls. If a caller reuses the same vector across multiple calls, stale data will accumulate.

♻️ Proposed fix
   job_id    = incumbent_msg.job_id();
   objective = incumbent_msg.objective();
+  assignment.clear();
   assignment.reserve(incumbent_msg.assignment_size());
   for (int i = 0; i < incumbent_msg.assignment_size(); ++i) {
     assignment.push_back(incumbent_msg.assignment(i));
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 44 - 49, The
assignment vector is populated via push_back without being cleared, so reuse
will accumulate stale entries; before calling assignment.reserve(...) and the
for-loop that pushes from incumbent_msg.assignment(i), call assignment.clear()
(or use assignment.assign(...) from the repeated field) to ensure the vector is
emptied first—update the code around the symbols assignment and incumbent_msg in
grpc_incumbent_proto.hpp to clear assignment before reserving/pushing.
cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)

93-100: Verify semantic difference between result and request deserialization.

For result blobs (line 94), total_elements is used directly as byte count for resize. For request blobs (lines 151-153), total_elements * elem_size is used. This asymmetry suggests total_elements has different semantics in result vs request contexts - consider adding a clarifying comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 100, The
resize logic for result deserialization uses ac.total_elements() directly as
bytes while request deserialization multiplies total_elements by element size,
so confirm whether ac.total_elements() represents bytes or element count and
make semantics consistent: either change the result-path allocation (the
dest.resize call) to use total_bytes = ac.total_elements() * ac.elem_size() (or
equivalent accessor) before resizing and memcpy, or add a clear comment by the
result-path block documenting that total_elements already includes element size
in bytes; update any related uses (element_offset(), chunk_data.size(), memcpy)
to match the chosen unit to avoid off-by-factor bugs.
cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp (1)

16-20: Consider reordering includes for consistency.

Standard convention places system/library headers before local headers. The <memory> include should come before the local "grpc_client.hpp" include.

Suggested reorder
 `#include` <cuopt_remote_service.grpc.pb.h>
 `#include` <grpcpp/grpcpp.h>
-#include "grpc_client.hpp"
 
 `#include` <memory>
+
+#include "grpc_client.hpp"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp` around lines
16 - 20, Reorder the includes so system/library headers come before local
headers: move the <memory> include above the "grpc_client.hpp" include in
grpc_client_test_helper.hpp; ensure the block now lists
<cuopt_remote_service.grpc.pb.h>, <grpcpp/grpcpp.h>, <memory> and then
"grpc_client.hpp" to follow the standard include ordering convention.
cpp/tests/linear_programming/grpc/CMakeLists.txt (1)

24-33: Minor: Redundant GTest library linkage.

GTest::gmock_main already includes GTest::gmock, and GTest::gtest_main includes GTest::gtest. You can simplify to just the _main variants.

Simplified linkage
 target_link_libraries(GRPC_CLIENT_TEST
     PRIVATE
     cuopt
-    GTest::gmock
     GTest::gmock_main
-    GTest::gtest
     GTest::gtest_main
     gRPC::grpc++
     protobuf::libprotobuf
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/CMakeLists.txt` around lines 24 - 33,
Remove redundant GTest link entries in the target_link_libraries call for
GRPC_CLIENT_TEST: keep only the main variants (GTest::gmock_main and
GTest::gtest_main) and remove GTest::gmock and GTest::gtest to avoid duplicate
linkage; update the target_link_libraries block that references
GRPC_CLIENT_TEST, cuopt, gRPC::grpc++, and protobuf::libprotobuf accordingly so
only the necessary _main symbols remain.
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (2)

359-360: Remove unused server_logs_ member.

The server_logs_ vector is declared but never populated or used. Server logs are read directly from the file system via get_server_logs(). This dead code should be removed to avoid confusion.

Remove unused member
  private:
   mutable std::mutex mutex_;
   std::vector<LogEntry> client_logs_;
-  std::vector<LogEntry> server_logs_;
   std::string server_log_path_;
   std::streampos server_log_start_pos_ = 0;  // Position in server log file when test started
   bool test_start_marked_              = false;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 359
- 360, Remove the dead member server_logs_ from the class: delete the
declaration of std::vector<LogEntry> server_logs_ (it is never populated or
used; server logs are accessed via get_server_logs()). After removing the
member, run a quick search for server_logs_ to ensure there are no remaining
references and update any tests or constructors that may have initialized it.

123-126: Callback captures this pointer — document lifetime requirement.

The lambda returned by client_callback() captures this by pointer. If the callback is invoked after the GrpcTestLogCapture instance is destroyed, this will cause undefined behavior. Consider adding a brief note in the documentation that the GrpcTestLogCapture instance must outlive any client using this callback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 123
- 126, The lambda returned by GrpcTestLogCapture::client_callback() captures
this by pointer and can dangle if the GrpcTestLogCapture instance is destroyed,
so add a clear lifetime note: update the documentation/comment for the
GrpcTestLogCapture class and/or the client_callback() method to state that the
GrpcTestLogCapture instance must outlive any client that retains or invokes the
returned callback; reference the client_callback() function name in the comment
so callers know the requirement and consider mentioning alternative designs
(e.g., providing a weak_ptr-based factory) if you want to avoid the strict
lifetime requirement.
SERVER_ARCHITECTURE.md (1)

15-46: Add language specifier to fenced code block.

Per markdownlint, fenced code blocks should have a language specified. For ASCII diagrams, use text or plaintext.

Fix markdown lint warning
-```
+```text
 ┌────────────────────────────────────────────────────────────────────┐
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The fenced ASCII diagram block
in SERVER_ARCHITECTURE.md is missing a language specifier; change the opening
fence from ``` to ```text (or ```plaintext) so the block becomes ```text and
satisfies markdownlint for the diagram that starts with the
"┌────────────────────────────────────────────────────────────────────┐" line.
cpp/src/grpc/client/solve_remote.cu (1)

69-74: Mark currently-unused LP flags explicitly (or forward them).

problem_checking and use_pdlp_solver_mode are currently unused. Marking them explicitly avoids silent interface drift and warning churn.

Small cleanup option
 std::unique_ptr<lp_solution_interface_t<i_t, f_t>> solve_lp_remote(
   cpu_optimization_problem_t<i_t, f_t> const& cpu_problem,
   pdlp_solver_settings_t<i_t, f_t> const& settings,
-  bool problem_checking,
-  bool use_pdlp_solver_mode)
+  [[maybe_unused]] bool problem_checking,
+  [[maybe_unused]] bool use_pdlp_solver_mode)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 69 - 74, The function
solve_lp_remote currently accepts parameters problem_checking and
use_pdlp_solver_mode but never uses them, which can cause warnings and interface
drift; update the function (solve_lp_remote) to explicitly mark these parameters
as intentionally unused (e.g., cast to void or annotate with [[maybe_unused]])
or forward them to the underlying call that needs them, ensuring the symbols
problem_checking and use_pdlp_solver_mode appear in the implementation so the
intent is clear and compiler warnings are silenced.
cpp/tests/linear_programming/grpc/grpc_client_test.cpp (1)

1086-1340: Add at least one degenerate solve-case assertion in this suite.

The end-to-end mock solve tests are strong, but adding explicit infeasible/unbounded/empty (or singleton edge) solve assertions would better protect status/result mapping regressions.

As per coding guidelines: "Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 1086 -
1340, Add a new degenerate-case unit test (e.g., TEST_F(GrpcClientTest,
SolveLP_Infeasible)) that uses the existing GrpcClientTest fixture and
MockCuOptStub to exercise grpc_client_t::solve_lp for an infeasible LP: have
SubmitJob(...) return a job_id, have CheckStatus/WaitForCompletion indicate
COMPLETED (or skip if use_wait), and have GetResult(...) return a
cuopt::remote::LPSolution with termination_status set to PDLP_INFEASIBLE (and
resp->set_status(cuopt::remote::SUCCESS) if appropriate); then assert the
returned result indicates failure (result.success == false) and that
result.error_message contains a clear indicator of infeasibility (e.g.,
"INFEASIBLE" or "infeasible"); reuse symbols SubmitJob,
CheckStatus/WaitForCompletion, GetResult, grpc_client_t::solve_lp, and
GrpcClientTest to locate where to add the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6984e3cb-4874-432a-8307-c8f3a64fd9ba

📥 Commits

Reviewing files that changed from the base of the PR and between 1c6f86b and 6cafe37.

📒 Files selected for processing (39)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/routing/utils.py
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

♻️ Duplicate comments (9)
cpp/src/grpc/server/grpc_server_main.cpp (2)

186-212: ⚠️ Potential issue | 🔴 Critical

Add teardown on TLS config failures after workers/threads are started.

These early returns occur after runtime startup and bypass shutdown/join/cleanup paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 186 - 212, The TLS
validation branches that currently do "return 1" (around checks using config,
ssl_opts and read_file_to_string for cert/key/root and require_client) can occur
after workers/threads have been started; replace those early returns with calls
to the centralized shutdown/teardown routine (e.g., call your existing server
stop/cleanup functions such as StopServer(), shutdownWorkers()/joinThreads(), or
a new cleanup_resources() that performs shutdown and joins) and then return the
error code; ensure every failure path that currently returns from inside the TLS
setup invokes that teardown routine before exiting so workers/threads and other
resources are properly cleaned up.

234-251: ⚠️ Potential issue | 🔴 Critical

Guard BuildAndStart() failure before waiting on the server pointer.

server->Wait() is called unconditionally; if startup fails and returns null, this dereferences a null pointer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251,
BuildAndStart() may return nullptr so avoid unguarded dereference of server;
after std::unique_ptr<Server> server(builder.BuildAndStart()), check if (server)
before calling server->Wait() and starting the shutdown logic that assumes a
live server. If BuildAndStart() fails, log an error (or print to cerr) and
exit/return non-zero. Also ensure the shutdown_thread's capture/Shutdown call
remains safe by only invoking server->Shutdown() when server is non-null; move
creation of shutdown_thread and the server->Wait() call inside the if (server)
block and handle the failure path explicitly.
cpp/src/grpc/server/grpc_job_management.cpp (1)

147-158: ⚠️ Potential issue | 🔴 Critical

Synchronize all job_queue access; current slot claim/update path is racy.

submit_job_async still does unsynchronized check-then-write on shared queue state, and check_job_status/cancel_job concurrently read/write the same fields. This can corrupt queue ownership and status transitions under concurrent requests.

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

Also applies to: 195-196, 267-270

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The
submit_job_async path performs an unsynchronized check-then-write on shared
job_queue entries (e.g., checking job_queue[i].ready/claimed then writing fields
like job_id, ready, claimed), which races with check_job_status and cancel_job;
protect all accesses to job_queue (both scans and per-entry reads/writes) by
introducing a synchronization primitive (e.g., a std::mutex or per-entry mutex
array) and acquire the lock(s) around the loop that scans MAX_JOBS and before
modifying fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure
check_job_status and cancel_job acquire the same lock(s) before reading or
updating those fields so ownership and state transitions are atomic and
race-free.
cpp/src/grpc/client/grpc_client.cu (4)

258-266: ⚠️ Potential issue | 🔴 Critical

Guard public RPC methods before dereferencing impl_->stub.

These methods can be called before connect(). Dereferencing impl_->stub without a null check can crash.

🔧 Suggested guard pattern
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_ || !impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The public RPC
method grpc_client_t::check_status dereferences impl_->stub without ensuring
impl_ is initialized or stub is non-null (can be called before connect()), which
can crash; add a null-check guard at the start of check_status (and other public
RPC methods) that verifies impl_ and impl_->stub are valid, and return an
appropriate job_status_result_t error/failed state (or set result.error and
result.code) when the client is not connected instead of dereferencing; refer to
grpc_client_t::connect and impl_->stub when adding the guard so the check
mirrors connect()'s initialization contract.

994-995: ⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before using it as a divisor.

max_polls divides by config_.poll_interval_ms without checking <= 0, which can crash.

🔧 Suggested validation
+  if (config_.poll_interval_ms <= 0) {
+    stop_log_streaming();
+    result.error_message = "poll_interval_ms must be > 0";
+    return result;
+  }
   int poll_count = 0;
   int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;

Also applies to: 1132-1132

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 994 - 995, The code computes
max_polls by dividing by config_.poll_interval_ms without validating it; update
the logic that sets max_polls (the line initializing int max_polls) to first
validate config_.poll_interval_ms is > 0 (and also guard against extremely
large/small values), returning an error or using a safe default/policy (e.g.,
set poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid
division-by-zero and undefined behavior; apply the same validation where
max_polls is computed elsewhere (the other occurrence around the code that uses
config_.poll_interval_ms).

703-712: ⚠️ Potential issue | 🔴 Critical

Validate chunk metadata and chunk bounds before allocation/copy.

elem_size, total_elems, and elements_in_chunk are trusted too early. This can lead to division by zero, overflowed allocation sizes, or out-of-bounds memcpy.

🔧 Suggested hardening
   for (const auto& arr_desc : header->arrays()) {
     auto field_id       = arr_desc.field_id();
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element size";
+      return false;
+    }
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: byte-size overflow";
+      return false;
+    }
+    int64_t total_bytes = total_elems * elem_size;
+    if (total_bytes < 0 ||
+        static_cast<uint64_t>(total_bytes) > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+      last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;
     if (elems_per_chunk <= 0) elems_per_chunk = 1;

-    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size));
+    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));

     for (int64_t elem_offset = 0; elem_offset < total_elems; elem_offset += elems_per_chunk) {
       int64_t elems_wanted = std::min(elems_per_chunk, total_elems - elem_offset);
@@
       int64_t elems_received = chunk_resp.elements_in_chunk();
       const auto& data       = chunk_resp.data();

       if (static_cast<int64_t>(data.size()) != elems_received * elem_size) {
         last_error_ = "GetResultChunk: data size mismatch";
         return false;
       }
+      if (elems_received < 0 || elem_offset + elems_received > total_elems) {
+        last_error_ = "GetResultChunk: elements_in_chunk out of bounds";
+        return false;
+      }

       std::memcpy(array_bytes.data() + elem_offset * elem_size, data.data(), data.size());

Also applies to: 731-740

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 703 - 712, Validate and
guard all chunk metadata before allocating or copying: ensure
arr_desc.element_size_bytes() (elem_size) > 0 and arr_desc.total_elements()
(total_elems) >= 0 before using them; ensure chunk_data_budget is >0 before
computing elems_per_chunk and handle division safely; check for
size_t/multiplication overflow when computing total_elems * elem_size before
constructing array_bytes and use a safe cap; when iterating chunks validate each
elements_in_chunk is within [0, total_elems] and compute copy_size =
elements_in_chunk * elem_size with overflow checks before any memcpy; add clear
error handling/logging and skip/abort the chunk on invalid metadata rather than
performing the allocation or memcpy (refer to symbols: arr_desc, field_id,
elem_size, total_elems, chunk_data_budget, elems_per_chunk, elements_in_chunk,
array_bytes, memcpy).

247-251: ⚠️ Potential issue | 🔴 Critical

stop_log_streaming() can hang waiting on a blocked reader thread.

Setting stop_logs_ does not unblock a thread already stuck in stream read, so join() can block indefinitely.

🔧 Suggested fix
 void grpc_client_t::start_log_streaming(const std::string& job_id)
 {
   if (!config_.stream_logs || !config_.log_callback) return;

   stop_logs_.store(false);
   log_thread_ = std::make_unique<std::thread>([this, job_id]() {
     grpc::ClientContext context;
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = &context;
+    }
     auto request = build_stream_logs_request(job_id, 0);
     auto reader  = impl_->stub->StreamLogs(&context, request);

     cuopt::remote::LogMessage log_msg;
     while (reader->Read(&log_msg)) {
       if (stop_logs_.load()) {
-        context.TryCancel();
         break;
       }
       if (config_.log_callback) { config_.log_callback(log_msg.line()); }
       if (log_msg.job_complete()) { break; }
     }
     reader->Finish();
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = nullptr;
+    }
   });
 }

 void grpc_client_t::stop_log_streaming()
 {
   stop_logs_.store(true);
+  {
+    std::lock_guard<std::mutex> lk(log_context_mutex_);
+    if (active_log_context_) { active_log_context_->TryCancel(); }
+  }
   if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); }
   log_thread_.reset();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 247 - 251,
stop_log_streaming currently only sets stop_logs_ and then joins log_thread_,
which can hang if the thread is blocked in a gRPC read; change the shutdown
sequence to (1) set stop_logs_, (2) explicitly cancel/close the gRPC stream used
by the reader (e.g. call the grpc::ClientContext::TryCancel() or cancel/Finish
the ClientReader/ReaderInterface instance used for log streaming), (3) ensure
any CompletionQueue or async reader is shutdown/drained so the blocked read
returns, and only then join log_thread_ (use a timed join as a fallback and log
an error if join times out); update stop_log_streaming to reference stop_logs_,
log_thread_ and the stream/reader/client-context member (the object that
performs the stream read) when implementing the cancel/cleanup steps.
cpp/src/grpc/grpc_problem_mapper.cu (1)

399-411: ⚠️ Potential issue | 🔴 Critical

Validate byte alignment before memcpy in chunked array decoders.

n is computed with integer division, but memcpy copies full byte count. Misaligned payload sizes can write past allocated vectors.

🔧 Suggested fix
   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};
     size_t n = it->second.size() / sizeof(double);
@@
   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};
     size_t n = it->second.size() / sizeof(int32_t);

Also applies to: 414-426

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 411, The lambda
get_doubles uses n = it->second.size() / sizeof(double) but then memcpy's the
full byte size, which can overflow if the payload isn't aligned; validate that
it->second.size() is an exact multiple of sizeof(double) before copying, compute
expected_bytes = n * sizeof(double) and only memcpy expected_bytes (or
return/throw on misaligned size), and when converting to f_t ensure you only
read expected_bytes into the temporary vector; apply the same check/fix to the
other chunked-array decoder blocks that perform memcpy (the other double-to-f_t
conversion in this file).
cpp/src/grpc/server/grpc_service_impl.cpp (1)

71-73: ⚠️ Potential issue | 🔴 Critical

Publish ready before releasing claimed to avoid queue slot races.

Current store order can allow another submitter to reserve the same slot before readiness is published.

🔧 Suggested ordering fix
-    job_queue[job_idx].claimed.store(false);
-    job_queue[job_idx].ready.store(true);
+    job_queue[job_idx].ready.store(true, std::memory_order_release);
+    job_queue[job_idx].claimed.store(false, std::memory_order_release);
As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

Also applies to: 265-267

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code
currently does job_queue[job_idx].claimed.store(false) before
job_queue[job_idx].ready.store(true), which can let another submitter claim the
slot before readiness is published; swap the two stores so ready.store(true) is
executed before claimed.store(false) for the job_queue entry, and apply the same
fix to the other identical location where claimed and ready are updated (the
later occurrence around the second claim/release sequence) to prevent the race.
🧹 Nitpick comments (5)
cpp/src/grpc/client/test_grpc_client.cpp (1)

401-403: Consider graceful handling for invalid numeric CLI input instead of process abort.

std::stod/std::stoll at lines 402 and 531 throw on malformed input (e.g., --time-limit abc or invalid incumbent index). While this is test code, a try-catch wrapper would provide a friendlier CLI experience instead of silently crashing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Replace
direct calls to std::stod and std::stoll in the CLI parsing logic with
exception-safe parsing: wrap the std::stod(std::string) used to set time_limit
and the std::stoll used to set incumbent_index in try-catch blocks that catch
std::invalid_argument and std::out_of_range, print a clear error message
indicating the offending flag and value (e.g., "--time-limit" or
"--incumbent-index"), and exit with a non-zero status instead of letting the
process throw; update the parsing branches that reference time_limit and
incumbent_index so they validate the parsed numeric ranges if applicable and
fall back or exit cleanly on parse failure.
cpp/src/grpc/cuopt_remote_service.proto (1)

9-9: Standardize job_id field type across RPC schema.

Most request messages use string job_id (StatusRequest, GetResultRequest, DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.), but IncumbentRequest.job_id and Incumbent.job_id are defined as bytes. Similarly, cuopt_remote.proto uses bytes job_id in SubmitResponse. This type inconsistency can cause friction in client libraries expecting uniform scalar types across the API contract. Either standardize on string for consistency with the majority of the schema, or use bytes uniformly if avoiding UTF-8 validation is a requirement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` at line 9, IncumbentRequest.job_id,
Incumbent.job_id and SubmitResponse.job_id are defined as bytes while most RPC
messages (StatusRequest, GetResultRequest, DeleteRequest, CancelRequest,
WaitRequest, StreamLogsRequest, etc.) use string; change those fields to type
string to standardize the schema, update any proto message definitions where
job_id is bytes to string (IncumbentRequest, Incumbent, SubmitResponse), run
proto generation to refresh client/server stubs, and adjust any
serialization/handler code that expects bytes to handle string instead.
cpp/src/grpc/grpc_solution_mapper.cu (2)

93-101: Consider bulk operations for repeated fields.

The element-by-element add_*() calls work correctly but may be slower for large solutions. Protobuf's mutable_primal_solution()->Reserve(primal.size()) followed by Add() or using mutable_primal_solution()->Assign(...) could improve performance for large-scale problems.

💡 Example using Reserve for primal_solution
   // Solution vectors - CPU solution already has data in host memory
   const auto& primal       = solution.get_primal_solution_host();
   const auto& dual         = solution.get_dual_solution_host();
   const auto& reduced_cost = solution.get_reduced_cost_host();

+  pb_solution->mutable_primal_solution()->Reserve(primal.size());
   for (const auto& v : primal) {
     pb_solution->add_primal_solution(static_cast<double>(v));
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 93 - 101, The per-element
protobuf adds (pb_solution->add_primal_solution, add_dual_solution,
add_reduced_cost) should be replaced with bulk operations to improve performance
for large vectors: call
pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use
Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal
data (cast to double as needed), and do the same for dual and reduced_cost using
mutable_dual_solution()->Reserve(dual.size())/Assign and
mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so
the bulk copy converts elements to double before Assign/adding.

559-598: Warm-start presence detection relies on empty array check.

The warm-start data presence is inferred from ws_primal.empty(). This works correctly as long as any valid warm-start data always includes the primal solution. If there's ever a case where warm-start exists but current_primal is empty, this would miss it. Consider checking the header for a dedicated warm-start flag if available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 598, The code
currently detects warm-start presence by testing ws_primal.empty() in the block
that constructs cpu_pdlp_warm_start_data_t (see bytes_to_typed,
cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty()) guard); change that
condition to prefer a dedicated header flag (e.g., use h.has_warm_start() or
h.ws_present() if such a field exists) and fall back to the primal-array
emptiness check if the flag is unavailable, e.g., replace if
(!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so
warm-starts with an empty current_primal are still recognized.
cpp/src/grpc/client/grpc_client.hpp (1)

236-240: Non-movable semantics are intentional but could be revisited.

The class is non-copyable and non-movable due to std::atomic<bool> and std::thread members. While std::atomic is actually movable in C++17, the std::thread member and PIMPL pattern make this reasonable. If movability becomes needed, consider wrapping the thread in a nullable type or resetting stop_logs_ after move.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.hpp` around lines 236 - 240, The class
grpc_client_t is explicitly non-copyable and non-movable due to an
std::atomic<bool> (stop_logs_) and an internal std::thread member; if you need
movability, add a noexcept move constructor and move assignment that transfer
ownership of the PIMPL, move the std::thread (std::move on the thread member)
and ensure the source thread is left in a benign state (e.g., set to not
joinable or detached), and transfer/reset the atomic flag (use
stop_logs_.store(false) or std::exchange to set a sensible value in the
moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and
operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and
handle stop_logs_ correctly; ensure proper synchronization when transferring
thread ownership to avoid races.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env overrides for CUOPT_CHUNK_SIZE and
CUOPT_MAX_MESSAGE_BYTES must be validated before assignment: call
parse_env_int64 for each (as currently done), then only assign to
config.chunk_size_bytes and config.max_message_bytes when the parsed value is a
sane positive integer (e.g., > 0) and within expected bounds; otherwise ignore
the override (leave existing config value) and emit a warning via
process/logging (use config.enable_debug_log or existing logger). In short,
replace the direct assignments with guarded checks around parse_env_int64
results and clamp or reject values that are zero/negative or exceed your defined
upper limits before setting config.chunk_size_bytes and
config.max_message_bytes.
- Around line 164-166: The incumbent callback incorrectly uses the incumbent
objective as the "bound" (bound_copy = objective) when calling
get_callback->get_solution; change the flow so the actual best bound from the
remote solver (use get_solution_bound()) is forwarded through the incumbent
callback protocol and used as bound_copy when invoking
get_callback->get_solution; update the gRPC incumbent callback message/handler
(grpc_client.hpp protocol and the server-side code that calls the callback) to
include a bound field, populate it from get_solution_bound(), and adjust the
client-side call site in solve_remote.cu to read that bound and pass it instead
of objective to ensure callbacks receive the solver's best bound.

In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 315: The proto uses two different types for job_id (string in
SubmitJobResponse, StatusRequest, GetResultRequest, StartChunkedDownloadRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest vs bytes in
IncumbentRequest and Incumbent), so unify them: choose one canonical type
(prefer string for UTF-8 identifiers or bytes if binary IDs are required) and
update the job_id field in the outlier messages (IncumbentRequest and Incumbent
if switching to string, or update the eight string-typed messages if switching
to bytes) to match; after changing the proto make sure to regenerate language
bindings and add any necessary UTF-8 validation or encoding/decoding notes in
the RPC handling code.

In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto doesn't match Buf's package-directory rules; either change
the package line to match the file's directory-based package (so the package
mirrors the proto file location) or move cuopt_remote.proto into a directory
structure that matches "cuopt/remote"; update the package declaration (package
cuopt.remote) accordingly so the package and directory structure are consistent
and the PACKAGE_DIRECTORY_MATCH lint passes.

In `@cpp/src/grpc/grpc_settings_mapper.cu`:
- Around line 144-146: The protobuf iteration_limit can exceed the range of the
target type i_t and overflow when statically cast; before assigning
pb_settings.iteration_limit() to settings.iteration_limit (the
static_cast<i_t>), check that the value lies within numeric_limits<i_t>::min()
and numeric_limits<i_t>::max(); if it is out of range, either clamp to the
appropriate min/max or return/log an error and refuse the assignment, ensuring
you reference pb_settings.iteration_limit() and settings.iteration_limit when
implementing the guard to prevent narrowing overflow.

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 109-113: The code currently reads a uint64_t size from the pipe
and blindly calls data.resize(size) which permits unbounded allocation; before
resizing, validate and cap the incoming size (e.g. define a
MAX_INCUMBENT_PAYLOAD or MAX_INCUMBENT_SIZE constant) and return false if size
is zero or exceeds that limit, and ensure you convert/cast size to size_t
safely; update the recv_incumbent_pipe logic around the read_from_pipe(fd,
&size, ...) and data.resize(...) calls (and any callers using fd/data) to
enforce this limit and prevent resource exhaustion.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Line 90: The loop currently treats any ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) == false as a benign break and later returns true, which
accepts partial/corrupt payloads; change the handling so that when
ParseDelimitedFromCodedStream returns false you check clean_eof and only break
on a true clean_eof (clean EOF), otherwise return false to indicate a parse
error. Update both occurrences (the line with ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) and the other similar occurrence) so parse failures produce
an immediate false return instead of falling through and returning true.
- Around line 151-154: The resize uses ac.total_elements() * elem_size without
validation which can overflow or trigger huge allocations; before calling
dest.resize validate that ac.total_elements() and elem_size are non-negative and
compute the product using a larger unsigned type (e.g., uint64_t) then check
product <= std::numeric_limits<size_t>::max() and <= a configured maximum
allocation threshold, and only then cast to size_t and call dest.resize; if the
checks fail, return/throw an error instead of resizing. Also apply the same
validation logic for the other occurrence around lines 158-163 (same variables:
ac, elem_size, array_field_element_size(ac.field_id()), dest).

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 50-80: The CLI parsing loop in main (grpc_server_main.cpp)
currently silently ignores unknown flags and missing values; update the for-loop
that inspects argv/argc to validate every flag: for each option that requires a
value (e.g., "-p"/"--port", "-w"/"--workers", "--max-message-mb",
"--max-message-bytes", "--tls-cert", "--tls-key", "--tls-root") check i+1 < argc
and if not, print an error to std::cerr (include the flag name), call
print_usage(argv[0]) and return a non-zero exit code; for unrecognized args (the
final else case) do the same (error + usage + non-zero exit); also wrap
stoi/stoll conversions in try/catch to handle invalid numbers and fail similarly
with an explanatory error. Ensure you reference and update the same config
fields (config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) and keep behavior for boolean flags unchanged.
- Around line 157-167: The code currently calls memset on JobQueueEntry and
ResultQueueEntry which contain std::atomic members (job_queue and result_queue
initialization loop); replace the raw memset with explicit field initialization
or construction: for each JobQueueEntry/ResultQueueEntry set every member
explicitly (e.g., initialize non-atomic fields directly and initialize atomics
via .store(...) or use placement-new / a helper function that constructs the
object in shared memory) to avoid undefined behavior; update the loops that call
memset(&job_queue[i], 0, ...) and memset(&result_queue[i], 0, ...) to perform
per-field initialization instead, ensuring worker_index, ready, claimed,
cancelled, retrieved, etc. are correctly constructed and stored.

In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe
functions (std::cout and result_cv.notify_all()), which must be removed; change
signal_handler to only perform async-signal-safe writes: set keep_running to
false using a sig_atomic_t/atomic<bool> and, if needed, set
shm_ctrl->shutdown_requested using an atomic or sig_atomic_t field, and return
immediately (no I/O or condition_variable calls). Add a separate watcher
thread/function (e.g., monitor_shutdown or shutdown_watcher) that polls/blocks
on keep_running becoming false and then performs the logging (std::cout) and
calls result_cv.notify_all() and any other cleanup; update signal_handler,
keep_running's type, and shm_ctrl to use signal-safe atomics so the watcher can
safely observe the change.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setters and other methods access server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization, causing
races; add a member mutex (e.g., server_log_mutex_) and use it to guard all
reads and writes of these fields (wrap accesses in std::lock_guard<std::mutex>),
update set_server_log_path to lock the mutex when updating server_log_path_,
server_log_start_pos_, and test_start_marked_, and ensure every other method
that reads or writes these members (the other server-log readers/pollers
mentioned) also locks the same mutex before accessing them so the state is
consistently protected.

In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The documentation sentence claiming “no custom serialization
logic is implemented” is too absolute; update the statement to acknowledge the
custom pipe blob framing helpers introduced in grpc_pipe_serialization.hpp (the
pipe blob framing helpers) and clarify scope by limiting the claim to gRPC wire
payloads (e.g., “All gRPC wire payload serialization uses protocol buffers
generated by protoc and grpc_cpp_plugin; separate pipe blob framing helpers in
grpc_pipe_serialization.hpp are used only for local pipe framing, not for gRPC
wire serialization.”).

In `@SERVER_ARCHITECTURE.md`:
- Around line 15-46: The markdown has untyped fenced code blocks (the ASCII
diagram block and other fenced regions) which trigger MD040; update each
triple-backtick fence (including the large ASCII diagram block shown and the
other blocks noted) to include an explicit language token such as text (e.g.,
change ``` to ```text) or another appropriate language (bash/cpp) so
markdownlint MD040 is satisfied across the file.

---

Duplicate comments:
In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-266: The public RPC method grpc_client_t::check_status
dereferences impl_->stub without ensuring impl_ is initialized or stub is
non-null (can be called before connect()), which can crash; add a null-check
guard at the start of check_status (and other public RPC methods) that verifies
impl_ and impl_->stub are valid, and return an appropriate job_status_result_t
error/failed state (or set result.error and result.code) when the client is not
connected instead of dereferencing; refer to grpc_client_t::connect and
impl_->stub when adding the guard so the check mirrors connect()'s
initialization contract.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without validating it; update the logic that sets
max_polls (the line initializing int max_polls) to first validate
config_.poll_interval_ms is > 0 (and also guard against extremely large/small
values), returning an error or using a safe default/policy (e.g., set
poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid
division-by-zero and undefined behavior; apply the same validation where
max_polls is computed elsewhere (the other occurrence around the code that uses
config_.poll_interval_ms).
- Around line 703-712: Validate and guard all chunk metadata before allocating
or copying: ensure arr_desc.element_size_bytes() (elem_size) > 0 and
arr_desc.total_elements() (total_elems) >= 0 before using them; ensure
chunk_data_budget is >0 before computing elems_per_chunk and handle division
safely; check for size_t/multiplication overflow when computing total_elems *
elem_size before constructing array_bytes and use a safe cap; when iterating
chunks validate each elements_in_chunk is within [0, total_elems] and compute
copy_size = elements_in_chunk * elem_size with overflow checks before any
memcpy; add clear error handling/logging and skip/abort the chunk on invalid
metadata rather than performing the allocation or memcpy (refer to symbols:
arr_desc, field_id, elem_size, total_elems, chunk_data_budget, elems_per_chunk,
elements_in_chunk, array_bytes, memcpy).
- Around line 247-251: stop_log_streaming currently only sets stop_logs_ and
then joins log_thread_, which can hang if the thread is blocked in a gRPC read;
change the shutdown sequence to (1) set stop_logs_, (2) explicitly cancel/close
the gRPC stream used by the reader (e.g. call the
grpc::ClientContext::TryCancel() or cancel/Finish the
ClientReader/ReaderInterface instance used for log streaming), (3) ensure any
CompletionQueue or async reader is shutdown/drained so the blocked read returns,
and only then join log_thread_ (use a timed join as a fallback and log an error
if join times out); update stop_log_streaming to reference stop_logs_,
log_thread_ and the stream/reader/client-context member (the object that
performs the stream read) when implementing the cancel/cleanup steps.

In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 399-411: The lambda get_doubles uses n = it->second.size() /
sizeof(double) but then memcpy's the full byte size, which can overflow if the
payload isn't aligned; validate that it->second.size() is an exact multiple of
sizeof(double) before copying, compute expected_bytes = n * sizeof(double) and
only memcpy expected_bytes (or return/throw on misaligned size), and when
converting to f_t ensure you only read expected_bytes into the temporary vector;
apply the same check/fix to the other chunked-array decoder blocks that perform
memcpy (the other double-to-f_t conversion in this file).

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 147-158: The submit_job_async path performs an unsynchronized
check-then-write on shared job_queue entries (e.g., checking
job_queue[i].ready/claimed then writing fields like job_id, ready, claimed),
which races with check_job_status and cancel_job; protect all accesses to
job_queue (both scans and per-entry reads/writes) by introducing a
synchronization primitive (e.g., a std::mutex or per-entry mutex array) and
acquire the lock(s) around the loop that scans MAX_JOBS and before modifying
fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure
check_job_status and cancel_job acquire the same lock(s) before reading or
updating those fields so ownership and state transitions are atomic and
race-free.

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 186-212: The TLS validation branches that currently do "return 1"
(around checks using config, ssl_opts and read_file_to_string for cert/key/root
and require_client) can occur after workers/threads have been started; replace
those early returns with calls to the centralized shutdown/teardown routine
(e.g., call your existing server stop/cleanup functions such as StopServer(),
shutdownWorkers()/joinThreads(), or a new cleanup_resources() that performs
shutdown and joins) and then return the error code; ensure every failure path
that currently returns from inside the TLS setup invokes that teardown routine
before exiting so workers/threads and other resources are properly cleaned up.
- Around line 234-251: BuildAndStart() may return nullptr so avoid unguarded
dereference of server; after std::unique_ptr<Server>
server(builder.BuildAndStart()), check if (server) before calling server->Wait()
and starting the shutdown logic that assumes a live server. If BuildAndStart()
fails, log an error (or print to cerr) and exit/return non-zero. Also ensure the
shutdown_thread's capture/Shutdown call remains safe by only invoking
server->Shutdown() when server is non-null; move creation of shutdown_thread and
the server->Wait() call inside the if (server) block and handle the failure path
explicitly.

In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code currently does
job_queue[job_idx].claimed.store(false) before
job_queue[job_idx].ready.store(true), which can let another submitter claim the
slot before readiness is published; swap the two stores so ready.store(true) is
executed before claimed.store(false) for the job_queue entry, and apply the same
fix to the other identical location where claimed and ready are updated (the
later occurrence around the second claim/release sequence) to prevent the race.

---

Nitpick comments:
In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 236-240: The class grpc_client_t is explicitly non-copyable and
non-movable due to an std::atomic<bool> (stop_logs_) and an internal std::thread
member; if you need movability, add a noexcept move constructor and move
assignment that transfer ownership of the PIMPL, move the std::thread (std::move
on the thread member) and ensure the source thread is left in a benign state
(e.g., set to not joinable or detached), and transfer/reset the atomic flag (use
stop_logs_.store(false) or std::exchange to set a sensible value in the
moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and
operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and
handle stop_logs_ correctly; ensure proper synchronization when transferring
thread ownership to avoid races.

In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: Replace direct calls to std::stod and std::stoll in the
CLI parsing logic with exception-safe parsing: wrap the std::stod(std::string)
used to set time_limit and the std::stoll used to set incumbent_index in
try-catch blocks that catch std::invalid_argument and std::out_of_range, print a
clear error message indicating the offending flag and value (e.g.,
"--time-limit" or "--incumbent-index"), and exit with a non-zero status instead
of letting the process throw; update the parsing branches that reference
time_limit and incumbent_index so they validate the parsed numeric ranges if
applicable and fall back or exit cleanly on parse failure.

In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 9: IncumbentRequest.job_id, Incumbent.job_id and SubmitResponse.job_id
are defined as bytes while most RPC messages (StatusRequest, GetResultRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.) use string;
change those fields to type string to standardize the schema, update any proto
message definitions where job_id is bytes to string (IncumbentRequest,
Incumbent, SubmitResponse), run proto generation to refresh client/server stubs,
and adjust any serialization/handler code that expects bytes to handle string
instead.

In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 93-101: The per-element protobuf adds
(pb_solution->add_primal_solution, add_dual_solution, add_reduced_cost) should
be replaced with bulk operations to improve performance for large vectors: call
pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use
Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal
data (cast to double as needed), and do the same for dual and reduced_cost using
mutable_dual_solution()->Reserve(dual.size())/Assign and
mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so
the bulk copy converts elements to double before Assign/adding.
- Around line 559-598: The code currently detects warm-start presence by testing
ws_primal.empty() in the block that constructs cpu_pdlp_warm_start_data_t (see
bytes_to_typed, cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty())
guard); change that condition to prefer a dedicated header flag (e.g., use
h.has_warm_start() or h.ws_present() if such a field exists) and fall back to
the primal-array emptiness check if the flag is unavailable, e.g., replace if
(!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so
warm-starts with an empty current_primal are still recognized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2a43112f-b25e-4054-b00f-1fee3a598df9

📥 Commits

Reviewing files that changed from the base of the PR and between 6cafe37 and 612de9a.

📒 Files selected for processing (38)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (8)
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (24)
cpp/src/grpc/cuopt_remote_service.proto (1)

314-325: ⚠️ Potential issue | 🟠 Major

Use one job_id scalar type across service messages.

Line 315 and Line 324 use bytes, while most service request/response messages use string for job_id. This inconsistency creates avoidable cross-language conversion edge cases.

Schema alignment sketch
 message IncumbentRequest {
-  bytes job_id = 1;
+  string job_id = 1;
   int64 from_index = 2;
   int32 max_count = 3;
 }

 message Incumbent {
   int64 index = 1;
   double objective = 2;
   repeated double assignment = 3;
-  bytes job_id = 4;
+  string job_id = 4;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 325, The
IncumbentRequest.job_id and Incumbent.job_id fields use bytes while the rest of
the API uses string for job_id; change both fields in messages IncumbentRequest
and Incumbent from bytes to string to ensure a single scalar type for job_id
across the service, and update any serialization/usage code that expects bytes
to accept a string (or perform explicit encoding/decoding at the callsite) so
cross-language conversions remain consistent.
GRPC_ARCHITECTURE.md (1)

9-10: ⚠️ Potential issue | 🟡 Minor

Scope the serialization claim to gRPC wire payloads.

Line 9-Line 10 is too absolute; local pipe framing helpers exist. Please clarify this is specifically about gRPC wire serialization.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_ARCHITECTURE.md` around lines 9 - 10, The statement "All serialization
uses protocol buffers generated by `protoc` and `grpc_cpp_plugin` — no custom
serialization logic is implemented." is too absolute; update it to explicitly
scope serialization to gRPC wire payloads (e.g., "All gRPC wire payload
serialization uses protocol buffers generated by `protoc` and
`grpc_cpp_plugin`") and add a brief note acknowledging that local pipe framing
helpers (custom framing/unframing code) are used outside the gRPC wire format;
keep the original wording's intent but clarify that custom framing helpers exist
for local pipes.
SERVER_ARCHITECTURE.md (1)

15-46: ⚠️ Potential issue | 🟡 Minor

Add explicit fence languages to these code blocks.

markdownlint MD040 is still triggered by untyped fences in these sections. Use ```text for ASCII diagrams and table-like flows.

Also applies to: 110-118, 122-135, 139-148, 154-168, 196-198, 207-217

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The ASCII-art code fences
(e.g., the block starting with the "Main Server Process" diagram that includes
"Shared Memory Queues", "Job Queue", "Result Queue", and worker boxes) are
untyped and trigger markdownlint MD040; update each such fence to use an
explicit text language (replace ``` with ```text) for that diagram and the other
similar untyped blocks (the blocks containing table/flow-style ASCII diagrams
and blocks starting with the same diagram-like contents referenced in the
comment). Ensure every ASCII/table-like fenced block (including the ones around
the ranges called out in the review) is changed to ```text so markdownlint stops
flagging MD040.
cpp/src/grpc/server/grpc_worker_infra.cpp (1)

165-168: ⚠️ Potential issue | 🔴 Critical

Validate PID and handle EINTR in wait_for_workers().

The reap loop at Line 165-Line 168 should skip non-positive PIDs and retry interrupted waits; otherwise waitpid(0, ...)/signal interruption can lead to incorrect child reaping behavior.

Suggested fix
 void wait_for_workers()
 {
   for (pid_t pid : worker_pids) {
+    if (pid <= 0) continue;
     int status;
-    waitpid(pid, &status, 0);
+    while (waitpid(pid, &status, 0) < 0 && errno == EINTR) {}
   }
   worker_pids.clear();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 165 - 168, In
wait_for_workers(), the reap loop currently calls waitpid(pid, &status, 0) for
every pid in worker_pids without validating pid or handling EINTR; update the
loop to skip non-positive PIDs (pid <= 0) and wrap waitpid(pid, &status, 0) in a
retry loop that re-invokes waitpid when it returns -1 with errno == EINTR, and
only break on a successful return or an error other than EINTR; reference the
worker_pids iteration, the pid variable, the waitpid call and the status/errno
checks when making the change.
cpp/src/grpc/cuopt_remote.proto (1)

6-6: ⚠️ Potential issue | 🟠 Major

Align proto package and directory to satisfy Buf lint.

package cuopt.remote at Line 6 does not match the current file directory (cpp/src/grpc), which triggers PACKAGE_DIRECTORY_MATCH.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in
cuopt_remote.proto ("package cuopt.remote;") does not match the file's directory
and triggers PACKAGE_DIRECTORY_MATCH; update the package line in
cuopt_remote.proto to reflect the directory structure (e.g., "package
cpp.src.grpc;" or another package path that mirrors the directory segments) so
the package-to-directory mapping aligns, then re-run buf lint to verify the
PACKAGE_DIRECTORY_MATCH error is resolved.
cpp/src/grpc/server/grpc_server_main.cpp (5)

157-167: ⚠️ Potential issue | 🔴 Critical

Replace raw memset on queue entries that include atomics.

JobQueueEntry / ResultQueueEntry contain std::atomic members; byte-wise zeroing those objects is undefined behavior.

As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 157 - 167, The code
currently zeroes JobQueueEntry and ResultQueueEntry with memset (affecting
job_queue and result_queue), which is undefined because those structs contain
std::atomic members; replace the memset calls by value-initializing or
explicitly initializing each member (e.g., construct entries using
JobQueueEntry() or assign default-constructed instances) and then set the atomic
flags via their store(...) calls (retain the existing store(false)/store(-1)
lines for ready/claimed/cancelled/worker_index and retrieved), ensuring no
byte-wise writes touch atomics; update the initialization loop to use assignment
or placement-new of JobQueueEntry/ResultQueueEntry rather than memset to avoid
UB.

50-80: ⚠️ Potential issue | 🟠 Major

Reject unknown or incomplete CLI flags explicitly.

The parser currently falls through for unrecognized flags and for value-requiring flags missing a value, so the server can continue with unintended defaults.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI
parser in the main argument loop accepts unknown flags and silently ignores
missing values for flags that require them; update the argument parsing logic
(the for-loop handling argv in grpc_server_main.cpp and the handling of flags
like
"--port"/"--workers"/"--max-message-mb"/"--max-message-bytes"/"--tls-cert"/"--tls-key"/"--tls-root")
so that when a flag that requires a value is encountered and i+1 >= argc you
call print_usage(argv[0]) (or log an explicit error) and return a non-zero exit
code, and when an unrecognized argument starting with '-' is seen you also call
print_usage(argv[0]) (or emit an error) and return non-zero—this ensures unknown
or incomplete CLI flags are rejected instead of silently falling back to
defaults.

89-90: ⚠️ Potential issue | 🔴 Critical

Do not wire POSIX signals to the current async-unsafe handler.

signal_handler (in cpp/src/grpc/server/grpc_server_types.hpp) uses operations like condition-variable notification from signal context, which is not async-signal-safe.

As per coding guidelines "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 89 - 90, The current
wiring of POSIX signals via signal(SIGINT, signal_handler) / signal(SIGTERM,
signal_handler) is unsafe because signal_handler performs async-unsafe
operations (e.g., condition-variable notification); instead, change to a safe
signal handling pattern: install a minimal async-signal-safe handler that only
sets a std::sig_atomic_t or std::atomic_bool shutdown flag (or writes a byte to
a dedicated self-pipe) and return, and move any condition-variable notifications
or complex shutdown logic out of signal_handler into the main thread or a
dedicated watcher that polls the flag or reads the pipe; update references in
grpc_server_main.cpp that call signal(...) and the implementation of
signal_handler in grpc_server_types.hpp to follow this pattern so all
non-async-safe actions occur off the signal context.

234-251: ⚠️ Potential issue | 🔴 Critical

Guard BuildAndStart() failure before calling server->Wait().

BuildAndStart() can return null, but Line 250 unconditionally dereferences server.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251,
BuildAndStart() may return nullptr so avoid unguarded dereference of server;
after calling std::unique_ptr<Server> server(builder.BuildAndStart()), check if
(!server) and handle the failure (log an error via std::cerr or similar and
return/exit with non-zero) before starting the shutdown_thread or calling
server->Wait(); also ensure the shutdown_thread lambda still checks server
before calling server->Shutdown() and join the thread if started (use a flag to
indicate thread started or create the thread only when server is valid).

185-212: ⚠️ Potential issue | 🔴 Critical

TLS error branches leak workers/threads/shared memory after startup.

These early returns execute after spawn_workers() and thread creation, so teardown is skipped on TLS configuration/read failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 185 - 212, The TLS
error branches return early after spawn_workers()/thread/shared-memory setup and
thus leak resources; either move the TLS validation block to run before
spawn_workers(), or ensure every early-return path calls the teardown routines
(e.g., shutdown_workers(), join_worker_threads(), release_shared_memory() or
whatever the existing cleanup functions are) before returning. Locate the TLS
block (uses config, ssl_opts, read_file_to_string) and change it so that
validation happens prior to calling spawn_workers(), or add calls to the
project's worker/shutdown/cleanup functions immediately before each return in
this block to properly stop threads and free shared memory.
cpp/src/grpc/server/grpc_pipe_serialization.hpp (2)

90-90: ⚠️ Potential issue | 🔴 Critical

Return failure on malformed chunk parse instead of breaking and succeeding.

A parse error currently breaks the loop and still returns true, which accepts partial/corrupt payloads as valid.

🛠️ Minimal fix
-    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break;
+    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) {
+      if (clean_eof) { break; }
+      return false;
+    }
...
-    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break;
+    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) {
+      if (clean_eof) { break; }
+      return false;
+    }

As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

Also applies to: 148-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loop
currently treats any false from
google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)
as a harmless break and ultimately returns true, which accepts corrupt partial
payloads; change the logic so that when ParseDelimitedFromCodedStream returns
false you check clean_eof and only break if clean_eof is true (clean EOF),
otherwise immediately return false to signal a malformed chunk error; update
this behavior for the ParseDelimitedFromCodedStream call at the shown line and
the duplicate occurrence (around the other occurrence near line 148) so
malformed parses fail instead of succeeding.

93-95: ⚠️ Potential issue | 🟠 Major

Validate total_elements size math before resize().

The current resize paths trust unvalidated total_elements (and multiplication with elem_size), which can overflow or trigger excessive allocation on malformed input.

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

Also applies to: 151-154

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 95, The
resize call trusts ac.total_elements() (and the later elem_size * total_elements
math) and can overflow or cause huge allocations; before calling dest.resize()
(both at the dest.empty() check and the duplicate block at the later
occurrence), validate ac.total_elements() and the computed byte_count: cast
total_elements to a fixed-width unsigned type (e.g. uint64_t), check
total_elements > 0 is within a sane upper bound and that elem_size *
total_elements does not overflow size_t (e.g. byte_count = uint64_t(elem_size) *
total_elements; if (byte_count > std::numeric_limits<size_t>::max() ||
byte_count > MAX_ALLOWED_BYTES) return/error), then perform resize using the
checked size_t value; if the validation fails, return an error/handle gracefully
instead of resizing.
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)

213-218: ⚠️ Potential issue | 🟠 Major

Synchronize server-log state reads/writes consistently.

server_log_path_, server_log_start_pos_, and test_start_marked_ are accessed without a common lock across setters/readers. This can race when log polling and test setup run concurrently.

Suggested fix pattern
   void set_server_log_path(const std::string& path)
   {
+    std::lock_guard<std::mutex> lock(mutex_);
     server_log_path_      = path;
     server_log_start_pos_ = 0;
     test_start_marked_    = false;
   }

   std::string get_server_logs(bool since_test_start = true) const
   {
-    if (server_log_path_.empty()) { return ""; }
+    std::string path;
+    std::streampos start_pos = 0;
+    bool marked = false;
+    {
+      std::lock_guard<std::mutex> lock(mutex_);
+      path = server_log_path_;
+      start_pos = server_log_start_pos_;
+      marked = test_start_marked_;
+    }
+    if (path.empty()) { return ""; }

-    std::ifstream file(server_log_path_);
+    std::ifstream file(path);
     if (!file.is_open()) { return ""; }

-    if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) {
-      file.seekg(server_log_start_pos_);
+    if (since_test_start && marked && start_pos > 0) {
+      file.seekg(start_pos);
     }

As per coding guidelines: "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."

Also applies to: 229-244, 332-355

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213
- 218, The three server-log fields (server_log_path_, server_log_start_pos_,
test_start_marked_) are accessed without synchronization; modify
set_server_log_path to acquire a dedicated mutex (e.g., server_log_mutex_) and
set all three fields while holding the lock, and update all other
readers/writers that touch these members (the other setter/getter/polling
functions referenced in the diff ranges) to also lock the same mutex when
reading or writing those fields to prevent races.
cpp/src/grpc/server/grpc_job_management.cpp (2)

109-113: ⚠️ Potential issue | 🟠 Major

Cap incumbent payload size before resizing the buffer.

This path trusts the size prefix and directly calls data.resize(size). A malformed payload can force large allocations and exhaust memory.

Suggested guardrail
 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  if (size > static_cast<uint64_t>(config.max_message_bytes)) {
+    std::cerr << "[Worker] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code
currently trusts the size prefix and calls data.resize(size) after
read_from_pipe; introduce a validated upper bound (e.g., a constexpr size_t
kMaxJobPayloadSize) and check that size is non-negative and <=
kMaxJobPayloadSize before calling data.resize(size) and before the subsequent
read_from_pipe; if the size is out of range, return false (or handle as a
protocol error) to prevent unbounded allocation and potential memory exhaustion.
Use the existing symbols size, data.resize, and read_from_pipe to locate and
modify the logic.

147-158: ⚠️ Potential issue | 🔴 Critical

Fix racy queue-slot reservation in async submit.

Slot selection currently performs a check-then-write on shared queue state, so concurrent submissions can reserve the same slot.

Suggested approach (atomic claim)
   for (size_t i = 0; i < MAX_JOBS; ++i) {
-    if (!job_queue[i].ready && !job_queue[i].claimed) {
+    if (job_queue[i].ready.load()) { continue; }
+    bool expected = false;
+    if (!job_queue[i].claimed.compare_exchange_strong(expected, true)) { continue; }
       copy_cstr(job_queue[i].job_id, job_id);
       job_queue[i].problem_type = is_mip ? 1 : 0;
       ...
-      job_queue[i].claimed      = false;
-      job_queue[i].cancelled    = false;
-      job_queue[i].ready        = true;
+      job_queue[i].cancelled.store(false);
+      job_queue[i].ready.store(true, std::memory_order_release);
+      job_queue[i].claimed.store(false, std::memory_order_release);

As per coding guidelines: "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The loop
in submit that does check-then-write on job_queue slots is racy: change
reservation to an atomic claim (e.g., make job_queue[].claimed an
std::atomic<bool> or std::atomic_flag) and perform an atomic test-and-set
(compare_exchange_strong or test_and_set) to reserve the slot before writing any
fields; once the atomic claim succeeds, initialize job_queue[i].job_id,
problem_type, data_size, worker_* etc., then set job_queue[i].ready = true
(non-atomic write) only after fully initialized; ensure headers (<atomic>) and
any needed memory-ordering are used and that no other code resets claimed
without clearing ready in the reverse order.
cpp/src/grpc/client/solve_remote.cu (2)

53-57: ⚠️ Potential issue | 🟠 Major

Guard invalid env overrides before applying gRPC sizes.

CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES are applied without sanity checks. Zero/negative values can break transfer behavior at runtime.

Suggested fix
 static void apply_env_overrides(grpc_client_config_t& config)
 {
-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  const auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  const auto max_msg    = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; }
+  if (max_msg > 0) { config.max_message_bytes = max_msg; }
   config.enable_debug_log  = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);

As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 53 - 57, The env overrides
in apply_env_overrides currently accept any parsed value from parse_env_int64
for chunk_size_bytes and max_message_bytes; update apply_env_overrides to
validate the returned values before assignment: call parse_env_int64 for
"CUOPT_CHUNK_SIZE" and "CUOPT_MAX_MESSAGE_BYTES", then only assign to
config.chunk_size_bytes and config.max_message_bytes if the parsed value is > 0
and within sane bounds (e.g., enforce a minimum of 1 and a sensible upper cap to
avoid resource exhaustion); leave enable_debug_log assignment using
parse_env_int64 as-is. Ensure you reference apply_env_overrides,
config.chunk_size_bytes, config.max_message_bytes, and parse_env_int64 when
making the change.

163-166: ⚠️ Potential issue | 🟠 Major

Forward the real best bound to incumbent callbacks.

The current forwarding path sets bound_copy from the incumbent objective, which can report an artificial zero gap to user callbacks. This should carry the solver’s actual best bound from the remote pipeline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 163 - 166, The code
currently sets bound_copy from objective, so get_callback->get_solution(...)
forwards the incumbent objective as the bound and can report a zero gap; change
the initialization of bound_copy to use the solver’s actual best-bound value
from the remote pipeline (replace "bound_copy = objective" with the variable
that holds the solver's best bound), then call
get_callback->get_solution(solution_copy.data(), &obj_copy, &bound_copy,
callback->get_user_data()); this ensures callbacks receive the real best bound
instead of the incumbent objective.
cpp/src/grpc/grpc_settings_mapper.cu (1)

144-146: ⚠️ Potential issue | 🟠 Major

Range-check iteration_limit before narrowing cast.

A large protobuf iteration_limit (int64_t) can overflow when cast to i_t (which may be int32_t), producing an invalid internal limit.

🔧 Suggested guard
   // proto3 defaults numeric fields to 0; treat negative iteration_limit as "unset"
   // so the server keeps the library default (typically max()).
   if (pb_settings.iteration_limit() >= 0) {
-    settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
+    const auto limit = pb_settings.iteration_limit();
+    if (limit <= static_cast<int64_t>(std::numeric_limits<i_t>::max())) {
+      settings.iteration_limit = static_cast<i_t>(limit);
+    } else {
+      settings.iteration_limit = std::numeric_limits<i_t>::max();
+    }
   }

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The narrowing
cast from pb_settings.iteration_limit() to settings.iteration_limit (type i_t)
can overflow; before the static_cast in the block that sets
settings.iteration_limit, range-check the int64 value against the numeric limits
of i_t and either clamp to std::numeric_limits<i_t>::max()/min() or
return/report an error if out-of-range—update the code around the check of
pb_settings.iteration_limit() to perform this validation and only then assign
via static_cast<i_t>.
cpp/src/grpc/client/grpc_client.cu (4)

258-266: ⚠️ Potential issue | 🔴 Critical

Add null stub guard and deadline to RPC methods.

Public RPC methods like check_status dereference impl_->stub without checking if a connection was established. Additionally, the ClientContext lacks a deadline, allowing indefinite blocking on network issues. The connection probe at line 191 correctly demonstrates both patterns.

🔧 Proposed fix pattern (apply to all RPC methods)
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
+  if (config_.timeout_seconds > 0) {
+    context.set_deadline(std::chrono::system_clock::now() +
+                         std::chrono::seconds(config_.timeout_seconds));
+  }
   auto request = build_status_request(job_id);

Apply the same pattern to: wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs, and all chunked upload/download methods.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The method
grpc_client_t::check_status currently dereferences impl_->stub and creates a
grpc::ClientContext without a deadline; add a null-stub guard at the start
(check impl_ and impl_->stub and return a job_status_result_t error/result
indicating "not connected") before calling build_status_request/CheckStatus, and
set a deadline on the grpc::ClientContext (use std::chrono::... now()+configured
RPC timeout) so the RPC won't block indefinitely; follow the same pattern for
wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs and all
chunked upload/download methods, using the same error/result handling style used
by the connection probe at line 191.

991-995: ⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before division.

Line 994 computes max_polls by dividing by config_.poll_interval_ms. If this value is <= 0, the division will either crash or produce invalid polling behavior. The same issue exists at line 1132.

🔧 Proposed validation
   } else {
     CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str());
+    if (config_.poll_interval_ms <= 0) {
+      stop_log_streaming();
+      result.error_message = "Invalid configuration: poll_interval_ms must be > 0";
+      return result;
+    }
     int poll_count = 0;
     int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 991 - 995, The code computes
max_polls by dividing by config_.poll_interval_ms without validating it; add a
guard in the polling branch (where CUOPT_LOG_INFO logs "Using polling
(CheckStatus) for job" and any other similar polling block around the second
occurrence) to check config_.poll_interval_ms > 0 and handle invalid values: log
an error via CUOPT_LOG_ERROR (include job_id/context), set a safe default (e.g.,
poll_interval_ms = 1) or return/fail early, then compute max_polls using the
validated/clamped value; update both locations that reference
config_.poll_interval_ms to use the validated variable.

702-712: ⚠️ Potential issue | 🔴 Critical

Validate chunked result metadata before arithmetic operations.

element_size_bytes is used at line 708 without checking for <= 0, causing division by zero. Additionally, total_elems * elem_size at line 711 can overflow before allocation, making subsequent memcpy operations unsafe.

🔧 Proposed bounds validation
   for (const auto& arr_desc : header->arrays()) {
     auto field_id       = arr_desc.field_id();
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element_size_bytes";
+      return false;
+    }
+    // Check for overflow before allocation
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: total byte size overflow";
+      return false;
+    }
+    int64_t total_bytes = total_elems * elem_size;
+    if (static_cast<uint64_t>(total_bytes) > std::numeric_limits<size_t>::max()) {
+      last_error_ = "Invalid chunk metadata: exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 712, The loop over
header->arrays() uses arr_desc.element_size_bytes() and multiplies it by
arr_desc.total_elements() without validation; add checks that
element_size_bytes() > 0 (return/continue or log and skip if not) before
computing elems_per_chunk to avoid division by zero, compute elems_per_chunk =
max(1, chunk_data_budget / elem_size) only after that check, and validate that
total_elements() * element_size_bytes() cannot overflow size_t before allocating
array_bytes (e.g., if total_elems > SIZE_MAX / elem_size then handle as
error/skip or cap/stream allocation); reference arr_desc.element_size_bytes(),
arr_desc.total_elements(), chunk_data_budget, elems_per_chunk, and the
allocation of array_bytes to implement these guards.

221-252: ⚠️ Potential issue | 🔴 Critical

Log streaming thread can hang indefinitely on shutdown.

The stop_log_streaming() method sets the stop_logs_ flag and joins the thread, but reader->Read() at line 232 blocks indefinitely if the server stops sending messages. The TryCancel() call at line 234 only executes after Read() returns, not before. You need to call context.TryCancel() from the main thread before joining.

🔧 Proposed fix using context cancellation from stop_log_streaming

Add member variables to the class (in grpc_client.hpp):

std::mutex log_context_mutex_;
grpc::ClientContext* active_log_context_ = nullptr;

Then update the methods:

 void grpc_client_t::start_log_streaming(const std::string& job_id)
 {
   if (!config_.stream_logs || !config_.log_callback) return;

   stop_logs_.store(false);
   log_thread_ = std::make_unique<std::thread>([this, job_id]() {
     grpc::ClientContext context;
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = &context;
+    }
     auto request = build_stream_logs_request(job_id, 0);
     auto reader  = impl_->stub->StreamLogs(&context, request);

     cuopt::remote::LogMessage log_msg;
     while (reader->Read(&log_msg)) {
       if (stop_logs_.load()) {
-        context.TryCancel();
         break;
       }
       if (config_.log_callback) { config_.log_callback(log_msg.line()); }
       if (log_msg.job_complete()) { break; }
     }
     reader->Finish();
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = nullptr;
+    }
   });
 }

 void grpc_client_t::stop_log_streaming()
 {
   stop_logs_.store(true);
+  {
+    std::lock_guard<std::mutex> lk(log_context_mutex_);
+    if (active_log_context_) { active_log_context_->TryCancel(); }
+  }
   if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); }
   log_thread_.reset();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The log streaming
thread can block in reader->Read() so add a cancellable context pointer
protected by a mutex: introduce members std::mutex log_context_mutex_ and
grpc::ClientContext* active_log_context_ = nullptr; in grpc_client_t, then in
start_log_streaming() create the local grpc::ClientContext context as before but
immediately register it by locking log_context_mutex_ and setting
active_log_context_ = &context before calling impl_->stub->StreamLogs(...);
after the read/Finish() loop clear the pointer (lock and set to nullptr). In
stop_log_streaming() lock log_context_mutex_, if active_log_context_ != nullptr
call active_log_context_->TryCancel() (to wake up reader->Read()) before joining
the thread, then proceed to join/reset log_thread_ and set stop_logs_. Ensure
all accesses to active_log_context_ are mutex-protected to avoid race conditions
and that the pointer only points to the stack context while the thread is
running.
cpp/src/grpc/server/grpc_server_types.hpp (1)

300-308: ⚠️ Potential issue | 🔴 Critical

Remove non-async-signal-safe operations from signal handler.

std::cout and condition_variable::notify_all() are not async-signal-safe and can deadlock or corrupt state when called from a POSIX signal handler. The signal handler should only perform atomic writes.

🔧 Suggested approach
 inline void signal_handler(int signal)
 {
   if (signal == SIGINT || signal == SIGTERM) {
-    std::cout << "\n[gRPC Server] Received shutdown signal\n";
     keep_running = false;
     if (shm_ctrl) { shm_ctrl->shutdown_requested = true; }
-    result_cv.notify_all();
   }
 }

Move logging and result_cv.notify_all() to a dedicated shutdown-watcher thread that polls keep_running == false and then performs the cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308, The
signal_handler currently performs non-async-signal-safe operations (std::cout
and result_cv.notify_all()), which must be removed; change signal_handler to
only perform atomic-safe actions: set keep_running = false and, if shm_ctrl
exists, set shm_ctrl->shutdown_requested = true (both already present), and
return immediately; then implement a separate shutdown-watcher thread (e.g.,
spawned at server startup) that polls or waits on an atomic/condition to detect
keep_running == false and from that thread perform logging (std::cout or logging
framework), call result_cv.notify_all(), and any other cleanup work — locate
signal_handler, keep_running, shm_ctrl, and result_cv to update behavior and add
the watcher thread to the server lifecycle.
cpp/src/grpc/grpc_problem_mapper.cu (1)

399-427: ⚠️ Potential issue | 🔴 Critical

Add alignment validation before memcpy in chunked array decoding.

The get_doubles and get_ints lambdas compute element count via integer division but then copy the full byte array. If the payload size isn't aligned to element size, memcpy will copy garbage or partial data.

🔧 Proposed alignment checks
   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};  // Misaligned data
     size_t n = it->second.size() / sizeof(double);
     if constexpr (std::is_same_v<f_t, double>) {
       std::vector<double> v(n);
-      std::memcpy(v.data(), it->second.data(), it->second.size());
+      std::memcpy(v.data(), it->second.data(), n * sizeof(double));
       return v;
     } else {
       // ...
     }
   };

   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};  // Misaligned data
     size_t n = it->second.size() / sizeof(int32_t);
     // ...
   };

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas
get_doubles and get_ints must validate byte alignment before memcpy: compute
size_t bytes = it->second.size(); if bytes == 0 return {}; if (bytes %
sizeof(double) != 0) (for get_doubles) or bytes % sizeof(int32_t) != 0 (for
get_ints) then handle the misaligned payload (log/return error/throw) instead of
blindly memcpy; use size_t n = bytes / sizeof(...) and pass bytes (or n *
sizeof(...)) to memcpy rather than it->second.size(); update references in
get_doubles/get_ints and ensure the branch where types differ still copies only
the aligned byte count and converts only n elements.
🧹 Nitpick comments (3)
cpp/src/grpc/server/grpc_field_element_size.hpp (1)

17-29: Avoid silent fallback for unknown ArrayFieldId.

At Line 28, default: return 8; can mask unsupported/new enum values and mis-size chunk parsing. Prefer an explicit invalid path so callers can reject the request.

Suggested change
 inline int64_t array_field_element_size(cuopt::remote::ArrayFieldId field_id)
 {
   switch (field_id) {
@@
-    default: return 8;
+    default: return -1;  // unknown field id; caller should fail validation
   }
 }
As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_field_element_size.hpp` around lines 17 - 29, The
function array_field_element_size(cuopt::remote::ArrayFieldId) silently falls
back to return 8 for unknown enum values; change it to explicitly handle
unsupported/new enums by logging/propagating an error or failing fast (e.g.,
throw std::invalid_argument or assert) instead of returning 8 so callers can
reject invalid requests and avoid mis-sized parsing; update
array_field_element_size to remove the silent default and return/throw a clear
failure for unexpected cuopt::remote::ArrayFieldId values and ensure callers
handle that error path.
build.sh (1)

7-7: Enable unbound-variable checks in the script preamble.

At Line 7, prefer set -eu to catch unset variable usage early.

Suggested change
-set -e
+set -eu
Based on learnings: "In this repository, prefer using 'set -u' in Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@build.sh` at line 7, The script preamble currently uses "set -e"; change this
to enable unbound-variable checks by replacing that occurrence with "set -eu"
(i.e., add the -u flag) so the shell will error on unset variables; audit any
places that intentionally rely on unset variables and initialize or guard them
accordingly (locate the "set -e" token in build.sh and update it).
cpp/src/grpc/server/grpc_server_threads.cpp (1)

166-182: Deserialize outside tracker_mutex to reduce contention.

deserialize_result_pipe_blob(...) can be expensive; parsing before taking the lock and only committing parsed state under lock will improve concurrency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 166 - 182,
deserialize_result_pipe_blob(...) is done while holding tracker_mutex causing
unnecessary contention; move the expensive deserialize step out of the critical
section by first copying result_data and success locally, call
deserialize_result_pipe_blob(result_data.data(), result_data.size(), hdr,
arrays) before acquiring tracker_mutex, then acquire lock(tracker_mutex), find
job_tracker entry by job_id, verify it still exists and update it: set status to
COMPLETED or FAILED (JobStatus), set error_message on failure, or assign
result_header, result_arrays and result_size_bytes on success; ensure hdr and
arrays are moved into it->second to avoid copying and that result_size_bytes is
set from result_data.size().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d12476d8-baa0-42cb-9006-1d214ffa2716

📥 Commits

Reviewing files that changed from the base of the PR and between 612de9a and d228718.

📒 Files selected for processing (38)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (5)
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/grpc_solution_mapper.cu

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

♻️ Duplicate comments (25)
cpp/src/grpc/server/grpc_job_management.cpp (2)

147-158: ⚠️ Potential issue | 🔴 Critical

submit_job_async still uses racy check-then-write slot acquisition.

Lines [147]-[158] can let concurrent submitters claim/overwrite the same queue slot. Use atomic reservation (compare_exchange) and publish ordering.

Proposed fix
-  for (size_t i = 0; i < MAX_JOBS; ++i) {
-    if (!job_queue[i].ready && !job_queue[i].claimed) {
+  for (size_t i = 0; i < MAX_JOBS; ++i) {
+    if (job_queue[i].ready.load()) { continue; }
+    bool expected_claimed = false;
+    if (job_queue[i].claimed.compare_exchange_strong(expected_claimed, true)) {
       copy_cstr(job_queue[i].job_id, job_id);
       job_queue[i].problem_type = is_mip ? 1 : 0;
       job_queue[i].data_size    = request_data.size();
       job_queue[i].worker_pid   = 0;
       job_queue[i].worker_index = -1;
-      job_queue[i].data_sent    = false;
-      job_queue[i].claimed      = false;
-      job_queue[i].cancelled    = false;
-      job_queue[i].ready        = true;
+      job_queue[i].data_sent.store(false);
+      job_queue[i].cancelled.store(false);
+      job_queue[i].ready.store(true, std::memory_order_release);
+      job_queue[i].claimed.store(false, std::memory_order_release);

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158,
submit_job_async currently uses a racy check-then-write loop over job_queue
(iterating i from 0..MAX_JOBS) checking job_queue[i].ready and
job_queue[i].claimed then writing the slot; change this to atomically reserve a
slot using an atomic CAS on a dedicated slot state (e.g., atomically
compare_exchange the claimed/empty flag for job_queue[i]) before writing any
fields, and use release semantics when publishing ready (store with
memory_order_release) and acquire semantics when consumers read
(memory_order_acquire); update submit_job_async to attempt compare_exchange on
job_queue[i].claimed (or a new atomic state) to transition from empty->reserved,
only write the job fields after successful CAS, and then set ready=true with
appropriate publish ordering so concurrent submitters cannot overwrite the same
slot.

109-113: ⚠️ Potential issue | 🟠 Major

Bound incumbent payload size before resize to prevent memory exhaustion.

Line [111] trusts a pipe-provided size and allocates directly. A malformed size prefix can force unbounded allocation.

Proposed fix
 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  if (size > static_cast<uint64_t>(config.max_message_bytes)) {
+    std::cerr << "[Worker] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }

As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code
reads an untrusted uint64_t size from a pipe and calls data.resize(size)
directly; bound the incoming size before resizing to prevent memory exhaustion
by introducing a sensible constant (e.g. JOB_PAYLOAD_MAX or MAX_PAYLOAD_SIZE)
and rejecting sizes > that limit (return false) or otherwise handling them; also
validate zero/negative semantics if needed and use that check right after
read_from_pipe(fd, &size, ...) and before data.resize(size) (referencing the
local variable size, function read_from_pipe, and the vector/data.resize call).
cpp/src/grpc/cuopt_remote.proto (1)

6-6: ⚠️ Potential issue | 🟠 Major

Buf package-directory mismatch is still unresolved.

Line [6] (package cuopt.remote;) remains in a path that violates Buf PACKAGE_DIRECTORY_MATCH for this module layout, so lint/CI will continue to fail until package/path (or Buf roots) are aligned.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in
cuopt_remote.proto ("package cuopt.remote;") doesn't match the file path,
triggering Buf's PACKAGE_DIRECTORY_MATCH rule; fix by either updating the
package to reflect the file's directory structure (e.g., change package to match
the actual path) or move cuopt_remote.proto into a directory layout matching
"cuopt/remote" (or adjust buf.yaml roots/module configuration to include the
current path). Ensure the package token "cuopt.remote" and the filesystem layout
are aligned so Buf lint passes.
cpp/src/grpc/server/grpc_service_impl.cpp (1)

71-73: ⚠️ Potential issue | 🔴 Critical

Publish ready before releasing claimed to prevent queue-slot double allocation.

At Line [71] and Line [265], releasing claimed before publishing ready re-opens a claim window for another submitter on the same slot.

Proposed fix (both locations)
-    job_queue[job_idx].claimed.store(false);
-    job_queue[job_idx].ready.store(true);
+    job_queue[job_idx].ready.store(true, std::memory_order_release);
+    job_queue[job_idx].claimed.store(false, std::memory_order_release);

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".

Also applies to: 265-267

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code
currently clears job_queue[job_idx].claimed before publishing
job_queue[job_idx].ready, which allows another submitter to claim the same slot
between those two stores; swap the operations so you set
job_queue[job_idx].ready.store(true) before
job_queue[job_idx].claimed.store(false) (and apply the same swap at the other
occurrence around lines 265-267) to ensure the slot is marked available only
after the ready flag is visible; if using explicit memory orders, use release
semantics for the store to ready to avoid the race.
SERVER_ARCHITECTURE.md (1)

15-46: ⚠️ Potential issue | 🟡 Minor

Add fence languages to satisfy markdownlint MD040

Several fenced blocks are untyped; markdownlint MD040 will keep warning/failing until those fences specify a language (e.g., text).

Proposed patch
-```
+```text
 ┌────────────────────────────────────────────────────────────────────┐
 │                        Main Server Process                          │
 ...
 └─────────────────┘  └─────────────────┘  └─────────────────┘
-```
+```

@@
-```
+```text
 Client                     Server                      Worker
 ...
-```
+```

@@
-```
+```text
 Client                     Server                      Worker
 ...
-```
+```

@@
-```
+```text
 Client                     Server                      Worker
 ...
-```
+```

@@
-```
+```text
 Client                     Worker
 ...
-```
+```

@@
-```
+```text
 ┌─────────┐  submit   ┌───────────┐  claim   ┌────────────┐
 ...
 └───────────┘          └─────────┘
-```
+```

Also applies to: 110-118, 122-136, 139-148, 154-168, 207-217

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The markdown has multiple
untyped fenced code blocks (the large ASCII "Main Server Process" diagram and
the smaller "Client / Server / Worker" and "submit / claim" pipeline diagrams)
causing markdownlint MD040 failures; update each triple-backtick fence around
these ASCII diagrams in SERVER_ARCHITECTURE.md to specify a language (use
"text") so the fences read ```text ... ```; look for the big box diagram
containing "Main Server Process", the repeated "Client                    
Server                      Worker" blocks, and the "submit   claim" pipeline
block and add the language token to each opening fence.
cpp/src/grpc/server/grpc_server_main.cpp (3)

156-168: ⚠️ Potential issue | 🔴 Critical

Don’t memset shared-memory structs that contain std::atomic

memset() on a type with std::atomic members is undefined behavior in C++. Even if you immediately .store() afterwards, the atomic object’s lifetime/rules are already violated.

Safer initialization pattern
   for (size_t i = 0; i < MAX_JOBS; ++i) {
-    memset(&job_queue[i], 0, sizeof(JobQueueEntry));
+    new (&job_queue[i]) JobQueueEntry{};
     job_queue[i].ready.store(false);
     job_queue[i].claimed.store(false);
     job_queue[i].cancelled.store(false);
     job_queue[i].worker_index.store(-1);
   }

   for (size_t i = 0; i < MAX_RESULTS; ++i) {
-    memset(&result_queue[i], 0, sizeof(ResultQueueEntry));
+    new (&result_queue[i]) ResultQueueEntry{};
     result_queue[i].ready.store(false);
     result_queue[i].retrieved.store(false);
   }

As per coding guidelines, “Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 156 - 168, The code
currently calls memset on job_queue[i] and result_queue[i] which is undefined
for types containing std::atomic; remove the memset calls and instead properly
construct/value-initialize each entry (e.g., use placement new: new
(&job_queue[i]) JobQueueEntry(); new (&result_queue[i]) ResultQueueEntry();) and
then set the atomic fields via their .store(...) calls (worker_index.store(-1),
ready.store(false), claimed.store(false), cancelled.store(false),
retrieved.store(false)); update the loops that touch job_queue and result_queue
(and references to MAX_JOBS / MAX_RESULTS) to use this safe initialization
pattern.

173-212: ⚠️ Potential issue | 🔴 Critical

Avoid std::terminate on TLS error paths; guard BuildAndStart() failure

TLS validation errors currently return after starting workers/threads (joinable thread dtors will call std::terminate). Separately, BuildAndStart() can return null and server->Wait() will crash. Both need a single cleanup/teardown path.

Minimal direction: introduce a shutdown helper and use it on all early-exit paths
   spawn_workers();

   std::thread result_thread(result_retrieval_thread);
   std::thread incumbent_thread(incumbent_retrieval_thread);
   std::thread monitor_thread(worker_monitor_thread);
   std::thread reaper_thread(session_reaper_thread);

+  auto shutdown_runtime = [&]() {
+    keep_running                 = false;
+    shm_ctrl->shutdown_requested = true;
+    result_cv.notify_all();
+    if (result_thread.joinable()) result_thread.join();
+    if (incumbent_thread.joinable()) incumbent_thread.join();
+    if (monitor_thread.joinable()) monitor_thread.join();
+    if (reaper_thread.joinable()) reaper_thread.join();
+    wait_for_workers();
+    cleanup_shared_memory();
+  };
@@
   if (config.enable_tls) {
     if (config.tls_cert_path.empty() || config.tls_key_path.empty()) {
       std::cerr << "[Server] TLS enabled but --tls-cert/--tls-key not provided\n";
-      return 1;
+      shutdown_runtime();
+      return 1;
     }
@@
       if (ssl_opts.pem_root_certs.empty()) {
         std::cerr << "[Server] --require-client-cert requires --tls-root\n";
-        return 1;
+        shutdown_runtime();
+        return 1;
       }
@@
   std::unique_ptr<Server> server(builder.BuildAndStart());
+  if (!server) {
+    std::cerr << "[gRPC Server] Failed to start server on " << server_address << "\n";
+    shutdown_runtime();
+    return 1;
+  }
@@
   server->Wait();

As per coding guidelines, “Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state.”

Also applies to: 234-251

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 173 - 212, Spawned
worker threads and early TLS/BuildAndStart error paths can return while threads
are still joinable (causing std::terminate) and BuildAndStart() can return null
causing a crash; introduce a single shutdown helper (e.g.,
shutdown_server_and_workers()) that stops workers, signals threads to exit,
joins result_retrieval_thread, incumbent_retrieval_thread,
worker_monitor_thread, session_reaper_thread and cleans up any resources and the
grpc::Server pointer, call this helper on all early-exit/TLS error paths and
when builder.BuildAndStart() returns nullptr before returning non-zero, and
replace ad-hoc returns with calls to this helper; additionally ensure shared
solver/global state access in server handlers uses proper mutex/synchronization
(e.g., std::mutex/lock_guard around the solver state) to avoid thread-unsafe
access.

50-80: ⚠️ Potential issue | 🟠 Major

CLI parsing should reject unknown/incomplete flags and handle std::stoi failures

Right now unknown flags and missing values are silently ignored, and std::stoi/std::stoll can throw and abort the server. Also, --chunk-timeout is documented but not parsed at all.

Example patch sketch (apply pattern to all flags)
   for (int i = 1; i < argc; i++) {
     std::string arg = argv[i];
     if (arg == "-p" || arg == "--port") {
-      if (i + 1 < argc) { config.port = std::stoi(argv[++i]); }
+      if (i + 1 >= argc) {
+        std::cerr << "ERROR: " << arg << " requires a value\n";
+        print_usage(argv[0]);
+        return 1;
+      }
+      try {
+        config.port = std::stoi(argv[++i]);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid port value for " << arg << ": '" << argv[i] << "'\n";
+        return 1;
+      }
     } else if (arg == "--max-message-bytes") {
-      if (i + 1 < argc) { config.max_message_b = std::max(4096LL, std::stoll(argv[++i])); }
+      if (i + 1 >= argc) {
+        std::cerr << "ERROR: " << arg << " requires a value\n";
+        return 1;
+      }
+      try {
+        config.max_message_b = std::max<int64_t>(4096LL, std::stoll(argv[++i]));
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n";
+        return 1;
+      }
+    } else if (arg == "--chunk-timeout") {
+      if (i + 1 >= argc) {
+        std::cerr << "ERROR: " << arg << " requires a value\n";
+        return 1;
+      }
+      try {
+        config.chunk_timeout_seconds = std::stoi(argv[++i]);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n";
+        return 1;
+      }
     } else if (arg == "-h" || arg == "--help") {
       print_usage(argv[0]);
       return 0;
+    } else {
+      std::cerr << "ERROR: Unknown option '" << arg << "'\n";
+      print_usage(argv[0]);
+      return 1;
     }
   }

As per coding guidelines, “Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI
parsing loop in main currently silently ignores unknown flags and missing
values, and calls std::stoi/std::stoll without catching exceptions; also the
documented flag "--chunk-timeout" is not parsed. Update the argv parsing code
(the for loop handling argv[], touching config.port, config.num_workers,
config.max_message_mb, config.max_message_b, config.enable_transfer_hash,
config.enable_tls, config.tls_cert_path, config.tls_key_path,
config.tls_root_path, config.require_client, config.log_to_console,
config.verbose) to: 1) validate that flags requiring a value check (i + 1 <
argc) and on failure print_usage(argv[0]) and return a non-zero error code; 2)
wrap std::stoi/std::stoll usages in try/catch(std::exception&) to log a clear
error and exit non-zero on parse failure; 3) handle unknown flags by printing an
error + usage and exiting non-zero; and 4) add parsing for "--chunk-timeout" to
populate the appropriate config field (e.g., config.chunk_timeout) following the
same validation and exception-handling pattern. Ensure error messages include
the offending flag for easier debugging.
build.sh (1)

390-398: ⚠️ Potential issue | 🟠 Major

cuopt_grpc_server fast path still captures mixed-target builds

As written, build.sh cuopt_grpc_server cuopt will still take the server-only branch and skip the requested install target(s). That’s a surprising foot-gun for devs/CI.

Proposed patch
-    if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll; then
+    if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll \
+       && ! hasArg cuopt && ! hasArg cuopt_server && ! hasArg cuopt_mps_parser \
+       && ! hasArg cuopt_sh_client && ! hasArg docs && ! hasArg deb \
+       && ! hasArg libmps_parser; then
         # Build only the gRPC server (ninja resolves libcuopt as a dependency)
         cmake --build "${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server ${VERBOSE_FLAG} ${JFLAG}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@build.sh` around lines 390 - 398, The fast-path condition that builds only
cuopt_grpc_server (the if branch using hasArg cuopt_grpc_server && ! hasArg
libcuopt && ! buildAll) incorrectly triggers when other targets (e.g., cuopt)
are also requested; change the guard so it only takes the server-only branch
when cuopt_grpc_server is the sole target requested (for example replace the
condition with a new predicate like hasOnlyArg cuopt_grpc_server or add explicit
negations for all other possible target args / check INSTALL_TARGET equals
cuopt_grpc_server), updating the if in build.sh that surrounds the cmake --build
"${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server block so mixed-target
invocations fall through to the normal install/build branch.
cpp/src/grpc/client/test_grpc_client.cpp (1)

381-412: ⚠️ Potential issue | 🟠 Major

Harden numeric argument parsing and validate --time-limit / incumbent index

std::stod/std::stoll throw on bad input, and static_cast<int>(time_limit) can be undefined if the value is out of range. This tool should fail with a clean CLI error instead of terminating.

Proposed patch
   while (arg_idx < argc && argv[arg_idx][0] == '-') {
     std::string arg = argv[arg_idx];
@@
     } else if (arg == "--time-limit" && arg_idx + 1 < argc) {
-      time_limit = std::stod(argv[++arg_idx]);
+      const char* value = argv[++arg_idx];
+      try {
+        time_limit = std::stod(value);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid value for --time-limit: '" << value << "'\n";
+        return 1;
+      }
+      if (!std::isfinite(time_limit) || time_limit <= 0.0) {
+        std::cerr << "ERROR: --time-limit must be a finite positive number (got '" << value
+                  << "')\n";
+        return 1;
+      }
       arg_idx++;
@@
   grpc_client_config_t config;
   config.server_address  = server_address;
-  config.timeout_seconds = static_cast<int>(time_limit) + 120;
+  if (time_limit > static_cast<double>(std::numeric_limits<int>::max() - 120)) {
+    std::cerr << "ERROR: --time-limit too large (max "
+              << (std::numeric_limits<int>::max() - 120) << ")\n";
+    return 1;
+  }
+  config.timeout_seconds = static_cast<int>(time_limit) + 120;
@@
   } else if (mode == "incumbent") {
@@
     std::string job_id = argv[arg_idx++];
-    int64_t from_index = (arg_idx < argc) ? std::stoll(argv[arg_idx]) : 0;
+    int64_t from_index = 0;
+    if (arg_idx < argc) {
+      const char* value = argv[arg_idx];
+      try {
+        from_index = std::stoll(value);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid incumbent index: '" << value << "'\n";
+        return 1;
+      }
+      if (from_index < 0) {
+        std::cerr << "ERROR: incumbent index must be >= 0 (got '" << value << "')\n";
+        return 1;
+      }
+    }
     return mode_incumbent(client, job_id, from_index);
   }

Also applies to: 425-426, 531-532

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 381 - 412, The CLI
currently calls std::stod (for time_limit) and std::stoll (for the incumbent
index elsewhere) which throw on invalid input and can lead to undefined behavior
when casting large values; update the argument parsing in the main option loop
(handling --time-limit / variable time_limit) to validate input robustly: parse
numeric arguments using non-throwing parsing (e.g., std::from_chars for integers
and a safe check for floating parsing or wrap std::stod in try/catch), detect
conversion failure or out-of-range values, emit a clear error message via
std::cerr and return a non-zero exit code, and before doing
static_cast<int>(time_limit) or casting the stoll result ensure the parsed value
fits the target range to avoid undefined behavior; apply the same
validation/failure handling to the other occurrences parsing integers (the
std::stoll sites noted in the review).
cpp/src/grpc/client/solve_remote.cu (2)

164-166: ⚠️ Potential issue | 🟠 Major

Do not pass incumbent objective as best-bound in callbacks.

Line 164–166 sets bound_copy equal to objective, which reports a fake zero-gap state to callback consumers.

A safe interim behavior is to pass NaN (unknown bound) until the incumbent protocol includes a real bound field end-to-end.

🔧 Interim mitigation in this file
+#include <limits>
...
-          double bound_copy                 = objective;  // Use objective as bound for incumbent
+          double bound_copy = std::numeric_limits<double>::quiet_NaN();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 164 - 166, The callback
currently passes the incumbent objective as the best-bound (bound_copy =
objective) to get_callback->get_solution, which falsely reports a zero-gap;
change the interim behavior to pass an unknown bound (use NaN) instead of
objective when calling get_callback->get_solution (affecting bound_copy,
get_callback->get_solution(..., &bound_copy, callback->get_user_data()) and
related solution_copy/obj_copy variables) so callbacks receive NaN until the
protocol supplies a real bound end-to-end.

55-57: ⚠️ Potential issue | 🟠 Major

Validate env sizing overrides before applying them.

Line 55–57 still accepts zero/negative values for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES, which can destabilize transfer behavior.

🔧 Proposed guardrails
-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  auto max_msg    = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; }
+  if (max_msg > 0) { config.max_message_bytes = max_msg; }

As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The env overrides
for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES may be zero/negative; update
the code that calls parse_env_int64 to validate the returned value before
assigning to config.chunk_size_bytes and config.max_message_bytes (e.g., call
parse_env_int64 for each into a temp variable, check temp > 0 and optionally cap
to a safe maximum, then assign to config.*; if invalid, retain the existing
default and emit a debug/warn via config.enable_debug_log or logger). Ensure you
reference parse_env_int64, config.chunk_size_bytes, config.max_message_bytes,
and the CUOPT_CHUNK_SIZE / CUOPT_MAX_MESSAGE_BYTES env names when implementing
the guard.
cpp/CMakeLists.txt (1)

279-294: ⚠️ Potential issue | 🟠 Major

Resolve protoc via $<TARGET_FILE:protobuf::protoc> to support in-tree protobuf targets.

Line 279–294 still assumes protobuf::protoc has IMPORTED_LOCATION*. That fails for non-imported targets (e.g., FetchContent/add_subdirectory), causing false fatal errors.

🔧 Proposed CMake adjustment
 if(TARGET protobuf::protoc)
-  get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
-  if(NOT _PROTOBUF_PROTOC)
-    get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
-  endif()
+  set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
 else()
   find_package(protobuf CONFIG REQUIRED)
-  get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
-  if(NOT _PROTOBUF_PROTOC)
-    get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
-  endif()
+  if(TARGET protobuf::protoc)
+    set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
+  endif()
 endif()
In CMake, does get_target_property(... IMPORTED_LOCATION) work for non-imported targets (e.g., created via add_subdirectory/FetchContent), and is $<TARGET_FILE:...> the recommended way to reference executables in add_custom_command(COMMAND ...)?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` around lines 279 - 294, The current logic assumes
protobuf::protoc has IMPORTED_LOCATION properties and can fail for in-tree
targets; change the resolution to first check if the target exists and set
_PROTOBUF_PROTOC to the generator expression $<TARGET_FILE:protobuf::protoc>
when TARGET protobuf::protoc is true, otherwise fall back to
get_target_property(... IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION) or the
PROTOBUF_PROTOC_EXECUTABLE cache variable; update any add_custom_command(...)
references to use ${_PROTOBUF_PROTOC} and keep the existing fatal error check if
_PROTOBUF_PROTOC is still empty.
cpp/src/grpc/server/grpc_pipe_serialization.hpp (3)

96-100: ⚠️ Potential issue | 🔴 Critical

Validate element_offset and overflow before pointer arithmetic in result deserialization.

Line 96–100 still allows negative offsets and unchecked offset + chunk_size arithmetic, which can underflow/overflow bounds checks and lead to unsafe memcpy.

🔧 Proposed fix
-    int64_t offset         = ac.element_offset();
-    const auto& chunk_data = ac.data();
-    if (offset + static_cast<int64_t>(chunk_data.size()) <= static_cast<int64_t>(dest.size())) {
-      std::memcpy(dest.data() + offset, chunk_data.data(), chunk_data.size());
-    }
+    int64_t offset         = ac.element_offset();
+    const auto& chunk_data = ac.data();
+    if (offset < 0) { return false; }
+    size_t uoffset = static_cast<size_t>(offset);
+    if (uoffset > dest.size()) { return false; }
+    if (chunk_data.size() > dest.size() - uoffset) { return false; }
+    std::memcpy(dest.data() + uoffset, chunk_data.data(), chunk_data.size());

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 96 - 100, The
current deserialization uses offset = ac.element_offset() and then does pointer
arithmetic without validating offset or preventing overflow/underflow before
memcpy; update the logic in grpc_pipe_serialization.hpp around the block using
offset, chunk_data and dest so you first validate element_offset() is
non-negative and within dest.size(), compute size_t chunk_size =
chunk_data.size(), and perform a safe bounds check using size_t conversions
(e.g., offset <= dest.size() && offset + chunk_size <= dest.size()) while
guarding against overflow on offset + chunk_size; only call
std::memcpy(dest.data() + offset, ...) when those checks pass and otherwise
handle the error/return early.

90-90: ⚠️ Potential issue | 🔴 Critical

Return failure on non-EOF parse errors in both deserializers.

Line 90 and Line 148 still break on any parse failure and eventually return true, which can accept malformed/truncated payloads as valid.

🔧 Proposed fix
-    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break;
+    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) {
+      if (clean_eof) { break; }  // normal end-of-stream
+      return false;              // malformed/truncated message
+    }

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

Also applies to: 148-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loops
calling google::protobuf::util::ParseDelimitedFromCodedStream (using variables
ac, coded, clean_eof) currently use "break" on parse failure which lets the
function return true for truncated/malformed input; change those breaks to
return false on a parse failure unless the failure was a clean EOF. Concretely,
at both call sites (around the ParseDelimitedFromCodedStream lines referenced)
check the clean_eof flag after a failed parse and return false for non-EOF
failures (only allow breaking/ending the loop when clean_eof is true).

151-163: ⚠️ Potential issue | 🟠 Major

Guard total_elements * elem_size and byte-offset math before resize/copy.

Line 151–163 still performs unchecked multiplication and resize from untrusted chunk metadata. A malformed payload can overflow size math or trigger excessive allocation.

🔧 Proposed fix
+    if (ac.total_elements() < 0) { return false; }
     if (dest.empty() && ac.total_elements() > 0) {
       int64_t elem_size = array_field_element_size(ac.field_id());
-      dest.resize(static_cast<size_t>(ac.total_elements() * elem_size), 0);
+      if (elem_size <= 0) { return false; }
+      uint64_t total_bytes = static_cast<uint64_t>(ac.total_elements()) *
+                             static_cast<uint64_t>(elem_size);
+      if (total_bytes > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+        return false;
+      }
+      dest.resize(static_cast<size_t>(total_bytes), 0);
     }

As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 151 - 163, The
code in the block that resizes dest and copies chunk bytes must validate sizes
to avoid integer overflow and huge allocations: before calling
dest.resize(ac.total_elements() * elem_size) and before computing byte_offset,
check that ac.total_elements() and computed elem_size are > 0 and that
static_cast<size_t>(ac.total_elements()) <= SIZE_MAX /
static_cast<size_t>(elem_size); bound the requested allocation against a safe
maximum (or return an error) to prevent resource exhaustion; verify
chunk_data.size() is <= dest.size() and that byte_offset >= 0 and byte_offset <=
dest.size() and that byte_offset + chunk_data.size() does not overflow (use
size_t or checked arithmetic) before calling std::memcpy; refer to symbols dest,
ac, array_field_element_size(), total_elements(), element_offset(), data(),
elem_size, byte_offset, resize, and std::memcpy when making these checks.
cpp/src/grpc/server/grpc_server_types.hpp (1)

300-308: ⚠️ Potential issue | 🔴 Critical

Remove async-signal-unsafe operations from signal_handler().

std::cout and condition_variable::notify_all() are not async-signal-safe and can deadlock/corrupt state when invoked from a POSIX signal handler.

Suggested minimal signal-safe handler
 inline void signal_handler(int signal)
 {
   if (signal == SIGINT || signal == SIGTERM) {
-    std::cout << "\n[gRPC Server] Received shutdown signal\n";
     keep_running = false;
     if (shm_ctrl) { shm_ctrl->shutdown_requested = true; }
-    result_cv.notify_all();
   }
 }

Then have a normal thread (or the main loop) observe keep_running == false and do logging + result_cv.notify_all() outside the signal context.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308,
signal_handler currently performs async-unsafe operations (std::cout and
result_cv.notify_all()); modify it to only perform signal-safe actions: set a
sig_atomic_t/atomic<bool> flag (keep_running) and, if needed, set
shm_ctrl->shutdown_requested (ensure that shutdown_requested is signal-safe or
updated via atomic/sig_atomic_t), then return immediately—remove std::cout and
result_cv.notify_all() from signal_handler. Update the main loop or a dedicated
watcher thread to observe keep_running (or the atomic) and then perform logging
and call result_cv.notify_all() outside the signal context (use the symbols
signal_handler, keep_running, shm_ctrl, result_cv to locate changes).
cpp/src/grpc/client/grpc_client.cu (5)

702-745: ⚠️ Potential issue | 🔴 Critical

Harden chunked-result metadata validation before division/alloc/memcpy.

element_size_bytes and total_elements are trusted. elem_size <= 0 triggers division-by-zero (Line 708), and total_elems * elem_size can overflow size_t before allocation, making the memcpy (Line 739) unsafe.

Proposed bounds checks
   for (const auto& arr_desc : header->arrays()) {
     auto field_id       = arr_desc.field_id();
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element_size_bytes";
+      return false;
+    }
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: total byte size overflow";
+      return false;
+    }
+    const int64_t total_bytes = total_elems * elem_size;
+    if (total_bytes < 0 ||
+        static_cast<uint64_t>(total_bytes) >
+          static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+      last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;
     if (elems_per_chunk <= 0) elems_per_chunk = 1;

-    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size));
+    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 745, The code trusts
element_size_bytes() and total_elements() and must validate them before any
division, allocation, or memcpy: check that elem_size =
arr_desc.element_size_bytes() > 0 and total_elems = arr_desc.total_elements() >=
0, ensure elems_per_chunk = chunk_data_budget / elem_size is computed only after
the elem_size > 0 check, and verify that total_elems * elem_size does not
overflow size_t before constructing array_bytes; also validate that elems_wanted
and elems_received are non-negative and that data.size() equals elems_received *
elem_size and fits within array_bytes bounds before calling std::memcpy
(references: arr_desc.field_id(), arr_desc.total_elements(),
arr_desc.element_size_bytes(), elems_per_chunk, array_bytes,
chunk_resp.elements_in_chunk(), chunk_resp.data(), and the std::memcpy call).

258-266: ⚠️ Potential issue | 🔴 Critical

Guard all public RPC methods against impl_->stub being null.

check_status(), wait_for_completion(), cancel_job(), delete_job(), get_incumbents(), and stream_logs() all dereference impl_->stub directly; calling these before connect() will crash.

Proposed guard pattern
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_ || !impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;
   auto status = impl_->stub->CheckStatus(&context, request, &response);

Also applies to: 293-303, 325-333, 354-361, 381-395, 419-430

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, All public RPC
methods must guard against impl_ or impl_->stub being null to avoid crashes when
called before connect(); update grpc_client_t::check_status,
wait_for_completion, cancel_job, delete_job, get_incumbents, and stream_logs to
early-return a sensible error result if impl_ == nullptr or impl_->stub ==
nullptr (e.g., set job_status_result_t to an error/failed state or populate the
method's error/status fields and avoid calling stub->*), and add a clear logged
message indicating the client is not connected; ensure you use the same guard
pattern in the other listed methods (the ones around the ranges noted:
check_status, wait_for_completion, cancel_job, delete_job, get_incumbents,
stream_logs) so no method dereferences impl_->stub without this null check.

221-252: ⚠️ Potential issue | 🔴 Critical

stop_log_streaming() can hang indefinitely (blocked reader->Read()).

Setting stop_logs_ doesn’t unblock a synchronous server-streaming Read(), so join() can deadlock if the server stops emitting logs (or the stream is otherwise idle). This is still the same failure mode previously called out.

Proposed fix (make stop cancel the active ClientContext)
 void grpc_client_t::start_log_streaming(const std::string& job_id)
 {
   if (!config_.stream_logs || !config_.log_callback) return;

   stop_logs_.store(false);
   log_thread_ = std::make_unique<std::thread>([this, job_id]() {
-    grpc::ClientContext context;
+    grpc::ClientContext context;
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = &context;
+    }
     auto request = build_stream_logs_request(job_id, 0);
     auto reader  = impl_->stub->StreamLogs(&context, request);

     cuopt::remote::LogMessage log_msg;
     while (reader->Read(&log_msg)) {
       if (stop_logs_.load()) {
-        context.TryCancel();
         break;
       }
       if (config_.log_callback) { config_.log_callback(log_msg.line()); }
       if (log_msg.job_complete()) { break; }
     }
     reader->Finish();
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = nullptr;
+    }
   });
 }

 void grpc_client_t::stop_log_streaming()
 {
   stop_logs_.store(true);
+  {
+    std::lock_guard<std::mutex> lk(log_context_mutex_);
+    if (active_log_context_) { active_log_context_->TryCancel(); }
+  }
   if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); }
   log_thread_.reset();
 }

You’ll also need matching members on grpc_client_t (likely in cpp/src/grpc/client/grpc_client.hpp):

mutable std::mutex log_context_mutex_;
grpc::ClientContext* active_log_context_ = nullptr;

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The
stop_log_streaming() can block because reader->Read() is synchronous; modify
start_log_streaming() and stop_log_streaming() to track and cancel the active
grpc::ClientContext: add a mutable std::mutex log_context_mutex_ and
grpc::ClientContext* active_log_context_ member, then in start_log_streaming()
create the ClientContext, lock log_context_mutex_, set active_log_context_ =
&context before calling impl_->stub->StreamLogs, unlock, and clear
active_log_context_ (under the mutex) after reader->Finish(); in
stop_log_streaming() lock log_context_mutex_, if active_log_context_ is non-null
call active_log_context_->TryCancel(), then set stop_logs_ and join/reset the
thread; keep the existing context.TryCancel() check inside the thread but ensure
all access to active_log_context_ is synchronized to avoid races.

258-266: ⚠️ Potential issue | 🟠 Major

Add deadlines to operational RPCs (current code can block forever on network stalls).

Only the connection probe sets a deadline; most other RPC calls create a grpc::ClientContext without set_deadline(). That bypasses config_.timeout_seconds and can hang indefinitely (including inside polling loops).

Proposed helper + per-RPC use
+static inline void set_context_deadline(grpc::ClientContext& ctx, int timeout_seconds)
+{
+  if (timeout_seconds <= 0) return;
+  ctx.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(timeout_seconds));
+}
+
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;

   grpc::ClientContext context;
+  set_context_deadline(context, config_.timeout_seconds);
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;
   auto status = impl_->stub->CheckStatus(&context, request, &response);

Apply similarly to WaitForCompletion, CancelJob, DeleteResult, GetIncumbents, GetResult, and all chunked upload/download RPCs.

Also applies to: 293-303, 325-333, 354-361, 387-395, 419-443, 468-470, 509-510, 540-541, 561-562, 588-589, 625-626, 674-675, 724-725, 1000-1004, 1172-1176

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The RPCs in
grpc_client_t (e.g., grpc_client_t::check_status) create grpc::ClientContext
without deadlines and can block indefinitely; add a helper (e.g.,
make_client_context_with_deadline or set_deadline_on_context) that constructs a
grpc::ClientContext and calls set_deadline(std::chrono::system_clock::now() +
std::chrono::seconds(config_.timeout_seconds)) (or uses config_.timeout_seconds
as a std::chrono duration) and use that helper everywhere you create a context
for RPCs (check_status, WaitForCompletion, CancelJob, DeleteResult,
GetIncumbents, GetResult, all chunked upload/download calls, and the other
locations noted in the review) so each RPC honors config_.timeout_seconds and
cannot hang forever.

992-995: ⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before dividing (division-by-zero).

Both LP and MIP polling paths compute max_polls by dividing by config_.poll_interval_ms without guarding <= 0.

Proposed validation
   } else {
     CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str());
+    if (config_.poll_interval_ms <= 0) {
+      stop_log_streaming();
+      result.error_message = "poll_interval_ms must be > 0";
+      return result;
+    }
     int poll_count = 0;
     int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

Also applies to: 1130-1133

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 992 - 995, Before computing
max_polls in the LP and MIP polling paths, validate config_.poll_interval_ms > 0
to avoid division-by-zero: if config_.poll_interval_ms <= 0, emit a clear error
log (e.g., CUOPT_LOG_ERROR) that includes the invalid value and either return an
error status or substitute a safe fallback (e.g., set poll_interval_ms = 1)
before computing max_polls = (config_.timeout_seconds * 1000) /
config_.poll_interval_ms; apply this check around the existing uses of
poll_count and max_polls so both the LP and MIP paths use the validated value.
cpp/src/grpc/grpc_problem_mapper.cu (1)

399-427: ⚠️ Potential issue | 🔴 Critical

Fix potential out-of-bounds writes in get_doubles/get_ints (misaligned payload sizes).

n is computed with floor division, but memcpy copies it->second.size() bytes. If the byte payload isn’t an exact multiple of the element size, this can overflow the destination vector.

Proposed fix (validate alignment and copy only validated byte count)
   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};
     size_t n = it->second.size() / sizeof(double);
     if constexpr (std::is_same_v<f_t, double>) {
       std::vector<double> v(n);
-      std::memcpy(v.data(), it->second.data(), it->second.size());
+      std::memcpy(v.data(), it->second.data(), n * sizeof(double));
       return v;
     } else {
       std::vector<double> tmp(n);
-      std::memcpy(tmp.data(), it->second.data(), it->second.size());
+      std::memcpy(tmp.data(), it->second.data(), n * sizeof(double));
       return std::vector<f_t>(tmp.begin(), tmp.end());
     }
   };

   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};
     size_t n = it->second.size() / sizeof(int32_t);
     if constexpr (std::is_same_v<i_t, int32_t>) {
       std::vector<int32_t> v(n);
-      std::memcpy(v.data(), it->second.data(), it->second.size());
+      std::memcpy(v.data(), it->second.data(), n * sizeof(int32_t));
       return v;
     } else {
       std::vector<int32_t> tmp(n);
-      std::memcpy(tmp.data(), it->second.data(), it->second.size());
+      std::memcpy(tmp.data(), it->second.data(), n * sizeof(int32_t));
       return std::vector<i_t>(tmp.begin(), tmp.end());
     }
   };

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas
get_doubles and get_ints compute element count n via floor division but then
memcpy the full it->second.size() bytes, risking out-of-bounds when payload size
isn't a multiple of element size; fix by validating alignment (check
it->second.size() % sizeof(double) / sizeof(int32_t) == 0) and compute a safe
byte_count = n * sizeof(element_type), resize the destination vector to n, then
memcpy only byte_count (or return an error/empty vector if misaligned); update
both get_doubles (use sizeof(double) and f_t conversions) and get_ints (use
sizeof(int32_t) and i_t conversions) accordingly so copying never exceeds the
destination buffer.
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)

213-218: ⚠️ Potential issue | 🟠 Major

Synchronize server-log state reads/writes consistently (data race).

server_log_path_, server_log_start_pos_, and test_start_marked_ are accessed without a common lock in set_server_log_path(), get_server_logs(), dump_logs(), and the accessors. This can race if tests poll logs while another thread resets paths/markers.

Proposed approach (copy shared state under lock, then do I/O unlocked)
 void set_server_log_path(const std::string& path)
 {
-  server_log_path_      = path;
-  server_log_start_pos_ = 0;
-  test_start_marked_    = false;
+  std::lock_guard<std::mutex> lock(mutex_);
+  server_log_path_      = path;
+  server_log_start_pos_ = 0;
+  test_start_marked_    = false;
 }

 std::string get_server_logs(bool since_test_start = true) const
 {
-  if (server_log_path_.empty()) { return ""; }
+  std::string path;
+  std::streampos start_pos = 0;
+  bool marked              = false;
+  {
+    std::lock_guard<std::mutex> lock(mutex_);
+    path      = server_log_path_;
+    start_pos = server_log_start_pos_;
+    marked    = test_start_marked_;
+  }
+
+  if (path.empty()) { return ""; }

-  std::ifstream file(server_log_path_);
+  std::ifstream file(path);
   if (!file.is_open()) { return ""; }

-  if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) {
-    file.seekg(server_log_start_pos_);
+  if (since_test_start && marked && start_pos > 0) {
+    file.seekg(start_pos);
   }

Consider also making server_log_path() return a copy (std::string) instead of a reference if you need it to be thread-safe.

As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state".

Also applies to: 229-244, 332-345, 350-356

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213
- 218, There is a data race on server_log_path_, server_log_start_pos_, and
test_start_marked_: protect all reads/writes with a single mutex (e.g., add
std::mutex server_log_mutex_) and update set_server_log_path(),
get_server_logs(), dump_logs(), and the accessors to acquire the lock when
accessing those members; for I/O work, copy the guarded state (make
server_log_path() return a std::string copy rather than a const ref) while
holding the lock then release and perform file reading/writing unlocked to avoid
long holds; ensure all places that currently read the fields directly use the
mutex-protected copy to eliminate races.
cpp/src/grpc/grpc_solution_mapper.cu (1)

512-535: ⚠️ Potential issue | 🔴 Critical

Harden bytes_to_typed() against misaligned byte payloads (potential OOB write).

All branches compute n with floor division and then copy raw.size() bytes. If raw.size() is not an exact multiple of the expected element size, this overflows tmp/v.

Proposed fix
   const auto& raw = it->second;
   if constexpr (std::is_same_v<T, float>) {
+    if (raw.size() % sizeof(double) != 0) return {};
     size_t n = raw.size() / sizeof(double);
     std::vector<double> tmp(n);
-    std::memcpy(tmp.data(), raw.data(), raw.size());
+    std::memcpy(tmp.data(), raw.data(), n * sizeof(double));
     return std::vector<T>(tmp.begin(), tmp.end());
   } else if constexpr (std::is_same_v<T, double>) {
+    if (raw.size() % sizeof(double) != 0) return {};
     size_t n = raw.size() / sizeof(double);
     std::vector<double> v(n);
-    std::memcpy(v.data(), raw.data(), raw.size());
+    std::memcpy(v.data(), raw.data(), n * sizeof(double));
     return v;
   } else {
+    if (raw.size() % sizeof(T) != 0) return {};
     size_t n = raw.size() / sizeof(T);
     std::vector<T> v(n);
-    std::memcpy(v.data(), raw.data(), raw.size());
+    std::memcpy(v.data(), raw.data(), n * sizeof(T));
     return v;
   }

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 512 - 535, The
bytes_to_typed() implementation can write out-of-bounds because it computes n =
raw.size()/sizeof(...) but then memcpy's raw.size() bytes; fix by always using
sizeof(T) (not sizeof(double) in the float branch), compute size_t n =
raw.size() / sizeof(T), and memcpy only n * sizeof(T) bytes into the destination
buffers (and return the n-length vector); if raw.size() % sizeof(T) != 0,
silently ignore trailing bytes or log a warning—ensure this same change is
applied in the float branch (tmp) and the double/other branches (v) so all
copies use n * sizeof(T) and use sizeof(T) to compute n.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 53823cec-ccad-4fbb-b90f-ec216b29d2e2

📥 Commits

Reviewing files that changed from the base of the PR and between d228718 and e46028c.

📒 Files selected for processing (39)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • conda/recipes/libcuopt/recipe.yaml
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/solve_remote.cu
  • cpp/src/pdlp/CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
  • cpp/tests/linear_programming/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (4)
  • GRPC_ARCHITECTURE.md
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
  • cpp/src/grpc/grpc_settings_mapper.cu

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
cpp/src/grpc/client/test_grpc_client.cpp (1)

340-340: Print the client error on log-stream failure for parity with other modes.

At Line 340, a failed stream_logs returns non-zero but drops the error detail. Emit client.get_last_error() before returning to keep failure diagnosis consistent.

Proposed fix
-  return success ? 0 : 1;
+  if (!success) {
+    std::cerr << "Log stream failed: " << client.get_last_error() << "\n";
+    return 1;
+  }
+  return 0;

As per coding guidelines: "Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` at line 340, The return path after
stream_logs currently returns non-zero on failure but drops the error detail;
update the failure branch surrounding stream_logs in test_grpc_client.cpp to
call and emit client.get_last_error() (or equivalent accessor) to the log/stream
before returning the non-zero exit code so the client error is logged for parity
with other modes and aids diagnosis (locate the code around stream_logs and the
return of success ? 0 : 1 and insert a process/log call that prints
client.get_last_error() when success is false).
cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)

24-27: Consider making kPipeChunkBytes configurable.

The fixed 64 MiB chunk cap is a hard resource limit; exposing it via server config/env (with this as default) would improve deploy-time tuning.

As per coding guidelines: "Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 24 - 27,
kPipeChunkBytes is a hard-coded constexpr limit; make it configurable by
replacing the fixed constexpr with a runtime-configured value (e.g., a
GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that reads an
environment variable or server config and falls back to 64ULL * 1024 * 1024 as
the default; update references to kPipeChunkBytes to call the accessor and
validate the parsed value (positive, reasonable upper bound) and document the
new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp (1)

957-982: Assert the log-callback path in SolveMIPWithLogCallback.

received_logs is populated but never asserted, so this test can pass even if callback streaming regresses.

💡 Suggested assertion
   auto result = client->solve_mip(problem, settings, false);
   EXPECT_TRUE(result.success) << result.error_message;
+  EXPECT_FALSE(received_logs.empty()) << "Expected at least one streamed log line";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 957
- 982, The test registers a log callback (received_logs with log_mutex) but
never asserts anything, so add assertions after the call to client->solve_mip in
SolveMIPWithLogCallback to verify the callback was invoked and logs look
correct: acquire lock(log_mutex) and assert !received_logs.empty() (and
optionally that at least one entry contains an expected substring like
"objective" or "MIP" to ensure meaningful log content). If solve_mip is
asynchronous/streams logs, ensure you wait briefly or poll until received_logs
is non-empty before asserting to avoid flakes; reference received_logs,
log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when
adding the checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/CMakeLists.txt`:
- Around line 297-304: The CMake snippet only checks for the non-namespaced
target grpc_cpp_plugin; update the check to first test for the namespaced target
gRPC::grpc_cpp_plugin (like how gRPC::grpc++ is handled) and use its target file
if present, otherwise fall back to checking TARGET grpc_cpp_plugin and then
find_program for the executable; adjust references to set
_GRPC_CPP_PLUGIN_EXECUTABLE from "$<TARGET_FILE:gRPC::grpc_cpp_plugin>" when
available to ensure installed gRPC packages are recognized.

In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp`:
- Around line 1387-1402: The test currently expects GetResult to be called even
though CheckStatus returns PROCESSING (not complete); update the test so
GetResult is not expected for PROCESSING jobs: remove or replace the
EXPECT_CALL(*mock_stub_, GetResult(...)).WillOnce(...) with an expectation that
it is not called (e.g., EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0))
or simply delete the GetResult expectation, keeping the CheckStatus stub that
sets resp->set_job_status(cuopt::remote::PROCESSING) so the test asserts the
intended "do not call GetResult for non-complete status" behavior.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 275-283: The server_log_count function can infinite-loop when
substring is empty; add a guard at the start of server_log_count (which calls
get_server_logs()) to return 0 (or handle as appropriate) if substring.empty()
before entering the while loop so pos will never rely on substring.length() == 0
to advance.

---

Duplicate comments:
In `@cpp/CMakeLists.txt`:
- Around line 279-294: The current CMake logic relies on IMPORTED_LOCATION*
which can be empty for non-imported targets; update the resolution to use
$<TARGET_FILE:protobuf::protoc> when the target exists: in the branch that tests
if(TARGET protobuf::protoc) set _PROTOBUF_PROTOC to the generator expression
$<TARGET_FILE:protobuf::protoc> (instead of querying IMPORTED_LOCATION*),
otherwise fall back to find_package(protobuf CONFIG REQUIRED) and then use
IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION as a fallback; keep the existing
fatal check for empty _PROTOBUF_PROTOC. Ensure you reference the symbols
protobuf::protoc and _PROTOBUF_PROTOC in the change so the target-file generator
expression is used for in-tree/FetchContent builds.

In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-265: grpc_client_t::check_status (and similar RPC helpers)
currently dereferences impl_->stub without checking connection state which can
crash if called before connect(); add a null-check for impl_ and impl_->stub at
the start of grpc_client_t::check_status and for the same pattern in the other
RPC helper methods (the ones calling impl_->stub->...). If either is null,
return an appropriate error job_status_result_t (populate error code/message or
a failed state consistent with existing error handling) instead of calling the
stub; otherwise proceed to build the request and call CheckStatus as before.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without checking for <=0; add a guard before any
division (both where max_polls is computed and the other occurrence around the
1132 region) to validate config_.poll_interval_ms > 0, and if not either (a)
return or raise a clear error/invalid-argument from the surrounding function, or
(b) log an error and use a safe default (e.g., 1 ms) before computing max_polls;
ensure you reference config_.poll_interval_ms and update the computations that
set max_polls to avoid divide-by-zero crashes.
- Around line 221-251: The read loop in start_log_streaming blocks on
reader->Read() so stop_log_streaming can hang; to fix, add a mutex-protected
pointer member (log_context_mutex_ and active_log_context_) and set
active_log_context_ = &context (protected by log_context_mutex_) right after
creating grpc::ClientContext context in start_log_streaming, clear it (set
nullptr) after reader->Finish(); then update stop_log_streaming to lock
log_context_mutex_ and if active_log_context_ is non-null call
active_log_context_->TryCancel() before setting stop_logs_ and joining the
thread so the blocking StreamLogs read is cancelled safely.
- Around line 703-712: Validate elem_size and guard against overflow before
dividing/allocating: check elem_size > 0 before computing elems_per_chunk =
chunk_data_budget / elem_size (return/continue or log and skip if zero), ensure
total_elems > 0 (already present) and verify total_elems <= SIZE_MAX / elem_size
(or use safe_mul_check) before computing total_elems * elem_size and allocating
array_bytes, and if the allocation would exceed a configured budget or SIZE_MAX
also error/skip; apply the same checks when computing per-chunk sizes and before
any memcpy/copy operations. Reference symbols: elem_size, total_elems,
chunk_data_budget, elems_per_chunk, array_bytes, arr_desc.total_elements(),
arr_desc.element_size_bytes() (also update the similar block around the other
occurrence).
- Around line 262-265: The CheckStatus RPC (and other RPCs that create
grpc::ClientContext, e.g., the sites calling impl_->stub->CheckStatus and the
other stub->* calls listed) currently create a ClientContext with no deadline;
fix by setting a per-RPC deadline on the context before the call using
context.set_deadline(std::chrono::system_clock::now() + configured_timeout).
Prefer reusing or adding a small helper (e.g., set_deadline(context, timeout_ms)
or a method on impl_ that reads a configured RPC timeout) and apply it to the
Call sites such as the CheckStatus call (impl_->stub->CheckStatus) and the other
grpc::ClientContext creation sites flagged in the review so all RPCs
consistently use the same configured deadline.

In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 53-99: The grpc_client_config_t allows invalid values for
intervals/sizes (poll_interval_ms, incumbent_poll_interval_ms, timeout_seconds,
chunk_size_bytes, max_message_bytes, chunked_array_threshold_bytes) which can
break runtime behavior; add a validation method (e.g.,
grpc_client_config_t::Validate() or a constructor/static factory) that enforces
and documents invariants (positive non-zero ints for intervals/timeout, sensible
min/max for chunk_size_bytes and max_message_bytes, and explicit handling for
chunked_array_threshold_bytes semantics (-1 for auto, 0 for force-chunked, >0 as
threshold)), make the client call Validate() early (e.g., before using in
grpc_client constructors or SubmitJob path) and return/throw a clear error on
invalid config or clamp values where safe; update comments to list accepted
ranges for each field.

In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env-derived overrides for config.chunk_size_bytes and
config.max_message_bytes are applied without validation (using parse_env_int64),
allowing zero/negative/oversized values; update the logic where
config.chunk_size_bytes, config.max_message_bytes (and optionally
config.enable_debug_log) are set so that after calling parse_env_int64 you
validate the result against sane bounds (e.g., >0 and <= a defined
MAX_MESSAGE_BYTES/CHUNK_SIZE constant), ignore or clamp invalid values, and emit
a debug/warning log via the existing logger; specifically modify the section
using parse_env_int64 to perform the check and fallback to existing config
values when the parsed value is out-of-range.
- Around line 164-166: The callback is being passed the incumbent objective as
the bound (bound_copy = objective) which can show a false zero gap; change the
code so bound_copy is set to the solver's actual best bound variable (the
maintained global/best bound used by the B&B engine) instead of 'objective'
before calling get_callback->get_solution(...); locate where the solver stores
the current best bound (e.g., variables named best_bound, global_bound,
current_bound, or similar in the surrounding code/branch-and-bound context) and
use that identifier (preserving obj_copy, solution_copy, and
callback->get_user_data() usage).

In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: The parsing of numeric CLI arguments uses std::stod and
std::stoll without guards and converts a double to int unsafely (variable
time_limit and the static_cast<int>(time_limit)); wrap each std::stod/std::stoll
call in a try/catch (catch std::invalid_argument and std::out_of_range),
validate the parsed value for positivity and reasonable bounds, and return/exit
with a clear error on bad input; before converting time_limit to int (the
static_cast<int>(time_limit) usage) check that time_limit is finite, >= 0 and <=
std::numeric_limits<int>::max() (or clamp/handle overflow) to avoid UB; apply
the same defensive checks to the stoll usages (ensure values fit in target
integer types) and use the same error path if validation fails.

In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto conflicts with its directory path (cpp/src/grpc) and triggers
Buf PACKAGE_DIRECTORY_MATCH; fix by making the package name match the file path
or by moving the file to match the package. Either change the package line in
cpp/src/grpc/cuopt_remote.proto from "package cuopt.remote" to "package
cpp.src.grpc" (matching the directory segments as dot-separated identifiers) or
relocate cuopt_remote.proto into a cuopt/remote directory so the existing
"package cuopt.remote" matches the filesystem; update any import/usage
references accordingly and re-run buf lint to verify.

In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 175-182: The code assumes constraint_upper_bounds exists when
pb_problem.constraint_lower_bounds_size() > 0; instead validate both arrays and
their sizes before calling cpu_problem.set_constraint_lower_bounds and
cpu_problem.set_constraint_upper_bounds: check
pb_problem.constraint_lower_bounds_size() > 0 &&
pb_problem.constraint_upper_bounds_size() > 0 and that their sizes match, and
only then build con_lb/con_ub and call the two set_* methods; if sizes mismatch
or one is missing, handle the case (skip setting bounds or raise/log an error)
to avoid creating an inconsistent problem state.
- Around line 435-446: get_string_list can advance s past s_end when memchr
returns null (no trailing NUL); fix by computing remaining = s_end - s, call
memchr over remaining, and if memchr returns null push the final string using
the remaining length and then break (do not set s = nul + 1). Update the loop in
get_string_list to use it->second.data() and it->second.size() to compute s and
s_end, use the computed remaining length for std::memchr, and on null result
construct names.emplace_back(s, remaining) and break instead of setting s = nul
+ 1.
- Around line 399-411: The lambdas (notably get_doubles) compute element count
as it->second.size() / sizeof(double) then memcpy the full byte payload, which
can overflow if the byte length is not a multiple of sizeof(double); add an
explicit alignment/length check before any typed memcpy: verify
it->second.size() % sizeof(double) == 0 (or the equivalent check for the other
lambda’s element type), and if the check fails either return an empty vector or
log/throw an error; only then allocate the destination vector with n =
it->second.size() / sizeof(double) and perform memcpy; apply the same fix to the
other similar lambda at the noted region (around 414-426) to avoid misaligned
copies.

In `@cpp/src/grpc/grpc_service_mapper.cu`:
- Around line 23-55: Chunked wire format is inconsistent for floats: update
chunk_typed_array (and the client-side build_array_chunk_requests that calls it)
to always emit floating arrays as IEEE double on the wire (so server get_doubles
can continue to interpret chunks as sizeof(double)). Concretely, when T ==
float, convert the input vector to a temporary std::vector<double> and use that
buffer for chunking (set elem_size = sizeof(double) and point raw at the double
buffer) so byte offsets/counts match get_doubles; for non-floating types keep
the existing behavior. Ensure build_array_chunk_requests callers use the
adjusted chunk_typed_array semantics and do not send raw float bytes anymore.

In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 520-534: The bytes_to_typed template is copying raw.size() bytes
into buffers sized for n elements, causing overflow when raw.size() isn't an
exact multiple of the element byte-size; update each branch (the
std::is_same_v<T,float> branch that builds tmp as double, the
std::is_same_v<T,double> branch that builds v as double, and the generic branch)
to: compute n = raw.size() / sizeof(ElemType), check that raw.size() %
sizeof(ElemType) == 0 and handle (throw/log/return error) if not, and use
memcpy(dst.data(), raw.data(), n * sizeof(ElemType)) instead of memcpy(...,
raw.size()) so you only copy the exact bytes that fit the destination (refer to
function bytes_to_typed and the template T branches).

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp`:
- Around line 46-49: The assignment vector is not cleared before appending
parsed values, so repeated calls to parse_incumbent_proto will accumulate stale
entries; fix by calling assignment.clear() (or assign empty) before
assignment.reserve(...) in the code that iterates over incumbent_msg
(referencing the assignment variable and the loop that uses
incumbent_msg.assignment_size() and incumbent_msg.assignment(i)) so the vector
starts empty each parse.
- Around line 30-32: The serialization ignores SerializeToArray's boolean return
and narrows ByteSizeLong() (size_t) to int unsafely: update the code that
constructs buffer and calls msg.SerializeToArray to first capture size_t sz =
msg.ByteSizeLong(), validate sz <= static_cast<size_t>(INT_MAX) before casting,
allocate buffer with that size, call bool ok =
msg.SerializeToArray(buffer.data(), static_cast<int>(sz)), and handle a false
result (e.g., throw or return an error) instead of ignoring it; reference the
buffer variable, msg.ByteSizeLong(), and msg.SerializeToArray(...) when making
the change.

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 107-113: The recv_incumbent_pipe function trusts the incoming size
prefix and can allocate arbitrarily large memory; add a sanity check before
resizing by comparing the read size (variable size) against a defined cap (e.g.,
a new constant like kMaxIncumbentSize or MAX_INCUMBENT_SIZE) and reject/return
false if size is 0 or exceeds that cap, then only call data.resize(size) and
read_from_pipe when the size is within limits; update recv_incumbent_pipe to
enforce this limit and document the constant choice.
- Around line 147-158: The loop in submit_job_async is racy because threads
perform a check-then-write on job_queue[i] (checking ready/claimed then
initializing), so change the slot acquisition to an atomic claim: use an atomic
flag (e.g., make job_queue[].claimed an std::atomic_bool or add a per-slot
std::atomic<int> state) and perform a compare_exchange_strong to transition from
false->true to reserve the slot; only the thread that succeeds should populate
the slot fields (job_id, problem_type, data_size, worker_pid, worker_index,
data_sent, cancelled) and set job_queue[i].ready = true as the final publish
step. Ensure submit_job_async uses this atomic CAS on job_queue[i].claimed and
does not set ready until after all other fields are initialized.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 93-100: Validate and sanitize the metadata before resizing or
memcpy: check ac.total_elements() and ac.element_offset() are non-negative and
within reasonable caps, ensure chunk_data.size() + offset does not overflow and
fits within a bounded/validated size_t, and guard against pathological
total_elements to avoid huge allocations; in the functions using ac (e.g., where
dest.resize(...) and std::memcpy(...) are called), convert to size_t only after
validation, return or throw an error on invalid/too-large values instead of
resizing/ copying, and ensure the memcpy condition uses the validated,
overflow-checked bounds.
- Line 90: The deserializers currently treat ParseDelimitedFromCodedStream
failures by breaking out and eventually returning true, accepting malformed
payloads; change the error handling around
google::protobuf::util::ParseDelimitedFromCodedStream(...) (the calls that pass
ac, coded, clean_eof) so that on a parse failure you immediately fail closed and
return false (or propagate an error) instead of break/continuing; apply the same
change to both deserializer sites that call ParseDelimitedFromCodedStream to
ensure truncated/malformed messages are rejected.

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 234-250: Builder::BuildAndStart() can return nullptr so avoid
dereferencing server without checking; after calling builder.BuildAndStart()
(unique symbols: builder.BuildAndStart(), std::unique_ptr<Server> server,
server->Wait(), shutdown_thread, keep_running), check if server is null and if
so log an error/print to stderr, ensure shutdown_thread is signaled/joined or
cleaned up, and exit or return with a non-zero status; if server is valid
continue to start the shutdown_thread and call server->Wait() as currently
implemented.
- Around line 89-90: The signal handler registered via signal(SIGINT,
signal_handler) performs non-async-signal-safe operations (std::cout and
result_cv.notify_all()) — change it to a minimal, async-signal-safe handler:
have signal_handler set a std::sig_atomic_t or std::atomic<bool> flag (e.g.,
shutdown_requested) and (optionally) write a byte to a dedicated pipe to wake
the main loop; then remove any std::cout or condition_variable use from
signal_handler and move the actual logging and result_cv.notify_all() into the
main thread’s event loop where you check shutdown_requested (or the pipe) and
call result_cv.notify_all() and perform printing. Update grpc_server_types.hpp
to expose the atomic flag (or pipe FD) and ensure only async-signal-safe
functions are called from signal_handler.
- Around line 156-168: The loops in grpc_server_main.cpp must not call memset on
JobQueueEntry and ResultQueueEntry because they contain std::atomic members;
remove the memset calls and instead default-construct or placement-new each
element and then explicitly initialize fields. Replace memset(&job_queue[i], 0,
sizeof(JobQueueEntry)) with either new (&job_queue[i]) JobQueueEntry() or ensure
the array is value-initialized, then call job_queue[i].ready.store(false),
.claimed.store(false), .cancelled.store(false) and
job_queue[i].worker_index.store(-1); do the same for result_queue elements
(value-initialize or placement-new, then .ready.store(false) and
.retrieved.store(false)). Ensure JobQueueEntry and ResultQueueEntry constructors
handle non-atomic fields if you prefer constructor-based initialization.
- Around line 186-212: The early-return TLS error paths currently exit
immediately after spawn_workers() and creating threads, leaking workers,
threads, and shared memory; change these branches to perform proper cleanup
before returning: after calling spawn_workers() and after launching the threads
(referencing spawn_workers() and the container holding created std::thread
objects), call the corresponding shutdown/stop routine for workers (e.g.,
stop_workers() or signal shutdown on the worker pool), join all threads in the
thread container, and release any shared memory/IPC resources (the handles
created earlier, and any variables like ssl_opts or pem_root_certs can be left
untouched) before returning an error code. Ensure every early-return path that
occurs after worker spawn or thread creation follows this cleanup sequence so no
threads or shared resources remain leaked.
- Around line 50-80: The CLI parsing loop currently in main (the for-loop that
processes argv and assigns to config) silently ignores unknown flags and missing
values; update it to validate every flag and required value: for each branch
that expects a value (flags that call std::stoi/std::stoll or assign argv[++i]
to config fields like config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) ensure you check i+1<argc and on failure call
print_usage(argv[0]) and return a non-zero exit code; add an else branch for
unrecognized args (arg starts with '-') that also prints usage and exits
non-zero; additionally wrap stoi/stoll conversions in try/catch to detect
invalid numeric inputs and treat them as errors that print usage and exit
non-zero so the server cannot start with unintended defaults.

In `@cpp/src/grpc/server/grpc_server_threads.cpp`:
- Around line 166-229: When deserialize_result_pipe_blob(...) fails you mark
it->second.status = JobStatus::FAILED but you leave the local success flag true,
causing waiter->success to be set incorrectly; update the logic so that after a
deserialization failure you set success = false (or set waiter->success based on
it->second.status != JobStatus::FAILED) before assigning waiter->success in the
waiting_threads block; modify the code paths around
deserialize_result_pipe_blob, job_tracker, and the waiter assignment
(waiter->success) so waiters observe the actual failed state and error_message
from it->second.error_message.

In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe APIs
(std::cout and result_cv.notify_all); change it to only perform
async-signal-safe operations: remove the std::cout and result_cv.notify_all
calls and instead set an atomic/sig_atomic_t flag (ensure keep_running is an
std::atomic_bool or set a sig_atomic_t atomic_shutdown) and, if needed, set
shm_ctrl->shutdown_requested using a sig_atomic_t field; add a normal runtime
watcher thread that observes keep_running/shutdown flag and performs logging and
calls result_cv.notify_all when it detects shutdown; refer to signal_handler,
keep_running, shm_ctrl, shutdown_requested and result_cv when making these
changes.

In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code releases job_queue[job_idx].claimed before
publishing job_queue[job_idx].ready, creating a race where another producer can
claim the slot before consumers see ready; invert the operations so you
store(true) to job_queue[job_idx].ready before you store(false) to
job_queue[job_idx].claimed (apply same fix at the other occurrence around lines
with job_queue[...] claimed/ready, e.g., the site referenced as 265-267) to
ensure ready is visible prior to allowing re-claim.
- Around line 633-642: Hold on to waiter->mutex only while inspecting/modifying
waiter state; do not call check_job_status() while holding waiter->mutex to
avoid lock-order inversion with tracker_mutex used in cancel paths (e.g.,
cancel_job). In WaitForCompletion, restructure the loop so you acquire
unique_lock waiter's mutex, check waiter->ready and context->IsCancelled (and
adjust waiter->waiters on cancel) but before invoking check_job_status(job_id,
msg) release the lock (lock.unlock()), call check_job_status(), then re-lock
(lock.lock()) to re-evaluate waiter->ready and continue/wait; alternatively
adopt a consistent global lock order (acquire tracker_mutex before waiter->mutex
everywhere) — prefer the unlock/call/relock pattern around the call site to
quickly eliminate the deadlock risk.
- Around line 189-192: The code currently blindly adds each chunk to
meta.received_bytes, state.total_bytes, state.chunks and increments
state.total_chunks without enforcing a cumulative bound, so add a pre-check that
rejects or truncates a chunk when meta.received_bytes + raw.size() (or
state.total_bytes + raw.size()) would exceed the declared/expected field size
(use the existing expected-size field on meta or state), and only update
meta.received_bytes, state.total_bytes, state.chunks.push_back(ac) and
++state.total_chunks after the check succeeds; on failure return an error/status
indicating oversized upload to prevent unbounded memory growth.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp`:
- Around line 120-124: The fork failure branch only calls
close_worker_pipes_server(worker_id) leaving child-end FDs open and leaking;
update the pid < 0 handling to also close the child-side pipe FDs (e.g., call
the corresponding cleanup function for child ends such as
close_worker_pipes_child(worker_id) or otherwise close the child file
descriptors associated with worker_id) before returning -1, ensuring both server
and child pipe endpoints are cleaned up when fork() fails; reference symbols:
fork, pid, worker_id, is_replacement, close_worker_pipes_server.
- Around line 165-168: The loop in wait_for_workers is calling waitpid
unconditionally for every entry in worker_pids; skip any non-positive PIDs (pid
<= 0) and call waitpid only for valid PIDs, and handle EINTR by retrying the
wait in a loop until waitpid succeeds or returns a permanent error; also handle
-1 results such that if errno == ECHILD you can break/continue safely and for
other errors log/propagate as appropriate. Locate the loop that iterates over
worker_pids and update it to validate pid > 0, wrap the waitpid call in a retry
loop that checks errno == EINTR, and handle other error cases (e.g., ECHILD)
instead of proceeding blindly.

In `@cpp/src/grpc/server/grpc_worker.cpp`:
- Around line 56-74: store_simple_result currently scans result_queue for an
entry with !ready and writes multiple non-atomic fields, causing a race; change
the slot allocation to atomically claim a slot before writing (e.g., add or
reuse an atomic claim flag on result_queue[i] and use
compare_exchange_strong/compare_exchange_weak to transition from UNCLAIMED to
CLAIMED), only proceed to copy job_id/status/data_size/error_message and set
retrieved/ready after the CAS succeeds, and ensure you set the atomic ready flag
last so other threads see a fully-initialized entry; update store_simple_result
to loop until a CAS claim succeeds (or handle full-queue) and reference
result_queue, MAX_RESULTS, and the ready/claimed flags in your changes.
- Around line 44-48: Check and handle the return value of
send_incumbent_pipe(fd_, buffer): if it indicates failure (e.g., broken pipe or
write error) log an error including job_id_ and the returned errno/details and
avoid further work or retry/cleanup as appropriate; only emit the incumbent
logging (currently using std::cout with job_id_, objective, assignment.size())
when the send succeeded or gate it behind a debug/log-level flag or rate-limit
to prevent unbounded output. Locate this logic around the call sites using
send_incumbent_pipe, build_incumbent_proto, and the variables job_id_,
objective, assignment, and fd_ to add the error check and conditional logging.
- Around line 248-252: The cudaMemcpy call copying device_solution to
host_solution currently ignores CUDA return codes; change it to capture the
cudaMemcpy return value (cudaError_t err = cudaMemcpy(...)) and if err !=
cudaSuccess throw an exception (e.g., std::runtime_error) with a descriptive
message including cudaGetErrorString(err) so the existing try/catch surfaces a
clear failure to the client; apply the same pattern to all other cudaMemcpy
sites in this file (including the similar copies near lines 296-307) and
reference the same symbols (device_solution, host_solution, cudaMemcpy) when
making the fixes.
- Around line 341-363: The loop that finds an empty slot in result_queue and
then fills it (using result_slot, ResultQueueEntry, copy_cstr, MAX_RESULTS,
worker_id) has a race: another thread may grab the same slot after the ready
check but before setting ready, and if no slot is found the result is silently
dropped; fix by protecting the selection-and-write with the same synchronization
used in store_simple_result (e.g., a mutex or atomic compare-and-set on
result_queue[i].ready) so the check-and-set is atomic, and add explicit handling
for the case result_slot remains -1 (log an error via the worker logger /
std::cerr and return or propagate a failure code) so lost results are reported.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 53-57: The test file uses open() and O_WRONLY/O_CREAT/O_TRUNC but
does not include <fcntl.h>, which is required on some platforms; add a `#include`
<fcntl.h> to the top of grpc_integration_test.cpp alongside the existing system
headers (near the includes for <signal.h>, <sys/types.h>, <sys/wait.h>,
<unistd.h>) so that the open() call and O_* constants used by the test compile
portably.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setter set_server_log_path writes server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization; protect
these shared members by acquiring mutex_ (e.g., std::lock_guard or
std::unique_lock on mutex_) at the start of set_server_log_path and similarly in
every reader/callback that touches server_log_path_, server_log_start_pos_, or
test_start_marked_ (the read blocks around lines ~229-244 and ~332-355). Under
the lock, perform updates or read-and-copy the values, and for readers prefer
copying the protected values into local variables while holding mutex_ then
release the lock before doing any IO/long operations to avoid holding mutex_
during blocking work.

In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The phrase claiming "no custom serialization logic is
implemented" is too absolute; update the line to clarify the scope by stating
that protocol buffers generated by protoc and grpc_cpp_plugin are used for gRPC
wire payloads, and explicitly note the exception that
grpc_pipe_serialization.hpp provides custom pipe blob framing helpers (i.e., the
custom framing is outside gRPC wire payload serialization). Locate the sentence
in GRPC_ARCHITECTURE.md and reword it to mention "for gRPC wire payloads" and
add a parenthetical or following sentence referencing
grpc_pipe_serialization.hpp as the known custom framing helper.

---

Nitpick comments:
In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Line 340: The return path after stream_logs currently returns non-zero on
failure but drops the error detail; update the failure branch surrounding
stream_logs in test_grpc_client.cpp to call and emit client.get_last_error() (or
equivalent accessor) to the log/stream before returning the non-zero exit code
so the client error is logged for parity with other modes and aids diagnosis
(locate the code around stream_logs and the return of success ? 0 : 1 and insert
a process/log call that prints client.get_last_error() when success is false).

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 24-27: kPipeChunkBytes is a hard-coded constexpr limit; make it
configurable by replacing the fixed constexpr with a runtime-configured value
(e.g., a GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that
reads an environment variable or server config and falls back to 64ULL * 1024 *
1024 as the default; update references to kPipeChunkBytes to call the accessor
and validate the parsed value (positive, reasonable upper bound) and document
the new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 957-982: The test registers a log callback (received_logs with
log_mutex) but never asserts anything, so add assertions after the call to
client->solve_mip in SolveMIPWithLogCallback to verify the callback was invoked
and logs look correct: acquire lock(log_mutex) and assert !received_logs.empty()
(and optionally that at least one entry contains an expected substring like
"objective" or "MIP" to ensure meaningful log content). If solve_mip is
asynchronous/streams logs, ensure you wait briefly or poll until received_logs
is non-empty before asserting to avoid flakes; reference received_logs,
log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when
adding the checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8bfed585-0c3e-4c04-9fc0-af2be140a6f2

📥 Commits

Reviewing files that changed from the base of the PR and between e46028c and 6c06a40.

📒 Files selected for processing (39)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • conda/recipes/libcuopt/recipe.yaml
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (5)
  • build.sh
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
  • cpp/src/grpc/grpc_settings_mapper.cu

@tmckayus tmckayus force-pushed the grpc-server-v4 branch 2 times, most recently from a94be38 to 67e1796 Compare March 6, 2026 21:20
cuopt uses embedded grpc client to solve problems on a remote server
@anandhkb anandhkb added this to the 26.04 milestone Mar 8, 2026
@tmckayus tmckayus force-pushed the grpc-server-v4 branch 5 times, most recently from 5fea430 to 92c17f0 Compare March 10, 2026 14:18
@tmckayus
Copy link
Contributor Author

/ok to test 718d938

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

♻️ Duplicate comments (6)
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)

339-357: ⚠️ Potential issue | 🟠 Major

dump_logs() and is_test_start_marked() still read test_start_marked_ unlocked.

The writes are synchronized, but Line 344 and Line 357 read the same flag directly. With the log callback and the test thread running concurrently, that is still a data race and can make the log window nondeterministic.

🛠️ Minimal fix
   void dump_logs(std::ostream& os = std::cout, bool include_all_server_logs = false) const
   {
+    const bool marked = is_test_start_marked();
     os << "=== Client Logs ===\n";
     os << get_client_logs();
     os << "\n=== Server Logs";
-    if (test_start_marked_ && !include_all_server_logs) {
+    if (marked && !include_all_server_logs) {
       os << " (since test start)";
     } else {
       os << " (all)";
@@
-  bool is_test_start_marked() const { return test_start_marked_; }
+  bool is_test_start_marked() const
+  {
+    std::lock_guard<std::mutex> lock(mutex_);
+    return test_start_marked_;
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 339
- 357, dump_logs() and is_test_start_marked() read test_start_marked_ without
locking, causing a data race with the log callback writer; fix by protecting
those reads with the same mutex used to synchronize log writes (e.g., wrap the
reads in dump_logs() and is_test_start_marked() in a std::lock_guard<std::mutex>
using the existing logs mutex such as logs_mutex_ or mutex_), or alternatively
change test_start_marked_ to std::atomic<bool> and update writers/readers
accordingly so all accesses are thread-safe (make the change in the dump_logs,
is_test_start_marked, and the writer that sets test_start_marked_).
cpp/src/grpc/server/grpc_worker_infra.cpp (1)

147-154: ⚠️ Potential issue | 🟡 Minor

Add PID validation before waitpid to avoid undefined behavior.

The loop iterates worker_pids without checking for non-positive values. If a worker failed to spawn (returning -1 from spawn_worker) but was still pushed to the vector, or if the vector contains stale zero values, calling waitpid(0, ...) waits for any child in the process group, which is unintended.

Proposed fix
 void wait_for_workers()
 {
   for (pid_t pid : worker_pids) {
+    if (pid <= 0) continue;
     int status;
     while (waitpid(pid, &status, 0) < 0 && errno == EINTR) {}
   }
   worker_pids.clear();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 147 - 154, In
wait_for_workers(), validate PIDs from the worker_pids vector before calling
waitpid to avoid undefined behavior: skip any non-positive entries (<= 0) that
may come from failed spawn_worker() calls or stale zeros, only call waitpid(pid,
&status, 0) for pid > 0 (and optionally log or collect invalid PIDs), and then
clear worker_pids as before; ensure the check is applied inside the for-loop
that iterates worker_pids so stray 0 or -1 values are not passed to waitpid.
cpp/src/grpc/cuopt_remote_service.proto (1)

320-325: ⚠️ Potential issue | 🟠 Major

Include the best bound in Incumbent updates.

Local GET_SOLUTION callbacks receive both the incumbent objective and the current best bound, but this message only carries the objective and assignment. The client wrapper has to synthesize or omit the bound today, so remote callbacks lose gap information and can't preserve local stop heuristics.

📦 Schema addition
 message Incumbent {
   int64 index = 1;
   double objective = 2;
   repeated double assignment = 3;
   string job_id = 4;
+  double solution_bound = 5;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 320 - 325, The
Incumbent proto message is missing the current best bound; add a new double
field (e.g. best_bound = 5) to message Incumbent so remote GET_SOLUTION
callbacks carry both objective and bound, then regenerate gRPC/protobuf stubs
and update any producers/consumers that construct or read Incumbent (places
referencing Incumbent, its index/objective/assignment/job_id handling) to
populate and consume the new best_bound field accordingly so gap information is
preserved end-to-end.
cpp/tests/linear_programming/grpc/grpc_client_test.cpp (1)

1347-1353: ⚠️ Potential issue | 🟡 Minor

Stop the test at PROCESSING instead of expecting GetResult.

The comment above this block says get_lp_result() should not attempt GetResult() once CheckStatus() reports PROCESSING, but the mock still requires that RPC. This locks in the opposite behavior.

Suggested fix
-  EXPECT_CALL(*mock_stub_, GetResult(_, _, _))
-    .WillOnce([](grpc::ClientContext*,
-                 const cuopt::remote::GetResultRequest&,
-                 cuopt::remote::ResultResponse*) {
-      return grpc::Status(grpc::StatusCode::UNAVAILABLE, "Result not ready");
-    });
+  EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 1347 -
1353, The test currently sets an EXPECT_CALL on mock_stub_->GetResult which
contradicts the intended behavior that get_lp_result() should stop when
CheckStatus reports PROCESSING; remove or change that expectation so GetResult
is not required to be called (e.g., delete the EXPECT_CALL for GetResult or
change it to .Times(0)), ensure the mock for CheckStatus returns PROCESSING, and
verify the test asserts that get_lp_result() stops after CheckStatus rather than
invoking GetResult; references: mock_stub_, GetResult, CheckStatus,
get_lp_result.
cpp/src/grpc/server/grpc_job_management.cpp (1)

50-55: ⚠️ Potential issue | 🟠 Major

Bound the incumbent payload before resize().

size comes straight off the pipe, so data.resize(size) lets a malformed worker force an arbitrary allocation here. Please reject anything above an explicit cap before resizing.

🔧 Suggested guard
 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  if (size > static_cast<uint64_t>(config.max_message_bytes)) {
+    std::cerr << "[Worker] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }
As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 50 - 55, The
recv_incumbent_pipe function currently calls data.resize(size) with size read
from the pipe (via read_from_pipe) allowing a remote peer to force unbounded
allocation; guard against this by validating the uint64_t size before
resizing—define and use an explicit max payload constant (e.g.,
MAX_INCUMBENT_PAYLOAD or similar within this compilation unit), check that size
<= MAX_INCUMBENT_PAYLOAD and size is non-negative/ sensible, reject the read
(return false) if the check fails, then proceed to data.resize(size) and the
subsequent read_from_pipe call; update any related error paths or comments in
recv_incumbent_pipe accordingly.
cpp/src/grpc/grpc_problem_mapper.cpp (1)

440-450: ⚠️ Potential issue | 🟠 Major

Stop once memchr() falls off the blob.

When memchr() misses, nul becomes s_end and Line 450 still forms nul + 1. That walks past the buffer on malformed chunk payloads.

🔧 Suggested fix
     while (s < s_end) {
       const char* nul = static_cast<const char*>(std::memchr(s, '\0', s_end - s));
       if (!nul) nul = s_end;
       names.emplace_back(s, nul);
-      s = nul + 1;
+      if (nul == s_end) break;
+      s = nul + 1;
     }
As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cpp` around lines 440 - 450, In
get_string_list, when std::memchr returns nullptr we set nul = s_end but then do
s = nul + 1 which advances past the buffer; change the loop to detect the "no
NUL found" case and stop safely: after computing nul (from std::memchr), if nul
== s_end handle the final fragment (push a string from s to s_end if non-empty)
and break the loop instead of doing s = nul + 1; only advance s = nul + 1 when
nul < s_end. This prevents walking past the blob and fixes
arrays/deserialization safety in get_string_list.
🟠 Major comments (23)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-77-83 (1)

77-83: ⚠️ Potential issue | 🟠 Major

Clear inherited CUOPT_TLS_* variables in _cpu_only_env.

This helper copies the caller environment verbatim, so a pre-set TLS configuration from the shell or another fixture can leak into the plaintext-server tests and make them fail nondeterministically. Please explicitly remove the TLS keys here so each test class fully owns its transport settings.

Suggested fix
 def _cpu_only_env(port):
     """Return an env dict that hides all GPUs and enables remote mode."""
     env = os.environ.copy()
+    for key in (
+        "CUOPT_TLS_ENABLED",
+        "CUOPT_TLS_ROOT_CERT",
+        "CUOPT_TLS_CLIENT_CERT",
+        "CUOPT_TLS_CLIENT_KEY",
+    ):
+        env.pop(key, None)
     env["CUDA_VISIBLE_DEVICES"] = ""
     env["CUOPT_REMOTE_HOST"] = "localhost"
     env["CUOPT_REMOTE_PORT"] = str(port)
     return env

As per coding guidelines "Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 77 - 83, The _cpu_only_env helper currently copies the caller
environment verbatim which can leak TLS settings; modify the function
(_cpu_only_env) to explicitly remove any keys that start with "CUOPT_TLS_" from
the env dict (e.g., iterate over keys or pop specific known CUOPT_TLS_* names)
before returning so plaintext-server tests run with a clean transport
configuration and deterministic behavior.
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-312-338 (1)

312-338: ⚠️ Potential issue | 🟠 Major

Assert remote numerical correctness, not just subprocess success.

These remote-path tests still pass when the gRPC/CLI path returns wrong numbers with the right shapes or simply exits 0. Since afiro_original.mps and bb_optimality.mps are known instances, please assert objective/gap correctness on the CPU-only, CLI, TLS, and mTLS paths as well.

As per coding guidelines "Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)".

Also applies to: 432-442, 560-565, 625-635

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 312 - 338, The tests currently only check subprocess exit codes;
update the test functions (test_lp_solve_cpu_only,
test_lp_dual_solution_cpu_only, test_mip_solve_cpu_only,
test_warmstart_cpu_only) and their underlying helpers (_impl_lp_solve_cpu_only,
_impl_lp_dual_solution_cpu_only, _impl_mip_solve_cpu_only,
_impl_warmstart_cpu_only) to assert numerical correctness: load the known
instances (afiro_original.mps, bb_optimality.mps) and assert the returned
objective value, primal/dual vectors, reduced costs and optimality gap are
within a tight tolerance (e.g. absolute/relative eps) rather than only checking
returncode; apply the same numerical assertions to the CLI/TLS/mTLS remote paths
covered by the other failing ranges (around 432-442, 560-565, 625-635) and
include tests for degenerate cases (infeasible/unbounded/empty/singleton)
verifying expected statuses and gap behavior.
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-110-120 (1)

110-120: ⚠️ Potential issue | 🟠 Major

Add a subjectAltName to the generated server certificate.

The server certificate only sets CN=localhost. Modern TLS stacks, particularly BoringSSL (used in gRPC C-core), do not accept hostname verification via common name and require subjectAltName instead. CN-only certificates will cause test failures on systems that enforce this requirement.

Suggested fix
     server_key = os.path.join(cert_dir, "server.key")
     server_csr = os.path.join(cert_dir, "server.csr")
     server_crt = os.path.join(cert_dir, "server.crt")
+    server_ext = os.path.join(cert_dir, "server.ext")
+    with open(server_ext, "w", encoding="utf-8") as f:
+        f.write("subjectAltName=DNS:localhost,IP:127.0.0.1\n")
     if not _run(
         f"openssl req -newkey rsa:2048 -keyout {server_key} -out {server_csr} "
         f"-nodes -subj '/CN=localhost' 2>/dev/null"
     ):
         return False
     if not _run(
         f"openssl x509 -req -in {server_csr} -CA {ca_crt} -CAkey {ca_key} "
-        f"-CAcreateserial -out {server_crt} -days 1 2>/dev/null"
+        f"-CAcreateserial -out {server_crt} -days 1 "
+        f"-extfile {server_ext} 2>/dev/null"
     ):
         return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 110 - 120, The server certificate generation uses _run to call
openssl and only sets CN=localhost (server_csr/server_crt), which fails hostname
verification on BoringSSL—update the second openssl invocation that creates
server_crt (the openssl x509 -req ... call referencing {server_csr}, {ca_crt},
{ca_key}) to include a subjectAltName for DNS:localhost and IP:127.0.0.1 (via
-addext "subjectAltName=DNS:localhost,IP:127.0.0.1" or by supplying an
-extfile/-extensions config) so the generated server_crt includes SAN entries
accepted by modern TLS stacks.
cpp/src/grpc/server/grpc_server_types.hpp-254-256 (1)

254-256: ⚠️ Potential issue | 🟠 Major

Namespace the shared-memory objects per server instance.

With fixed names like /cuopt_job_queue, /cuopt_result_queue, and /cuopt_control, two cuopt_grpc_server processes on the same host will attach to the same IPC objects even if they listen on different ports. That can cause cross-talk between jobs and one instance unlinking the other's shared memory during cleanup. Please derive these names from an instance identifier such as port/PID and pass them through to workers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 254 - 256, The fixed
shared-memory name constants SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL
must be replaced with functions that derive per-instance names (e.g., using the
server's listening port and/or PID) so each cuopt_grpc_server gets unique IPC
names; update the code that creates/opens/unlinks these objects to call the new
name generators and thread/worker spawn points to pass the resolved names into
worker initialization so workers use the same instance-specific names rather
than the previous global constants.
GRPC_QUICK_START.md-197-205 (1)

197-205: ⚠️ Potential issue | 🟠 Major

The "C API" section shows C++ code, not the public C API.

Lines 200–204 use C++ headers (solve.hpp, cpu_optimization_problem.hpp), the auto keyword, and the cuopt::linear_programming:: namespace. This will not compile in a C translation unit. Replace this snippet with the actual public C API, which uses cuOptSolve() with opaque handles (cuOptOptimizationProblem, cuOptSolverSettings, cuOptSolution*). Alternatively, rename the section to "C++" if the intent is to show internal C++ usage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_QUICK_START.md` around lines 197 - 205, The section labeled "C API"
currently shows C++ code; either change the section title to "C++" or replace
the snippet with the real public C API: show usage of cuOptSolve with the opaque
handle types (cuOptOptimizationProblem, cuOptSolverSettings, cuOptSolution*) and
the proper C function call (cuOptSolve) and cleanup patterns so the example
compiles in a C translation unit; ensure the example does not use C++ headers,
namespaces, or keywords like auto and instead demonstrates the correct C handles
and function signatures.
cpp/src/grpc/server/grpc_pipe_io.cpp-32-57 (1)

32-57: ⚠️ Potential issue | 🟠 Major

timeout_ms should cover the entire data transfer, and respect POLLIN | POLLHUP state.

The initial poll() enforces the timeout, but the loop at line 59 falls back to blocking read() calls without timeout protection. If a writer sends a size header, then stalls before sending the payload, the reader blocks indefinitely on the second read_from_pipe() call (e.g., at line 38 of grpc_job_management.cpp).

Additionally, line 54 treats POLLERR | POLLHUP | POLLNVAL as immediate failures. When poll() returns with both POLLIN and POLLHUP set—indicating the peer closed but buffered bytes remain—the code returns false and drops those buffered bytes instead of draining them. Use an absolute deadline and poll until no more readable bytes remain before treating hangup as fatal.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_io.cpp` around lines 32 - 57, The
read_from_pipe function must enforce timeout across the entire transfer and not
drop buffered bytes on POLLHUP; change the logic in read_from_pipe so you
compute an absolute deadline from timeout_ms, then loop calling
poll(&pfd,1,remaining_ms) (handling EINTR) before each read and update
remaining_ms from the deadline; only treat POLLNVAL/POLLERR as immediate
failure, and if pfd.revents has POLLIN (even if POLLHUP is also set) perform
read() to drain available bytes into ptr/remaining until read returns 0 or
remaining==0, only treating a pure POLLHUP with no POLLIN as EOF, and return
false on poll timeout once the absolute deadline passes. Ensure you reference
the existing symbols ptr, remaining, pfd and timeout_ms when making these
changes.
cpp/src/grpc/server/grpc_pipe_serialization.hpp-192-200 (1)

192-200: ⚠️ Potential issue | 🟠 Major

Same validation needed in read_result_from_pipe.

Apply the same total_bytes bounds check here to prevent resource exhaustion from malformed result payloads.

Proposed fix
   for (uint32_t i = 0; i < num_arrays; ++i) {
     int32_t field_id;
     uint64_t total_bytes;
     if (!read_from_pipe(fd, &field_id, sizeof(field_id))) return false;
     if (!read_from_pipe(fd, &total_bytes, sizeof(total_bytes))) return false;
+    if (total_bytes > kMaxArrayBytes) return false;
     auto& dest = arrays_out[field_id];
     dest.resize(static_cast<size_t>(total_bytes));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 192 - 200, In
read_result_from_pipe, validate the uint64_t total_bytes read from the pipe
before calling dest.resize() and read_from_pipe(): check total_bytes is within
an expected/sane upper bound (e.g. compare against a MAX_ALLOWED_BYTES constant
or a per-array limit and ensure it fits into size_t) and return false if it
exceeds the limit or would cause allocation overflow; apply this check in the
same loop that reads field_id/total_bytes and references arrays_out, so you
never resize or read using a maliciously large total_bytes.
cpp/src/grpc/server/grpc_pipe_serialization.hpp-140-149 (1)

140-149: ⚠️ Potential issue | 🟠 Major

Add validation for total_bytes before resizing to prevent resource exhaustion.

The read_chunked_request_from_pipe function reads total_bytes from untrusted input and directly uses it to resize a vector without bounds checking. A malicious or corrupted payload could specify an extremely large value, causing allocation failure or resource exhaustion.

Proposed fix
+// Maximum allowable array size to prevent resource exhaustion (e.g., 4 GiB)
+static constexpr uint64_t kMaxArrayBytes = 4ULL * 1024 * 1024 * 1024;
+
 inline bool read_chunked_request_from_pipe(int fd,
                                            cuopt::remote::ChunkedProblemHeader& header_out,
                                            std::map<int32_t, std::vector<uint8_t>>& arrays_out)
 {
   if (!read_protobuf_from_pipe(fd, header_out)) return false;

   uint32_t num_arrays;
   if (!read_from_pipe(fd, &num_arrays, sizeof(num_arrays))) return false;

   // Read each field's raw bytes directly into the output map, keyed by field_id.
   for (uint32_t i = 0; i < num_arrays; ++i) {
     int32_t field_id;
     uint64_t total_bytes;
     if (!read_from_pipe(fd, &field_id, sizeof(field_id))) return false;
     if (!read_from_pipe(fd, &total_bytes, sizeof(total_bytes))) return false;
+    if (total_bytes > kMaxArrayBytes) return false;
     auto& dest = arrays_out[field_id];
     dest.resize(static_cast<size_t>(total_bytes));

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 140 - 149, The
loop in read_chunked_request_from_pipe currently trusts total_bytes from the
pipe and calls dest.resize(total_bytes); validate total_bytes before resizing:
ensure field_id is within arrays_out bounds, check total_bytes is not negative
(already unsigned but still validate), enforce a sane upper limit (e.g. compare
against a MAX_ALLOWED_CHUNK or the destination's max_size/remaining budget) and
fail (return false) if it exceeds that limit or would cause memory growth beyond
system limits; only then call dest.resize(...) and proceed to read_from_pipe
into dest.data(). Use the symbols read_chunked_request_from_pipe, num_arrays,
field_id, total_bytes, arrays_out, dest.resize, and read_from_pipe when locating
the code to implement these checks.
GRPC_SERVER_ARCHITECTURE.md-97-106 (1)

97-106: ⚠️ Potential issue | 🟠 Major

Full upload buffering in the coordinator is an OOM vector.

This design keeps chunked uploads in main-process memory until FinishChunkedUpload, so memory usage scales with the entire problem size rather than the in-flight chunk size. A few concurrent “multi-GB” uploads can exhaust the coordinator before pipe backpressure helps. Consider streaming directly to the worker or enforcing hard per-upload and global memory quotas.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_SERVER_ARCHITECTURE.md` around lines 97 - 106, Coordinator currently
buffers entire uploads in ChunkedUploadState and pending_chunked_data until
FinishChunkedUpload, which risks OOM; instead modify the chunk handling so each
incoming chunk is either (a) streamed immediately to the worker pipe (forward
inside the same handler used by FinishChunkedUpload) and not retained in
ChunkedUploadState, or (b) if buffering is required, enforce strict per-upload
and global memory quotas and refuse/slow new chunks when limits are reached;
update the code paths around FinishChunkedUpload, ChunkedUploadState, and
pending_chunked_data to stop accumulating full uploads (or add quota checks and
accounting and evictions), and ensure backpressure is propagated to gRPC
handlers so the coordinator never holds multi-GB of data in memory.
GRPC_SERVER_ARCHITECTURE.md-52-57 (1)

52-57: ⚠️ Potential issue | 🟠 Major

Namespace the shared-memory objects per server instance.

/cuopt_job_queue, /cuopt_result_queue, and /cuopt_control are global names. Two cuopt_grpc_server processes on the same host will collide even if they listen on different ports, because the later instance can unlink/recreate the first instance’s IPC objects. Add a per-instance suffix such as port, PID, or UUID and pass it to workers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_SERVER_ARCHITECTURE.md` around lines 52 - 57, The documented
shared-memory names (/cuopt_job_queue, /cuopt_result_queue, /cuopt_control) are
global and will collide across multiple server instances; update the server and
worker startup to generate a per-instance suffix (e.g., port, PID, or UUID) and
append it to those base names to form unique IPC identifiers, propagate that
suffix into worker launch/config so workers use the same suffixed names, and
update all references in documentation and code (search for /cuopt_job_queue,
/cuopt_result_queue, /cuopt_control) so unlink/create/attach operations use the
suffixed names.
cpp/src/grpc/server/grpc_server_main.cpp-225-230 (1)

225-230: ⚠️ Potential issue | 🟠 Major

Fail startup if no worker actually came up.

spawn_workers() skips failed spawns, but main() never verifies the result before binding the gRPC port. With --workers 0 or repeated spawn_worker() failures, the server accepts jobs that can never be processed.

💡 Minimal guard
   spawn_workers();
+  if (worker_pids.empty()) {
+    std::cerr << "[Server] Failed to start any worker processes\n";
+    cleanup_shared_memory();
+    return 1;
+  }
 
   std::thread result_thread(result_retrieval_thread);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 225 - 230,
spawn_workers() currently skips failed spawns but main() proceeds to bind the
gRPC port and start threads; change main() to verify that at least one worker
was started (either by making spawn_workers() return the number of successful
spawns or by checking the global worker container/worker_count after
spawn_workers()) and if zero, log an error and exit non-zero before creating
result_retrieval_thread, incumbent_retrieval_thread, worker_monitor_thread,
session_reaper_thread or binding the gRPC server; update
spawn_workers()/spawn_worker() signatures if needed to surface success count so
main() can perform this check.
cpp/src/grpc/grpc_settings_mapper.cpp-140-149 (1)

140-149: ⚠️ Potential issue | 🟠 Major

iteration_limit loses its "unset" state when omitted in proto3 decode.

The iteration_limit field is a plain proto3 scalar that defaults to 0 when omitted. The encode side uses -1 as a sentinel to represent "unset" (library default), but this convention only works if all clients consistently follow it. External clients or future versions unaware of this sentinel will send omitted fields as 0, which the decode side interprets as "explicitly request 0 iterations" rather than "use server default". This RPC contract is brittle and violates the principle of forward compatibility. Use optional int64 iteration_limit to add proper field presence semantics in proto3.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cpp` around lines 140 - 149, The proto3
scalar iteration_limit must become optional to preserve presence semantics:
change the .proto to "optional int64 iteration_limit" and update encoding so it
omits the field when the library default/unset is intended (stop writing -1 as a
sentinel). In the decoder (grpc_settings_mapper.cpp) replace the current check
that relies on a numeric sentinel (pb_settings.iteration_limit() >= 0) with a
presence check (pb_settings.has_iteration_limit()) and only then clamp and
assign to settings.iteration_limit (preserving the existing max-clamp logic);
reference pb_settings, settings.iteration_limit and the mapping code that sets
iteration_limit.
cpp/src/grpc/client/solve_remote.cpp-125-159 (1)

125-159: ⚠️ Potential issue | 🟠 Major

Forward problem_checking and use_pdlp_solver_mode to the remote solver.

solve_lp_remote() accepts both parameters but discards them before calling client.solve_lp(cpu_problem, settings). The local PDLP path respects these flags—problem_checking enables problem validation (line 1325), and use_pdlp_solver_mode controls hyperparameter selection (line 1458). Callers at cpp/src/pdlp/solve.cu:1675 explicitly pass these flags expecting remote execution to honor them, creating an API-level inconsistency.

Add these flags to the proto PDLPSolverSettings message and forward them through grpc_client_t::solve_lp() so remote and local paths behave consistently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cpp` around lines 125 - 159, The
solve_lp_remote function currently ignores problem_checking and
use_pdlp_solver_mode; update the flow by adding these two bools to the
PDLPSolverSettings proto (e.g., PDLPSolverSettings.problem_checking and
PDLPSolverSettings.use_pdlp_solver_mode), regenerate the gRPC stubs, extend
pdlp_solver_settings_t to carry those fields, update grpc_client_t::solve_lp
signature/impl to accept the extended settings (or ensure it serializes the new
fields), and then call client.solve_lp(cpu_problem, settings_with_flags) from
solve_lp_remote so the remote solver receives and honors problem_checking and
use_pdlp_solver_mode. Ensure any serialization/deserialization and tests are
updated to reflect the new proto fields.
cpp/src/grpc/server/grpc_service_impl.cpp-208-221 (1)

208-221: ⚠️ Potential issue | 🟠 Major

Reject truncated chunked uploads before returning a job ID.

FinishChunkedUpload() clears field_meta and enqueues the job without checking that each uploaded field received its declared byte count. A client can omit the tail of an array and still get a successful SubmitJobResponse, pushing a malformed payload into the worker path instead of failing fast here.

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 208 - 221,
FinishChunkedUpload currently clears state.field_meta and enqueues a job without
verifying that each field actually received its declared byte count; before
constructing PendingChunkedUpload and calling submit_chunked_job_async, iterate
the entries in state.field_meta (or the header) and verify received bytes in
state.chunks (or state.total_bytes/total_chunks per field) match the declared
sizes, and if any mismatch reject the request (return an error
status/SubmitJobResponse) instead of proceeding; keep references to
PendingChunkedUpload, state.header, state.chunks, state.field_meta,
submit_chunked_job_async and ensure the validation happens prior to
std::move(state.header)/std::move(state.chunks) and before clearing
state.field_meta so you can include useful diagnostics in the error.
cpp/src/grpc/server/grpc_service_impl.cpp-486-503 (1)

486-503: ⚠️ Potential issue | 🟠 Major

Don’t orphan running jobs in DeleteResult.

This erases job_tracker[job_id] unconditionally. If the job is still queued or processing, the worker keeps running but clients lose status/cancel/result handles and the eventual completion no longer has authoritative tracker state. Restrict deletion to terminal jobs, or cancel and purge the backing worker/queue state before erasing the tracker.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 486 - 503, The
current DeleteResult unconditionally erases job_tracker[job_id], which can
orphan running jobs; change DeleteResult to first look up the tracker entry
under tracker_mutex (use job_tracker.find(job_id) and tracker_mutex) and inspect
the job state (e.g., job_entry.status or job_entry->state) and only erase if the
job is in a terminal state (SUCCEEDED/FAILED/CANCELLED). If the job is queued or
running, either return ERROR_BAD_STATE to the client or initiate a controlled
cancellation flow: call the existing cancellation API (e.g., CancelJob(job_id)
or job_entry->cancel()/queue.cancel(job_id)), wait or force-stop the
worker/queue backing state, confirm the entry is terminal, then call
delete_log_file(job_id) and erase from job_tracker; keep config.verbose logging
for each branch.
cpp/src/grpc/server/grpc_service_impl.cpp-351-368 (1)

351-368: ⚠️ Potential issue | 🟠 Major

Share result buffers across download sessions instead of copying them.

StartChunkedDownload() deep-copies result_arrays into every ChunkedDownloadState. Large completed jobs can therefore multiply memory usage by the number of concurrent download sessions and exhaust the server. A shared immutable snapshot/reference-counted buffer is safer here than per-session copies.

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems"

Also applies to: 377-385

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 351 - 368,
StartChunkedDownload currently deep-copies it->second.result_arrays into each
ChunkedDownloadState causing memory bloat; instead change ChunkedDownloadState
to hold a reference-counted/shared_ptr to the immutable result buffers and
assign that shared_ptr from the tracked job (create or expose
it->second.result_arrays as a shared_ptr inside the job tracker) so the state
just points to the shared data without copying. Update the assignment in
StartChunkedDownload (and the similar block at 377-385) to set
state.raw_arrays_shared = it->second.result_arrays_shared (or construct a
shared_ptr wrapper around the existing container) and ensure thread-safety by
keeping the shared_ptr copy under the tracker_mutex while leaving the actual
heavy buffers shared and immutable after lock release.
cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp-420-477 (1)

420-477: ⚠️ Potential issue | 🟠 Major

Capture the original env before any early return.

SetUpTestSuite() returns from the binary lookup / fork() / startup-timeout paths before orig_* and *_was_set_ are populated, but TearDownTestSuite() still unsets CUDA_VISIBLE_DEVICES and CUOPT_REMOTE_*. A skipped suite can therefore erase preexisting process env for later tests. Snapshot the env at the top of SetUpTestSuite(), or guard teardown behind a snapshot_taken flag.

Based on learnings "Applies to **/test.{cpp,cu,py} : Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment"

Also applies to: 480-496

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 420 -
477, SetUpTestSuite currently snapshots original env vars only after server
startup, so early returns can leave TearDownTestSuite unsetting envs that were
never saved; fix by capturing CUDA_VISIBLE_DEVICES, CUOPT_REMOTE_HOST, and
CUOPT_REMOTE_PORT into orig_cuda_visible_, orig_remote_host_, orig_remote_port_
and setting cuda_was_set_, host_was_set_, port_was_set_ at the very start of
SetUpTestSuite (or set a snapshot_taken boolean true after doing so), and then
in TearDownTestSuite guard any restoration/unset logic behind that
snapshot_taken flag (or check the *_was_set_ flags) so teardown never clobbers
preexisting environment when SetUpTestSuite returned early.
cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp-428-463 (1)

428-463: ⚠️ Potential issue | 🟠 Major

Reserve a free port instead of probing a fixed default.

If Line 428 picks a port that is already occupied, the child can fail to bind while tcp_connect_check() still succeeds against the unrelated listener. The suite then exports CUOPT_REMOTE_HOST/PORT for the wrong process and becomes nondeterministic. Reserve an ephemeral free port first, or at least verify the forked child is the process that became ready before setting the env vars.

Based on learnings "Applies to **/test.{cpp,cu,py} : Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 428 -
463, The test currently picks a fixed port in port_ then forks and probes
readiness with tcp_connect_check which can hit an unrelated process; to fix,
reserve an ephemeral port before forking by creating a temporary socket, bind it
to loopback with port 0, obtain the assigned port via getsockname and set port_,
then close the socket so the child can bind that same port; keep using
server_pid_ and server_path_ as before and then call tcp_connect_check(port_,
...); this ensures the child can bind the reserved ephemeral port and avoids
racing with unrelated listeners.
cpp/src/grpc/grpc_problem_mapper.cpp-597-686 (1)

597-686: ⚠️ Potential issue | 🟠 Major

Don't build the entire chunked upload in memory first.

build_array_chunk_requests() materializes every SendArrayChunkRequest before the first RPC is sent, so a large "chunked" upload still allocates roughly another full copy of the problem in protobuf strings. On the workloads that trigger chunking, that can OOM the client before upload even starts. Emit/send chunks incrementally instead.

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cpp` around lines 597 - 686,
build_array_chunk_requests currently materializes all SendArrayChunkRequest
objects into a vector before any RPC is sent, which can OOM for large problems;
change the design so chunks are emitted incrementally instead of accumulated.
Replace build_array_chunk_requests (or add an alternative like
stream_array_chunk_requests) to accept a sender
callback/std::function<void(const cuopt::remote::SendArrayChunkRequest&)> or an
output iterator and invoke chunk_typed_array and chunk_byte_blob to call that
sender for each generated chunk rather than appending to a requests vector;
update callers to use the streaming version (or pass a lambda that forwards to
the RPC stream) and remove the in-memory accumulation in
build_array_chunk_requests. Ensure functions referenced
(build_array_chunk_requests, chunk_typed_array, chunk_byte_blob) are adapted to
support emitting via the provided callback/iterator and that any temporary byte
blobs (e.g., var_names_blob/row_names_blob, vt_bytes/rt_bytes) are created and
sent one-at-a-time to avoid full duplication.
cpp/src/grpc/client/grpc_client.cpp-485-517 (1)

485-517: ⚠️ Potential issue | 🟠 Major

Return a retryable "too large" outcome from submit_unary().

submit_lp() / submit_mip() need to distinguish "unary request exceeded the message cap" from ordinary transport failures, otherwise near-threshold requests just fail even though the chunked upload path is already available. Mirroring the download-side fallback would make this path much more robust.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cpp` around lines 485 - 517, In
grpc_client_t::submit_unary detect a "request too large" outcome by checking
status.error_code() == grpc::StatusCode::RESOURCE_EXHAUSTED (and optionally
status.error_message() for message-size text) and, instead of treating it as a
generic failure, set a distinct retryable indicator (e.g. last_error_ =
"TOO_LARGE_REQUEST: " + status.error_message()) and return false; update callers
(submit_lp/submit_mip) to look for that sentinel (or a new helper like
grpc_client_t::is_too_large_error()) and trigger the chunked upload fallback
when seen.
cpp/src/grpc/grpc_problem_mapper.cpp-484-503 (1)

484-503: ⚠️ Potential issue | 🟠 Major

Keep chunked constraint decoding consistent with the unary path.

map_proto_to_problem() only applies lower/upper bounds as a matched pair and otherwise falls back to b + row_types, but this chunked path sets con_lb, con_ub, and b independently. The same logical payload can therefore reconstruct differently depending on transport, or with half-initialized constraint state if one field is missing.

As per coding guidelines "Validate correct initialization of variable bounds, constraint coefficients, and algorithm state before solving; ensure reset when transitioning between algorithm phases (presolve, simplex, diving, crossover)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cpp` around lines 484 - 503, The chunked
decoding path sets constraint lower/upper bounds and b independently which can
lead to inconsistent state; modify the logic around
get_doubles(cuopt::remote::FIELD_CONSTRAINT_LOWER_BOUNDS),
get_doubles(cuopt::remote::FIELD_CONSTRAINT_UPPER_BOUNDS), and
get_doubles(cuopt::remote::FIELD_B) so that you only apply
constraint_lower_bounds and constraint_upper_bounds to cpu_problem when both
con_lb and con_ub are present (matched pair), otherwise skip setting them and
fall back to the unary behavior: set_constraint_bounds (b_vec) and set_row_types
only as the unary path does; additionally ensure you clear or reset any existing
constraint bounds on cpu_problem when the matched pair is absent to avoid
half-initialized state.
cpp/src/grpc/grpc_solution_mapper.cpp-151-229 (1)

151-229: ⚠️ Potential issue | 🟠 Major

Don't drop solver error text on the read path.

The write side serializes error_message / mip_error_message, but neither the unary readers nor chunked_result_to_*() restore it. A remote solve can therefore lose its actual failure reason even though the transport succeeded. If cpu_*_solution_t cannot carry free-form text, please surface it through the surrounding result wrapper instead of silently discarding it.

Also applies to: 259-277, 545-638

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cpp` around lines 151 - 229,
map_proto_to_lp_solution is currently discarding the serialized solver text
fields (error_message / mip_error_message) from cuopt::remote::LPSolution; add
logic to extract pb_solution.error_message() and pb_solution.mip_error_message()
and propagate them into the returned result (either by populating new fields on
cpu_lp_solution_t<i_t,f_t> or, if cpu_lp_solution_t cannot carry free-form text,
return/populate the surrounding result wrapper with these strings). Do the same
fix for the other readers mentioned (the reader at lines ~259-277 and the
chunked_result_to_* paths at ~545-638): locate the functions that construct
cpu_*_solution_t from proto and ensure they copy pb_solution.error_message() and
pb_solution.mip_error_message() into the appropriate fields of cpu_*_solution_t
or the outer response wrapper so solver failure text is not lost.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-1761-1818 (1)

1761-1818: ⚠️ Potential issue | 🟠 Major

These chunk-validation cases leak upload state across the shared server.

start_upload() creates server-side sessions, but this fixture keeps one server alive for the whole suite and never finishes or explicitly cleans those sessions up. That can bleed partially uploaded state into later cases and make failures order-dependent.

Based on learnings "Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines
1761 - 1818, Tests leak server-side upload sessions because SetUpTestSuite
starts a single s_server_ for the whole suite while start_upload() creates
per-test upload sessions that are never finished; either (A) make the server
lifecycle per-test by moving server start/stop from
SetUpTestSuite/TearDownTestSuite into SetUp/TearDown so each test gets a fresh
s_server_, or (B) keep the shared server but track upload IDs from
start_upload() and ensure each test calls the appropriate cleanup RPC (e.g.,
FinishChunkedUpload or Abort/CancelChunkedUpload) via stub_ in TearDown to
remove server-side session state before the next test; update SetUp/ TearDown
and start_upload/send_chunk helpers accordingly to record and clean up upload
IDs.
🟡 Minor comments (4)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-637-649 (1)

637-649: ⚠️ Potential issue | 🟡 Minor

Make the negative mTLS assertion failure-specific.

result.returncode != 0 also passes for unrelated failures like a missing dataset, parser import error, or server startup race. Please assert on a TLS/auth-specific stderr/stdout fragment so this test actually proves client-certificate enforcement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 637 - 649, Update the test_mtls_rejects_no_client_cert to make the
failure specific to TLS/auth: after calling _run_in_subprocess with
_impl_lp_solve_cpu_only, assert both that result.returncode != 0 and that
result.stderr (or result.stdout if stderr is empty) contains a TLS-specific
substring such as "certificate", "client certificate", "SSL", "TLS", or
"authentication failed"; use the same test helper names (_run_in_subprocess,
_impl_lp_solve_cpu_only, _tls_env, mtls_server_info) to locate where to add the
additional assert and prefer checking stderr first and falling back to stdout if
needed.
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp-38-47 (1)

38-47: ⚠️ Potential issue | 🟡 Minor

Add the missing <iostream> include.

dump_logs() uses std::cout in its default argument at line 339, but <iostream> is never included. This makes the header dependent on transitive includes and will fail if includes are reordered.

Minimal fix
 `#include` <fstream>
 `#include` <functional>
+#include <iostream>
 `#include` <mutex>
 `#include` <regex>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 38
- 47, The header is missing the <iostream> include required by dump_logs()'s
default parameter that uses std::cout; add `#include` <iostream> to the include
block at the top of grpc_test_log_capture.hpp so the file no longer relies on
transitive includes and dump_logs()'s default argument (std::cout) is valid;
update the include list near the other standard headers (alongside <string>,
<sstream>, etc.) to ensure compilation when includes are reordered.
cpp/src/grpc/server/grpc_service_impl.cpp-791-808 (1)

791-808: ⚠️ Potential issue | 🟡 Minor

Advance the completion offset after flushing the last line.

When the terminal-state path drains one final log line, the sentinel message still publishes the pre-drain current_offset. Resuming from done.byte_offset will re-read that tail line.

Suggested fix
         if (std::getline(in, line)) {
           std::streampos after2 = in.tellg();
           int64_t next_offset2  = current_offset + static_cast<int64_t>(line.size());
           if (after2 >= 0) { next_offset2 = static_cast<int64_t>(after2); }
           cuopt::remote::LogMessage m;
           m.set_line(line);
           m.set_byte_offset(next_offset2);
           m.set_job_complete(false);
           writer->Write(m);
+          current_offset = next_offset2;
         }

         cuopt::remote::LogMessage done;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 791 - 808, The
sentinel "done" message is using current_offset (pre-drain) so resuming re-reads
the last line; after you compute next_offset2 (or after2) when a final line is
read, advance current_offset (or set the done byte offset to next_offset2)
before constructing cuopt::remote::LogMessage done so done.set_byte_offset(...)
uses the post-drain offset; update the code around current_offset/next_offset2
(and use after2 if available) prior to writer->Write(done) to ensure the
completion offset points after the flushed line.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-1020-1045 (1)

1020-1045: ⚠️ Potential issue | 🟡 Minor

Prove the incumbent callback actually triggered cancellation.

EXPECT_LT(elapsed.count(), 25) can still pass if the callback never fires and this instance just solves quickly. Please assert that the callback ran at least twice and that the terminal state reflects cancellation/non-success.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines
1020 - 1045, The test must assert the incumbent callback actually fired and that
the solver returned a cancelled/non-success state: after calling
client->solve_mip(problem, settings, true) check that callback_count >= 2 (the
lambda attached to config.incumbent_callback incremented it) and assert the
returned result (the variable result from client->solve_mip) indicates
cancellation/non-success by checking the appropriate result field (e.g.,
result.status or result.termination_status) rather than relying solely on
elapsed time.
🧹 Nitpick comments (1)
cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp (1)

140-286: Add malformed-chunk coverage for the assembly path.

These tests cover well-formed round trips only. The serializer in cpp/src/grpc/server/grpc_pipe_serialization.hpp also has failure-prone paths for overlaps, gaps, inconsistent total_elements, and invalid byte counts; without negative tests, silent corruption in the pipe format can slip through. As per coding guidelines "Test with free variables, singleton problems, and extreme problem dimensions near resource limits to validate edge case handling."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp` around
lines 140 - 286, Add negative unit tests in
cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp that exercise
the assembly failure paths by constructing malformed ArrayChunk sequences and
asserting read_chunked_request_from_pipe returns false or fails as expected;
create tests such as ChunkedRequest_Malformed_Overlap (two chunks for the same
field with overlapping element ranges), ChunkedRequest_Malformed_Gap (chunks
that leave a gap so total_elements mismatches assembly),
ChunkedRequest_Malformed_InconsistentTotalElements (chunks for same field with
conflicting total_elements values), and
ChunkedRequest_Malformed_InvalidByteCount (chunk.data length not equal to
element_count * array_field_element_size); use PipePair,
write_chunked_request_to_pipe(pp.write_fd(), header, chunks) to write and
read_chunked_request_from_pipe(pp.read_fd(), header_out, arrays_out) to verify
the reader rejects/returns false for each case, building chunks via
make_partial_chunk/make_whole_chunk or by setting ArrayChunk fields directly
(field_id, element_offset, total_elements, data) to craft the malformed
scenarios.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aaca6118-22d1-4b1b-9a91-5dbc62d00790

📥 Commits

Reviewing files that changed from the base of the PR and between 6c06a40 and eebd7d8.

📒 Files selected for processing (50)
  • GRPC_INTERFACE.md
  • GRPC_QUICK_START.md
  • GRPC_SERVER_ARCHITECTURE.md
  • build.sh
  • ci/build_wheel_libcuopt.sh
  • ci/utils/install_protobuf_grpc.sh
  • conda/environments/all_cuda-129_arch-aarch64.yaml
  • conda/environments/all_cuda-129_arch-x86_64.yaml
  • conda/environments/all_cuda-131_arch-aarch64.yaml
  • conda/environments/all_cuda-131_arch-x86_64.yaml
  • conda/recipes/libcuopt/recipe.yaml
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cpp
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cpp
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cpp
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cpp
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cpp
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_io.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • dependencies.yaml
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
  • python/libcuopt/CMakeLists.txt
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/solve_remote.cu
  • cpp/src/pdlp/CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
  • GRPC_INTERFACE.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/tests/linear_programming/grpc/CMakeLists.txt

Comment on lines +21 to +24
ssize_t written = ::write(fd, ptr, remaining);
if (written <= 0) {
if (errno == EINTR) continue;
return false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Expect at least one startup path to ignore or block SIGPIPE before pipe writes happen.
rg -nP 'SIGPIPE|sigaction\s*\(\s*SIGPIPE|signal\s*\(\s*SIGPIPE' -C2

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

cat cpp/src/grpc/server/grpc_pipe_io.cpp

Repository: NVIDIA/cuopt

Length of output: 2085


🏁 Script executed:

# Search for alternative SIGPIPE mitigation: MSG_NOSIGNAL flag, SO_NOSIGPIPE option, signal masks
rg -n 'MSG_NOSIGNAL|SO_NOSIGPIPE|signal.*mask|sigprocmask' -C2

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Look for signal() or sigaction() calls anywhere in the codebase
rg -n 'signal\s*\(' -C2 | head -50

Repository: NVIDIA/cuopt

Length of output: 4013


🏁 Script executed:

# Check for any SIGPIPE references even in comments or strings
rg -in 'sigpipe' -C2

Repository: NVIDIA/cuopt

Length of output: 38


Add SIGPIPE signal handler in server startup to prevent unexpected process termination.

Line 21 uses plain ::write() on a pipe. When the peer closes its read end, SIGPIPE will be raised and terminate the process before this helper can return false. No signal-level SIGPIPE handling exists anywhere in the C++ codebase. Add signal(SIGPIPE, SIG_IGN) or equivalent sigaction() call in the gRPC server initialization before any pipe I/O operations occur.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_io.cpp` around lines 21 - 24, The pipe write
helper in grpc_pipe_io.cpp uses ::write(...) which can raise SIGPIPE and kill
the process before the helper returns false; to fix this, register SIGPIPE to be
ignored in the gRPC server startup (before any pipe I/O occurs) by calling
signal(SIGPIPE, SIG_IGN) or installing an equivalent sigaction() in the server
initialization path (e.g., in the function that starts/initializes the gRPC
server), ensuring this handler is set once at startup so the ::write loop in
grpc_pipe_io.cpp no longer causes process termination when the peer closes the
read end.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
ci/utils/install_protobuf_grpc.sh (1)

72-79: ⚠️ Potential issue | 🔴 Critical

Canonicalize the user-controlled paths before the cleanup/build flow.

The current guard only rejects a literal /. --prefix=/usr/.. still passes and makes the cleanup target /lib and /bin, and a fresh custom --build-dir still fails because nothing creates it before git clone.

🛡️ Minimal hardening
+PREFIX="$(realpath -m -- "${PREFIX}")"
+BUILD_DIR="$(realpath -m -- "${BUILD_DIR}")"
+
 if [[ -z "$PREFIX" || "$PREFIX" == "/" ]]; then
     echo "ERROR: Invalid PREFIX: '$PREFIX'" >&2
     exit 1
 fi
 if [[ -z "$BUILD_DIR" || "$BUILD_DIR" == "/" ]]; then
@@
 GRPC_SRC="${BUILD_DIR}/grpc-src"
 ABSL_BUILD="${BUILD_DIR}/absl-build"
 PROTOBUF_BUILD="${BUILD_DIR}/protobuf-build"
 GRPC_BUILD="${BUILD_DIR}/grpc-build"
 
-rm -rf "${GRPC_SRC}" "${ABSL_BUILD}" "${PROTOBUF_BUILD}" "${GRPC_BUILD}"
-mkdir -p "${PREFIX}"
+mkdir -p -- "${PREFIX}" "${BUILD_DIR}"
+rm -rf -- "${GRPC_SRC}" "${ABSL_BUILD}" "${PROTOBUF_BUILD}" "${GRPC_BUILD}"

Also applies to: 124-158

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/install_protobuf_grpc.sh` around lines 72 - 79, Canonicalize and
validate user-controlled paths before using them: resolve PREFIX and BUILD_DIR
with realpath -m (or readlink -f where available) into canonical variables
(e.g., resolved_prefix and resolved_build_dir), then reject if resolved_prefix
is empty or equals "/" and reject empty resolved_build_dir; also ensure
resolved_build_dir is created (mkdir -p "$resolved_build_dir") before git clone
and any cleanup steps. Replace raw uses of PREFIX and BUILD_DIR in the
cleanup/build flow (including the other similar block) with the canonical
variables to prevent paths like --prefix=/usr/.. from escaping to root and to
ensure the build directory exists.
🧹 Nitpick comments (7)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (5)

662-663: Consider adding a function whitelist for subprocess entry point.

The globals()[sys.argv[1]]() pattern works but could be hardened with an explicit allowlist to prevent accidental execution of unintended functions:

♻️ Optional hardening
 if __name__ == "__main__":
-    globals()[sys.argv[1]]()
+    _ALLOWED_IMPL_FUNCS = {
+        "_impl_lp_solve_cpu_only",
+        "_impl_lp_dual_solution_cpu_only",
+        "_impl_mip_solve_cpu_only",
+        "_impl_warmstart_cpu_only",
+    }
+    func_name = sys.argv[1]
+    if func_name not in _ALLOWED_IMPL_FUNCS:
+        raise ValueError(f"Unknown test function: {func_name}")
+    globals()[func_name]()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 662 - 663, Replace the dynamic call via globals()[sys.argv[1]]() in
the module entrypoint with an explicit allowlist: define a dict or set (e.g.,
ALLOWED_ENTRIES = {"run_test_a": run_test_a, "run_test_b": run_test_b}) mapping
permitted names to callables, then look up sys.argv[1] against ALLOWED_ENTRIES
and invoke the matched function; if the name is missing, raise a clear error or
print usage. This hardens the __main__ entrypoint and references the existing
globals()[sys.argv[1]]() pattern, sys.argv, and the module-level test functions
to locate where to change the logic.

96-102: Avoid shell=True with string interpolation.

Using shell=True with f-string command construction is a security anti-pattern. While cert_dir originates from pytest's temp directory, this pattern is risky if copied elsewhere. Consider using list arguments with shell=False.

♻️ Suggested refactor using list arguments
     def _run(cmd):
-        return (
-            subprocess.run(
-                cmd, shell=True, capture_output=True, timeout=30
-            ).returncode
-            == 0
-        )
+        return (
+            subprocess.run(
+                cmd, capture_output=True, timeout=30
+            ).returncode
+            == 0
+        )

Then update callers to pass list arguments:

# Example for CA generation:
if not _run([
    "openssl", "req", "-x509", "-newkey", "rsa:2048",
    "-keyout", ca_key, "-out", ca_crt,
    "-days", "1", "-nodes", "-subj", "/CN=TestCA"
]):
    return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 96 - 102, The helper _run currently invokes subprocess.run with
shell=True and a string command which is unsafe; change _run to accept a
sequence (list/tuple) command and call subprocess.run(cmd, capture_output=True,
timeout=30) without shell=True (or explicitly shell=False), and update all
callers that pass interpolated strings to pass list arguments (e.g., ["openssl",
"req", "-x509", ...]) so argument quoting is handled safely and no shell
interpolation occurs; keep the same return semantics (returncode == 0) and
preserve timeout/capture_output behavior in the function named _run and in any
tests calling it.

184-227: Consider adding numerical correctness validation to CPU-only tests.

These test implementations only verify vector sizes, not numerical accuracy. Per coding guidelines, tests should "validat[e] numerical correctness of optimization results (not just 'runs without error')". Since these tests verify remote execution fidelity, consider adding objective value checks similar to TestSolutionInterfacePolymorphism:

# Example addition to _impl_lp_solve_cpu_only:
obj = solution.get_primal_objective()
_AFIRO_OBJ = -464.7531428571
assert abs(obj - _AFIRO_OBJ) / abs(_AFIRO_OBJ) < 0.01, (
    f"Objective {obj} too far from expected {_AFIRO_OBJ}"
)

This ensures remote execution produces numerically correct results, not just correctly-sized outputs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 184 - 227, Add numerical correctness assertions to the CPU-only
tests: in _impl_lp_solve_cpu_only, after obtaining obj =
solution.get_primal_objective(), assert the objective is within a small relative
tolerance of the known Afiro reference value (e.g., _AFIRO_OBJ ≈
-464.7531428571) using a relative error check (< 0.01); in
_impl_lp_dual_solution_cpu_only, optionally add similar sanity checks for duals
or reduced costs (e.g., spot-check a few known entries or confirm primal
objective via solution.get_primal_objective() matches the same reference) so the
tests validate numerical accuracy not just vector sizes. Ensure you use the
existing solution methods (get_primal_objective, get_dual_solution,
get_reduced_cost) and compare with a relative tolerance assertion.

291-316: Consider extracting shared fixture logic.

The cpu_only_env_with_server fixture is nearly identical between TestCPUOnlyExecution (port +600) and TestCuoptCliCPUOnly (port +700). Consider extracting a parameterized helper or using conftest.py to reduce duplication.

♻️ Example extraction
def _start_grpc_server_fixture(port_offset):
    """Factory for class-scoped gRPC server fixtures."""
    `@pytest.fixture`(scope="class")
    def fixture_fn():
        server_bin = _find_grpc_server()
        if server_bin is None:
            pytest.skip("cuopt_grpc_server not found")

        port = int(os.environ.get("CUOPT_TEST_PORT_BASE", "18000")) + port_offset
        proc = subprocess.Popen(
            [server_bin, "--port", str(port), "--workers", "1"],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
        )
        # ... rest of setup/teardown logic
        yield _cpu_only_env(port)
        # ... cleanup
    return fixture_fn
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 291 - 316, The cpu_only_env_with_server fixture in
TestCPUOnlyExecution is almost identical to the one in TestCuoptCliCPUOnly;
extract the duplicated logic into a reusable factory (e.g.,
_start_grpc_server_fixture) that accepts a port_offset and returns a
class-scoped fixture, then replace the two fixtures with calls to that factory
(or register them in conftest.py) so cpu_only_env_with_server and the other
fixture reuse the same start/teardown logic and only differ by the port offset
(+600 vs +700).

1-14: Consider adding edge case tests for remote execution.

Per coding guidelines, tests should cover "infeasible, unbounded, empty, singleton problems" and "free variables, singleton problems, and extreme problem dimensions." While this file appropriately focuses on connectivity (CPU-only, TLS, mTLS), consider adding tests that exercise the remote execution path with degenerate or edge-case problems to ensure the gRPC layer correctly handles and propagates solver results for unusual inputs.

This could be added in a follow-up PR or tracked as a separate issue.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 1 - 14, Add remote execution edge-case tests that run through the
same CPU-only gRPC test harness (the shared cuopt_grpc_server used by
TestSolutionInterfacePolymorphism) to assert correct solver status and
propagated messages for degenerate problems: include tests that submit
infeasible problems, unbounded problems, empty/zero-variable problems, singleton
problems, and problems with free variables and extreme dimensions; for each test
reuse the existing server setup/teardown and the client call path used by the
CPU-only TLS/mTLS tests so you validate gRPC serialization and status
propagation (check solver status codes, objective values when applicable, and
error messages returned by the remote solver).
cpp/src/grpc/server/grpc_pipe_serialization.hpp (2)

37-44: Consider validating message size before serialization.

ByteSizeLong() returns size_t but is cast to uint32_t. While protobuf enforces a ~2 GiB internal limit, adding an explicit check would catch issues earlier with a clearer error path.

♻️ Suggested defensive check
 inline bool write_protobuf_to_pipe(int fd, const google::protobuf::MessageLite& msg)
 {
-  uint32_t size = static_cast<uint32_t>(msg.ByteSizeLong());
+  size_t byte_size = msg.ByteSizeLong();
+  if (byte_size > std::numeric_limits<uint32_t>::max()) return false;
+  uint32_t size = static_cast<uint32_t>(byte_size);
   if (!write_to_pipe(fd, &size, sizeof(size))) return false;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 37 - 44, The
write_protobuf_to_pipe function casts msg.ByteSizeLong() (size_t) to uint32_t
without checking for overflow; add a defensive check using the original size_t
value (from ByteSizeLong()) and if it exceeds UINT32_MAX (or some configured
max), fail early (return false or handle error) before allocating the
std::vector and calling SerializeToArray; this prevents silent
truncation/overflow and clarifies the error path when messages are larger than
uint32_t can hold.

215-222: Consider adding size validation for consistency.

SerializeToArray takes an int for the size parameter. While the comment notes this is "always well under 2 GiB," adding an explicit check would provide defense-in-depth consistent with other serialization paths.

♻️ Suggested defensive check
 inline std::vector<uint8_t> serialize_submit_request_to_pipe(
   const cuopt::remote::SubmitJobRequest& request)
 {
   size_t byte_size = request.ByteSizeLong();
+  if (byte_size > static_cast<size_t>(std::numeric_limits<int>::max())) {
+    return {};  // Return empty on oversized request
+  }
   std::vector<uint8_t> blob(byte_size);
   request.SerializeToArray(blob.data(), static_cast<int>(byte_size));
   return blob;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 215 - 222,
serialize_submit_request_to_pipe currently casts request.ByteSizeLong() to int
before calling request.SerializeToArray; add a defensive check that byte_size is
<= std::numeric_limits<int>::max() (and optionally > 0) before the static_cast
to avoid overflow/truncation, and handle the error case (throw
std::runtime_error or return an empty vector with a logged error) so
SerializeToArray is only called with a safe int size; refer to
serialize_submit_request_to_pipe and the call to SerializeToArray for where to
place this validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/utils/install_protobuf_grpc.sh`:
- Around line 116-119: The error message inside the tools-check loop (for tool
in git cmake ninja) incorrectly suggests the unsupported flag
"--skip-deps=false"; update the echo in that block (the line that prints "Error:
Required tool '$tool' not found...") to remove the unsupported flag and instead
give a correct instruction such as prompting the user to install the missing
tool or consult the script's README/usage, so the message only references valid
recovery actions.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 46-53: The function read_protobuf_from_pipe reads a uint32_t size
from the pipe and allocates a buffer without any upper bound check, enabling
potential resource exhaustion; add validation of size against the existing
constant kMaxPipeArrayBytes (and ensure size>0) before allocating the
std::vector and before the second read_from_pipe call, returning false if size
is zero or exceeds kMaxPipeArrayBytes; update error path to avoid allocation
when invalid and keep using read_from_pipe and msg.ParseFromArray as-is.
- Around line 118-127: The loop over fi.chunks currently skips chunks whose
byte_offset/size validation fails, which can hide data corruption; update the
logic in the loop that iterates over fi.chunks (using ac->element_offset(),
elem_size, chunk_data, assembled) to treat invalid offsets as an error: when
byte_offset < 0 or byte_offset + chunk_data.size() > fi.total_bytes, immediately
return false (or at minimum log an error) instead of silently continuing, so
that write_to_pipe(fd, assembled.data(), assembled.size()) is only called when
all chunks validated and copied successfully.

---

Duplicate comments:
In `@ci/utils/install_protobuf_grpc.sh`:
- Around line 72-79: Canonicalize and validate user-controlled paths before
using them: resolve PREFIX and BUILD_DIR with realpath -m (or readlink -f where
available) into canonical variables (e.g., resolved_prefix and
resolved_build_dir), then reject if resolved_prefix is empty or equals "/" and
reject empty resolved_build_dir; also ensure resolved_build_dir is created
(mkdir -p "$resolved_build_dir") before git clone and any cleanup steps. Replace
raw uses of PREFIX and BUILD_DIR in the cleanup/build flow (including the other
similar block) with the canonical variables to prevent paths like
--prefix=/usr/.. from escaping to root and to ensure the build directory exists.

---

Nitpick comments:
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 37-44: The write_protobuf_to_pipe function casts
msg.ByteSizeLong() (size_t) to uint32_t without checking for overflow; add a
defensive check using the original size_t value (from ByteSizeLong()) and if it
exceeds UINT32_MAX (or some configured max), fail early (return false or handle
error) before allocating the std::vector and calling SerializeToArray; this
prevents silent truncation/overflow and clarifies the error path when messages
are larger than uint32_t can hold.
- Around line 215-222: serialize_submit_request_to_pipe currently casts
request.ByteSizeLong() to int before calling request.SerializeToArray; add a
defensive check that byte_size is <= std::numeric_limits<int>::max() (and
optionally > 0) before the static_cast to avoid overflow/truncation, and handle
the error case (throw std::runtime_error or return an empty vector with a logged
error) so SerializeToArray is only called with a safe int size; refer to
serialize_submit_request_to_pipe and the call to SerializeToArray for where to
place this validation.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 662-663: Replace the dynamic call via globals()[sys.argv[1]]() in
the module entrypoint with an explicit allowlist: define a dict or set (e.g.,
ALLOWED_ENTRIES = {"run_test_a": run_test_a, "run_test_b": run_test_b}) mapping
permitted names to callables, then look up sys.argv[1] against ALLOWED_ENTRIES
and invoke the matched function; if the name is missing, raise a clear error or
print usage. This hardens the __main__ entrypoint and references the existing
globals()[sys.argv[1]]() pattern, sys.argv, and the module-level test functions
to locate where to change the logic.
- Around line 96-102: The helper _run currently invokes subprocess.run with
shell=True and a string command which is unsafe; change _run to accept a
sequence (list/tuple) command and call subprocess.run(cmd, capture_output=True,
timeout=30) without shell=True (or explicitly shell=False), and update all
callers that pass interpolated strings to pass list arguments (e.g., ["openssl",
"req", "-x509", ...]) so argument quoting is handled safely and no shell
interpolation occurs; keep the same return semantics (returncode == 0) and
preserve timeout/capture_output behavior in the function named _run and in any
tests calling it.
- Around line 184-227: Add numerical correctness assertions to the CPU-only
tests: in _impl_lp_solve_cpu_only, after obtaining obj =
solution.get_primal_objective(), assert the objective is within a small relative
tolerance of the known Afiro reference value (e.g., _AFIRO_OBJ ≈
-464.7531428571) using a relative error check (< 0.01); in
_impl_lp_dual_solution_cpu_only, optionally add similar sanity checks for duals
or reduced costs (e.g., spot-check a few known entries or confirm primal
objective via solution.get_primal_objective() matches the same reference) so the
tests validate numerical accuracy not just vector sizes. Ensure you use the
existing solution methods (get_primal_objective, get_dual_solution,
get_reduced_cost) and compare with a relative tolerance assertion.
- Around line 291-316: The cpu_only_env_with_server fixture in
TestCPUOnlyExecution is almost identical to the one in TestCuoptCliCPUOnly;
extract the duplicated logic into a reusable factory (e.g.,
_start_grpc_server_fixture) that accepts a port_offset and returns a
class-scoped fixture, then replace the two fixtures with calls to that factory
(or register them in conftest.py) so cpu_only_env_with_server and the other
fixture reuse the same start/teardown logic and only differ by the port offset
(+600 vs +700).
- Around line 1-14: Add remote execution edge-case tests that run through the
same CPU-only gRPC test harness (the shared cuopt_grpc_server used by
TestSolutionInterfacePolymorphism) to assert correct solver status and
propagated messages for degenerate problems: include tests that submit
infeasible problems, unbounded problems, empty/zero-variable problems, singleton
problems, and problems with free variables and extreme dimensions; for each test
reuse the existing server setup/teardown and the client call path used by the
CPU-only TLS/mTLS tests so you validate gRPC serialization and status
propagation (check solver status codes, objective values when applicable, and
error messages returned by the remote solver).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a9ff9e87-fb6b-4f35-bdfb-ecfd2c053f90

📥 Commits

Reviewing files that changed from the base of the PR and between eebd7d8 and a97a468.

📒 Files selected for processing (10)
  • GRPC_QUICK_START.md
  • ci/utils/install_protobuf_grpc.sh
  • cpp/src/grpc/client/grpc_client.cpp
  • cpp/src/grpc/grpc_problem_mapper.cpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • GRPC_QUICK_START.md
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/grpc/server/grpc_job_management.cpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)

474-479: ⚠️ Potential issue | 🟡 Minor

Missing executable check for CONDA_PREFIX path (same issue as _find_grpc_server).

The CONDA_PREFIX branch doesn't verify executable permission, inconsistent with the earlier checks.

Proposed fix
         conda_prefix = os.environ.get("CONDA_PREFIX", "")
         if conda_prefix:
             p = os.path.join(conda_prefix, "bin", "cuopt_cli")
-            if os.path.isfile(p):
+            if os.path.isfile(p) and os.access(p, os.X_OK):
                 return p
         return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 474 - 479, The CONDA_PREFIX branch currently only checks
os.path.isfile(p) and should also verify the file is executable like the earlier
_find_grpc_server logic; update the check for the constructed path variable p to
ensure it exists and is executable (e.g., replace or augment os.path.isfile(p)
with an executable check such as os.access(p, os.X_OK) or equivalent) so the
function returns the path only when the cuopt_cli binary is executable.
♻️ Duplicate comments (3)
cpp/src/grpc/client/grpc_client.cpp (1)

89-94: ⚠️ Potential issue | 🟠 Major

The new deadline helper still ignores config_.timeout_seconds.

set_rpc_deadline() is parameterized, but every finite RPC still passes the hard-coded 60-second constant. That means large submits/uploads/downloads can time out even when the client is configured for a longer timeout. Either plumb config_.timeout_seconds through these call sites or introduce a separate configurable short-RPC timeout knob.

Also applies to: 265-266, 333-334, 359-360, 397-397, 489-490, 821-821, 856-856, 875-875, 910-910, 948-948, 996-996, 1059-1059, 1102-1102

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cpp` around lines 89 - 94, The helper
set_rpc_deadline(ctx, timeout_seconds) exists but call sites still pass the
hard-coded 60s constant, so plumb the configured timeout through: replace
literal 60 at each RPC call site with config_.timeout_seconds (or, if you want a
separate short-RPC knob, add and use a new member like
config_.short_rpc_timeout_seconds) so that functions using set_rpc_deadline
(references around the hard-coded uses noted in the review and the helper
itself) honor the client configuration; ensure default behavior preserves
current semantics when config_ value is zero or negative.
cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)

37-55: ⚠️ Potential issue | 🟠 Major

The protobuf size guard still allows near-4 GiB allocations.

size is a uint32_t, so size > kMaxPipeArrayBytes only rejects values the prefix cannot represent anyway. A malformed frame can still drive read_protobuf_from_pipe() into allocating gigabytes. Please switch these helpers to a dedicated protobuf-message cap and enforce it symmetrically on both read and write.

🛡️ Minimal cap tightening
+static constexpr uint32_t kMaxProtobufMessageBytes = 256U * 1024 * 1024;
+
 inline bool write_protobuf_to_pipe(int fd, const google::protobuf::MessageLite& msg)
 {
   size_t byte_size = msg.ByteSizeLong();
-  if (byte_size > UINT32_MAX) return false;
+  if (byte_size == 0 || byte_size > kMaxProtobufMessageBytes) return false;
   uint32_t size = static_cast<uint32_t>(byte_size);
   if (!write_to_pipe(fd, &size, sizeof(size))) return false;
   std::vector<uint8_t> buf(size);
@@
 inline bool read_protobuf_from_pipe(int fd, google::protobuf::MessageLite& msg)
 {
   uint32_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
-  if (size == 0 || size > kMaxPipeArrayBytes) return false;
+  if (size == 0 || size > kMaxProtobufMessageBytes) return false;
   std::vector<uint8_t> buf(size);
   if (!read_from_pipe(fd, buf.data(), size)) return false;
   return msg.ParseFromArray(buf.data(), static_cast<int>(size));
 }
As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 37 - 55, The
helpers write_protobuf_to_pipe and read_protobuf_from_pipe currently allow
near-4GiB allocations because they only compare a uint32_t size against
kMaxPipeArrayBytes; introduce a dedicated protobuf message cap (e.g.,
kMaxProtobufMessageBytes) and enforce it symmetrically: in
write_protobuf_to_pipe check msg.ByteSizeLong() against the new cap and return
false if it exceeds the cap before casting to uint32_t, and in
read_protobuf_from_pipe validate the read uint32_t size is non‑zero and <=
kMaxProtobufMessageBytes (not kMaxPipeArrayBytes) before allocating the vector
or parsing; update any related comments.
cpp/src/grpc/server/grpc_server_main.cpp (1)

25-45: ⚠️ Potential issue | 🟠 Major

--chunk-timeout is documented but currently has no effect.

print_usage() advertises the flag, but the parser never handles it, and unknown options still fall through silently. Operators can think they set a per-chunk timeout while the server keeps the default behavior. Please either parse it or reject it explicitly as unsupported.

Also applies to: 49-81

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 25 - 45, print_usage
documents the --chunk-timeout flag but the argument parser currently ignores it
(and lets unknown options pass silently); update the CLI parsing logic (where
flags like --max-message-mb / --tls-* are handled) to either (A) implement
--chunk-timeout by parsing its integer value into the server variable (e.g.,
chunk_timeout_seconds or similar), validate it (non-negative integer; 0 =
disabled), and wire that variable into the streaming timeout behavior, or (B) if
not supported, explicitly detect --chunk-timeout and exit with an error message
stating it is unsupported; additionally make the parser treat unknown options as
errors instead of silently ignoring them so flags advertised by print_usage are
enforced.
🧹 Nitpick comments (2)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (2)

29-31: Hardcoded default dataset path.

The default /home/datasets/cuopt may not exist in all environments. Tests do handle missing files gracefully via pytest.skip, but consider documenting this expectation or using a more portable default like ./datasets.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 29 - 31, The default dataset path RAPIDS_DATASET_ROOT_DIR is
hardcoded to "/home/datasets/cuopt"; change its fallback to a more portable path
such as "./datasets" (or document the expectation) by updating the
os.environ.get call that defines RAPIDS_DATASET_ROOT_DIR in
test_cpu_only_execution.py so tests still call pytest.skip when files are
missing but will use the local ./datasets directory by default; ensure the
variable name RAPIDS_DATASET_ROOT_DIR is the only symbol you modify and keep the
existing skip behavior intact.

369-394: Consider extracting duplicate fixture logic.

The cpu_only_env_with_server fixtures in TestCPUOnlyExecution (lines 369-394) and TestCuoptCliCPUOnly (lines 436-461) are nearly identical, differing only in port offset. Consider extracting a shared helper or parameterized fixture to reduce duplication.

Example refactor
def _start_grpc_server_fixture(port_offset):
    """Factory for server-backed environment fixtures."""
    server_bin = _find_grpc_server()
    if server_bin is None:
        pytest.skip("cuopt_grpc_server not found")

    port = int(os.environ.get("CUOPT_TEST_PORT_BASE", "18000")) + port_offset
    proc = subprocess.Popen(
        [server_bin, "--port", str(port), "--workers", "1"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )
    if not _wait_for_port(port, timeout=15):
        proc.kill()
        proc.wait()
        pytest.fail("cuopt_grpc_server failed to start within 15s")
    return proc, _cpu_only_env(port)


def _stop_grpc_server(proc):
    """Cleanup helper for server process."""
    proc.send_signal(signal.SIGTERM)
    try:
        proc.wait(timeout=5)
    except subprocess.TimeoutExpired:
        proc.kill()
        proc.wait()

Then use in fixtures:

`@pytest.fixture`(scope="class")
def cpu_only_env_with_server(self):
    proc, env = _start_grpc_server_fixture(port_offset=600)
    yield env
    _stop_grpc_server(proc)

Also applies to: 436-461

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 369 - 394, Extract the duplicated server-backed fixture setup into
a shared helper and cleanup functions and update both cpu_only_env_with_server
(in TestCPUOnlyExecution) and the corresponding fixture in TestCuoptCliCPUOnly
to call them; specifically, create a factory function (e.g.,
_start_grpc_server_fixture(port_offset)) that locates the server via
_find_grpc_server, computes port using CUOPT_TEST_PORT_BASE + port_offset,
launches the subprocess with the same args, waits with _wait_for_port, and
returns (proc, _cpu_only_env(port)), and a _stop_grpc_server(proc) that sends
SIGTERM and falls back to kill on TimeoutExpired, then replace the fixture
bodies (cpu_only_env_with_server and the other class fixture) to call the
factory, yield the env, and invoke _stop_grpc_server(proc) for cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/grpc/client/grpc_client.cpp`:
- Around line 204-215: start_log_streaming currently overwrites log_thread_ and
will destroy a joinable std::thread if called twice; add a lifecycle guard: if
log_thread_ exists and is joinable, set stop_logs_.store(true), call join() on
log_thread_ (or return early to reject reentry), then reset log_thread_ before
clearing stop flag and creating the new thread. Use the existing symbols
start_log_streaming, stop_logs_, log_thread_, and stop_log_streaming (if
available) to implement this sequence so you never assign a new std::thread over
a joinable one.

In `@cpp/src/grpc/grpc_problem_mapper.cpp`:
- Around line 216-224: The loop that maps var_types_str to var_types currently
maps unknown chars to var_t::CONTINUOUS, which silently changes model semantics;
instead detect any unexpected tag in the switch (refer to var_types_str,
var_types, var_t and cpu_problem.set_variable_types) and abort parsing by
returning/propagating an error or throwing an exception with a clear message
(include the bad char and context) rather than pushing CONTINUOUS; make the same
change in the other decode path referenced (lines ~512-523) so both decoders
validate tags and fail on unknown values.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 115-128: Validate inputs before writing into assembled: compute
elem_size only after confirming fi.total_bytes and
fi.chunks[0]->total_elements() divide evenly (use division with remainder check)
and fail if not; for each ac in fi.chunks validate ac->element_offset() is
non-negative and that chunk_data.size() is a multiple of elem_size
(chunk_data.size() % elem_size == 0), compute num_elements = chunk_data.size() /
elem_size and check element_offset + num_elements does not exceed
fi.chunks[0]->total_elements() before multiplying to get a byte_offset, and
instead of pre-zeroing and blindly memcpying keep a bitmap or byte-coverage
counter to detect gaps/overlaps and only allow success if the union of all chunk
byte ranges exactly covers fi.total_bytes (otherwise return false).

In `@GRPC_QUICK_START.md`:
- Around line 68-118: Add a missing server-certificate issuance step: generate a
server key and CSR, sign the CSR with the private CA (ca.crt / ca.key) to
produce server.crt, and ensure the CSR includes a SAN that matches the hostname
clients will use (so the server cert presented to clients is actually signed by
the same CA and contains the correct subjectAltName); update the docs around the
cuopt_grpc_server invocation and the environment variables (--tls-cert
server.crt, --tls-key server.key, --tls-root ca.crt) to state explicitly that
server.crt must be the CA-signed certificate with matching SAN.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 57-61: The CONDA_PREFIX branch sets p = os.path.join(conda_prefix,
"bin", "cuopt_grpc_server") and only checks os.path.isfile(p); update that
branch to also verify the file is executable using os.access(p, os.X_OK)
(consistent with the other branches) before returning p, so replace the single
isfile check with a combined isfile and os.access check for executability.

---

Outside diff comments:
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 474-479: The CONDA_PREFIX branch currently only checks
os.path.isfile(p) and should also verify the file is executable like the earlier
_find_grpc_server logic; update the check for the constructed path variable p to
ensure it exists and is executable (e.g., replace or augment os.path.isfile(p)
with an executable check such as os.access(p, os.X_OK) or equivalent) so the
function returns the path only when the cuopt_cli binary is executable.

---

Duplicate comments:
In `@cpp/src/grpc/client/grpc_client.cpp`:
- Around line 89-94: The helper set_rpc_deadline(ctx, timeout_seconds) exists
but call sites still pass the hard-coded 60s constant, so plumb the configured
timeout through: replace literal 60 at each RPC call site with
config_.timeout_seconds (or, if you want a separate short-RPC knob, add and use
a new member like config_.short_rpc_timeout_seconds) so that functions using
set_rpc_deadline (references around the hard-coded uses noted in the review and
the helper itself) honor the client configuration; ensure default behavior
preserves current semantics when config_ value is zero or negative.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 37-55: The helpers write_protobuf_to_pipe and
read_protobuf_from_pipe currently allow near-4GiB allocations because they only
compare a uint32_t size against kMaxPipeArrayBytes; introduce a dedicated
protobuf message cap (e.g., kMaxProtobufMessageBytes) and enforce it
symmetrically: in write_protobuf_to_pipe check msg.ByteSizeLong() against the
new cap and return false if it exceeds the cap before casting to uint32_t, and
in read_protobuf_from_pipe validate the read uint32_t size is non‑zero and <=
kMaxProtobufMessageBytes (not kMaxPipeArrayBytes) before allocating the vector
or parsing; update any related comments.

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 25-45: print_usage documents the --chunk-timeout flag but the
argument parser currently ignores it (and lets unknown options pass silently);
update the CLI parsing logic (where flags like --max-message-mb / --tls-* are
handled) to either (A) implement --chunk-timeout by parsing its integer value
into the server variable (e.g., chunk_timeout_seconds or similar), validate it
(non-negative integer; 0 = disabled), and wire that variable into the streaming
timeout behavior, or (B) if not supported, explicitly detect --chunk-timeout and
exit with an error message stating it is unsupported; additionally make the
parser treat unknown options as errors instead of silently ignoring them so
flags advertised by print_usage are enforced.

---

Nitpick comments:
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 29-31: The default dataset path RAPIDS_DATASET_ROOT_DIR is
hardcoded to "/home/datasets/cuopt"; change its fallback to a more portable path
such as "./datasets" (or document the expectation) by updating the
os.environ.get call that defines RAPIDS_DATASET_ROOT_DIR in
test_cpu_only_execution.py so tests still call pytest.skip when files are
missing but will use the local ./datasets directory by default; ensure the
variable name RAPIDS_DATASET_ROOT_DIR is the only symbol you modify and keep the
existing skip behavior intact.
- Around line 369-394: Extract the duplicated server-backed fixture setup into a
shared helper and cleanup functions and update both cpu_only_env_with_server (in
TestCPUOnlyExecution) and the corresponding fixture in TestCuoptCliCPUOnly to
call them; specifically, create a factory function (e.g.,
_start_grpc_server_fixture(port_offset)) that locates the server via
_find_grpc_server, computes port using CUOPT_TEST_PORT_BASE + port_offset,
launches the subprocess with the same args, waits with _wait_for_port, and
returns (proc, _cpu_only_env(port)), and a _stop_grpc_server(proc) that sends
SIGTERM and falls back to kill on TimeoutExpired, then replace the fixture
bodies (cpu_only_env_with_server and the other class fixture) to call the
factory, yield the env, and invoke _stop_grpc_server(proc) for cleanup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 21c442e4-fc3d-439d-aaa7-8c0bbcf57bcd

📥 Commits

Reviewing files that changed from the base of the PR and between a97a468 and 026627a.

📒 Files selected for processing (10)
  • GRPC_QUICK_START.md
  • ci/utils/install_protobuf_grpc.sh
  • cpp/src/grpc/client/grpc_client.cpp
  • cpp/src/grpc/grpc_problem_mapper.cpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • ci/utils/install_protobuf_grpc.sh

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (3)
cpp/src/grpc/server/grpc_job_management.cpp (1)

245-259: ⚠️ Potential issue | 🔴 Critical

cancel_job() still has a slot-reuse race.

This path matches job_id once, then writes cancelled and may kill() a worker based on queue-slot state that can be recycled concurrently by the dispatcher/result threads. If the slot is reused between those operations, cancellation can affect the wrong job. Pin the slot, or revalidate that the same job_id still owns it immediately before mutating the slot and signaling the PID. As per coding guidelines "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 245 - 259, The
cancel_job() path can race with slot reuse; after you find a matching slot
(checking job_queue[i].ready and strcmp on job_queue[i].job_id), revalidate/lock
the slot immediately before mutating or signaling: either acquire the slot-level
mutex (or a global queue mutex) around reading worker_pid/claimed and then set
job_queue[i].cancelled and call kill(), or atomically re-check that
strcmp(job_queue[i].job_id, job_id.c_str()) still matches (and
job_queue[i].ready is still true) right before setting cancelled/killing; ensure
you reference the same symbols (MAX_JOBS, job_queue[i].job_id,
job_queue[i].ready, job_queue[i].claimed, job_queue[i].cancelled, worker_pid,
kill()) so the slot cannot be reused between the match and the mutation/signal.
cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)

120-140: ⚠️ Potential issue | 🔴 Critical

bytes_covered is not a real integrity check.

Two overlapping chunks and one gap can still satisfy bytes_covered == fi.total_bytes, so this path accepts a zero-padded/corrupted field as valid. It also multiplies element_offset * elem_size before proving the element range is safe, which can overflow on malformed input. Track covered ranges/bytes and validate the element interval before converting it to byte offsets. As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 120 - 140, The
current assembly loop uses bytes_covered and multiplies element_offset *
elem_size before validating element ranges, which allows overflow and accepts
overlaps/gaps; fix by first validating elem_size > 0 and that chunk_data.size()
% elem_size == 0, compute chunk_elements = chunk_data.size() / elem_size,
validate element_offset >= 0 and element_offset <= total_elements -
chunk_elements (use int64 safe checks to avoid overflow) before computing
byte_offset, and replace the bytes_covered check with a per-element coverage
bitmap/vector<bool> (size total_elements) that marks each element index covered
by a chunk and detects any overlap (already-marked bit) or final gaps (any false
bit) before memcpy and write_to_pipe; ensure all bounds and arithmetic use
checked types to prevent overflows.
cpp/src/grpc/server/grpc_server_main.cpp (1)

49-86: ⚠️ Potential issue | 🟠 Major

Still reject missing option values instead of falling through.

Flags that require an argument (--port, --workers, --max-message-*, --chunk-timeout, and the TLS path flags) still just skip parsing when the value is absent, so the server can continue with unintended defaults. Please fail fast with usage output here, and validate basic numeric ranges before continuing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 49 - 86, The CLI
parsing loop in grpc_server_main.cpp currently ignores missing values for flags
that require arguments, allowing silent fallback to defaults; update the parsing
for flags that take values (those setting config.port, config.num_workers,
config.max_message_bytes via --max-message-mb/--max-message-bytes,
config.chunk_timeout_seconds, and TLS path flags
config.tls_cert_path/config.tls_key_path/config.tls_root_path) to check i+1 <
argc and if missing call print_usage(argv[0]) and return 1 (fail fast), and
after parsing validate numeric ranges (e.g. port in 1–65535, num_workers > 0,
max_message_bytes >= 4096, chunk_timeout_seconds >= 0) and on invalid values
also print_usage and return 1; keep using the same parsing block and
print_usage/return flow so behavior is consistent.
🧹 Nitpick comments (3)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (3)

630-635: Duplicate shutdown logic could be extracted.

This shutdown block duplicates _stop_grpc_server. Consider reusing the helper.

♻️ Suggested refactor
-        proc.send_signal(signal.SIGTERM)
-        try:
-            proc.wait(timeout=5)
-        except subprocess.TimeoutExpired:
-            proc.kill()
-            proc.wait()
+        _stop_grpc_server(proc)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 630 - 635, The shutdown block that sends SIGTERM and falls back to
kill duplicates the existing helper _stop_grpc_server; replace the inline
sequence (proc.send_signal(signal.SIGTERM)...proc.wait()) with a call to the
helper _stop_grpc_server(proc) (or refactor that helper to accept the proc
object if necessary), ensuring the helper implements the same timeout and
fallback kill behavior and reuses existing logging/error handling so the test
uses the central shutdown logic.

88-100: Consider logging failures for easier debugging.

The inner _run helper silently discards stdout/stderr on failure. If certificate generation fails, diagnosing the root cause will be difficult.

♻️ Optional: capture and log failures
     def _run(cmd):
-        return (
-            subprocess.run(cmd, capture_output=True, timeout=30).returncode
-            == 0
-        )
+        result = subprocess.run(cmd, capture_output=True, timeout=30, text=True)
+        if result.returncode != 0:
+            print(f"openssl command failed: {' '.join(cmd)}", file=sys.stderr)
+            if result.stderr:
+                print(result.stderr, file=sys.stderr)
+        return result.returncode == 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 88 - 100, The helper _run inside _generate_test_certs currently
swallows stdout/stderr which makes failures hard to diagnose; modify _run to
capture subprocess.run's CompletedProcess, and on non-zero returncode log the
command, returncode, stdout and stderr (use the module logger via
logging.getLogger or the existing test logger) before returning False, so
callers can see detailed failure information while keeping the same boolean
success/failure return behavior.

307-325: Add numerical correctness validation for MIP test.

The test validates solution size but not correctness. Per coding guidelines, tests should validate "numerical correctness of optimization results (not just 'runs without error')". Consider adding objective value or feasibility validation.

♻️ Suggested improvement
     solution = linear_programming.Solve(dm, settings)
     vals = solution.get_primal_solution()
     assert len(vals) == n_vars, f"solution size {len(vals)} != n_vars {n_vars}"
+
+    obj = solution.get_primal_objective()
+    assert obj is not None, "objective is None"
+    
+    # bb_optimality.mps has a known optimal value; validate against it
+    stats = solution.get_milp_stats()
+    assert "mip_gap" in stats, "MIP stats missing mip_gap"
+    assert stats["mip_gap"] >= 0, f"Negative MIP gap: {stats['mip_gap']}"

As per coding guidelines: "Write tests validating numerical correctness of optimization results (not just 'runs without error')".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 307 - 325, Add numerical correctness checks to
_impl_mip_solve_cpu_only: compute the objective from
dm.get_objective_coefficients() dotted with the returned
solution.get_primal_solution() and assert it matches
solution.get_objective_value() within a small tolerance; additionally validate
feasibility by evaluating the constraint matrix and RHS (use
dm.get_constraint_matrix(), dm.get_rhs(), dm.get_sense() or equivalent
accessors) against vals to ensure all constraints satisfy their senses (<=, =,
>=) within tolerances. Use n_vars, dm, solution, vals and explicit tolerances in
the assertions so the test verifies both objective accuracy and constraint
feasibility rather than just solution size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/grpc/client/grpc_client.cpp`:
- Around line 683-688: The calculation of max_polls can overflow because
config_.timeout_seconds * 1000 is done in int; change the arithmetic to 64-bit
before multiplication (e.g., use int64_t or long long) to compute
(timeout_seconds * 1000) / poll_ms safely, then clamp or cast the final value
into an int as needed; update the computation that assigns max_polls
(referencing poll_ms, max_polls, and
config_.timeout_seconds/config_.poll_interval_ms) to perform 64-bit math and
handle the case where the result exceeds std::numeric_limits<int>::max().
- Around line 292-295: The member server_max_message_bytes_ of class
grpc_client_t is being accessed concurrently from multiple RPC methods
(check_status, get_result_or_download, download_chunked_result,
upload_chunked_arrays) causing a data race; fix by making
server_max_message_bytes_ an std::atomic<int64_t> (replace its declaration) and
update all writes to use store() and reads to use load() (use
memory_order_relaxed unless stricter ordering is required), or alternatively
document in the class comment that grpc_client_t is not thread-safe and must
only be used from a single thread—pick one approach and apply it consistently
across all usages referenced above.

In `@cpp/src/grpc/grpc_problem_mapper.cpp`:
- Around line 113-119: The encoding loop over var_types that maps var_t to chars
(pushing into var_types_str) currently treats unknown enum values as CONTINUOUS
('C'); change the default case in that switch to detect unexpected var_t values
and fail loudly (e.g., throw std::runtime_error or call LOG(ERROR)/assert)
instead of pushing 'C'), including the offending enum value in the message;
reference the enum var_t, the var_types container and var_types_str vector so
you update that switch branch to mirror the decoder's behavior (i.e., explicit
handling for known cases and error on unknown cases).
- Around line 647-652: The switch over var_t in the loop currently maps unknown
enum values silently to CONTINUOUS; update the default path in the block that
fills vt_bytes (loop iterating over var_types and writing to vt_bytes) to match
the handling used in map_problem_to_proto: detect unexpected var_t values, emit
a clear error/log entry including the offending enum value (and context), and
encode a distinct marker (e.g., 'U' for unknown) or return/propagate an error
instead of silently falling back to 'C' so unknown enums are not silently
converted; ensure you touch the switch that references var_t, var_types, and
vt_bytes to implement this consistent behavior.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 70-81: The clear() method resets client_logs_ but incorrectly sets
server_log_start_pos_ to 0 and test_start_marked_ to false, which causes
subsequent get_server_logs() to return the entire server log (including stale
entries); change clear() to mark the server log start for the next test by
setting server_log_start_pos_ to the current end of the server log (e.g.,
compute current file size/position used by get_server_logs()) and set
test_start_marked_ = true (or call the existing mark/test-start helper) so
get_server_logs() only returns entries written after clear(); update references
in clear(), server_log_start_pos_, and test_start_marked_ accordingly.
- Around line 40-47: The header is not self-contained because
dump_logs(std::ostream& out = std::cout) uses std::ostream and std::cout but
<iostream> is not included; add `#include` <iostream> at the top of
grpc_test_log_capture.hpp so std::ostream and std::cout are declared and the
header no longer relies on transitive includes (affects the dump_logs
declaration and any callers relying on default std::cout).

In `@GRPC_QUICK_START.md`:
- Line 134: Duplicate step number: change the heading "4. Configure an
authorized client:" to "5. Configure an authorized client:" so the sequence
follows the previous "4." step; update the numeric prefix in that Markdown line
(the string "4. Configure an authorized client:") to "5." to fix the numbering.

---

Duplicate comments:
In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 245-259: The cancel_job() path can race with slot reuse; after you
find a matching slot (checking job_queue[i].ready and strcmp on
job_queue[i].job_id), revalidate/lock the slot immediately before mutating or
signaling: either acquire the slot-level mutex (or a global queue mutex) around
reading worker_pid/claimed and then set job_queue[i].cancelled and call kill(),
or atomically re-check that strcmp(job_queue[i].job_id, job_id.c_str()) still
matches (and job_queue[i].ready is still true) right before setting
cancelled/killing; ensure you reference the same symbols (MAX_JOBS,
job_queue[i].job_id, job_queue[i].ready, job_queue[i].claimed,
job_queue[i].cancelled, worker_pid, kill()) so the slot cannot be reused between
the match and the mutation/signal.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 120-140: The current assembly loop uses bytes_covered and
multiplies element_offset * elem_size before validating element ranges, which
allows overflow and accepts overlaps/gaps; fix by first validating elem_size > 0
and that chunk_data.size() % elem_size == 0, compute chunk_elements =
chunk_data.size() / elem_size, validate element_offset >= 0 and element_offset
<= total_elements - chunk_elements (use int64 safe checks to avoid overflow)
before computing byte_offset, and replace the bytes_covered check with a
per-element coverage bitmap/vector<bool> (size total_elements) that marks each
element index covered by a chunk and detects any overlap (already-marked bit) or
final gaps (any false bit) before memcpy and write_to_pipe; ensure all bounds
and arithmetic use checked types to prevent overflows.

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 49-86: The CLI parsing loop in grpc_server_main.cpp currently
ignores missing values for flags that require arguments, allowing silent
fallback to defaults; update the parsing for flags that take values (those
setting config.port, config.num_workers, config.max_message_bytes via
--max-message-mb/--max-message-bytes, config.chunk_timeout_seconds, and TLS path
flags config.tls_cert_path/config.tls_key_path/config.tls_root_path) to check
i+1 < argc and if missing call print_usage(argv[0]) and return 1 (fail fast),
and after parsing validate numeric ranges (e.g. port in 1–65535, num_workers >
0, max_message_bytes >= 4096, chunk_timeout_seconds >= 0) and on invalid values
also print_usage and return 1; keep using the same parsing block and
print_usage/return flow so behavior is consistent.

---

Nitpick comments:
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 630-635: The shutdown block that sends SIGTERM and falls back to
kill duplicates the existing helper _stop_grpc_server; replace the inline
sequence (proc.send_signal(signal.SIGTERM)...proc.wait()) with a call to the
helper _stop_grpc_server(proc) (or refactor that helper to accept the proc
object if necessary), ensuring the helper implements the same timeout and
fallback kill behavior and reuses existing logging/error handling so the test
uses the central shutdown logic.
- Around line 88-100: The helper _run inside _generate_test_certs currently
swallows stdout/stderr which makes failures hard to diagnose; modify _run to
capture subprocess.run's CompletedProcess, and on non-zero returncode log the
command, returncode, stdout and stderr (use the module logger via
logging.getLogger or the existing test logger) before returning False, so
callers can see detailed failure information while keeping the same boolean
success/failure return behavior.
- Around line 307-325: Add numerical correctness checks to
_impl_mip_solve_cpu_only: compute the objective from
dm.get_objective_coefficients() dotted with the returned
solution.get_primal_solution() and assert it matches
solution.get_objective_value() within a small tolerance; additionally validate
feasibility by evaluating the constraint matrix and RHS (use
dm.get_constraint_matrix(), dm.get_rhs(), dm.get_sense() or equivalent
accessors) against vals to ensure all constraints satisfy their senses (<=, =,
>=) within tolerances. Use n_vars, dm, solution, vals and explicit tolerances in
the assertions so the test verifies both objective accuracy and constraint
feasibility rather than just solution size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2293365d-c3e4-47b6-b111-6b48e153b0c0

📥 Commits

Reviewing files that changed from the base of the PR and between 026627a and 6e01d60.

📒 Files selected for processing (11)
  • GRPC_QUICK_START.md
  • ci/utils/install_protobuf_grpc.sh
  • cpp/src/grpc/client/grpc_client.cpp
  • cpp/src/grpc/grpc_problem_mapper.cpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • ci/utils/install_protobuf_grpc.sh
  • cpp/src/grpc/server/grpc_worker_infra.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants