Conversation
📝 WalkthroughWalkthroughAdds a complete gRPC remote execution subsystem: protobuf/gRPC schemas and codegen, a gRPC client and server (multi-process workers, shared memory and pipe IPC, chunked upload/download), bidirectional mappers, large test suites (unit + integration with TLS/mTLS), build/CI/conda packaging changes, and removal of previous remote stub implementations. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)
152-163:⚠️ Potential issue | 🟠 MajorStrengthen this warmstart test to assert numerical correctness, not just non-None outputs.
After adding
CUOPT_PRESOLVEat Line 152, the test still only checks that a primal solution exists (Line 162). Please also validate warmstart numerical correctness (e.g., objective consistency/improvement within tolerance).Proposed test-strengthening diff
sol1 = linear_programming.Solve(dm, settings) ws = sol1.get_pdlp_warm_start_data() + obj1 = sol1.get_primal_objective() if ws is not None: settings.set_pdlp_warm_start_data(ws) settings.set_parameter(CUOPT_ITERATION_LIMIT, 200) sol2 = linear_programming.Solve(dm, settings) assert sol2.get_primal_solution() is not None + obj2 = sol2.get_primal_objective() + assert obj2 is not None + assert abs(obj2 - obj1) / max(1.0, abs(obj1)) < 1e-6As per coding guidelines,
**/*test*.{cpp,cu,py}must “Write tests validating numerical correctness of optimization results (not just 'runs without error').”🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 152 - 163, The test currently only asserts sol2.get_primal_solution() is not None; strengthen it by checking numerical correctness: retrieve both primal solutions (via sol1.get_primal_solution() and sol2.get_primal_solution()), compute their objective values (either via solX.get_primal_objective() if available or by forming the dot product of the primal vector with the problem cost from dm), then assert the objective values are consistent within a small tolerance (e.g., abs(obj1 - obj2) <= 1e-6) or that sol2's objective is no worse (<=) than sol1's given the increased CUOPT_ITERATION_LIMIT; keep tolerance and exact assertion choice explicit in the test.
🟠 Major comments (18)
cpp/src/grpc/server/grpc_worker.cpp-56-74 (1)
56-74:⚠️ Potential issue | 🟠 MajorPotential race condition in result queue access.
store_simple_resultiterates throughresult_queuelooking for an empty slot without synchronization. If multiple workers call this concurrently, they could both find the same slot with!result_queue[i].readyand overwrite each other's results.Consider using an atomic compare-exchange on
readysimilar to the job claiming pattern (lines 86-87), or protect access with a mutex.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 56 - 74, store_simple_result currently scans result_queue and sets a slot when result_queue[i].ready is false without synchronization, causing a race where multiple callers can claim the same slot; update this function to claim a slot atomically (e.g., change result_queue[i].ready to an std::atomic<bool> and perform an atomic compare-exchange to flip false->true before writing the slot) or surround the scan-and-write with a mutex; ensure you only write the other fields (job_id, status, data_size, worker_index, error_message, retrieved) after successfully claiming the slot to avoid concurrent overwrites.cpp/src/grpc/server/grpc_server_types.hpp-55-64 (1)
55-64: 🛠️ Refactor suggestion | 🟠 MajorAvoid
usingdirectives in headers.The
usingdeclarations (lines 55-61) andusing namespacedirective (line 63) pollute the global namespace for any file that includes this header. This can cause unexpected name collisions.Suggested fix: Use qualified names instead
Remove these lines and use fully qualified names (e.g.,
grpc::Server,grpc::Status) in the implementation files that include this header. Alternatively, move these declarations inside the#ifdef CUOPT_ENABLE_GRPCguard's associated.cppfiles only.-using grpc::Server; -using grpc::ServerBuilder; -using grpc::ServerContext; -using grpc::ServerReaderWriter; -using grpc::ServerWriter; -using grpc::Status; -using grpc::StatusCode; - -using namespace cuopt::linear_programming; -// Note: NOT using "using namespace cuopt::remote" to avoid JobStatus enum conflict🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 55 - 64, Remove the header-level using declarations for grpc types (grpc::Server, grpc::ServerBuilder, grpc::ServerContext, grpc::ServerReaderWriter, grpc::ServerWriter, grpc::Status, grpc::StatusCode) and the using namespace cuopt::linear_programming; to avoid leaking names (and the JobStatus enum conflict). Replace usages in this header with fully-qualified names (e.g., grpc::Server, cuopt::linear_programming::Whatever) or move the using declarations into the corresponding .cpp files (or inside the CUOPT_ENABLE_GRPC guard in .cpp) so only implementation files import those namespaces; ensure any references in grpc_server_types.hpp are updated to the qualified symbols.cpp/src/grpc/server/grpc_worker_infra.cpp-194-205 (1)
194-205:⚠️ Potential issue | 🟠 MajorHandle full
result_queuewhen marking worker failures.If all result slots are occupied, this path drops the terminal failure record. A waiting client can then stall even though the job already failed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 194 - 205, The current loop that writes a terminal failure into result_queue stops if every slot is already marked ready, which can drop the failure; modify the failure handling logic so that if no slot with !ready is found it will (1) search for a slot with ready && retrieved (i.e., already consumed) and reuse it, using the same writes (copy_cstr to job_id/error_message, set status, data_size, worker_index, retrieved=false, ready=true), and (2) if none exist, as a last resort overwrite the oldest/first slot (e.g., index 0) to guarantee the client gets a terminal response; update the block around result_queue/MAX_RESULTS and reuse the existing copy_cstr calls so the behavior is deterministic.cpp/src/grpc/server/grpc_server_main.cpp-50-60 (1)
50-60:⚠️ Potential issue | 🟠 MajorHarden numeric CLI parsing and range validation.
Line 53/55/57/59 use
stoi/stollwithout exception handling or range checks. Invalid input or negative values (e.g., workers/port) can crash startup or create invalid runtime config.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 60, The CLI numeric parsing loop that assigns config.port, config.num_workers, config.max_message_mb and config.max_message_b uses std::stoi/std::stoll without error handling or range checks; wrap each conversion in a try/catch for std::invalid_argument and std::out_of_range, validate values (e.g., port > 0 && port <= 65535, num_workers > 0, max_message_mb >= 0, max_message_b >= 4096) and on invalid input log an error via the existing logger (or std::cerr) and return/exit with non‑zero status or fall back to safe defaults; update the parsing branches that set config.port, config.num_workers, config.max_message_mb and the --max-message-bytes branch that sets config.max_message_b to use these guarded conversions and checks.cpp/src/grpc/server/grpc_worker_infra.cpp-120-124 (1)
120-124:⚠️ Potential issue | 🟠 MajorClose child-end FDs on fork failure to prevent descriptor leaks.
On fork failure, only server-side pipe ends are closed. Parent-owned child ends (
worker_read_fd,worker_write_fd,worker_incumbent_write_fd) remain open.🔧 Minimal fix
if (pid < 0) { std::cerr << "[Server] Failed to fork " << (is_replacement ? "replacement worker " : "worker ") << worker_id << "\n"; close_worker_pipes_server(worker_id); + close_worker_pipes_child_ends(worker_id); return -1; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 120 - 124, On fork failure in grpc_worker_infra.cpp (the pid < 0 branch), close the parent-owned child-end file descriptors to avoid leaks: ensure worker_read_fd, worker_write_fd, and worker_incumbent_write_fd are closed (if valid) in addition to calling close_worker_pipes_server(worker_id); update the pid < 0 error path in the code that creates workers so it explicitly closes those descriptors before returning -1.cpp/src/grpc/client/solve_remote.cu-55-57 (1)
55-57:⚠️ Potential issue | 🟠 MajorValidate env overrides before applying chunk/message sizes.
CUOPT_CHUNK_SIZEandCUOPT_MAX_MESSAGE_BYTESare accepted as-is; zero/negative/too-small values can break chunk sizing and message-limit behavior.✅ Minimal validation example
static void apply_env_overrides(grpc_client_config_t& config) { - config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); - config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); + auto max_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + + config.chunk_size_bytes = std::max<int64_t>(4096, chunk_size); + config.max_message_bytes = + (max_bytes <= 0) ? config.max_message_bytes : std::max<int64_t>(4096, max_bytes); config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The code currently applies CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES directly which can allow zero/negative/too-small values to break chunking; after calling parse_env_int64 for these into config.chunk_size_bytes and config.max_message_bytes, validate each parsed value against sensible thresholds (e.g., >0 and >= a defined MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constant) and only assign the override if it passes validation; if invalid, keep the existing default and emit a warning/log mentioning the env var name and rejected value (use the same symbols config.chunk_size_bytes, config.max_message_bytes, parse_env_int64 and add MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constants or inline thresholds).cpp/src/grpc/client/solve_remote.cu-153-166 (1)
153-166:⚠️ Potential issue | 🟠 MajorDon’t forward incumbent objective as the callback bound.
Line 164 sets
bound_copyequal to incumbent objective. That can make gap-sensitive callback logic interpret the incumbent as already tightly bounded, which is semantically wrong when a true best bound is unavailable.Based on learnings: objective direction is encoded by objective-sense metadata, and callback bounds should use direction-appropriate “no bound” values when the true bound is not available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 153 - 166, The incumbent callback is incorrectly using the incumbent objective as the callback bound (bound_copy), which can mislead gap-sensitive logic; instead, set bound_copy to a direction-appropriate "no bound" sentinel (e.g., +infinity for minimization, -infinity for maximization) rather than objective. In the config.incumbent_callback lambda (referencing mip_callbacks and get_solution_callback_t::get_solution), query the objective sense from the callback or problem metadata (e.g., a get_objective_sense() or equivalent) and assign bound_copy = std::numeric_limits<double>::infinity() for minimization or -std::numeric_limits<double>::infinity() for maximization before calling get_solution; do not forward the incumbent objective as the bound.cpp/src/grpc/server/grpc_server_threads.cpp-253-263 (1)
253-263:⚠️ Potential issue | 🟠 MajorSynchronize
worker_pipesaccess across polling and worker-restart paths.This thread reads
worker_pipes(lines 254–255) while the monitor thread replaces dead workers viaspawn_single_worker()→close_worker_pipes_server(), which modifies the same entries without a shared lock. This is a data race: the vector can reallocate during concurrent modifications, and file descriptors can be closed while the incumbent thread polls them, causing undefined behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 253 - 263, The polling code reads shared worker_pipes without synchronization, causing a data race with spawn_single_worker() / close_worker_pipes_server() which mutate that vector; fix by introducing and using a shared mutex (e.g., worker_pipes_mutex) that both the polling thread and the worker-restart code lock when accessing/modifying worker_pipes, and change the polling path (the loop that builds pfds) to take the mutex, copy the needed file descriptors into a local vector<int> (or local pfds) while holding the lock, then release the lock before calling poll() so the poll is not done while holding the mutex; also ensure close_worker_pipes_server() and spawn_single_worker() acquire the same mutex when closing/removing entries to avoid closing fds being polled.cpp/src/grpc/grpc_settings_mapper.cu-144-146 (1)
144-146:⚠️ Potential issue | 🟠 MajorGuard
iteration_limitnarrowing conversion.Line 145 casts protobuf
int64directly intoi_t. Oversized values can overflow/truncate and corrupt solver limits.Proposed fix
- if (pb_settings.iteration_limit() >= 0) { - settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit()); - } + if (pb_settings.iteration_limit() >= 0) { + const auto iter_limit = pb_settings.iteration_limit(); + if (iter_limit > static_cast<int64_t>(std::numeric_limits<i_t>::max())) { + settings.iteration_limit = std::numeric_limits<i_t>::max(); + } else { + settings.iteration_limit = static_cast<i_t>(iter_limit); + } + }As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The code currently casts pb_settings.iteration_limit() directly to i_t (setting settings.iteration_limit) which can overflow; update the assignment in the block that checks pb_settings.iteration_limit() to first compare the int64 value against the representable range of i_t (use std::numeric_limits<i_t>::max() and min()/0 as appropriate), and then either clamp to that range or return/log an error before assigning to settings.iteration_limit; reference pb_settings.iteration_limit(), settings.iteration_limit, and the i_t type when making this change so the narrowing conversion is guarded.cpp/src/grpc/server/grpc_job_management.cpp-109-113 (1)
109-113:⚠️ Potential issue | 🟠 MajorCap incumbent payload size before
data.resize(size).Line 111 trusts wire-provided size and can allocate arbitrarily large memory.
Proposed fix
bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data) { uint64_t size; if (!read_from_pipe(fd, &size, sizeof(size))) return false; + constexpr uint64_t kMaxIncumbentBytes = 64ULL * 1024 * 1024; + if (size > kMaxIncumbentBytes) { + std::cerr << "[Server] Incumbent payload too large: " << size << "\n"; + return false; + } data.resize(size); if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false; return true; }As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code reads a wire-provided uint64_t size and calls data.resize(size) without validation; before calling data.resize (in the block that uses read_from_pipe, size and data), validate and cap size against a defined maximum payload (e.g. kMaxPayloadSize) and reject sizes that are zero or exceed the cap by returning false; use the same symbols from the snippet (uint64_t size, read_from_pipe(fd, &size,...), data.resize(size)) so the check occurs immediately after reading size and before any allocation or further read_from_pipe calls.cpp/src/grpc/server/grpc_service_impl.cpp-699-716 (1)
699-716:⚠️ Potential issue | 🟠 Major
StreamLogscan block indefinitely for terminal jobs without a log file.The wait loop only exits on file existence or NOT_FOUND; a completed/failed/cancelled job with no log file never terminates the RPC.
Proposed fix
if (waited_ms >= 2000) { std::string msg; JobStatus s = check_job_status(job_id, msg); if (s == JobStatus::NOT_FOUND) { if (config.verbose) { std::cout << "[gRPC] StreamLogs job not found: " << job_id << std::endl; } return Status(grpc::StatusCode::NOT_FOUND, "Job not found: " + job_id); } + if (s == JobStatus::COMPLETED || s == JobStatus::FAILED || s == JobStatus::CANCELLED) { + cuopt::remote::LogMessage done; + done.set_line(""); + done.set_byte_offset(from_byte); + done.set_job_complete(true); + writer->Write(done); + return Status::OK; + } waited_ms = 0; }As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 699 - 716, The StreamLogs loop can hang for terminal jobs that never produce a log file; modify the loop (in grpc_service_impl::StreamLogs) to call check_job_status(job_id, msg) when the file is still missing and, if the returned JobStatus is a terminal state (e.g., JobStatus::COMPLETED, JobStatus::FAILED, JobStatus::CANCELLED or other terminal enums your code defines), break out and return a gRPC error (for example Status(grpc::StatusCode::NOT_FOUND, "Log file not available for job: " + job_id)) or an appropriate Status; keep the existing verbose logging path (config.verbose) and context->IsCancelled() checks, and retain the waited_ms retry/backoff behavior but ensure terminal-state handling resets/short-circuits the wait to avoid indefinite blocking.cpp/src/grpc/server/grpc_service_impl.cpp-189-193 (1)
189-193:⚠️ Potential issue | 🟠 Major
SendArrayChunkallows unbounded duplicate/overlapping payload growth.
meta.received_bytesandstate.total_bytesare incremented unconditionally; repeated overlapping chunks can inflate memory usage until OOM.Proposed guard
- meta.received_bytes += static_cast<int64_t>(raw.size()); + if (meta.received_bytes + static_cast<int64_t>(raw.size()) > array_bytes) { + return Status(StatusCode::INVALID_ARGUMENT, "Received bytes exceed declared array size"); + } + meta.received_bytes += static_cast<int64_t>(raw.size()); state.total_bytes += static_cast<int64_t>(raw.size());As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 189 - 193, SendArrayChunk currently unconditionally increments meta.received_bytes and state.total_bytes and pushes state.chunks, allowing duplicate/overlapping chunks to inflate memory; modify SendArrayChunk to first validate the incoming chunk against already-received ranges (e.g., track per-array received byte ranges in state or use an interval set) and skip or trim duplicate/overlapping payloads before modifying meta.received_bytes, state.total_bytes, state.chunks, and state.total_chunks; also enforce a global per-array max size check and reject or cap chunks that would push state.total_bytes beyond that limit to prevent unbounded growth.cpp/src/grpc/grpc_settings_mapper.cu-154-154 (1)
154-154:⚠️ Potential issue | 🟠 MajorValidate protobuf
presolvervalue before casting topresolver_t.Line 154 and Line 221 trust unvalidated wire values; invalid integers become unsupported internal enum states.
As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."
Also applies to: 221-221
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_settings_mapper.cu` at line 154, The code casts the protobuf integer pb_settings.presolver() directly into presolver_t (see assignment to settings.presolver and the similar use at the other spot), which can produce invalid enum states for out-of-range wire values; before static_casting to presolver_t, validate the integer is within the defined presolver_t range (e.g., compare against the enum's min/max or use a switch/case for known values), handle unknown values by logging an error/warning and selecting a safe default (or returning an error), and then assign the validated value to settings.presolver to avoid unsupported internal enum states.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-673-699 (1)
673-699:⚠️ Potential issue | 🟠 Major
SolveInfeasibleLPcan pass without actually validating infeasibility behavior.The assertion is gated by
if (result.success && result.solution), so transport or solve regressions can still produce a passing test.Proposed fix
auto result = client->solve_lp(problem, settings); - if (result.success && result.solution) { - EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal); - } + ASSERT_TRUE(result.success) << result.error_message; + ASSERT_NE(result.solution, nullptr); + EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal);As per coding guidelines "
**/*test*.{cpp,cu,py}: Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 673 - 699, The test SolveInfeasibleLP currently skips assertions when the solver failed or returned no solution; replace the guarded if-check with hard assertions so the test fails on transport/solve regressions: assert that result.success is true (ASSERT_TRUE(result.success)), assert that result.solution is non-null (ASSERT_NE(result.solution, nullptr) or ASSERT_TRUE(result.solution)), and then assert the termination status equals pdlp_termination_status_t::Infeasible (ASSERT_EQ(result.solution->get_termination_status(), pdlp_termination_status_t::Infeasible)) so the test verifies infeasibility behavior unconditionally.cpp/src/grpc/grpc_problem_mapper.cu-175-182 (1)
175-182:⚠️ Potential issue | 🟠 MajorValidate lower/upper bound array sizes before applying both.
Line 175 only checks
constraint_lower_bounds_size() > 0; a malformed payload with mismatched upper-size still gets applied.Proposed guard
- if (pb_problem.constraint_lower_bounds_size() > 0) { + if (pb_problem.constraint_lower_bounds_size() > 0 || + pb_problem.constraint_upper_bounds_size() > 0) { + if (pb_problem.constraint_lower_bounds_size() != pb_problem.constraint_upper_bounds_size()) { + return; // or propagate an explicit mapping error + } std::vector<f_t> con_lb(pb_problem.constraint_lower_bounds().begin(), pb_problem.constraint_lower_bounds().end());As per coding guidelines "Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 175 - 182, The code applies constraint lower and upper bounds when only constraint_lower_bounds_size() > 0 is checked, which allows mismatched arrays; update the guard to validate both pb_problem.constraint_lower_bounds_size() and pb_problem.constraint_upper_bounds_size() are > 0 and equal before calling cpu_problem.set_constraint_lower_bounds(...) and cpu_problem.set_constraint_upper_bounds(...); if sizes differ or one is zero, skip applying them (or log/return an error) to avoid out-of-bounds/misaligned mappings and ensure index consistency across transformations.cpp/src/grpc/grpc_problem_mapper.cu-454-461 (1)
454-461:⚠️ Potential issue | 🟠 MajorRequire complete CSR/Q triplets before setting matrix data.
Line 454 and Line 526 gate only on values array presence; applying matrices without matching indices/offsets can create malformed models.
Proposed guard pattern
- if (!a_values.empty()) { + if (!a_values.empty() || !a_indices.empty() || !a_offsets.empty()) { + if (a_values.empty() || a_indices.empty() || a_offsets.empty()) { return; } cpu_problem.set_csr_constraint_matrix(...); }- if (!q_values.empty()) { + if (!q_values.empty() || !q_indices.empty() || !q_offsets.empty()) { + if (q_values.empty() || q_indices.empty() || q_offsets.empty()) { return; } cpu_problem.set_quadratic_objective_matrix(...); }As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."
Also applies to: 526-533
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 454 - 461, The current guard only checks a_values before calling cpu_problem.set_csr_constraint_matrix, which can yield malformed CSR data if a_indices or a_offsets are missing or their sizes don't match; update the condition to require a_values, a_indices, and a_offsets to be non-empty and verify their sizes are consistent (e.g., a_indices.size() matches a_values.size() and a_offsets.size() matches expected row count + 1) before calling cpu_problem.set_csr_constraint_matrix, and apply the same strengthened guard/validation to the other matrix-setting block referenced around the second call (the similar call later in the file).cpp/src/grpc/server/grpc_service_impl.cpp-823-824 (1)
823-824:⚠️ Potential issue | 🟠 MajorFix incumbent pagination cursor calculation.
next_indexis always set toavailable, which skips data whenmax_countis smaller than remaining incumbents.Proposed fix
- response->set_next_index(available); + response->set_next_index(from_index + count);As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 823 - 824, The pagination logic incorrectly sets response->set_next_index(available) which skips items when max_count < remaining; change it to compute the next cursor as the start index plus the number of items actually returned (e.g., start + returned_count) or equivalently start + std::min(max_count, available - start), then call response->set_next_index(...) with that value and set done when that next index >= available; update the code around response->set_next_index(available) and the subsequent bool done = ... expression accordingly.cpp/src/grpc/client/grpc_client.cu-525-531 (1)
525-531:⚠️ Potential issue | 🟠 MajorUse
compute_chunk_size()to clamp chunk payload to negotiated message-size limits.
chunk_data_budgetis derived only fromconfig_.chunk_size_byteswithout enforcing server/client limits. If user config exceeds server capabilities, chunk RPC calls will fail with message-size errors. Apply the existingcompute_chunk_size()helper at both occurrences (lines 525-531 and 694-698):🔧 Proposed fix
- int64_t chunk_data_budget = config_.chunk_size_bytes; - if (chunk_data_budget <= 0) { chunk_data_budget = 1LL * 1024 * 1024; } + int64_t chunk_data_budget = + compute_chunk_size(server_max_message_bytes_, config_.max_message_bytes, config_.chunk_size_bytes); const int64_t proto_overhead = 64; if (chunk_data_budget > proto_overhead) { chunk_data_budget -= proto_overhead; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 525 - 531, The code sets chunk_data_budget directly from config_.chunk_size_bytes then subtracts proto_overhead before calling build_array_chunk_requests, which can exceed negotiated gRPC limits; replace the direct assignment with a call to compute_chunk_size(config_.chunk_size_bytes) (or the appropriate compute_chunk_size overload) to clamp to negotiated message-size limits, then subtract proto_overhead and pass that clamped chunk_data_budget into build_array_chunk_requests; apply the exact same change to the other place in this file where chunk_data_budget is computed later (the second occurrence) so both chunk payload computations use compute_chunk_size().
🟡 Minor comments (10)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-152-152 (1)
152-152:⚠️ Potential issue | 🟡 MinorClarify the presolve mode and add numerical validation to the warmstart test.
At line 152, the literal
0forCUOPT_PRESOLVEobscures intent—there is noPresolverModeenum exported to Python, so add an inline comment explaining that0disables presolve (required for warmstart). More importantly, the test at line 162 only checks that the solution is non-None; it should validate numerical correctness of the warmstart result by comparing it against a baseline solve or checking solution bounds.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` at line 152, Replace the magic literal by documenting intent: add an inline comment next to settings.set_parameter(CUOPT_PRESOLVE, 0) stating that 0 disables presolve (required for warmstart) since no PresolverMode enum is exposed to Python. Then extend the warmstart test (the test that currently only asserts solution is not None) to perform numerical validation: run a baseline solve without warmstart (or use known optimal objective/variable bounds) and assert the warmstarted solution’s objective and key variable values are within a tight tolerance (e.g., via pytest.approx) of the baseline; reference the same settings and solution objects used in the test to obtain values for comparison.cpp/src/grpc/server/grpc_incumbent_proto.hpp-30-32 (1)
30-32:⚠️ Potential issue | 🟡 MinorUnchecked serialization return value.
SerializeToArraycan fail and returns a boolean, but the return value is ignored. This could lead to returning an invalid/incomplete buffer silently.🛡️ Proposed fix
std::vector<uint8_t> buffer(msg.ByteSizeLong()); - msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())); + if (!msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()))) { + return {}; // Return empty vector on failure + } return buffer;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 30 - 32, The code ignores the boolean result of msg.SerializeToArray which can fail; update the serialization block to check the return value of msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())) and handle failure (e.g., throw a std::runtime_error or return an empty/optional buffer) rather than returning a potentially incomplete vector; keep the existing use of msg.ByteSizeLong() and the buffer variable but ensure you validate SerializeToArray's result and produce a clear error/alternative return before returning buffer.cpp/src/grpc/server/grpc_worker.cpp-248-251 (1)
248-251:⚠️ Potential issue | 🟡 MinorMissing CUDA error checking on cudaMemcpy.
Per coding guidelines, CUDA errors should be caught and mapped to meaningful error codes. These
cudaMemcpycalls could fail (e.g., device out of memory) but errors are not checked.🛡️ Proposed fix pattern
- cudaMemcpy(host_solution.data(), - device_solution.data(), - device_solution.size() * sizeof(double), - cudaMemcpyDeviceToHost); + cudaError_t err = cudaMemcpy(host_solution.data(), + device_solution.data(), + device_solution.size() * sizeof(double), + cudaMemcpyDeviceToHost); + if (err != cudaSuccess) { + throw std::runtime_error(std::string("cudaMemcpy failed: ") + cudaGetErrorString(err)); + }Apply the same pattern to the other
cudaMemcpycalls on lines 296-307.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 248 - 251, The cudaMemcpy calls (e.g., the call copying device_solution -> host_solution) lack CUDA error checks; wrap each cudaMemcpy (including the other calls around lines 296-307) to capture the returned cudaError_t, check if it is not cudaSuccess, and convert it to a meaningful error/Status (use/create a helper like mapCudaErrorToStatus or mapCudaErrorToGrpcStatus) and return/log that error instead of proceeding; update the cudaMemcpy calls that reference device_solution, host_solution, and any other device/host buffers to use this pattern so failures (OOM, invalid value, etc.) are detected and mapped to the existing error-handling flow.cpp/src/grpc/client/test_grpc_client.cpp-401-403 (1)
401-403:⚠️ Potential issue | 🟡 MinorUnhandled exception from
std::stod.If the user provides an invalid
--time-limitvalue (e.g., non-numeric),std::stodthrowsstd::invalid_argumentorstd::out_of_range, resulting in an unhandled exception with a cryptic error message.🛡️ Proposed fix
} else if (arg == "--time-limit" && arg_idx + 1 < argc) { - time_limit = std::stod(argv[++arg_idx]); + try { + time_limit = std::stod(argv[++arg_idx]); + } catch (const std::exception& e) { + std::cerr << "ERROR: Invalid time limit value\n"; + return 1; + } arg_idx++;The same issue exists for
std::stollon line 531.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Wrap the std::stod call that parses --time-limit (and the std::stoll call at the other location) in a try/catch that catches std::invalid_argument and std::out_of_range, then log a clear error via the existing logging/exit path and/or set a safe default; specifically, around the parsing of argv in the branch handling "--time-limit" (variable time_limit and std::stod) catch parsing exceptions, call the same error-handling used elsewhere (print usage or error and exit with non-zero) and do the same for the std::stoll parse to avoid uncaught exceptions and cryptic crashes.cpp/src/grpc/cuopt_remote_service.proto-314-318 (1)
314-318:⚠️ Potential issue | 🟡 MinorInconsistent
job_idtype:byteshere vsstringelsewhere.
IncumbentRequest.job_idis defined asbytes(line 315), but all other request messages (e.g.,StatusRequest,GetResultRequest,CancelRequest,DeleteRequest) usestringforjob_id. This inconsistency could cause confusion and require different handling on client/server.Suggested fix for consistency
message IncumbentRequest { - bytes job_id = 1; + string job_id = 1; int64 from_index = 2; // Return incumbents starting from this index int32 max_count = 3; // Optional limit (0 or negative => no limit) }Similarly,
Incumbent.job_idat line 324 should also bestringfor consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 318, IncumbentRequest.job_id (and matching Incumbent.job_id) are declared as bytes while all other request messages use string; change the field type from bytes to string in the proto for IncumbentRequest and Incumbent so they match StatusRequest/GetResultRequest/CancelRequest/DeleteRequest, then recompile/regenerate the gRPC/protobuf stubs (look for the IncumbentRequest and Incumbent message definitions) to ensure client/server code uses the consistent string type.cpp/src/grpc/server/grpc_server_types.hpp-263-265 (1)
263-265:⚠️ Potential issue | 🟡 MinorHardcoded shared memory names may cause conflicts with concurrent server instances.
The shared memory segment names (
/cuopt_job_queue,/cuopt_result_queue,/cuopt_control) are hardcoded constants. If multiple server instances run on the same host, they will conflict.Consider incorporating the port number or a unique identifier into the segment names.
Suggested approach
// Instead of inline constants, generate names dynamically: inline std::string get_shm_job_queue_name(int port) { return "/cuopt_job_queue_" + std::to_string(port); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 263 - 265, The three hardcoded shared-memory name constants (SHM_JOB_QUEUE, SHM_RESULT_QUEUE, SHM_CONTROL) will collide across concurrent server instances; change them from inline const char* values to functions that generate unique names at runtime (e.g., append the server port, PID, or UUID) and update call sites to use the generator functions (e.g., get_shm_job_queue_name(port), get_shm_result_queue_name(port), get_shm_control_name(port)) so each server instance uses distinct segment names.GRPC_ARCHITECTURE.md-298-304 (1)
298-304:⚠️ Potential issue | 🟡 MinorDocument required remote endpoint env vars in the Environment Variables section.
This table currently lists only transfer tuning variables. Please also include
CUOPT_REMOTE_HOSTandCUOPT_REMOTE_PORT, since remote execution depends on them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_ARCHITECTURE.md` around lines 298 - 304, The Environment Variables table is missing the remote endpoint variables: add two rows for CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT in the "Environment Variables" section, specifying sensible defaults (e.g., blank or "localhost" for CUOPT_REMOTE_HOST and a default port like 50051 for CUOPT_REMOTE_PORT) and clear descriptions such as "Remote execution host" and "Remote execution port" so remote execution is documented alongside CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES.GRPC_ARCHITECTURE.md-16-39 (1)
16-39:⚠️ Potential issue | 🟡 MinorSpecify languages on fenced code blocks to satisfy markdown lint.
These blocks are missing a language identifier (
MD040), which is currently failing lint hygiene.📝 Example fix pattern
-``` +```text cpp/src/grpc/ ... -``` +```Also applies to: 105-120, 136-151, 359-371
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_ARCHITECTURE.md` around lines 16 - 39, The markdown fenced code blocks in GRPC_ARCHITECTURE.md are missing language identifiers (MD040); update each triple-backtick block (including the shown cpp/src/grpc/ tree block and the other blocks at ranges 105-120, 136-151, 359-371) to include a language token such as "text" (i.e., change ``` to ```text) so linting passes; ensure all similar block openings (```...) in the file are consistently annotated.cpp/src/grpc/grpc_solution_mapper.cu-559-560 (1)
559-560:⚠️ Potential issue | 🟡 MinorWarm-start presence check is too narrow.
Warm-start reconstruction only triggers when
RESULT_WS_CURRENT_PRIMALis non-empty; other warm-start fields can be present and still be dropped.Suggested adjustment
- auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL); - if (!ws_primal.empty()) { + auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL); + auto ws_dual = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_DUAL); + auto ws_init_p = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_PRIMAL_AVG); + auto ws_init_d = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_DUAL_AVG); + if (!ws_primal.empty() || !ws_dual.empty() || !ws_init_p.empty() || !ws_init_d.empty()) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 560, The code only triggers warm-start reconstruction when ws_primal (from bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL)) is non-empty, which misses cases where other warm-start fields exist; change the presence check to detect any warm-start data key in arrays (e.g., cuopt::remote::RESULT_WS_CURRENT_PRIMAL, RESULT_WS_CURRENT_DUAL, RESULT_WS_PREV_PRIMAL, RESULT_WS_PREV_DUAL or other RESULT_WS_* entries) before skipping reconstruction. Implement a small helper or boolean that queries arrays for any of those RESULT_WS_* keys (or calls bytes_to_typed for each and checks non-empty) and use that combined condition instead of only ws_primal.empty() so reconstruction runs when any warm-start field is present.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-53-57 (1)
53-57:⚠️ Potential issue | 🟡 MinorAdd
<fcntl.h>include foropen()and flag constants.Line 120 uses
open()withO_WRONLY|O_CREAT|O_TRUNC, but<fcntl.h>is not explicitly included. While some systems may include it transitively throughunistd.h, this is not guaranteed by the POSIX standard and creates a portability risk.File handle cleanup is correct;
fdis properly closed at line 125.Proposed fix
`#include` <signal.h> +#include <fcntl.h> `#include` <sys/types.h> `#include` <sys/wait.h> `#include` <unistd.h>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 53 - 57, The test uses open() with flags O_WRONLY|O_CREAT|O_TRUNC (call site: open(..., O_WRONLY|O_CREAT|O_TRUNC)) but does not include <fcntl.h>, which can break portability; add `#include` <fcntl.h> to the top includes so the O_* flag constants and open declaration are defined and the code compiles reliably across platforms.
🧹 Nitpick comments (10)
cpp/src/grpc/server/grpc_worker.cpp (1)
149-161: Duplicate error handling blocks.Lines 149-151 and 153-161 both handle
!read_success. The first block logs to stderr, while the second handles the failure logic. These could be consolidated.♻️ Proposed consolidation
- if (!read_success) { - std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n"; - } - if (!read_success) { + std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n"; store_simple_result(job_id, worker_id, 1, "Failed to read job data"); job.worker_pid = 0; // ... rest of cleanup🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 149 - 161, The duplicate handling for !read_success should be consolidated into a single block: remove the first standalone stderr log and merge its message into the existing failure branch so that when read_success is false you both log the error (including worker_id) and execute the failure logic (call store_simple_result(job_id, worker_id, 1, "..."), reset job fields job.worker_pid, job.worker_index, job.data_sent, job.ready, job.claimed, and continue). Update the block around the read_success check in grpc_worker.cpp to perform logging and the cleanup exactly once to avoid duplicate branches.cpp/src/grpc/server/grpc_incumbent_proto.hpp (1)
44-49: Consider clearing the output vector before populating.The
assignmentvector is passed by reference but not cleared beforepush_backcalls. If a caller reuses the same vector across multiple calls, stale data will accumulate.♻️ Proposed fix
job_id = incumbent_msg.job_id(); objective = incumbent_msg.objective(); + assignment.clear(); assignment.reserve(incumbent_msg.assignment_size()); for (int i = 0; i < incumbent_msg.assignment_size(); ++i) { assignment.push_back(incumbent_msg.assignment(i)); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 44 - 49, The assignment vector is populated via push_back without being cleared, so reuse will accumulate stale entries; before calling assignment.reserve(...) and the for-loop that pushes from incumbent_msg.assignment(i), call assignment.clear() (or use assignment.assign(...) from the repeated field) to ensure the vector is emptied first—update the code around the symbols assignment and incumbent_msg in grpc_incumbent_proto.hpp to clear assignment before reserving/pushing.cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)
93-100: Verify semantic difference between result and request deserialization.For result blobs (line 94),
total_elementsis used directly as byte count for resize. For request blobs (lines 151-153),total_elements * elem_sizeis used. This asymmetry suggeststotal_elementshas different semantics in result vs request contexts - consider adding a clarifying comment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 100, The resize logic for result deserialization uses ac.total_elements() directly as bytes while request deserialization multiplies total_elements by element size, so confirm whether ac.total_elements() represents bytes or element count and make semantics consistent: either change the result-path allocation (the dest.resize call) to use total_bytes = ac.total_elements() * ac.elem_size() (or equivalent accessor) before resizing and memcpy, or add a clear comment by the result-path block documenting that total_elements already includes element size in bytes; update any related uses (element_offset(), chunk_data.size(), memcpy) to match the chosen unit to avoid off-by-factor bugs.cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp (1)
16-20: Consider reordering includes for consistency.Standard convention places system/library headers before local headers. The
<memory>include should come before the local"grpc_client.hpp"include.Suggested reorder
`#include` <cuopt_remote_service.grpc.pb.h> `#include` <grpcpp/grpcpp.h> -#include "grpc_client.hpp" `#include` <memory> + +#include "grpc_client.hpp"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp` around lines 16 - 20, Reorder the includes so system/library headers come before local headers: move the <memory> include above the "grpc_client.hpp" include in grpc_client_test_helper.hpp; ensure the block now lists <cuopt_remote_service.grpc.pb.h>, <grpcpp/grpcpp.h>, <memory> and then "grpc_client.hpp" to follow the standard include ordering convention.cpp/tests/linear_programming/grpc/CMakeLists.txt (1)
24-33: Minor: Redundant GTest library linkage.
GTest::gmock_mainalready includesGTest::gmock, andGTest::gtest_mainincludesGTest::gtest. You can simplify to just the_mainvariants.Simplified linkage
target_link_libraries(GRPC_CLIENT_TEST PRIVATE cuopt - GTest::gmock GTest::gmock_main - GTest::gtest GTest::gtest_main gRPC::grpc++ protobuf::libprotobuf )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/CMakeLists.txt` around lines 24 - 33, Remove redundant GTest link entries in the target_link_libraries call for GRPC_CLIENT_TEST: keep only the main variants (GTest::gmock_main and GTest::gtest_main) and remove GTest::gmock and GTest::gtest to avoid duplicate linkage; update the target_link_libraries block that references GRPC_CLIENT_TEST, cuopt, gRPC::grpc++, and protobuf::libprotobuf accordingly so only the necessary _main symbols remain.cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (2)
359-360: Remove unusedserver_logs_member.The
server_logs_vector is declared but never populated or used. Server logs are read directly from the file system viaget_server_logs(). This dead code should be removed to avoid confusion.Remove unused member
private: mutable std::mutex mutex_; std::vector<LogEntry> client_logs_; - std::vector<LogEntry> server_logs_; std::string server_log_path_; std::streampos server_log_start_pos_ = 0; // Position in server log file when test started bool test_start_marked_ = false;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 359 - 360, Remove the dead member server_logs_ from the class: delete the declaration of std::vector<LogEntry> server_logs_ (it is never populated or used; server logs are accessed via get_server_logs()). After removing the member, run a quick search for server_logs_ to ensure there are no remaining references and update any tests or constructors that may have initialized it.
123-126: Callback capturesthispointer — document lifetime requirement.The lambda returned by
client_callback()capturesthisby pointer. If the callback is invoked after theGrpcTestLogCaptureinstance is destroyed, this will cause undefined behavior. Consider adding a brief note in the documentation that theGrpcTestLogCaptureinstance must outlive any client using this callback.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 123 - 126, The lambda returned by GrpcTestLogCapture::client_callback() captures this by pointer and can dangle if the GrpcTestLogCapture instance is destroyed, so add a clear lifetime note: update the documentation/comment for the GrpcTestLogCapture class and/or the client_callback() method to state that the GrpcTestLogCapture instance must outlive any client that retains or invokes the returned callback; reference the client_callback() function name in the comment so callers know the requirement and consider mentioning alternative designs (e.g., providing a weak_ptr-based factory) if you want to avoid the strict lifetime requirement.SERVER_ARCHITECTURE.md (1)
15-46: Add language specifier to fenced code block.Per markdownlint, fenced code blocks should have a language specified. For ASCII diagrams, use
textorplaintext.Fix markdown lint warning
-``` +```text ┌────────────────────────────────────────────────────────────────────┐🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The fenced ASCII diagram block in SERVER_ARCHITECTURE.md is missing a language specifier; change the opening fence from ``` to ```text (or ```plaintext) so the block becomes ```text and satisfies markdownlint for the diagram that starts with the "┌────────────────────────────────────────────────────────────────────┐" line.cpp/src/grpc/client/solve_remote.cu (1)
69-74: Mark currently-unused LP flags explicitly (or forward them).
problem_checkinganduse_pdlp_solver_modeare currently unused. Marking them explicitly avoids silent interface drift and warning churn.Small cleanup option
std::unique_ptr<lp_solution_interface_t<i_t, f_t>> solve_lp_remote( cpu_optimization_problem_t<i_t, f_t> const& cpu_problem, pdlp_solver_settings_t<i_t, f_t> const& settings, - bool problem_checking, - bool use_pdlp_solver_mode) + [[maybe_unused]] bool problem_checking, + [[maybe_unused]] bool use_pdlp_solver_mode)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 69 - 74, The function solve_lp_remote currently accepts parameters problem_checking and use_pdlp_solver_mode but never uses them, which can cause warnings and interface drift; update the function (solve_lp_remote) to explicitly mark these parameters as intentionally unused (e.g., cast to void or annotate with [[maybe_unused]]) or forward them to the underlying call that needs them, ensuring the symbols problem_checking and use_pdlp_solver_mode appear in the implementation so the intent is clear and compiler warnings are silenced.cpp/tests/linear_programming/grpc/grpc_client_test.cpp (1)
1086-1340: Add at least one degenerate solve-case assertion in this suite.The end-to-end mock solve tests are strong, but adding explicit infeasible/unbounded/empty (or singleton edge) solve assertions would better protect status/result mapping regressions.
As per coding guidelines: "Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 1086 - 1340, Add a new degenerate-case unit test (e.g., TEST_F(GrpcClientTest, SolveLP_Infeasible)) that uses the existing GrpcClientTest fixture and MockCuOptStub to exercise grpc_client_t::solve_lp for an infeasible LP: have SubmitJob(...) return a job_id, have CheckStatus/WaitForCompletion indicate COMPLETED (or skip if use_wait), and have GetResult(...) return a cuopt::remote::LPSolution with termination_status set to PDLP_INFEASIBLE (and resp->set_status(cuopt::remote::SUCCESS) if appropriate); then assert the returned result indicates failure (result.success == false) and that result.error_message contains a clear indicator of infeasibility (e.g., "INFEASIBLE" or "infeasible"); reuse symbols SubmitJob, CheckStatus/WaitForCompletion, GetResult, grpc_client_t::solve_lp, and GrpcClientTest to locate where to add the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6984e3cb-4874-432a-8307-c8f3a64fd9ba
📒 Files selected for processing (39)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/routing/utils.pypython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/solve_remote.cu
There was a problem hiding this comment.
Actionable comments posted: 14
♻️ Duplicate comments (9)
cpp/src/grpc/server/grpc_server_main.cpp (2)
186-212:⚠️ Potential issue | 🔴 CriticalAdd teardown on TLS config failures after workers/threads are started.
These early returns occur after runtime startup and bypass shutdown/join/cleanup paths.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 186 - 212, The TLS validation branches that currently do "return 1" (around checks using config, ssl_opts and read_file_to_string for cert/key/root and require_client) can occur after workers/threads have been started; replace those early returns with calls to the centralized shutdown/teardown routine (e.g., call your existing server stop/cleanup functions such as StopServer(), shutdownWorkers()/joinThreads(), or a new cleanup_resources() that performs shutdown and joins) and then return the error code; ensure every failure path that currently returns from inside the TLS setup invokes that teardown routine before exiting so workers/threads and other resources are properly cleaned up.
234-251:⚠️ Potential issue | 🔴 CriticalGuard
BuildAndStart()failure before waiting on the server pointer.
server->Wait()is called unconditionally; if startup fails and returns null, this dereferences a null pointer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251, BuildAndStart() may return nullptr so avoid unguarded dereference of server; after std::unique_ptr<Server> server(builder.BuildAndStart()), check if (server) before calling server->Wait() and starting the shutdown logic that assumes a live server. If BuildAndStart() fails, log an error (or print to cerr) and exit/return non-zero. Also ensure the shutdown_thread's capture/Shutdown call remains safe by only invoking server->Shutdown() when server is non-null; move creation of shutdown_thread and the server->Wait() call inside the if (server) block and handle the failure path explicitly.cpp/src/grpc/server/grpc_job_management.cpp (1)
147-158:⚠️ Potential issue | 🔴 CriticalSynchronize all
job_queueaccess; current slot claim/update path is racy.
submit_job_asyncstill does unsynchronized check-then-write on shared queue state, andcheck_job_status/cancel_jobconcurrently read/write the same fields. This can corrupt queue ownership and status transitions under concurrent requests.As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."
Also applies to: 195-196, 267-270
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The submit_job_async path performs an unsynchronized check-then-write on shared job_queue entries (e.g., checking job_queue[i].ready/claimed then writing fields like job_id, ready, claimed), which races with check_job_status and cancel_job; protect all accesses to job_queue (both scans and per-entry reads/writes) by introducing a synchronization primitive (e.g., a std::mutex or per-entry mutex array) and acquire the lock(s) around the loop that scans MAX_JOBS and before modifying fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure check_job_status and cancel_job acquire the same lock(s) before reading or updating those fields so ownership and state transitions are atomic and race-free.cpp/src/grpc/client/grpc_client.cu (4)
258-266:⚠️ Potential issue | 🔴 CriticalGuard public RPC methods before dereferencing
impl_->stub.These methods can be called before
connect(). Dereferencingimpl_->stubwithout a null check can crash.🔧 Suggested guard pattern
job_status_result_t grpc_client_t::check_status(const std::string& job_id) { job_status_result_t result; + if (!impl_ || !impl_->stub) { + result.error_message = "Not connected to server"; + return result; + } grpc::ClientContext context; auto request = build_status_request(job_id); cuopt::remote::StatusResponse response;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The public RPC method grpc_client_t::check_status dereferences impl_->stub without ensuring impl_ is initialized or stub is non-null (can be called before connect()), which can crash; add a null-check guard at the start of check_status (and other public RPC methods) that verifies impl_ and impl_->stub are valid, and return an appropriate job_status_result_t error/failed state (or set result.error and result.code) when the client is not connected instead of dereferencing; refer to grpc_client_t::connect and impl_->stub when adding the guard so the check mirrors connect()'s initialization contract.
994-995:⚠️ Potential issue | 🔴 CriticalValidate
poll_interval_msbefore using it as a divisor.
max_pollsdivides byconfig_.poll_interval_mswithout checking<= 0, which can crash.🔧 Suggested validation
+ if (config_.poll_interval_ms <= 0) { + stop_log_streaming(); + result.error_message = "poll_interval_ms must be > 0"; + return result; + } int poll_count = 0; int max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;Also applies to: 1132-1132
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 994 - 995, The code computes max_polls by dividing by config_.poll_interval_ms without validating it; update the logic that sets max_polls (the line initializing int max_polls) to first validate config_.poll_interval_ms is > 0 (and also guard against extremely large/small values), returning an error or using a safe default/policy (e.g., set poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid division-by-zero and undefined behavior; apply the same validation where max_polls is computed elsewhere (the other occurrence around the code that uses config_.poll_interval_ms).
703-712:⚠️ Potential issue | 🔴 CriticalValidate chunk metadata and chunk bounds before allocation/copy.
elem_size,total_elems, andelements_in_chunkare trusted too early. This can lead to division by zero, overflowed allocation sizes, or out-of-boundsmemcpy.🔧 Suggested hardening
for (const auto& arr_desc : header->arrays()) { auto field_id = arr_desc.field_id(); int64_t total_elems = arr_desc.total_elements(); int64_t elem_size = arr_desc.element_size_bytes(); if (total_elems <= 0) continue; + if (elem_size <= 0) { + last_error_ = "Invalid chunk metadata: non-positive element size"; + return false; + } + if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) { + last_error_ = "Invalid chunk metadata: byte-size overflow"; + return false; + } + int64_t total_bytes = total_elems * elem_size; + if (total_bytes < 0 || + static_cast<uint64_t>(total_bytes) > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) { + last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory"; + return false; + } int64_t elems_per_chunk = chunk_data_budget / elem_size; if (elems_per_chunk <= 0) elems_per_chunk = 1; - std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size)); + std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes)); for (int64_t elem_offset = 0; elem_offset < total_elems; elem_offset += elems_per_chunk) { int64_t elems_wanted = std::min(elems_per_chunk, total_elems - elem_offset); @@ int64_t elems_received = chunk_resp.elements_in_chunk(); const auto& data = chunk_resp.data(); if (static_cast<int64_t>(data.size()) != elems_received * elem_size) { last_error_ = "GetResultChunk: data size mismatch"; return false; } + if (elems_received < 0 || elem_offset + elems_received > total_elems) { + last_error_ = "GetResultChunk: elements_in_chunk out of bounds"; + return false; + } std::memcpy(array_bytes.data() + elem_offset * elem_size, data.data(), data.size());Also applies to: 731-740
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 703 - 712, Validate and guard all chunk metadata before allocating or copying: ensure arr_desc.element_size_bytes() (elem_size) > 0 and arr_desc.total_elements() (total_elems) >= 0 before using them; ensure chunk_data_budget is >0 before computing elems_per_chunk and handle division safely; check for size_t/multiplication overflow when computing total_elems * elem_size before constructing array_bytes and use a safe cap; when iterating chunks validate each elements_in_chunk is within [0, total_elems] and compute copy_size = elements_in_chunk * elem_size with overflow checks before any memcpy; add clear error handling/logging and skip/abort the chunk on invalid metadata rather than performing the allocation or memcpy (refer to symbols: arr_desc, field_id, elem_size, total_elems, chunk_data_budget, elems_per_chunk, elements_in_chunk, array_bytes, memcpy).
247-251:⚠️ Potential issue | 🔴 Critical
stop_log_streaming()can hang waiting on a blocked reader thread.Setting
stop_logs_does not unblock a thread already stuck in stream read, sojoin()can block indefinitely.🔧 Suggested fix
void grpc_client_t::start_log_streaming(const std::string& job_id) { if (!config_.stream_logs || !config_.log_callback) return; stop_logs_.store(false); log_thread_ = std::make_unique<std::thread>([this, job_id]() { grpc::ClientContext context; + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = &context; + } auto request = build_stream_logs_request(job_id, 0); auto reader = impl_->stub->StreamLogs(&context, request); cuopt::remote::LogMessage log_msg; while (reader->Read(&log_msg)) { if (stop_logs_.load()) { - context.TryCancel(); break; } if (config_.log_callback) { config_.log_callback(log_msg.line()); } if (log_msg.job_complete()) { break; } } reader->Finish(); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = nullptr; + } }); } void grpc_client_t::stop_log_streaming() { stop_logs_.store(true); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + if (active_log_context_) { active_log_context_->TryCancel(); } + } if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); } log_thread_.reset(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 247 - 251, stop_log_streaming currently only sets stop_logs_ and then joins log_thread_, which can hang if the thread is blocked in a gRPC read; change the shutdown sequence to (1) set stop_logs_, (2) explicitly cancel/close the gRPC stream used by the reader (e.g. call the grpc::ClientContext::TryCancel() or cancel/Finish the ClientReader/ReaderInterface instance used for log streaming), (3) ensure any CompletionQueue or async reader is shutdown/drained so the blocked read returns, and only then join log_thread_ (use a timed join as a fallback and log an error if join times out); update stop_log_streaming to reference stop_logs_, log_thread_ and the stream/reader/client-context member (the object that performs the stream read) when implementing the cancel/cleanup steps.cpp/src/grpc/grpc_problem_mapper.cu (1)
399-411:⚠️ Potential issue | 🔴 CriticalValidate byte alignment before
memcpyin chunked array decoders.
nis computed with integer division, butmemcpycopies full byte count. Misaligned payload sizes can write past allocated vectors.🔧 Suggested fix
auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(double) != 0) return {}; size_t n = it->second.size() / sizeof(double); @@ auto get_ints = [&](int32_t field_id) -> std::vector<i_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(int32_t) != 0) return {}; size_t n = it->second.size() / sizeof(int32_t);Also applies to: 414-426
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 411, The lambda get_doubles uses n = it->second.size() / sizeof(double) but then memcpy's the full byte size, which can overflow if the payload isn't aligned; validate that it->second.size() is an exact multiple of sizeof(double) before copying, compute expected_bytes = n * sizeof(double) and only memcpy expected_bytes (or return/throw on misaligned size), and when converting to f_t ensure you only read expected_bytes into the temporary vector; apply the same check/fix to the other chunked-array decoder blocks that perform memcpy (the other double-to-f_t conversion in this file).cpp/src/grpc/server/grpc_service_impl.cpp (1)
71-73:⚠️ Potential issue | 🔴 CriticalPublish
readybefore releasingclaimedto avoid queue slot races.Current store order can allow another submitter to reserve the same slot before readiness is published.
As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."🔧 Suggested ordering fix
- job_queue[job_idx].claimed.store(false); - job_queue[job_idx].ready.store(true); + job_queue[job_idx].ready.store(true, std::memory_order_release); + job_queue[job_idx].claimed.store(false, std::memory_order_release);Also applies to: 265-267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code currently does job_queue[job_idx].claimed.store(false) before job_queue[job_idx].ready.store(true), which can let another submitter claim the slot before readiness is published; swap the two stores so ready.store(true) is executed before claimed.store(false) for the job_queue entry, and apply the same fix to the other identical location where claimed and ready are updated (the later occurrence around the second claim/release sequence) to prevent the race.
🧹 Nitpick comments (5)
cpp/src/grpc/client/test_grpc_client.cpp (1)
401-403: Consider graceful handling for invalid numeric CLI input instead of process abort.
std::stod/std::stollat lines 402 and 531 throw on malformed input (e.g.,--time-limit abcor invalid incumbent index). While this is test code, a try-catch wrapper would provide a friendlier CLI experience instead of silently crashing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Replace direct calls to std::stod and std::stoll in the CLI parsing logic with exception-safe parsing: wrap the std::stod(std::string) used to set time_limit and the std::stoll used to set incumbent_index in try-catch blocks that catch std::invalid_argument and std::out_of_range, print a clear error message indicating the offending flag and value (e.g., "--time-limit" or "--incumbent-index"), and exit with a non-zero status instead of letting the process throw; update the parsing branches that reference time_limit and incumbent_index so they validate the parsed numeric ranges if applicable and fall back or exit cleanly on parse failure.cpp/src/grpc/cuopt_remote_service.proto (1)
9-9: Standardizejob_idfield type across RPC schema.Most request messages use
string job_id(StatusRequest, GetResultRequest, DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.), butIncumbentRequest.job_idandIncumbent.job_idare defined asbytes. Similarly,cuopt_remote.protousesbytes job_idinSubmitResponse. This type inconsistency can cause friction in client libraries expecting uniform scalar types across the API contract. Either standardize onstringfor consistency with the majority of the schema, or usebytesuniformly if avoiding UTF-8 validation is a requirement.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote_service.proto` at line 9, IncumbentRequest.job_id, Incumbent.job_id and SubmitResponse.job_id are defined as bytes while most RPC messages (StatusRequest, GetResultRequest, DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.) use string; change those fields to type string to standardize the schema, update any proto message definitions where job_id is bytes to string (IncumbentRequest, Incumbent, SubmitResponse), run proto generation to refresh client/server stubs, and adjust any serialization/handler code that expects bytes to handle string instead.cpp/src/grpc/grpc_solution_mapper.cu (2)
93-101: Consider bulk operations for repeated fields.The element-by-element
add_*()calls work correctly but may be slower for large solutions. Protobuf'smutable_primal_solution()->Reserve(primal.size())followed byAdd()or usingmutable_primal_solution()->Assign(...)could improve performance for large-scale problems.💡 Example using Reserve for primal_solution
// Solution vectors - CPU solution already has data in host memory const auto& primal = solution.get_primal_solution_host(); const auto& dual = solution.get_dual_solution_host(); const auto& reduced_cost = solution.get_reduced_cost_host(); + pb_solution->mutable_primal_solution()->Reserve(primal.size()); for (const auto& v : primal) { pb_solution->add_primal_solution(static_cast<double>(v)); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 93 - 101, The per-element protobuf adds (pb_solution->add_primal_solution, add_dual_solution, add_reduced_cost) should be replaced with bulk operations to improve performance for large vectors: call pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal data (cast to double as needed), and do the same for dual and reduced_cost using mutable_dual_solution()->Reserve(dual.size())/Assign and mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so the bulk copy converts elements to double before Assign/adding.
559-598: Warm-start presence detection relies on empty array check.The warm-start data presence is inferred from
ws_primal.empty(). This works correctly as long as any valid warm-start data always includes the primal solution. If there's ever a case where warm-start exists but current_primal is empty, this would miss it. Consider checking the header for a dedicated warm-start flag if available.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 598, The code currently detects warm-start presence by testing ws_primal.empty() in the block that constructs cpu_pdlp_warm_start_data_t (see bytes_to_typed, cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty()) guard); change that condition to prefer a dedicated header flag (e.g., use h.has_warm_start() or h.ws_present() if such a field exists) and fall back to the primal-array emptiness check if the flag is unavailable, e.g., replace if (!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so warm-starts with an empty current_primal are still recognized.cpp/src/grpc/client/grpc_client.hpp (1)
236-240: Non-movable semantics are intentional but could be revisited.The class is non-copyable and non-movable due to
std::atomic<bool>andstd::threadmembers. Whilestd::atomicis actually movable in C++17, thestd::threadmember and PIMPL pattern make this reasonable. If movability becomes needed, consider wrapping the thread in a nullable type or resettingstop_logs_after move.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.hpp` around lines 236 - 240, The class grpc_client_t is explicitly non-copyable and non-movable due to an std::atomic<bool> (stop_logs_) and an internal std::thread member; if you need movability, add a noexcept move constructor and move assignment that transfer ownership of the PIMPL, move the std::thread (std::move on the thread member) and ensure the source thread is left in a benign state (e.g., set to not joinable or detached), and transfer/reset the atomic flag (use stop_logs_.store(false) or std::exchange to set a sensible value in the moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and handle stop_logs_ correctly; ensure proper synchronization when transferring thread ownership to avoid races.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env overrides for CUOPT_CHUNK_SIZE and
CUOPT_MAX_MESSAGE_BYTES must be validated before assignment: call
parse_env_int64 for each (as currently done), then only assign to
config.chunk_size_bytes and config.max_message_bytes when the parsed value is a
sane positive integer (e.g., > 0) and within expected bounds; otherwise ignore
the override (leave existing config value) and emit a warning via
process/logging (use config.enable_debug_log or existing logger). In short,
replace the direct assignments with guarded checks around parse_env_int64
results and clamp or reject values that are zero/negative or exceed your defined
upper limits before setting config.chunk_size_bytes and
config.max_message_bytes.
- Around line 164-166: The incumbent callback incorrectly uses the incumbent
objective as the "bound" (bound_copy = objective) when calling
get_callback->get_solution; change the flow so the actual best bound from the
remote solver (use get_solution_bound()) is forwarded through the incumbent
callback protocol and used as bound_copy when invoking
get_callback->get_solution; update the gRPC incumbent callback message/handler
(grpc_client.hpp protocol and the server-side code that calls the callback) to
include a bound field, populate it from get_solution_bound(), and adjust the
client-side call site in solve_remote.cu to read that bound and pass it instead
of objective to ensure callbacks receive the solver's best bound.
In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 315: The proto uses two different types for job_id (string in
SubmitJobResponse, StatusRequest, GetResultRequest, StartChunkedDownloadRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest vs bytes in
IncumbentRequest and Incumbent), so unify them: choose one canonical type
(prefer string for UTF-8 identifiers or bytes if binary IDs are required) and
update the job_id field in the outlier messages (IncumbentRequest and Incumbent
if switching to string, or update the eight string-typed messages if switching
to bytes) to match; after changing the proto make sure to regenerate language
bindings and add any necessary UTF-8 validation or encoding/decoding notes in
the RPC handling code.
In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto doesn't match Buf's package-directory rules; either change
the package line to match the file's directory-based package (so the package
mirrors the proto file location) or move cuopt_remote.proto into a directory
structure that matches "cuopt/remote"; update the package declaration (package
cuopt.remote) accordingly so the package and directory structure are consistent
and the PACKAGE_DIRECTORY_MATCH lint passes.
In `@cpp/src/grpc/grpc_settings_mapper.cu`:
- Around line 144-146: The protobuf iteration_limit can exceed the range of the
target type i_t and overflow when statically cast; before assigning
pb_settings.iteration_limit() to settings.iteration_limit (the
static_cast<i_t>), check that the value lies within numeric_limits<i_t>::min()
and numeric_limits<i_t>::max(); if it is out of range, either clamp to the
appropriate min/max or return/log an error and refuse the assignment, ensuring
you reference pb_settings.iteration_limit() and settings.iteration_limit when
implementing the guard to prevent narrowing overflow.
In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 109-113: The code currently reads a uint64_t size from the pipe
and blindly calls data.resize(size) which permits unbounded allocation; before
resizing, validate and cap the incoming size (e.g. define a
MAX_INCUMBENT_PAYLOAD or MAX_INCUMBENT_SIZE constant) and return false if size
is zero or exceeds that limit, and ensure you convert/cast size to size_t
safely; update the recv_incumbent_pipe logic around the read_from_pipe(fd,
&size, ...) and data.resize(...) calls (and any callers using fd/data) to
enforce this limit and prevent resource exhaustion.
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Line 90: The loop currently treats any ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) == false as a benign break and later returns true, which
accepts partial/corrupt payloads; change the handling so that when
ParseDelimitedFromCodedStream returns false you check clean_eof and only break
on a true clean_eof (clean EOF), otherwise return false to indicate a parse
error. Update both occurrences (the line with ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) and the other similar occurrence) so parse failures produce
an immediate false return instead of falling through and returning true.
- Around line 151-154: The resize uses ac.total_elements() * elem_size without
validation which can overflow or trigger huge allocations; before calling
dest.resize validate that ac.total_elements() and elem_size are non-negative and
compute the product using a larger unsigned type (e.g., uint64_t) then check
product <= std::numeric_limits<size_t>::max() and <= a configured maximum
allocation threshold, and only then cast to size_t and call dest.resize; if the
checks fail, return/throw an error instead of resizing. Also apply the same
validation logic for the other occurrence around lines 158-163 (same variables:
ac, elem_size, array_field_element_size(ac.field_id()), dest).
In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 50-80: The CLI parsing loop in main (grpc_server_main.cpp)
currently silently ignores unknown flags and missing values; update the for-loop
that inspects argv/argc to validate every flag: for each option that requires a
value (e.g., "-p"/"--port", "-w"/"--workers", "--max-message-mb",
"--max-message-bytes", "--tls-cert", "--tls-key", "--tls-root") check i+1 < argc
and if not, print an error to std::cerr (include the flag name), call
print_usage(argv[0]) and return a non-zero exit code; for unrecognized args (the
final else case) do the same (error + usage + non-zero exit); also wrap
stoi/stoll conversions in try/catch to handle invalid numbers and fail similarly
with an explanatory error. Ensure you reference and update the same config
fields (config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) and keep behavior for boolean flags unchanged.
- Around line 157-167: The code currently calls memset on JobQueueEntry and
ResultQueueEntry which contain std::atomic members (job_queue and result_queue
initialization loop); replace the raw memset with explicit field initialization
or construction: for each JobQueueEntry/ResultQueueEntry set every member
explicitly (e.g., initialize non-atomic fields directly and initialize atomics
via .store(...) or use placement-new / a helper function that constructs the
object in shared memory) to avoid undefined behavior; update the loops that call
memset(&job_queue[i], 0, ...) and memset(&result_queue[i], 0, ...) to perform
per-field initialization instead, ensuring worker_index, ready, claimed,
cancelled, retrieved, etc. are correctly constructed and stored.
In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe
functions (std::cout and result_cv.notify_all()), which must be removed; change
signal_handler to only perform async-signal-safe writes: set keep_running to
false using a sig_atomic_t/atomic<bool> and, if needed, set
shm_ctrl->shutdown_requested using an atomic or sig_atomic_t field, and return
immediately (no I/O or condition_variable calls). Add a separate watcher
thread/function (e.g., monitor_shutdown or shutdown_watcher) that polls/blocks
on keep_running becoming false and then performs the logging (std::cout) and
calls result_cv.notify_all() and any other cleanup; update signal_handler,
keep_running's type, and shm_ctrl to use signal-safe atomics so the watcher can
safely observe the change.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setters and other methods access server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization, causing
races; add a member mutex (e.g., server_log_mutex_) and use it to guard all
reads and writes of these fields (wrap accesses in std::lock_guard<std::mutex>),
update set_server_log_path to lock the mutex when updating server_log_path_,
server_log_start_pos_, and test_start_marked_, and ensure every other method
that reads or writes these members (the other server-log readers/pollers
mentioned) also locks the same mutex before accessing them so the state is
consistently protected.
In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The documentation sentence claiming “no custom serialization
logic is implemented” is too absolute; update the statement to acknowledge the
custom pipe blob framing helpers introduced in grpc_pipe_serialization.hpp (the
pipe blob framing helpers) and clarify scope by limiting the claim to gRPC wire
payloads (e.g., “All gRPC wire payload serialization uses protocol buffers
generated by protoc and grpc_cpp_plugin; separate pipe blob framing helpers in
grpc_pipe_serialization.hpp are used only for local pipe framing, not for gRPC
wire serialization.”).
In `@SERVER_ARCHITECTURE.md`:
- Around line 15-46: The markdown has untyped fenced code blocks (the ASCII
diagram block and other fenced regions) which trigger MD040; update each
triple-backtick fence (including the large ASCII diagram block shown and the
other blocks noted) to include an explicit language token such as text (e.g.,
change ``` to ```text) or another appropriate language (bash/cpp) so
markdownlint MD040 is satisfied across the file.
---
Duplicate comments:
In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-266: The public RPC method grpc_client_t::check_status
dereferences impl_->stub without ensuring impl_ is initialized or stub is
non-null (can be called before connect()), which can crash; add a null-check
guard at the start of check_status (and other public RPC methods) that verifies
impl_ and impl_->stub are valid, and return an appropriate job_status_result_t
error/failed state (or set result.error and result.code) when the client is not
connected instead of dereferencing; refer to grpc_client_t::connect and
impl_->stub when adding the guard so the check mirrors connect()'s
initialization contract.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without validating it; update the logic that sets
max_polls (the line initializing int max_polls) to first validate
config_.poll_interval_ms is > 0 (and also guard against extremely large/small
values), returning an error or using a safe default/policy (e.g., set
poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid
division-by-zero and undefined behavior; apply the same validation where
max_polls is computed elsewhere (the other occurrence around the code that uses
config_.poll_interval_ms).
- Around line 703-712: Validate and guard all chunk metadata before allocating
or copying: ensure arr_desc.element_size_bytes() (elem_size) > 0 and
arr_desc.total_elements() (total_elems) >= 0 before using them; ensure
chunk_data_budget is >0 before computing elems_per_chunk and handle division
safely; check for size_t/multiplication overflow when computing total_elems *
elem_size before constructing array_bytes and use a safe cap; when iterating
chunks validate each elements_in_chunk is within [0, total_elems] and compute
copy_size = elements_in_chunk * elem_size with overflow checks before any
memcpy; add clear error handling/logging and skip/abort the chunk on invalid
metadata rather than performing the allocation or memcpy (refer to symbols:
arr_desc, field_id, elem_size, total_elems, chunk_data_budget, elems_per_chunk,
elements_in_chunk, array_bytes, memcpy).
- Around line 247-251: stop_log_streaming currently only sets stop_logs_ and
then joins log_thread_, which can hang if the thread is blocked in a gRPC read;
change the shutdown sequence to (1) set stop_logs_, (2) explicitly cancel/close
the gRPC stream used by the reader (e.g. call the
grpc::ClientContext::TryCancel() or cancel/Finish the
ClientReader/ReaderInterface instance used for log streaming), (3) ensure any
CompletionQueue or async reader is shutdown/drained so the blocked read returns,
and only then join log_thread_ (use a timed join as a fallback and log an error
if join times out); update stop_log_streaming to reference stop_logs_,
log_thread_ and the stream/reader/client-context member (the object that
performs the stream read) when implementing the cancel/cleanup steps.
In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 399-411: The lambda get_doubles uses n = it->second.size() /
sizeof(double) but then memcpy's the full byte size, which can overflow if the
payload isn't aligned; validate that it->second.size() is an exact multiple of
sizeof(double) before copying, compute expected_bytes = n * sizeof(double) and
only memcpy expected_bytes (or return/throw on misaligned size), and when
converting to f_t ensure you only read expected_bytes into the temporary vector;
apply the same check/fix to the other chunked-array decoder blocks that perform
memcpy (the other double-to-f_t conversion in this file).
In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 147-158: The submit_job_async path performs an unsynchronized
check-then-write on shared job_queue entries (e.g., checking
job_queue[i].ready/claimed then writing fields like job_id, ready, claimed),
which races with check_job_status and cancel_job; protect all accesses to
job_queue (both scans and per-entry reads/writes) by introducing a
synchronization primitive (e.g., a std::mutex or per-entry mutex array) and
acquire the lock(s) around the loop that scans MAX_JOBS and before modifying
fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure
check_job_status and cancel_job acquire the same lock(s) before reading or
updating those fields so ownership and state transitions are atomic and
race-free.
In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 186-212: The TLS validation branches that currently do "return 1"
(around checks using config, ssl_opts and read_file_to_string for cert/key/root
and require_client) can occur after workers/threads have been started; replace
those early returns with calls to the centralized shutdown/teardown routine
(e.g., call your existing server stop/cleanup functions such as StopServer(),
shutdownWorkers()/joinThreads(), or a new cleanup_resources() that performs
shutdown and joins) and then return the error code; ensure every failure path
that currently returns from inside the TLS setup invokes that teardown routine
before exiting so workers/threads and other resources are properly cleaned up.
- Around line 234-251: BuildAndStart() may return nullptr so avoid unguarded
dereference of server; after std::unique_ptr<Server>
server(builder.BuildAndStart()), check if (server) before calling server->Wait()
and starting the shutdown logic that assumes a live server. If BuildAndStart()
fails, log an error (or print to cerr) and exit/return non-zero. Also ensure the
shutdown_thread's capture/Shutdown call remains safe by only invoking
server->Shutdown() when server is non-null; move creation of shutdown_thread and
the server->Wait() call inside the if (server) block and handle the failure path
explicitly.
In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code currently does
job_queue[job_idx].claimed.store(false) before
job_queue[job_idx].ready.store(true), which can let another submitter claim the
slot before readiness is published; swap the two stores so ready.store(true) is
executed before claimed.store(false) for the job_queue entry, and apply the same
fix to the other identical location where claimed and ready are updated (the
later occurrence around the second claim/release sequence) to prevent the race.
---
Nitpick comments:
In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 236-240: The class grpc_client_t is explicitly non-copyable and
non-movable due to an std::atomic<bool> (stop_logs_) and an internal std::thread
member; if you need movability, add a noexcept move constructor and move
assignment that transfer ownership of the PIMPL, move the std::thread (std::move
on the thread member) and ensure the source thread is left in a benign state
(e.g., set to not joinable or detached), and transfer/reset the atomic flag (use
stop_logs_.store(false) or std::exchange to set a sensible value in the
moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and
operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and
handle stop_logs_ correctly; ensure proper synchronization when transferring
thread ownership to avoid races.
In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: Replace direct calls to std::stod and std::stoll in the
CLI parsing logic with exception-safe parsing: wrap the std::stod(std::string)
used to set time_limit and the std::stoll used to set incumbent_index in
try-catch blocks that catch std::invalid_argument and std::out_of_range, print a
clear error message indicating the offending flag and value (e.g.,
"--time-limit" or "--incumbent-index"), and exit with a non-zero status instead
of letting the process throw; update the parsing branches that reference
time_limit and incumbent_index so they validate the parsed numeric ranges if
applicable and fall back or exit cleanly on parse failure.
In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 9: IncumbentRequest.job_id, Incumbent.job_id and SubmitResponse.job_id
are defined as bytes while most RPC messages (StatusRequest, GetResultRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.) use string;
change those fields to type string to standardize the schema, update any proto
message definitions where job_id is bytes to string (IncumbentRequest,
Incumbent, SubmitResponse), run proto generation to refresh client/server stubs,
and adjust any serialization/handler code that expects bytes to handle string
instead.
In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 93-101: The per-element protobuf adds
(pb_solution->add_primal_solution, add_dual_solution, add_reduced_cost) should
be replaced with bulk operations to improve performance for large vectors: call
pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use
Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal
data (cast to double as needed), and do the same for dual and reduced_cost using
mutable_dual_solution()->Reserve(dual.size())/Assign and
mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so
the bulk copy converts elements to double before Assign/adding.
- Around line 559-598: The code currently detects warm-start presence by testing
ws_primal.empty() in the block that constructs cpu_pdlp_warm_start_data_t (see
bytes_to_typed, cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty())
guard); change that condition to prefer a dedicated header flag (e.g., use
h.has_warm_start() or h.ws_present() if such a field exists) and fall back to
the primal-array emptiness check if the flag is unavailable, e.g., replace if
(!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so
warm-starts with an empty current_primal are still recognized.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2a43112f-b25e-4054-b00f-1fee3a598df9
📒 Files selected for processing (38)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (8)
- cpp/tests/linear_programming/CMakeLists.txt
- cpp/tests/linear_programming/grpc/CMakeLists.txt
- cpp/src/grpc/server/grpc_incumbent_proto.hpp
- cpp/src/grpc/server/grpc_worker_infra.cpp
- cpp/src/grpc/server/grpc_server_threads.cpp
- cpp/src/grpc/server/grpc_field_element_size.hpp
- cpp/src/grpc/server/grpc_worker.cpp
- python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (24)
cpp/src/grpc/cuopt_remote_service.proto (1)
314-325:⚠️ Potential issue | 🟠 MajorUse one
job_idscalar type across service messages.Line 315 and Line 324 use
bytes, while most service request/response messages usestringforjob_id. This inconsistency creates avoidable cross-language conversion edge cases.Schema alignment sketch
message IncumbentRequest { - bytes job_id = 1; + string job_id = 1; int64 from_index = 2; int32 max_count = 3; } message Incumbent { int64 index = 1; double objective = 2; repeated double assignment = 3; - bytes job_id = 4; + string job_id = 4; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 325, The IncumbentRequest.job_id and Incumbent.job_id fields use bytes while the rest of the API uses string for job_id; change both fields in messages IncumbentRequest and Incumbent from bytes to string to ensure a single scalar type for job_id across the service, and update any serialization/usage code that expects bytes to accept a string (or perform explicit encoding/decoding at the callsite) so cross-language conversions remain consistent.GRPC_ARCHITECTURE.md (1)
9-10:⚠️ Potential issue | 🟡 MinorScope the serialization claim to gRPC wire payloads.
Line 9-Line 10 is too absolute; local pipe framing helpers exist. Please clarify this is specifically about gRPC wire serialization.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_ARCHITECTURE.md` around lines 9 - 10, The statement "All serialization uses protocol buffers generated by `protoc` and `grpc_cpp_plugin` — no custom serialization logic is implemented." is too absolute; update it to explicitly scope serialization to gRPC wire payloads (e.g., "All gRPC wire payload serialization uses protocol buffers generated by `protoc` and `grpc_cpp_plugin`") and add a brief note acknowledging that local pipe framing helpers (custom framing/unframing code) are used outside the gRPC wire format; keep the original wording's intent but clarify that custom framing helpers exist for local pipes.SERVER_ARCHITECTURE.md (1)
15-46:⚠️ Potential issue | 🟡 MinorAdd explicit fence languages to these code blocks.
markdownlintMD040 is still triggered by untyped fences in these sections. Use```textfor ASCII diagrams and table-like flows.Also applies to: 110-118, 122-135, 139-148, 154-168, 196-198, 207-217
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The ASCII-art code fences (e.g., the block starting with the "Main Server Process" diagram that includes "Shared Memory Queues", "Job Queue", "Result Queue", and worker boxes) are untyped and trigger markdownlint MD040; update each such fence to use an explicit text language (replace ``` with ```text) for that diagram and the other similar untyped blocks (the blocks containing table/flow-style ASCII diagrams and blocks starting with the same diagram-like contents referenced in the comment). Ensure every ASCII/table-like fenced block (including the ones around the ranges called out in the review) is changed to ```text so markdownlint stops flagging MD040.cpp/src/grpc/server/grpc_worker_infra.cpp (1)
165-168:⚠️ Potential issue | 🔴 CriticalValidate PID and handle
EINTRinwait_for_workers().The reap loop at Line 165-Line 168 should skip non-positive PIDs and retry interrupted waits; otherwise
waitpid(0, ...)/signal interruption can lead to incorrect child reaping behavior.Suggested fix
void wait_for_workers() { for (pid_t pid : worker_pids) { + if (pid <= 0) continue; int status; - waitpid(pid, &status, 0); + while (waitpid(pid, &status, 0) < 0 && errno == EINTR) {} } worker_pids.clear(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 165 - 168, In wait_for_workers(), the reap loop currently calls waitpid(pid, &status, 0) for every pid in worker_pids without validating pid or handling EINTR; update the loop to skip non-positive PIDs (pid <= 0) and wrap waitpid(pid, &status, 0) in a retry loop that re-invokes waitpid when it returns -1 with errno == EINTR, and only break on a successful return or an error other than EINTR; reference the worker_pids iteration, the pid variable, the waitpid call and the status/errno checks when making the change.cpp/src/grpc/cuopt_remote.proto (1)
6-6:⚠️ Potential issue | 🟠 MajorAlign proto package and directory to satisfy Buf lint.
package cuopt.remoteat Line 6 does not match the current file directory (cpp/src/grpc), which triggersPACKAGE_DIRECTORY_MATCH.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in cuopt_remote.proto ("package cuopt.remote;") does not match the file's directory and triggers PACKAGE_DIRECTORY_MATCH; update the package line in cuopt_remote.proto to reflect the directory structure (e.g., "package cpp.src.grpc;" or another package path that mirrors the directory segments) so the package-to-directory mapping aligns, then re-run buf lint to verify the PACKAGE_DIRECTORY_MATCH error is resolved.cpp/src/grpc/server/grpc_server_main.cpp (5)
157-167:⚠️ Potential issue | 🔴 CriticalReplace raw
memseton queue entries that include atomics.
JobQueueEntry/ResultQueueEntrycontainstd::atomicmembers; byte-wise zeroing those objects is undefined behavior.As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 157 - 167, The code currently zeroes JobQueueEntry and ResultQueueEntry with memset (affecting job_queue and result_queue), which is undefined because those structs contain std::atomic members; replace the memset calls by value-initializing or explicitly initializing each member (e.g., construct entries using JobQueueEntry() or assign default-constructed instances) and then set the atomic flags via their store(...) calls (retain the existing store(false)/store(-1) lines for ready/claimed/cancelled/worker_index and retrieved), ensuring no byte-wise writes touch atomics; update the initialization loop to use assignment or placement-new of JobQueueEntry/ResultQueueEntry rather than memset to avoid UB.
50-80:⚠️ Potential issue | 🟠 MajorReject unknown or incomplete CLI flags explicitly.
The parser currently falls through for unrecognized flags and for value-requiring flags missing a value, so the server can continue with unintended defaults.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI parser in the main argument loop accepts unknown flags and silently ignores missing values for flags that require them; update the argument parsing logic (the for-loop handling argv in grpc_server_main.cpp and the handling of flags like "--port"/"--workers"/"--max-message-mb"/"--max-message-bytes"/"--tls-cert"/"--tls-key"/"--tls-root") so that when a flag that requires a value is encountered and i+1 >= argc you call print_usage(argv[0]) (or log an explicit error) and return a non-zero exit code, and when an unrecognized argument starting with '-' is seen you also call print_usage(argv[0]) (or emit an error) and return non-zero—this ensures unknown or incomplete CLI flags are rejected instead of silently falling back to defaults.
89-90:⚠️ Potential issue | 🔴 CriticalDo not wire POSIX signals to the current async-unsafe handler.
signal_handler(incpp/src/grpc/server/grpc_server_types.hpp) uses operations like condition-variable notification from signal context, which is not async-signal-safe.As per coding guidelines "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 89 - 90, The current wiring of POSIX signals via signal(SIGINT, signal_handler) / signal(SIGTERM, signal_handler) is unsafe because signal_handler performs async-unsafe operations (e.g., condition-variable notification); instead, change to a safe signal handling pattern: install a minimal async-signal-safe handler that only sets a std::sig_atomic_t or std::atomic_bool shutdown flag (or writes a byte to a dedicated self-pipe) and return, and move any condition-variable notifications or complex shutdown logic out of signal_handler into the main thread or a dedicated watcher that polls the flag or reads the pipe; update references in grpc_server_main.cpp that call signal(...) and the implementation of signal_handler in grpc_server_types.hpp to follow this pattern so all non-async-safe actions occur off the signal context.
234-251:⚠️ Potential issue | 🔴 CriticalGuard
BuildAndStart()failure before callingserver->Wait().
BuildAndStart()can return null, but Line 250 unconditionally dereferencesserver.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251, BuildAndStart() may return nullptr so avoid unguarded dereference of server; after calling std::unique_ptr<Server> server(builder.BuildAndStart()), check if (!server) and handle the failure (log an error via std::cerr or similar and return/exit with non-zero) before starting the shutdown_thread or calling server->Wait(); also ensure the shutdown_thread lambda still checks server before calling server->Shutdown() and join the thread if started (use a flag to indicate thread started or create the thread only when server is valid).
185-212:⚠️ Potential issue | 🔴 CriticalTLS error branches leak workers/threads/shared memory after startup.
These early returns execute after
spawn_workers()and thread creation, so teardown is skipped on TLS configuration/read failures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 185 - 212, The TLS error branches return early after spawn_workers()/thread/shared-memory setup and thus leak resources; either move the TLS validation block to run before spawn_workers(), or ensure every early-return path calls the teardown routines (e.g., shutdown_workers(), join_worker_threads(), release_shared_memory() or whatever the existing cleanup functions are) before returning. Locate the TLS block (uses config, ssl_opts, read_file_to_string) and change it so that validation happens prior to calling spawn_workers(), or add calls to the project's worker/shutdown/cleanup functions immediately before each return in this block to properly stop threads and free shared memory.cpp/src/grpc/server/grpc_pipe_serialization.hpp (2)
90-90:⚠️ Potential issue | 🔴 CriticalReturn failure on malformed chunk parse instead of breaking and succeeding.
A parse error currently breaks the loop and still returns
true, which accepts partial/corrupt payloads as valid.🛠️ Minimal fix
- if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break; + if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) { + if (clean_eof) { break; } + return false; + } ... - if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break; + if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) { + if (clean_eof) { break; } + return false; + }As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
Also applies to: 148-148
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loop currently treats any false from google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof) as a harmless break and ultimately returns true, which accepts corrupt partial payloads; change the logic so that when ParseDelimitedFromCodedStream returns false you check clean_eof and only break if clean_eof is true (clean EOF), otherwise immediately return false to signal a malformed chunk error; update this behavior for the ParseDelimitedFromCodedStream call at the shown line and the duplicate occurrence (around the other occurrence near line 148) so malformed parses fail instead of succeeding.
93-95:⚠️ Potential issue | 🟠 MajorValidate
total_elementssize math beforeresize().The current resize paths trust unvalidated
total_elements(and multiplication withelem_size), which can overflow or trigger excessive allocation on malformed input.As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
Also applies to: 151-154
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 95, The resize call trusts ac.total_elements() (and the later elem_size * total_elements math) and can overflow or cause huge allocations; before calling dest.resize() (both at the dest.empty() check and the duplicate block at the later occurrence), validate ac.total_elements() and the computed byte_count: cast total_elements to a fixed-width unsigned type (e.g. uint64_t), check total_elements > 0 is within a sane upper bound and that elem_size * total_elements does not overflow size_t (e.g. byte_count = uint64_t(elem_size) * total_elements; if (byte_count > std::numeric_limits<size_t>::max() || byte_count > MAX_ALLOWED_BYTES) return/error), then perform resize using the checked size_t value; if the validation fails, return an error/handle gracefully instead of resizing.cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)
213-218:⚠️ Potential issue | 🟠 MajorSynchronize server-log state reads/writes consistently.
server_log_path_,server_log_start_pos_, andtest_start_marked_are accessed without a common lock across setters/readers. This can race when log polling and test setup run concurrently.Suggested fix pattern
void set_server_log_path(const std::string& path) { + std::lock_guard<std::mutex> lock(mutex_); server_log_path_ = path; server_log_start_pos_ = 0; test_start_marked_ = false; } std::string get_server_logs(bool since_test_start = true) const { - if (server_log_path_.empty()) { return ""; } + std::string path; + std::streampos start_pos = 0; + bool marked = false; + { + std::lock_guard<std::mutex> lock(mutex_); + path = server_log_path_; + start_pos = server_log_start_pos_; + marked = test_start_marked_; + } + if (path.empty()) { return ""; } - std::ifstream file(server_log_path_); + std::ifstream file(path); if (!file.is_open()) { return ""; } - if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) { - file.seekg(server_log_start_pos_); + if (since_test_start && marked && start_pos > 0) { + file.seekg(start_pos); }As per coding guidelines: "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."
Also applies to: 229-244, 332-355
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213 - 218, The three server-log fields (server_log_path_, server_log_start_pos_, test_start_marked_) are accessed without synchronization; modify set_server_log_path to acquire a dedicated mutex (e.g., server_log_mutex_) and set all three fields while holding the lock, and update all other readers/writers that touch these members (the other setter/getter/polling functions referenced in the diff ranges) to also lock the same mutex when reading or writing those fields to prevent races.cpp/src/grpc/server/grpc_job_management.cpp (2)
109-113:⚠️ Potential issue | 🟠 MajorCap incumbent payload size before resizing the buffer.
This path trusts the size prefix and directly calls
data.resize(size). A malformed payload can force large allocations and exhaust memory.Suggested guardrail
bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data) { uint64_t size; if (!read_from_pipe(fd, &size, sizeof(size))) return false; + if (size > static_cast<uint64_t>(config.max_message_bytes)) { + std::cerr << "[Worker] Incumbent payload too large: " << size << "\n"; + return false; + } data.resize(size); if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false; return true; }As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code currently trusts the size prefix and calls data.resize(size) after read_from_pipe; introduce a validated upper bound (e.g., a constexpr size_t kMaxJobPayloadSize) and check that size is non-negative and <= kMaxJobPayloadSize before calling data.resize(size) and before the subsequent read_from_pipe; if the size is out of range, return false (or handle as a protocol error) to prevent unbounded allocation and potential memory exhaustion. Use the existing symbols size, data.resize, and read_from_pipe to locate and modify the logic.
147-158:⚠️ Potential issue | 🔴 CriticalFix racy queue-slot reservation in async submit.
Slot selection currently performs a check-then-write on shared queue state, so concurrent submissions can reserve the same slot.
Suggested approach (atomic claim)
for (size_t i = 0; i < MAX_JOBS; ++i) { - if (!job_queue[i].ready && !job_queue[i].claimed) { + if (job_queue[i].ready.load()) { continue; } + bool expected = false; + if (!job_queue[i].claimed.compare_exchange_strong(expected, true)) { continue; } copy_cstr(job_queue[i].job_id, job_id); job_queue[i].problem_type = is_mip ? 1 : 0; ... - job_queue[i].claimed = false; - job_queue[i].cancelled = false; - job_queue[i].ready = true; + job_queue[i].cancelled.store(false); + job_queue[i].ready.store(true, std::memory_order_release); + job_queue[i].claimed.store(false, std::memory_order_release);As per coding guidelines: "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The loop in submit that does check-then-write on job_queue slots is racy: change reservation to an atomic claim (e.g., make job_queue[].claimed an std::atomic<bool> or std::atomic_flag) and perform an atomic test-and-set (compare_exchange_strong or test_and_set) to reserve the slot before writing any fields; once the atomic claim succeeds, initialize job_queue[i].job_id, problem_type, data_size, worker_* etc., then set job_queue[i].ready = true (non-atomic write) only after fully initialized; ensure headers (<atomic>) and any needed memory-ordering are used and that no other code resets claimed without clearing ready in the reverse order.cpp/src/grpc/client/solve_remote.cu (2)
53-57:⚠️ Potential issue | 🟠 MajorGuard invalid env overrides before applying gRPC sizes.
CUOPT_CHUNK_SIZEandCUOPT_MAX_MESSAGE_BYTESare applied without sanity checks. Zero/negative values can break transfer behavior at runtime.Suggested fix
static void apply_env_overrides(grpc_client_config_t& config) { - config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); - config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + const auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); + const auto max_msg = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; } + if (max_msg > 0) { config.max_message_bytes = max_msg; } config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 53 - 57, The env overrides in apply_env_overrides currently accept any parsed value from parse_env_int64 for chunk_size_bytes and max_message_bytes; update apply_env_overrides to validate the returned values before assignment: call parse_env_int64 for "CUOPT_CHUNK_SIZE" and "CUOPT_MAX_MESSAGE_BYTES", then only assign to config.chunk_size_bytes and config.max_message_bytes if the parsed value is > 0 and within sane bounds (e.g., enforce a minimum of 1 and a sensible upper cap to avoid resource exhaustion); leave enable_debug_log assignment using parse_env_int64 as-is. Ensure you reference apply_env_overrides, config.chunk_size_bytes, config.max_message_bytes, and parse_env_int64 when making the change.
163-166:⚠️ Potential issue | 🟠 MajorForward the real best bound to incumbent callbacks.
The current forwarding path sets
bound_copyfrom the incumbent objective, which can report an artificial zero gap to user callbacks. This should carry the solver’s actual best bound from the remote pipeline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 163 - 166, The code currently sets bound_copy from objective, so get_callback->get_solution(...) forwards the incumbent objective as the bound and can report a zero gap; change the initialization of bound_copy to use the solver’s actual best-bound value from the remote pipeline (replace "bound_copy = objective" with the variable that holds the solver's best bound), then call get_callback->get_solution(solution_copy.data(), &obj_copy, &bound_copy, callback->get_user_data()); this ensures callbacks receive the real best bound instead of the incumbent objective.cpp/src/grpc/grpc_settings_mapper.cu (1)
144-146:⚠️ Potential issue | 🟠 MajorRange-check
iteration_limitbefore narrowing cast.A large protobuf
iteration_limit(int64_t) can overflow when cast toi_t(which may be int32_t), producing an invalid internal limit.🔧 Suggested guard
// proto3 defaults numeric fields to 0; treat negative iteration_limit as "unset" // so the server keeps the library default (typically max()). if (pb_settings.iteration_limit() >= 0) { - settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit()); + const auto limit = pb_settings.iteration_limit(); + if (limit <= static_cast<int64_t>(std::numeric_limits<i_t>::max())) { + settings.iteration_limit = static_cast<i_t>(limit); + } else { + settings.iteration_limit = std::numeric_limits<i_t>::max(); + } }As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The narrowing cast from pb_settings.iteration_limit() to settings.iteration_limit (type i_t) can overflow; before the static_cast in the block that sets settings.iteration_limit, range-check the int64 value against the numeric limits of i_t and either clamp to std::numeric_limits<i_t>::max()/min() or return/report an error if out-of-range—update the code around the check of pb_settings.iteration_limit() to perform this validation and only then assign via static_cast<i_t>.cpp/src/grpc/client/grpc_client.cu (4)
258-266:⚠️ Potential issue | 🔴 CriticalAdd null stub guard and deadline to RPC methods.
Public RPC methods like
check_statusdereferenceimpl_->stubwithout checking if a connection was established. Additionally, theClientContextlacks a deadline, allowing indefinite blocking on network issues. The connection probe at line 191 correctly demonstrates both patterns.🔧 Proposed fix pattern (apply to all RPC methods)
job_status_result_t grpc_client_t::check_status(const std::string& job_id) { job_status_result_t result; + if (!impl_->stub) { + result.error_message = "Not connected to server"; + return result; + } grpc::ClientContext context; + if (config_.timeout_seconds > 0) { + context.set_deadline(std::chrono::system_clock::now() + + std::chrono::seconds(config_.timeout_seconds)); + } auto request = build_status_request(job_id);Apply the same pattern to:
wait_for_completion,cancel_job,delete_job,get_incumbents,stream_logs, and all chunked upload/download methods.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The method grpc_client_t::check_status currently dereferences impl_->stub and creates a grpc::ClientContext without a deadline; add a null-stub guard at the start (check impl_ and impl_->stub and return a job_status_result_t error/result indicating "not connected") before calling build_status_request/CheckStatus, and set a deadline on the grpc::ClientContext (use std::chrono::... now()+configured RPC timeout) so the RPC won't block indefinitely; follow the same pattern for wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs and all chunked upload/download methods, using the same error/result handling style used by the connection probe at line 191.
991-995:⚠️ Potential issue | 🔴 CriticalValidate
poll_interval_msbefore division.Line 994 computes
max_pollsby dividing byconfig_.poll_interval_ms. If this value is<= 0, the division will either crash or produce invalid polling behavior. The same issue exists at line 1132.🔧 Proposed validation
} else { CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str()); + if (config_.poll_interval_ms <= 0) { + stop_log_streaming(); + result.error_message = "Invalid configuration: poll_interval_ms must be > 0"; + return result; + } int poll_count = 0; int max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 991 - 995, The code computes max_polls by dividing by config_.poll_interval_ms without validating it; add a guard in the polling branch (where CUOPT_LOG_INFO logs "Using polling (CheckStatus) for job" and any other similar polling block around the second occurrence) to check config_.poll_interval_ms > 0 and handle invalid values: log an error via CUOPT_LOG_ERROR (include job_id/context), set a safe default (e.g., poll_interval_ms = 1) or return/fail early, then compute max_polls using the validated/clamped value; update both locations that reference config_.poll_interval_ms to use the validated variable.
702-712:⚠️ Potential issue | 🔴 CriticalValidate chunked result metadata before arithmetic operations.
element_size_bytesis used at line 708 without checking for<= 0, causing division by zero. Additionally,total_elems * elem_sizeat line 711 can overflow before allocation, making subsequentmemcpyoperations unsafe.🔧 Proposed bounds validation
for (const auto& arr_desc : header->arrays()) { auto field_id = arr_desc.field_id(); int64_t total_elems = arr_desc.total_elements(); int64_t elem_size = arr_desc.element_size_bytes(); if (total_elems <= 0) continue; + if (elem_size <= 0) { + last_error_ = "Invalid chunk metadata: non-positive element_size_bytes"; + return false; + } + // Check for overflow before allocation + if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) { + last_error_ = "Invalid chunk metadata: total byte size overflow"; + return false; + } + int64_t total_bytes = total_elems * elem_size; + if (static_cast<uint64_t>(total_bytes) > std::numeric_limits<size_t>::max()) { + last_error_ = "Invalid chunk metadata: exceeds addressable memory"; + return false; + } int64_t elems_per_chunk = chunk_data_budget / elem_size;As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 712, The loop over header->arrays() uses arr_desc.element_size_bytes() and multiplies it by arr_desc.total_elements() without validation; add checks that element_size_bytes() > 0 (return/continue or log and skip if not) before computing elems_per_chunk to avoid division by zero, compute elems_per_chunk = max(1, chunk_data_budget / elem_size) only after that check, and validate that total_elements() * element_size_bytes() cannot overflow size_t before allocating array_bytes (e.g., if total_elems > SIZE_MAX / elem_size then handle as error/skip or cap/stream allocation); reference arr_desc.element_size_bytes(), arr_desc.total_elements(), chunk_data_budget, elems_per_chunk, and the allocation of array_bytes to implement these guards.
221-252:⚠️ Potential issue | 🔴 CriticalLog streaming thread can hang indefinitely on shutdown.
The
stop_log_streaming()method sets thestop_logs_flag and joins the thread, butreader->Read()at line 232 blocks indefinitely if the server stops sending messages. TheTryCancel()call at line 234 only executes afterRead()returns, not before. You need to callcontext.TryCancel()from the main thread before joining.🔧 Proposed fix using context cancellation from stop_log_streaming
Add member variables to the class (in grpc_client.hpp):
std::mutex log_context_mutex_; grpc::ClientContext* active_log_context_ = nullptr;Then update the methods:
void grpc_client_t::start_log_streaming(const std::string& job_id) { if (!config_.stream_logs || !config_.log_callback) return; stop_logs_.store(false); log_thread_ = std::make_unique<std::thread>([this, job_id]() { grpc::ClientContext context; + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = &context; + } auto request = build_stream_logs_request(job_id, 0); auto reader = impl_->stub->StreamLogs(&context, request); cuopt::remote::LogMessage log_msg; while (reader->Read(&log_msg)) { if (stop_logs_.load()) { - context.TryCancel(); break; } if (config_.log_callback) { config_.log_callback(log_msg.line()); } if (log_msg.job_complete()) { break; } } reader->Finish(); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = nullptr; + } }); } void grpc_client_t::stop_log_streaming() { stop_logs_.store(true); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + if (active_log_context_) { active_log_context_->TryCancel(); } + } if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); } log_thread_.reset(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The log streaming thread can block in reader->Read() so add a cancellable context pointer protected by a mutex: introduce members std::mutex log_context_mutex_ and grpc::ClientContext* active_log_context_ = nullptr; in grpc_client_t, then in start_log_streaming() create the local grpc::ClientContext context as before but immediately register it by locking log_context_mutex_ and setting active_log_context_ = &context before calling impl_->stub->StreamLogs(...); after the read/Finish() loop clear the pointer (lock and set to nullptr). In stop_log_streaming() lock log_context_mutex_, if active_log_context_ != nullptr call active_log_context_->TryCancel() (to wake up reader->Read()) before joining the thread, then proceed to join/reset log_thread_ and set stop_logs_. Ensure all accesses to active_log_context_ are mutex-protected to avoid race conditions and that the pointer only points to the stack context while the thread is running.cpp/src/grpc/server/grpc_server_types.hpp (1)
300-308:⚠️ Potential issue | 🔴 CriticalRemove non-async-signal-safe operations from signal handler.
std::coutandcondition_variable::notify_all()are not async-signal-safe and can deadlock or corrupt state when called from a POSIX signal handler. The signal handler should only perform atomic writes.🔧 Suggested approach
inline void signal_handler(int signal) { if (signal == SIGINT || signal == SIGTERM) { - std::cout << "\n[gRPC Server] Received shutdown signal\n"; keep_running = false; if (shm_ctrl) { shm_ctrl->shutdown_requested = true; } - result_cv.notify_all(); } }Move logging and
result_cv.notify_all()to a dedicated shutdown-watcher thread that pollskeep_running == falseand then performs the cleanup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308, The signal_handler currently performs non-async-signal-safe operations (std::cout and result_cv.notify_all()), which must be removed; change signal_handler to only perform atomic-safe actions: set keep_running = false and, if shm_ctrl exists, set shm_ctrl->shutdown_requested = true (both already present), and return immediately; then implement a separate shutdown-watcher thread (e.g., spawned at server startup) that polls or waits on an atomic/condition to detect keep_running == false and from that thread perform logging (std::cout or logging framework), call result_cv.notify_all(), and any other cleanup work — locate signal_handler, keep_running, shm_ctrl, and result_cv to update behavior and add the watcher thread to the server lifecycle.cpp/src/grpc/grpc_problem_mapper.cu (1)
399-427:⚠️ Potential issue | 🔴 CriticalAdd alignment validation before
memcpyin chunked array decoding.The
get_doublesandget_intslambdas compute element count via integer division but then copy the full byte array. If the payload size isn't aligned to element size,memcpywill copy garbage or partial data.🔧 Proposed alignment checks
auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(double) != 0) return {}; // Misaligned data size_t n = it->second.size() / sizeof(double); if constexpr (std::is_same_v<f_t, double>) { std::vector<double> v(n); - std::memcpy(v.data(), it->second.data(), it->second.size()); + std::memcpy(v.data(), it->second.data(), n * sizeof(double)); return v; } else { // ... } }; auto get_ints = [&](int32_t field_id) -> std::vector<i_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(int32_t) != 0) return {}; // Misaligned data size_t n = it->second.size() / sizeof(int32_t); // ... };As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas get_doubles and get_ints must validate byte alignment before memcpy: compute size_t bytes = it->second.size(); if bytes == 0 return {}; if (bytes % sizeof(double) != 0) (for get_doubles) or bytes % sizeof(int32_t) != 0 (for get_ints) then handle the misaligned payload (log/return error/throw) instead of blindly memcpy; use size_t n = bytes / sizeof(...) and pass bytes (or n * sizeof(...)) to memcpy rather than it->second.size(); update references in get_doubles/get_ints and ensure the branch where types differ still copies only the aligned byte count and converts only n elements.
🧹 Nitpick comments (3)
cpp/src/grpc/server/grpc_field_element_size.hpp (1)
17-29: Avoid silent fallback for unknownArrayFieldId.At Line 28,
default: return 8;can mask unsupported/new enum values and mis-size chunk parsing. Prefer an explicit invalid path so callers can reject the request.As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."Suggested change
inline int64_t array_field_element_size(cuopt::remote::ArrayFieldId field_id) { switch (field_id) { @@ - default: return 8; + default: return -1; // unknown field id; caller should fail validation } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_field_element_size.hpp` around lines 17 - 29, The function array_field_element_size(cuopt::remote::ArrayFieldId) silently falls back to return 8 for unknown enum values; change it to explicitly handle unsupported/new enums by logging/propagating an error or failing fast (e.g., throw std::invalid_argument or assert) instead of returning 8 so callers can reject invalid requests and avoid mis-sized parsing; update array_field_element_size to remove the silent default and return/throw a clear failure for unexpected cuopt::remote::ArrayFieldId values and ensure callers handle that error path.build.sh (1)
7-7: Enable unbound-variable checks in the script preamble.At Line 7, prefer
set -euto catch unset variable usage early.Based on learnings: "In this repository, prefer using 'set -u' in Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages."Suggested change
-set -e +set -eu🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@build.sh` at line 7, The script preamble currently uses "set -e"; change this to enable unbound-variable checks by replacing that occurrence with "set -eu" (i.e., add the -u flag) so the shell will error on unset variables; audit any places that intentionally rely on unset variables and initialize or guard them accordingly (locate the "set -e" token in build.sh and update it).cpp/src/grpc/server/grpc_server_threads.cpp (1)
166-182: Deserialize outsidetracker_mutexto reduce contention.
deserialize_result_pipe_blob(...)can be expensive; parsing before taking the lock and only committing parsed state under lock will improve concurrency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 166 - 182, deserialize_result_pipe_blob(...) is done while holding tracker_mutex causing unnecessary contention; move the expensive deserialize step out of the critical section by first copying result_data and success locally, call deserialize_result_pipe_blob(result_data.data(), result_data.size(), hdr, arrays) before acquiring tracker_mutex, then acquire lock(tracker_mutex), find job_tracker entry by job_id, verify it still exists and update it: set status to COMPLETED or FAILED (JobStatus), set error_message on failure, or assign result_header, result_arrays and result_size_bytes on success; ensure hdr and arrays are moved into it->second to avoid copying and that result_size_bytes is set from result_data.size().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d12476d8-baa0-42cb-9006-1d214ffa2716
📒 Files selected for processing (38)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/src/grpc/server/grpc_incumbent_proto.hpp
- python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
- cpp/src/grpc/server/grpc_service_impl.cpp
- cpp/src/grpc/server/grpc_worker.cpp
- cpp/src/grpc/grpc_solution_mapper.cu
There was a problem hiding this comment.
Actionable comments posted: 13
♻️ Duplicate comments (25)
cpp/src/grpc/server/grpc_job_management.cpp (2)
147-158:⚠️ Potential issue | 🔴 Critical
submit_job_asyncstill uses racy check-then-write slot acquisition.Lines [147]-[158] can let concurrent submitters claim/overwrite the same queue slot. Use atomic reservation (
compare_exchange) and publish ordering.Proposed fix
- for (size_t i = 0; i < MAX_JOBS; ++i) { - if (!job_queue[i].ready && !job_queue[i].claimed) { + for (size_t i = 0; i < MAX_JOBS; ++i) { + if (job_queue[i].ready.load()) { continue; } + bool expected_claimed = false; + if (job_queue[i].claimed.compare_exchange_strong(expected_claimed, true)) { copy_cstr(job_queue[i].job_id, job_id); job_queue[i].problem_type = is_mip ? 1 : 0; job_queue[i].data_size = request_data.size(); job_queue[i].worker_pid = 0; job_queue[i].worker_index = -1; - job_queue[i].data_sent = false; - job_queue[i].claimed = false; - job_queue[i].cancelled = false; - job_queue[i].ready = true; + job_queue[i].data_sent.store(false); + job_queue[i].cancelled.store(false); + job_queue[i].ready.store(true, std::memory_order_release); + job_queue[i].claimed.store(false, std::memory_order_release);As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, submit_job_async currently uses a racy check-then-write loop over job_queue (iterating i from 0..MAX_JOBS) checking job_queue[i].ready and job_queue[i].claimed then writing the slot; change this to atomically reserve a slot using an atomic CAS on a dedicated slot state (e.g., atomically compare_exchange the claimed/empty flag for job_queue[i]) before writing any fields, and use release semantics when publishing ready (store with memory_order_release) and acquire semantics when consumers read (memory_order_acquire); update submit_job_async to attempt compare_exchange on job_queue[i].claimed (or a new atomic state) to transition from empty->reserved, only write the job fields after successful CAS, and then set ready=true with appropriate publish ordering so concurrent submitters cannot overwrite the same slot.
109-113:⚠️ Potential issue | 🟠 MajorBound incumbent payload size before
resizeto prevent memory exhaustion.Line [111] trusts a pipe-provided size and allocates directly. A malformed size prefix can force unbounded allocation.
Proposed fix
bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data) { uint64_t size; if (!read_from_pipe(fd, &size, sizeof(size))) return false; + if (size > static_cast<uint64_t>(config.max_message_bytes)) { + std::cerr << "[Worker] Incumbent payload too large: " << size << "\n"; + return false; + } data.resize(size); if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false; return true; }As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code reads an untrusted uint64_t size from a pipe and calls data.resize(size) directly; bound the incoming size before resizing to prevent memory exhaustion by introducing a sensible constant (e.g. JOB_PAYLOAD_MAX or MAX_PAYLOAD_SIZE) and rejecting sizes > that limit (return false) or otherwise handling them; also validate zero/negative semantics if needed and use that check right after read_from_pipe(fd, &size, ...) and before data.resize(size) (referencing the local variable size, function read_from_pipe, and the vector/data.resize call).cpp/src/grpc/cuopt_remote.proto (1)
6-6:⚠️ Potential issue | 🟠 MajorBuf package-directory mismatch is still unresolved.
Line [6] (
package cuopt.remote;) remains in a path that violates BufPACKAGE_DIRECTORY_MATCHfor this module layout, so lint/CI will continue to fail until package/path (or Buf roots) are aligned.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in cuopt_remote.proto ("package cuopt.remote;") doesn't match the file path, triggering Buf's PACKAGE_DIRECTORY_MATCH rule; fix by either updating the package to reflect the file's directory structure (e.g., change package to match the actual path) or move cuopt_remote.proto into a directory layout matching "cuopt/remote" (or adjust buf.yaml roots/module configuration to include the current path). Ensure the package token "cuopt.remote" and the filesystem layout are aligned so Buf lint passes.cpp/src/grpc/server/grpc_service_impl.cpp (1)
71-73:⚠️ Potential issue | 🔴 CriticalPublish
readybefore releasingclaimedto prevent queue-slot double allocation.At Line [71] and Line [265], releasing
claimedbefore publishingreadyre-opens a claim window for another submitter on the same slot.Proposed fix (both locations)
- job_queue[job_idx].claimed.store(false); - job_queue[job_idx].ready.store(true); + job_queue[job_idx].ready.store(true, std::memory_order_release); + job_queue[job_idx].claimed.store(false, std::memory_order_release);As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".
Also applies to: 265-267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code currently clears job_queue[job_idx].claimed before publishing job_queue[job_idx].ready, which allows another submitter to claim the same slot between those two stores; swap the operations so you set job_queue[job_idx].ready.store(true) before job_queue[job_idx].claimed.store(false) (and apply the same swap at the other occurrence around lines 265-267) to ensure the slot is marked available only after the ready flag is visible; if using explicit memory orders, use release semantics for the store to ready to avoid the race.SERVER_ARCHITECTURE.md (1)
15-46:⚠️ Potential issue | 🟡 MinorAdd fence languages to satisfy markdownlint MD040
Several fenced blocks are untyped; markdownlint MD040 will keep warning/failing until those fences specify a language (e.g.,
text).Proposed patch
-``` +```text ┌────────────────────────────────────────────────────────────────────┐ │ Main Server Process │ ... └─────────────────┘ └─────────────────┘ └─────────────────┘ -``` +``` @@ -``` +```text Client Server Worker ... -``` +``` @@ -``` +```text Client Server Worker ... -``` +``` @@ -``` +```text Client Server Worker ... -``` +``` @@ -``` +```text Client Worker ... -``` +``` @@ -``` +```text ┌─────────┐ submit ┌───────────┐ claim ┌────────────┐ ... └───────────┘ └─────────┘ -``` +```Also applies to: 110-118, 122-136, 139-148, 154-168, 207-217
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The markdown has multiple untyped fenced code blocks (the large ASCII "Main Server Process" diagram and the smaller "Client / Server / Worker" and "submit / claim" pipeline diagrams) causing markdownlint MD040 failures; update each triple-backtick fence around these ASCII diagrams in SERVER_ARCHITECTURE.md to specify a language (use "text") so the fences read ```text ... ```; look for the big box diagram containing "Main Server Process", the repeated "Client Server Worker" blocks, and the "submit claim" pipeline block and add the language token to each opening fence.cpp/src/grpc/server/grpc_server_main.cpp (3)
156-168:⚠️ Potential issue | 🔴 CriticalDon’t
memsetshared-memory structs that containstd::atomic
memset()on a type withstd::atomicmembers is undefined behavior in C++. Even if you immediately.store()afterwards, the atomic object’s lifetime/rules are already violated.Safer initialization pattern
for (size_t i = 0; i < MAX_JOBS; ++i) { - memset(&job_queue[i], 0, sizeof(JobQueueEntry)); + new (&job_queue[i]) JobQueueEntry{}; job_queue[i].ready.store(false); job_queue[i].claimed.store(false); job_queue[i].cancelled.store(false); job_queue[i].worker_index.store(-1); } for (size_t i = 0; i < MAX_RESULTS; ++i) { - memset(&result_queue[i], 0, sizeof(ResultQueueEntry)); + new (&result_queue[i]) ResultQueueEntry{}; result_queue[i].ready.store(false); result_queue[i].retrieved.store(false); }As per coding guidelines, “Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 156 - 168, The code currently calls memset on job_queue[i] and result_queue[i] which is undefined for types containing std::atomic; remove the memset calls and instead properly construct/value-initialize each entry (e.g., use placement new: new (&job_queue[i]) JobQueueEntry(); new (&result_queue[i]) ResultQueueEntry();) and then set the atomic fields via their .store(...) calls (worker_index.store(-1), ready.store(false), claimed.store(false), cancelled.store(false), retrieved.store(false)); update the loops that touch job_queue and result_queue (and references to MAX_JOBS / MAX_RESULTS) to use this safe initialization pattern.
173-212:⚠️ Potential issue | 🔴 CriticalAvoid
std::terminateon TLS error paths; guardBuildAndStart()failureTLS validation errors currently
returnafter starting workers/threads (joinable thread dtors will callstd::terminate). Separately,BuildAndStart()can return null andserver->Wait()will crash. Both need a single cleanup/teardown path.Minimal direction: introduce a shutdown helper and use it on all early-exit paths
spawn_workers(); std::thread result_thread(result_retrieval_thread); std::thread incumbent_thread(incumbent_retrieval_thread); std::thread monitor_thread(worker_monitor_thread); std::thread reaper_thread(session_reaper_thread); + auto shutdown_runtime = [&]() { + keep_running = false; + shm_ctrl->shutdown_requested = true; + result_cv.notify_all(); + if (result_thread.joinable()) result_thread.join(); + if (incumbent_thread.joinable()) incumbent_thread.join(); + if (monitor_thread.joinable()) monitor_thread.join(); + if (reaper_thread.joinable()) reaper_thread.join(); + wait_for_workers(); + cleanup_shared_memory(); + }; @@ if (config.enable_tls) { if (config.tls_cert_path.empty() || config.tls_key_path.empty()) { std::cerr << "[Server] TLS enabled but --tls-cert/--tls-key not provided\n"; - return 1; + shutdown_runtime(); + return 1; } @@ if (ssl_opts.pem_root_certs.empty()) { std::cerr << "[Server] --require-client-cert requires --tls-root\n"; - return 1; + shutdown_runtime(); + return 1; } @@ std::unique_ptr<Server> server(builder.BuildAndStart()); + if (!server) { + std::cerr << "[gRPC Server] Failed to start server on " << server_address << "\n"; + shutdown_runtime(); + return 1; + } @@ server->Wait();As per coding guidelines, “Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state.”
Also applies to: 234-251
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 173 - 212, Spawned worker threads and early TLS/BuildAndStart error paths can return while threads are still joinable (causing std::terminate) and BuildAndStart() can return null causing a crash; introduce a single shutdown helper (e.g., shutdown_server_and_workers()) that stops workers, signals threads to exit, joins result_retrieval_thread, incumbent_retrieval_thread, worker_monitor_thread, session_reaper_thread and cleans up any resources and the grpc::Server pointer, call this helper on all early-exit/TLS error paths and when builder.BuildAndStart() returns nullptr before returning non-zero, and replace ad-hoc returns with calls to this helper; additionally ensure shared solver/global state access in server handlers uses proper mutex/synchronization (e.g., std::mutex/lock_guard around the solver state) to avoid thread-unsafe access.
50-80:⚠️ Potential issue | 🟠 MajorCLI parsing should reject unknown/incomplete flags and handle
std::stoifailuresRight now unknown flags and missing values are silently ignored, and
std::stoi/std::stollcan throw and abort the server. Also,--chunk-timeoutis documented but not parsed at all.Example patch sketch (apply pattern to all flags)
for (int i = 1; i < argc; i++) { std::string arg = argv[i]; if (arg == "-p" || arg == "--port") { - if (i + 1 < argc) { config.port = std::stoi(argv[++i]); } + if (i + 1 >= argc) { + std::cerr << "ERROR: " << arg << " requires a value\n"; + print_usage(argv[0]); + return 1; + } + try { + config.port = std::stoi(argv[++i]); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid port value for " << arg << ": '" << argv[i] << "'\n"; + return 1; + } } else if (arg == "--max-message-bytes") { - if (i + 1 < argc) { config.max_message_b = std::max(4096LL, std::stoll(argv[++i])); } + if (i + 1 >= argc) { + std::cerr << "ERROR: " << arg << " requires a value\n"; + return 1; + } + try { + config.max_message_b = std::max<int64_t>(4096LL, std::stoll(argv[++i])); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n"; + return 1; + } + } else if (arg == "--chunk-timeout") { + if (i + 1 >= argc) { + std::cerr << "ERROR: " << arg << " requires a value\n"; + return 1; + } + try { + config.chunk_timeout_seconds = std::stoi(argv[++i]); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n"; + return 1; + } } else if (arg == "-h" || arg == "--help") { print_usage(argv[0]); return 0; + } else { + std::cerr << "ERROR: Unknown option '" << arg << "'\n"; + print_usage(argv[0]); + return 1; } }As per coding guidelines, “Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI parsing loop in main currently silently ignores unknown flags and missing values, and calls std::stoi/std::stoll without catching exceptions; also the documented flag "--chunk-timeout" is not parsed. Update the argv parsing code (the for loop handling argv[], touching config.port, config.num_workers, config.max_message_mb, config.max_message_b, config.enable_transfer_hash, config.enable_tls, config.tls_cert_path, config.tls_key_path, config.tls_root_path, config.require_client, config.log_to_console, config.verbose) to: 1) validate that flags requiring a value check (i + 1 < argc) and on failure print_usage(argv[0]) and return a non-zero error code; 2) wrap std::stoi/std::stoll usages in try/catch(std::exception&) to log a clear error and exit non-zero on parse failure; 3) handle unknown flags by printing an error + usage and exiting non-zero; and 4) add parsing for "--chunk-timeout" to populate the appropriate config field (e.g., config.chunk_timeout) following the same validation and exception-handling pattern. Ensure error messages include the offending flag for easier debugging.build.sh (1)
390-398:⚠️ Potential issue | 🟠 Major
cuopt_grpc_serverfast path still captures mixed-target buildsAs written,
build.sh cuopt_grpc_server cuoptwill still take the server-only branch and skip the requested install target(s). That’s a surprising foot-gun for devs/CI.Proposed patch
- if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll; then + if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll \ + && ! hasArg cuopt && ! hasArg cuopt_server && ! hasArg cuopt_mps_parser \ + && ! hasArg cuopt_sh_client && ! hasArg docs && ! hasArg deb \ + && ! hasArg libmps_parser; then # Build only the gRPC server (ninja resolves libcuopt as a dependency) cmake --build "${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server ${VERBOSE_FLAG} ${JFLAG}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@build.sh` around lines 390 - 398, The fast-path condition that builds only cuopt_grpc_server (the if branch using hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll) incorrectly triggers when other targets (e.g., cuopt) are also requested; change the guard so it only takes the server-only branch when cuopt_grpc_server is the sole target requested (for example replace the condition with a new predicate like hasOnlyArg cuopt_grpc_server or add explicit negations for all other possible target args / check INSTALL_TARGET equals cuopt_grpc_server), updating the if in build.sh that surrounds the cmake --build "${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server block so mixed-target invocations fall through to the normal install/build branch.cpp/src/grpc/client/test_grpc_client.cpp (1)
381-412:⚠️ Potential issue | 🟠 MajorHarden numeric argument parsing and validate
--time-limit/ incumbent index
std::stod/std::stollthrow on bad input, andstatic_cast<int>(time_limit)can be undefined if the value is out of range. This tool should fail with a clean CLI error instead of terminating.Proposed patch
while (arg_idx < argc && argv[arg_idx][0] == '-') { std::string arg = argv[arg_idx]; @@ } else if (arg == "--time-limit" && arg_idx + 1 < argc) { - time_limit = std::stod(argv[++arg_idx]); + const char* value = argv[++arg_idx]; + try { + time_limit = std::stod(value); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid value for --time-limit: '" << value << "'\n"; + return 1; + } + if (!std::isfinite(time_limit) || time_limit <= 0.0) { + std::cerr << "ERROR: --time-limit must be a finite positive number (got '" << value + << "')\n"; + return 1; + } arg_idx++; @@ grpc_client_config_t config; config.server_address = server_address; - config.timeout_seconds = static_cast<int>(time_limit) + 120; + if (time_limit > static_cast<double>(std::numeric_limits<int>::max() - 120)) { + std::cerr << "ERROR: --time-limit too large (max " + << (std::numeric_limits<int>::max() - 120) << ")\n"; + return 1; + } + config.timeout_seconds = static_cast<int>(time_limit) + 120; @@ } else if (mode == "incumbent") { @@ std::string job_id = argv[arg_idx++]; - int64_t from_index = (arg_idx < argc) ? std::stoll(argv[arg_idx]) : 0; + int64_t from_index = 0; + if (arg_idx < argc) { + const char* value = argv[arg_idx]; + try { + from_index = std::stoll(value); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid incumbent index: '" << value << "'\n"; + return 1; + } + if (from_index < 0) { + std::cerr << "ERROR: incumbent index must be >= 0 (got '" << value << "')\n"; + return 1; + } + } return mode_incumbent(client, job_id, from_index); }Also applies to: 425-426, 531-532
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 381 - 412, The CLI currently calls std::stod (for time_limit) and std::stoll (for the incumbent index elsewhere) which throw on invalid input and can lead to undefined behavior when casting large values; update the argument parsing in the main option loop (handling --time-limit / variable time_limit) to validate input robustly: parse numeric arguments using non-throwing parsing (e.g., std::from_chars for integers and a safe check for floating parsing or wrap std::stod in try/catch), detect conversion failure or out-of-range values, emit a clear error message via std::cerr and return a non-zero exit code, and before doing static_cast<int>(time_limit) or casting the stoll result ensure the parsed value fits the target range to avoid undefined behavior; apply the same validation/failure handling to the other occurrences parsing integers (the std::stoll sites noted in the review).cpp/src/grpc/client/solve_remote.cu (2)
164-166:⚠️ Potential issue | 🟠 MajorDo not pass incumbent objective as best-bound in callbacks.
Line 164–166 sets
bound_copyequal toobjective, which reports a fake zero-gap state to callback consumers.A safe interim behavior is to pass
NaN(unknown bound) until the incumbent protocol includes a real bound field end-to-end.🔧 Interim mitigation in this file
+#include <limits> ... - double bound_copy = objective; // Use objective as bound for incumbent + double bound_copy = std::numeric_limits<double>::quiet_NaN();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 164 - 166, The callback currently passes the incumbent objective as the best-bound (bound_copy = objective) to get_callback->get_solution, which falsely reports a zero-gap; change the interim behavior to pass an unknown bound (use NaN) instead of objective when calling get_callback->get_solution (affecting bound_copy, get_callback->get_solution(..., &bound_copy, callback->get_user_data()) and related solution_copy/obj_copy variables) so callbacks receive NaN until the protocol supplies a real bound end-to-end.
55-57:⚠️ Potential issue | 🟠 MajorValidate env sizing overrides before applying them.
Line 55–57 still accepts zero/negative values for
CUOPT_CHUNK_SIZEandCUOPT_MAX_MESSAGE_BYTES, which can destabilize transfer behavior.🔧 Proposed guardrails
- config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); - config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); + auto max_msg = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; } + if (max_msg > 0) { config.max_message_bytes = max_msg; }As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The env overrides for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES may be zero/negative; update the code that calls parse_env_int64 to validate the returned value before assigning to config.chunk_size_bytes and config.max_message_bytes (e.g., call parse_env_int64 for each into a temp variable, check temp > 0 and optionally cap to a safe maximum, then assign to config.*; if invalid, retain the existing default and emit a debug/warn via config.enable_debug_log or logger). Ensure you reference parse_env_int64, config.chunk_size_bytes, config.max_message_bytes, and the CUOPT_CHUNK_SIZE / CUOPT_MAX_MESSAGE_BYTES env names when implementing the guard.cpp/CMakeLists.txt (1)
279-294:⚠️ Potential issue | 🟠 MajorResolve
protocvia$<TARGET_FILE:protobuf::protoc>to support in-tree protobuf targets.Line 279–294 still assumes
protobuf::protochasIMPORTED_LOCATION*. That fails for non-imported targets (e.g., FetchContent/add_subdirectory), causing false fatal errors.🔧 Proposed CMake adjustment
if(TARGET protobuf::protoc) - get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE) - if(NOT _PROTOBUF_PROTOC) - get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION) - endif() + set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>) else() find_package(protobuf CONFIG REQUIRED) - get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE) - if(NOT _PROTOBUF_PROTOC) - get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION) - endif() + if(TARGET protobuf::protoc) + set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>) + endif() endif()In CMake, does get_target_property(... IMPORTED_LOCATION) work for non-imported targets (e.g., created via add_subdirectory/FetchContent), and is $<TARGET_FILE:...> the recommended way to reference executables in add_custom_command(COMMAND ...)?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/CMakeLists.txt` around lines 279 - 294, The current logic assumes protobuf::protoc has IMPORTED_LOCATION properties and can fail for in-tree targets; change the resolution to first check if the target exists and set _PROTOBUF_PROTOC to the generator expression $<TARGET_FILE:protobuf::protoc> when TARGET protobuf::protoc is true, otherwise fall back to get_target_property(... IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION) or the PROTOBUF_PROTOC_EXECUTABLE cache variable; update any add_custom_command(...) references to use ${_PROTOBUF_PROTOC} and keep the existing fatal error check if _PROTOBUF_PROTOC is still empty.cpp/src/grpc/server/grpc_pipe_serialization.hpp (3)
96-100:⚠️ Potential issue | 🔴 CriticalValidate
element_offsetand overflow before pointer arithmetic in result deserialization.Line 96–100 still allows negative offsets and unchecked
offset + chunk_sizearithmetic, which can underflow/overflow bounds checks and lead to unsafememcpy.🔧 Proposed fix
- int64_t offset = ac.element_offset(); - const auto& chunk_data = ac.data(); - if (offset + static_cast<int64_t>(chunk_data.size()) <= static_cast<int64_t>(dest.size())) { - std::memcpy(dest.data() + offset, chunk_data.data(), chunk_data.size()); - } + int64_t offset = ac.element_offset(); + const auto& chunk_data = ac.data(); + if (offset < 0) { return false; } + size_t uoffset = static_cast<size_t>(offset); + if (uoffset > dest.size()) { return false; } + if (chunk_data.size() > dest.size() - uoffset) { return false; } + std::memcpy(dest.data() + uoffset, chunk_data.data(), chunk_data.size());As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 96 - 100, The current deserialization uses offset = ac.element_offset() and then does pointer arithmetic without validating offset or preventing overflow/underflow before memcpy; update the logic in grpc_pipe_serialization.hpp around the block using offset, chunk_data and dest so you first validate element_offset() is non-negative and within dest.size(), compute size_t chunk_size = chunk_data.size(), and perform a safe bounds check using size_t conversions (e.g., offset <= dest.size() && offset + chunk_size <= dest.size()) while guarding against overflow on offset + chunk_size; only call std::memcpy(dest.data() + offset, ...) when those checks pass and otherwise handle the error/return early.
90-90:⚠️ Potential issue | 🔴 CriticalReturn failure on non-EOF parse errors in both deserializers.
Line 90 and Line 148 still
breakon any parse failure and eventually returntrue, which can accept malformed/truncated payloads as valid.🔧 Proposed fix
- if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break; + if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) { + if (clean_eof) { break; } // normal end-of-stream + return false; // malformed/truncated message + }As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
Also applies to: 148-148
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loops calling google::protobuf::util::ParseDelimitedFromCodedStream (using variables ac, coded, clean_eof) currently use "break" on parse failure which lets the function return true for truncated/malformed input; change those breaks to return false on a parse failure unless the failure was a clean EOF. Concretely, at both call sites (around the ParseDelimitedFromCodedStream lines referenced) check the clean_eof flag after a failed parse and return false for non-EOF failures (only allow breaking/ending the loop when clean_eof is true).
151-163:⚠️ Potential issue | 🟠 MajorGuard
total_elements * elem_sizeand byte-offset math before resize/copy.Line 151–163 still performs unchecked multiplication and resize from untrusted chunk metadata. A malformed payload can overflow size math or trigger excessive allocation.
🔧 Proposed fix
+ if (ac.total_elements() < 0) { return false; } if (dest.empty() && ac.total_elements() > 0) { int64_t elem_size = array_field_element_size(ac.field_id()); - dest.resize(static_cast<size_t>(ac.total_elements() * elem_size), 0); + if (elem_size <= 0) { return false; } + uint64_t total_bytes = static_cast<uint64_t>(ac.total_elements()) * + static_cast<uint64_t>(elem_size); + if (total_bytes > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) { + return false; + } + dest.resize(static_cast<size_t>(total_bytes), 0); }As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 151 - 163, The code in the block that resizes dest and copies chunk bytes must validate sizes to avoid integer overflow and huge allocations: before calling dest.resize(ac.total_elements() * elem_size) and before computing byte_offset, check that ac.total_elements() and computed elem_size are > 0 and that static_cast<size_t>(ac.total_elements()) <= SIZE_MAX / static_cast<size_t>(elem_size); bound the requested allocation against a safe maximum (or return an error) to prevent resource exhaustion; verify chunk_data.size() is <= dest.size() and that byte_offset >= 0 and byte_offset <= dest.size() and that byte_offset + chunk_data.size() does not overflow (use size_t or checked arithmetic) before calling std::memcpy; refer to symbols dest, ac, array_field_element_size(), total_elements(), element_offset(), data(), elem_size, byte_offset, resize, and std::memcpy when making these checks.cpp/src/grpc/server/grpc_server_types.hpp (1)
300-308:⚠️ Potential issue | 🔴 CriticalRemove async-signal-unsafe operations from
signal_handler().
std::coutandcondition_variable::notify_all()are not async-signal-safe and can deadlock/corrupt state when invoked from a POSIX signal handler.Suggested minimal signal-safe handler
inline void signal_handler(int signal) { if (signal == SIGINT || signal == SIGTERM) { - std::cout << "\n[gRPC Server] Received shutdown signal\n"; keep_running = false; if (shm_ctrl) { shm_ctrl->shutdown_requested = true; } - result_cv.notify_all(); } }Then have a normal thread (or the main loop) observe
keep_running == falseand do logging +result_cv.notify_all()outside the signal context.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308, signal_handler currently performs async-unsafe operations (std::cout and result_cv.notify_all()); modify it to only perform signal-safe actions: set a sig_atomic_t/atomic<bool> flag (keep_running) and, if needed, set shm_ctrl->shutdown_requested (ensure that shutdown_requested is signal-safe or updated via atomic/sig_atomic_t), then return immediately—remove std::cout and result_cv.notify_all() from signal_handler. Update the main loop or a dedicated watcher thread to observe keep_running (or the atomic) and then perform logging and call result_cv.notify_all() outside the signal context (use the symbols signal_handler, keep_running, shm_ctrl, result_cv to locate changes).cpp/src/grpc/client/grpc_client.cu (5)
702-745:⚠️ Potential issue | 🔴 CriticalHarden chunked-result metadata validation before division/alloc/memcpy.
element_size_bytesandtotal_elementsare trusted.elem_size <= 0triggers division-by-zero (Line 708), andtotal_elems * elem_sizecan overflowsize_tbefore allocation, making thememcpy(Line 739) unsafe.Proposed bounds checks
for (const auto& arr_desc : header->arrays()) { auto field_id = arr_desc.field_id(); int64_t total_elems = arr_desc.total_elements(); int64_t elem_size = arr_desc.element_size_bytes(); if (total_elems <= 0) continue; + if (elem_size <= 0) { + last_error_ = "Invalid chunk metadata: non-positive element_size_bytes"; + return false; + } + if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) { + last_error_ = "Invalid chunk metadata: total byte size overflow"; + return false; + } + const int64_t total_bytes = total_elems * elem_size; + if (total_bytes < 0 || + static_cast<uint64_t>(total_bytes) > + static_cast<uint64_t>(std::numeric_limits<size_t>::max())) { + last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory"; + return false; + } int64_t elems_per_chunk = chunk_data_budget / elem_size; if (elems_per_chunk <= 0) elems_per_chunk = 1; - std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size)); + std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 745, The code trusts element_size_bytes() and total_elements() and must validate them before any division, allocation, or memcpy: check that elem_size = arr_desc.element_size_bytes() > 0 and total_elems = arr_desc.total_elements() >= 0, ensure elems_per_chunk = chunk_data_budget / elem_size is computed only after the elem_size > 0 check, and verify that total_elems * elem_size does not overflow size_t before constructing array_bytes; also validate that elems_wanted and elems_received are non-negative and that data.size() equals elems_received * elem_size and fits within array_bytes bounds before calling std::memcpy (references: arr_desc.field_id(), arr_desc.total_elements(), arr_desc.element_size_bytes(), elems_per_chunk, array_bytes, chunk_resp.elements_in_chunk(), chunk_resp.data(), and the std::memcpy call).
258-266:⚠️ Potential issue | 🔴 CriticalGuard all public RPC methods against
impl_->stubbeing null.
check_status(),wait_for_completion(),cancel_job(),delete_job(),get_incumbents(), andstream_logs()all dereferenceimpl_->stubdirectly; calling these beforeconnect()will crash.Proposed guard pattern
job_status_result_t grpc_client_t::check_status(const std::string& job_id) { job_status_result_t result; + if (!impl_ || !impl_->stub) { + result.error_message = "Not connected to server"; + return result; + } grpc::ClientContext context; auto request = build_status_request(job_id); cuopt::remote::StatusResponse response; auto status = impl_->stub->CheckStatus(&context, request, &response);Also applies to: 293-303, 325-333, 354-361, 381-395, 419-430
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, All public RPC methods must guard against impl_ or impl_->stub being null to avoid crashes when called before connect(); update grpc_client_t::check_status, wait_for_completion, cancel_job, delete_job, get_incumbents, and stream_logs to early-return a sensible error result if impl_ == nullptr or impl_->stub == nullptr (e.g., set job_status_result_t to an error/failed state or populate the method's error/status fields and avoid calling stub->*), and add a clear logged message indicating the client is not connected; ensure you use the same guard pattern in the other listed methods (the ones around the ranges noted: check_status, wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs) so no method dereferences impl_->stub without this null check.
221-252:⚠️ Potential issue | 🔴 Critical
stop_log_streaming()can hang indefinitely (blockedreader->Read()).Setting
stop_logs_doesn’t unblock a synchronous server-streamingRead(), sojoin()can deadlock if the server stops emitting logs (or the stream is otherwise idle). This is still the same failure mode previously called out.Proposed fix (make stop cancel the active ClientContext)
void grpc_client_t::start_log_streaming(const std::string& job_id) { if (!config_.stream_logs || !config_.log_callback) return; stop_logs_.store(false); log_thread_ = std::make_unique<std::thread>([this, job_id]() { - grpc::ClientContext context; + grpc::ClientContext context; + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = &context; + } auto request = build_stream_logs_request(job_id, 0); auto reader = impl_->stub->StreamLogs(&context, request); cuopt::remote::LogMessage log_msg; while (reader->Read(&log_msg)) { if (stop_logs_.load()) { - context.TryCancel(); break; } if (config_.log_callback) { config_.log_callback(log_msg.line()); } if (log_msg.job_complete()) { break; } } reader->Finish(); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = nullptr; + } }); } void grpc_client_t::stop_log_streaming() { stop_logs_.store(true); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + if (active_log_context_) { active_log_context_->TryCancel(); } + } if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); } log_thread_.reset(); }You’ll also need matching members on
grpc_client_t(likely incpp/src/grpc/client/grpc_client.hpp):mutable std::mutex log_context_mutex_; grpc::ClientContext* active_log_context_ = nullptr;As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The stop_log_streaming() can block because reader->Read() is synchronous; modify start_log_streaming() and stop_log_streaming() to track and cancel the active grpc::ClientContext: add a mutable std::mutex log_context_mutex_ and grpc::ClientContext* active_log_context_ member, then in start_log_streaming() create the ClientContext, lock log_context_mutex_, set active_log_context_ = &context before calling impl_->stub->StreamLogs, unlock, and clear active_log_context_ (under the mutex) after reader->Finish(); in stop_log_streaming() lock log_context_mutex_, if active_log_context_ is non-null call active_log_context_->TryCancel(), then set stop_logs_ and join/reset the thread; keep the existing context.TryCancel() check inside the thread but ensure all access to active_log_context_ is synchronized to avoid races.
258-266:⚠️ Potential issue | 🟠 MajorAdd deadlines to operational RPCs (current code can block forever on network stalls).
Only the connection probe sets a deadline; most other RPC calls create a
grpc::ClientContextwithoutset_deadline(). That bypassesconfig_.timeout_secondsand can hang indefinitely (including inside polling loops).Proposed helper + per-RPC use
+static inline void set_context_deadline(grpc::ClientContext& ctx, int timeout_seconds) +{ + if (timeout_seconds <= 0) return; + ctx.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(timeout_seconds)); +} + job_status_result_t grpc_client_t::check_status(const std::string& job_id) { job_status_result_t result; grpc::ClientContext context; + set_context_deadline(context, config_.timeout_seconds); auto request = build_status_request(job_id); cuopt::remote::StatusResponse response; auto status = impl_->stub->CheckStatus(&context, request, &response);Apply similarly to
WaitForCompletion,CancelJob,DeleteResult,GetIncumbents,GetResult, and all chunked upload/download RPCs.Also applies to: 293-303, 325-333, 354-361, 387-395, 419-443, 468-470, 509-510, 540-541, 561-562, 588-589, 625-626, 674-675, 724-725, 1000-1004, 1172-1176
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The RPCs in grpc_client_t (e.g., grpc_client_t::check_status) create grpc::ClientContext without deadlines and can block indefinitely; add a helper (e.g., make_client_context_with_deadline or set_deadline_on_context) that constructs a grpc::ClientContext and calls set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(config_.timeout_seconds)) (or uses config_.timeout_seconds as a std::chrono duration) and use that helper everywhere you create a context for RPCs (check_status, WaitForCompletion, CancelJob, DeleteResult, GetIncumbents, GetResult, all chunked upload/download calls, and the other locations noted in the review) so each RPC honors config_.timeout_seconds and cannot hang forever.
992-995:⚠️ Potential issue | 🔴 CriticalValidate
poll_interval_msbefore dividing (division-by-zero).Both LP and MIP polling paths compute
max_pollsby dividing byconfig_.poll_interval_mswithout guarding<= 0.Proposed validation
} else { CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str()); + if (config_.poll_interval_ms <= 0) { + stop_log_streaming(); + result.error_message = "poll_interval_ms must be > 0"; + return result; + } int poll_count = 0; int max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".
Also applies to: 1130-1133
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 992 - 995, Before computing max_polls in the LP and MIP polling paths, validate config_.poll_interval_ms > 0 to avoid division-by-zero: if config_.poll_interval_ms <= 0, emit a clear error log (e.g., CUOPT_LOG_ERROR) that includes the invalid value and either return an error status or substitute a safe fallback (e.g., set poll_interval_ms = 1) before computing max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms; apply this check around the existing uses of poll_count and max_polls so both the LP and MIP paths use the validated value.cpp/src/grpc/grpc_problem_mapper.cu (1)
399-427:⚠️ Potential issue | 🔴 CriticalFix potential out-of-bounds writes in
get_doubles/get_ints(misaligned payload sizes).
nis computed with floor division, butmemcpycopiesit->second.size()bytes. If the byte payload isn’t an exact multiple of the element size, this can overflow the destination vector.Proposed fix (validate alignment and copy only validated byte count)
auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(double) != 0) return {}; size_t n = it->second.size() / sizeof(double); if constexpr (std::is_same_v<f_t, double>) { std::vector<double> v(n); - std::memcpy(v.data(), it->second.data(), it->second.size()); + std::memcpy(v.data(), it->second.data(), n * sizeof(double)); return v; } else { std::vector<double> tmp(n); - std::memcpy(tmp.data(), it->second.data(), it->second.size()); + std::memcpy(tmp.data(), it->second.data(), n * sizeof(double)); return std::vector<f_t>(tmp.begin(), tmp.end()); } }; auto get_ints = [&](int32_t field_id) -> std::vector<i_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(int32_t) != 0) return {}; size_t n = it->second.size() / sizeof(int32_t); if constexpr (std::is_same_v<i_t, int32_t>) { std::vector<int32_t> v(n); - std::memcpy(v.data(), it->second.data(), it->second.size()); + std::memcpy(v.data(), it->second.data(), n * sizeof(int32_t)); return v; } else { std::vector<int32_t> tmp(n); - std::memcpy(tmp.data(), it->second.data(), it->second.size()); + std::memcpy(tmp.data(), it->second.data(), n * sizeof(int32_t)); return std::vector<i_t>(tmp.begin(), tmp.end()); } };As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas get_doubles and get_ints compute element count n via floor division but then memcpy the full it->second.size() bytes, risking out-of-bounds when payload size isn't a multiple of element size; fix by validating alignment (check it->second.size() % sizeof(double) / sizeof(int32_t) == 0) and compute a safe byte_count = n * sizeof(element_type), resize the destination vector to n, then memcpy only byte_count (or return an error/empty vector if misaligned); update both get_doubles (use sizeof(double) and f_t conversions) and get_ints (use sizeof(int32_t) and i_t conversions) accordingly so copying never exceeds the destination buffer.cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)
213-218:⚠️ Potential issue | 🟠 MajorSynchronize server-log state reads/writes consistently (data race).
server_log_path_,server_log_start_pos_, andtest_start_marked_are accessed without a common lock inset_server_log_path(),get_server_logs(),dump_logs(), and the accessors. This can race if tests poll logs while another thread resets paths/markers.Proposed approach (copy shared state under lock, then do I/O unlocked)
void set_server_log_path(const std::string& path) { - server_log_path_ = path; - server_log_start_pos_ = 0; - test_start_marked_ = false; + std::lock_guard<std::mutex> lock(mutex_); + server_log_path_ = path; + server_log_start_pos_ = 0; + test_start_marked_ = false; } std::string get_server_logs(bool since_test_start = true) const { - if (server_log_path_.empty()) { return ""; } + std::string path; + std::streampos start_pos = 0; + bool marked = false; + { + std::lock_guard<std::mutex> lock(mutex_); + path = server_log_path_; + start_pos = server_log_start_pos_; + marked = test_start_marked_; + } + + if (path.empty()) { return ""; } - std::ifstream file(server_log_path_); + std::ifstream file(path); if (!file.is_open()) { return ""; } - if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) { - file.seekg(server_log_start_pos_); + if (since_test_start && marked && start_pos > 0) { + file.seekg(start_pos); }Consider also making
server_log_path()return a copy (std::string) instead of a reference if you need it to be thread-safe.As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state".
Also applies to: 229-244, 332-345, 350-356
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213 - 218, There is a data race on server_log_path_, server_log_start_pos_, and test_start_marked_: protect all reads/writes with a single mutex (e.g., add std::mutex server_log_mutex_) and update set_server_log_path(), get_server_logs(), dump_logs(), and the accessors to acquire the lock when accessing those members; for I/O work, copy the guarded state (make server_log_path() return a std::string copy rather than a const ref) while holding the lock then release and perform file reading/writing unlocked to avoid long holds; ensure all places that currently read the fields directly use the mutex-protected copy to eliminate races.cpp/src/grpc/grpc_solution_mapper.cu (1)
512-535:⚠️ Potential issue | 🔴 CriticalHarden
bytes_to_typed()against misaligned byte payloads (potential OOB write).All branches compute
nwith floor division and then copyraw.size()bytes. Ifraw.size()is not an exact multiple of the expected element size, this overflowstmp/v.Proposed fix
const auto& raw = it->second; if constexpr (std::is_same_v<T, float>) { + if (raw.size() % sizeof(double) != 0) return {}; size_t n = raw.size() / sizeof(double); std::vector<double> tmp(n); - std::memcpy(tmp.data(), raw.data(), raw.size()); + std::memcpy(tmp.data(), raw.data(), n * sizeof(double)); return std::vector<T>(tmp.begin(), tmp.end()); } else if constexpr (std::is_same_v<T, double>) { + if (raw.size() % sizeof(double) != 0) return {}; size_t n = raw.size() / sizeof(double); std::vector<double> v(n); - std::memcpy(v.data(), raw.data(), raw.size()); + std::memcpy(v.data(), raw.data(), n * sizeof(double)); return v; } else { + if (raw.size() % sizeof(T) != 0) return {}; size_t n = raw.size() / sizeof(T); std::vector<T> v(n); - std::memcpy(v.data(), raw.data(), raw.size()); + std::memcpy(v.data(), raw.data(), n * sizeof(T)); return v; }As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 512 - 535, The bytes_to_typed() implementation can write out-of-bounds because it computes n = raw.size()/sizeof(...) but then memcpy's raw.size() bytes; fix by always using sizeof(T) (not sizeof(double) in the float branch), compute size_t n = raw.size() / sizeof(T), and memcpy only n * sizeof(T) bytes into the destination buffers (and return the n-length vector); if raw.size() % sizeof(T) != 0, silently ignore trailing bytes or log a warning—ensure this same change is applied in the float branch (tmp) and the double/other branches (v) so all copies use n * sizeof(T) and use sizeof(T) to compute n.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 53823cec-ccad-4fbb-b90f-ec216b29d2e2
📒 Files selected for processing (39)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shconda/recipes/libcuopt/recipe.yamlcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/solve_remote.cu
- cpp/src/pdlp/CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
- cpp/tests/linear_programming/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (4)
- GRPC_ARCHITECTURE.md
- cpp/src/grpc/server/grpc_worker_infra.cpp
- python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
- cpp/src/grpc/grpc_settings_mapper.cu
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
cpp/src/grpc/client/test_grpc_client.cpp (1)
340-340: Print the client error on log-stream failure for parity with other modes.At
Line 340, a failedstream_logsreturns non-zero but drops the error detail. Emitclient.get_last_error()before returning to keep failure diagnosis consistent.Proposed fix
- return success ? 0 : 1; + if (!success) { + std::cerr << "Log stream failed: " << client.get_last_error() << "\n"; + return 1; + } + return 0;As per coding guidelines: "Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/test_grpc_client.cpp` at line 340, The return path after stream_logs currently returns non-zero on failure but drops the error detail; update the failure branch surrounding stream_logs in test_grpc_client.cpp to call and emit client.get_last_error() (or equivalent accessor) to the log/stream before returning the non-zero exit code so the client error is logged for parity with other modes and aids diagnosis (locate the code around stream_logs and the return of success ? 0 : 1 and insert a process/log call that prints client.get_last_error() when success is false).cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)
24-27: Consider makingkPipeChunkBytesconfigurable.The fixed 64 MiB chunk cap is a hard resource limit; exposing it via server config/env (with this as default) would improve deploy-time tuning.
As per coding guidelines: "Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 24 - 27, kPipeChunkBytes is a hard-coded constexpr limit; make it configurable by replacing the fixed constexpr with a runtime-configured value (e.g., a GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that reads an environment variable or server config and falls back to 64ULL * 1024 * 1024 as the default; update references to kPipeChunkBytes to call the accessor and validate the parsed value (positive, reasonable upper bound) and document the new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp (1)
957-982: Assert the log-callback path inSolveMIPWithLogCallback.
received_logsis populated but never asserted, so this test can pass even if callback streaming regresses.💡 Suggested assertion
auto result = client->solve_mip(problem, settings, false); EXPECT_TRUE(result.success) << result.error_message; + EXPECT_FALSE(received_logs.empty()) << "Expected at least one streamed log line";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 957 - 982, The test registers a log callback (received_logs with log_mutex) but never asserts anything, so add assertions after the call to client->solve_mip in SolveMIPWithLogCallback to verify the callback was invoked and logs look correct: acquire lock(log_mutex) and assert !received_logs.empty() (and optionally that at least one entry contains an expected substring like "objective" or "MIP" to ensure meaningful log content). If solve_mip is asynchronous/streams logs, ensure you wait briefly or poll until received_logs is non-empty before asserting to avoid flakes; reference received_logs, log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when adding the checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/CMakeLists.txt`:
- Around line 297-304: The CMake snippet only checks for the non-namespaced
target grpc_cpp_plugin; update the check to first test for the namespaced target
gRPC::grpc_cpp_plugin (like how gRPC::grpc++ is handled) and use its target file
if present, otherwise fall back to checking TARGET grpc_cpp_plugin and then
find_program for the executable; adjust references to set
_GRPC_CPP_PLUGIN_EXECUTABLE from "$<TARGET_FILE:gRPC::grpc_cpp_plugin>" when
available to ensure installed gRPC packages are recognized.
In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp`:
- Around line 1387-1402: The test currently expects GetResult to be called even
though CheckStatus returns PROCESSING (not complete); update the test so
GetResult is not expected for PROCESSING jobs: remove or replace the
EXPECT_CALL(*mock_stub_, GetResult(...)).WillOnce(...) with an expectation that
it is not called (e.g., EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0))
or simply delete the GetResult expectation, keeping the CheckStatus stub that
sets resp->set_job_status(cuopt::remote::PROCESSING) so the test asserts the
intended "do not call GetResult for non-complete status" behavior.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 275-283: The server_log_count function can infinite-loop when
substring is empty; add a guard at the start of server_log_count (which calls
get_server_logs()) to return 0 (or handle as appropriate) if substring.empty()
before entering the while loop so pos will never rely on substring.length() == 0
to advance.
---
Duplicate comments:
In `@cpp/CMakeLists.txt`:
- Around line 279-294: The current CMake logic relies on IMPORTED_LOCATION*
which can be empty for non-imported targets; update the resolution to use
$<TARGET_FILE:protobuf::protoc> when the target exists: in the branch that tests
if(TARGET protobuf::protoc) set _PROTOBUF_PROTOC to the generator expression
$<TARGET_FILE:protobuf::protoc> (instead of querying IMPORTED_LOCATION*),
otherwise fall back to find_package(protobuf CONFIG REQUIRED) and then use
IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION as a fallback; keep the existing
fatal check for empty _PROTOBUF_PROTOC. Ensure you reference the symbols
protobuf::protoc and _PROTOBUF_PROTOC in the change so the target-file generator
expression is used for in-tree/FetchContent builds.
In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-265: grpc_client_t::check_status (and similar RPC helpers)
currently dereferences impl_->stub without checking connection state which can
crash if called before connect(); add a null-check for impl_ and impl_->stub at
the start of grpc_client_t::check_status and for the same pattern in the other
RPC helper methods (the ones calling impl_->stub->...). If either is null,
return an appropriate error job_status_result_t (populate error code/message or
a failed state consistent with existing error handling) instead of calling the
stub; otherwise proceed to build the request and call CheckStatus as before.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without checking for <=0; add a guard before any
division (both where max_polls is computed and the other occurrence around the
1132 region) to validate config_.poll_interval_ms > 0, and if not either (a)
return or raise a clear error/invalid-argument from the surrounding function, or
(b) log an error and use a safe default (e.g., 1 ms) before computing max_polls;
ensure you reference config_.poll_interval_ms and update the computations that
set max_polls to avoid divide-by-zero crashes.
- Around line 221-251: The read loop in start_log_streaming blocks on
reader->Read() so stop_log_streaming can hang; to fix, add a mutex-protected
pointer member (log_context_mutex_ and active_log_context_) and set
active_log_context_ = &context (protected by log_context_mutex_) right after
creating grpc::ClientContext context in start_log_streaming, clear it (set
nullptr) after reader->Finish(); then update stop_log_streaming to lock
log_context_mutex_ and if active_log_context_ is non-null call
active_log_context_->TryCancel() before setting stop_logs_ and joining the
thread so the blocking StreamLogs read is cancelled safely.
- Around line 703-712: Validate elem_size and guard against overflow before
dividing/allocating: check elem_size > 0 before computing elems_per_chunk =
chunk_data_budget / elem_size (return/continue or log and skip if zero), ensure
total_elems > 0 (already present) and verify total_elems <= SIZE_MAX / elem_size
(or use safe_mul_check) before computing total_elems * elem_size and allocating
array_bytes, and if the allocation would exceed a configured budget or SIZE_MAX
also error/skip; apply the same checks when computing per-chunk sizes and before
any memcpy/copy operations. Reference symbols: elem_size, total_elems,
chunk_data_budget, elems_per_chunk, array_bytes, arr_desc.total_elements(),
arr_desc.element_size_bytes() (also update the similar block around the other
occurrence).
- Around line 262-265: The CheckStatus RPC (and other RPCs that create
grpc::ClientContext, e.g., the sites calling impl_->stub->CheckStatus and the
other stub->* calls listed) currently create a ClientContext with no deadline;
fix by setting a per-RPC deadline on the context before the call using
context.set_deadline(std::chrono::system_clock::now() + configured_timeout).
Prefer reusing or adding a small helper (e.g., set_deadline(context, timeout_ms)
or a method on impl_ that reads a configured RPC timeout) and apply it to the
Call sites such as the CheckStatus call (impl_->stub->CheckStatus) and the other
grpc::ClientContext creation sites flagged in the review so all RPCs
consistently use the same configured deadline.
In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 53-99: The grpc_client_config_t allows invalid values for
intervals/sizes (poll_interval_ms, incumbent_poll_interval_ms, timeout_seconds,
chunk_size_bytes, max_message_bytes, chunked_array_threshold_bytes) which can
break runtime behavior; add a validation method (e.g.,
grpc_client_config_t::Validate() or a constructor/static factory) that enforces
and documents invariants (positive non-zero ints for intervals/timeout, sensible
min/max for chunk_size_bytes and max_message_bytes, and explicit handling for
chunked_array_threshold_bytes semantics (-1 for auto, 0 for force-chunked, >0 as
threshold)), make the client call Validate() early (e.g., before using in
grpc_client constructors or SubmitJob path) and return/throw a clear error on
invalid config or clamp values where safe; update comments to list accepted
ranges for each field.
In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env-derived overrides for config.chunk_size_bytes and
config.max_message_bytes are applied without validation (using parse_env_int64),
allowing zero/negative/oversized values; update the logic where
config.chunk_size_bytes, config.max_message_bytes (and optionally
config.enable_debug_log) are set so that after calling parse_env_int64 you
validate the result against sane bounds (e.g., >0 and <= a defined
MAX_MESSAGE_BYTES/CHUNK_SIZE constant), ignore or clamp invalid values, and emit
a debug/warning log via the existing logger; specifically modify the section
using parse_env_int64 to perform the check and fallback to existing config
values when the parsed value is out-of-range.
- Around line 164-166: The callback is being passed the incumbent objective as
the bound (bound_copy = objective) which can show a false zero gap; change the
code so bound_copy is set to the solver's actual best bound variable (the
maintained global/best bound used by the B&B engine) instead of 'objective'
before calling get_callback->get_solution(...); locate where the solver stores
the current best bound (e.g., variables named best_bound, global_bound,
current_bound, or similar in the surrounding code/branch-and-bound context) and
use that identifier (preserving obj_copy, solution_copy, and
callback->get_user_data() usage).
In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: The parsing of numeric CLI arguments uses std::stod and
std::stoll without guards and converts a double to int unsafely (variable
time_limit and the static_cast<int>(time_limit)); wrap each std::stod/std::stoll
call in a try/catch (catch std::invalid_argument and std::out_of_range),
validate the parsed value for positivity and reasonable bounds, and return/exit
with a clear error on bad input; before converting time_limit to int (the
static_cast<int>(time_limit) usage) check that time_limit is finite, >= 0 and <=
std::numeric_limits<int>::max() (or clamp/handle overflow) to avoid UB; apply
the same defensive checks to the stoll usages (ensure values fit in target
integer types) and use the same error path if validation fails.
In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto conflicts with its directory path (cpp/src/grpc) and triggers
Buf PACKAGE_DIRECTORY_MATCH; fix by making the package name match the file path
or by moving the file to match the package. Either change the package line in
cpp/src/grpc/cuopt_remote.proto from "package cuopt.remote" to "package
cpp.src.grpc" (matching the directory segments as dot-separated identifiers) or
relocate cuopt_remote.proto into a cuopt/remote directory so the existing
"package cuopt.remote" matches the filesystem; update any import/usage
references accordingly and re-run buf lint to verify.
In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 175-182: The code assumes constraint_upper_bounds exists when
pb_problem.constraint_lower_bounds_size() > 0; instead validate both arrays and
their sizes before calling cpu_problem.set_constraint_lower_bounds and
cpu_problem.set_constraint_upper_bounds: check
pb_problem.constraint_lower_bounds_size() > 0 &&
pb_problem.constraint_upper_bounds_size() > 0 and that their sizes match, and
only then build con_lb/con_ub and call the two set_* methods; if sizes mismatch
or one is missing, handle the case (skip setting bounds or raise/log an error)
to avoid creating an inconsistent problem state.
- Around line 435-446: get_string_list can advance s past s_end when memchr
returns null (no trailing NUL); fix by computing remaining = s_end - s, call
memchr over remaining, and if memchr returns null push the final string using
the remaining length and then break (do not set s = nul + 1). Update the loop in
get_string_list to use it->second.data() and it->second.size() to compute s and
s_end, use the computed remaining length for std::memchr, and on null result
construct names.emplace_back(s, remaining) and break instead of setting s = nul
+ 1.
- Around line 399-411: The lambdas (notably get_doubles) compute element count
as it->second.size() / sizeof(double) then memcpy the full byte payload, which
can overflow if the byte length is not a multiple of sizeof(double); add an
explicit alignment/length check before any typed memcpy: verify
it->second.size() % sizeof(double) == 0 (or the equivalent check for the other
lambda’s element type), and if the check fails either return an empty vector or
log/throw an error; only then allocate the destination vector with n =
it->second.size() / sizeof(double) and perform memcpy; apply the same fix to the
other similar lambda at the noted region (around 414-426) to avoid misaligned
copies.
In `@cpp/src/grpc/grpc_service_mapper.cu`:
- Around line 23-55: Chunked wire format is inconsistent for floats: update
chunk_typed_array (and the client-side build_array_chunk_requests that calls it)
to always emit floating arrays as IEEE double on the wire (so server get_doubles
can continue to interpret chunks as sizeof(double)). Concretely, when T ==
float, convert the input vector to a temporary std::vector<double> and use that
buffer for chunking (set elem_size = sizeof(double) and point raw at the double
buffer) so byte offsets/counts match get_doubles; for non-floating types keep
the existing behavior. Ensure build_array_chunk_requests callers use the
adjusted chunk_typed_array semantics and do not send raw float bytes anymore.
In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 520-534: The bytes_to_typed template is copying raw.size() bytes
into buffers sized for n elements, causing overflow when raw.size() isn't an
exact multiple of the element byte-size; update each branch (the
std::is_same_v<T,float> branch that builds tmp as double, the
std::is_same_v<T,double> branch that builds v as double, and the generic branch)
to: compute n = raw.size() / sizeof(ElemType), check that raw.size() %
sizeof(ElemType) == 0 and handle (throw/log/return error) if not, and use
memcpy(dst.data(), raw.data(), n * sizeof(ElemType)) instead of memcpy(...,
raw.size()) so you only copy the exact bytes that fit the destination (refer to
function bytes_to_typed and the template T branches).
In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp`:
- Around line 46-49: The assignment vector is not cleared before appending
parsed values, so repeated calls to parse_incumbent_proto will accumulate stale
entries; fix by calling assignment.clear() (or assign empty) before
assignment.reserve(...) in the code that iterates over incumbent_msg
(referencing the assignment variable and the loop that uses
incumbent_msg.assignment_size() and incumbent_msg.assignment(i)) so the vector
starts empty each parse.
- Around line 30-32: The serialization ignores SerializeToArray's boolean return
and narrows ByteSizeLong() (size_t) to int unsafely: update the code that
constructs buffer and calls msg.SerializeToArray to first capture size_t sz =
msg.ByteSizeLong(), validate sz <= static_cast<size_t>(INT_MAX) before casting,
allocate buffer with that size, call bool ok =
msg.SerializeToArray(buffer.data(), static_cast<int>(sz)), and handle a false
result (e.g., throw or return an error) instead of ignoring it; reference the
buffer variable, msg.ByteSizeLong(), and msg.SerializeToArray(...) when making
the change.
In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 107-113: The recv_incumbent_pipe function trusts the incoming size
prefix and can allocate arbitrarily large memory; add a sanity check before
resizing by comparing the read size (variable size) against a defined cap (e.g.,
a new constant like kMaxIncumbentSize or MAX_INCUMBENT_SIZE) and reject/return
false if size is 0 or exceeds that cap, then only call data.resize(size) and
read_from_pipe when the size is within limits; update recv_incumbent_pipe to
enforce this limit and document the constant choice.
- Around line 147-158: The loop in submit_job_async is racy because threads
perform a check-then-write on job_queue[i] (checking ready/claimed then
initializing), so change the slot acquisition to an atomic claim: use an atomic
flag (e.g., make job_queue[].claimed an std::atomic_bool or add a per-slot
std::atomic<int> state) and perform a compare_exchange_strong to transition from
false->true to reserve the slot; only the thread that succeeds should populate
the slot fields (job_id, problem_type, data_size, worker_pid, worker_index,
data_sent, cancelled) and set job_queue[i].ready = true as the final publish
step. Ensure submit_job_async uses this atomic CAS on job_queue[i].claimed and
does not set ready until after all other fields are initialized.
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 93-100: Validate and sanitize the metadata before resizing or
memcpy: check ac.total_elements() and ac.element_offset() are non-negative and
within reasonable caps, ensure chunk_data.size() + offset does not overflow and
fits within a bounded/validated size_t, and guard against pathological
total_elements to avoid huge allocations; in the functions using ac (e.g., where
dest.resize(...) and std::memcpy(...) are called), convert to size_t only after
validation, return or throw an error on invalid/too-large values instead of
resizing/ copying, and ensure the memcpy condition uses the validated,
overflow-checked bounds.
- Line 90: The deserializers currently treat ParseDelimitedFromCodedStream
failures by breaking out and eventually returning true, accepting malformed
payloads; change the error handling around
google::protobuf::util::ParseDelimitedFromCodedStream(...) (the calls that pass
ac, coded, clean_eof) so that on a parse failure you immediately fail closed and
return false (or propagate an error) instead of break/continuing; apply the same
change to both deserializer sites that call ParseDelimitedFromCodedStream to
ensure truncated/malformed messages are rejected.
In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 234-250: Builder::BuildAndStart() can return nullptr so avoid
dereferencing server without checking; after calling builder.BuildAndStart()
(unique symbols: builder.BuildAndStart(), std::unique_ptr<Server> server,
server->Wait(), shutdown_thread, keep_running), check if server is null and if
so log an error/print to stderr, ensure shutdown_thread is signaled/joined or
cleaned up, and exit or return with a non-zero status; if server is valid
continue to start the shutdown_thread and call server->Wait() as currently
implemented.
- Around line 89-90: The signal handler registered via signal(SIGINT,
signal_handler) performs non-async-signal-safe operations (std::cout and
result_cv.notify_all()) — change it to a minimal, async-signal-safe handler:
have signal_handler set a std::sig_atomic_t or std::atomic<bool> flag (e.g.,
shutdown_requested) and (optionally) write a byte to a dedicated pipe to wake
the main loop; then remove any std::cout or condition_variable use from
signal_handler and move the actual logging and result_cv.notify_all() into the
main thread’s event loop where you check shutdown_requested (or the pipe) and
call result_cv.notify_all() and perform printing. Update grpc_server_types.hpp
to expose the atomic flag (or pipe FD) and ensure only async-signal-safe
functions are called from signal_handler.
- Around line 156-168: The loops in grpc_server_main.cpp must not call memset on
JobQueueEntry and ResultQueueEntry because they contain std::atomic members;
remove the memset calls and instead default-construct or placement-new each
element and then explicitly initialize fields. Replace memset(&job_queue[i], 0,
sizeof(JobQueueEntry)) with either new (&job_queue[i]) JobQueueEntry() or ensure
the array is value-initialized, then call job_queue[i].ready.store(false),
.claimed.store(false), .cancelled.store(false) and
job_queue[i].worker_index.store(-1); do the same for result_queue elements
(value-initialize or placement-new, then .ready.store(false) and
.retrieved.store(false)). Ensure JobQueueEntry and ResultQueueEntry constructors
handle non-atomic fields if you prefer constructor-based initialization.
- Around line 186-212: The early-return TLS error paths currently exit
immediately after spawn_workers() and creating threads, leaking workers,
threads, and shared memory; change these branches to perform proper cleanup
before returning: after calling spawn_workers() and after launching the threads
(referencing spawn_workers() and the container holding created std::thread
objects), call the corresponding shutdown/stop routine for workers (e.g.,
stop_workers() or signal shutdown on the worker pool), join all threads in the
thread container, and release any shared memory/IPC resources (the handles
created earlier, and any variables like ssl_opts or pem_root_certs can be left
untouched) before returning an error code. Ensure every early-return path that
occurs after worker spawn or thread creation follows this cleanup sequence so no
threads or shared resources remain leaked.
- Around line 50-80: The CLI parsing loop currently in main (the for-loop that
processes argv and assigns to config) silently ignores unknown flags and missing
values; update it to validate every flag and required value: for each branch
that expects a value (flags that call std::stoi/std::stoll or assign argv[++i]
to config fields like config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) ensure you check i+1<argc and on failure call
print_usage(argv[0]) and return a non-zero exit code; add an else branch for
unrecognized args (arg starts with '-') that also prints usage and exits
non-zero; additionally wrap stoi/stoll conversions in try/catch to detect
invalid numeric inputs and treat them as errors that print usage and exit
non-zero so the server cannot start with unintended defaults.
In `@cpp/src/grpc/server/grpc_server_threads.cpp`:
- Around line 166-229: When deserialize_result_pipe_blob(...) fails you mark
it->second.status = JobStatus::FAILED but you leave the local success flag true,
causing waiter->success to be set incorrectly; update the logic so that after a
deserialization failure you set success = false (or set waiter->success based on
it->second.status != JobStatus::FAILED) before assigning waiter->success in the
waiting_threads block; modify the code paths around
deserialize_result_pipe_blob, job_tracker, and the waiter assignment
(waiter->success) so waiters observe the actual failed state and error_message
from it->second.error_message.
In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe APIs
(std::cout and result_cv.notify_all); change it to only perform
async-signal-safe operations: remove the std::cout and result_cv.notify_all
calls and instead set an atomic/sig_atomic_t flag (ensure keep_running is an
std::atomic_bool or set a sig_atomic_t atomic_shutdown) and, if needed, set
shm_ctrl->shutdown_requested using a sig_atomic_t field; add a normal runtime
watcher thread that observes keep_running/shutdown flag and performs logging and
calls result_cv.notify_all when it detects shutdown; refer to signal_handler,
keep_running, shm_ctrl, shutdown_requested and result_cv when making these
changes.
In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code releases job_queue[job_idx].claimed before
publishing job_queue[job_idx].ready, creating a race where another producer can
claim the slot before consumers see ready; invert the operations so you
store(true) to job_queue[job_idx].ready before you store(false) to
job_queue[job_idx].claimed (apply same fix at the other occurrence around lines
with job_queue[...] claimed/ready, e.g., the site referenced as 265-267) to
ensure ready is visible prior to allowing re-claim.
- Around line 633-642: Hold on to waiter->mutex only while inspecting/modifying
waiter state; do not call check_job_status() while holding waiter->mutex to
avoid lock-order inversion with tracker_mutex used in cancel paths (e.g.,
cancel_job). In WaitForCompletion, restructure the loop so you acquire
unique_lock waiter's mutex, check waiter->ready and context->IsCancelled (and
adjust waiter->waiters on cancel) but before invoking check_job_status(job_id,
msg) release the lock (lock.unlock()), call check_job_status(), then re-lock
(lock.lock()) to re-evaluate waiter->ready and continue/wait; alternatively
adopt a consistent global lock order (acquire tracker_mutex before waiter->mutex
everywhere) — prefer the unlock/call/relock pattern around the call site to
quickly eliminate the deadlock risk.
- Around line 189-192: The code currently blindly adds each chunk to
meta.received_bytes, state.total_bytes, state.chunks and increments
state.total_chunks without enforcing a cumulative bound, so add a pre-check that
rejects or truncates a chunk when meta.received_bytes + raw.size() (or
state.total_bytes + raw.size()) would exceed the declared/expected field size
(use the existing expected-size field on meta or state), and only update
meta.received_bytes, state.total_bytes, state.chunks.push_back(ac) and
++state.total_chunks after the check succeeds; on failure return an error/status
indicating oversized upload to prevent unbounded memory growth.
In `@cpp/src/grpc/server/grpc_worker_infra.cpp`:
- Around line 120-124: The fork failure branch only calls
close_worker_pipes_server(worker_id) leaving child-end FDs open and leaking;
update the pid < 0 handling to also close the child-side pipe FDs (e.g., call
the corresponding cleanup function for child ends such as
close_worker_pipes_child(worker_id) or otherwise close the child file
descriptors associated with worker_id) before returning -1, ensuring both server
and child pipe endpoints are cleaned up when fork() fails; reference symbols:
fork, pid, worker_id, is_replacement, close_worker_pipes_server.
- Around line 165-168: The loop in wait_for_workers is calling waitpid
unconditionally for every entry in worker_pids; skip any non-positive PIDs (pid
<= 0) and call waitpid only for valid PIDs, and handle EINTR by retrying the
wait in a loop until waitpid succeeds or returns a permanent error; also handle
-1 results such that if errno == ECHILD you can break/continue safely and for
other errors log/propagate as appropriate. Locate the loop that iterates over
worker_pids and update it to validate pid > 0, wrap the waitpid call in a retry
loop that checks errno == EINTR, and handle other error cases (e.g., ECHILD)
instead of proceeding blindly.
In `@cpp/src/grpc/server/grpc_worker.cpp`:
- Around line 56-74: store_simple_result currently scans result_queue for an
entry with !ready and writes multiple non-atomic fields, causing a race; change
the slot allocation to atomically claim a slot before writing (e.g., add or
reuse an atomic claim flag on result_queue[i] and use
compare_exchange_strong/compare_exchange_weak to transition from UNCLAIMED to
CLAIMED), only proceed to copy job_id/status/data_size/error_message and set
retrieved/ready after the CAS succeeds, and ensure you set the atomic ready flag
last so other threads see a fully-initialized entry; update store_simple_result
to loop until a CAS claim succeeds (or handle full-queue) and reference
result_queue, MAX_RESULTS, and the ready/claimed flags in your changes.
- Around line 44-48: Check and handle the return value of
send_incumbent_pipe(fd_, buffer): if it indicates failure (e.g., broken pipe or
write error) log an error including job_id_ and the returned errno/details and
avoid further work or retry/cleanup as appropriate; only emit the incumbent
logging (currently using std::cout with job_id_, objective, assignment.size())
when the send succeeded or gate it behind a debug/log-level flag or rate-limit
to prevent unbounded output. Locate this logic around the call sites using
send_incumbent_pipe, build_incumbent_proto, and the variables job_id_,
objective, assignment, and fd_ to add the error check and conditional logging.
- Around line 248-252: The cudaMemcpy call copying device_solution to
host_solution currently ignores CUDA return codes; change it to capture the
cudaMemcpy return value (cudaError_t err = cudaMemcpy(...)) and if err !=
cudaSuccess throw an exception (e.g., std::runtime_error) with a descriptive
message including cudaGetErrorString(err) so the existing try/catch surfaces a
clear failure to the client; apply the same pattern to all other cudaMemcpy
sites in this file (including the similar copies near lines 296-307) and
reference the same symbols (device_solution, host_solution, cudaMemcpy) when
making the fixes.
- Around line 341-363: The loop that finds an empty slot in result_queue and
then fills it (using result_slot, ResultQueueEntry, copy_cstr, MAX_RESULTS,
worker_id) has a race: another thread may grab the same slot after the ready
check but before setting ready, and if no slot is found the result is silently
dropped; fix by protecting the selection-and-write with the same synchronization
used in store_simple_result (e.g., a mutex or atomic compare-and-set on
result_queue[i].ready) so the check-and-set is atomic, and add explicit handling
for the case result_slot remains -1 (log an error via the worker logger /
std::cerr and return or propagate a failure code) so lost results are reported.
In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 53-57: The test file uses open() and O_WRONLY/O_CREAT/O_TRUNC but
does not include <fcntl.h>, which is required on some platforms; add a `#include`
<fcntl.h> to the top of grpc_integration_test.cpp alongside the existing system
headers (near the includes for <signal.h>, <sys/types.h>, <sys/wait.h>,
<unistd.h>) so that the open() call and O_* constants used by the test compile
portably.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setter set_server_log_path writes server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization; protect
these shared members by acquiring mutex_ (e.g., std::lock_guard or
std::unique_lock on mutex_) at the start of set_server_log_path and similarly in
every reader/callback that touches server_log_path_, server_log_start_pos_, or
test_start_marked_ (the read blocks around lines ~229-244 and ~332-355). Under
the lock, perform updates or read-and-copy the values, and for readers prefer
copying the protected values into local variables while holding mutex_ then
release the lock before doing any IO/long operations to avoid holding mutex_
during blocking work.
In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The phrase claiming "no custom serialization logic is
implemented" is too absolute; update the line to clarify the scope by stating
that protocol buffers generated by protoc and grpc_cpp_plugin are used for gRPC
wire payloads, and explicitly note the exception that
grpc_pipe_serialization.hpp provides custom pipe blob framing helpers (i.e., the
custom framing is outside gRPC wire payload serialization). Locate the sentence
in GRPC_ARCHITECTURE.md and reword it to mention "for gRPC wire payloads" and
add a parenthetical or following sentence referencing
grpc_pipe_serialization.hpp as the known custom framing helper.
---
Nitpick comments:
In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Line 340: The return path after stream_logs currently returns non-zero on
failure but drops the error detail; update the failure branch surrounding
stream_logs in test_grpc_client.cpp to call and emit client.get_last_error() (or
equivalent accessor) to the log/stream before returning the non-zero exit code
so the client error is logged for parity with other modes and aids diagnosis
(locate the code around stream_logs and the return of success ? 0 : 1 and insert
a process/log call that prints client.get_last_error() when success is false).
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 24-27: kPipeChunkBytes is a hard-coded constexpr limit; make it
configurable by replacing the fixed constexpr with a runtime-configured value
(e.g., a GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that
reads an environment variable or server config and falls back to 64ULL * 1024 *
1024 as the default; update references to kPipeChunkBytes to call the accessor
and validate the parsed value (positive, reasonable upper bound) and document
the new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.
In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 957-982: The test registers a log callback (received_logs with
log_mutex) but never asserts anything, so add assertions after the call to
client->solve_mip in SolveMIPWithLogCallback to verify the callback was invoked
and logs look correct: acquire lock(log_mutex) and assert !received_logs.empty()
(and optionally that at least one entry contains an expected substring like
"objective" or "MIP" to ensure meaningful log content). If solve_mip is
asynchronous/streams logs, ensure you wait briefly or poll until received_logs
is non-empty before asserting to avoid flakes; reference received_logs,
log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when
adding the checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8bfed585-0c3e-4c04-9fc0-af2be140a6f2
📒 Files selected for processing (39)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shconda/recipes/libcuopt/recipe.yamlcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (5)
- build.sh
- cpp/tests/linear_programming/CMakeLists.txt
- cpp/tests/linear_programming/grpc/CMakeLists.txt
- python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
- cpp/src/grpc/grpc_settings_mapper.cu
a94be38 to
67e1796
Compare
cuopt uses embedded grpc client to solve problems on a remote server
5fea430 to
92c17f0
Compare
92c17f0 to
718d938
Compare
|
/ok to test 718d938 |
c00e107 to
757b92b
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
♻️ Duplicate comments (6)
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)
339-357:⚠️ Potential issue | 🟠 Major
dump_logs()andis_test_start_marked()still readtest_start_marked_unlocked.The writes are synchronized, but Line 344 and Line 357 read the same flag directly. With the log callback and the test thread running concurrently, that is still a data race and can make the log window nondeterministic.
🛠️ Minimal fix
void dump_logs(std::ostream& os = std::cout, bool include_all_server_logs = false) const { + const bool marked = is_test_start_marked(); os << "=== Client Logs ===\n"; os << get_client_logs(); os << "\n=== Server Logs"; - if (test_start_marked_ && !include_all_server_logs) { + if (marked && !include_all_server_logs) { os << " (since test start)"; } else { os << " (all)"; @@ - bool is_test_start_marked() const { return test_start_marked_; } + bool is_test_start_marked() const + { + std::lock_guard<std::mutex> lock(mutex_); + return test_start_marked_; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 339 - 357, dump_logs() and is_test_start_marked() read test_start_marked_ without locking, causing a data race with the log callback writer; fix by protecting those reads with the same mutex used to synchronize log writes (e.g., wrap the reads in dump_logs() and is_test_start_marked() in a std::lock_guard<std::mutex> using the existing logs mutex such as logs_mutex_ or mutex_), or alternatively change test_start_marked_ to std::atomic<bool> and update writers/readers accordingly so all accesses are thread-safe (make the change in the dump_logs, is_test_start_marked, and the writer that sets test_start_marked_).cpp/src/grpc/server/grpc_worker_infra.cpp (1)
147-154:⚠️ Potential issue | 🟡 MinorAdd PID validation before
waitpidto avoid undefined behavior.The loop iterates
worker_pidswithout checking for non-positive values. If a worker failed to spawn (returning -1 fromspawn_worker) but was still pushed to the vector, or if the vector contains stale zero values, callingwaitpid(0, ...)waits for any child in the process group, which is unintended.Proposed fix
void wait_for_workers() { for (pid_t pid : worker_pids) { + if (pid <= 0) continue; int status; while (waitpid(pid, &status, 0) < 0 && errno == EINTR) {} } worker_pids.clear(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 147 - 154, In wait_for_workers(), validate PIDs from the worker_pids vector before calling waitpid to avoid undefined behavior: skip any non-positive entries (<= 0) that may come from failed spawn_worker() calls or stale zeros, only call waitpid(pid, &status, 0) for pid > 0 (and optionally log or collect invalid PIDs), and then clear worker_pids as before; ensure the check is applied inside the for-loop that iterates worker_pids so stray 0 or -1 values are not passed to waitpid.cpp/src/grpc/cuopt_remote_service.proto (1)
320-325:⚠️ Potential issue | 🟠 MajorInclude the best bound in
Incumbentupdates.Local
GET_SOLUTIONcallbacks receive both the incumbent objective and the current best bound, but this message only carries the objective and assignment. The client wrapper has to synthesize or omit the bound today, so remote callbacks lose gap information and can't preserve local stop heuristics.📦 Schema addition
message Incumbent { int64 index = 1; double objective = 2; repeated double assignment = 3; string job_id = 4; + double solution_bound = 5; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 320 - 325, The Incumbent proto message is missing the current best bound; add a new double field (e.g. best_bound = 5) to message Incumbent so remote GET_SOLUTION callbacks carry both objective and bound, then regenerate gRPC/protobuf stubs and update any producers/consumers that construct or read Incumbent (places referencing Incumbent, its index/objective/assignment/job_id handling) to populate and consume the new best_bound field accordingly so gap information is preserved end-to-end.cpp/tests/linear_programming/grpc/grpc_client_test.cpp (1)
1347-1353:⚠️ Potential issue | 🟡 MinorStop the test at
PROCESSINGinstead of expectingGetResult.The comment above this block says
get_lp_result()should not attemptGetResult()onceCheckStatus()reportsPROCESSING, but the mock still requires that RPC. This locks in the opposite behavior.Suggested fix
- EXPECT_CALL(*mock_stub_, GetResult(_, _, _)) - .WillOnce([](grpc::ClientContext*, - const cuopt::remote::GetResultRequest&, - cuopt::remote::ResultResponse*) { - return grpc::Status(grpc::StatusCode::UNAVAILABLE, "Result not ready"); - }); + EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 1347 - 1353, The test currently sets an EXPECT_CALL on mock_stub_->GetResult which contradicts the intended behavior that get_lp_result() should stop when CheckStatus reports PROCESSING; remove or change that expectation so GetResult is not required to be called (e.g., delete the EXPECT_CALL for GetResult or change it to .Times(0)), ensure the mock for CheckStatus returns PROCESSING, and verify the test asserts that get_lp_result() stops after CheckStatus rather than invoking GetResult; references: mock_stub_, GetResult, CheckStatus, get_lp_result.cpp/src/grpc/server/grpc_job_management.cpp (1)
50-55:⚠️ Potential issue | 🟠 MajorBound the incumbent payload before
resize().
sizecomes straight off the pipe, sodata.resize(size)lets a malformed worker force an arbitrary allocation here. Please reject anything above an explicit cap before resizing.As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."🔧 Suggested guard
bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data) { uint64_t size; if (!read_from_pipe(fd, &size, sizeof(size))) return false; + if (size > static_cast<uint64_t>(config.max_message_bytes)) { + std::cerr << "[Worker] Incumbent payload too large: " << size << "\n"; + return false; + } data.resize(size); if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false; return true; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 50 - 55, The recv_incumbent_pipe function currently calls data.resize(size) with size read from the pipe (via read_from_pipe) allowing a remote peer to force unbounded allocation; guard against this by validating the uint64_t size before resizing—define and use an explicit max payload constant (e.g., MAX_INCUMBENT_PAYLOAD or similar within this compilation unit), check that size <= MAX_INCUMBENT_PAYLOAD and size is non-negative/ sensible, reject the read (return false) if the check fails, then proceed to data.resize(size) and the subsequent read_from_pipe call; update any related error paths or comments in recv_incumbent_pipe accordingly.cpp/src/grpc/grpc_problem_mapper.cpp (1)
440-450:⚠️ Potential issue | 🟠 MajorStop once
memchr()falls off the blob.When
memchr()misses,nulbecomess_endand Line 450 still formsnul + 1. That walks past the buffer on malformed chunk payloads.As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."🔧 Suggested fix
while (s < s_end) { const char* nul = static_cast<const char*>(std::memchr(s, '\0', s_end - s)); if (!nul) nul = s_end; names.emplace_back(s, nul); - s = nul + 1; + if (nul == s_end) break; + s = nul + 1; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cpp` around lines 440 - 450, In get_string_list, when std::memchr returns nullptr we set nul = s_end but then do s = nul + 1 which advances past the buffer; change the loop to detect the "no NUL found" case and stop safely: after computing nul (from std::memchr), if nul == s_end handle the final fragment (push a string from s to s_end if non-empty) and break the loop instead of doing s = nul + 1; only advance s = nul + 1 when nul < s_end. This prevents walking past the blob and fixes arrays/deserialization safety in get_string_list.
🟠 Major comments (23)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-77-83 (1)
77-83:⚠️ Potential issue | 🟠 MajorClear inherited
CUOPT_TLS_*variables in_cpu_only_env.This helper copies the caller environment verbatim, so a pre-set TLS configuration from the shell or another fixture can leak into the plaintext-server tests and make them fail nondeterministically. Please explicitly remove the TLS keys here so each test class fully owns its transport settings.
Suggested fix
def _cpu_only_env(port): """Return an env dict that hides all GPUs and enables remote mode.""" env = os.environ.copy() + for key in ( + "CUOPT_TLS_ENABLED", + "CUOPT_TLS_ROOT_CERT", + "CUOPT_TLS_CLIENT_CERT", + "CUOPT_TLS_CLIENT_KEY", + ): + env.pop(key, None) env["CUDA_VISIBLE_DEVICES"] = "" env["CUOPT_REMOTE_HOST"] = "localhost" env["CUOPT_REMOTE_PORT"] = str(port) return envAs per coding guidelines "Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 77 - 83, The _cpu_only_env helper currently copies the caller environment verbatim which can leak TLS settings; modify the function (_cpu_only_env) to explicitly remove any keys that start with "CUOPT_TLS_" from the env dict (e.g., iterate over keys or pop specific known CUOPT_TLS_* names) before returning so plaintext-server tests run with a clean transport configuration and deterministic behavior.python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-312-338 (1)
312-338:⚠️ Potential issue | 🟠 MajorAssert remote numerical correctness, not just subprocess success.
These remote-path tests still pass when the gRPC/CLI path returns wrong numbers with the right shapes or simply exits 0. Since
afiro_original.mpsandbb_optimality.mpsare known instances, please assert objective/gap correctness on the CPU-only, CLI, TLS, and mTLS paths as well.As per coding guidelines "Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)".
Also applies to: 432-442, 560-565, 625-635
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 312 - 338, The tests currently only check subprocess exit codes; update the test functions (test_lp_solve_cpu_only, test_lp_dual_solution_cpu_only, test_mip_solve_cpu_only, test_warmstart_cpu_only) and their underlying helpers (_impl_lp_solve_cpu_only, _impl_lp_dual_solution_cpu_only, _impl_mip_solve_cpu_only, _impl_warmstart_cpu_only) to assert numerical correctness: load the known instances (afiro_original.mps, bb_optimality.mps) and assert the returned objective value, primal/dual vectors, reduced costs and optimality gap are within a tight tolerance (e.g. absolute/relative eps) rather than only checking returncode; apply the same numerical assertions to the CLI/TLS/mTLS remote paths covered by the other failing ranges (around 432-442, 560-565, 625-635) and include tests for degenerate cases (infeasible/unbounded/empty/singleton) verifying expected statuses and gap behavior.python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-110-120 (1)
110-120:⚠️ Potential issue | 🟠 MajorAdd a
subjectAltNameto the generated server certificate.The server certificate only sets
CN=localhost. Modern TLS stacks, particularly BoringSSL (used in gRPC C-core), do not accept hostname verification via common name and requiresubjectAltNameinstead. CN-only certificates will cause test failures on systems that enforce this requirement.Suggested fix
server_key = os.path.join(cert_dir, "server.key") server_csr = os.path.join(cert_dir, "server.csr") server_crt = os.path.join(cert_dir, "server.crt") + server_ext = os.path.join(cert_dir, "server.ext") + with open(server_ext, "w", encoding="utf-8") as f: + f.write("subjectAltName=DNS:localhost,IP:127.0.0.1\n") if not _run( f"openssl req -newkey rsa:2048 -keyout {server_key} -out {server_csr} " f"-nodes -subj '/CN=localhost' 2>/dev/null" ): return False if not _run( f"openssl x509 -req -in {server_csr} -CA {ca_crt} -CAkey {ca_key} " - f"-CAcreateserial -out {server_crt} -days 1 2>/dev/null" + f"-CAcreateserial -out {server_crt} -days 1 " + f"-extfile {server_ext} 2>/dev/null" ): return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 110 - 120, The server certificate generation uses _run to call openssl and only sets CN=localhost (server_csr/server_crt), which fails hostname verification on BoringSSL—update the second openssl invocation that creates server_crt (the openssl x509 -req ... call referencing {server_csr}, {ca_crt}, {ca_key}) to include a subjectAltName for DNS:localhost and IP:127.0.0.1 (via -addext "subjectAltName=DNS:localhost,IP:127.0.0.1" or by supplying an -extfile/-extensions config) so the generated server_crt includes SAN entries accepted by modern TLS stacks.cpp/src/grpc/server/grpc_server_types.hpp-254-256 (1)
254-256:⚠️ Potential issue | 🟠 MajorNamespace the shared-memory objects per server instance.
With fixed names like
/cuopt_job_queue,/cuopt_result_queue, and/cuopt_control, twocuopt_grpc_serverprocesses on the same host will attach to the same IPC objects even if they listen on different ports. That can cause cross-talk between jobs and one instance unlinking the other's shared memory during cleanup. Please derive these names from an instance identifier such as port/PID and pass them through to workers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 254 - 256, The fixed shared-memory name constants SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL must be replaced with functions that derive per-instance names (e.g., using the server's listening port and/or PID) so each cuopt_grpc_server gets unique IPC names; update the code that creates/opens/unlinks these objects to call the new name generators and thread/worker spawn points to pass the resolved names into worker initialization so workers use the same instance-specific names rather than the previous global constants.GRPC_QUICK_START.md-197-205 (1)
197-205:⚠️ Potential issue | 🟠 MajorThe "C API" section shows C++ code, not the public C API.
Lines 200–204 use C++ headers (
solve.hpp,cpu_optimization_problem.hpp), theautokeyword, and thecuopt::linear_programming::namespace. This will not compile in a C translation unit. Replace this snippet with the actual public C API, which usescuOptSolve()with opaque handles (cuOptOptimizationProblem,cuOptSolverSettings,cuOptSolution*). Alternatively, rename the section to "C++" if the intent is to show internal C++ usage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_QUICK_START.md` around lines 197 - 205, The section labeled "C API" currently shows C++ code; either change the section title to "C++" or replace the snippet with the real public C API: show usage of cuOptSolve with the opaque handle types (cuOptOptimizationProblem, cuOptSolverSettings, cuOptSolution*) and the proper C function call (cuOptSolve) and cleanup patterns so the example compiles in a C translation unit; ensure the example does not use C++ headers, namespaces, or keywords like auto and instead demonstrates the correct C handles and function signatures.cpp/src/grpc/server/grpc_pipe_io.cpp-32-57 (1)
32-57:⚠️ Potential issue | 🟠 Major
timeout_msshould cover the entire data transfer, and respectPOLLIN | POLLHUPstate.The initial
poll()enforces the timeout, but the loop at line 59 falls back to blockingread()calls without timeout protection. If a writer sends a size header, then stalls before sending the payload, the reader blocks indefinitely on the secondread_from_pipe()call (e.g., at line 38 ofgrpc_job_management.cpp).Additionally, line 54 treats
POLLERR | POLLHUP | POLLNVALas immediate failures. Whenpoll()returns with bothPOLLINandPOLLHUPset—indicating the peer closed but buffered bytes remain—the code returns false and drops those buffered bytes instead of draining them. Use an absolute deadline and poll until no more readable bytes remain before treating hangup as fatal.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_io.cpp` around lines 32 - 57, The read_from_pipe function must enforce timeout across the entire transfer and not drop buffered bytes on POLLHUP; change the logic in read_from_pipe so you compute an absolute deadline from timeout_ms, then loop calling poll(&pfd,1,remaining_ms) (handling EINTR) before each read and update remaining_ms from the deadline; only treat POLLNVAL/POLLERR as immediate failure, and if pfd.revents has POLLIN (even if POLLHUP is also set) perform read() to drain available bytes into ptr/remaining until read returns 0 or remaining==0, only treating a pure POLLHUP with no POLLIN as EOF, and return false on poll timeout once the absolute deadline passes. Ensure you reference the existing symbols ptr, remaining, pfd and timeout_ms when making these changes.cpp/src/grpc/server/grpc_pipe_serialization.hpp-192-200 (1)
192-200:⚠️ Potential issue | 🟠 MajorSame validation needed in
read_result_from_pipe.Apply the same
total_bytesbounds check here to prevent resource exhaustion from malformed result payloads.Proposed fix
for (uint32_t i = 0; i < num_arrays; ++i) { int32_t field_id; uint64_t total_bytes; if (!read_from_pipe(fd, &field_id, sizeof(field_id))) return false; if (!read_from_pipe(fd, &total_bytes, sizeof(total_bytes))) return false; + if (total_bytes > kMaxArrayBytes) return false; auto& dest = arrays_out[field_id]; dest.resize(static_cast<size_t>(total_bytes));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 192 - 200, In read_result_from_pipe, validate the uint64_t total_bytes read from the pipe before calling dest.resize() and read_from_pipe(): check total_bytes is within an expected/sane upper bound (e.g. compare against a MAX_ALLOWED_BYTES constant or a per-array limit and ensure it fits into size_t) and return false if it exceeds the limit or would cause allocation overflow; apply this check in the same loop that reads field_id/total_bytes and references arrays_out, so you never resize or read using a maliciously large total_bytes.cpp/src/grpc/server/grpc_pipe_serialization.hpp-140-149 (1)
140-149:⚠️ Potential issue | 🟠 MajorAdd validation for
total_bytesbefore resizing to prevent resource exhaustion.The
read_chunked_request_from_pipefunction readstotal_bytesfrom untrusted input and directly uses it to resize a vector without bounds checking. A malicious or corrupted payload could specify an extremely large value, causing allocation failure or resource exhaustion.Proposed fix
+// Maximum allowable array size to prevent resource exhaustion (e.g., 4 GiB) +static constexpr uint64_t kMaxArrayBytes = 4ULL * 1024 * 1024 * 1024; + inline bool read_chunked_request_from_pipe(int fd, cuopt::remote::ChunkedProblemHeader& header_out, std::map<int32_t, std::vector<uint8_t>>& arrays_out) { if (!read_protobuf_from_pipe(fd, header_out)) return false; uint32_t num_arrays; if (!read_from_pipe(fd, &num_arrays, sizeof(num_arrays))) return false; // Read each field's raw bytes directly into the output map, keyed by field_id. for (uint32_t i = 0; i < num_arrays; ++i) { int32_t field_id; uint64_t total_bytes; if (!read_from_pipe(fd, &field_id, sizeof(field_id))) return false; if (!read_from_pipe(fd, &total_bytes, sizeof(total_bytes))) return false; + if (total_bytes > kMaxArrayBytes) return false; auto& dest = arrays_out[field_id]; dest.resize(static_cast<size_t>(total_bytes));As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 140 - 149, The loop in read_chunked_request_from_pipe currently trusts total_bytes from the pipe and calls dest.resize(total_bytes); validate total_bytes before resizing: ensure field_id is within arrays_out bounds, check total_bytes is not negative (already unsigned but still validate), enforce a sane upper limit (e.g. compare against a MAX_ALLOWED_CHUNK or the destination's max_size/remaining budget) and fail (return false) if it exceeds that limit or would cause memory growth beyond system limits; only then call dest.resize(...) and proceed to read_from_pipe into dest.data(). Use the symbols read_chunked_request_from_pipe, num_arrays, field_id, total_bytes, arrays_out, dest.resize, and read_from_pipe when locating the code to implement these checks.GRPC_SERVER_ARCHITECTURE.md-97-106 (1)
97-106:⚠️ Potential issue | 🟠 MajorFull upload buffering in the coordinator is an OOM vector.
This design keeps chunked uploads in main-process memory until
FinishChunkedUpload, so memory usage scales with the entire problem size rather than the in-flight chunk size. A few concurrent “multi-GB” uploads can exhaust the coordinator before pipe backpressure helps. Consider streaming directly to the worker or enforcing hard per-upload and global memory quotas.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_SERVER_ARCHITECTURE.md` around lines 97 - 106, Coordinator currently buffers entire uploads in ChunkedUploadState and pending_chunked_data until FinishChunkedUpload, which risks OOM; instead modify the chunk handling so each incoming chunk is either (a) streamed immediately to the worker pipe (forward inside the same handler used by FinishChunkedUpload) and not retained in ChunkedUploadState, or (b) if buffering is required, enforce strict per-upload and global memory quotas and refuse/slow new chunks when limits are reached; update the code paths around FinishChunkedUpload, ChunkedUploadState, and pending_chunked_data to stop accumulating full uploads (or add quota checks and accounting and evictions), and ensure backpressure is propagated to gRPC handlers so the coordinator never holds multi-GB of data in memory.GRPC_SERVER_ARCHITECTURE.md-52-57 (1)
52-57:⚠️ Potential issue | 🟠 MajorNamespace the shared-memory objects per server instance.
/cuopt_job_queue,/cuopt_result_queue, and/cuopt_controlare global names. Twocuopt_grpc_serverprocesses on the same host will collide even if they listen on different ports, because the later instance can unlink/recreate the first instance’s IPC objects. Add a per-instance suffix such as port, PID, or UUID and pass it to workers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_SERVER_ARCHITECTURE.md` around lines 52 - 57, The documented shared-memory names (/cuopt_job_queue, /cuopt_result_queue, /cuopt_control) are global and will collide across multiple server instances; update the server and worker startup to generate a per-instance suffix (e.g., port, PID, or UUID) and append it to those base names to form unique IPC identifiers, propagate that suffix into worker launch/config so workers use the same suffixed names, and update all references in documentation and code (search for /cuopt_job_queue, /cuopt_result_queue, /cuopt_control) so unlink/create/attach operations use the suffixed names.cpp/src/grpc/server/grpc_server_main.cpp-225-230 (1)
225-230:⚠️ Potential issue | 🟠 MajorFail startup if no worker actually came up.
spawn_workers()skips failed spawns, butmain()never verifies the result before binding the gRPC port. With--workers 0or repeatedspawn_worker()failures, the server accepts jobs that can never be processed.💡 Minimal guard
spawn_workers(); + if (worker_pids.empty()) { + std::cerr << "[Server] Failed to start any worker processes\n"; + cleanup_shared_memory(); + return 1; + } std::thread result_thread(result_retrieval_thread);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 225 - 230, spawn_workers() currently skips failed spawns but main() proceeds to bind the gRPC port and start threads; change main() to verify that at least one worker was started (either by making spawn_workers() return the number of successful spawns or by checking the global worker container/worker_count after spawn_workers()) and if zero, log an error and exit non-zero before creating result_retrieval_thread, incumbent_retrieval_thread, worker_monitor_thread, session_reaper_thread or binding the gRPC server; update spawn_workers()/spawn_worker() signatures if needed to surface success count so main() can perform this check.cpp/src/grpc/grpc_settings_mapper.cpp-140-149 (1)
140-149:⚠️ Potential issue | 🟠 Major
iteration_limitloses its "unset" state when omitted in proto3 decode.The
iteration_limitfield is a plain proto3 scalar that defaults to0when omitted. The encode side uses-1as a sentinel to represent "unset" (library default), but this convention only works if all clients consistently follow it. External clients or future versions unaware of this sentinel will send omitted fields as0, which the decode side interprets as "explicitly request 0 iterations" rather than "use server default". This RPC contract is brittle and violates the principle of forward compatibility. Useoptional int64 iteration_limitto add proper field presence semantics in proto3.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_settings_mapper.cpp` around lines 140 - 149, The proto3 scalar iteration_limit must become optional to preserve presence semantics: change the .proto to "optional int64 iteration_limit" and update encoding so it omits the field when the library default/unset is intended (stop writing -1 as a sentinel). In the decoder (grpc_settings_mapper.cpp) replace the current check that relies on a numeric sentinel (pb_settings.iteration_limit() >= 0) with a presence check (pb_settings.has_iteration_limit()) and only then clamp and assign to settings.iteration_limit (preserving the existing max-clamp logic); reference pb_settings, settings.iteration_limit and the mapping code that sets iteration_limit.cpp/src/grpc/client/solve_remote.cpp-125-159 (1)
125-159:⚠️ Potential issue | 🟠 MajorForward
problem_checkinganduse_pdlp_solver_modeto the remote solver.
solve_lp_remote()accepts both parameters but discards them before callingclient.solve_lp(cpu_problem, settings). The local PDLP path respects these flags—problem_checkingenables problem validation (line 1325), anduse_pdlp_solver_modecontrols hyperparameter selection (line 1458). Callers atcpp/src/pdlp/solve.cu:1675explicitly pass these flags expecting remote execution to honor them, creating an API-level inconsistency.Add these flags to the proto
PDLPSolverSettingsmessage and forward them throughgrpc_client_t::solve_lp()so remote and local paths behave consistently.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cpp` around lines 125 - 159, The solve_lp_remote function currently ignores problem_checking and use_pdlp_solver_mode; update the flow by adding these two bools to the PDLPSolverSettings proto (e.g., PDLPSolverSettings.problem_checking and PDLPSolverSettings.use_pdlp_solver_mode), regenerate the gRPC stubs, extend pdlp_solver_settings_t to carry those fields, update grpc_client_t::solve_lp signature/impl to accept the extended settings (or ensure it serializes the new fields), and then call client.solve_lp(cpu_problem, settings_with_flags) from solve_lp_remote so the remote solver receives and honors problem_checking and use_pdlp_solver_mode. Ensure any serialization/deserialization and tests are updated to reflect the new proto fields.cpp/src/grpc/server/grpc_service_impl.cpp-208-221 (1)
208-221:⚠️ Potential issue | 🟠 MajorReject truncated chunked uploads before returning a job ID.
FinishChunkedUpload()clearsfield_metaand enqueues the job without checking that each uploaded field received its declared byte count. A client can omit the tail of an array and still get a successfulSubmitJobResponse, pushing a malformed payload into the worker path instead of failing fast here.As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 208 - 221, FinishChunkedUpload currently clears state.field_meta and enqueues a job without verifying that each field actually received its declared byte count; before constructing PendingChunkedUpload and calling submit_chunked_job_async, iterate the entries in state.field_meta (or the header) and verify received bytes in state.chunks (or state.total_bytes/total_chunks per field) match the declared sizes, and if any mismatch reject the request (return an error status/SubmitJobResponse) instead of proceeding; keep references to PendingChunkedUpload, state.header, state.chunks, state.field_meta, submit_chunked_job_async and ensure the validation happens prior to std::move(state.header)/std::move(state.chunks) and before clearing state.field_meta so you can include useful diagnostics in the error.cpp/src/grpc/server/grpc_service_impl.cpp-486-503 (1)
486-503:⚠️ Potential issue | 🟠 MajorDon’t orphan running jobs in
DeleteResult.This erases
job_tracker[job_id]unconditionally. If the job is still queued or processing, the worker keeps running but clients lose status/cancel/result handles and the eventual completion no longer has authoritative tracker state. Restrict deletion to terminal jobs, or cancel and purge the backing worker/queue state before erasing the tracker.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 486 - 503, The current DeleteResult unconditionally erases job_tracker[job_id], which can orphan running jobs; change DeleteResult to first look up the tracker entry under tracker_mutex (use job_tracker.find(job_id) and tracker_mutex) and inspect the job state (e.g., job_entry.status or job_entry->state) and only erase if the job is in a terminal state (SUCCEEDED/FAILED/CANCELLED). If the job is queued or running, either return ERROR_BAD_STATE to the client or initiate a controlled cancellation flow: call the existing cancellation API (e.g., CancelJob(job_id) or job_entry->cancel()/queue.cancel(job_id)), wait or force-stop the worker/queue backing state, confirm the entry is terminal, then call delete_log_file(job_id) and erase from job_tracker; keep config.verbose logging for each branch.cpp/src/grpc/server/grpc_service_impl.cpp-351-368 (1)
351-368:⚠️ Potential issue | 🟠 MajorShare result buffers across download sessions instead of copying them.
StartChunkedDownload()deep-copiesresult_arraysinto everyChunkedDownloadState. Large completed jobs can therefore multiply memory usage by the number of concurrent download sessions and exhaust the server. A shared immutable snapshot/reference-counted buffer is safer here than per-session copies.As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems"
Also applies to: 377-385
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 351 - 368, StartChunkedDownload currently deep-copies it->second.result_arrays into each ChunkedDownloadState causing memory bloat; instead change ChunkedDownloadState to hold a reference-counted/shared_ptr to the immutable result buffers and assign that shared_ptr from the tracked job (create or expose it->second.result_arrays as a shared_ptr inside the job tracker) so the state just points to the shared data without copying. Update the assignment in StartChunkedDownload (and the similar block at 377-385) to set state.raw_arrays_shared = it->second.result_arrays_shared (or construct a shared_ptr wrapper around the existing container) and ensure thread-safety by keeping the shared_ptr copy under the tracker_mutex while leaving the actual heavy buffers shared and immutable after lock release.cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp-420-477 (1)
420-477:⚠️ Potential issue | 🟠 MajorCapture the original env before any early return.
SetUpTestSuite()returns from the binary lookup /fork()/ startup-timeout paths beforeorig_*and*_was_set_are populated, butTearDownTestSuite()still unsetsCUDA_VISIBLE_DEVICESandCUOPT_REMOTE_*. A skipped suite can therefore erase preexisting process env for later tests. Snapshot the env at the top ofSetUpTestSuite(), or guard teardown behind asnapshot_takenflag.Based on learnings "Applies to **/test.{cpp,cu,py} : Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment"
Also applies to: 480-496
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 420 - 477, SetUpTestSuite currently snapshots original env vars only after server startup, so early returns can leave TearDownTestSuite unsetting envs that were never saved; fix by capturing CUDA_VISIBLE_DEVICES, CUOPT_REMOTE_HOST, and CUOPT_REMOTE_PORT into orig_cuda_visible_, orig_remote_host_, orig_remote_port_ and setting cuda_was_set_, host_was_set_, port_was_set_ at the very start of SetUpTestSuite (or set a snapshot_taken boolean true after doing so), and then in TearDownTestSuite guard any restoration/unset logic behind that snapshot_taken flag (or check the *_was_set_ flags) so teardown never clobbers preexisting environment when SetUpTestSuite returned early.cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp-428-463 (1)
428-463:⚠️ Potential issue | 🟠 MajorReserve a free port instead of probing a fixed default.
If Line 428 picks a port that is already occupied, the child can fail to bind while
tcp_connect_check()still succeeds against the unrelated listener. The suite then exportsCUOPT_REMOTE_HOST/PORTfor the wrong process and becomes nondeterministic. Reserve an ephemeral free port first, or at least verify the forked child is the process that became ready before setting the env vars.Based on learnings "Applies to **/test.{cpp,cu,py} : Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 428 - 463, The test currently picks a fixed port in port_ then forks and probes readiness with tcp_connect_check which can hit an unrelated process; to fix, reserve an ephemeral port before forking by creating a temporary socket, bind it to loopback with port 0, obtain the assigned port via getsockname and set port_, then close the socket so the child can bind that same port; keep using server_pid_ and server_path_ as before and then call tcp_connect_check(port_, ...); this ensures the child can bind the reserved ephemeral port and avoids racing with unrelated listeners.cpp/src/grpc/grpc_problem_mapper.cpp-597-686 (1)
597-686:⚠️ Potential issue | 🟠 MajorDon't build the entire chunked upload in memory first.
build_array_chunk_requests()materializes everySendArrayChunkRequestbefore the first RPC is sent, so a large "chunked" upload still allocates roughly another full copy of the problem in protobuf strings. On the workloads that trigger chunking, that can OOM the client before upload even starts. Emit/send chunks incrementally instead.As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cpp` around lines 597 - 686, build_array_chunk_requests currently materializes all SendArrayChunkRequest objects into a vector before any RPC is sent, which can OOM for large problems; change the design so chunks are emitted incrementally instead of accumulated. Replace build_array_chunk_requests (or add an alternative like stream_array_chunk_requests) to accept a sender callback/std::function<void(const cuopt::remote::SendArrayChunkRequest&)> or an output iterator and invoke chunk_typed_array and chunk_byte_blob to call that sender for each generated chunk rather than appending to a requests vector; update callers to use the streaming version (or pass a lambda that forwards to the RPC stream) and remove the in-memory accumulation in build_array_chunk_requests. Ensure functions referenced (build_array_chunk_requests, chunk_typed_array, chunk_byte_blob) are adapted to support emitting via the provided callback/iterator and that any temporary byte blobs (e.g., var_names_blob/row_names_blob, vt_bytes/rt_bytes) are created and sent one-at-a-time to avoid full duplication.cpp/src/grpc/client/grpc_client.cpp-485-517 (1)
485-517:⚠️ Potential issue | 🟠 MajorReturn a retryable "too large" outcome from
submit_unary().
submit_lp()/submit_mip()need to distinguish "unary request exceeded the message cap" from ordinary transport failures, otherwise near-threshold requests just fail even though the chunked upload path is already available. Mirroring the download-side fallback would make this path much more robust.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cpp` around lines 485 - 517, In grpc_client_t::submit_unary detect a "request too large" outcome by checking status.error_code() == grpc::StatusCode::RESOURCE_EXHAUSTED (and optionally status.error_message() for message-size text) and, instead of treating it as a generic failure, set a distinct retryable indicator (e.g. last_error_ = "TOO_LARGE_REQUEST: " + status.error_message()) and return false; update callers (submit_lp/submit_mip) to look for that sentinel (or a new helper like grpc_client_t::is_too_large_error()) and trigger the chunked upload fallback when seen.cpp/src/grpc/grpc_problem_mapper.cpp-484-503 (1)
484-503:⚠️ Potential issue | 🟠 MajorKeep chunked constraint decoding consistent with the unary path.
map_proto_to_problem()only applies lower/upper bounds as a matched pair and otherwise falls back tob + row_types, but this chunked path setscon_lb,con_ub, andbindependently. The same logical payload can therefore reconstruct differently depending on transport, or with half-initialized constraint state if one field is missing.As per coding guidelines "Validate correct initialization of variable bounds, constraint coefficients, and algorithm state before solving; ensure reset when transitioning between algorithm phases (presolve, simplex, diving, crossover)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cpp` around lines 484 - 503, The chunked decoding path sets constraint lower/upper bounds and b independently which can lead to inconsistent state; modify the logic around get_doubles(cuopt::remote::FIELD_CONSTRAINT_LOWER_BOUNDS), get_doubles(cuopt::remote::FIELD_CONSTRAINT_UPPER_BOUNDS), and get_doubles(cuopt::remote::FIELD_B) so that you only apply constraint_lower_bounds and constraint_upper_bounds to cpu_problem when both con_lb and con_ub are present (matched pair), otherwise skip setting them and fall back to the unary behavior: set_constraint_bounds (b_vec) and set_row_types only as the unary path does; additionally ensure you clear or reset any existing constraint bounds on cpu_problem when the matched pair is absent to avoid half-initialized state.cpp/src/grpc/grpc_solution_mapper.cpp-151-229 (1)
151-229:⚠️ Potential issue | 🟠 MajorDon't drop solver error text on the read path.
The write side serializes
error_message/mip_error_message, but neither the unary readers norchunked_result_to_*()restore it. A remote solve can therefore lose its actual failure reason even though the transport succeeded. Ifcpu_*_solution_tcannot carry free-form text, please surface it through the surrounding result wrapper instead of silently discarding it.Also applies to: 259-277, 545-638
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cpp` around lines 151 - 229, map_proto_to_lp_solution is currently discarding the serialized solver text fields (error_message / mip_error_message) from cuopt::remote::LPSolution; add logic to extract pb_solution.error_message() and pb_solution.mip_error_message() and propagate them into the returned result (either by populating new fields on cpu_lp_solution_t<i_t,f_t> or, if cpu_lp_solution_t cannot carry free-form text, return/populate the surrounding result wrapper with these strings). Do the same fix for the other readers mentioned (the reader at lines ~259-277 and the chunked_result_to_* paths at ~545-638): locate the functions that construct cpu_*_solution_t from proto and ensure they copy pb_solution.error_message() and pb_solution.mip_error_message() into the appropriate fields of cpu_*_solution_t or the outer response wrapper so solver failure text is not lost.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-1761-1818 (1)
1761-1818:⚠️ Potential issue | 🟠 MajorThese chunk-validation cases leak upload state across the shared server.
start_upload()creates server-side sessions, but this fixture keeps one server alive for the whole suite and never finishes or explicitly cleans those sessions up. That can bleed partially uploaded state into later cases and make failures order-dependent.Based on learnings "Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 1761 - 1818, Tests leak server-side upload sessions because SetUpTestSuite starts a single s_server_ for the whole suite while start_upload() creates per-test upload sessions that are never finished; either (A) make the server lifecycle per-test by moving server start/stop from SetUpTestSuite/TearDownTestSuite into SetUp/TearDown so each test gets a fresh s_server_, or (B) keep the shared server but track upload IDs from start_upload() and ensure each test calls the appropriate cleanup RPC (e.g., FinishChunkedUpload or Abort/CancelChunkedUpload) via stub_ in TearDown to remove server-side session state before the next test; update SetUp/ TearDown and start_upload/send_chunk helpers accordingly to record and clean up upload IDs.
🟡 Minor comments (4)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-637-649 (1)
637-649:⚠️ Potential issue | 🟡 MinorMake the negative mTLS assertion failure-specific.
result.returncode != 0also passes for unrelated failures like a missing dataset, parser import error, or server startup race. Please assert on a TLS/auth-specific stderr/stdout fragment so this test actually proves client-certificate enforcement.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 637 - 649, Update the test_mtls_rejects_no_client_cert to make the failure specific to TLS/auth: after calling _run_in_subprocess with _impl_lp_solve_cpu_only, assert both that result.returncode != 0 and that result.stderr (or result.stdout if stderr is empty) contains a TLS-specific substring such as "certificate", "client certificate", "SSL", "TLS", or "authentication failed"; use the same test helper names (_run_in_subprocess, _impl_lp_solve_cpu_only, _tls_env, mtls_server_info) to locate where to add the additional assert and prefer checking stderr first and falling back to stdout if needed.cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp-38-47 (1)
38-47:⚠️ Potential issue | 🟡 MinorAdd the missing
<iostream>include.
dump_logs()usesstd::coutin its default argument at line 339, but<iostream>is never included. This makes the header dependent on transitive includes and will fail if includes are reordered.Minimal fix
`#include` <fstream> `#include` <functional> +#include <iostream> `#include` <mutex> `#include` <regex>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 38 - 47, The header is missing the <iostream> include required by dump_logs()'s default parameter that uses std::cout; add `#include` <iostream> to the include block at the top of grpc_test_log_capture.hpp so the file no longer relies on transitive includes and dump_logs()'s default argument (std::cout) is valid; update the include list near the other standard headers (alongside <string>, <sstream>, etc.) to ensure compilation when includes are reordered.cpp/src/grpc/server/grpc_service_impl.cpp-791-808 (1)
791-808:⚠️ Potential issue | 🟡 MinorAdvance the completion offset after flushing the last line.
When the terminal-state path drains one final log line, the sentinel message still publishes the pre-drain
current_offset. Resuming fromdone.byte_offsetwill re-read that tail line.Suggested fix
if (std::getline(in, line)) { std::streampos after2 = in.tellg(); int64_t next_offset2 = current_offset + static_cast<int64_t>(line.size()); if (after2 >= 0) { next_offset2 = static_cast<int64_t>(after2); } cuopt::remote::LogMessage m; m.set_line(line); m.set_byte_offset(next_offset2); m.set_job_complete(false); writer->Write(m); + current_offset = next_offset2; } cuopt::remote::LogMessage done;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 791 - 808, The sentinel "done" message is using current_offset (pre-drain) so resuming re-reads the last line; after you compute next_offset2 (or after2) when a final line is read, advance current_offset (or set the done byte offset to next_offset2) before constructing cuopt::remote::LogMessage done so done.set_byte_offset(...) uses the post-drain offset; update the code around current_offset/next_offset2 (and use after2 if available) prior to writer->Write(done) to ensure the completion offset points after the flushed line.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-1020-1045 (1)
1020-1045:⚠️ Potential issue | 🟡 MinorProve the incumbent callback actually triggered cancellation.
EXPECT_LT(elapsed.count(), 25)can still pass if the callback never fires and this instance just solves quickly. Please assert that the callback ran at least twice and that the terminal state reflects cancellation/non-success.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 1020 - 1045, The test must assert the incumbent callback actually fired and that the solver returned a cancelled/non-success state: after calling client->solve_mip(problem, settings, true) check that callback_count >= 2 (the lambda attached to config.incumbent_callback incremented it) and assert the returned result (the variable result from client->solve_mip) indicates cancellation/non-success by checking the appropriate result field (e.g., result.status or result.termination_status) rather than relying solely on elapsed time.
🧹 Nitpick comments (1)
cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp (1)
140-286: Add malformed-chunk coverage for the assembly path.These tests cover well-formed round trips only. The serializer in
cpp/src/grpc/server/grpc_pipe_serialization.hppalso has failure-prone paths for overlaps, gaps, inconsistenttotal_elements, and invalid byte counts; without negative tests, silent corruption in the pipe format can slip through. As per coding guidelines "Test with free variables, singleton problems, and extreme problem dimensions near resource limits to validate edge case handling."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp` around lines 140 - 286, Add negative unit tests in cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp that exercise the assembly failure paths by constructing malformed ArrayChunk sequences and asserting read_chunked_request_from_pipe returns false or fails as expected; create tests such as ChunkedRequest_Malformed_Overlap (two chunks for the same field with overlapping element ranges), ChunkedRequest_Malformed_Gap (chunks that leave a gap so total_elements mismatches assembly), ChunkedRequest_Malformed_InconsistentTotalElements (chunks for same field with conflicting total_elements values), and ChunkedRequest_Malformed_InvalidByteCount (chunk.data length not equal to element_count * array_field_element_size); use PipePair, write_chunked_request_to_pipe(pp.write_fd(), header, chunks) to write and read_chunked_request_from_pipe(pp.read_fd(), header_out, arrays_out) to verify the reader rejects/returns false for each case, building chunks via make_partial_chunk/make_whole_chunk or by setting ArrayChunk fields directly (field_id, element_offset, total_elements, data) to craft the malformed scenarios.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aaca6118-22d1-4b1b-9a91-5dbc62d00790
📒 Files selected for processing (50)
GRPC_INTERFACE.mdGRPC_QUICK_START.mdGRPC_SERVER_ARCHITECTURE.mdbuild.shci/build_wheel_libcuopt.shci/utils/install_protobuf_grpc.shconda/environments/all_cuda-129_arch-aarch64.yamlconda/environments/all_cuda-129_arch-x86_64.yamlconda/environments/all_cuda-131_arch-aarch64.yamlconda/environments/all_cuda-131_arch-x86_64.yamlconda/recipes/libcuopt/recipe.yamlcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cppcpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cppcpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cppcpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cppcpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cppcpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_io.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/c_api_tests/c_api_tests.cppcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hppdependencies.yamlpython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.pypython/libcuopt/CMakeLists.txt
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/solve_remote.cu
- cpp/src/pdlp/CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
- GRPC_INTERFACE.md
🚧 Files skipped from review as they are similar to previous changes (3)
- cpp/src/grpc/server/grpc_field_element_size.hpp
- cpp/src/grpc/server/grpc_incumbent_proto.hpp
- cpp/tests/linear_programming/grpc/CMakeLists.txt
| ssize_t written = ::write(fd, ptr, remaining); | ||
| if (written <= 0) { | ||
| if (errno == EINTR) continue; | ||
| return false; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Expect at least one startup path to ignore or block SIGPIPE before pipe writes happen.
rg -nP 'SIGPIPE|sigaction\s*\(\s*SIGPIPE|signal\s*\(\s*SIGPIPE' -C2Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
cat cpp/src/grpc/server/grpc_pipe_io.cppRepository: NVIDIA/cuopt
Length of output: 2085
🏁 Script executed:
# Search for alternative SIGPIPE mitigation: MSG_NOSIGNAL flag, SO_NOSIGPIPE option, signal masks
rg -n 'MSG_NOSIGNAL|SO_NOSIGPIPE|signal.*mask|sigprocmask' -C2Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Look for signal() or sigaction() calls anywhere in the codebase
rg -n 'signal\s*\(' -C2 | head -50Repository: NVIDIA/cuopt
Length of output: 4013
🏁 Script executed:
# Check for any SIGPIPE references even in comments or strings
rg -in 'sigpipe' -C2Repository: NVIDIA/cuopt
Length of output: 38
Add SIGPIPE signal handler in server startup to prevent unexpected process termination.
Line 21 uses plain ::write() on a pipe. When the peer closes its read end, SIGPIPE will be raised and terminate the process before this helper can return false. No signal-level SIGPIPE handling exists anywhere in the C++ codebase. Add signal(SIGPIPE, SIG_IGN) or equivalent sigaction() call in the gRPC server initialization before any pipe I/O operations occur.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_pipe_io.cpp` around lines 21 - 24, The pipe write
helper in grpc_pipe_io.cpp uses ::write(...) which can raise SIGPIPE and kill
the process before the helper returns false; to fix this, register SIGPIPE to be
ignored in the gRPC server startup (before any pipe I/O occurs) by calling
signal(SIGPIPE, SIG_IGN) or installing an equivalent sigaction() in the server
initialization path (e.g., in the function that starts/initializes the gRPC
server), ensuring this handler is set once at startup so the ::write loop in
grpc_pipe_io.cpp no longer causes process termination when the peer closes the
read end.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
ci/utils/install_protobuf_grpc.sh (1)
72-79:⚠️ Potential issue | 🔴 CriticalCanonicalize the user-controlled paths before the cleanup/build flow.
The current guard only rejects a literal
/.--prefix=/usr/..still passes and makes the cleanup target/liband/bin, and a fresh custom--build-dirstill fails because nothing creates it beforegit clone.🛡️ Minimal hardening
+PREFIX="$(realpath -m -- "${PREFIX}")" +BUILD_DIR="$(realpath -m -- "${BUILD_DIR}")" + if [[ -z "$PREFIX" || "$PREFIX" == "/" ]]; then echo "ERROR: Invalid PREFIX: '$PREFIX'" >&2 exit 1 fi if [[ -z "$BUILD_DIR" || "$BUILD_DIR" == "/" ]]; then @@ GRPC_SRC="${BUILD_DIR}/grpc-src" ABSL_BUILD="${BUILD_DIR}/absl-build" PROTOBUF_BUILD="${BUILD_DIR}/protobuf-build" GRPC_BUILD="${BUILD_DIR}/grpc-build" -rm -rf "${GRPC_SRC}" "${ABSL_BUILD}" "${PROTOBUF_BUILD}" "${GRPC_BUILD}" -mkdir -p "${PREFIX}" +mkdir -p -- "${PREFIX}" "${BUILD_DIR}" +rm -rf -- "${GRPC_SRC}" "${ABSL_BUILD}" "${PROTOBUF_BUILD}" "${GRPC_BUILD}"Also applies to: 124-158
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/install_protobuf_grpc.sh` around lines 72 - 79, Canonicalize and validate user-controlled paths before using them: resolve PREFIX and BUILD_DIR with realpath -m (or readlink -f where available) into canonical variables (e.g., resolved_prefix and resolved_build_dir), then reject if resolved_prefix is empty or equals "/" and reject empty resolved_build_dir; also ensure resolved_build_dir is created (mkdir -p "$resolved_build_dir") before git clone and any cleanup steps. Replace raw uses of PREFIX and BUILD_DIR in the cleanup/build flow (including the other similar block) with the canonical variables to prevent paths like --prefix=/usr/.. from escaping to root and to ensure the build directory exists.
🧹 Nitpick comments (7)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (5)
662-663: Consider adding a function whitelist for subprocess entry point.The
globals()[sys.argv[1]]()pattern works but could be hardened with an explicit allowlist to prevent accidental execution of unintended functions:♻️ Optional hardening
if __name__ == "__main__": - globals()[sys.argv[1]]() + _ALLOWED_IMPL_FUNCS = { + "_impl_lp_solve_cpu_only", + "_impl_lp_dual_solution_cpu_only", + "_impl_mip_solve_cpu_only", + "_impl_warmstart_cpu_only", + } + func_name = sys.argv[1] + if func_name not in _ALLOWED_IMPL_FUNCS: + raise ValueError(f"Unknown test function: {func_name}") + globals()[func_name]()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 662 - 663, Replace the dynamic call via globals()[sys.argv[1]]() in the module entrypoint with an explicit allowlist: define a dict or set (e.g., ALLOWED_ENTRIES = {"run_test_a": run_test_a, "run_test_b": run_test_b}) mapping permitted names to callables, then look up sys.argv[1] against ALLOWED_ENTRIES and invoke the matched function; if the name is missing, raise a clear error or print usage. This hardens the __main__ entrypoint and references the existing globals()[sys.argv[1]]() pattern, sys.argv, and the module-level test functions to locate where to change the logic.
96-102: Avoidshell=Truewith string interpolation.Using
shell=Truewith f-string command construction is a security anti-pattern. Whilecert_diroriginates from pytest's temp directory, this pattern is risky if copied elsewhere. Consider using list arguments withshell=False.♻️ Suggested refactor using list arguments
def _run(cmd): - return ( - subprocess.run( - cmd, shell=True, capture_output=True, timeout=30 - ).returncode - == 0 - ) + return ( + subprocess.run( + cmd, capture_output=True, timeout=30 + ).returncode + == 0 + )Then update callers to pass list arguments:
# Example for CA generation: if not _run([ "openssl", "req", "-x509", "-newkey", "rsa:2048", "-keyout", ca_key, "-out", ca_crt, "-days", "1", "-nodes", "-subj", "/CN=TestCA" ]): return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 96 - 102, The helper _run currently invokes subprocess.run with shell=True and a string command which is unsafe; change _run to accept a sequence (list/tuple) command and call subprocess.run(cmd, capture_output=True, timeout=30) without shell=True (or explicitly shell=False), and update all callers that pass interpolated strings to pass list arguments (e.g., ["openssl", "req", "-x509", ...]) so argument quoting is handled safely and no shell interpolation occurs; keep the same return semantics (returncode == 0) and preserve timeout/capture_output behavior in the function named _run and in any tests calling it.
184-227: Consider adding numerical correctness validation to CPU-only tests.These test implementations only verify vector sizes, not numerical accuracy. Per coding guidelines, tests should "validat[e] numerical correctness of optimization results (not just 'runs without error')". Since these tests verify remote execution fidelity, consider adding objective value checks similar to
TestSolutionInterfacePolymorphism:# Example addition to _impl_lp_solve_cpu_only: obj = solution.get_primal_objective() _AFIRO_OBJ = -464.7531428571 assert abs(obj - _AFIRO_OBJ) / abs(_AFIRO_OBJ) < 0.01, ( f"Objective {obj} too far from expected {_AFIRO_OBJ}" )This ensures remote execution produces numerically correct results, not just correctly-sized outputs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 184 - 227, Add numerical correctness assertions to the CPU-only tests: in _impl_lp_solve_cpu_only, after obtaining obj = solution.get_primal_objective(), assert the objective is within a small relative tolerance of the known Afiro reference value (e.g., _AFIRO_OBJ ≈ -464.7531428571) using a relative error check (< 0.01); in _impl_lp_dual_solution_cpu_only, optionally add similar sanity checks for duals or reduced costs (e.g., spot-check a few known entries or confirm primal objective via solution.get_primal_objective() matches the same reference) so the tests validate numerical accuracy not just vector sizes. Ensure you use the existing solution methods (get_primal_objective, get_dual_solution, get_reduced_cost) and compare with a relative tolerance assertion.
291-316: Consider extracting shared fixture logic.The
cpu_only_env_with_serverfixture is nearly identical betweenTestCPUOnlyExecution(port +600) andTestCuoptCliCPUOnly(port +700). Consider extracting a parameterized helper or usingconftest.pyto reduce duplication.♻️ Example extraction
def _start_grpc_server_fixture(port_offset): """Factory for class-scoped gRPC server fixtures.""" `@pytest.fixture`(scope="class") def fixture_fn(): server_bin = _find_grpc_server() if server_bin is None: pytest.skip("cuopt_grpc_server not found") port = int(os.environ.get("CUOPT_TEST_PORT_BASE", "18000")) + port_offset proc = subprocess.Popen( [server_bin, "--port", str(port), "--workers", "1"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, ) # ... rest of setup/teardown logic yield _cpu_only_env(port) # ... cleanup return fixture_fn🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 291 - 316, The cpu_only_env_with_server fixture in TestCPUOnlyExecution is almost identical to the one in TestCuoptCliCPUOnly; extract the duplicated logic into a reusable factory (e.g., _start_grpc_server_fixture) that accepts a port_offset and returns a class-scoped fixture, then replace the two fixtures with calls to that factory (or register them in conftest.py) so cpu_only_env_with_server and the other fixture reuse the same start/teardown logic and only differ by the port offset (+600 vs +700).
1-14: Consider adding edge case tests for remote execution.Per coding guidelines, tests should cover "infeasible, unbounded, empty, singleton problems" and "free variables, singleton problems, and extreme problem dimensions." While this file appropriately focuses on connectivity (CPU-only, TLS, mTLS), consider adding tests that exercise the remote execution path with degenerate or edge-case problems to ensure the gRPC layer correctly handles and propagates solver results for unusual inputs.
This could be added in a follow-up PR or tracked as a separate issue.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 1 - 14, Add remote execution edge-case tests that run through the same CPU-only gRPC test harness (the shared cuopt_grpc_server used by TestSolutionInterfacePolymorphism) to assert correct solver status and propagated messages for degenerate problems: include tests that submit infeasible problems, unbounded problems, empty/zero-variable problems, singleton problems, and problems with free variables and extreme dimensions; for each test reuse the existing server setup/teardown and the client call path used by the CPU-only TLS/mTLS tests so you validate gRPC serialization and status propagation (check solver status codes, objective values when applicable, and error messages returned by the remote solver).cpp/src/grpc/server/grpc_pipe_serialization.hpp (2)
37-44: Consider validating message size before serialization.
ByteSizeLong()returnssize_tbut is cast touint32_t. While protobuf enforces a ~2 GiB internal limit, adding an explicit check would catch issues earlier with a clearer error path.♻️ Suggested defensive check
inline bool write_protobuf_to_pipe(int fd, const google::protobuf::MessageLite& msg) { - uint32_t size = static_cast<uint32_t>(msg.ByteSizeLong()); + size_t byte_size = msg.ByteSizeLong(); + if (byte_size > std::numeric_limits<uint32_t>::max()) return false; + uint32_t size = static_cast<uint32_t>(byte_size); if (!write_to_pipe(fd, &size, sizeof(size))) return false;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 37 - 44, The write_protobuf_to_pipe function casts msg.ByteSizeLong() (size_t) to uint32_t without checking for overflow; add a defensive check using the original size_t value (from ByteSizeLong()) and if it exceeds UINT32_MAX (or some configured max), fail early (return false or handle error) before allocating the std::vector and calling SerializeToArray; this prevents silent truncation/overflow and clarifies the error path when messages are larger than uint32_t can hold.
215-222: Consider adding size validation for consistency.
SerializeToArraytakes anintfor the size parameter. While the comment notes this is "always well under 2 GiB," adding an explicit check would provide defense-in-depth consistent with other serialization paths.♻️ Suggested defensive check
inline std::vector<uint8_t> serialize_submit_request_to_pipe( const cuopt::remote::SubmitJobRequest& request) { size_t byte_size = request.ByteSizeLong(); + if (byte_size > static_cast<size_t>(std::numeric_limits<int>::max())) { + return {}; // Return empty on oversized request + } std::vector<uint8_t> blob(byte_size); request.SerializeToArray(blob.data(), static_cast<int>(byte_size)); return blob; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 215 - 222, serialize_submit_request_to_pipe currently casts request.ByteSizeLong() to int before calling request.SerializeToArray; add a defensive check that byte_size is <= std::numeric_limits<int>::max() (and optionally > 0) before the static_cast to avoid overflow/truncation, and handle the error case (throw std::runtime_error or return an empty vector with a logged error) so SerializeToArray is only called with a safe int size; refer to serialize_submit_request_to_pipe and the call to SerializeToArray for where to place this validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/utils/install_protobuf_grpc.sh`:
- Around line 116-119: The error message inside the tools-check loop (for tool
in git cmake ninja) incorrectly suggests the unsupported flag
"--skip-deps=false"; update the echo in that block (the line that prints "Error:
Required tool '$tool' not found...") to remove the unsupported flag and instead
give a correct instruction such as prompting the user to install the missing
tool or consult the script's README/usage, so the message only references valid
recovery actions.
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 46-53: The function read_protobuf_from_pipe reads a uint32_t size
from the pipe and allocates a buffer without any upper bound check, enabling
potential resource exhaustion; add validation of size against the existing
constant kMaxPipeArrayBytes (and ensure size>0) before allocating the
std::vector and before the second read_from_pipe call, returning false if size
is zero or exceeds kMaxPipeArrayBytes; update error path to avoid allocation
when invalid and keep using read_from_pipe and msg.ParseFromArray as-is.
- Around line 118-127: The loop over fi.chunks currently skips chunks whose
byte_offset/size validation fails, which can hide data corruption; update the
logic in the loop that iterates over fi.chunks (using ac->element_offset(),
elem_size, chunk_data, assembled) to treat invalid offsets as an error: when
byte_offset < 0 or byte_offset + chunk_data.size() > fi.total_bytes, immediately
return false (or at minimum log an error) instead of silently continuing, so
that write_to_pipe(fd, assembled.data(), assembled.size()) is only called when
all chunks validated and copied successfully.
---
Duplicate comments:
In `@ci/utils/install_protobuf_grpc.sh`:
- Around line 72-79: Canonicalize and validate user-controlled paths before
using them: resolve PREFIX and BUILD_DIR with realpath -m (or readlink -f where
available) into canonical variables (e.g., resolved_prefix and
resolved_build_dir), then reject if resolved_prefix is empty or equals "/" and
reject empty resolved_build_dir; also ensure resolved_build_dir is created
(mkdir -p "$resolved_build_dir") before git clone and any cleanup steps. Replace
raw uses of PREFIX and BUILD_DIR in the cleanup/build flow (including the other
similar block) with the canonical variables to prevent paths like
--prefix=/usr/.. from escaping to root and to ensure the build directory exists.
---
Nitpick comments:
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 37-44: The write_protobuf_to_pipe function casts
msg.ByteSizeLong() (size_t) to uint32_t without checking for overflow; add a
defensive check using the original size_t value (from ByteSizeLong()) and if it
exceeds UINT32_MAX (or some configured max), fail early (return false or handle
error) before allocating the std::vector and calling SerializeToArray; this
prevents silent truncation/overflow and clarifies the error path when messages
are larger than uint32_t can hold.
- Around line 215-222: serialize_submit_request_to_pipe currently casts
request.ByteSizeLong() to int before calling request.SerializeToArray; add a
defensive check that byte_size is <= std::numeric_limits<int>::max() (and
optionally > 0) before the static_cast to avoid overflow/truncation, and handle
the error case (throw std::runtime_error or return an empty vector with a logged
error) so SerializeToArray is only called with a safe int size; refer to
serialize_submit_request_to_pipe and the call to SerializeToArray for where to
place this validation.
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 662-663: Replace the dynamic call via globals()[sys.argv[1]]() in
the module entrypoint with an explicit allowlist: define a dict or set (e.g.,
ALLOWED_ENTRIES = {"run_test_a": run_test_a, "run_test_b": run_test_b}) mapping
permitted names to callables, then look up sys.argv[1] against ALLOWED_ENTRIES
and invoke the matched function; if the name is missing, raise a clear error or
print usage. This hardens the __main__ entrypoint and references the existing
globals()[sys.argv[1]]() pattern, sys.argv, and the module-level test functions
to locate where to change the logic.
- Around line 96-102: The helper _run currently invokes subprocess.run with
shell=True and a string command which is unsafe; change _run to accept a
sequence (list/tuple) command and call subprocess.run(cmd, capture_output=True,
timeout=30) without shell=True (or explicitly shell=False), and update all
callers that pass interpolated strings to pass list arguments (e.g., ["openssl",
"req", "-x509", ...]) so argument quoting is handled safely and no shell
interpolation occurs; keep the same return semantics (returncode == 0) and
preserve timeout/capture_output behavior in the function named _run and in any
tests calling it.
- Around line 184-227: Add numerical correctness assertions to the CPU-only
tests: in _impl_lp_solve_cpu_only, after obtaining obj =
solution.get_primal_objective(), assert the objective is within a small relative
tolerance of the known Afiro reference value (e.g., _AFIRO_OBJ ≈
-464.7531428571) using a relative error check (< 0.01); in
_impl_lp_dual_solution_cpu_only, optionally add similar sanity checks for duals
or reduced costs (e.g., spot-check a few known entries or confirm primal
objective via solution.get_primal_objective() matches the same reference) so the
tests validate numerical accuracy not just vector sizes. Ensure you use the
existing solution methods (get_primal_objective, get_dual_solution,
get_reduced_cost) and compare with a relative tolerance assertion.
- Around line 291-316: The cpu_only_env_with_server fixture in
TestCPUOnlyExecution is almost identical to the one in TestCuoptCliCPUOnly;
extract the duplicated logic into a reusable factory (e.g.,
_start_grpc_server_fixture) that accepts a port_offset and returns a
class-scoped fixture, then replace the two fixtures with calls to that factory
(or register them in conftest.py) so cpu_only_env_with_server and the other
fixture reuse the same start/teardown logic and only differ by the port offset
(+600 vs +700).
- Around line 1-14: Add remote execution edge-case tests that run through the
same CPU-only gRPC test harness (the shared cuopt_grpc_server used by
TestSolutionInterfacePolymorphism) to assert correct solver status and
propagated messages for degenerate problems: include tests that submit
infeasible problems, unbounded problems, empty/zero-variable problems, singleton
problems, and problems with free variables and extreme dimensions; for each test
reuse the existing server setup/teardown and the client call path used by the
CPU-only TLS/mTLS tests so you validate gRPC serialization and status
propagation (check solver status codes, objective values when applicable, and
error messages returned by the remote solver).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a9ff9e87-fb6b-4f35-bdfb-ecfd2c053f90
📒 Files selected for processing (10)
GRPC_QUICK_START.mdci/utils/install_protobuf_grpc.shcpp/src/grpc/client/grpc_client.cppcpp/src/grpc/grpc_problem_mapper.cppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
🚧 Files skipped from review as they are similar to previous changes (3)
- GRPC_QUICK_START.md
- cpp/src/grpc/server/grpc_worker_infra.cpp
- cpp/src/grpc/server/grpc_job_management.cpp
a97a468 to
026627a
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)
474-479:⚠️ Potential issue | 🟡 MinorMissing executable check for CONDA_PREFIX path (same issue as
_find_grpc_server).The CONDA_PREFIX branch doesn't verify executable permission, inconsistent with the earlier checks.
Proposed fix
conda_prefix = os.environ.get("CONDA_PREFIX", "") if conda_prefix: p = os.path.join(conda_prefix, "bin", "cuopt_cli") - if os.path.isfile(p): + if os.path.isfile(p) and os.access(p, os.X_OK): return p return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 474 - 479, The CONDA_PREFIX branch currently only checks os.path.isfile(p) and should also verify the file is executable like the earlier _find_grpc_server logic; update the check for the constructed path variable p to ensure it exists and is executable (e.g., replace or augment os.path.isfile(p) with an executable check such as os.access(p, os.X_OK) or equivalent) so the function returns the path only when the cuopt_cli binary is executable.
♻️ Duplicate comments (3)
cpp/src/grpc/client/grpc_client.cpp (1)
89-94:⚠️ Potential issue | 🟠 MajorThe new deadline helper still ignores
config_.timeout_seconds.
set_rpc_deadline()is parameterized, but every finite RPC still passes the hard-coded 60-second constant. That means large submits/uploads/downloads can time out even when the client is configured for a longer timeout. Either plumbconfig_.timeout_secondsthrough these call sites or introduce a separate configurable short-RPC timeout knob.Also applies to: 265-266, 333-334, 359-360, 397-397, 489-490, 821-821, 856-856, 875-875, 910-910, 948-948, 996-996, 1059-1059, 1102-1102
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cpp` around lines 89 - 94, The helper set_rpc_deadline(ctx, timeout_seconds) exists but call sites still pass the hard-coded 60s constant, so plumb the configured timeout through: replace literal 60 at each RPC call site with config_.timeout_seconds (or, if you want a separate short-RPC knob, add and use a new member like config_.short_rpc_timeout_seconds) so that functions using set_rpc_deadline (references around the hard-coded uses noted in the review and the helper itself) honor the client configuration; ensure default behavior preserves current semantics when config_ value is zero or negative.cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)
37-55:⚠️ Potential issue | 🟠 MajorThe protobuf size guard still allows near-4 GiB allocations.
sizeis auint32_t, sosize > kMaxPipeArrayBytesonly rejects values the prefix cannot represent anyway. A malformed frame can still driveread_protobuf_from_pipe()into allocating gigabytes. Please switch these helpers to a dedicated protobuf-message cap and enforce it symmetrically on both read and write.As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."🛡️ Minimal cap tightening
+static constexpr uint32_t kMaxProtobufMessageBytes = 256U * 1024 * 1024; + inline bool write_protobuf_to_pipe(int fd, const google::protobuf::MessageLite& msg) { size_t byte_size = msg.ByteSizeLong(); - if (byte_size > UINT32_MAX) return false; + if (byte_size == 0 || byte_size > kMaxProtobufMessageBytes) return false; uint32_t size = static_cast<uint32_t>(byte_size); if (!write_to_pipe(fd, &size, sizeof(size))) return false; std::vector<uint8_t> buf(size); @@ inline bool read_protobuf_from_pipe(int fd, google::protobuf::MessageLite& msg) { uint32_t size; if (!read_from_pipe(fd, &size, sizeof(size))) return false; - if (size == 0 || size > kMaxPipeArrayBytes) return false; + if (size == 0 || size > kMaxProtobufMessageBytes) return false; std::vector<uint8_t> buf(size); if (!read_from_pipe(fd, buf.data(), size)) return false; return msg.ParseFromArray(buf.data(), static_cast<int>(size)); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 37 - 55, The helpers write_protobuf_to_pipe and read_protobuf_from_pipe currently allow near-4GiB allocations because they only compare a uint32_t size against kMaxPipeArrayBytes; introduce a dedicated protobuf message cap (e.g., kMaxProtobufMessageBytes) and enforce it symmetrically: in write_protobuf_to_pipe check msg.ByteSizeLong() against the new cap and return false if it exceeds the cap before casting to uint32_t, and in read_protobuf_from_pipe validate the read uint32_t size is non‑zero and <= kMaxProtobufMessageBytes (not kMaxPipeArrayBytes) before allocating the vector or parsing; update any related comments.cpp/src/grpc/server/grpc_server_main.cpp (1)
25-45:⚠️ Potential issue | 🟠 Major
--chunk-timeoutis documented but currently has no effect.
print_usage()advertises the flag, but the parser never handles it, and unknown options still fall through silently. Operators can think they set a per-chunk timeout while the server keeps the default behavior. Please either parse it or reject it explicitly as unsupported.Also applies to: 49-81
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 25 - 45, print_usage documents the --chunk-timeout flag but the argument parser currently ignores it (and lets unknown options pass silently); update the CLI parsing logic (where flags like --max-message-mb / --tls-* are handled) to either (A) implement --chunk-timeout by parsing its integer value into the server variable (e.g., chunk_timeout_seconds or similar), validate it (non-negative integer; 0 = disabled), and wire that variable into the streaming timeout behavior, or (B) if not supported, explicitly detect --chunk-timeout and exit with an error message stating it is unsupported; additionally make the parser treat unknown options as errors instead of silently ignoring them so flags advertised by print_usage are enforced.
🧹 Nitpick comments (2)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (2)
29-31: Hardcoded default dataset path.The default
/home/datasets/cuoptmay not exist in all environments. Tests do handle missing files gracefully viapytest.skip, but consider documenting this expectation or using a more portable default like./datasets.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 29 - 31, The default dataset path RAPIDS_DATASET_ROOT_DIR is hardcoded to "/home/datasets/cuopt"; change its fallback to a more portable path such as "./datasets" (or document the expectation) by updating the os.environ.get call that defines RAPIDS_DATASET_ROOT_DIR in test_cpu_only_execution.py so tests still call pytest.skip when files are missing but will use the local ./datasets directory by default; ensure the variable name RAPIDS_DATASET_ROOT_DIR is the only symbol you modify and keep the existing skip behavior intact.
369-394: Consider extracting duplicate fixture logic.The
cpu_only_env_with_serverfixtures inTestCPUOnlyExecution(lines 369-394) andTestCuoptCliCPUOnly(lines 436-461) are nearly identical, differing only in port offset. Consider extracting a shared helper or parameterized fixture to reduce duplication.Example refactor
def _start_grpc_server_fixture(port_offset): """Factory for server-backed environment fixtures.""" server_bin = _find_grpc_server() if server_bin is None: pytest.skip("cuopt_grpc_server not found") port = int(os.environ.get("CUOPT_TEST_PORT_BASE", "18000")) + port_offset proc = subprocess.Popen( [server_bin, "--port", str(port), "--workers", "1"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, ) if not _wait_for_port(port, timeout=15): proc.kill() proc.wait() pytest.fail("cuopt_grpc_server failed to start within 15s") return proc, _cpu_only_env(port) def _stop_grpc_server(proc): """Cleanup helper for server process.""" proc.send_signal(signal.SIGTERM) try: proc.wait(timeout=5) except subprocess.TimeoutExpired: proc.kill() proc.wait()Then use in fixtures:
`@pytest.fixture`(scope="class") def cpu_only_env_with_server(self): proc, env = _start_grpc_server_fixture(port_offset=600) yield env _stop_grpc_server(proc)Also applies to: 436-461
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 369 - 394, Extract the duplicated server-backed fixture setup into a shared helper and cleanup functions and update both cpu_only_env_with_server (in TestCPUOnlyExecution) and the corresponding fixture in TestCuoptCliCPUOnly to call them; specifically, create a factory function (e.g., _start_grpc_server_fixture(port_offset)) that locates the server via _find_grpc_server, computes port using CUOPT_TEST_PORT_BASE + port_offset, launches the subprocess with the same args, waits with _wait_for_port, and returns (proc, _cpu_only_env(port)), and a _stop_grpc_server(proc) that sends SIGTERM and falls back to kill on TimeoutExpired, then replace the fixture bodies (cpu_only_env_with_server and the other class fixture) to call the factory, yield the env, and invoke _stop_grpc_server(proc) for cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/grpc/client/grpc_client.cpp`:
- Around line 204-215: start_log_streaming currently overwrites log_thread_ and
will destroy a joinable std::thread if called twice; add a lifecycle guard: if
log_thread_ exists and is joinable, set stop_logs_.store(true), call join() on
log_thread_ (or return early to reject reentry), then reset log_thread_ before
clearing stop flag and creating the new thread. Use the existing symbols
start_log_streaming, stop_logs_, log_thread_, and stop_log_streaming (if
available) to implement this sequence so you never assign a new std::thread over
a joinable one.
In `@cpp/src/grpc/grpc_problem_mapper.cpp`:
- Around line 216-224: The loop that maps var_types_str to var_types currently
maps unknown chars to var_t::CONTINUOUS, which silently changes model semantics;
instead detect any unexpected tag in the switch (refer to var_types_str,
var_types, var_t and cpu_problem.set_variable_types) and abort parsing by
returning/propagating an error or throwing an exception with a clear message
(include the bad char and context) rather than pushing CONTINUOUS; make the same
change in the other decode path referenced (lines ~512-523) so both decoders
validate tags and fail on unknown values.
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 115-128: Validate inputs before writing into assembled: compute
elem_size only after confirming fi.total_bytes and
fi.chunks[0]->total_elements() divide evenly (use division with remainder check)
and fail if not; for each ac in fi.chunks validate ac->element_offset() is
non-negative and that chunk_data.size() is a multiple of elem_size
(chunk_data.size() % elem_size == 0), compute num_elements = chunk_data.size() /
elem_size and check element_offset + num_elements does not exceed
fi.chunks[0]->total_elements() before multiplying to get a byte_offset, and
instead of pre-zeroing and blindly memcpying keep a bitmap or byte-coverage
counter to detect gaps/overlaps and only allow success if the union of all chunk
byte ranges exactly covers fi.total_bytes (otherwise return false).
In `@GRPC_QUICK_START.md`:
- Around line 68-118: Add a missing server-certificate issuance step: generate a
server key and CSR, sign the CSR with the private CA (ca.crt / ca.key) to
produce server.crt, and ensure the CSR includes a SAN that matches the hostname
clients will use (so the server cert presented to clients is actually signed by
the same CA and contains the correct subjectAltName); update the docs around the
cuopt_grpc_server invocation and the environment variables (--tls-cert
server.crt, --tls-key server.key, --tls-root ca.crt) to state explicitly that
server.crt must be the CA-signed certificate with matching SAN.
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 57-61: The CONDA_PREFIX branch sets p = os.path.join(conda_prefix,
"bin", "cuopt_grpc_server") and only checks os.path.isfile(p); update that
branch to also verify the file is executable using os.access(p, os.X_OK)
(consistent with the other branches) before returning p, so replace the single
isfile check with a combined isfile and os.access check for executability.
---
Outside diff comments:
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 474-479: The CONDA_PREFIX branch currently only checks
os.path.isfile(p) and should also verify the file is executable like the earlier
_find_grpc_server logic; update the check for the constructed path variable p to
ensure it exists and is executable (e.g., replace or augment os.path.isfile(p)
with an executable check such as os.access(p, os.X_OK) or equivalent) so the
function returns the path only when the cuopt_cli binary is executable.
---
Duplicate comments:
In `@cpp/src/grpc/client/grpc_client.cpp`:
- Around line 89-94: The helper set_rpc_deadline(ctx, timeout_seconds) exists
but call sites still pass the hard-coded 60s constant, so plumb the configured
timeout through: replace literal 60 at each RPC call site with
config_.timeout_seconds (or, if you want a separate short-RPC knob, add and use
a new member like config_.short_rpc_timeout_seconds) so that functions using
set_rpc_deadline (references around the hard-coded uses noted in the review and
the helper itself) honor the client configuration; ensure default behavior
preserves current semantics when config_ value is zero or negative.
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 37-55: The helpers write_protobuf_to_pipe and
read_protobuf_from_pipe currently allow near-4GiB allocations because they only
compare a uint32_t size against kMaxPipeArrayBytes; introduce a dedicated
protobuf message cap (e.g., kMaxProtobufMessageBytes) and enforce it
symmetrically: in write_protobuf_to_pipe check msg.ByteSizeLong() against the
new cap and return false if it exceeds the cap before casting to uint32_t, and
in read_protobuf_from_pipe validate the read uint32_t size is non‑zero and <=
kMaxProtobufMessageBytes (not kMaxPipeArrayBytes) before allocating the vector
or parsing; update any related comments.
In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 25-45: print_usage documents the --chunk-timeout flag but the
argument parser currently ignores it (and lets unknown options pass silently);
update the CLI parsing logic (where flags like --max-message-mb / --tls-* are
handled) to either (A) implement --chunk-timeout by parsing its integer value
into the server variable (e.g., chunk_timeout_seconds or similar), validate it
(non-negative integer; 0 = disabled), and wire that variable into the streaming
timeout behavior, or (B) if not supported, explicitly detect --chunk-timeout and
exit with an error message stating it is unsupported; additionally make the
parser treat unknown options as errors instead of silently ignoring them so
flags advertised by print_usage are enforced.
---
Nitpick comments:
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 29-31: The default dataset path RAPIDS_DATASET_ROOT_DIR is
hardcoded to "/home/datasets/cuopt"; change its fallback to a more portable path
such as "./datasets" (or document the expectation) by updating the
os.environ.get call that defines RAPIDS_DATASET_ROOT_DIR in
test_cpu_only_execution.py so tests still call pytest.skip when files are
missing but will use the local ./datasets directory by default; ensure the
variable name RAPIDS_DATASET_ROOT_DIR is the only symbol you modify and keep the
existing skip behavior intact.
- Around line 369-394: Extract the duplicated server-backed fixture setup into a
shared helper and cleanup functions and update both cpu_only_env_with_server (in
TestCPUOnlyExecution) and the corresponding fixture in TestCuoptCliCPUOnly to
call them; specifically, create a factory function (e.g.,
_start_grpc_server_fixture(port_offset)) that locates the server via
_find_grpc_server, computes port using CUOPT_TEST_PORT_BASE + port_offset,
launches the subprocess with the same args, waits with _wait_for_port, and
returns (proc, _cpu_only_env(port)), and a _stop_grpc_server(proc) that sends
SIGTERM and falls back to kill on TimeoutExpired, then replace the fixture
bodies (cpu_only_env_with_server and the other class fixture) to call the
factory, yield the env, and invoke _stop_grpc_server(proc) for cleanup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 21c442e4-fc3d-439d-aaa7-8c0bbcf57bcd
📒 Files selected for processing (10)
GRPC_QUICK_START.mdci/utils/install_protobuf_grpc.shcpp/src/grpc/client/grpc_client.cppcpp/src/grpc/grpc_problem_mapper.cppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/src/grpc/server/grpc_worker_infra.cpp
- ci/utils/install_protobuf_grpc.sh
026627a to
6e01d60
Compare
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (3)
cpp/src/grpc/server/grpc_job_management.cpp (1)
245-259:⚠️ Potential issue | 🔴 Critical
cancel_job()still has a slot-reuse race.This path matches
job_idonce, then writescancelledand maykill()a worker based on queue-slot state that can be recycled concurrently by the dispatcher/result threads. If the slot is reused between those operations, cancellation can affect the wrong job. Pin the slot, or revalidate that the samejob_idstill owns it immediately before mutating the slot and signaling the PID. As per coding guidelines "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 245 - 259, The cancel_job() path can race with slot reuse; after you find a matching slot (checking job_queue[i].ready and strcmp on job_queue[i].job_id), revalidate/lock the slot immediately before mutating or signaling: either acquire the slot-level mutex (or a global queue mutex) around reading worker_pid/claimed and then set job_queue[i].cancelled and call kill(), or atomically re-check that strcmp(job_queue[i].job_id, job_id.c_str()) still matches (and job_queue[i].ready is still true) right before setting cancelled/killing; ensure you reference the same symbols (MAX_JOBS, job_queue[i].job_id, job_queue[i].ready, job_queue[i].claimed, job_queue[i].cancelled, worker_pid, kill()) so the slot cannot be reused between the match and the mutation/signal.cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)
120-140:⚠️ Potential issue | 🔴 Critical
bytes_coveredis not a real integrity check.Two overlapping chunks and one gap can still satisfy
bytes_covered == fi.total_bytes, so this path accepts a zero-padded/corrupted field as valid. It also multiplieselement_offset * elem_sizebefore proving the element range is safe, which can overflow on malformed input. Track covered ranges/bytes and validate the element interval before converting it to byte offsets. As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 120 - 140, The current assembly loop uses bytes_covered and multiplies element_offset * elem_size before validating element ranges, which allows overflow and accepts overlaps/gaps; fix by first validating elem_size > 0 and that chunk_data.size() % elem_size == 0, compute chunk_elements = chunk_data.size() / elem_size, validate element_offset >= 0 and element_offset <= total_elements - chunk_elements (use int64 safe checks to avoid overflow) before computing byte_offset, and replace the bytes_covered check with a per-element coverage bitmap/vector<bool> (size total_elements) that marks each element index covered by a chunk and detects any overlap (already-marked bit) or final gaps (any false bit) before memcpy and write_to_pipe; ensure all bounds and arithmetic use checked types to prevent overflows.cpp/src/grpc/server/grpc_server_main.cpp (1)
49-86:⚠️ Potential issue | 🟠 MajorStill reject missing option values instead of falling through.
Flags that require an argument (
--port,--workers,--max-message-*,--chunk-timeout, and the TLS path flags) still just skip parsing when the value is absent, so the server can continue with unintended defaults. Please fail fast with usage output here, and validate basic numeric ranges before continuing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 49 - 86, The CLI parsing loop in grpc_server_main.cpp currently ignores missing values for flags that require arguments, allowing silent fallback to defaults; update the parsing for flags that take values (those setting config.port, config.num_workers, config.max_message_bytes via --max-message-mb/--max-message-bytes, config.chunk_timeout_seconds, and TLS path flags config.tls_cert_path/config.tls_key_path/config.tls_root_path) to check i+1 < argc and if missing call print_usage(argv[0]) and return 1 (fail fast), and after parsing validate numeric ranges (e.g. port in 1–65535, num_workers > 0, max_message_bytes >= 4096, chunk_timeout_seconds >= 0) and on invalid values also print_usage and return 1; keep using the same parsing block and print_usage/return flow so behavior is consistent.
🧹 Nitpick comments (3)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (3)
630-635: Duplicate shutdown logic could be extracted.This shutdown block duplicates
_stop_grpc_server. Consider reusing the helper.♻️ Suggested refactor
- proc.send_signal(signal.SIGTERM) - try: - proc.wait(timeout=5) - except subprocess.TimeoutExpired: - proc.kill() - proc.wait() + _stop_grpc_server(proc)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 630 - 635, The shutdown block that sends SIGTERM and falls back to kill duplicates the existing helper _stop_grpc_server; replace the inline sequence (proc.send_signal(signal.SIGTERM)...proc.wait()) with a call to the helper _stop_grpc_server(proc) (or refactor that helper to accept the proc object if necessary), ensuring the helper implements the same timeout and fallback kill behavior and reuses existing logging/error handling so the test uses the central shutdown logic.
88-100: Consider logging failures for easier debugging.The inner
_runhelper silently discards stdout/stderr on failure. If certificate generation fails, diagnosing the root cause will be difficult.♻️ Optional: capture and log failures
def _run(cmd): - return ( - subprocess.run(cmd, capture_output=True, timeout=30).returncode - == 0 - ) + result = subprocess.run(cmd, capture_output=True, timeout=30, text=True) + if result.returncode != 0: + print(f"openssl command failed: {' '.join(cmd)}", file=sys.stderr) + if result.stderr: + print(result.stderr, file=sys.stderr) + return result.returncode == 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 88 - 100, The helper _run inside _generate_test_certs currently swallows stdout/stderr which makes failures hard to diagnose; modify _run to capture subprocess.run's CompletedProcess, and on non-zero returncode log the command, returncode, stdout and stderr (use the module logger via logging.getLogger or the existing test logger) before returning False, so callers can see detailed failure information while keeping the same boolean success/failure return behavior.
307-325: Add numerical correctness validation for MIP test.The test validates solution size but not correctness. Per coding guidelines, tests should validate "numerical correctness of optimization results (not just 'runs without error')". Consider adding objective value or feasibility validation.
♻️ Suggested improvement
solution = linear_programming.Solve(dm, settings) vals = solution.get_primal_solution() assert len(vals) == n_vars, f"solution size {len(vals)} != n_vars {n_vars}" + + obj = solution.get_primal_objective() + assert obj is not None, "objective is None" + + # bb_optimality.mps has a known optimal value; validate against it + stats = solution.get_milp_stats() + assert "mip_gap" in stats, "MIP stats missing mip_gap" + assert stats["mip_gap"] >= 0, f"Negative MIP gap: {stats['mip_gap']}"As per coding guidelines: "Write tests validating numerical correctness of optimization results (not just 'runs without error')".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 307 - 325, Add numerical correctness checks to _impl_mip_solve_cpu_only: compute the objective from dm.get_objective_coefficients() dotted with the returned solution.get_primal_solution() and assert it matches solution.get_objective_value() within a small tolerance; additionally validate feasibility by evaluating the constraint matrix and RHS (use dm.get_constraint_matrix(), dm.get_rhs(), dm.get_sense() or equivalent accessors) against vals to ensure all constraints satisfy their senses (<=, =, >=) within tolerances. Use n_vars, dm, solution, vals and explicit tolerances in the assertions so the test verifies both objective accuracy and constraint feasibility rather than just solution size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/grpc/client/grpc_client.cpp`:
- Around line 683-688: The calculation of max_polls can overflow because
config_.timeout_seconds * 1000 is done in int; change the arithmetic to 64-bit
before multiplication (e.g., use int64_t or long long) to compute
(timeout_seconds * 1000) / poll_ms safely, then clamp or cast the final value
into an int as needed; update the computation that assigns max_polls
(referencing poll_ms, max_polls, and
config_.timeout_seconds/config_.poll_interval_ms) to perform 64-bit math and
handle the case where the result exceeds std::numeric_limits<int>::max().
- Around line 292-295: The member server_max_message_bytes_ of class
grpc_client_t is being accessed concurrently from multiple RPC methods
(check_status, get_result_or_download, download_chunked_result,
upload_chunked_arrays) causing a data race; fix by making
server_max_message_bytes_ an std::atomic<int64_t> (replace its declaration) and
update all writes to use store() and reads to use load() (use
memory_order_relaxed unless stricter ordering is required), or alternatively
document in the class comment that grpc_client_t is not thread-safe and must
only be used from a single thread—pick one approach and apply it consistently
across all usages referenced above.
In `@cpp/src/grpc/grpc_problem_mapper.cpp`:
- Around line 113-119: The encoding loop over var_types that maps var_t to chars
(pushing into var_types_str) currently treats unknown enum values as CONTINUOUS
('C'); change the default case in that switch to detect unexpected var_t values
and fail loudly (e.g., throw std::runtime_error or call LOG(ERROR)/assert)
instead of pushing 'C'), including the offending enum value in the message;
reference the enum var_t, the var_types container and var_types_str vector so
you update that switch branch to mirror the decoder's behavior (i.e., explicit
handling for known cases and error on unknown cases).
- Around line 647-652: The switch over var_t in the loop currently maps unknown
enum values silently to CONTINUOUS; update the default path in the block that
fills vt_bytes (loop iterating over var_types and writing to vt_bytes) to match
the handling used in map_problem_to_proto: detect unexpected var_t values, emit
a clear error/log entry including the offending enum value (and context), and
encode a distinct marker (e.g., 'U' for unknown) or return/propagate an error
instead of silently falling back to 'C' so unknown enums are not silently
converted; ensure you touch the switch that references var_t, var_types, and
vt_bytes to implement this consistent behavior.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 70-81: The clear() method resets client_logs_ but incorrectly sets
server_log_start_pos_ to 0 and test_start_marked_ to false, which causes
subsequent get_server_logs() to return the entire server log (including stale
entries); change clear() to mark the server log start for the next test by
setting server_log_start_pos_ to the current end of the server log (e.g.,
compute current file size/position used by get_server_logs()) and set
test_start_marked_ = true (or call the existing mark/test-start helper) so
get_server_logs() only returns entries written after clear(); update references
in clear(), server_log_start_pos_, and test_start_marked_ accordingly.
- Around line 40-47: The header is not self-contained because
dump_logs(std::ostream& out = std::cout) uses std::ostream and std::cout but
<iostream> is not included; add `#include` <iostream> at the top of
grpc_test_log_capture.hpp so std::ostream and std::cout are declared and the
header no longer relies on transitive includes (affects the dump_logs
declaration and any callers relying on default std::cout).
In `@GRPC_QUICK_START.md`:
- Line 134: Duplicate step number: change the heading "4. Configure an
authorized client:" to "5. Configure an authorized client:" so the sequence
follows the previous "4." step; update the numeric prefix in that Markdown line
(the string "4. Configure an authorized client:") to "5." to fix the numbering.
---
Duplicate comments:
In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 245-259: The cancel_job() path can race with slot reuse; after you
find a matching slot (checking job_queue[i].ready and strcmp on
job_queue[i].job_id), revalidate/lock the slot immediately before mutating or
signaling: either acquire the slot-level mutex (or a global queue mutex) around
reading worker_pid/claimed and then set job_queue[i].cancelled and call kill(),
or atomically re-check that strcmp(job_queue[i].job_id, job_id.c_str()) still
matches (and job_queue[i].ready is still true) right before setting
cancelled/killing; ensure you reference the same symbols (MAX_JOBS,
job_queue[i].job_id, job_queue[i].ready, job_queue[i].claimed,
job_queue[i].cancelled, worker_pid, kill()) so the slot cannot be reused between
the match and the mutation/signal.
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 120-140: The current assembly loop uses bytes_covered and
multiplies element_offset * elem_size before validating element ranges, which
allows overflow and accepts overlaps/gaps; fix by first validating elem_size > 0
and that chunk_data.size() % elem_size == 0, compute chunk_elements =
chunk_data.size() / elem_size, validate element_offset >= 0 and element_offset
<= total_elements - chunk_elements (use int64 safe checks to avoid overflow)
before computing byte_offset, and replace the bytes_covered check with a
per-element coverage bitmap/vector<bool> (size total_elements) that marks each
element index covered by a chunk and detects any overlap (already-marked bit) or
final gaps (any false bit) before memcpy and write_to_pipe; ensure all bounds
and arithmetic use checked types to prevent overflows.
In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 49-86: The CLI parsing loop in grpc_server_main.cpp currently
ignores missing values for flags that require arguments, allowing silent
fallback to defaults; update the parsing for flags that take values (those
setting config.port, config.num_workers, config.max_message_bytes via
--max-message-mb/--max-message-bytes, config.chunk_timeout_seconds, and TLS path
flags config.tls_cert_path/config.tls_key_path/config.tls_root_path) to check
i+1 < argc and if missing call print_usage(argv[0]) and return 1 (fail fast),
and after parsing validate numeric ranges (e.g. port in 1–65535, num_workers >
0, max_message_bytes >= 4096, chunk_timeout_seconds >= 0) and on invalid values
also print_usage and return 1; keep using the same parsing block and
print_usage/return flow so behavior is consistent.
---
Nitpick comments:
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 630-635: The shutdown block that sends SIGTERM and falls back to
kill duplicates the existing helper _stop_grpc_server; replace the inline
sequence (proc.send_signal(signal.SIGTERM)...proc.wait()) with a call to the
helper _stop_grpc_server(proc) (or refactor that helper to accept the proc
object if necessary), ensuring the helper implements the same timeout and
fallback kill behavior and reuses existing logging/error handling so the test
uses the central shutdown logic.
- Around line 88-100: The helper _run inside _generate_test_certs currently
swallows stdout/stderr which makes failures hard to diagnose; modify _run to
capture subprocess.run's CompletedProcess, and on non-zero returncode log the
command, returncode, stdout and stderr (use the module logger via
logging.getLogger or the existing test logger) before returning False, so
callers can see detailed failure information while keeping the same boolean
success/failure return behavior.
- Around line 307-325: Add numerical correctness checks to
_impl_mip_solve_cpu_only: compute the objective from
dm.get_objective_coefficients() dotted with the returned
solution.get_primal_solution() and assert it matches
solution.get_objective_value() within a small tolerance; additionally validate
feasibility by evaluating the constraint matrix and RHS (use
dm.get_constraint_matrix(), dm.get_rhs(), dm.get_sense() or equivalent
accessors) against vals to ensure all constraints satisfy their senses (<=, =,
>=) within tolerances. Use n_vars, dm, solution, vals and explicit tolerances in
the assertions so the test verifies both objective accuracy and constraint
feasibility rather than just solution size.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2293365d-c3e4-47b6-b111-6b48e153b0c0
📒 Files selected for processing (11)
GRPC_QUICK_START.mdci/utils/install_protobuf_grpc.shcpp/src/grpc/client/grpc_client.cppcpp/src/grpc/grpc_problem_mapper.cppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_worker_infra.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
🚧 Files skipped from review as they are similar to previous changes (2)
- ci/utils/install_protobuf_grpc.sh
- cpp/src/grpc/server/grpc_worker_infra.cpp
6e01d60 to
3d8167d
Compare
This change replaces the solve_lp_remote and solve_mip_remote stubs with real routines that use an embedded grpc client to communicate with a remote cuopt server.
There are two documents included, GRPC_ARCHITECTURE.md and SERVER_ARCHITECTURE.md that act as developer guides.
The server is built by build.sh and is called cuopt_grpc_server.
Remote execution is still enabled via env vars CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT
To try this feature, checkout the branch/PR and do the following:
$ build.sh
$ cuopt_grpc_server // defaults are fine, run with -h for help
$ CUOPT_REMOTE_HOST=localhost CUOPT_REMOTE_PORT=8765 cuopt_cli myproblem.mps
All cuopt APIs will pick up remote execution if the env vars are set. So cuopt_cli, the C API, the Python API, all will solve problems on the server if the env vars are set and the server is running. Just use cuopt tools and APIs as you normally would for a local solve.