Skip to content

Conversation

@SurajAralihalli
Copy link
Contributor

Dataproc ML images (2.3-ml-ubuntu) come with NVIDIA drivers and Spark RAPIDS preinstalled by default, so the init script should only update the RAPIDS JAR and skip all other setup steps.

This PR:

  • Introduces install_gpu_xgboost and added a new function to check for existing RAPIDS JARs.
  • Introduces remove_spark_rapids_jar to clean up existing JARs before installation.
  • Updates the main function to ensure the correct RAPIDS version is installed, and skips NVIDIA driver installation when the instance already includes the Spark RAPIDS jar (applies to ML images).

- Introduced `install_gpu_xgboost` and added a new function to check for existing RAPIDS JARs.
- Introduced `remove_spark_rapids_jar` to clean up existing JARs before installation.
- Updated the main function to ensure the correct RAPIDS version is installed, replacing any existing JARs as necessary.
- Improved overall structure and readability of the script.
- Modified the `remove_spark_rapids_jar` function to use a wildcard for matching RAPIDS JAR files, allowing for more flexible removal of existing JARs.
- Ensured the main function is properly terminated with a newline for better script formatting.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SurajAralihalli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Spark RAPIDS initialization script to intelligently handle Dataproc 2.3 ML images. By introducing checks for existing Spark RAPIDS installations and providing mechanisms to update them, the script ensures that the correct RAPIDS version is deployed efficiently, avoiding unnecessary driver installations on environments where they are already present. This streamlines the setup process for Spark RAPIDS on Dataproc ML clusters.

Highlights

  • Function Refactoring: The original install_spark_rapids function has been refactored. The logic for installing GPU XGBoost has been moved into a new function named install_gpu_xgboost, while install_spark_rapids now exclusively handles the Spark RAPIDS JAR installation.
  • JAR Detection and Removal Utilities: Two new utility functions, check_spark_rapids_jar and remove_spark_rapids_jar, have been introduced. These functions allow the script to detect if a Spark RAPIDS JAR is already present and to remove existing JARs, respectively.
  • Optimized Main Function Logic: The main function now includes conditional logic to optimize installation on Dataproc ML images. If a Spark RAPIDS JAR is detected (indicating a pre-installed ML image), the script will remove the existing JAR and install the specified version, skipping redundant NVIDIA driver installations. Otherwise, it proceeds with a full installation including GPU drivers and both Spark RAPIDS and XGBoost JARs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Jan 10, 2026

@cjac @jayadeep-jayaraman @codelixir for reviews

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Spark RAPIDS installation script to support Dataproc ML images, which come with Spark RAPIDS pre-installed. The approach of checking for an existing JAR and only updating it is sound. I've identified a high-severity issue where YARN services are not restarted after the JAR is updated on ML images, which would prevent the changes from taking effect. I've also included a medium-severity suggestion to improve the robustness of the script.

@codelixir
Copy link
Member

/gcbrun

@SurajAralihalli
Copy link
Contributor Author

@codelixir can this PR be merged?

@cjac
Copy link
Contributor

cjac commented Feb 6, 2026

sorry for the delay here. I've been caught up in other things.

/gcbrun

@cjac cjac self-assigned this Feb 6, 2026
@cjac
Copy link
Contributor

cjac commented Feb 6, 2026

My first review of the change indicates to me that this change treats the -ml image type differently than non-ml images. This accelerates the delivery pipeline by removing much of the installation.

These changes should be generatlized to detect component installation status independent of what the image type is so that using a custom image (#120) of any os image with the assets pre-installed as an alternative.

@cjac
Copy link
Contributor

cjac commented Feb 6, 2026

@SurajAralihalli would you like me to implement the changes I recommended and patch your PR for your review?

@SurajAralihalli
Copy link
Contributor Author

@SurajAralihalli would you like me to implement the changes I recommended and patch your PR for your review?

Sure @cjac!

@cjac
Copy link
Contributor

cjac commented Feb 8, 2026

Okay, it's in progress. I will try to finish it this weekend, but I can't promise. we'll see how far I get.

Right now it captures information about packages installed at the system and conda levels

I am working on getting better diagnostics on Rocky, but it's not very well supported quite yet.

I have the evaluation printing the information in all of different starting States including fresh base image, custom image with secure boot, custom image with secure Boot and GPU driver installer run to completion, and 2.3 ml

I have yet to try anything before the 2.2 image

@cjac
Copy link
Contributor

cjac commented Feb 8, 2026

Journal Entry: 2026-02-06 - PR #1372 Analysis

Project: Dataproc Initialization Actions - Spark RAPIDS
Objective: Understand and plan improvements for PR #1372.
Task: Review the PR's intent and identify areas for generalization.

Actions:

Outcome: Clear understanding of the task to refactor the spark-rapids init action to dynamically detect the environment.


Journal Entry: 2026-02-07 - GPU Environment Detection Script Development

Project: Dataproc Initialization Actions - Spark RAPIDS
Objective: Create a set of functions to detect GPU hardware and software components.
Task: Implement initial evaluate_gpu_setup and helper functions.

Actions:

  • Created tmp/spark-rapids.sh_function-gpu_detect.
  • Implemented functions:
    • is_gpu_instance: Checking metadata for GPU presence.
    • detect_nvidia_driver: Checking for nvidia-smi, loaded modules, and /dev/nvidia0.
    • get_nvidia_driver_version: Parsing nvidia-smi.
    • detect_cuda_toolkit: Checking for nvcc and /usr/local/cuda.
    • get_cuda_version: Parsing nvcc --version.
    • find_spark_rapids_jars: Searching for RAPIDS JARs.
    • detect_conda_gpu_packages: Basic check for key GPU packages in Conda.
    • evaluate_gpu_setup: Orchestrator function to call detectors and return a status.
  • Initial testing revealed issues with Conda package detection and CUDA version parsing.

Outcome: First version of GPU detection functions created. Identified need for refinement in Conda and CUDA checks.


Journal Entry: 2026-02-07 - Refining Conda and CUDA Detection

Project: Dataproc Initialization Actions - Spark RAPIDS
Objective: Improve accuracy of Conda package and CUDA version detection.
Task: Modify detection functions based on test results.

Actions:

  • Updated detect_conda_gpu_packages:
    • Initially added checks for include-pytorch metadata, but realized this was too restrictive.
    • Removed the include-pytorch gate to always check the Conda env.
    • Ensured conda.sh is sourced correctly.
    • Switched from printing the full conda list to checking for specific packages (pytorch, tensorflow, rapids, xgboost, cudatoolkit, cudnn) and logging findings.
  • Updated get_cuda_version: Added 2>/dev/null to handle cases where nvcc might fail.
  • Updated evaluate_gpu_setup to differentiate between ML_BASE and ML_BASE_WITH_CONDA.

Outcome: Conda package detection is more reliable and less verbose. CUDA version detection is more robust.


Journal Entry: 2026-02-07 - Enhancing JAR and System Information Detection

Project: Dataproc Initialization Actions - Spark RAPIDS
Objective: Improve JAR detection and add system-level checks.
Task: Update JAR finding, add Secure Boot and Proxy detection.

Actions:

  • Renamed find_spark_rapids_jars to find_gpu_spark_jars.
  • Enhanced find_gpu_spark_jars to extract version numbers from JAR filenames and include patterns for XGBoost JARs.
  • Added detect_secure_boot_status using mokutil --sb-state.
  • Added detect_proxy_settings to check common environment variables and GCE metadata for proxies.
  • Improved get_metadata_attribute to fall back to project metadata.
  • Integrated these new checks into evaluate_gpu_setup.
  • Refactored evaluate_gpu_setup output to be a concise summary.
  • Fixed metadata fetching to handle non-200 responses, preventing HTML output.

Outcome: More comprehensive environment detection, including JAR versions, Secure Boot, and Proxy settings. Output is now a summary.


Journal Entry: 2026-02-07 - Improving GPU Hardware Detection

Project: Dataproc Initialization Actions - Spark RAPIDS
Objective: Make GPU hardware detection more reliable than metadata alone.
Task: Add PCI device checks.

Actions:

  • Created has_nvidia_pci_device to check /sys/bus/pci/devices/*/uevent for NVIDIA Vendor ID (10de).
  • Updated is_gpu_instance to call has_nvidia_pci_device first, making metadata checks a fallback.
  • Added get_gpu_count and get_gpu_models functions, using nvidia-smi with lspci fallback.
  • Incorporated GPU count and model reporting into evaluate_gpu_setup summary.

Outcome: GPU hardware detection is now more robust by directly checking PCI devices.


Journal Entry: 2026-02-08 - JSON Output Generation

Project: Dataproc Initialization Actions - Spark RAPIDS
Objective: Output the evaluation results as a structured JSON.
Task: Develop functions to build and format JSON output.

Actions:

  • Introduced add_json_part helper function to build JSON elements in a bash array.
  • Created finalize_json to assemble the parts, pipe to jq . for pretty printing.
  • Developed a Perl one-liner to compact the available_versions arrays onto a single line within the JSON output.
  • Iteratively refined the Perl script and its integration into finalize_json to handle escaping and newlines correctly.
  • Corrected add_json_part to avoid word splitting issues.

Outcome: evaluate_gpu_setup now produces a well-formatted JSON output with key order preserved and arrays compacted.


Journal Entry: 2026-02-08 - System Package Version Detection

Project: Dataproc Initialization Actions - Spark RAPIDS
Objective: Add available version detection for system packages (debs/rpms).
Task: Implement functions to query apt/dnf.

Actions:

  • Created get_apt_available_versions using apt-cache search and apt-cache policy.
  • Created get_dnf_available_versions using dnf repoquery.
  • Updated get_system_package_details to use these functions.
  • Refined package patterns in check_system_gpu_packages for cudatoolkit and cudnn.
  • Hardcoded system package details for rapids, tensorflow, and pytorch to "N/A" as they are not typically system-installed.
  • Corrected associative array declaration in check_system_gpu_packages.

Outcome: available_versions in the JSON output for system packages is now populated.


Journal Entry: 2026-02-08 - Script Splitting

Project: Dataproc Initialization Actions - Spark RAPIDS
Objective: Split the original spark-rapids.sh into smaller, manageable parts.
Task: Divide the script into logical files.

Actions:

  • Multiple attempts were made to split the file, initially with manual copying, then with sed.
  • Encountered issues with line ranges and unintended modifications.
  • Finally, used split -l to ensure a faithful, order-preserved split.
  • Re-ran the split command after the user corrected the source file content.
  • The user then requested a more logical split based on function groups.
  • Developed a 12-part plan for splitting, grouping related functions.
  • Executed the split using sed with corrected contiguous line ranges.

Outcome: The initialization-actions/spark-rapids/spark-rapids.sh script is now split into 12 parts in the initialization-actions/spark-rapids/spark-rapids.parts/ directory, preserving the original code and order.

@cjac
Copy link
Contributor

cjac commented Feb 8, 2026

Plan for Integrating Environment Probe - File by File

Goal: Integrate the evaluate_gpu_setup scripts from tmp/spark-rapids.parts/ into the main spark-rapids.sh concatenated script in initialization-actions/spark-rapids/spark-rapids.parts/, minimizing line count by refactoring and removing redundancies.

Method: All files in the target directory are concatenated in alphanumeric order to produce initialization-actions/spark-rapids/spark-rapids.sh.

Milestones:

  1. Milestone 1: Copy Evaluation Script Components: - COMPLETED

    • Copy contents of tmp/spark-rapids.parts/02-evaluate_gpu_setup-* files into appropriately named new files within initialization-actions/spark-rapids/spark-rapids.parts/.
    • Merge the contents of tmp/spark-rapids.parts/02-evaluate_gpu_setup-00-header.sh into initialization-actions/spark-rapids/spark-rapids.parts/00-header.sh.
  2. Milestone 2: Consolidate Package Management in Target: - COMPLETED

    • 2.1: Move execute_with_retries: Ensure the improved execute_with_retries function is in 01-os-utils.sh.
    • 2.2: Create 00a-package-management.sh: Define install_required_packages function.
    • 2.3: Update 99-main.sh: Call install_required_packages at the start of main.
    • 2.4: Remove Redundant Installations: Remove package installs from other .parts files.
  3. Milestone 3: Refactor Target .parts files to Use Evaluation Output:

    • 3.1: Add Evaluation Call and Define EVALUATION_OUTPUT_FILE:

      • In 00-header.sh, add readonly EVALUATION_OUTPUT_FILE="/tmp/gpu_evaluation.json".
      • In 99-main.sh, add a call to evaluate_gpu_setup immediately after install_required_packages.
      • Test: Run test_spark_rapids.py. Expected to pass as no existing logic is changed yet.
    • 3.2: Refactor OS and Secure Boot Checks in 99-main.sh:

      • Remove the check_os_and_secure_boot function definition.
      • Replace the call to check_os_and_secure_boot with jq queries on ${EVALUATION_OUTPUT_FILE} to perform the equivalent checks:
        • Read os.id, os.version.
        • Read secure_boot.
        • Read private_secret_name_set.
        • Read dataproc_image_version.
        • Replicate the conditional logic using these values.
      • Remove the SECURE_BOOT variable initialization from 04-apt-helpers.sh.
      • Remove OS_NAME and DATAPROC_IMAGE_VERSION initializations from 05-install-helpers.sh.
      • Test: Run test_spark_rapids.py.
    • 3.3: Refactor install_nvidia_gpu_driver to use JSON - Part 1 (OS, Versions):

      • Modify sections of 06-agent-helpers.sh, 07-configure.sh, 08-configure-gpu-mig.sh that form install_nvidia_gpu_driver.
      • Replace uses of is_debian, is_ubuntu, is_rocky, is_debian10, is_ubuntu18 with checks against $(jq -r .os.id "${EVALUATION_OUTPUT_FILE}") and $(jq -r .os.version "${EVALUATION_OUTPUT_FILE}").
      • Fetch CUDA_VERSION, NVIDIA_DRIVER_VERSION, CUDA_VERSION_MAJOR, and shortname by querying ${EVALUATION_OUTPUT_FILE} with jq inside the function, instead of relying on global variables.
        • NVIDIA_DRIVER_VERSION=$(get_metadata_attribute 'driver-version' "$(jq -r .nvidia_driver_version "${EVALUATION_OUTPUT_FILE}" || echo "550.54.15")") (and similar for CUDA_VERSION). This allows metadata override.
        • Reconstruct shortname based on the OS info from JSON.
      • Remove global initializations of these variables from 05-install-helpers.sh.
      • Test: Run test_spark_rapids.py.
    • 3.4: Refactor setup_gpu_yarn to use JSON:

      • In 09-install-nvidia.sh, modify setup_gpu_yarn.
      • Replace OS_NAME checks with jq queries on .os.id.
      • Replace INSTALL_GPU_AGENT global variable with $(jq -r .install_gpu_agent "${EVALUATION_OUTPUT_FILE}"). Remove global from 04-apt-helpers.sh.
      • Replace ROLE global variable with $(jq -r .dataproc_role "${EVALUATION_OUTPUT_FILE}"). Remove global from 05-install-helpers.sh.
      • The call to install_nvidia_gpu_driver now uses values fetched from JSON within that function.
      • Test: Run test_spark_rapids.py.
    • 3.5: Refactor main function's core logic:

      • In 99-main.sh, update the main else block.
      • Replace DATAPROC_IMAGE_VERSION check with jq query.
      • Replace RUNTIME check with $(jq -r .rapids_runtime "${EVALUATION_OUTPUT_FILE}"). Remove global from 05-install-helpers.sh.
      • Test: Run test_spark_rapids.py.
  4. Milestone 4: Remove Redundant Functions/Variables in Target:

    • 4.1: Remove OS check functions: Delete is_debian, is_ubuntu, is_rocky and their versioned variants (e.g., is_debian10) from 01-os-utils.sh.
    • 4.2: Remove remaining global variables: Thoroughly check all .parts files (01-09) and remove any remaining global variable definitions that are now dynamically determined via the JSON file (e.g., MASTER, HADOOP_CONF_DIR, etc. if they are not needed globally).
    • Test: Run test_spark_rapids.py.
  5. Milestone 5: Review and Clean: Final check for unused code and formatting.

    • Test: Run test_spark_rapids.py.

@cjac
Copy link
Contributor

cjac commented Feb 10, 2026

/gcbrun

@cjac cjac force-pushed the spark_rapids_for_ml_images branch from b56bb63 to 95a7510 Compare February 10, 2026 03:06
@cjac
Copy link
Contributor

cjac commented Feb 10, 2026

/gcbrun

@cjac cjac force-pushed the spark_rapids_for_ml_images branch from 95a7510 to df0af13 Compare February 10, 2026 03:10
@cjac
Copy link
Contributor

cjac commented Feb 10, 2026

/gcbrun

@cjac cjac force-pushed the spark_rapids_for_ml_images branch from df0af13 to d5f16e5 Compare February 10, 2026 03:17
This commit introduces a comprehensive set of functions to detect and evaluate the GPU environment on a Dataproc node. These functions are designed to replace the existing OS/version checks and hardcoded assumptions within the spark-rapids initialization action.

The new `evaluate_gpu_setup` function and its helpers will:
- Detect GPU hardware, NVIDIA drivers, and CUDA toolkit versions.
- Check for the presence of key system packages (dkms, headers, etc.).
- Inspect Conda environments for GPU-related packages (TensorFlow, PyTorch, XGBoost, etc.).
- Verify Spark RAPIDS JAR installations.
- Check YARN and Spark configurations related to GPU resources.
- Assess secure boot status and MOK key enrollment.

The output is a JSON file (`/tmp/gpu_evaluation.json`) summarizing the environment, which will be used by subsequent refactored parts of the init action to make informed decisions about driver installation, package management, and configuration.

This change is the first step in refactoring the spark-rapids init action to be more robust and less dependent on image-specific details.
@cjac cjac force-pushed the spark_rapids_for_ml_images branch from d5f16e5 to eb6f87d Compare February 10, 2026 03:32
@cjac
Copy link
Contributor

cjac commented Feb 10, 2026

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants