Http proxy 20260604#1395
Conversation
…and integration tests Introduces a new initialization action `http-proxy/http-proxy.sh` to configure global HTTP/HTTPS proxy settings on Dataproc cluster nodes. Features: - Configures global proxy environment variables (`http_proxy`, `https_proxy`, `no_proxy`) in `/etc/environment` and `/etc/profile.d/proxy.sh`. - Bypasses proxying for Google Cloud APIs and internal GCP domains (e.g., `metadata.google.internal`, `.googleapis.com`, local cluster hostnames) using a robust default `no_proxy` list. - Automatically appends custom bypass hosts from the `no-proxy` metadata. - Configures `gcloud` CLI proxy settings to align with the environment. - Installs the proxy's PEM CA certificate (if provided via `http-proxy-pem-uri`) to the OS, Java, and Conda trust stores. - Configures system package managers (`apt`/`dnf`) and `dirmngr` to fetch packages through the proxy. - Configures `boto.cfg` (used by `gsutil`) to use the proxy. - Ensures idempotency by skipping configuration if already present. Documentation: - Adds `http-proxy/README.md` detailing parameters, usage, and the critical compatibility requirement (`dataproc:dataproc.master.custom.init.actions.mode=RUN_BEFORE_SERVICES`). - Updates root `README.md` to link to the new action. Testing: - Adds integration test `http-proxy/test_http_proxy.py` to verify the action exits cleanly when no proxy metadata is provided. - Registers the test target in the root `BUILD` file. Build Infrastructure: - Adds `MODULE.bazel` with `rules_python` dependency to support Bazel Bzlmod builds. - Updates root `BUILD` file to load python rules explicitly. TAG=agy CONV=b274b565-1bd6-43f1-b4db-31f3d89d087b
…and integration tests
Introduces a new initialization action to configure global HTTP/HTTPS proxy
settings on Dataproc cluster nodes, along with comprehensive integration tests
and build infrastructure updates.
Features:
- Configures global proxy environment variables (http_proxy, https_proxy,
no_proxy) in /etc/environment and /etc/profile.d/proxy.sh
- Bypasses proxying for Google Cloud APIs and internal GCP domains (e.g.,
metadata.google.internal, .googleapis.com, local cluster hostnames) by default
- Automatically appends custom bypass hosts from the no-proxy metadata
- Configures gcloud CLI proxy settings to align with the environment
- Installs the proxy's PEM CA certificate (if provided via http-proxy-pem-uri)
to the OS, Java, and Conda trust stores
- Configures system package managers (apt/dnf) and dirmngr to fetch packages
through the proxy
- Configures boto.cfg (used by gsutil) to use the proxy, with robust section
deduplication and universe_domain variable resolution
- Validates proxy egress connectivity at boot via nc and curl
Documentation:
- Adds http-proxy/README.md which details parameters, usage, and the critical
compatibility requirement (dataproc:dataproc.master.custom.init.actions.mode=RUN_BEFORE_SERVICES)
- Updates the root README.md file to link to the new action
Testing and Build Infrastructure:
- Adds integration test http-proxy/test_http_proxy.py that verifies the clean
exit path (no proxy) and the proxy-enabled path
- The proxy-enabled test path automatically detects provisioned Secure Web
Proxy (SWP) resources in the project, harvests the CA certificate from
Private CA, stages it to GCS, and provisions the cluster in a custom VPC
network segment
- Registers the new test target :test_http_proxy in the root BUILD file
- Fixes python rules loading in integration_tests/BUILD and the root BUILD file
- Adds MODULE.bazel with rules_python and abseil-py declarations to support
modern Bazel Bzlmod builds
- Modifies integration_tests/dataproc_test_case.py to perform the following:
* Fix a class instantiation bug in setUpClass
* Add network and subnet arguments to createCluster
* Clean PYTHONPATH and PYTHONSAFEPATH in subprocess environments, preventing
Bazel's sandboxed python from breaking external CLI tools (gcloud, gsutil)
TAG=agy
CONV=b274b565-1bd6-43f1-b4db-31f3d89d087b
|
@saravind7 I would appreciate your assistance |
There was a problem hiding this comment.
Code Review
This pull request introduces a new initialization action to configure global HTTP and HTTPS proxy settings on Google Cloud Dataproc clusters, along with associated documentation and integration tests. The review feedback highlights several critical and high-severity issues in the shell script, including the use of an undefined function (version_ge), a bootstrapping order issue where gsutil is used to download a CA certificate before the proxy is configured in boto.cfg, and portability/idempotency concerns regarding the use of nc and Java/Conda truststore certificate imports. Additionally, improvements were suggested for robust whitespace parsing in proxy bypass lists and graceful handling of missing test configuration files.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| local gcloud_version | ||
| local -r min_gcloud_proxy_ver="547.0.0" | ||
| gcloud_version=$(gcloud version --format="value(google_cloud_sdk)" 2>/dev/null || echo "0.0.0") | ||
| if version_ge "${gcloud_version}" "${min_gcloud_proxy_ver}"; then |
There was a problem hiding this comment.
The function version_ge is called here but is not defined anywhere in the script. Since set -euo pipefail is enabled, this will cause the script to fail immediately with a 'command not found' error during execution.
Since version_le is defined, you can achieve the same comparison by reversing the arguments.
| if version_ge "${gcloud_version}" "${min_gcloud_proxy_ver}"; then | |
| if version_le "${min_gcloud_proxy_ver}" "${gcloud_version}"; then |
| if [[ -f "/etc/environment" ]]; then | ||
| JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment)" | ||
| if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then | ||
| "${JAVA_HOME}/bin/keytool" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}" | ||
| fi | ||
| fi |
There was a problem hiding this comment.
The current implementation only attempts to import the CA certificate into the Java truststore if JAVA_HOME is explicitly defined in /etc/environment. If it is not defined there (but Java is installed and keytool is in the PATH), the import is skipped, which will cause Java applications to fail SSL handshakes through the proxy.
Additionally, if JAVA_HOME in /etc/environment contains quotes, the file check -f "${JAVA_HOME}/bin/keytool" will fail. Finally, keytool will fail and exit with an error if the alias swp_ca already exists, making the script non-idempotent.
We should make this robust by stripping quotes, falling back to keytool in PATH, and deleting the existing alias before importing.
| if [[ -f "/etc/environment" ]]; then | |
| JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment)" | |
| if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then | |
| "${JAVA_HOME}/bin/keytool" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}" | |
| fi | |
| fi | |
| # Add to Java trust store | |
| if [[ -f "/etc/environment" ]]; then | |
| JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment | tr -d '"')" | |
| fi | |
| local keytool_bin="" | |
| if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then | |
| keytool_bin="${JAVA_HOME}/bin/keytool" | |
| elif command -v keytool &> /dev/null; then | |
| keytool_bin="keytool" | |
| fi | |
| if [[ -n "${keytool_bin}" ]]; then | |
| # Delete existing alias to ensure idempotency and avoid failure if already imported | |
| "${keytool_bin}" -delete -cacerts -storepass changeit -alias swp_ca 2>/dev/null || true | |
| "${keytool_bin}" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}" || echo "WARNING: Failed to import CA cert into Java truststore" | |
| fi |
| echo "DEBUG: set_proxy: Testing TCP connection to proxy ${proxy_host}:${proxy_port}..." | ||
| if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then | ||
| echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}." | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
nc (netcat) is not guaranteed to be installed on all Linux distributions (especially minimal Rocky Linux installations). If nc is missing, the script will fail.
You can use Bash's built-in /dev/tcp device file to test TCP connectivity. This is completely portable, has zero external dependencies, and works on all supported Dataproc OS images.
| echo "DEBUG: set_proxy: Testing TCP connection to proxy ${proxy_host}:${proxy_port}..." | |
| if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then | |
| echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}." | |
| exit 1 | |
| fi | |
| echo "DEBUG: set_proxy: Testing TCP connection to proxy ${proxy_host}:${proxy_port}..." | |
| if ! timeout 5 bash -c "cat < /dev/null > /dev/tcp/${proxy_host}/${proxy_port}" 2>/dev/null; then | |
| echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}." | |
| exit 1 | |
| fi |
| set_proxy | ||
| repair_boto |
There was a problem hiding this comment.
set_proxy downloads the CA certificate using gsutil cp (lines 221 and 228).
If the cluster network requires all egress traffic (including GCS) to go through the proxy, gsutil will fail because repair_boto (which configures the proxy in /etc/boto.cfg) is called after set_proxy.
Calling repair_boto before set_proxy resolves this bootstrapping issue, ensuring gsutil is fully configured with the proxy before it is used to download the certificate.
| set_proxy | |
| repair_boto | |
| repair_boto | |
| set_proxy |
| local conda_cert_file="/opt/conda/default/ssl/cacert.pem" | ||
| if [[ -f "${conda_cert_file}" ]]; then | ||
| local ca_subject=$(openssl crl2pkcs7 -nocrl -certfile "${proxy_ca_pem}" | openssl pkcs7 -print_certs -noout | grep ^subject) | ||
| openssl crl2pkcs7 -nocrl -certfile "${conda_cert_file}" | openssl pkcs7 -print_certs -noout | grep -Fxq "${ca_subject}" || { | ||
| cat "${proxy_ca_pem}" >> "${conda_cert_file}" | ||
| } | ||
| fi |
There was a problem hiding this comment.
Using openssl crl2pkcs7 and parsing the entire Conda certificate bundle to check if the CA is already appended is slow and prone to formatting/parsing issues.
A much simpler, faster, and more robust way to ensure idempotency is to append a unique comment (e.g., # Dataproc SWP CA) and check for its presence using grep.
| local conda_cert_file="/opt/conda/default/ssl/cacert.pem" | |
| if [[ -f "${conda_cert_file}" ]]; then | |
| local ca_subject=$(openssl crl2pkcs7 -nocrl -certfile "${proxy_ca_pem}" | openssl pkcs7 -print_certs -noout | grep ^subject) | |
| openssl crl2pkcs7 -nocrl -certfile "${conda_cert_file}" | openssl pkcs7 -print_certs -noout | grep -Fxq "${ca_subject}" || { | |
| cat "${proxy_ca_pem}" >> "${conda_cert_file}" | |
| } | |
| fi | |
| local conda_cert_file="/opt/conda/default/ssl/cacert.pem" | |
| if [[ -f "${conda_cert_file}" ]]; then | |
| if ! grep -q "# Dataproc SWP CA" "${conda_cert_file}"; then | |
| echo -e "\n# Dataproc SWP CA" >> "${conda_cert_file}" | |
| cat "${proxy_ca_pem}" >> "${conda_cert_file}" | |
| fi | |
| fi |
| env_json_path = os.path.abspath(os.path.join(real_test_dir, "../env.json")) | ||
|
|
||
| with open(env_json_path, "r") as f: | ||
| env_data = json.load(f) |
There was a problem hiding this comment.
If env.json is missing (e.g., when running tests locally or in a clean environment), the test will fail with a raw FileNotFoundError.
It is much cleaner to check if the file exists and skip the test with a descriptive message if it is missing.
| env_json_path = os.path.abspath(os.path.join(real_test_dir, "../env.json")) | |
| with open(env_json_path, "r") as f: | |
| env_data = json.load(f) | |
| env_json_path = os.path.abspath(os.path.join(real_test_dir, "../env.json")) | |
| if not os.path.exists(env_json_path): | |
| self.skipTest(f"env.json not found at {env_json_path}. Skipping enabled proxy test.") | |
| with open(env_json_path, "r") as f: | |
| env_data = json.load(f) |
| if [[ -n "${user_no_proxy}" ]]; then | ||
| IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy// /,}" | ||
| fi |
There was a problem hiding this comment.
If user_no_proxy contains spaces after commas (e.g., "host1, host2"), the replacement ${user_no_proxy// /,} will result in double commas ("host1,,host2"), which creates empty elements in the parsed array.
Using ${user_no_proxy//[[:space:]]/} to strip all whitespace before splitting is much more robust and avoids empty elements.
| if [[ -n "${user_no_proxy}" ]]; then | |
| IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy// /,}" | |
| fi | |
| if [[ -n "${user_no_proxy}" ]]; then | |
| IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy//[[:space:]]/}" | |
| fi |
Description
This PR introduces a new initialization action for configuring global HTTP and HTTPS proxy settings on Dataproc cluster nodes. It also adds a comprehensive integration test suite, updates the build infrastructure to support Bazel Bzlmod, and fixes several critical bugs in the testing framework.
Why are these changes needed?
Clusters running in private network environments often require all egress traffic to be routed through a Secure Web Proxy (SWP) or gateway. This initialization action provides a robust, production-ready solution to configure system-wide proxy settings, install custom CA certificates, and ensure all critical cluster services (including HDFS, YARN, and the gcloud/gsutil CLIs) can successfully communicate through the proxy during and after boot.
Key Changes
1. HTTP Proxy Initialization Action
http-proxy/http-proxy.sh/etc/environmentand/etc/profile.d/proxy.shno_proxy) for local hostnames, metadata server, and Google APIs by default, with support for custom user-defined bypasses viano-proxymetadataapton Debian/Ubuntu,dnfon Rocky Linux) anddirmngrto route through the proxy/etc/boto.cfgto ensuregsutilworks seamlesslyhttp-proxy-pem-uri) to system, Java, and Conda trust stores2. Integration Tests
http-proxy/test_http_proxy.py3. Build & Test Infrastructure
MODULE.bazelto declarerules_pythonandabseil-pydependencies, fixing compatibility with modern Bazel versions:test_http_proxyin the rootBUILDfileintegration_tests/dataproc_test_case.pyto:setUpClassnetworkandsubnetparameters increateClusterPYTHONPATHandPYTHONSAFEPATHin subprocess environments, preventing Bazel's sandboxed python from breaking external CLI tools (gcloud,gsutil)How was this tested?