Skip to content

Http proxy 20260604#1395

Draft
cjac wants to merge 4 commits into
GoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:http-proxy-20260604
Draft

Http proxy 20260604#1395
cjac wants to merge 4 commits into
GoogleCloudDataproc:mainfrom
LLC-Technologies-Collier:http-proxy-20260604

Conversation

@cjac
Copy link
Copy Markdown
Contributor

@cjac cjac commented Jun 5, 2026

Description

This PR introduces a new initialization action for configuring global HTTP and HTTPS proxy settings on Dataproc cluster nodes. It also adds a comprehensive integration test suite, updates the build infrastructure to support Bazel Bzlmod, and fixes several critical bugs in the testing framework.

Why are these changes needed?

Clusters running in private network environments often require all egress traffic to be routed through a Secure Web Proxy (SWP) or gateway. This initialization action provides a robust, production-ready solution to configure system-wide proxy settings, install custom CA certificates, and ensure all critical cluster services (including HDFS, YARN, and the gcloud/gsutil CLIs) can successfully communicate through the proxy during and after boot.

Key Changes

1. HTTP Proxy Initialization Action

  • Script: http-proxy/http-proxy.sh
  • Configures system-wide proxy environment variables in /etc/environment and /etc/profile.d/proxy.sh
  • Sets up proxy bypasses (no_proxy) for local hostnames, metadata server, and Google APIs by default, with support for custom user-defined bypasses via no-proxy metadata
  • Configures system package managers (apt on Debian/Ubuntu, dnf on Rocky Linux) and dirmngr to route through the proxy
  • Deduplicates and configures /etc/boto.cfg to ensure gsutil works seamlessly
  • Installs custom PEM CA certificates (via http-proxy-pem-uri) to system, Java, and Conda trust stores
  • Validates proxy connectivity at boot time

2. Integration Tests

  • Test Suite: http-proxy/test_http_proxy.py
  • Skip Path: Verifies that the script exits cleanly when no proxy metadata is provided
  • Enabled Path: Automatically detects if SWP resources (gateway, CA pool, certificate) are provisioned in the GCP project. If available, it harvests the root CA certificate from Private CA, stages it to GCS, provisions a cluster in a custom VPC network segment, and verifies outbound connectivity through the proxy

3. Build & Test Infrastructure

  • Bzlmod Support: Adds MODULE.bazel to declare rules_python and abseil-py dependencies, fixing compatibility with modern Bazel versions
  • Target Registration: Registers :test_http_proxy in the root BUILD file
  • Testing Framework Fixes: Modifies integration_tests/dataproc_test_case.py to:
    • Fix a class instantiation bug in setUpClass
    • Add support for custom network and subnet parameters in createCluster
    • Clean PYTHONPATH and PYTHONSAFEPATH in subprocess environments, preventing Bazel's sandboxed python from breaking external CLI tools (gcloud , gsutil )

How was this tested?

  • Verified locally using Bazel:
    bazel test :test_http_proxy --test_arg="--image_version=2.2-debian12" --test_output=all
  • Tested both the skip path and the proxy-enabled path (utilizing pre-provisioned SWP resources in the test project).

cjac added 2 commits June 4, 2026 21:14
…and integration tests

Introduces a new initialization action `http-proxy/http-proxy.sh` to configure
global HTTP/HTTPS proxy settings on Dataproc cluster nodes.

Features:
- Configures global proxy environment variables (`http_proxy`, `https_proxy`,
  `no_proxy`) in `/etc/environment` and `/etc/profile.d/proxy.sh`.
- Bypasses proxying for Google Cloud APIs and internal GCP domains (e.g.,
  `metadata.google.internal`, `.googleapis.com`, local cluster hostnames)
  using a robust default `no_proxy` list.
- Automatically appends custom bypass hosts from the `no-proxy` metadata.
- Configures `gcloud` CLI proxy settings to align with the environment.
- Installs the proxy's PEM CA certificate (if provided via `http-proxy-pem-uri`)
  to the OS, Java, and Conda trust stores.
- Configures system package managers (`apt`/`dnf`) and `dirmngr` to fetch
  packages through the proxy.
- Configures `boto.cfg` (used by `gsutil`) to use the proxy.
- Ensures idempotency by skipping configuration if already present.

Documentation:
- Adds `http-proxy/README.md` detailing parameters, usage, and the critical
  compatibility requirement (`dataproc:dataproc.master.custom.init.actions.mode=RUN_BEFORE_SERVICES`).
- Updates root `README.md` to link to the new action.

Testing:
- Adds integration test `http-proxy/test_http_proxy.py` to verify the action
  exits cleanly when no proxy metadata is provided.
- Registers the test target in the root `BUILD` file.

Build Infrastructure:
- Adds `MODULE.bazel` with `rules_python` dependency to support Bazel Bzlmod builds.
- Updates root `BUILD` file to load python rules explicitly.

TAG=agy
CONV=b274b565-1bd6-43f1-b4db-31f3d89d087b
…and integration tests

Introduces a new initialization action to configure global HTTP/HTTPS proxy
settings on Dataproc cluster nodes, along with comprehensive integration tests
and build infrastructure updates.

Features:
- Configures global proxy environment variables (http_proxy, https_proxy,
  no_proxy) in /etc/environment and /etc/profile.d/proxy.sh
- Bypasses proxying for Google Cloud APIs and internal GCP domains (e.g.,
  metadata.google.internal, .googleapis.com, local cluster hostnames) by default
- Automatically appends custom bypass hosts from the no-proxy metadata
- Configures gcloud CLI proxy settings to align with the environment
- Installs the proxy's PEM CA certificate (if provided via http-proxy-pem-uri)
  to the OS, Java, and Conda trust stores
- Configures system package managers (apt/dnf) and dirmngr to fetch packages
  through the proxy
- Configures boto.cfg (used by gsutil) to use the proxy, with robust section
  deduplication and universe_domain variable resolution
- Validates proxy egress connectivity at boot via nc and curl

Documentation:
- Adds http-proxy/README.md which details parameters, usage, and the critical
  compatibility requirement (dataproc:dataproc.master.custom.init.actions.mode=RUN_BEFORE_SERVICES)
- Updates the root README.md file to link to the new action

Testing and Build Infrastructure:
- Adds integration test http-proxy/test_http_proxy.py that verifies the clean
  exit path (no proxy) and the proxy-enabled path
- The proxy-enabled test path automatically detects provisioned Secure Web
  Proxy (SWP) resources in the project, harvests the CA certificate from
  Private CA, stages it to GCS, and provisions the cluster in a custom VPC
  network segment
- Registers the new test target :test_http_proxy in the root BUILD file
- Fixes python rules loading in integration_tests/BUILD and the root BUILD file
- Adds MODULE.bazel with rules_python and abseil-py declarations to support
  modern Bazel Bzlmod builds
- Modifies integration_tests/dataproc_test_case.py to perform the following:
  * Fix a class instantiation bug in setUpClass
  * Add network and subnet arguments to createCluster
  * Clean PYTHONPATH and PYTHONSAFEPATH in subprocess environments, preventing
    Bazel's sandboxed python from breaking external CLI tools (gcloud, gsutil)

TAG=agy
CONV=b274b565-1bd6-43f1-b4db-31f3d89d087b
@cjac cjac self-assigned this Jun 5, 2026
@cjac
Copy link
Copy Markdown
Contributor Author

cjac commented Jun 5, 2026

@saravind7 I would appreciate your assistance

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new initialization action to configure global HTTP and HTTPS proxy settings on Google Cloud Dataproc clusters, along with associated documentation and integration tests. The review feedback highlights several critical and high-severity issues in the shell script, including the use of an undefined function (version_ge), a bootstrapping order issue where gsutil is used to download a CA certificate before the proxy is configured in boto.cfg, and portability/idempotency concerns regarding the use of nc and Java/Conda truststore certificate imports. Additionally, improvements were suggested for robust whitespace parsing in proxy bypass lists and graceful handling of missing test configuration files.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread http-proxy/http-proxy.sh
local gcloud_version
local -r min_gcloud_proxy_ver="547.0.0"
gcloud_version=$(gcloud version --format="value(google_cloud_sdk)" 2>/dev/null || echo "0.0.0")
if version_ge "${gcloud_version}" "${min_gcloud_proxy_ver}"; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function version_ge is called here but is not defined anywhere in the script. Since set -euo pipefail is enabled, this will cause the script to fail immediately with a 'command not found' error during execution.

Since version_le is defined, you can achieve the same comparison by reversing the arguments.

Suggested change
if version_ge "${gcloud_version}" "${min_gcloud_proxy_ver}"; then
if version_le "${min_gcloud_proxy_ver}" "${gcloud_version}"; then

Comment thread http-proxy/http-proxy.sh
Comment on lines +236 to +241
if [[ -f "/etc/environment" ]]; then
JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment)"
if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then
"${JAVA_HOME}/bin/keytool" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}"
fi
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation only attempts to import the CA certificate into the Java truststore if JAVA_HOME is explicitly defined in /etc/environment. If it is not defined there (but Java is installed and keytool is in the PATH), the import is skipped, which will cause Java applications to fail SSL handshakes through the proxy.

Additionally, if JAVA_HOME in /etc/environment contains quotes, the file check -f "${JAVA_HOME}/bin/keytool" will fail. Finally, keytool will fail and exit with an error if the alias swp_ca already exists, making the script non-idempotent.

We should make this robust by stripping quotes, falling back to keytool in PATH, and deleting the existing alias before importing.

Suggested change
if [[ -f "/etc/environment" ]]; then
JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment)"
if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then
"${JAVA_HOME}/bin/keytool" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}"
fi
fi
# Add to Java trust store
if [[ -f "/etc/environment" ]]; then
JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment | tr -d '"')"
fi
local keytool_bin=""
if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then
keytool_bin="${JAVA_HOME}/bin/keytool"
elif command -v keytool &> /dev/null; then
keytool_bin="keytool"
fi
if [[ -n "${keytool_bin}" ]]; then
# Delete existing alias to ensure idempotency and avoid failure if already imported
"${keytool_bin}" -delete -cacerts -storepass changeit -alias swp_ca 2>/dev/null || true
"${keytool_bin}" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}" || echo "WARNING: Failed to import CA cert into Java truststore"
fi

Comment thread http-proxy/http-proxy.sh
Comment on lines +258 to +262
echo "DEBUG: set_proxy: Testing TCP connection to proxy ${proxy_host}:${proxy_port}..."
if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then
echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."
exit 1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

nc (netcat) is not guaranteed to be installed on all Linux distributions (especially minimal Rocky Linux installations). If nc is missing, the script will fail.

You can use Bash's built-in /dev/tcp device file to test TCP connectivity. This is completely portable, has zero external dependencies, and works on all supported Dataproc OS images.

Suggested change
echo "DEBUG: set_proxy: Testing TCP connection to proxy ${proxy_host}:${proxy_port}..."
if ! nc -zv -w 5 "${proxy_host}" "${proxy_port}"; then
echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."
exit 1
fi
echo "DEBUG: set_proxy: Testing TCP connection to proxy ${proxy_host}:${proxy_port}..."
if ! timeout 5 bash -c "cat < /dev/null > /dev/tcp/${proxy_host}/${proxy_port}" 2>/dev/null; then
echo "ERROR: Failed to establish TCP connection to proxy ${proxy_host}:${proxy_port}."
exit 1
fi

Comment thread http-proxy/http-proxy.sh
Comment on lines +363 to +364
set_proxy
repair_boto
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

set_proxy downloads the CA certificate using gsutil cp (lines 221 and 228).

If the cluster network requires all egress traffic (including GCS) to go through the proxy, gsutil will fail because repair_boto (which configures the proxy in /etc/boto.cfg) is called after set_proxy.

Calling repair_boto before set_proxy resolves this bootstrapping issue, ensuring gsutil is fully configured with the proxy before it is used to download the certificate.

Suggested change
set_proxy
repair_boto
repair_boto
set_proxy

Comment thread http-proxy/http-proxy.sh
Comment on lines +243 to +249
local conda_cert_file="/opt/conda/default/ssl/cacert.pem"
if [[ -f "${conda_cert_file}" ]]; then
local ca_subject=$(openssl crl2pkcs7 -nocrl -certfile "${proxy_ca_pem}" | openssl pkcs7 -print_certs -noout | grep ^subject)
openssl crl2pkcs7 -nocrl -certfile "${conda_cert_file}" | openssl pkcs7 -print_certs -noout | grep -Fxq "${ca_subject}" || {
cat "${proxy_ca_pem}" >> "${conda_cert_file}"
}
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using openssl crl2pkcs7 and parsing the entire Conda certificate bundle to check if the CA is already appended is slow and prone to formatting/parsing issues.

A much simpler, faster, and more robust way to ensure idempotency is to append a unique comment (e.g., # Dataproc SWP CA) and check for its presence using grep.

Suggested change
local conda_cert_file="/opt/conda/default/ssl/cacert.pem"
if [[ -f "${conda_cert_file}" ]]; then
local ca_subject=$(openssl crl2pkcs7 -nocrl -certfile "${proxy_ca_pem}" | openssl pkcs7 -print_certs -noout | grep ^subject)
openssl crl2pkcs7 -nocrl -certfile "${conda_cert_file}" | openssl pkcs7 -print_certs -noout | grep -Fxq "${ca_subject}" || {
cat "${proxy_ca_pem}" >> "${conda_cert_file}"
}
fi
local conda_cert_file="/opt/conda/default/ssl/cacert.pem"
if [[ -f "${conda_cert_file}" ]]; then
if ! grep -q "# Dataproc SWP CA" "${conda_cert_file}"; then
echo -e "\n# Dataproc SWP CA" >> "${conda_cert_file}"
cat "${proxy_ca_pem}" >> "${conda_cert_file}"
fi
fi

Comment thread http-proxy/test_http_proxy.py Outdated
Comment on lines +115 to +118
env_json_path = os.path.abspath(os.path.join(real_test_dir, "../env.json"))

with open(env_json_path, "r") as f:
env_data = json.load(f)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If env.json is missing (e.g., when running tests locally or in a clean environment), the test will fail with a raw FileNotFoundError.

It is much cleaner to check if the file exists and skip the test with a descriptive message if it is missing.

Suggested change
env_json_path = os.path.abspath(os.path.join(real_test_dir, "../env.json"))
with open(env_json_path, "r") as f:
env_data = json.load(f)
env_json_path = os.path.abspath(os.path.join(real_test_dir, "../env.json"))
if not os.path.exists(env_json_path):
self.skipTest(f"env.json not found at {env_json_path}. Skipping enabled proxy test.")
with open(env_json_path, "r") as f:
env_data = json.load(f)

Comment thread http-proxy/http-proxy.sh
Comment on lines +146 to +148
if [[ -n "${user_no_proxy}" ]]; then
IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy// /,}"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If user_no_proxy contains spaces after commas (e.g., "host1, host2"), the replacement ${user_no_proxy// /,} will result in double commas ("host1,,host2"), which creates empty elements in the parsed array.

Using ${user_no_proxy//[[:space:]]/} to strip all whitespace before splitting is much more robust and avoids empty elements.

Suggested change
if [[ -n "${user_no_proxy}" ]]; then
IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy// /,}"
fi
if [[ -n "${user_no_proxy}" ]]; then
IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy//[[:space:]]/}"
fi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant