Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions BUILD
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
load("@rules_python//python:defs.bzl", "py_test", "py_library")

package(default_visibility = ["//visibility:public"])

test_suite(
Expand All @@ -6,6 +8,7 @@ test_suite(
":test_cloud_sql_proxy",
":test_dr_elephant",
":test_hive_hcatalog",
":test_http_proxy",
":test_spark_rapids",
":test_starburst_presto",
"//alluxio:test_alluxio",
Expand Down Expand Up @@ -151,3 +154,15 @@ py_test(
"@io_abseil_py//absl/testing:parameterized",
],
)

py_test(
name = "test_http_proxy",
size = "enormous",
srcs = ["http-proxy/test_http_proxy.py"],
data = ["http-proxy/http-proxy.sh"],
local = True,
deps = [
"//integration_tests:dataproc_test_case",
"@io_abseil_py//absl/testing:parameterized",
],
)
2 changes: 2 additions & 0 deletions MODULE.bazel
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
bazel_dep(name = "rules_python", version = "1.7.0")
bazel_dep(name = "abseil-py", version = "2.1.0", repo_name = "io_abseil_py")
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ This repository currently offers the following actions for use with Dataproc clu
* Configure the environment
* Configure a *nice* shell environment
* To switch to Python 3, use the conda initialization action
* [HTTP Proxy](http-proxy/README.md)
* Connect to Google Cloud Platform services
* Install alternate versions of the [Cloud Storage and BigQuery connectors](https://github.com/GoogleCloudPlatform/bigdata-interop/releases). [Specific versions](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions) of these connectors come pre-installed on Cloud Dataproc clusters.
* Share a [Cloud SQL](https://cloud.google.com/sql/) Hive Metastore, or simply read/write data from Cloud SQL.
Expand Down
57 changes: 57 additions & 0 deletions http-proxy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# HTTP Proxy Configuration

This initialization action configures global HTTP and HTTPS proxy settings on every node in a [Google Cloud Dataproc](https://cloud.google.com/dataproc) cluster.

It is designed to set up proxy environments for clusters running in private networks that must egress through a secure web proxy or gateway.

## Features

- Configures global proxy environment variables (`http_proxy`, `https_proxy`, `no_proxy` and their uppercase variants) in `/etc/environment`.
- Persists proxy settings for all shell sessions via `/etc/profile.d/proxy.sh`.
- Bypasses proxying for Google Cloud APIs and internal GCP domains (e.g. `metadata.google.internal`, `169.254.169.254`, `.googleapis.com`, local cluster hostnames) using a robust default `no_proxy` list.
- Automatically appends custom bypass hosts from the `no-proxy` metadata.
- Configures the `gcloud` CLI proxy settings to align with the environment.
- Installs the proxy's PEM CA certificate (if provided) to the OS, Java, and Conda trust stores.
- Configures system package managers (`apt`/`dnf`) and `dirmngr` to fetch packages through the proxy.
- Configures `boto.cfg` (used by `gsutil`) to use the proxy.

## Parameters

You configure the proxy settings using VM metadata:

| Metadata Key | Description |
|---|---|
| `http-proxy` | (Optional) The HTTP proxy host and port (e.g. `10.0.0.1:8080` or `vzproxy.verizon.com:9290`). |
| `https-proxy` | (Optional) The HTTPS proxy host and port. |
| `proxy-uri` | (Optional) A unified proxy host and port if HTTP and HTTPS proxies are the same. Used as fallback if `http-proxy` or `https-proxy` are not set. |
| `no-proxy` | (Optional) A comma-separated list of additional hosts/domains that should bypass the proxy. |
| `http-proxy-pem-uri` | (Optional) A Cloud Storage URI (e.g. `gs://my-bucket/proxy_ca.crt`) containing the PEM-encoded CA certificate for the proxy. Required if the proxy inspects SSL traffic. |

## Usage

### ⚠️ CRITICAL COMPATIBILITY REQUIREMENT ⚠️

For Dataproc internal components (like HDFS NameNode) to successfully initialize and access the Google Cloud metadata server during boot, **this initialization action must run before system services start.**

You **must** set the following cluster property:
`dataproc:dataproc.master.custom.init.actions.mode=RUN_BEFORE_SERVICES`

### Example

Use the `gcloud` command to create a new cluster with this initialization action:

```bash
PROJECT_ID="my-project-id"
REGION="us-east4"
CLUSTER_NAME="my-proxy-cluster"
PROXY_HOST_PORT="vzproxy.verizon.com:9290"
CA_CERT_URI="gs://my-secure-bucket/proxy_ca.crt"

gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--initialization-actions gs://dataproc-initialization-actions-${REGION}/http-proxy/http-proxy.sh \
--properties "dataproc:dataproc.master.custom.init.actions.mode=RUN_BEFORE_SERVICES" \
--metadata "proxy-uri=${PROXY_HOST_PORT}" \
--metadata "http-proxy-pem-uri=${CA_CERT_URI}" \
--metadata "no-proxy=my-onprem-service.corp.internal"
```
Loading