Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 4 additions & 11 deletions .github/workflows/docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,17 +40,11 @@ jobs:
ref: asf-site
path: asf-site

- name: Setup Python
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.12"
- name: Setup uv
uses: astral-sh/setup-uv@f0ec1fc3b38f5e7cd731bb6ce540c5af426746bb # v6.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below, do we need to specific a specific sha here?


- name: Install dependencies
run: |
set -x
python3 -m venv venv
source venv/bin/activate
pip install -r docs/requirements.txt
run: uv sync --project docs
- name: Install dependency graph tooling
run: |
set -x
Expand All @@ -61,9 +55,8 @@ jobs:
- name: Build docs
run: |
set -x
source venv/bin/activate
cd docs
./build.sh
uv run --project ../docs ./build.sh

- name: Copy & push the generated HTML
run: |
Expand Down
15 changes: 4 additions & 11 deletions .github/workflows/docs_pr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,16 +44,10 @@ jobs:
with:
submodules: true
fetch-depth: 1
- name: Setup Python
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.12"
- name: Setup uv
uses: astral-sh/setup-uv@f0ec1fc3b38f5e7cd731bb6ce540c5af426746bb # v6.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the specific commit a requirement? I think astral-sh/setup-uv@v6 is pretty stable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but it is generally encouraged to lock to commits in GHA: a commit provides an immutable reference to "safe" code. A tag is mutable. If a malicious actor gains control of an action repository they can upload a new v6 and infect everyone. If everyone is pinned to the commit they can't force the malicious code into everyone's CI unless you opt in by updating the hash. TLDR because there are no lockfiles for CI and because CI is a critical vector for supply chain attacks it's best to pin to a hash.

- name: Install doc dependencies
run: |
set -x
python3 -m venv venv
source venv/bin/activate
pip install -r docs/requirements.txt
run: uv sync --project docs
- name: Install dependency graph tooling
run: |
set -x
Expand All @@ -63,6 +57,5 @@ jobs:
- name: Build docs html and check for warnings
run: |
set -x
source venv/bin/activate
cd docs
./build.sh # fails on errors
uv run --project ../docs ./build.sh # fails on errors
140 changes: 8 additions & 132 deletions benchmarks/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ DATAFUSION_DIR=${DATAFUSION_DIR:-$SCRIPT_DIR/..}
DATA_DIR=${DATA_DIR:-$SCRIPT_DIR/data}
CARGO_COMMAND=${CARGO_COMMAND:-"cargo run --release"}
PREFER_HASH_JOIN=${PREFER_HASH_JOIN:-true}
VIRTUAL_ENV=${VIRTUAL_ENV:-$SCRIPT_DIR/venv}
UV_PROJECT=${UV_PROJECT:-$SCRIPT_DIR}

usage() {
echo "
Expand Down Expand Up @@ -144,7 +144,7 @@ CARGO_COMMAND command that runs the benchmark binary
DATAFUSION_DIR directory to use (default $DATAFUSION_DIR)
RESULTS_NAME folder where the benchmark files are stored
PREFER_HASH_JOIN Prefer hash join algorithm (default true)
VENV_PATH Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate)
UV_PROJECT Path to the benchmarks project for uv (default $SCRIPT_DIR)
DATAFUSION_* Set the given datafusion configuration
"
exit 1
Expand Down Expand Up @@ -708,7 +708,7 @@ run_compile_profile() {
local data_path="${DATA_DIR}/tpch_sf1"

echo "Running compile profile benchmark..."
local cmd=(python3 "${runner}" --data "${data_path}")
local cmd=(uv run --project "${UV_PROJECT}" python3 "${runner}" --data "${data_path}")
if [ ${#profiles[@]} -gt 0 ]; then
cmd+=(--profiles "${profiles[@]}")
fi
Expand Down Expand Up @@ -923,151 +923,27 @@ data_h2o() {
SIZE=${1:-"SMALL"}
DATA_FORMAT=${2:-"CSV"}

# Function to compare Python versions
version_ge() {
[ "$(printf '%s\n' "$1" "$2" | sort -V | head -n1)" = "$2" ]
}

export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1

# Find the highest available Python version (3.10 or higher)
REQUIRED_VERSION="3.10"
PYTHON_CMD=$(command -v python3 || true)

if [ -n "$PYTHON_CMD" ]; then
PYTHON_VERSION=$($PYTHON_CMD -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")
if version_ge "$PYTHON_VERSION" "$REQUIRED_VERSION"; then
echo "Found Python version $PYTHON_VERSION, which is suitable."
else
echo "Python version $PYTHON_VERSION found, but version $REQUIRED_VERSION or higher is required."
PYTHON_CMD=""
fi
fi

# Search for suitable Python versions if the default is unsuitable
if [ -z "$PYTHON_CMD" ]; then
# Loop through all available Python3 commands on the system
for CMD in $(compgen -c | grep -E '^python3(\.[0-9]+)?$'); do
if command -v "$CMD" &> /dev/null; then
PYTHON_VERSION=$($CMD -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")
if version_ge "$PYTHON_VERSION" "$REQUIRED_VERSION"; then
PYTHON_CMD="$CMD"
echo "Found suitable Python version: $PYTHON_VERSION ($CMD)"
break
fi
fi
done
fi

# If no suitable Python version found, exit with an error
if [ -z "$PYTHON_CMD" ]; then
echo "Python 3.10 or higher is required. Please install it."
return 1
fi

echo "Using Python command: $PYTHON_CMD"

# Install falsa and other dependencies
echo "Installing falsa..."

# Set virtual environment directory
VIRTUAL_ENV="${PWD}/venv"

# Create a virtual environment using the detected Python command
$PYTHON_CMD -m venv "$VIRTUAL_ENV"

# Activate the virtual environment and install dependencies
source "$VIRTUAL_ENV/bin/activate"

# Ensure 'falsa' is installed (avoid unnecessary reinstall)
pip install --quiet --upgrade falsa

# Create directory if it doesn't exist
H2O_DIR="${DATA_DIR}/h2o"
mkdir -p "${H2O_DIR}"

# Generate h2o test data
echo "Generating h2o test data in ${H2O_DIR} with size=${SIZE} and format=${DATA_FORMAT}"
falsa groupby --path-prefix="${H2O_DIR}" --size "${SIZE}" --data-format "${DATA_FORMAT}"

# Deactivate virtual environment after completion
deactivate
uv run --project "${UV_PROJECT}" falsa groupby --path-prefix="${H2O_DIR}" --size "${SIZE}" --data-format "${DATA_FORMAT}"
}

data_h2o_join() {
# Default values for size and data format
SIZE=${1:-"SMALL"}
DATA_FORMAT=${2:-"CSV"}

# Function to compare Python versions
version_ge() {
[ "$(printf '%s\n' "$1" "$2" | sort -V | head -n1)" = "$2" ]
}

export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1

# Find the highest available Python version (3.10 or higher)
REQUIRED_VERSION="3.10"
PYTHON_CMD=$(command -v python3 || true)

if [ -n "$PYTHON_CMD" ]; then
PYTHON_VERSION=$($PYTHON_CMD -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")
if version_ge "$PYTHON_VERSION" "$REQUIRED_VERSION"; then
echo "Found Python version $PYTHON_VERSION, which is suitable."
else
echo "Python version $PYTHON_VERSION found, but version $REQUIRED_VERSION or higher is required."
PYTHON_CMD=""
fi
fi

# Search for suitable Python versions if the default is unsuitable
if [ -z "$PYTHON_CMD" ]; then
# Loop through all available Python3 commands on the system
for CMD in $(compgen -c | grep -E '^python3(\.[0-9]+)?$'); do
if command -v "$CMD" &> /dev/null; then
PYTHON_VERSION=$($CMD -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")
if version_ge "$PYTHON_VERSION" "$REQUIRED_VERSION"; then
PYTHON_CMD="$CMD"
echo "Found suitable Python version: $PYTHON_VERSION ($CMD)"
break
fi
fi
done
fi

# If no suitable Python version found, exit with an error
if [ -z "$PYTHON_CMD" ]; then
echo "Python 3.10 or higher is required. Please install it."
return 1
fi

echo "Using Python command: $PYTHON_CMD"

# Install falsa and other dependencies
echo "Installing falsa..."

# Set virtual environment directory
VIRTUAL_ENV="${PWD}/venv"

# Create a virtual environment using the detected Python command
$PYTHON_CMD -m venv "$VIRTUAL_ENV"

# Activate the virtual environment and install dependencies
source "$VIRTUAL_ENV/bin/activate"

# Ensure 'falsa' is installed (avoid unnecessary reinstall)
pip install --quiet --upgrade falsa

# Create directory if it doesn't exist
H2O_DIR="${DATA_DIR}/h2o"
mkdir -p "${H2O_DIR}"

# Generate h2o test data
echo "Generating h2o test data in ${H2O_DIR} with size=${SIZE} and format=${DATA_FORMAT}"
falsa join --path-prefix="${H2O_DIR}" --size "${SIZE}" --data-format "${DATA_FORMAT}"

# Deactivate virtual environment after completion
deactivate
uv run --project "${UV_PROJECT}" falsa join --path-prefix="${H2O_DIR}" --size "${SIZE}" --data-format "${DATA_FORMAT}"
}

# Runner for h2o groupby benchmark
Expand Down Expand Up @@ -1269,7 +1145,7 @@ compare_benchmarks() {
echo "--------------------"
echo "Benchmark ${BENCH}"
echo "--------------------"
PATH=$VIRTUAL_ENV/bin:$PATH python3 "${SCRIPT_DIR}"/compare.py $OPTS "${RESULTS_FILE1}" "${RESULTS_FILE2}"
uv run --project "${UV_PROJECT}" python3 "${SCRIPT_DIR}"/compare.py $OPTS "${RESULTS_FILE1}" "${RESULTS_FILE2}"
else
echo "Note: Skipping ${RESULTS_FILE1} as ${RESULTS_FILE2} does not exist"
fi
Expand Down Expand Up @@ -1385,8 +1261,8 @@ run_clickbench_sorted() {
}

setup_venv() {
python3 -m venv "$VIRTUAL_ENV"
PATH=$VIRTUAL_ENV/bin:$PATH python3 -m pip install -r requirements.txt
echo "Setting up Python environment via uv..."
uv sync --project "${UV_PROJECT}"
}

# And start the process up
Expand Down
5 changes: 5 additions & 0 deletions benchmarks/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[project]
name = "datafusion-benchmarks"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = ["rich", "falsa"]
18 changes: 0 additions & 18 deletions benchmarks/requirements.txt

This file was deleted.

5 changes: 5 additions & 0 deletions dev/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[project]
name = "datafusion-dev"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = ["tomlkit", "PyGithub", "requests"]
6 changes: 3 additions & 3 deletions dev/release/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,10 +178,10 @@ We maintain a [changelog] so our users know what has been changed between releas

The changelog is generated using a Python script.

To run the script, you will need a GitHub Personal Access Token (described in the prerequisites section) and the `PyGitHub` library. First install the `PyGitHub` dependency via `pip`:
To run the script, you will need a GitHub Personal Access Token (described in the prerequisites section) and the `PyGitHub` library. First install the dev dependencies via `uv`:

```shell
pip3 install PyGitHub
uv sync --group dev
```

To generate the changelog, set the `GITHUB_TOKEN` environment variable and then run `./dev/release/generate-changelog.py`
Expand All @@ -199,7 +199,7 @@ to generate a change log of all changes between the `50.3.0` tag and `branch-51`

```shell
export GITHUB_TOKEN=<your-token-here>
./dev/release/generate-changelog.py 50.3.0 branch-51 51.0.0 > dev/changelog/51.0.0.md
uv run --group dev ./dev/release/generate-changelog.py 50.3.0 branch-51 51.0.0 > dev/changelog/51.0.0.md
```

This script creates a changelog from GitHub PRs based on the labels associated with them as well as looking for
Expand Down
2 changes: 0 additions & 2 deletions dev/requirements.txt

This file was deleted.

2 changes: 1 addition & 1 deletion dev/update_arrow_deps.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# Script that updates the arrow dependencies in datafusion locally
#
# installation:
# pip install tomlkit requests
# uv sync --project dev
#
# pin all arrow crates deps to a specific version:
#
Expand Down
2 changes: 1 addition & 1 deletion dev/update_datafusion_versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# Script that updates versions for datafusion crates, locally
#
# dependencies:
# pip install tomlkit
# uv sync --project dev

import re
import argparse
Expand Down
15 changes: 4 additions & 11 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,12 @@ https://datafusion.apache.org/ as part of the release process.

## Dependencies

It's recommended to install build dependencies and build the documentation
inside a Python virtualenv.
Install build dependencies and build the documentation using
[uv](https://docs.astral.sh/uv/):

```sh
python3 -m venv venv
pip install -r requirements.txt
```

If using [uv](https://docs.astral.sh/uv/) the script can be run like so without
needing to create a virtual environment:

```sh
uv run --with-requirements requirements.txt bash build.sh
uv sync --project docs
uv run --project docs bash build.sh
```

The docs build regenerates the workspace dependency graph via
Expand Down
13 changes: 13 additions & 0 deletions docs/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[project]
name = "datafusion-docs"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"sphinx>=9,<10",
"sphinx-reredirects>=1.1,<2",
"pydata-sphinx-theme>=0.16,<1",
"myst-parser>=5,<6",
"maturin>=1.11,<2",
"jinja2>=3.1,<4",
"setuptools>=82,<83",
]
24 changes: 0 additions & 24 deletions docs/requirements.txt

This file was deleted.

Loading