Skip to content

Add RAT license validation#47

Open
janhoy wants to merge 11 commits into
apache:mainfrom
janhoy:feature/apache-rat-license-audit
Open

Add RAT license validation#47
janhoy wants to merge 11 commits into
apache:mainfrom
janhoy:feature/apache-rat-license-audit

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented May 30, 2026

Closes #8. Adds Apache RAT license audit infrastructure so we can satisfy ASF release policy.

Changes

  • make rat — downloads RAT 0.18 JAR to ~/.solr-orbit/cache/apache-rat/ via scripts/download-rat.py (pure Python, no curl), verifies SHA-512, then runs the audit. Errors if java is not on PATH.
  • .rat-excludes — excludes files that legitimately carry no header (bytecode, config/data files, test fixtures, binaries, docs, CI metadata).
  • make release-checks — calls scripts/release-checks.sh; runs RAT as the first pre-flight step.
  • 62 pre-fork files — were missing headers entirely; now carry an SPDX line, OpenSearch attribution, and full ASF Apache-2.0 boilerplate. Files that already had a header are untouched.
  • DEVELOPER_GUIDE.md — new Release section documenting the above.

The header added to py files that had no previous header is this

# SPDX-License-Identifier: Apache-2.0
#
# Originally developed by OpenSearch Contributors; licensed under the Apache License, Version 2.0.
# License header was absent in the original source; added when adopted into Apache Solr Orbit.
# Modified by Apache Solr contributors; see git log for details.
#
# Licensed to the Apache Software Foundation (ASF) under one
[... rest of standard apathe header ...]

How to review

  1. Run make rat — should complete with 0 unapproved licenses.
  2. Check .rat-excludes for any patterns that seem too broad.
  3. Spot-check a few of the 62 re-headered files (e.g. solrorbit/aggregator.py, solrorbit/builder/launchers/docker_launcher.py) to confirm the header is correct.
  4. Review scripts/download-rat.py for the download + verify logic.

janhoy added 5 commits May 30, 2026 11:33
- Add .rat-excludes with exclusion patterns for config files, binaries,
  docs, caches, empty __init__.py namespace markers, and IDE files
- Add `make rat` target that auto-downloads the RAT 0.18 JAR to
  ~/.cache/apache-rat/ on first use (requires curl + java)
- Add scripts/release-checks.sh (was missing despite being referenced
  in the Makefile); currently runs the RAT audit
- Update `make release-checks` to invoke scripts/release-checks.sh
- Add license headers to 62 Python source files that were missing them:
  - Files with first commit before fork point get the OSB-lineage header
    (SPDX + "Modifications by Apache Solr" + OpenSearch Contributors)
  - Empty __init__.py namespace markers are excluded from RAT instead
- Document `make rat` and `make release-checks` in DEVELOPER_GUIDE.md
- Makefile: replace deprecated RAT flags -E/-d with
  --input-exclude-file/-- (deprecated since RAT 0.17)
- Makefile: verify SHA-512 checksum of downloaded RAT JAR
- scripts/release-checks.sh: harden with set -euo pipefail
  and cd to project root via dirname guard
- DEVELOPER_GUIDE.md: replace X.Y.Z+1 placeholder with
  concrete version example (0.9.2 / 0.9.3)
Apache's .sha512 file contains a bare hex hash with no filename,
not the `<hash>  <file>` format shasum -c expects. Fix by:
- Downloading the tarball to a temp file first
- Constructing the shasum -c input inline via echo
- Extracting the JAR from the verified tarball
- Cleaning up the temp tarball and .sha512 file
The 62 pre-fork files added in the RAT commit had only the short OSB
snippet as a placeholder. Replace with a header that:
- Credits original OpenSearch Contributors authorship
- Notes the license header was absent in the original source
- Includes the full ASF Apache-2.0 boilerplate

Files that already carried the full Elasticsearch/ASF header are untouched.
- Replace curl+shasum shell gymnastics with scripts/download-rat.py:
  uses urllib.request + hashlib.sha512 + tarfile, no curl dependency
- Move JAR cache from ~/.cache/apache-rat/ to ~/.solr-orbit/cache/apache-rat/
  to keep all project runtime state under one directory
- Add java availability check in the rat target with a clear error message
- Update DEVELOPER_GUIDE.md to reflect the new cache path and Python requirement

This comment was marked as resolved.

janhoy added 6 commits May 31, 2026 00:11
Replace the overly broad **/__init__.py glob with specific patterns
covering only the 0-byte package markers. Non-empty __init__.py files
carry a full license header and will now be checked by RAT as intended.
…s 404

downloads.apache.org only hosts the current release; older pinned versions
move to archive.apache.org. Add download_with_fallback() that retries on
the archive mirror so 'make rat' keeps working after a new RAT release.
Apache .sha512 files are not bare hex strings; they use either GNU
coreutils format ("<hex>  <filename>") or BSD format
("SHA512 (<file>) = <hex>"). Extract the hex token before comparing
so SHA-512 verification does not always fail on a fresh machine.
…able

With set -euo pipefail, assigning \$1 / \$2 when the script is called
without arguments aborts before the -z guard can print the usage message.
Use \${1:-} / \${2:-} so the shell treats missing args as empty strings
and lets the existing -z check handle the error path.
File is redundant — LICENSE and NOTICE already cover dependency
attribution per ASF policy. Also had a duplicate row for solr-orbit.
Ruff (F541) flagged the f-string in download_with_fallback(); drop the
f prefix since the string contains no interpolation.
#
# Originally developed by OpenSearch Contributors; licensed under the Apache License, Version 2.0.
# License header was absent in the original source; added when adopted into Apache Solr Orbit.
# Modified by Apache Solr contributors; see git log for details.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help decide what we should use as license header for Python files that did not have any header from Opensearch Benchmark project, like this. I suggest this one, which has the normal Apache header but this three line notice on top to say the file was originally authored by OSB and then modified by Solr.

I felt it was overkill to add the full "The Opensearch project requires contributors... bla bla", and the Rally stuff. These three lines should be enough to let people know the origin of the file, and the git history reveals it all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Apache RAT for release-time license audit

2 participants