Skip to content

feat: add ecosystem domain allowlists from gh-aw#213

Merged
jamesadevine merged 12 commits intomainfrom
feat/comprehensive-allow-list
Apr 15, 2026
Merged

feat: add ecosystem domain allowlists from gh-aw#213
jamesadevine merged 12 commits intomainfrom
feat/comprehensive-allow-list

Conversation

@jamesadevine
Copy link
Copy Markdown
Collaborator

Summary

Adds ecosystem identifier support for the network.allow front matter field, aligning with gh-aw's network allowlisting approach. Users can now reference ecosystem names (e.g., python, rust, node) that expand to curated domain lists instead of manually listing individual domains.

Before

network:
  allow:
    - "pypi.org"
    - "pip.pypa.io"
    - "*.pythonhosted.org"
    - "files.pythonhosted.org"
    - "bootstrap.pypa.io"

After

network:
  allow:
    - python
    - "api.custom.com"

Changes

  • src/data/ecosystem_domains.json — New file: 30+ ecosystem categories sourced from gh-aw's ecosystem_domains.json, embedded at compile time
  • src/ecosystem_domains.rs — New module: loads/parses the JSON, provides get_ecosystem_domains(), is_ecosystem_identifier(), and is_known_ecosystem() functions, supports compound ecosystems (e.g., default-safe-outputs = defaults + dev-tools + github + local)
  • src/compile/standalone.rs — Updated generate_allowed_domains() to resolve ecosystem identifiers in both network.allow and network.blocked. Unknown identifiers emit a compile-time warning. Added 5 integration tests.
  • src/main.rs — Register new ecosystem_domains module
  • .github/workflows/update-awf-version.md — Extended the dependency updater workflow to sync ecosystem_domains.json from gh-aw upstream (max PRs bumped 3→4)
  • AGENTS.md — Documented ecosystem identifiers with available ecosystem table, updated architecture diagram, and updated {{ allowed_domains }} marker docs

Design Decisions

  1. Ecosystem vs domain heuristic: Ecosystem identifiers are lowercase alphanumeric + hyphens with no dots (python, linux-distros); domain names contain dots (pypi.org). Invalid strings (spaces, special chars) fall through to existing DNS validation.
  2. Backward compatible: Raw domain strings in network.allow continue to work exactly as before. Ecosystem identifiers are purely additive.
  3. Core hosts unchanged: CORE_ALLOWED_HOSTS in allowed_hosts.rs is always included regardless of network: config — ecosystems are an additional layer.
  4. Reuse existing updater: Added ecosystem_domains.json sync to the existing dependency updater workflow rather than creating a separate one.

Testing

  • 856 tests pass (15 new: 10 unit tests in ecosystem_domains.rs + 5 integration tests in standalone.rs)
  • No new clippy warnings
  • Existing test for invalid host ("bad host!") continues to work correctly — the tightened identifier heuristic rejects it

jamesadevine and others added 5 commits April 15, 2026 12:15
Add ecosystem identifier support for the network.allow front matter
field. Users can now reference ecosystem names (e.g., python, rust,
node) that expand to curated domain lists, matching gh-aw's approach.

Changes:
- Add src/data/ecosystem_domains.json sourced from gh-aw with 30+
  ecosystem categories
- Add src/ecosystem_domains.rs module with lookup, validation, and
  compound ecosystem support
- Update generate_allowed_domains() to resolve ecosystem identifiers
  in both network.allow and network.blocked
- Extend dependency updater workflow to sync ecosystem_domains.json
  from gh-aw upstream
- Update AGENTS.md with ecosystem identifier documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move Lean runtime domains from hardcoded LEAN_REQUIRED_HOSTS constant
into ecosystem_domains.json. The Lean extension now returns the
ecosystem identifier 'lean' from required_hosts(), and
generate_allowed_domains() resolves it via the JSON like any other
ecosystem. Extension hosts now support ecosystem identifiers too.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add reservoir.lean-lang.org (Lake package registry) and
static.lean-lang.org (toolchain binary downloads) to the lean
ecosystem entry. Update LEAN_REQUIRED_HOSTS constant to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The canonical lean domain list is now in ecosystem_domains.json.
The LeanExtension returns the "lean" ecosystem identifier, making
this constant dead code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The sync workflow previously replaced the local file with upstream
verbatim, which would delete ado-aw-specific entries like 'lean'.
Updated instructions to merge upstream changes while preserving
local-only keys.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Needs changes — one data bug that would silently allow Rust crate registry access in Python-only pipelines, plus a doc typo that would break copy-paste examples.

Findings

🐛 Bugs / Logic Issues

  • src/data/ecosystem_domains.jsonpython entry contains Rust registry domains (critical data bug)

    The python ecosystem list includes three crates.io domains that clearly do not belong there:

    "python": [
        "pypi.org", "pip.pypa.io", ...
        "crates.io",         // ← Rust registry, not Python
        "index.crates.io",   // ← Rust registry
        "static.crates.io"   // ← Rust registry
    ]

    These are already correctly present in the rust entry. This looks like a copy-paste error in the JSON. Any pipeline that uses network.allow: [python] would silently receive Rust registry access — violating least-privilege network isolation. A test like assert!(!python_domains.contains("crates.io")) would have caught this.

  • src/ecosystem_domains.rs:22 — doc comment uses allowed: instead of allow:

    The module-level YAML example says allowed: but the actual front matter field is allow::

    // current (wrong):
    network:
      allowed:     // ← wrong key
        - python
    
    // correct:
    network:
      allow:
        - python

    Any user copying this example would get silent no-op behavior (unknown YAML key is ignored by serde default).

⚠️ Suggestions

  • src/data/ecosystem_domains.jsonlean silently gained two new domains vs the old LEAN_REQUIRED_HOSTS constant

    Old constant had 3 hosts: elan.lean-lang.org, leanprover.github.io, lean-lang.org.
    New JSON entry adds reservoir.lean-lang.org and static.lean-lang.org. Existing Lean pipelines will have their network allowlist expanded on next recompile without any visible indication. Worth a comment in the PR or a mention in the changelog.

  • src/compile/standalone.rs:356 — consider using else instead of continue for clarity

    The continue after handling ecosystem identifiers is correct, but an else block would make the mutually-exclusive branching more immediately legible without relying on continue to skip the validation below. Minor.

✅ What Looks Good

  • The is_ecosystem_identifier heuristic (no dots → ecosystem, has dots → domain) is clean and handles all the edge cases in the tests correctly. Unknown ecosystem identifiers are warn-and-drop (safe-fail direction is right).
  • The HashSet-based deduplication means adding an ecosystem via both runtimes: lean: true and network.allow: [lean] is safe.
  • LazyLock for the embedded JSON parse is the correct approach — fail-fast at startup on a bad embed rather than silently returning empty at runtime.
  • Compound ecosystem support is well-structured and easy to extend.
  • Test coverage for the new module is solid (10 unit + 5 integration tests).

Generated by Rust PR Reviewer for issue #213 · ● 563.7K ·

jamesadevine and others added 3 commits April 15, 2026 12:51
Align field names with gh-aw conventions by renaming:
- network.allowed → network.allow
- network.blocked → network.block

Updates front matter parsing, compiler, tests, documentation,
and all prompt/agent files to use the new field names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Recompile lock files to reflect updated network field names
in update-awf-version workflow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace continue-after-if with an if/else block for clearer
mutually-exclusive branching between ecosystem identifiers and
raw domain validation. Also fix remaining stale "network.allowed"
references in error messages, comments, and test descriptions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Good feature overall, but there's a data bug causing Rust domains to leak into the Python ecosystem, and a breaking rename that will silently drop custom allowlists from recompiled pipelines.

Findings

🐛 Bugs / Logic Issues

  • src/data/ecosystem_domains.json (python entry) — The python ecosystem array ends with "crates.io", "index.crates.io", and "static.crates.io" — these are Rust crates registry domains. They're already correctly listed under rust. Any agent that specifies network.allowed: [python] will silently gain access to the Rust package registry, which is an unexpected expansion of the allowlist.

    "python": [
      ...,
      "repo.anaconda.com",
      "crates.io",        // ← belongs in "rust" only
      "index.crates.io",  // ← belongs in "rust" only
      "static.crates.io"  // ← belongs in "rust" only
    ],

    The test test_get_python_domains only asserts that pypi.org and pip.pypa.io are present — it doesn't assert that Rust domains are absent, so it won't catch this.

  • src/compile/types.rs:600 — Renaming the field from allow to allowed is a silent breaking change. Any existing pipeline source using network:\n allow:\n - "*.mycompany.com" will compile without error (serde ignores unknown fields by default), but the entire allowlist will be silently dropped from the generated pipeline YAML. If users recompile existing agents without noticing the rename in the docs, their AWF domain allowlists will disappear.

    Mitigation options: keep allow as a #[serde(alias = "allow")] on the new allowed field, or add a #[serde(deny_unknown_fields)] + migration error. A silent field rename in a compiler's DSL is particularly dangerous since the generated YAML is the actual security boundary.

⚠️ Suggestions

  • src/compile/standalone.rs — warning for unknown ecosystem identifier: The warning is only emitted when domains.is_empty() && !is_known_ecosystem(host). If a known ecosystem has an empty domain list in JSON (hypothetical), no warning is emitted and no domains are added, with no signal to the user. Consider warning whenever domains.is_empty() regardless of whether the key exists:

    if domains.is_empty() {
        eprintln!("warning: ecosystem '{}' resolved to no domains", host);
    }
  • src/ecosystem_domains.rsexpect on embedded JSON: The LazyLock uses .expect("embedded ecosystem_domains.json is invalid") which will panic if the JSON is malformed. Since this runs in a user-facing compiler (not a server), a panic on startup is jarring. Consider surfacing parse errors through the compilation result instead. That said, this is a compile-time-embedded file, so malformed JSON should be caught by CI — low real-world risk.

✅ What Looks Good

  • Clean LazyLock<HashMap> design for the domain map — zero-cost after first access, no unnecessary locking.
  • The is_ecosystem_identifier heuristic (no dot → ecosystem, has dot → domain) is simple and effective. The fallthrough to DNS validation for unrecognized strings preserves backward compatibility perfectly.
  • Blocking ecosystem identifiers correctly expands them before removal — network.blocked: [python] removes all Python domains, not just the literal string "python".
  • The extension-declared host expansion (e.g., lean returning an ecosystem identifier that is then expanded) is a clean layered design.
  • New reservoir.lean-lang.org and static.lean-lang.org in the lean ecosystem are genuine additions needed for Lean's package registry (Reservoir).

Generated by Rust PR Reviewer for issue #213 · ● 1.8M ·

jamesadevine and others added 3 commits April 15, 2026 13:10
Add two tests to ecosystem_domains::tests:

- test_embedded_json_parses_as_expected_schema: validates the
  compile-time-embedded JSON deserializes correctly as
  HashMap<String, Vec<String>> and every ecosystem has a non-empty
  domain list.

- test_malformed_json_rejected: confirms serde_json rejects schema
  mismatches (string instead of array, non-string array elements,
  invalid JSON syntax), validating the safety of the .expect() guard
  on the LazyLock initializer.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… rename

Add #[serde(deny_unknown_fields)] to NetworkConfig so that the old
field name (network.allow) produces a compile-time error instead of
being silently ignored. This prevents users from losing their AWF
domain allowlists after the rename to network.allowed.

Also fix stale doc comments referencing the old field name.

Tests added:
- test_network_config_rejects_old_allow_field
- test_network_config_accepts_allowed_field
- test_network_config_rejects_arbitrary_unknown_field

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename remaining occurrences of network.allow to network.allowed in:
- src/compile/standalone.rs: doc comments and warning/error messages
- tests/compiler_tests.rs: doc comments and test description strings

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Good feature addition, but contains one unannounced breaking change and a data quality issue in the embedded JSON that needs attention before merging.


Findings

🐛 Bugs / Logic Issues

  • src/data/ecosystem_domains.json lines 1076–1078 — The "python" ecosystem array contains three Rust/crates.io domains: "crates.io", "index.crates.io", "static.crates.io". This appears to be a copy-paste error in the upstream gh-aw JSON. Any agent using network.allowed: [python] will inadvertently also be allowed to reach the Rust crate registry. Worth checking the upstream source and either filing a bug there or stripping these out locally until it's fixed upstream. They already appear correctly under "rust" on line 1082.

  • src/compile/extensions.rs:499LeanExtension::required_hosts() now returns vec!["lean"] (an ecosystem identifier) instead of concrete domain names. The expansion in generate_allowed_domains() handles this correctly, but there's no warning or assertion if the ecosystem identifier lookup returns empty. If a future update to ecosystem_domains.json accidentally removes the "lean" key, the Lean runtime silently gets zero network hosts — no compilation error, no warning. The user-host path warns on unknown ecosystems, but the extension path (lines 554–564 of standalone.rs) does not. Consider adding a similar guard:

    if is_ecosystem_identifier(&host) {
        let domains = get_ecosystem_domains(&host);
        if domains.is_empty() {
            eprintln!("warning: extension '{}' requires unknown ecosystem '{}'; no domains added", ext.name(), host);
        }
        // ...
    }

🔒 Security Concerns

  • src/compile/types.rs:766#[serde(deny_unknown_fields)] on NetworkConfig combined with renaming allowallowed is a hard-breaking change (see ⚠️ below). Beyond migration: deny_unknown_fields is a good hardening addition and the intent is correct.

⚠️ Suggestions

  • src/compile/types.rs:771–773 — The field rename from allow to allowed is a silent breaking change for all existing users. Any pipeline definition currently using network:\n allow: will now fail compilation with unknown field 'allow' rather than working or getting a deprecation warning. The test test_network_config_rejects_old_allow_field confirms this is intentional, but there's no mention in the PR description and no migration path provided.

    The minimal backward-compatible fix is a serde alias:

    #[serde(default, alias = "allow")]
    pub allowed: Vec<String>,

    Alternatively, if this is an intentional hard break, it should be called out explicitly in the PR description and CHANGELOG as a breaking change so downstream users know to update their .md files. The deny_unknown_fields attribute prevents the alias + new-field coexistence from being silent, which is actually the right UX — but the current state gives users no guidance on what to change it to.

  • src/ecosystem_domains.rs:152–162 — The recursive get_ecosystem_domains() for compound ecosystems has no cycle-guard. Currently safe (only one compound entry with direct components), but if a future entry references another compound ecosystem it could stack-overflow. A simple visited-set or depth limit would make this robust:

    fn get_ecosystem_domains_inner(identifier: &str, depth: u8) -> Vec<String> {
        if depth > 8 { return vec![]; }  // cycle/depth guard
        // ...
    }

✅ What Looks Good

  • The is_ecosystem_identifier() heuristic (no dots, lowercase alphanumeric + hyphens) is clean and correctly distinguishes ecosystem names from DNS names. The existing "bad host!" test correctly falls through to DNS validation.
  • Embedding the JSON with include_str! + LazyLock + expect() is the right pattern for static data that must be valid at compile time — the test_embedded_json_parses_as_expected_schema test provides good coverage.
  • Test coverage is solid: 10 unit tests + 5 integration tests covering expansion, blocking, mixed use, and unknown identifiers.
  • The deny_unknown_fields addition on NetworkConfig is good hardening regardless of the migration concern above.
  • Removing LEAN_REQUIRED_HOSTS constant and replacing with the ecosystem JSON entry is a cleaner architecture — single source of truth for domain lists.

Generated by Rust PR Reviewer for issue #213 · ● 1.1M ·

- Add warning in generate_allowed_domains() when an extension requires
  an unknown ecosystem identifier, matching the existing guard on the
  user-host path.

- Add depth guard (max 8) to get_ecosystem_domains() to prevent stack
  overflow from circular compound ecosystem references.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jamesadevine jamesadevine merged commit 22a2069 into main Apr 15, 2026
4 checks passed
@jamesadevine jamesadevine deleted the feat/comprehensive-allow-list branch April 15, 2026 12:40
@github-actions
Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Well-implemented feature with one breaking change that needs awareness, and a subtle silent-failure path in the validation logic.

Findings

🐛 Bugs / Logic Issues

  • src/compile/types.rs:605 + #[serde(deny_unknown_fields)] — The rename from allowallowed paired with deny_unknown_fields is a hard breaking change with no migration path. Any existing pipeline with network:\n allow: will now fail at parse time with "unknown field allow". There is no #[serde(alias = "allow")] to ease the transition. The test test_network_config_rejects_old_allow_field explicitly confirms the old field is rejected. Downstream users with existing pipelines will hit an opaque error on the next ado-aw compile run without any hint to rename the field. A deprecation alias + a warning on use would have been safer for a v0.x release series — but at minimum this needs a prominent callout in the changelog/release notes.

  • src/compile/standalone.rs:340 — When a user specifies an ecosystem-like string (no dots, all lowercase+hyphens) that is not a known ecosystem, the entry is silently dropped from the allowlist with only a eprintln! warning. Example: network.allowed: ["pytohn"] (typo) generates a warning but the Python domains are never added and the pipeline runs anyway. This is a regression from the previous behaviour where an invalid entry would cause a hard bail!. A compile-time error for unrecognised ecosystem identifiers would be safer, since the alternative — a runtime networking failure inside the AWF sandbox — is much harder to diagnose. At minimum the warning message should recommend using bail! or at least be output through the compiler's structured warning system rather than bare eprintln!.

🔒 Security Concerns

  • src/ecosystem_domains.rs:97 — The heuristic is_ecosystem_identifier (no dot, all [a-z0-9-]) means any short, all-lowercase hostname without a dot (e.g. "intranet", "corp", "internal", "localhost") is now classified as an ecosystem identifier and silently dropped rather than being validated and added as a raw domain. In the old code "localhost" would pass DNS-character validation and be added; now it produces only a warning. Users relying on bare hostnames in their allowlist will silently lose them. The local ecosystem is the intended substitute for localhost, but the silent drop is surprising.

⚠️ Suggestions

  • src/ecosystem_domains.rs:53 — The .expect("embedded ecosystem_domains.json is invalid") inside LazyLock::new will panic at the first call site if the JSON ever becomes malformed (e.g., after a botched automated sync). Since this fires at runtime, it produces an unfriendly panic rather than a compile error or a structured anyhow::bail!. Wrapping with a Result-based initialisation and propagating the error through get_ecosystem_domains as a Result<Vec<String>> would be more consistent with the rest of the codebase's anyhow::Result pattern, though the test_embedded_json_parses_as_expected_schema test does catch this in CI.

  • src/compile/standalone.rs:320LeanExtension::required_hosts() now returns vec!["lean"] (the ecosystem identifier) rather than the concrete domain list. This works, but it couples the extension system to the ecosystem-identifier concept in a way that isn't reflected in the CompilerExtension trait documentation. The trait's required_hosts() return type implies raw host strings; returning an ecosystem identifier is a semantic overloading. A separate required_ecosystems() method, or explicit documentation on the trait, would make the contract clearer for future extension authors.

✅ What Looks Good

  • LazyLock for one-time JSON initialisation is the right pattern — zero-overhead at compile time, loaded once at first access.
  • Depth guard (max 8) on compound ecosystem recursion is a solid defensive addition, especially given the recursive get_ecosystem_domains_inner design.
  • #[serde(deny_unknown_fields)] on NetworkConfig is strictly correct hygiene — it will catch typos in operator configs that previously silently took effect with default values.
  • test_embedded_json_parses_as_expected_schema is an excellent test: it validates the embedded JSON's schema and asserts no ecosystem has an empty domain list, providing a safety net for the automated sync workflow.
  • The blocked-host ecosystem expansion (allowing blocked: [python] to remove all Python domains) is symmetric and well-tested.

Generated by Rust PR Reviewer for issue #213 · ● 617.5K ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant