diff --git a/docs/adr/IC-ADR-001_computing_elements.md b/docs/adr/IC-ADR-001_computing_elements.md new file mode 100644 index 0000000..8a240bb --- /dev/null +++ b/docs/adr/IC-ADR-001_computing_elements.md @@ -0,0 +1,964 @@ +# IC-ADR-001: Computing Element interfaces and the DIRAC to interCEde migration + +## Metadata + +- **Created By:** Alexandre Boyer +- **Date:** 2026-06-30 +- **Status:** Draft +- **Decision Maker(s):** DIRACGrid maintainers +- **Stakeholders:** DIRAC/DiracX SiteDirector, PushJobAgent and Pilot maintainers; extension + communities operating Computing Elements and batch systems; interCEde contributors + +> **Scope and altitude.** This is a *direction-setting* ADR. It commits to the **shape** of +> interCEde's Computing Element interfaces and records what is ported from DIRAC's +> `Resources/Computing` and what is deliberately left out. Illustrative code sketches pin down the +> shape; the exhaustive method-by-method signatures, the wire/serialisation details, and the +> per-backend porting mechanics are deferred to follow-up ADRs and to the implementation itself. +> interCEde is a standalone library in the [DIRACGrid](https://github.com/DIRACGrid) ecosystem +> designed to back the Computing Element layer used by +> [DiracX](https://github.com/DIRACGrid/diracx); this ADR is intended to be read alongside the +> DiracX transition ADRs but governs interCEde's own API. (Why a standalone library rather than a +> `diracx` subpackage is recorded in Rejected Ideas.) +> +> **How to read this.** To review the decision, read the Abstract, Motivation, §1–§4 and §7. +> The rest is supporting reference and can be skimmed: §8 (migration map), §9 (consumer sketches), +> the Rejected-Ideas entries you don't want to contest, and the Open Issues backlog. + +## Abstract + +interCEde gives the workload management layer one payload-agnostic way to **submit, monitor and +retrieve** units of work on heterogeneous Computing Elements (CEs) and batch systems, validated +against containerised backends. This ADR fixes the shape of that interface with five decisions: + +1. **The contract is a set of small typed interfaces (`typing.Protocol`s), not a base class.** + Backends are defined by structural conformance, implement only the capabilities they have, and + can be mocked without inheriting interCEde types. +2. **Access and scheduling compose instead of multiplying.** A `Transport` (how you reach the + resource) and a `Scheduler` (how work is queued) combine in one generic `BatchBackend`, so N + transports × M schedulers costs N+M classes; genuinely monolithic backends (ARC, HTCondor-CE, + Cloud) implement the contract directly. +3. **DIRAC's in-process CEs (`InProcess`, `Singularity`, `Pool`) leave the library.** They belong + to the pilot/worker-node domain and get their own `Runner`/`ExecutorPool` subsystem, + disconnected from this contract. +4. **The data plane handles arbitrary input/output sandboxes**, not only stdout/stderr — the + direct consequence of never interpreting the payload. +5. **The extension surface is explicit and three-tiered:** stable contracts third parties + implement, provided implementations they may use but not subclass, and internals with no + stability guarantee. + +## Motivation + +### What DIRAC has today + +DIRAC's CE machinery lives in `DIRAC/Resources/Computing/`. A base class `ComputingElement` is +subclassed by nine concrete CEs (`AREX`, `HTCondorCE`, `SSH`, `SSHBatch`, `Local`, `Cloud`, +`InProcess`, `Singularity`, `Pool`), instantiated through a `ComputingElementFactory` that +string-loads modules via `ObjectLoader`. A parallel `BatchSystems/` package (`SLURM`, `Condor`, +`LSF`, `GE`, `OAR`, `Torque`, `Host`) provides scheduler plugins that the `SSH`/`Local` CEs +compose via `loadBatchSystem()`. + +### Why it needs to change + +1. **The base class is a fat class that conflates contract with policy.** `ComputingElement` + mixes the job-lifecycle contract (submit/status/output/kill) with cross-cutting policy: + CPU-slot accounting (`available()`), proxy renewal (`_monitorProxy`), and a layered + configuration hierarchy. Every CE drags this along whether it needs it or not. + +2. **The contract is "abstract by convention", and unenforced.** There is no `abc.ABC`, no + `@abstractmethod`, no Protocol — it is a plain `class ComputingElement:`. What stubs exist + (`submitJob`, `getCEStatus`) merely `return S_ERROR("…should be implemented in a subclass")`; + the rest of the lifecycle (`getJobStatus`, `getJobOutput`, `killJob`) is not declared on the + base **at all**, so nothing — not even a stub — stops a subclass from silently omitting + `getJobOutput`. + +3. **The interface has drifted across implementations.** `submitJob` alone appears as + `(executableFile, proxy, numberOfJobs, inputs, outputs)` (AREX), `(…, numberOfJobs=1)` + (HTCondorCE/SSH), `(…, proxy=None, inputs=None, **kwargs)` (InProcess/Pool), and + `(…, proxy=None, **kwargs)` (Singularity); `proxy` is required in some and optional in others. + `getJobOutput` appears with four different signatures. Not all CEs implement the same methods. + +4. **Three incompatible semantics are crammed under one base.** The base docstring itself + distinguishes **Remote** (async, pollable, multi-job), **Inner** (synchronous, blocking, one + job), and **Inner Pool** CEs. They serve *different actors* — Remote CEs are driven by the + central SiteDirector/PushJobAgent, Inner CEs by the JobAgent inside a pilot — and are never + substituted for one another. Their *mechanics* don't even overlap (`CGroups2.systemCall` + + periodic proxy refresh on the inner side vs batch submit/poll/REST on the remote side). The + shared supertype is a false abstraction. + +5. **The good idea — composition — is applied inconsistently.** `SSH`/`Local` compose a transport + with a `BatchSystems/*` plugin, but `AREX`/`HTCondorCE`/`Cloud` hard-code their scheduler + interaction, and the `BatchSystems/*` plugins are a separate hierarchy with their own ad-hoc, + string-generating interface. + +6. **The data plane is too narrow.** Only `AREXComputingElement` retrieves arbitrary output files + (it lists the ARC session directory and streams every file to disk — though even it still + *returns* only stdout/stderr in memory, `S_OK((output, error))`); every other CE handles only + stdout/stderr, returning their contents in memory (SSH/Local can instead write them to a + directory and return paths, but nothing generalises to an arbitrary sandbox). Input staging is + similarly limited. A payload-agnostic library cannot assume payloads emit only stdout/stderr. + +7. **Untyped results and discovery.** Everything is `S_OK`/`S_ERROR` dicts; backends are + discovered by string-to-module loading. Neither is checkable or pluggable in a typed, + third-party-friendly way. + +### Drivers + +- **Payload-agnostic.** interCEde must not care whether the payload is a DIRAC job, a DIRAC + pilot, or anything else. It submits an opaque executable + sandbox and lets the caller + poll/fetch/kill. +- **Orchestration-agnostic.** Whether work is *pulled* (SiteDirector) or *pushed* (PushJobAgent) + is a workload management system concern above interCEde. +- **One contract, many backends**, with *combinations* (`SSH + Slurm`, `SSH + HTCondor`, + `Local + HTCondor`) as first-class. +- **Tested against the real thing** — backends verified against containerised schedulers, so the + contract must be trivially implementable and mockable. +- **Extensible by third parties** — VOs must be able to add their own transports, schedulers and + CEs without forking interCEde. +- **Typed and async** — DiracX is async; results and errors should be typed. + +## Specification + +### 1. Scope — what interCEde is and is not + +interCEde is the **delegate-and-poll** side of running work on a resource: *submit a payload to a +scheduler/resource, get a handle, then poll it, fetch its sandbox, or kill it.* The boundary with +the rest of the system, stated without any pilot/job vocabulary: + +- **In scope:** reaching a resource (local shell, SSH, REST/ARC, HTCondor-CE, cloud API); + queueing work on a scheduler; tracking a remote handle; staging input/output sandboxes. +- **Out of scope — execute-here-and-now:** running a payload *in this process* on a worker node + (DIRAC's `InProcess`/`Singularity`/`Pool`). This is the pilot/worker-node domain (see §6). +- **Out of scope — orchestration:** pull vs push, matching, pilot lifecycle. interCEde exposes + mechanism; DiracX/DIRAC decides policy. +- **Out of scope — payload interpretation:** interCEde never inspects payload contents. +- **Out of scope — persistence:** interCEde keeps no store of jobs or handles; the caller + persists handles and owns idempotency/reconciliation (see §9). +- **Out of scope — other resource types:** Storage Elements, File Catalogs, and the like are + *different contracts* (data movement; namespace/metadata) with different backends, testing, and + consumers. They are **sibling libraries, not part of interCEde** — see Rejected Ideas. + +The submitted thing is a **`SubmissionSpec`** — the description of an opaque payload (name +provisional; see Open Issues). Submitting a spec, optionally `count` identical times, returns one +**`Submission`**, which bundles one **`JobHandle`** per created job plus any per-copy failures. + +**Throughout this ADR, "job" means the *scheduler's* job (a Slurm/Condor job) — explicitly *not* +a DIRAC Job.** The naming deliberately reserves `Job*` for the backend handle and avoids +`Payload`, which collides with the pilot-side payload concept (see Open Issues). + +A `JobHandle` is the durable, serialisable identity of one job: the caller persists it and later +replays it into status/output/kill calls. It must therefore be stable across processes and carry +whatever *routing* the backend needs to find the job again (e.g. the target host that DIRAC's +`sshcondor://…` scheme used to encode — see §8). + +### 2. The contract — capability-segmented Protocols + +The caller-facing contract is a set of small `typing.Protocol`s, and **every operation works on +many jobs at once**: submit fans a single spec into `count` identical jobs; status/output/kill/ +purge take a sequence of ids and return a mapping keyed by `JobID`, so every per-item outcome is +addressable (see the partial-failure rule below). + +The contract is sized by its consumers. DiracX splits the old monolithic SiteDirector into +independent, separately-scheduled **tasks** — one submits, one polls status, one fetches outputs +(§9 sketches them). The **essential** capabilities of a CE are two — *submit a payload (with +inputs)* and *get the jobs' status* — each driven by its own task. *Retrieving outputs* is a +third task and its own protocol, but it is **optional, not essential**: some backends genuinely +cannot serve a pull-style sandbox retrieval — the canonical case is **Cloud**, where a booted VM +has no server-side filesystem interCEde controls, so exporting outputs is the payload's own +responsibility (e.g. pushing them to an object store) — and the ADR's rule is *never force a +backend to stub a capability it lacks*. So `OutputRetriever` sits with the optional capabilities, +and the output task narrows to it structurally. + +```python +StatusMap = Mapping[JobID, JobStatus] # named alias, stable (Tier A) + +@runtime_checkable +class Submitter(Protocol): # the submission task + async def submit(self, spec: SubmissionSpec, count: int = 1) -> Submission: ... + # submit the SAME spec `count` times (DIRAC's numberOfJobs); returns one Submission + # bundling one JobHandle per successful copy, plus per-copy failures + +@runtime_checkable +class Monitor(Protocol): # the status task + async def status(self, ids: Sequence[JobID]) -> StatusMap: ... + # ids the backend no longer knows come back as JobStatus.UNKNOWN — never dropped + +@runtime_checkable +class JobBackend(Submitter, Monitor, Protocol): ... +``` + +`OutputRetriever` and the other capabilities are additive and optional (below). `JobBackend` is +the composition of the two essential protocols; it names "a complete backend" and is what the +registry returns and validates. But **consumers depend on the narrow protocol they use, not on +`JobBackend`** — the submission task takes a `Submitter`, the status task a `Monitor`, the output +task an `OutputRetriever` (which it must confirm structurally, since a backend may lack it). A +backend implements the slices it can, and each task is handed the same object but sees only its +slice. + +> **Naming — why `JobBackend`, not `ComputingElement`.** interCEde types name *what interCEde +> provides* — a backend the WMS drives — never the external resource. The composed contract is +> `JobBackend` (it pairs with `JobHandle`/`JobID`/`JobStatus`); a Slurm-over-SSH `BatchBackend` +> is emphatically *not* a Computing Element in the WLCG sense, and the concrete drivers +> (`ARCBackend`, `HTCondorCEBackend`, `CloudBackend`) are clients *of* resources, not the +> resources themselves. The domain term "Computing Element / CE" stays in prose and docstrings. +> The one retained "CE" is `HTCondorCEBackend`: there "CE" names the external product +> (HTCondor-CE, a grid gateway distinct from the local `HTCondorScheduler`), not a claim that the +> class *is* a CE. + +> **The registry gate is a smoke-test, not the drift guard.** `isinstance` against a +> `@runtime_checkable` Protocol checks method **presence only, never signatures** — a backend +> whose `submit` has an extra required argument, or that forgot `async`, passes the gate and +> fails only at the first call. The real defence against DIRAC-style signature drift is (a) +> static typing for first-party code and (b) the **container conformance suite**, which is +> therefore a *requirement* of registration, not optional (see §5, Evolution). (Two PEP 544 +> mechanics for implementers: keep `Protocol` in the base list when composing protocols, or the +> subclass silently becomes a concrete class; and `@runtime_checkable` is not inherited — the +> composed `JobBackend` must carry the decorator itself.) + +**Everything else is optional** — additive, independent protocols, *not* part of `JobBackend`, +each justified by a real caller and by backends that have it and don't. All are bulk and return a +per-`JobID` outcome map (`kill`/`purge`/`fetch_output` never return bare `None`, so a +partial-batch failure is always reportable): + +```python +@runtime_checkable +class OutputRetriever(Protocol): # the (on-demand) output task — OPTIONAL + # Bulk: materialise each job's whole output sandbox (incl. the CE/scheduler log) into + # `dest`. No separate log fetch — the log is a manifest member. Retrieval is IDEMPOTENT + # by default (ARC/SSH/Local keep remote state and can be re-fetched); a backend whose + # retrieval is physically one-shot DECLARES it via `destructive` — e.g. HTCondor + # (>= 25.8), where a completed spooled job leaves the queue after `condor_transfer_data` + # (overridable via `leave_in_queue`). interCEde never *forces* destructiveness on a + # re-fetchable backend; the consumer reads `destructive` to know whether `dest` is a + # hard commit point or a re-fetchable cache. + destructive: bool # False for ARC/SSH/Local; True for HTCondor (>= 25.8) + async def fetch_output(self, ids: Sequence[JobID], dest: Path) -> Mapping[JobID, JobOutput]: ... + +@runtime_checkable +class Cancellable(Protocol): + async def kill(self, ids: Sequence[JobID]) -> Mapping[JobID, OpOutcome]: ... + +@runtime_checkable +class Purgeable(Protocol): + # Delete a job's outputs/remote state WITHOUT fetching them — the counterpart to the + # destructive fetch. Needed on its own where a site mandates scratch/spool cleanup + # even for outputs no one will retrieve. + async def purge(self, ids: Sequence[JobID]) -> Mapping[JobID, OpOutcome]: ... + +@runtime_checkable +class LoadReporter(Protocol): + # Aggregate counts of *our* submitted jobs — NOT a capacity number, NOT "available + # slots" (that is the caller's throttling policy). HTCondor-CE cannot provide this at + # all (`getCEStatus()` returns an explicit `S_ERROR("… not supported")`). + async def counts(self) -> JobCounts: ... # {running, waiting} + +@runtime_checkable +class SupportsLiveDiagnostics(Protocol): + # An independent, repeatable diagnostics fetch *before* completion — ARC is the + # clear case (its `diagnose/errors` endpoint); whether HTCondorCE's spool log-peek + # qualifies is an Open Issue. The *final* log is always a member of the + # fetch_output manifest, so there is no standalone "get the final log" method. + async def diagnostics(self, ids: Sequence[JobID]) -> Mapping[JobID, Diagnostics]: ... +``` + +The comment on `OutputRetriever.destructive` above is the **single normative statement** of the +destructive-retrieval model; every other mention in this ADR cross-references it. + +**Partial failure is per item, not per batch.** A bulk call raises a typed exception only for a +whole-operation failure (transport down, auth rejected); anything that can succeed for some ids +and fail for others reports it in the returned map — `status` yields `JobStatus.UNKNOWN` for +unknown/expired ids, and `kill`/`purge`/`fetch_output` yield a per-id `OpOutcome` (ok / error). +Callers never have to guess which ids a partial batch touched. + +**Backends have an explicit lifecycle, because consumers cache them.** DIRAC already proves both +halves: `QueueCECache` caches CE objects keyed on a hash of their parameters, reuses them across +agent cycles, and calls `shutdown()` on eviction (with an explicit "eviction must never fail" +guard), and the SSH CE's `shutdown()` closes per-host connections *and the ssh jump-gateway* — +yet the base class declares no lifecycle contract at all, so eviction relies on `shutdown()` +being "polymorphic by convention". interCEde backends and transports hold the same kind of state +(SSH connections, HTTP sessions), so the lifecycle is part of the Tier-A contract: **backends and +transports are async context managers** — construction is cheap and side-effect-free (no I/O +before first use), close is idempotent, and a consumer that caches backends owns closing them on +eviction. This is also what makes the statelessness rule (§9) honest: the *only* resources a +backend holds are connections, released deterministically, never job state. + +**Decision rule for segmentation:** one protocol per capability a *specific caller wants in +isolation*. The DiracX-task split makes that concrete — submit and status are two separate tasks +(so two essential protocols), and output-retrieval is a third task narrowed to an *optional* +`OutputRetriever` (an earlier draft bundled status+fetch into one "operator"; the task topology +shows that was too coarse). `kill` and `purge` are management actions some flows never use; the +submission task's throttling wants `LoadReporter`; the WMS matcher wants none of this. Do not +segment further than a caller exists for — that discipline is what prevents drift back into a +god-interface. + +```mermaid +classDiagram + direction LR + class Submitter { + <> + +submit(spec, count) Submission + } + class Monitor { + <> + +status(ids) StatusMap + } + class OutputRetriever { + <> + +fetch_output(ids, dest) OutputMap + } + class JobBackend { + <> + } + Submitter <|-- JobBackend + Monitor <|-- JobBackend + class Cancellable { + <> + +kill(ids) OutcomeMap + } + class Purgeable { + <> + +purge(ids) OutcomeMap + } + class LoadReporter { + <> + +counts() JobCounts + } + class SupportsLiveDiagnostics { + <> + +diagnostics(ids) DiagnosticsMap + } + class Transport { + <> + +run(argv) CommandResult + +put(local, remote) + +get(remote, local) + } + class Scheduler { + <> + +stages_own_files bool + +submit_cmd(spec) Command + +parse_status(raw) StatusMap + +kill_cmd(ids) Command + } + class BatchBackend + JobBackend <|.. BatchBackend + OutputRetriever <|.. BatchBackend + Cancellable <|.. BatchBackend + Purgeable <|.. BatchBackend + LoadReporter <|.. BatchBackend + BatchBackend o-- Transport : uses + BatchBackend o-- Scheduler : uses + Transport <|.. SSHTransport + Transport <|.. LocalTransport + Scheduler <|.. Slurm + Scheduler <|.. HTCondorScheduler + class ARCBackend + class HTCondorCEBackend + class CloudBackend + JobBackend <|.. ARCBackend + JobBackend <|.. HTCondorCEBackend + JobBackend <|.. CloudBackend + OutputRetriever <|.. ARCBackend + OutputRetriever <|.. HTCondorCEBackend + Cancellable <|.. ARCBackend + Purgeable <|.. ARCBackend + LoadReporter <|.. ARCBackend + SupportsLiveDiagnostics <|.. ARCBackend + Cancellable <|.. HTCondorCEBackend +``` + +(The two essential protocols compose into `JobBackend`; output-retrieval and the rest are +optional and picked up à la carte. `ARCBackend` adds `OutputRetriever` plus all four optional +capabilities. `HTCondorCEBackend` adds `OutputRetriever` and `Cancellable` only — it reports no +counts, offers no independent diagnostics, and its retrieval is destructive (§2). +`CloudBackend` implements only the essential `Submitter`+`Monitor` — it deliberately does *not* +claim `OutputRetriever` (§2).) + +### 3. Composition — `Transport` × `Scheduler` + +Most backends are really two orthogonal choices: **access/transport** (how you reach the +resource) and **scheduler** (what queues the work). `Transport` and `Scheduler` are +*collaborator* protocols — they are **not** sub-protocols of `JobBackend`. Their relationship to +the contract is realised by composition: + +```python +class BatchBackend: # satisfies JobBackend + def __init__(self, transport: Transport, scheduler: Scheduler): ... +``` + +This one class covers `SSH + Slurm`, `SSH + HTCondor`, `Local + Slurm`, … — the matrix collapses +from *transports × schedulers* to *transports + schedulers*. Backends that are not +"transport + scheduler" (ARC is a REST service fronting an opaque batch system; Cloud boots a VM +and has no scheduler) implement `JobBackend` directly. The Protocol is what lets composed and +monolithic implementations sit behind one caller-facing type. + +Cloud is the clearest case of *why* composition does not always apply. Batch has **two +independent axes** — any `Scheduler` behind any `Transport` — which is what makes the N×M → N+M +collapse worthwhile. Cloud has **one** axis, the provider (OpenStack/EC2/OpenNebula/…), and "how +you reach it" and "what it is" are *fused* into a single Apache Libcloud driver — there is no +matrix to collapse. So `CloudBackend` is **not** `BatchBackend(libcloud, opennebula)`: Libcloud +is not a `Transport` (no shell to `run` argv, no filesystem to `put`/`get`) and a provider like +OpenNebula is not a `Scheduler` (no queue, no submit/parse-status commands). The provider axis +*is* a composition, but Libcloud already owns it (its `get_driver`/`set_driver` registry); +`CloudBackend` is a monolithic `JobBackend` that composes Libcloud **internally** (Tier C, §8), +which is why the provider drivers do not appear as collaborators in the class diagram. + +**Staging is a Transport × Scheduler interaction, and the `Scheduler` carries it.** Who moves the +sandbox is not purely a transport choice: HTCondor declares its file transfer *in the submit +description at submit time* (`should_transfer_files`, `transfer_output_files`) and stages +worker→schedd itself, whereas a Slurm-over-SSH combination declares nothing to the scheduler and +interCEde stages the sandbox over the transport (`put`/`get`) at fetch time. `BatchBackend` must +not hard-code this per scheduler, so the `Scheduler` protocol exposes it as data — a +`stages_own_files` flag (and, where needed, `stage_inputs`/`collect_outputs` hooks). +`BatchBackend.submit()` consults the flag instead of branching on scheduler identity, which is +what keeps "add a scheduler by implementing the protocol, no core change" true even for a +scheduler whose transfer model differs from the reference one. + +**Shared batch-system knowledge.** A backend family's command construction and output parsing — +building a `condor_submit` description, parsing `condor_q`/`condor_history` ClassAds, mapping +native states to `JobStatus` — lives in *one* internal module (`_htcondor`, `_slurm`, …) consumed +by **both** the `Scheduler` adapter and any related monolithic CE. `HTCondorScheduler` (used by +`BatchBackend` for SSH/Local + HTCondor) and `HTCondorCEBackend` (a remote schedd reached over a +grid/CE transport, with tokens and the grid universe) share that core and differ only in +transport/auth and submit-description routing. That overlap is itself a hint that +`HTCondorCEBackend` might eventually be expressed as +`BatchBackend(HTCondorCETransport, HTCondorScheduler)` rather than a monolith (see Open Issues). +The rule is the same one from §6: **share implementation as functions/internal modules, never by +making one a subtype of the other.** + +**No interpreter on the remote host.** DIRAC's SSH/Local CEs work by *shipping* a Python driver +to the resource (`BatchSystems/.py` concatenated with `executeBatch.py`, uploaded and executed +by whatever Python the host provides — the file itself warns "support for py2 and py3 is +necessary"), which is why those drivers are stdlib-only, DIRAC-free, and stuck on 2014-era +conventions. interCEde deliberately drops the ship-and-exec model: `Scheduler`s build commands +*locally* and only argv (plus staged sandbox files) crosses the `Transport`. The remote host +needs the scheduler CLI and a shell — no Python, no shipped code, no py2 constraint. This is a +portability gain worth advertising and a behavioural change worth testing: the integration suite +should assert it with a batch-host variant that has no usable interpreter (IC-ADR-002 §7). + +### 4. Data plane — payload and sandbox + +DIRAC's stdout/stderr-only model is generalised to **arbitrary input/output sandboxes**. This is +not a feature bolt-on; it is the *same requirement* as payload-agnosticism — if interCEde does +not interpret the payload, it cannot assume the payload emits only stdout/stderr. stdout/stderr +become two well-known **members** of the output sandbox; the executable becomes one distinguished +member of the input sandbox. + +```python +@dataclass +class SubmissionSpec: # the opaque payload (name provisional — see Open Issues) + executable: str + arguments: list[str] + input_sandbox: list[FileRef] # files/dirs (or streams) staged in before the run + output_sandbox: OutputSpec # explicit names | globs | "everything produced" + resources: Resources + environment: Mapping[str, str] + tag: str | None = None # consumer-set token, reserved for post-crash + # reconciliation (list-by-tag — see Open Issues) +``` + +Rules: + +- **Declared at submit, materialised at fetch.** HTCondor needs `transfer_output_files` *in the + submit description*, so the output-sandbox spec lives in `SubmissionSpec`, not only as a + fetch-time argument. Permissive backends (ARC lists the session dir; SSH globs the remote + workdir) may also discover outputs at fetch time, so `OutputSpec` expresses both explicit lists + and globs/"all". +- **Caller-owned inputs are consumed at submit.** `submit()` stages (or copies) the executable + and every `input_sandbox` member *during the call* and retains no reference to caller-owned + paths after it returns — the caller may delete its temp files immediately. This rule removes a + DIRAC wart by construction: today the SiteDirector deletes the submitted executable *unless* + the CE returns `ExecutableToKeep` (HTCondorCE does, because it still needs the file on disk + afterwards), a hidden ownership handshake no other CE participates in. +- **Return paths, not contents — and stream.** `fetch_output(ids, dest)` streams each member into + `dest` and returns, per job, a **manifest** (`JobOutput` with `.stdout`, `.stderr`, + `.files[...]`, and `.log`); it never loads file contents into memory. This is the one model + that survives multi-GB outputs. +- **Staging is delegated or synthesised.** "Stage a sandbox" is either delegated to the backend's + native mechanism (ARC session dir, HTCondor file transfer) or synthesised by interCEde over the + transport (`Transport.put`/`get`), selected by the scheduler's `stages_own_files` flag (§3). + The composed CEs get arbitrary sandboxes *for free* because the transport already provides + `put`/`get`. +- **One manifest, one retrieval.** `fetch_output` materialises stdout, stderr, output files + **and** the CE/scheduler log together into `dest`, as members of a single manifest. The + manifest separates **CE/scheduler artifacts** (the log — infra-level, interCEde's to expose) + from **payload artifacts** (fetched, never interpreted). Whether the fetch also *releases* the + remote state is the backend's declared `destructive` property (§2) — when it is declared, the + consumer must treat the local `dest` as the commit point (§9). There is no standalone "get the + final log" call; a *while-running* diagnostics fetch exists only where it is genuinely + independent and non-destructive (`SupportsLiveDiagnostics`, §2). `purge(ids)` on its own still + deletes a job's outputs/remote state *without* fetching — the path a site takes when it + mandates scratch cleanup for outputs no one will retrieve (`Purgeable`, §2). +- **Bounded materialisation, enforced where interCEde does the copy.** "Everything produced" plus + arbitrary filenames from a payload interCEde never inspects means `fetch_output` needs a size + ceiling, a file-count ceiling, a per-transfer timeout, and path containment (every member lands + **under** `dest`; `..`/absolute members are rejected). How much is enforceable inline depends + on who does the copy: where interCEde streams file-by-file itself (ARC, SSH) all four limits + apply mid-stream; where the backend's own tool does the transfer (HTCondor) enforcement is + coarser — a submit-time allow-list, a timeout on the transfer command, and a post-transfer + size/count check. The limits are stated per backend rather than assumed uniform. + +Results are typed dataclasses/models; whole-operation failures raise a **typed exception +hierarchy** (replacing `S_OK`/`S_ERROR`) while per-item outcomes are returned in the bulk map +(§2). The contract is **`async`, and async is the single source of truth**: the backends are all +I/O-bound (SSH, ARC/cloud REST, subprocess) and the contract is bulk, so async is the natural +model for polling/fetching many jobs concurrently. A sync-only backend library (the `htcondor` +bindings, Libcloud, a batch CLI) is wrapped in a thread internally (`asyncio.to_thread`), never +surfaced as a second contract — backends implement only the async protocols. Sync *callers* +(DIRAC during the transition) are served by a thin sync **facade** that runs the async calls to +completion, **not** by a parallel sync API (see Open Issues); a sync caller still benefits from +the bulk fan-out, which runs concurrently *inside* each run-to-completion call. + +### 5. Discovery — registry and entry points + +`ComputingElementFactory` + `ObjectLoader` string-to-module loading is replaced by a typed +**registry** populated through Python **entry points**. Three groups, one per pluggable kind: + +```toml +# a backend package's pyproject.toml +[project.entry-points."intercede.backends"] +arc = "intercede.backend.arc:ARCBackend" +htcondor-ce = "intercede.backend.htcondor:HTCondorCEBackend" + +[project.entry-points."intercede.transports"] +ssh = "intercede.transport.ssh:SSHTransport" +local = "intercede.transport.local:LocalTransport" + +[project.entry-points."intercede.schedulers"] +slurm = "intercede.scheduler.slurm:Slurm" +htcondor = "intercede.scheduler.htcondor:HTCondorScheduler" +``` + +How it resolves: + +- **Lazy, typed lookup.** The registry reads the entry-point groups on first use but imports only + the *one* target a request names — no eager import of every backend. Each resolved object is + checked for structural conformance (`isinstance` against the relevant `@runtime_checkable` + protocol) at the boundary, so a misregistered class fails fast with a clear error instead of at + the first method call. +- **A CE is described by data, not a class name.** A request is a small typed config — + `{"type": "arc", ...}` for a monolithic backend, or `{"transport": "ssh", "scheduler": "slurm", + ...}` for a composed one. For the composed form the registry instantiates the named transport + and scheduler and wraps them in `BatchBackend`; the *combination* never needs its own + registered class, which is what keeps the matrix at N+M. +- **Third parties register without forking.** A VO ships a package advertising any of the three + groups (an in-house scheduler, a site-specific transport, a bespoke CE); it is discovered + automatically once installed in the same environment. Nothing in interCEde is edited, and the + package depends only on Tier-A protocols. +- **Discoverable and versioned.** Unlike `ObjectLoader`, the mapping from a short stable name + (`"slurm"`) to an implementation is owned by the providing package's metadata, versioned with + it, and enumerable (`importlib.metadata.entry_points`) for tooling and diagnostics ("what + backends does this environment offer?"). +- **Explicit override precedence.** Name collisions resolve by a documented rule (e.g. a + configured allow-list / last-installed-wins) so a site can shadow a built-in scheduler with its + own without patching interCEde. + +### 6. The severed "inner" subsystem (out of scope, recorded for completeness) + +`InProcess`, `Singularity` and `Pool` move to the **pilot/worker-node domain** with their own +contract, completely disconnected from `JobBackend`, and deliberately **not** modelled as +backends at all (in DIRAC they were lumped under the same `ComputingElement` base, and that +shared name is what created the false unity). Their relationship is a **depth-1 aggregator** — +*not* the full (recursive) Composite pattern, because pools-of-pools are unwanted: + +- `Runner`: execute one payload. `HostRunner` (was `InProcess`) runs it directly; + `ContainerRunner` (was `Singularity`) is a **decorator** that wraps another `Runner`'s command + in a container. Decoration is a *fixed-depth* `Runner → Runner` edge (a container around a host + runner), not aggregation — it composes exactly one runner, never a collection. +- `ExecutorPool` (was `Pool`): aggregates a collection of `Runner`s and adds bounded concurrency. + The restriction that matters is that **`ExecutorPool` is not itself a `Runner`** (the child + edge is typed `ExecutorPool → Runner`); since a pool is not a runner, it cannot be a child of + another pool. That is what makes pools-of-pools unrepresentable — without forbidding the + bounded, non-recursive `ContainerRunner`-over-`HostRunner` decoration. + +```mermaid +classDiagram + direction LR + class Runner { + <> + +run(payload) PayloadResult + } + class HostRunner + class ContainerRunner + class ExecutorPool { + -limit: int + } + Runner <|.. HostRunner + Runner <|.. ContainerRunner + ContainerRunner o-- Runner : wraps (fixed depth) + ExecutorPool o-- Runner : aggregates (pool is not a Runner) +``` + +This subsystem is tracked as separate work (likely against the Pilot repo); it is named here only +so the migration map is complete. One scope warning for that work, from the consumer analysis: +the JobAgent-side contract is richer than `run(payload)`. It also covers +*description-for-matching* (`getDescription()` — where Pool returns a **list** of CE dicts +implementing the MultiProcessor tag strategy), *filling-mode accounting* (`setCPUTimeLeft()`), +*asynchronous result harvesting* (the mutable `taskResults` dict plus the `AsyncSubmission` +flag), and Pool's per-job writes to `pilot.cfg`. The sibling ADR must scope all four, not just +execution. + +### 7. Extension surface — what is extendable, and what is not + +interCEde is a library implemented and mocked by others, so its public surface is defined +**explicitly** and governed by SemVer. Every public module declares `__all__`; anything not in a +public module's `__all__`, and any module or name prefixed with `_`, is internal and may change +without notice. Three tiers: + +**Tier A — Stable contracts (implement these).** The extension points third parties build +against. Breaking changes are SemVer-major. + +- The protocols: the essential `Submitter`, `Monitor` and their composition `JobBackend`; the + collaborators `Transport`, `Scheduler`; and the optional capability protocols + (`OutputRetriever`, `Cancellable`, `Purgeable`, `LoadReporter`, `SupportsLiveDiagnostics`, …). +- The data types and enums: `SubmissionSpec` (name provisional — see Open Issues), `Submission` + (bundles `.handles` and `.failures`), `JobHandle` (durable, serialisable, routing-carrying) and + its identity `JobID`, `JobStatus` (incl. `UNKNOWN`), `StatusMap`, `JobOutput`, `OpOutcome`, + `JobCounts`, `Diagnostics`, `Resources`, `FileRef`, `OutputSpec`. +- The exception hierarchy (a single rootable `InterCEdeError`). +- The registry API and entry-point group names. +- The lifecycle: backends and transports are async context managers (§2) — cheap, + side-effect-free construction; idempotent close; consumers that cache them own eviction-time + closing. + +**Tier B — Provided implementations (instantiate and register; do not subclass).** Public to +*use*, not a subclassing contract. Extend by **implementing a protocol or composing**, never by +subclassing these. + +- `BatchBackend`; the concrete transports (`SSHTransport`, `LocalTransport`, …); the concrete + schedulers (`Slurm`, `HTCondorScheduler`, `LSF`, `SGE`, `OAR`, `Torque`, `Direct`); the + monolithic CEs (`ARCBackend`, `HTCondorCEBackend`, `CloudBackend`). +- An **optional** convenience base, `BaseBackend(ABC)`, may carry *shared mechanics only* (config + normalisation, logging, retry/timeout helpers) and exposes a small, documented set of + overridable hooks. It is **not** the contract: a fully-conformant CE can be written without it, + and policy (slot accounting, credential renewal) is **not** placed in it. Because "no policy, + by convention" is exactly the DIRAC weakness this ADR indicts, two guardrails keep it from + re-accreting into a fat base: (1) the helpers it may contain are a **closed, enumerated set**, + and a unit test asserts it exposes *no* lifecycle method (`submit`/`status`/`fetch_output`/ + `kill`) and *no* policy method (anything like DIRAC's `available()`/`_monitorProxy`); (2) + interCEde's **own** monolithic CEs do **not** subclass it — so it can never become load-bearing + inside interCEde the way DIRAC's base became load-bearing in nine places. It exists for + third-party convenience and is tested against the same conformance suite as any other backend. + +**Tier C — Internal (do not import).** Command builders, output parsers, transport internals, +materialisation/retry machinery, anything under a `_`-prefixed module. No stability guarantee. + +The guiding rule, and the single most important lesson from DIRAC's fat base: **the preferred +extension mechanism is composition + structural typing, not inheritance.** You add a backend by +implementing a Tier-A protocol and registering it — not by subclassing a Tier-B class. + +### 8. DIRAC → interCEde migration map + +The map records *where each DIRAC piece lands*. Method-by-method porting mechanics (DIRAC's +job-ref formats, `S_OK` payload keys, CS parameter names) belong to the migration mapping +document owned by the sync-facade work (see Open Issues), not to this table. + +| DIRAC (`Resources/Computing`) | interCEde destination | Notes | +| --- | --- | --- | +| `AREXComputingElement` | `ARCBackend` (`JobBackend` + all four optional capabilities + `OutputRetriever`) | REST; reference implementation of the full sandbox model; `killJob` → `kill`, `cleanJob` → `purge`; the `diagnose` endpoint → `SupportsLiveDiagnostics` | +| `HTCondorCEComputingElement` | `HTCondorCEBackend` (monolithic; `JobBackend` + `OutputRetriever` + `Cancellable`) | native file transfer; destructive fetch (§2); no `LoadReporter` (reports no counts); shares the internal `_htcondor` core with `HTCondorScheduler` — composition candidate (Open Issues) | +| `SSHComputingElement` | `BatchBackend(SSHTransport, )` | composition; sandbox via `Transport.put`/`get` | +| `SSHBatchComputingElement` | `BatchBackend(SSHMultiHostTransport, Direct)` | host routing moves into `JobHandle`; host-spreading is placement *policy* (Open Issues) | +| `LocalComputingElement` | `BatchBackend(LocalTransport, )` | remote-family despite the name; its spoofed `ssh://` job-ID hack is dropped — routing lives in `JobHandle` (§1) | +| `CloudComputingElement` | `CloudBackend` (monolithic; essential protocols only) | no `OutputRetriever` (§2); `cleanupPilots` → `purge` | +| `CloudProviders/{OpenNebula,…}` | custom Apache Libcloud `NodeDriver`s — Tier-C internals of `CloudBackend` | provider selection stays delegated to Libcloud (§3, Rejected Ideas); VO-pluggability open | +| `BatchSystems/{SLURM,Condor,LSF,GE,OAR,Torque,Host}` | `Scheduler` implementations (`Slurm`, `HTCondorScheduler`, `LSF`, `SGE`, `OAR`, `Torque`, `Direct`) | promoted from ad-hoc string plugins to the typed protocol; ship-a-python-driver model dropped (§3) | +| `BatchSystems/TimeLeft/*` | **out of scope as an interface** → pilot/worker-node side | queried from *inside* an allocation — a different vantage than the submission-side `Scheduler` (Open Issues) | +| `ComputingElementFactory` (`ObjectLoader`) | typed registry + entry points (§5) | | +| `submitJob` (all remote CEs) | essential `Submitter.submit` | inputs consumed at submit — the `ExecutableToKeep` handshake disappears (§4) | +| `getJobStatus` (all remote CEs) | essential `Monitor.status` | | +| `getJobOutput` (all remote CEs) | **optional** `OutputRetriever.fetch_output` | optional because some backends cannot serve pull-retrieval (§2) | +| `killJob` / `cleanJob` (remote CEs) | optional `Cancellable.kill` / `Purgeable.purge` | not essential — some flows never use them | +| `ComputingElement` base — `getCEStatus()` counts | optional `LoadReporter` (counts only) | the *availability* computation (counts vs `Max*Jobs`) is **caller policy**, removed from interCEde | +| `ComputingElement` base — `shutdown()` (+ `QueueCECache` eviction) | async context-manager lifecycle (§2) | | +| `ComputingElement` base — `setProxy`/`setToken`/`_monitorProxy` | **split**: backend auth → [IC-ADR-003](IC-ADR-003_credentials.md); *payload* proxy renewal → pilot/runner side | | +| `ComputingElement` base — config hierarchy | a loader utility (Tier C), not a base-class responsibility | | +| `InProcessComputingElement` | **out of scope** → `HostRunner` (pilot/runner subsystem, §6) | severed | +| `SingularityComputingElement` | **out of scope** → `ContainerRunner` (decorator over a runner, §6) | severed | +| `PoolComputingElement` | **out of scope** → `ExecutorPool` (depth-1 aggregator, §6) | severed | + +### 9. Consumer interface — how DiracX uses interCEde + +The consumer surface is deliberately small: the Tier-A protocols plus the registry factory. +DiracX splits the old monolithic SiteDirector into **separate tasks**, and each one is typed to +the *narrowest* protocol it needs — never to `JobBackend`, never to a concrete class. The +registry resolves the same backend for all of them; each task sees only its slice. One sketch — +the submission task; the status and output tasks follow the same resolve-narrow-drive pattern: + +```python +from intercede import registry, Submitter, LoadReporter + +async def submission_task(resource, want, make_payload): + ce: Submitter = registry.backend(resource) # isinstance-checked at the boundary + if isinstance(ce, LoadReporter): # optional -> structural narrowing + want = min(want, policy.slots(await ce.counts())) # throttling policy is DiracX's + spec = make_payload() # -> one SubmissionSpec + sub = await ce.submit(spec, count=want) # same spec, `want` identical copies + await store.record(sub.handles) # DiracX owns this store + # sub.failures carries any copies the backend rejected, addressable per copy +``` + +- **The status task** depends on `Monitor` only: it polls and records (`store.update(await + ce.status(handles))`), with unknown ids coming back as `JobStatus.UNKNOWN` so nothing is + silently lost. It literally *cannot* call `fetch_output` — which both documents intent and + keeps output transfers out of the polling loop. +- **The output task** first confirms the backend has the *optional* `OutputRetriever` at all + (`isinstance` narrowing; Cloud lacks it, §2) and fetches into a durable `dest`. When the + backend declares `destructive` (§2), the remote copy is gone once the call returns, so `dest` + is the commit point — upload from there, and on a crash resume from `dest`, never re-fetch. + +What this fixes on the contract side: + +- **Restart re-drives handles; durability is DiracX's.** Tasks re-resolve the backend and drive + their slice on handles **reloaded from the DiracX store** — they never submitted anything in + this process, and interCEde never stored anything. This is why `JobHandle` must be serialisable + and self-contained (§1). +- **The submit→record window is the consumer's to close (interCEde is stateless).** A crash after + `submit` returns but before `store.record` commits leaves jobs on the backend that DiracX has + no handle for. This is not new — DIRAC's SiteDirector has the identical window today — and + interCEde cannot fix it, having no store; the candidate closing mechanism (tag-at-submit + + list-by-tag, using the reserved `tag` field from §4) is an Open Issue. +- **Pull vs push is invisible to interCEde.** A pull SiteDirector and a push PushJobAgent run the + *same* submission task; only *where the payload comes from and when* differs, and that lives in + the consumer. +- **Policy is the consumer's, mechanism is interCEde's.** Every `policy.*` call (throttling, + expiry, retries, "what is a pilot") is consumer-side; interCEde supplies only the verbs. + Optional capabilities are reached by structural narrowing, never assumed. + +## Rationale + +- **Protocols over a fat base class.** The thing callers depend on is a *contract*; a Protocol + expresses it without coupling implementations to interCEde's base, which is exactly what a + library implemented by third parties and mocked against containers needs. DIRAC's pain came + from the opposite: a fat base plus "abstract by convention" stubs with no enforcement and + drifting signatures. `@runtime_checkable` keeps an `isinstance` gate at the registry boundary; + a real `@abstractmethod` ABC is available where forcing an override is genuinely wanted. +- **Composition over inheritance for access × scheduler.** Inheritance forces N×M leaf classes or + fragile multiple inheritance; composition makes it N+M and is the literal expression of the + project's "backends combine" promise. DIRAC already proved the bone is sound (`SSH` + + `BatchSystems`); interCEde just applies it uniformly and gives the scheduler side a real typed + contract. This is well-trodden ground: HTCondor's **blahp** translates one protocol into + per-LRMS submit/status/cancel scripts, and ALICE's JAliEn has per-batch `BatchQueue` drivers — + the `Scheduler` protocol is the same idea, in typed Python. Their async + request-id/`RESULTS`-poll model is also why interCEde's contract is async, bulk, and poll-based + (§2) rather than blocking (see Rejected Ideas for why we reimplement rather than reuse them). +- **Capability segmentation, sized by real callers.** Interface Segregation, with the DiracX-task + split as the evidence: a status task wants `Monitor` and an output task wants `OutputRetriever`, + each *without* `submit` and without each other — so these are separate protocols, not one + "operator". `submit` and `status` are the two *essential* ones (every usable backend has them); + output-retrieval is a real task but optional, because a backend that cannot serve + pull-retrieval must not be forced to stub it — the same reasoning that makes `LoadReporter` and + independent diagnostics optional. Segment to match real callers, no further — and keep *policy* + (the counts → "is there room?" decision) in the caller, not the backend. (Counts-only was + checked against DIRAC: no remote-family caller consumes more than `{running, waiting}` — the + processors-cap math in the base `available()` is fed only by the severed inner CEs, so it moves + with the pilot-side subsystem.) +- **Severing the inner CEs.** Liskov is the test: no caller ever holds a `JobBackend` and uses it + without knowing whether it is remote or inner, because remote and inner serve *different + actors* with disjoint mechanics. The shared base bought only the factory and some config + boilerplate — both replaceable. Removing it deletes a false abstraction at near-zero cost. +- **Sandbox generalisation.** Payload-agnostic ⟹ sandbox-general. Returning a streamed manifest + of paths (not in-memory contents) is the only model that scales to real output, and it unifies + stdout/stderr/output-files/CE-log behind one retrieval. +- **Explicit extension surface.** Uncontrolled subclassing of concrete classes is how the DIRAC + fat base became load-bearing in nine places. A defined three-tier surface lets interCEde evolve + Tier-B/C freely while third parties depend only on Tier-A — which is what makes SemVer + meaningful for a plugin ecosystem. + +## Evolution & non-conforming backends + +No abstraction is eternal; the design's job is to make change cheap and localised, not to predict +the future. Three cases: + +1. **A new backend that fits the shape — trivial, by construction.** A scheduler/transport/CE + that is still "delegate-and-poll" (most new batch systems and CEs are) is added by + implementing the relevant protocol, registering an entry point, and shipping a container + conformance test. No core change — this is the designed-for case, and the reason discovery is + data-driven. + +2. **A backend that does not fit the shape — branch, don't contort.** A future resource may be + fundamentally different: serverless/FaaS with no persistent handle, a callback/push resource + that calls the consumer, a long-lived interactive session, or something with no output-sandbox + notion. The rule — the same one applied when this ADR severed the inner CEs — is **when the + semantics differ, add a sibling abstraction; never overload `JobBackend`.** Additive + differences become optional capability protocols; a genuinely different lifecycle gets its own + protocol and its own registry group. The monolithic-CE escape hatch already lets a + weird-but-pollable backend (e.g. Kubernetes: submit a Job object, poll, fetch logs, delete) + implement the contract directly with whatever internals it needs. + +3. **The contract itself stops making sense — manage it deliberately.** This is what ADRs, SemVer + and the narrow surface are for: + - Consumers depend only on small Tier-A protocols resolved by data, so implementations + (Tier B/C) can be rewritten freely — schedulers moving from CLI to REST changes only + `Transport`/`Scheduler` bodies, not the contract. + - A breaking change is a **new ADR superseding this one** plus a SemVer-major with a + deprecation window. + - Protocols are structural and versionable: introduce `JobBackendV2`, expose both through the + registry, bridge with adapters, and migrate backends one at a time — old and new run side by + side. + - **Prefer additive evolution** — a new optional capability protocol over a core change. The + core is kept deliberately small precisely to minimise what can break. + - The container conformance suite is the early-warning system: change an interface and the + per-backend tests show what breaks before any user does. + + The meta-principle: keep the core minimal, push variation into optional capabilities and + composition, and treat "this does not fit" as a signal to branch a new abstraction rather than + bend the old one. + +## Rejected Ideas + +- **Keep DIRAC's single fat `ComputingElement` base.** Conflates contract and policy, enforces + nothing, and is the documented source of the signature drift and the Remote/Inner/Pool + confusion. +- **One god-interface with every method mandatory.** Forces backends that lack a capability to + ship `S_ERROR`/`NotImplementedError` stubs — exactly what DIRAC's + `HTCondorCEComputingElement.getCEStatus()` already is. Additive optional protocols replace it — + even output-retrieval is optional (§2). +- **A separate `fetch_log` retrieval operation.** Breaks on destructive-fetch backends (§2), + where retrieval deletes the spool — the log must come out in the *same* operation as the + output. The log is therefore a manifest member of the single `fetch_output`; only ARC's + genuinely independent, non-destructive `diagnose` endpoint justifies a separate (optional) + diagnostics call. +- **Model access × scheduler with inheritance** (N×M classes, or a mixin lattice). Combinatorial + and fragile; composition is strictly better here. +- **Keep the inner CEs under `JobBackend`** (even as a sibling branch). They are never + substituted for remote CEs and share no mechanics; the common supertype is fictitious. +- **Use the full recursive Composite pattern for the pool.** Its defining feature — a composite + containing components, hence composites-of-composites — yields pools-of-pools, which we do not + want. A depth-1 aggregator with a leaf-typed child edge is the right restriction. +- **stdout/stderr-only output (DIRAC's default).** Incompatible with payload-agnosticism; ARC + already shows the general model is feasible, and HTCondor/SSH can do it natively or via the + transport. +- **Subclassing concrete CEs as the extension mechanism.** Re-creates the fat-base coupling; the + registry + protocol path keeps the provided implementations free to change. +- **`ObjectLoader` string-to-module discovery.** Untyped and undiscoverable; entry points give + typed, packageable, third-party-friendly registration. +- **Keep the CE layer inside DIRAC/diracx (no standalone library).** Rejected. A standalone + library must serve **both** DIRAC (sync, throughout the migration) and DiracX (async) from one + contract — an in-`diracx` subpackage cannot back the sync DIRAC side — and it lets + backend/CE-version support ship on its own cadence, without a DiracX release. It also isolates + the containerised test matrix from DiracX's CI, gives third-party VOs an entry-point extension + path that does not fork DiracX, and turns the Tier-A/B/C boundary into a packaging fact rather + than a convention. The cost — a second release train, SemVer ceremony, and diracx↔intercede + version skew — is accepted in exchange. (Were DiracX ever the sole consumer *and* the DIRAC + transition complete, folding it back in-tree would be worth revisiting.) +- **One repo for all DIRAC resources — or adding Storage/Catalog to interCEde.** Rejected: the + library boundary is a *coherent contract + testing domain*, so the repo count follows the + number of distinct contracts — contract-driven, not taxonomy-driven — and that count is small + and stable. + - *submit / monitor / retrieve* is one contract with one collaborator model + (`Transport` × `Scheduler`), one test rig (containerised schedulers), one consumer (the WMS). + Storage Elements (`put`/`get`/space, over XRootD/S3/gsiftp) and File Catalogs + (`register`/`query`/metadata, over DFC/Rucio) are *different contracts* with different + backends, test rigs and consumers (the Data Management System) — **sibling libraries**, not + tenants of interCEde. + - Merging them is the repo-level form of the fat-base false-unity this ADR rejects ("when + semantics differ, add a sibling abstraction, never overload the type" — §6, Evolution). DIRAC + itself never unified them: `Resources/{Computing,Storage,Catalog,…}` have distinct base + classes and factories, with no common `Resource` supertype. + - What siblings *may* share is **plumbing, never a contract** — the entry-point registry + mechanism, the typed-error base, and the async/sync-facade conventions could later factor + into a small `dirac-resources-core`. That factoring is a separate, deferrable decision, not a + reason to merge contracts now. +- **Reuse an existing job-submission abstraction instead of defining our own.** Surveyed and + rejected — nothing covers interCEde's target set (SSH+batch, ARC-CE REST, HTCondor-CE, Cloud) + as a modern async, typed Python library. HTCondor's blahp/GAHP is a process protocol, not a + library, and has no SSH story; PanDA/Harvester has the right plugin shape but is inseparable + from the PanDA server; JAliEn's batchqueue layer is Java, unreleased and licence-unclear; and + **none** of the generic scheduler libraries (RADICAL-SAGA, PSI/J, DRMAA, Parsl, AiiDA, + Dask-Jobqueue) speaks ARC at all, in any interface generation. The genuinely reusable layer is + the WLCG grid-CE ecosystem itself plus first-party per-backend clients (`pyarcrest`, the + `htcondor` bindings, Apache Libcloud), which interCEde composes behind a typed async contract. + The per-tool evidence is in the [reuse survey note](../notes/reuse-survey.md); blahp's per-LRMS + scripts and async poll model are worth learning from (see Rationale). + +## Open Issues + +Items tagged **(blocking)** need an answer before this ADR moves to Accepted; **(deferred)** +items are follow-up work that does not gate acceptance. + +- **Naming at the API boundary (blocking).** Keep `Job*` only for the *scheduler* handle + (`JobHandle`, `JobID`, `JobStatus` — accurate and namespaced). The submitted description is + provisionally `SubmissionSpec` — `JobSpec` clashes with DiracX Jobs, `Payload` with the + pilot-side payload. Confirm, or pick another neutral name. +- **Destructive-fetch confirmation (blocking).** Confirm the HTCondor ≥ 25.8 one-shot spool + behaviour (§2) against the release notes, whether setting `leave_in_queue` restores ARC-like + re-fetchability, and that every consumer reads `destructive` before assuming it can re-fetch. +- **HTCondorCE live pilot-log fetch (blocking).** DIRAC operators fetch running-pilot logs from + both AREX *and* HTCondorCE (`dirac-admin-get-pilot-logging-info`, via the CE's `getJobLog`). + Under this ADR the final log is a `fetch_output` manifest member and only ARC gets + `SupportsLiveDiagnostics` — which would regress the operator flow for running HTCondorCE + pilots, because HTCondorCE's log fetch rides the same spool-transfer machinery as output + retrieval. Decide: implement `SupportsLiveDiagnostics` on `HTCondorCEBackend` over that + machinery (documenting that a fetch against a *completed* spooled job shares the destructive + semantics, §2), or record the regression as accepted. +- **Submit idempotency / list-by-tag (deferred).** The submit→record crash window (§9) is the + consumer's to close; the realistic mechanism is **tag-at-submit + list-by-tag** — stamp a token + on the jobs at submit, list jobs carrying it after a crash, reconcile before re-submitting. The + `tag` field is already reserved on `SubmissionSpec` (§4, additive and non-breaking); the + deferrable part is the enumeration capability (a future optional protocol). `LoadReporter`'s + aggregate counts cannot substitute for it. +- **Cloud staging (deferred).** Whether `CloudBackend` later grows a limited `OutputRetriever` + (a transport *into* the VM), and whether "declared staging capability" deserves its own + optional protocol (§2 records why it has none today). +- **Cloud-provider drivers (deferred).** interCEde keeps *composing* Libcloud + (`get_driver`/`set_driver` stays the provider-selection mechanism) rather than defining its own + driver protocol (§3). Open: whether VO-pluggable clouds warrant a thin + `intercede.cloud_drivers` entry-point group over Libcloud's registry, and whether the + OpenNebula driver is maintained in-tree (Tier C) or upstreamed to Libcloud. +- **Home of the severed subsystem (deferred).** Where `Runner`/`ExecutorPool` live (Pilot repo vs + a new small library) — a separate ADR/tracking item (§6 lists the scope warnings for it). +- **`HTCondorCEBackend` as composition (deferred).** It shares the `_htcondor` core with + `HTCondorScheduler`; expressing it as `BatchBackend(HTCondorCETransport, HTCondorScheduler)` + would need a transport-level notion of a "destructive get", and the CE fronts a *different* + scheduler than a raw schedd (its own blahp hides the site's real batch system). Share the + internals now; revisit the shape later. +- **Multi-host SSH placement (deferred).** DIRAC's `SSHBatch` spreads a submission across hosts + by free slots and encodes the host in the job ref. The routing belongs in `JobHandle` (§8); the + spreading decision is placement **policy**. Decide when this backend is scheduled: caller + policy (consistent with the mechanism/policy split) vs a declared multi-host transport + capability. +- **`get_time_left()` placement (deferred — confirm).** It is queried from *inside* a running + allocation — a different vantage and consumer than the submission-side `Scheduler`. Kept + **off** the `Scheduler` protocol to preserve role cohesion; the batch-system parsing is shared + internal code, exposed via a separate pilot-facing capability if the pilot needs it. Confirm + this split (and payload-proxy renewal, likewise pilot-side). +- **DIRAC compatibility shim (deferred).** DIRAC is synchronous and `S_OK`/`S_ERROR`-based; the + resolution is a sync **facade** generated over the async core (run-to-completion + + `S_OK`/`S_ERROR` translation), usable only by sync callers outside an event loop — **not** a + hand-maintained parallel sync protocol, which would re-introduce the two-surfaces drift this + ADR fights. One design constraint is fixed now, because getting it wrong breaks cached + backends: the facade must own a **single long-lived background event loop** in a dedicated + thread and dispatch every call onto it (`run_coroutine_threadsafe`) — never per-call + `asyncio.run()`. Backend connection state (HTTP sessions, SSH connections) is bound to the + event loop it was created on, so a fresh loop per call breaks any backend reused across agent + cycles (the §2 lifecycle/caching model); the facade likewise exposes a sync `close()` over the + async close so cache eviction works. Sync callers inside DIRAC's Tornado-based services are + expected to be safe (handlers run in worker threads, with no running loop in the calling + thread) — verify once. Confirm the facade's lifetime (transition-only vs kept for sync + third-party tools). The facade work also owns the **migration mapping document** — the + translation table from DIRAC's load-bearing conventions (job-ref formats, `PilotStampDict`, + `S_OK` payload keys, CS parameter names, `Tag: Token[:vo]` vs IC-ADR-003 requirements) to + interCEde's. The mapping document lives in interCEde docs; the adapter code lives DIRAC-side. +- **Heterogeneous submit batches (deferred).** `submit(spec, count)` submits `count` *identical* + copies, which map to native array submission (HTCondor `queue N`, Slurm `--array`). A + payload-agnostic push consumer (PushJobAgent) may want a *heterogeneous* batch of distinct + specs. Decide whether to add a `submit(specs: Sequence[SubmissionSpec])` overload (which loses + native-array efficiency on most backends) or leave heterogeneous batches as N separate calls. +- **Staging as a third collaborator (deferred).** Staging currently rides on `Scheduler` + (`stages_own_files` + hooks, §3), and the HTCondorCE-as-composition idea would need a + "destructive get" on the *transport* — two cross-axis concerns before any code exists. + Evaluate promoting staging to its own collaborator protocol (a `Stager` strategy composed + alongside `Transport`/`Scheduler`). +- **Capability declaration vs structural sniffing (deferred).** Dispatch narrows by `isinstance` + against `@runtime_checkable` protocols, which check method *presence only* (§2). Evaluate an + explicit capability declaration (registry metadata, or a `capabilities()` set) as a sturdier + gate, keeping `isinstance` as the smoke-test. +- **Loud partial-failure results (deferred).** Bulk verbs return per-`JobID` outcome maps, which + a caller can silently ignore (unlike a raised exception). Consider a result type that makes + undrained failures loud (a `.raise_for_failures()` helper, or logging on drop) plus single-job + convenience wrappers over the bulk verbs. + +Note: the backend credential/auth model was an open issue here and is now its own ADR — +[IC-ADR-003](IC-ADR-003_credentials.md) (typed credentials, backend-declared requirements, +provider-based supply). Only *backend* auth is interCEde's; payload credential renewal stays +pilot-side. diff --git a/docs/adr/IC-ADR-002_integration_tests.md b/docs/adr/IC-ADR-002_integration_tests.md new file mode 100644 index 0000000..9ef4c7a --- /dev/null +++ b/docs/adr/IC-ADR-002_integration_tests.md @@ -0,0 +1,398 @@ +# IC-ADR-002: Integration testing against containerized backends + +## Metadata + +- **Created By:** Alexandre Boyer +- **Date:** 2026-07-02 +- **Status:** Draft +- **Decision Maker(s):** DIRACGrid maintainers +- **Stakeholders:** interCEde contributors and maintainers; DIRACGrid CI maintainers +- **Depends on:** IC-ADR-001 (core architecture: Protocols, Transport × Scheduler composition, capability segmentation) + +> **Scope and altitude.** Direction-setting for the *test architecture*: what runs (real backend +> daemons in per-stack compose environments), where (GitHub-hosted CI, on every PR), and how +> versions are managed (pins + Renovate + support windows). The contents of individual tests, the +> exact Dockerfiles, and the CI YAML are implementation, governed by the architecture decided +> here. + +## Abstract + +interCEde's contract is expressed as structural `Protocol`s exercised against fakes, which structurally cannot catch the bugs interCEde exists to absorb: version-specific daemon behaviour, destructive output retrieval, and status-mapping quirks. This ADR decides to add integration tests that run **real backend daemons** — in the combinations interCEde supports — in GitHub-hosted CI on every pull request, within a standard runner. The unit of everything is a **stack**: a named, self-contained backend environment defined by one `docker compose` file plus versioned configuration, declared in a single `stacks.yml` manifest from which the CI matrix is generated. A single, backend-agnostic contract suite is dispatched across stacks by capability (protocol narrowing), mirroring how the library itself handles capability variance. Backend images build from pinned RPMs, are prebuilt and pushed to GHCR, and every version — image tags, Dockerfile `ARG`s, Python dependencies — is pinned and tracked by Renovate, so an upstream release that breaks a contract surfaces as a red, bisectable Renovate PR rather than tribal knowledge. Credentials are generated ephemerally per run and never committed. A non-blocking weekly canary runs the same matrix against unpinned tags for early warning. Adding a backend or configuration is a manifest entry plus files, with no workflow edits. + +## Motivation + +interCEde's core architecture is deliberately mock-friendly: everything is a `Protocol`, backends +are composed, and the unit test suite exercises contracts against fakes. That is also its greatest +testing risk. A structural protocol plus a fake will happily agree with each other forever while +the real ARC REST endpoint, the real `condor_submit -spool` handshake, or the real +`sbatch`-over-SSH quoting rules drift away underneath. The bugs interCEde exists to absorb — +version-specific daemon behaviour, destructive output retrieval, status-mapping quirks — are +precisely the bugs that unit tests structurally cannot see. + +DIRAC's history makes the point concretely: HTCondor ≥ 25.8 silently changed output retrieval for +spooled jobs into a one-shot operation. No unit test anywhere would have caught that; only a test +that submits a real job to a real schedd and fetches its output twice would go red. + +We therefore need integration tests that: + +1. run real backend daemons, in the combinations interCEde actually supports; +2. run in GitHub-hosted CI, on every pull request, within the resources of a standard runner + (4 vCPU / 16 GB on public repositories); +3. surface **new upstream versions** (ARC, HTCondor, Slurm, OpenSSH, and our own Python + dependencies) as reviewable Renovate pull requests whose CI result *is* the compatibility + verdict; +4. support **multiple configurations per backend combination** eventually (token vs. proxy auth, + shared vs. non-shared filesystem, alternative queue setups) while starting with exactly one + basic configuration each. + +### Initial target combinations + +| Stack ID | Backend under test | interCEde components exercised | +|---|---|---| +| `arc-slurm` | ARC 7 CE (REST) → Slurm LRMS | `ARCBackend` (AREX REST), end-to-end through a real LRMS | +| `arc-condor` | ARC 7 CE (REST) → HTCondor LRMS | `ARCBackend` against the other major LRMS | +| `htcondor` | HTCondor-CE → HTCondor pool | `HTCondorCEBackend`; plain-schedd Condor scheduler tests reuse the same pool | +| `ssh-slurm` | sshd → Slurm | SSH transport × Slurm scheduler | +| `ssh-condor` | sshd → HTCondor (schedd) | SSH transport × Condor scheduler | +| `local-slurm` | Slurm, tests run *inside* the container | Local transport × Slurm scheduler | + +`ssh-*` and `local-*` share the same images — the local variant simply executes pytest inside the +batch container instead of connecting over port 22. Which of `ssh-slurm` / `local-slurm` lands +first is an implementation detail; the architecture treats them identically. Local × HTCondor +(first-class in IC-ADR-001's driver list) is deliberately not an initial stack: the Local +transport is exercised by `local-slurm` and the Condor scheduler by `ssh-condor`, so the +combination adds no new axis; it can be added later as a manifest entry reusing the `ssh-condor` +stack (`exec_in_container: true`). + +### Facts constraining the design + +- **HTCondor** publishes official role images on Docker Hub (`htcondor/mini`, `htcondor/cm`, + `htcondor/execute`, `htcondor/submit`), version-tagged (`-el9`, `lts-el9`). These are + Renovate-trackable out of the box. +- **HTCondor-CE** has no first-party image with clean semantic tags (the OSG images use date tags + and OSG-series semantics); an in-repo Dockerfile installing a pinned `htcondor-ce` RPM is more + Renovate-friendly than chasing OSG tags. +- **ARC** has no maintained official Docker image. The supported path is installing pinned + `nordugrid-arc-*` RPMs on EL9 in our own Dockerfile. interCEde targets **ARC 7 and later + only** — ARC 6 is out of scope (see §8). Crucially, ARC ships *zero configuration*: + a minimal working CE with a **Test-CA and host certificate generated at install time** + (`arcctl test-ca`), which removes the entire grid-PKI problem from CI. +- **Slurm** has no official image; community images are unmaintained to varying degrees. An + in-repo Dockerfile with a pinned Slurm package version (single node running `slurmctld` + + `slurmd` + `munged`) is small and fully under our control. +- **Renovate** is already configured in this repository (pip/pep621/github-actions managers). It + additionally supports `dockerfile` and `docker-compose` managers (image tags), and + `customManagers` (regex) for versions embedded in Dockerfiles as `ARG`s — the standard + `# renovate: datasource=... depName=...` comment convention. +- GitHub Actions **service containers** cannot express what we need (no compose, no build step, + no ordered startup, poor log access). Docker and `docker compose` are preinstalled on + `ubuntu-latest`, and compose v2 supports `up --wait` gating on healthchecks. + +## Specification + +### 1. A *stack* is the unit of everything + +A **stack** is a named, self-contained backend environment: + +``` +tests/integration/ +├── stacks.yml # the manifest: single source of truth for the CI matrix +├── conftest.py # stack-aware fixtures, capability-based skipping +├── test_submit.py # ONE backend-agnostic contract suite … +├── test_status.py +├── test_output.py # … including the fetch-twice destructiveness test +├── test_kill.py +└── stacks/ + ├── _images/ # Dockerfiles shared across stacks + │ ├── arc/Dockerfile # EL9 + pinned nordugrid-arc RPMs (ARG ARC_VERSION) + │ ├── slurm/Dockerfile # EL9 + pinned slurm packages (ARG SLURM_VERSION) + │ └── htcondor-ce/Dockerfile # EL9 + pinned htcondor-ce RPM (ARG HTCONDOR_CE_VERSION) + ├── arc-slurm/ + │ ├── compose.yml + │ └── config/ + │ └── basic/ # arc.conf, slurm.conf, intercede-client.toml + ├── arc-condor/… + ├── htcondor/… + └── ssh-slurm/… # sshd enabled; local-slurm reuses this stack +``` + +Rules: + +- One `compose.yml` per stack. Shared plumbing (network, credential volume, healthcheck blocks) + is factored with YAML anchors or compose `include:`, **not** with a generated mega-file. + A stack must be runnable locally with exactly + `docker compose -f tests/integration/stacks/arc-slurm/compose.yml up --wait` — CI does nothing + a developer cannot do on a laptop. +- Every configurable surface lives under `config//`. Today each stack has only + `config/basic/`. A *configuration* is mounted into the containers (arc.conf, condor config, + slurm.conf) **and** consumed by the test client (`intercede-client.toml` describing endpoint, + credential type, queue names). Backend config and client config change together, atomically, + under one name — this is what makes future configuration-matrix growth mechanical rather than + a refactor. +- `stacks.yml` is the manifest the CI matrix is generated from: + + ```yaml + stacks: + - id: arc-slurm + configs: [basic] + versions: [latest] # ARC 7.x leading edge (support window in §8) + markers: "remote and arc" + - id: htcondor + configs: [basic] + versions: [lts, latest] # HTCondor 24.0 LTS anchor + leading edge + markers: "remote and htcondor" + - id: ssh-slurm + configs: [basic] + markers: "scheduler and slurm" + exec_in_container: false + - id: local-slurm + stack: ssh-slurm # reuses the ssh-slurm compose stack + configs: [basic] + markers: "scheduler and slurm" + exec_in_container: true + ``` + + `versions` names the entries of the backend's support window to run (§8); a bare or absent + `versions` means the single leading-edge pin. Adding a stack, a configuration, or a supported + version is one manifest entry plus files — no workflow edits. + +### 2. One contract test suite, capability-dispatched + +There is a **single** integration test suite, not one per backend. Tests are written against the +interCEde protocols and parameterized by the stack's client configuration. Capability variance is +handled exactly the way the library itself handles it — protocol narrowing: + +```python +async def test_output_refetch(backend, submitted_job, tmp_path): + if not isinstance(backend, OutputRetriever): + pytest.skip("backend has no output retrieval") + first = await backend.fetch_output([submitted_job], dest=tmp_path) + ... +``` + +This symmetry is deliberate: the integration suite is the executable form of the contract. If a +capability check works in the test harness, it works for DiracX; if a backend's structural typing +lies (the `runtime_checkable` presence-only problem from IC-ADR-001), the integration suite is +where the lie is caught against a real daemon. Backend-specific quirks (e.g. the HTCondor +one-shot retrieval semantics) get dedicated marker-selected tests rather than conditionals inside +generic ones. + +Markers (`remote`, `scheduler`, `arc`, `htcondor`, `slurm`, `ssh`, `destructive_fetch`) select +the applicable subset per stack via the manifest's `markers` expression. + +### 3. Prebuilt stack images on GHCR + +ARC and Slurm images build from RPMs; rebuilding them on every PR wastes 5–10 minutes per job. A +separate `build-images` workflow builds and pushes +`ghcr.io/diracgrid/intercede-testenv/{arc,slurm,htcondor-ce}:` whenever the Dockerfiles +change (including when Renovate bumps a pinned `ARG`), plus a weekly rebuild for base-image +security updates. PR CI **pulls** by digest. + +Daemons run in the foreground under a minimal supervisor (or a plain entrypoint script), **never +systemd** — privileged systemd containers are fragile on GitHub runners and hide daemon logs from +`docker logs`. + +### 4. Ephemeral credentials, generated per run + +No credential is ever committed: + +- **ARC stacks**: the container entrypoint runs `arcctl test-ca init` etc., generating the CA, a host + certificate, and a client credential; client credentials are written to a shared + `credentials` volume that the test harness reads. Token auth is added later as a second + configuration (`config/token/`), not baked into `basic`. +- **HTCondor / HTCondor-CE**: `IDTOKENS` auth; the CE entrypoint mints a token into the shared + volume. No GSI, no VOMS in `basic`. +- **SSH stacks**: `ssh-keygen` in a job step, public key mounted into the container's + `authorized_keys`, known-hosts pinned from `ssh-keyscan`. + +The shared credentials volume *is* the interface between stack and harness; its layout +(`credentials//…`) is part of the stack contract. + +### 5. Manifest-driven GitHub Actions matrix + +``` +.github/workflows/integration.yml +├── job: matrix — reads stacks.yml, emits JSON via fromJSON() +└── job: integration — matrix: {stack × config × version}, fail-fast: false + ├── docker compose up --wait (healthchecks gate readiness; version → image tag) + ├── pixi run pytest tests/integration -m "" --stack= --config= + │ (or `docker compose exec` for exec_in_container stacks) + └── on failure: `docker compose logs` + backend log dirs uploaded as artifacts +``` + +- Each stack runs in its **own job**: parallel wall-clock, isolated failure blast radius, and each + stays comfortably inside a 4 vCPU runner. There is no combined all-backends job. +- Healthchecks are mandatory in every compose service: ARC (`curl -k` the REST endpoint's info + URL), HTCondor (`condor_status -limit 1` / `condor_ce_status`), Slurm (`sinfo` reporting the + node up), sshd (TCP probe). `up --wait` then replaces every hand-rolled sleep-and-retry loop. +- The integration workflow is `workflow_call`-reusable, invoked from `ci.yml` after unit tests, + and required for merge. Log artifacts on failure are non-negotiable: a red integration job + without daemon logs is a re-run generator, not a signal. + +### 6. Renovate as the version radar — two lanes + +**Lane 1 — pinned, blocking (PRs).** Everything CI runs against is pinned: + +- image tags in `compose.yml` files → `docker-compose` manager (enable in `renovate.json`); +- base images and pinned package versions in Dockerfiles → `dockerfile` manager plus + `customManagers` regex on annotated `ARG`s: + + ```dockerfile + # renovate: datasource=repology depName=fedora_epel_9/nordugrid-arc + ARG ARC_VERSION=7.1.1 + ``` + +- Python dependencies → existing pep621 manager, unchanged. + +For a backend with a support window (§8), Renovate tracks **only the leading-edge ARG**; the older +anchor ARGs are pinned and constrained (`packageRule` `allowedVersions`/`matchCurrentVersion`, or +`enabled: false`) so they accept only patch bumps within their major and never jump it. + +A Renovate bump PR (e.g. HTCondor 25.7 → 25.8) triggers image rebuild + full integration matrix. +**A red Renovate PR is the feature**: it is the earliest, cheapest, fully-reproducible signal +that an upstream release broke an interCEde contract — the 25.8 destructive-fetch change would +have surfaced as exactly this. Backend bumps are grouped per backend (`arc`, `htcondor`, +`slurm`), never mixed, so a red PR names its culprit. + +**Lane 2 — unpinned canary (scheduled, non-blocking).** A weekly scheduled workflow runs the same +matrix against `latest`/nightly backend tags. It cannot block merges; it exists to catch +upstream changes *before* they reach the stable tags Renovate tracks. Failures open a +deduplicated issue via workflow automation. + +### 7. Configuration growth path + +The full matrix is `(stack × config × version)`, starting as `(6 × basic × leading-edge)`. Planned +configuration axes, added as `config//` directories and manifest entries — each an additive +change: + +- auth variants (token vs. proxy for ARC; token lifetimes for HTCondor-CE); +- filesystem variants (shared vs. non-shared session directories — the `stages_own_files` axis); +- queue topology (multiple queues, per-queue limits); +- resource constraints (memory/CPU limits surfacing scheduler translation bugs); +- a batch-host variant with **no usable Python interpreter**, asserting IC-ADR-001 §3's + no-interpreter-on-the-remote-host property (DIRAC's SSH CE shipped a Python driver to the host; + interCEde must never need one). + +PR CI runs `basic` only; the scheduled workflow runs the full configuration set. This keeps PR +latency flat as the configuration matrix grows. + +### 8. Backend version support window + +interCEde is a *client* that must interoperate with backend daemons whose version it does not +control. Two kinds of "version" therefore live in this repo and must not be conflated: + +- **Versions interCEde ships** — Python dependencies, base images. We own these; "bump to latest" + is correct and Renovate does exactly that (§6). +- **Backend server versions interCEde interoperates with** — ARC, HTCondor(-CE), Slurm daemons. + Here a pinned version is a *test fixture standing in for a version some site runs*, not a + dependency we upgrade. Real sites run a spread — HTCondor LTS years after release, older ARC 7 + where the latest hasn't rolled out — so testing only the newest daemon answers the wrong + question. + +Each backend therefore declares a **support window**: the set of upstream versions interCEde +claims to work against, expressed as an *oldest/LTS anchor* plus the *leading edge*. Version is a +matrix axis alongside `config` (the `versions:` key in `stacks.yml`, §1). + +- **Renovate tracks only the leading edge.** The leading-edge `_VERSION` ARG bumps + normally — this is Lane 1's radar unchanged: a new upstream release arrives as a red-or-green PR + that names its culprit. +- **Older anchors are human-managed.** Each anchor ARG is pinned and constrained (§6) so Renovate + offers only patch bumps within the anchor's major. **Moving the window — dropping an old major, + adopting a new one — is a deliberate human PR**, because dropping support for a version sites + still run is a decision, not an auto-merge. +- **PR latency stays flat.** As with `config`, PR CI runs a representative subset (leading edge, + plus one anchor for the backend most exposed to version skew); the scheduled workflow runs the + full version cross-product. + +Multi-version testing is also what *justifies* version-conditional client code: the HTCondor +≥ 25.8 one-shot-retrieval shim is exactly such a case, and the window is where it earns its +regression coverage. + +**ARC is supported from version 7 only.** ARC 6 (pre-REST era: GridFTP/EMI-ES data staging, LDAP +infosystem) and ARC 7 (REST-first, `arcctl`) are close to different backends, and interCEde's AREX +path is REST-native. ARC 6 is explicitly out of scope, so the ARC window's floor is 7.x, with no +v6 anchor. + +## Rationale + +The design follows directly from the drivers above: real daemons over fakes (to catch what +structural typing cannot), per-stack isolation over a shared environment (attributable failures +inside a small runner), pinned-plus-Renovate over unpinned tracking (every version change attached +to a reviewable PR), and manifest-plus-files growth over workflow surgery (so the configuration +matrix scales without CI edits). The resulting benefits and the trade-offs deliberately accepted +in exchange: + +**Benefits** + +- Contract violations against real daemons are caught at PR time, per backend, with logs. +- Upstream version compatibility becomes a reviewed, bisectable git history of Renovate PRs + instead of tribal knowledge. +- Stacks double as reproducible local development environments — `compose up`, point your client + config at localhost, develop against a real ARC. +- Adding backends (future: Cloud via a mock EC2/OpenStack endpoint) or configurations is + manifest-plus-files, no CI surgery. +- The destructive-fetch semantics, the sorest point of IC-ADR-001, get a permanent regression test + (`fetch twice, assert contract`) on every backend. + +**Trade-offs and accepted costs** + +- We own three Dockerfiles for daemons whose packaging we don't control; EL9 repo layout changes + will occasionally break image builds (contained to the `build-images` workflow). +- Integration jobs add ~5–8 minutes wall-clock per PR (parallel across stacks). Accepted; unit + tests remain the fast inner loop. +- Single-node Slurm/HTCondor pools do not exercise multi-node scheduling behaviour. Out of scope: + interCEde's contract ends at the CE/scheduler interface; scheduling fidelity beyond it is the + backend's problem. +- The version support window (§8) multiplies images and CI jobs per backend. Contained by running + only the leading edge (plus one anchor) on PRs and the full cross-product on the schedule, and by + keeping the window to a small anchor set rather than every historical release. +- The canary lane will produce noise when upstream nightlies are broken through no fault of ours. + Mitigated by non-blocking status and issue deduplication. +- GitHub-hosted runners only (linux/amd64). No arm64, no macOS backend testing. Revisit only if a + consumer materializes there. + +## Rejected Ideas + +**Kubernetes (kind/k3s) as the orchestration layer.** Buys multi-node realism and Helm-chart +reuse (diracx-charts) at the price of cluster boot time, YAML volume, and debuggability inside a +CI job. Nothing in the CE contract requires more than "daemon reachable on a port"; compose +delivers that in one file per stack. Revisit if interCEde ever tests in-cluster deployment +concerns, which today it explicitly doesn't have. + +**testcontainers-python / per-test containers.** Excellent for a Postgres; wrong for daemons with +30–60 s startup (ARC, slurmctld+munge). Session-scoped stacks amortize startup once per job. +Per-test isolation is recovered logically (unique job tags/working dirs per test), not by +container churn. + +**One mega-compose with every backend.** Single job, shared fate: one flaky daemon reds the +world, resource ceilings are shared, and log spelunking spans six daemons. Parallel per-stack +jobs are faster and attribute failures for free. + +**GitHub Actions `services:`.** No compose semantics, no build, no startup ordering beyond +health of individual containers, awkward log retrieval. Fine for a Redis sidecar; not for a CE +and its LRMS. + +**Installing backends directly on the runner (apt/yum in the job).** Fast to prototype, +impossible to pin properly (Ubuntu's Slurm is whatever the distro ships), nothing is reusable +locally, and ARC on Ubuntu is not a supported combination. Containers or nothing. + +**Reusing DIRAC's certification/integration environment.** It exists, but it drags in the full +DIRAC server stack and its configuration system — the exact coupling interCEde was extracted to +escape. interCEde must be testable by someone who has never installed DIRAC. + +**Tracking upstream only via the canary (no pinning).** Green-today-red-tomorrow CI with no +diff to point at. Pinning plus Renovate keeps every version change attached to a reviewable PR; +the canary is an early-warning supplement, not the mechanism of record. + +**Nightly-only integration tests (nothing on PR).** Decouples breakage from the change that +caused it and lets contract violations merge. The whole value proposition is the red X *on the +PR that introduces the problem* — including Renovate's version-bump PRs. + +## Open Issues + +- Exact Renovate datasource for EL9 RPM pins (`repology` vs. a custom endpoint for the NorduGrid + repo) — to be settled when the ARC Dockerfile lands. +- Whether `htcondor` stack `basic` uses `htcondor/mini` behind the CE or a cm/execute/submit + trio; `mini` is preferred until a test needs role separation. +- Whether Cloud backend testing (mock OpenStack/EC2) becomes a seventh stack or stays unit-level; + deferred until the Cloud backend is scheduled. diff --git a/docs/adr/IC-ADR-003_credentials.md b/docs/adr/IC-ADR-003_credentials.md new file mode 100644 index 0000000..b65d627 --- /dev/null +++ b/docs/adr/IC-ADR-003_credentials.md @@ -0,0 +1,301 @@ +# IC-ADR-003: Backend credentials — typed, declared, provider-supplied + +## Metadata + +- **Created By:** Alexandre Boyer +- **Date:** 2026-07-03 +- **Status:** Draft +- **Decision Maker(s):** DIRACGrid maintainers +- **Stakeholders:** interCEde backend implementers; DIRAC/DiracX SiteDirector and PushJobAgent maintainers; DiracX token-issuance maintainers; site/VO operators +- **Depends on:** IC-ADR-001 (contract shape, lifecycle, Tier A/B/C surface) + +> **Scope and altitude.** Direction-setting, like IC-ADR-001. This ADR decides how *backend* +> authentication credentials — bearer tokens and X.509 proxies — are **typed, declared, supplied, +> refreshed, scoped and materialised**. It does not decide token *issuance* (DiracX's), *payload* +> credential renewal (pilot-side, DIRAC's `_monitorProxy` lineage), or the exact dataclass fields +> (implementation). Long-lived transport secrets (SSH keys) are configuration, not credentials, and +> are only delimited here. + +## Abstract + +Backend auth is on the critical path for the ARC and HTCondor-CE backends, and DIRAC's incumbent +model does not survive the move to a stateless, cached, async library: credentials are **mutated +onto** long-lived CE objects (`setProxy`/`setToken`), freshness policy is **duplicated across +callers**, token opt-in is a **stringly-typed CS tag**, and audience is a bare attribute. This ADR +replaces that with three pieces: **typed, immutable credential values** (`BearerToken`, +`X509Proxy`, grouped in a `CredentialSet`); **backend-declared `CredentialRequirements`** (which +kinds it accepts, for which audience/scopes — data computed from the backend's resolved config, +replacing `Tag: Token[:vo]` and `audienceName`); and **provider-based supply** — backends pull a +fresh `CredentialSet` from a caller-supplied `CredentialProvider` when theirs is missing or near +expiry, so freshness *mechanics* live in one place while issuance/renewal *policy* stays entirely +in the consumer (DiracX token machinery, DIRAC's proxy manager, or a static file). Materialisation +(token/proxy files, env injection for CLI backends) is internal (Tier C). ARC's proxy delegation +remains backend-internal mechanics consuming a provider-supplied proxy. + +## Motivation + +What the DIRAC code actually does today (all verified against the current codebase): + +1. **Mutating setters on cached objects.** `ComputingElement.setProxy`/`setToken` mutate CE + instances that `QueueCECache` caches and reuses across agent cycles. Credential state and + connection state are entangled on the same long-lived object — exactly the hazard for a library + whose backends are cached and context-managed (IC-ADR-001 §2, lifecycle). +2. **Caller-side renewal policy, duplicated.** `SiteDirector._setCredentials` inspects + `ce.proxy.getRemainingSecs()` and re-supplies; `WMSUtilities.setPilotCredentials` reimplements + the same gate for the pilot-kill/log service path. Two copies of the same freshness loop is the + documented drift pattern IC-ADR-001 exists to end. +3. **Stringly-typed opt-in.** A CE accepts tokens iff `"Token"` or `f"Token:{vo}"` appears in its + CS `Tag` list. A *capability declaration* is encoded in a free-form tag shared with scheduling + metadata. +4. **Audience as a bare attribute.** Callers mint tokens with `audience=ce.audienceName` — an + untyped per-CE string (AREX: `https://:`; HTCondorCE: `:9619`) with no + accompanying scopes/kind information. +5. **Per-backend materialisation, ad hoc.** HTCondorCE writes the token to a temp file and injects + `_CONDOR_*` env vars around each CLI call; the base writes proxies to files and exports + `X509_USER_PROXY`; Cloud reads a `cloud.auth` ini with a magic `PROXY` secret that dumps the + pilot proxy. +6. **ARC needs more than a header.** AREX supports bearer tokens *and* X.509 proxy delegations + (create a delegation via CSR, sign with the proxy chain, upload, renew) — and can need **both at + once** (`AlwaysIncludeProxy`). +7. **HTCondorCE needs both at once — today, in production.** Even when SCITOKENS authenticates the + channel, DIRAC's submit description *unconditionally* contains `use_x509userproxy = true`, and + the code says why: *"For now, we still need to include a proxy in the submit file — HTCondor + extracts VOMS attribute from it for the sites"* — the proxy rides along for **site-side + consumption** (per-VO attribution in site accounting, i.e. the APEL pipeline), not for channel + auth. `_executeCondorCommand` even refuses to run with neither token nor proxy, and token mode + sets `_CONDOR_DELEGATE_JOB_GSI_CREDENTIALS=false` to work around a condor 24.4 delegation bug + (HTCONDOR-2904). Together with ARC's pairing, any model that assumes "one credential per + backend" is wrong on day one — twice. + +### Drivers + +- **Stateless and cache-safe.** Credentials must not be hidden mutable state on cached backends. +- **Policy with the consumer.** Where credentials *come from* (DiracX token service, proxy + download, a file) and *when they are renewed* is consumer policy; interCEde owns only the + mechanics of asking at the right time and using them correctly. +- **Typed declaration.** A consumer must be able to ask a backend "what do you need?" and get data + — not parse tags. +- **One supply path for one and many credentials.** Token-only, proxy-only, and token+proxy + backends use the same machinery. +- **Testable in containers.** The model must work with IC-ADR-002's ephemeral per-run credentials + (arcctl test-CA, IDTOKENS, ssh-keygen). + +## Specification + +### 1. Credential values (Tier A) + +Immutable, typed values with an expiry the machinery can read: + +```python +@dataclass(frozen=True) +class BearerToken: + value: str # opaque to interCEde; never logged + expires_at: datetime | None + +@dataclass(frozen=True) +class X509Proxy: + pem: bytes # full chain, PEM; opaque to interCEde beyond expiry + expires_at: datetime | None + +Credential = BearerToken | X509Proxy + +@dataclass(frozen=True) +class CredentialSet: # what a provider returns; may satisfy >1 kind + def get(self, kind: type[Credential]) -> Credential | None: ... +``` + +`CredentialSet` exists because of ARC's token+proxy case: requirements may name several kinds, and +one provider call returns everything needed, atomically. interCEde never inspects credential +*contents* (no VOMS parsing, no JWT decoding) beyond expiry bookkeeping. + +### 2. Requirements — the backend declares, as data (Tier A) + +```python +@dataclass(frozen=True) +class CredentialRequirements: + kinds: frozenset[type[Credential]] # the kinds needed TOGETHER (all-of, not a menu) + audience: str | None = None # token audience, if kinds include BearerToken + scopes: frozenset[str] = frozenset() # e.g. compute.create/compute.read (WLCG profile) +``` + +A backend computes its `CredentialRequirements` from its resolved configuration (it knows its +endpoint, hence its audience) and exposes them as a read-only property. This single piece of data +replaces both the `Tag: Token[:vo]` opt-in *and* `audienceName`: the consumer reads the +requirements and mints/fetches accordingly. Configuration may still *narrow* a backend that +accepts several kinds ("this site: proxy only"); it can never widen beyond what the backend +declares. + +Two semantic rules, both forced by the grid CEs: + +- **`kinds` means "all of these, together".** Requirements are a concrete ask for the backend's + *active configuration*, never a menu of alternatives — both first-party grid CEs need a *set* + (ARC `AlwaysIncludeProxy`; HTCondorCE token + proxy, Motivation 7). A backend that supports + alternative modes (token-only *or* proxy-only) exposes the mode as configuration, and its + requirements reflect the configured mode. +- **A required kind need not authenticate the channel.** HTCondorCE's proxy is materialised into + the submit description (`use_x509userproxy`) for *site-side* consumption — VOMS attributes + feeding the site's accounting (APEL) — while the token authenticates the channel. The provider + sees no difference (it supplies the set); what each member is *for* is backend mechanics + (Tier C). + +### 3. Supply — a provider the backend pulls from (Tier A) + +```python +@runtime_checkable +class CredentialProvider(Protocol): + async def get(self, requirements: CredentialRequirements) -> CredentialSet: ... +``` + +- The provider is part of the backend's construction config (registry request). A trivial + `StaticCredentials(CredentialSet)` provider covers files/fixed tokens — and the integration + stacks. +- **The backend pulls; the consumer implements.** Before an operation, if the backend's current + set is missing or within a freshness margin of expiry, it awaits `provider.get(...)` — once, in + one place, inside the library. What the provider *does* (call DiracX's token issuer, download a + proxy, read a refreshed file) is consumer code. This keeps the mechanism/policy split of + IC-ADR-001 §9 and deletes the SiteDirector/WMSUtilities duplication by construction. +- Provider calls are per-backend, not per-job: one credential authenticates the *channel/CE*, not + each submission. +- Failures raise the typed auth branch of `InterCEdeError` as whole-operation failures (IC-ADR-001 + §2 partial-failure rule: auth failure is never a per-id outcome). + +### 4. Materialisation (Tier C) + +Backends that drive CLIs or need files get internal helpers, never a public contract: secure temp +files (0600, private dir), env injection (`X509_USER_PROXY`, HTCondor `_CONDOR_*`/SciTokens env), +scoped to the operation and cleaned up deterministically (the backend's context-manager lifecycle +from IC-ADR-001 §2 is the natural cleanup boundary). Credential values never appear in logs or +exception messages. + +### 5. Delegation is backend mechanics (Tier C) + +ARC proxy delegation (CSR → sign with the provider-supplied `X509Proxy` → upload; renew before +expiry; reuse across submissions where valid) is `ARCBackend`-internal. The provider supplies the +proxy; everything downstream — delegation ids, renewal, `AlwaysIncludeProxy`-style pairing with a +token — is invisible to the consumer. + +### 6. Boundaries + +- **SSH keys are transport configuration**, not rotating credentials: long-lived, file/agent-based, + no audience. They stay in `SSHTransport` config; revisit only if key rotation becomes a real + requirement. +- **Payload credentials are pilot-side.** Renewing the proxy *of the running payload* (DIRAC's + `_monitorProxy`, `GENERIC_PILOT` branch) belongs to the severed pilot/runner domain, not here. +- **Issuance is the consumer's.** interCEde never holds refresh tokens, client secrets, or CA + material; it asks a provider and uses what it gets. +- **Cloud provider auth** (Libcloud driver keys, the `cloud.auth` ini) is `CloudBackend` internal + config; whether it adopts the provider model is deferred with the Cloud backend itself. + +### 7. Usage sketches (informative) + +**(a) The consumer implements the provider once; every backend pulls from it.** A DiracX +submission task wires its token/proxy machinery into one object and never polices freshness +again (this is the code that today exists twice, in `SiteDirector._setCredentials` and +`WMSUtilities.setPilotCredentials`): + +```python +class DiracXPilotCredentials: # consumer-side; satisfies CredentialProvider + async def get(self, req: CredentialRequirements) -> CredentialSet: + creds = [] + if BearerToken in req.kinds: # mint scoped to what the backend declared + tok = await token_service.mint(audience=req.audience, scopes=req.scopes) + creds.append(BearerToken(tok.value, tok.expires_at)) + if X509Proxy in req.kinds: # gProxyManager lineage, behind the provider + pem = await proxy_store.download(pilot_dn, pilot_group, lifetime=86400) + creds.append(X509Proxy(pem, expires_at=proxy_expiry(pem))) + return CredentialSet(creds) + +backend = registry.backend( + {"type": "htcondor-ce", "endpoint": "ce01.example.org:9619", ...}, + credentials=DiracXPilotCredentials(), +) +async with backend: # lifecycle from IC-ADR-001 §2 + sub = await backend.submit(spec, count=50) + # before the operation the backend compared its cached CredentialSet against + # backend.credential_requirements and awaited provider.get(...) only if stale +``` + +**(b) What the HTCondorCE backend declares, and what it does with the set (Tier C).** The +requirements make the Motivation-7 pairing explicit and typed; the materialisation reproduces +DIRAC's mechanics without the caller knowing any of it: + +```python +backend.credential_requirements == CredentialRequirements( + kinds=frozenset({BearerToken, X509Proxy}), # all-of: token for the channel, + audience="ce01.example.org:9619", # proxy for site-side VOMS/APEL accounting +) +# internally, per operation: +# BearerToken -> 0600 temp file; _CONDOR_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS, +# _CONDOR_SCITOKENS_FILE= +# X509Proxy -> 0600 temp file; X509_USER_PROXY=, referenced by the submit +# description's `use_x509userproxy = true` +# both cleaned up at the operation/lifecycle boundary; values never logged +``` + +**(c) Integration stacks and standalone use — static files, no issuance machinery.** The +IC-ADR-002 stacks mint ephemeral credentials into the shared volume; the test harness (or any +non-DiracX user) wraps them statically: + +```python +provider = StaticCredentials(CredentialSet([ + BearerToken(Path("credentials/htcondor/idtoken").read_text().strip(), expires_at=None), +])) +backend = registry.backend({"type": "htcondor-ce", ...}, credentials=provider) +``` + +## Rationale + +- **Provider-pull over mutating setters.** Setters put hidden state on cached objects and force + every caller to police freshness (two DIRAC copies prove the cost). A pull model puts the + *check* in one library-side place while leaving the *source and policy* in consumer code — the + same mechanism/policy line IC-ADR-001 draws for throttling. +- **Declared requirements over tag sniffing.** `Tag: Token` conflates scheduling metadata with an + auth capability and is invisible to type checkers and tooling. A typed declaration is + enumerable, testable, and lets the conformance suite assert that a backend's declared kinds are + the ones it actually uses. +- **`CredentialSet` over single credential.** ARC's token+proxy pairing is a first-party + requirement, not an edge case; modelling it from day one avoids a v2 of the provider protocol. +- **Immutability.** Frozen values make "refresh" a *replacement*, never an in-place mutation — + cache-safe and race-free under concurrent bulk operations. + +## Rejected Ideas + +- **DIRAC-style `set_credential()` mutators.** Hidden state on cached backends; freshness policy + smeared across callers; the incumbent model this ADR exists to replace. +- **Backend-driven issuance/refresh** (backend holds a refresh token or talks to the IdP). + Couples the library to DiracX/IdP specifics, embeds secrets in the library, and moves policy + inside — the opposite of the IC-ADR-001 §9 mechanism/policy split. +- **Credentials in the `SubmissionSpec`** (name provisional — IC-ADR-001 Open Issues). Auth is + per-backend/channel, not per-job; putting it on + the spec would force every task (status, fetch, kill) to re-thread it and would leak payload vs + backend credential confusion back in. +- **A single `Credential` per backend (no set).** Breaks on ARC's token+proxy pairing. +- **Keeping the `Tag: Token[:vo]` opt-in.** Stringly-typed, CS-coupled, invisible to types and + tooling. +- **An interCEde-owned credential store/daemon.** interCEde is stateless (IC-ADR-001 §1); caching + beyond the in-memory current set is the consumer's business. + +## Open Issues + +- **Freshness margin.** Fixed library default vs per-backend/per-provider configuration; ARC + delegation renewal wants a larger margin than a bearer-token header. +- **Requirements granularity.** Whether `CredentialRequirements` stays per-backend or needs + per-operation variance (e.g. a backend whose *fetch* endpoint needs a different scope than + *submit*). Start per-backend; split only on evidence. +- **Proxy representation.** Raw PEM bytes (current spec) vs a `cryptography` object; and whether + delegation needs key material the consumer must supply alongside the chain (DIRAC signs the + delegation CSR with the proxy's own key — implies the provider hands over key+chain, which + `X509Proxy.pem` as "full chain" must be explicit about). +- **Multi-VO backends.** One backend instance serving several VOs would need per-VO credential + sets; today DIRAC instantiates per-queue/VO CEs, and interCEde's per-backend provider assumes + the same. Confirm with DiracX's multi-VO design. +- **DiracX alignment.** The provider implementation on the DiracX side (token service, scopes, + pilot-credential flows) — tracked with the DiracX transition ADRs, not here. +- **Proxy-for-accounting lifetime.** HTCondorCE's proxy requirement is a WLCG-transition artifact + (sites attribute usage via the proxy's VOMS attributes, the APEL pipeline). When sites account + on token claims instead, the backend's declared requirements shrink to token-only — and because + requirements are *data*, that is a backend-version/configuration change, not an API break. + Track the WLCG token-transition timeline before hard-coding the pairing as permanent. +- **Conformance coverage.** IC-ADR-002 already plans token-vs-proxy configuration axes for the + ARC stack; extend the conformance suite with a "requirements honesty" check (backend declares X, + suite verifies it authenticates with exactly X). diff --git a/docs/adr/IC-ADR-XXX_template.md b/docs/adr/IC-ADR-XXX_template.md new file mode 100644 index 0000000..fa37760 --- /dev/null +++ b/docs/adr/IC-ADR-XXX_template.md @@ -0,0 +1,45 @@ +# IC-ADR-\[NUMBER\]: [Title of Decision] + +## Metadata + +- **Created By:** [Name] +- **Date:** [YYYY-MM-DD] +- **Status:** [Draft | Accepted | Rejected | Deprecated by IC-ADR-YYY | Supersedes IC-ADR-XXX] +- **Decision Maker(s):** [Name(s)] +- **Stakeholders:** [Name(s) / Role(s), only used for decisions which affect a subset of communities] +- **Depends on:** [IC-ADR-YYY — optional; upstream ADRs this decision builds on] + +> **Scope and altitude.** [Optional but recommended, especially for long ADRs. Two to four +> sentences: what kind of decision this is (direction-setting vs detailed), what is deliberately +> deferred and to where, and which sections a reviewer must read versus may skim.] + +## Abstract + +A short (~200 word) summary of the decision being made and why it matters. Prefer one +plain-language sentence per decision over a dense paragraph — this is the only part many +stakeholders will read. + +## Motivation + +Why is this decision needed now? What problem or limitation in the current system does it address? What are the functional and non-functional drivers? + +## Specification + +Describe the chosen solution in concrete detail — APIs, interfaces, configuration, behaviour. This is the "what we're building" section. + +## Rationale + +Explain *why* the chosen design looks the way it does. Why these trade-offs? Why this level of abstraction? Connect specific design choices back to the drivers in Motivation. + +## Evolution + +Optional. How the decision accommodates future change: what can be added without breaking +(additive extensions), and what would require superseding this ADR. + +## Rejected Ideas + +Why were the non-chosen options ultimately set aside? This is distinct from the pros/cons listing above — it's the narrative of what tipped the scales. Include any ideas that came up in discussion but weren't even promoted to full options, and why. + +## Open Issues + +Any points still being decided or discussed. Remove this section once the status moves to Accepted. diff --git a/docs/adr/index.md b/docs/adr/index.md new file mode 100644 index 0000000..2a53f1b --- /dev/null +++ b/docs/adr/index.md @@ -0,0 +1,13 @@ +# Architecture Decision Records + +Architecture Decision Records (ADRs) document significant design decisions made in the interCEde project. + +| ADR | Title | Status | +| --- | ----- | ------ | +| [IC-ADR-001](IC-ADR-001_computing_elements.md) | Computing Element interfaces and the DIRAC to interCEde migration | Draft | +| [IC-ADR-002](IC-ADR-002_integration_tests.md) | Integration testing against containerized backends | Draft | +| [IC-ADR-003](IC-ADR-003_credentials.md) | Backend credentials — typed, declared, provider-supplied | Draft | + +## Template + +New ADRs should follow the [ADR template](IC-ADR-XXX_template.md). diff --git a/docs/notes/reuse-survey.md b/docs/notes/reuse-survey.md new file mode 100644 index 0000000..a7a9504 --- /dev/null +++ b/docs/notes/reuse-survey.md @@ -0,0 +1,68 @@ +# Prior-art survey: existing job-submission abstractions + +> Companion note to [IC-ADR-001](../adr/IC-ADR-001_computing_elements.md) (Rejected Ideas, +> *"Reuse an existing job-submission abstraction instead of defining our own"*). This records the +> survey evidence behind that rejection; the ADR keeps only the conclusion. Surveyed 2026-06/07 — +> per-tool facts here go stale independently of the decision. + +## What interCEde needs + +A modern async, typed Python library covering, together: SSH + local batch (Slurm, HTCondor, +LSF, …), ARC-CE over its REST interface, HTCondor-CE, and cloud (Libcloud-style providers). No +surveyed tool covers this set. + +## HTCondor GAHP / blahp + +Not a library but a *protocol* — no C/Python binding; you spawn a helper binary and speak ASCII +over stdin/stdout — and a *family* of binaries: `blahp` covers only **local** batch +(Slurm/PBS/LSF/SGE/Condor), while ARC and cloud are separate `arc_gahp`/`ec2_gahp`/`gce_gahp`/ +`azure_gahp`. blahp has **no SSH** — remote-cluster access is a `condor_gridmanager` + +`condor_remote_cluster`/BOSCO feature — so "SSH + Slurm" would drag in the whole HTCondor grid +stack. The protocol is internal, self-described as "SECOND DRAFT", with no back-compat guarantee. + +Apache-2.0. Its per-LRMS submit/status/cancel scripts and its async request-id/`RESULTS`-poll +model are worth *learning from* — they are cited in IC-ADR-001's Rationale as prior art for the +`Scheduler` protocol and the async, bulk, poll-based contract. + +The reverse direction — HTCondor reusing interCEde — is a non-goal: its gridmanager is C++ and +speaks a process protocol, and HTCondor-CE is a backend interCEde *targets*, so the layering runs +the other way. + +## ATLAS PanDA / Harvester + +Python, Apache-2.0, and has the exact submitter/monitor/sweeper plugin shape interCEde wants +(HTCondor, ARC via **aCT**, Slurm/LSF/PBS, cloud, k8s, HPC) — but its communicator talks only to +the PanDA server and its data model is PanDA specs, so it is not an importable standalone +library. Notably, ATLAS built Harvester *because* aCT was too tied to ARC-CE for US HPCs — the +same "generic beats CE-specific" lesson IC-ADR-001 applies. + +## ALICE JAliEn + +Its `alien.site.batchqueue` layer (ARC, HTCondor, direct Slurm/PBS, NERSC SuperFacility) is the +closest HEP analogue and does *both* grid-CE and direct-batch — but it is Java, never released as +an artifact (CVMFS-only), coupled to JAliEn's LDAP/central/token machinery, and of unconfirmed +licence. + +## Generic libraries + +RADICAL-SAGA, PSI/J, DRMAA, Parsl, AiiDA, Dask-Jobqueue are batch/HPC-scheduler oriented: the +ARC-CE REST **+** HTCondor-CE **+** cloud triple never co-occurs in any one of them. In fact +**none of the six speaks ARC at all**, in either the modern ARC-CE REST or the classic +ARC0/EMI-ES interface (RADICAL-SAGA never shipped a NorduGrid adaptor in any release — the Python +`radical.saga` is distinct from the old C++/Java SAGA; AiiDA has no `aiida-arc` plugin in core or +its registry). Where they touch "HTCondor" it is vanilla HTCondor/Condor-G, not the HTCondor-CE +grid gateway. + +ARC-CE REST is itself well-supported — HTCondor's `arc_gahp` speaks it in C++ (libcurl against +`/arex/rest/1.0`), and interCEde uses the Python `pyarcrest` — but only through ARC-specific +clients, never through one of these generic scheduler abstractions. + +Concurrency is *not* the differentiator: most of the six are async (futures/callbacks or a +non-blocking submit + poll/wait); DRMAA is the lone synchronous-style one, and it is unmaintained. + +## Conclusion + +Every HEP experiment builds its own CE abstraction and none is a reusable standalone library. The +genuinely reusable layer is the WLCG grid-CE ecosystem itself (HTCondor-CE, ARC-CE) plus +first-party per-backend clients (`pyarcrest`, the `htcondor` bindings, Apache Libcloud). +interCEde composes those behind a typed, async contract rather than adopting any monolith.