From 3a07186dd1b7a025b05c789677a8308f3b0bcb7d Mon Sep 17 00:00:00 2001 From: "alejandro.gonzalez" Date: Mon, 25 May 2026 13:06:12 +0200 Subject: [PATCH 1/2] docs(sca): add SCA developer reference docs Adds docs/sca/ with two reference documents: gradle-setup.md covering the WriteProperties pattern for embedding pom.properties in JARs, and reachability.md covering ClassFileTransformer invariants, the three-condition gate, version matching with ComparableVersion, and the dual periodic-action telemetry model. --- AGENTS.md | 1 + docs/sca/AGENTS.md | 6 +++ docs/sca/gradle-setup.md | 95 +++++++++++++++++++++++++++++++++++ docs/sca/reachability.md | 105 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 207 insertions(+) create mode 100644 docs/sca/AGENTS.md create mode 100644 docs/sca/gradle-setup.md create mode 100644 docs/sca/reachability.md diff --git a/AGENTS.md b/AGENTS.md index e09dff0d473..7b390aaef68 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -39,6 +39,7 @@ docs/ Developer documentation (see below) | Working with Gradle | [docs/how_to_work_with_gradle.md](docs/how_to_work_with_gradle.md) | | Bootstrap/premain constraints | [docs/bootstrap_design_guidelines.md](docs/bootstrap_design_guidelines.md) | | CI/CD workflows | [.github/workflows/README.md](.github/workflows/README.md) | +| SCA: Gradle pom.properties, reachability architecture | [docs/sca/](docs/sca/) | **When working on a topic above, read the linked file first** — they are the source of truth maintained by humans. diff --git a/docs/sca/AGENTS.md b/docs/sca/AGENTS.md new file mode 100644 index 00000000000..958d0f608dd --- /dev/null +++ b/docs/sca/AGENTS.md @@ -0,0 +1,6 @@ +# SCA — Developer Reference + +| Topic | File | +|---|---| +| Embedding pom.properties via Gradle `WriteProperties` task | [gradle-setup.md](gradle-setup.md) | +| Reachability architecture, `ClassFileTransformer`, version matching | [reachability.md](reachability.md) | diff --git a/docs/sca/gradle-setup.md b/docs/sca/gradle-setup.md new file mode 100644 index 00000000000..9ab63d14e7c --- /dev/null +++ b/docs/sca/gradle-setup.md @@ -0,0 +1,95 @@ +# SCA Gradle Setup — Embedding pom.properties + +How to embed `META-INF/maven/.../pom.properties` in a JAR so that `DependencyResolver` reports it +as a Maven artifact in SCA telemetry. + +## How `DependencyResolver` identifies a Maven artifact + +`DependencyResolver.resolve(URI)` scans inside each JAR for +`META-INF/maven/{groupId}/{artifactId}/pom.properties`. When found, it builds a `Dependency` with +`name = "{groupId}:{artifactId}"` and the version from the file. The `hash` field is `null` in +this case -- SHA-1 is only computed as a fallback when neither manifest nor pom.properties is +present. + +## Canonical `WriteProperties` pattern + +```groovy +def pomPropertiesDir = project.layout.buildDirectory.dir("generated/maven-metadata") +def pomPropertiesFileTree = fileTree(pomPropertiesDir) + +tasks.named("processResources") { dependsOn(pomPropertiesFileTree) } +tasks.named("sourcesJar") { dependsOn(pomPropertiesFileTree) } // required -- see below + +sourceSets { + main.resources.srcDirs(includedAgentDir, pomPropertiesDir) // srcDirs plural, not srcDir +} + +def generatePomProperties = tasks.register('generatePomProperties', WriteProperties) { + destinationFile = pomPropertiesDir.map { + it.file("META-INF/maven/com.datadoghq/dd-java-agent/pom.properties") + } + property("groupId", "com.datadoghq") + property("artifactId", "dd-java-agent") + property("version", project.providers.provider { project.version.toString() }) +} +pomPropertiesFileTree.builtBy(generatePomProperties) +``` + +## Invariants + +### `sourcesJar` also needs `dependsOn` + +`sourcesJar` uses `main.resources` as its source set. Without `dependsOn(pomPropertiesFileTree)`, +running `sourcesJar` alone fails with "srcDir does not exist" because the generated directory +has not been created yet. + +### `srcDirs` (plural) when more than one generated directory exists + +`main.resources.srcDir(a)` replaces the existing srcDir. When adding a second generated directory, +switch to `srcDirs(a, b)`. + +### Use `providers.provider { }` for the version value + +`project.version` evaluated directly in the task configuration block may not be resolved yet if +the versioning plugin runs later. Wrap it in `project.providers.provider { project.version.toString() }` +to defer evaluation to task execution time. + +### `fileTree.builtBy(task)` placement + +`pomPropertiesFileTree.builtBy(generatePomProperties)` must come after the `generatePomProperties` +variable is assigned. Gradle evaluates build scripts top-to-bottom in the configuration phase. + +### Do not use `dependsOn(generatePomProperties)` directly + +The repo uses `dependsOn(fileTree)` + `fileTree.builtBy(task)`. Keep this consistent. + +## Testing `DependencyResolver` + +```groovy +void 'jar with pom.properties resolves as Maven artifact'() { + given: + File file = new File(testDir, 'artifact.jar') + new ZipOutputStream(new FileOutputStream(file)).with { + putNextEntry(new ZipEntry('META-INF/maven/com.example/artifact/pom.properties')) + write('groupId=com.example\nartifactId=artifact\nversion=1.2.3\n'.getBytes('UTF-8')) + closeEntry() + close() + } + + when: + List deps = DependencyResolver.resolve(file.toURI()) + + then: + deps[0].name == 'com.example:artifact' + deps[0].version == '1.2.3' + deps[0].hash == null // hash is null when pom.properties is present +} +``` + +The `hash == null` assertion is a required invariant: SHA-1 is not computed when +`pom.properties` resolves the artifact. + +## Smoke test verification + +`AbstractServerSmokeTest.missingDependencyNames` verifies end-to-end that SCA telemetry reports +a set of known dependencies. To add a new artifact to the expected set, add it to that `Set`. diff --git a/docs/sca/reachability.md b/docs/sca/reachability.md new file mode 100644 index 00000000000..c6b5b1b4f6c --- /dev/null +++ b/docs/sca/reachability.md @@ -0,0 +1,105 @@ +# SCA Reachability — Architecture and Invariants + +Design constraints and non-obvious rules for the SCA Reachability feature. + +## Three-condition gate (fundamental invariant) + +A class appearing in `sca_cves.json` is NOT reported just because it is in the index. All three +conditions must hold: + +1. Class is in the index (`database.entriesForClass()` returns entries) +2. Artifact is loaded: `DependencyResolver.resolve(jarUrl)` returns a `Dependency` with + `name == entry.artifact()` +3. Version is vulnerable: `entry.isVersionVulnerable(dep.version)` is true + +The index exists for O(1) lookup. Artifact + version filtering always runs in `processClass()`. +Classes in `pendingRetransform` still pass through the full filter -- if the version has not yet +resolved, `transform()` returns `null` without reporting. + +## `ClassFileTransformer` contract + +- `className` in `transform()` uses internal slash format: `"com/foo/Bar"`. The lookup map must + use this format as its key. +- Return `null` to leave bytecode unmodified (correct). Do not return the original buffer. +- `protectionDomain` may be null (bootstrap classes). `protectionDomain.getCodeSource()` may also + be null (runtime-generated classes). Both null checks are mandatory. +- `transform()` is called concurrently -- use immutable or concurrent data structures. +- Register with `addTransformer(transformer, true)` to allow `retransformClasses()`. + +## Two periodic actions and their relationship + +**`DependencyPeriodicAction`** (always active): drains `DependencyService` and reports ALL detected +JARs. With SCA enabled it emits `metadata:[]` on each dep as a signal that SCA is monitoring. +This is the complete inventory -- it knows nothing about CVEs. + +**`ScaReachabilityPeriodicAction`** (only when SCA enabled): drains `ScaReachabilityDependencyRegistry` +and re-emits only deps with pending CVE state, with `metadata:[{id, reached:[...]}]`. + +**Duplicate invariant**: within the same heartbeat window, the same dependency may arrive twice: + +1. From `DependencyPeriodicAction`: `{snakeyaml, metadata:[]}` -- dep just detected +2. From `ScaReachabilityPeriodicAction`: `{snakeyaml, metadata:[{cve, reached:[callsite]}]}` -- hit registered + +This happens when a dep is detected AND has a CVE hit in the same heartbeat window (common with +the 60s default). The backend must merge by `name:version`, taking the richer entry. This is not +a bug. + +## Version matching + +`ComparableVersion` in `internal-api/src/main/java/datadog/trace/util/ComparableVersion.java` is +a backport of Apache Maven 3.9.9. It supports 4-part versions and pre-release qualifiers. + +- `isWithin(start, end)` is half-open: `[start, end)` -- inclusive start, exclusive end. +- GHSA range formats: `"< 2.6.7.3"`, `">= 2.7.0, < 2.7.9.5"`, `"= 9.5.0"`. +- `< X` → `compareTo(X) < 0`. `= X` → `compareTo(X) == 0`. `>= A, < B` → `isWithin(A, B)`. + +## Class-level vs method-level symbols + +**Class-level (`method: null`)**: the class load IS the reachability signal. Use only when no +specific method is instrumentable or when class presence in memory is sufficient. + +**Risk with mixed class-level + method-level in the same entry**: if the same CVE entry has both, +the class-level hit registers `` in the registry and first-hit-wins blocks subsequent +method-level hits. Result: `reached=[{}]` instead of the application callsite. + +**Rule**: for libraries that load at startup (snakeyaml, xstream), remove the class-level symbol +and keep only method-level symbols. The class load triggers `registerCve()` with `reached:[]` +and the subsequent method call records the correct callsite. + +## Startup wiring + +- `AppSecSystem.start()` does NOT receive `Instrumentation`. Only `SubscriptionService` and + `SharedCommunicationObjects` are passed in. +- The correct pattern is a separate `maybeStartScaReachability(Instrumentation, ...)` method in + `Agent.java`, gated by `Config.get().isAppSecScaEnabled()`, called after `maybeStartAppSec()`. + +## Hard constraints (never violate) + +- Never use `java.nio.*` in premain code -- this includes `StandardCharsets.UTF_8`. Use the + string literal `"UTF-8"` instead. See `docs/bootstrap_design_guidelines.md`. +- Never modify bytecode for class-level hits (always `return null`). For method-level hits, + bytecode modification is required. +- Never throw from `transform()`. +- Never do blocking I/O in the `transform()` hot path. Use a + `ConcurrentHashMap>` cache. Do not cache empty results. +- Never report the same `(vulnId, artifact, symbolName)` more than once. Dedup: class-level in + the transformer, method-level in bootstrap `ScaReachabilityCallback.reported`. +- Never write to spans or traces -- the RFC explicitly forbids it. +- Never use `String.split(String)` or `String.split(String, int)` (forbidden API). Use a + `static final Pattern` field with `Pattern.compile(...).split(str)`. + +## GHSA data format + +- Each file is a JSON array: `[{...}]`. Process only `"language": "jvm"` entries. +- Full class name = `value + "." + name` from `ecosystem_specific.imports[].symbols[]`. +- Version ranges in `package[].version_range[]` (array of strings, treated as OR). +- The GHSA ID is the identifier -- extracted from the filename (`GHSA-xxx.json` → `"GHSA-xxx"`). + No CVE ID is present in the file. + +## Telemetry serialization + +- `Dependency` is a `public final class` with exactly 4 fields: `name`, `version`, `source`, + `hash` + optional `reachabilityMetadata: List` (nullable). `source` is NOT serialized. +- Serialization uses **Moshi `JsonWriter`** in `TelemetryRequestBody.writeDependency()`. Not Jackson. +- `null` metadata = SCA disabled (field omitted in JSON). `[]` = SCA active, no CVEs. `[{...}]` = CVE state. +- The `metadata.value` field in the RFC MUST be a JSON string (not an object) -- stringify explicitly. From 67eb4b43a7c5475a2cea8b6a9a469d58a4f353d3 Mon Sep 17 00:00:00 2001 From: "alejandro.gonzalez" Date: Wed, 27 May 2026 11:06:38 +0200 Subject: [PATCH 2/2] docs(sca): remove reachability.md based on unmerged PR #11352 SCA Reachability content will be added once PR #11352 is merged. --- docs/sca/AGENTS.md | 1 - docs/sca/reachability.md | 105 --------------------------------------- 2 files changed, 106 deletions(-) delete mode 100644 docs/sca/reachability.md diff --git a/docs/sca/AGENTS.md b/docs/sca/AGENTS.md index 958d0f608dd..5db479e52cd 100644 --- a/docs/sca/AGENTS.md +++ b/docs/sca/AGENTS.md @@ -3,4 +3,3 @@ | Topic | File | |---|---| | Embedding pom.properties via Gradle `WriteProperties` task | [gradle-setup.md](gradle-setup.md) | -| Reachability architecture, `ClassFileTransformer`, version matching | [reachability.md](reachability.md) | diff --git a/docs/sca/reachability.md b/docs/sca/reachability.md deleted file mode 100644 index c6b5b1b4f6c..00000000000 --- a/docs/sca/reachability.md +++ /dev/null @@ -1,105 +0,0 @@ -# SCA Reachability — Architecture and Invariants - -Design constraints and non-obvious rules for the SCA Reachability feature. - -## Three-condition gate (fundamental invariant) - -A class appearing in `sca_cves.json` is NOT reported just because it is in the index. All three -conditions must hold: - -1. Class is in the index (`database.entriesForClass()` returns entries) -2. Artifact is loaded: `DependencyResolver.resolve(jarUrl)` returns a `Dependency` with - `name == entry.artifact()` -3. Version is vulnerable: `entry.isVersionVulnerable(dep.version)` is true - -The index exists for O(1) lookup. Artifact + version filtering always runs in `processClass()`. -Classes in `pendingRetransform` still pass through the full filter -- if the version has not yet -resolved, `transform()` returns `null` without reporting. - -## `ClassFileTransformer` contract - -- `className` in `transform()` uses internal slash format: `"com/foo/Bar"`. The lookup map must - use this format as its key. -- Return `null` to leave bytecode unmodified (correct). Do not return the original buffer. -- `protectionDomain` may be null (bootstrap classes). `protectionDomain.getCodeSource()` may also - be null (runtime-generated classes). Both null checks are mandatory. -- `transform()` is called concurrently -- use immutable or concurrent data structures. -- Register with `addTransformer(transformer, true)` to allow `retransformClasses()`. - -## Two periodic actions and their relationship - -**`DependencyPeriodicAction`** (always active): drains `DependencyService` and reports ALL detected -JARs. With SCA enabled it emits `metadata:[]` on each dep as a signal that SCA is monitoring. -This is the complete inventory -- it knows nothing about CVEs. - -**`ScaReachabilityPeriodicAction`** (only when SCA enabled): drains `ScaReachabilityDependencyRegistry` -and re-emits only deps with pending CVE state, with `metadata:[{id, reached:[...]}]`. - -**Duplicate invariant**: within the same heartbeat window, the same dependency may arrive twice: - -1. From `DependencyPeriodicAction`: `{snakeyaml, metadata:[]}` -- dep just detected -2. From `ScaReachabilityPeriodicAction`: `{snakeyaml, metadata:[{cve, reached:[callsite]}]}` -- hit registered - -This happens when a dep is detected AND has a CVE hit in the same heartbeat window (common with -the 60s default). The backend must merge by `name:version`, taking the richer entry. This is not -a bug. - -## Version matching - -`ComparableVersion` in `internal-api/src/main/java/datadog/trace/util/ComparableVersion.java` is -a backport of Apache Maven 3.9.9. It supports 4-part versions and pre-release qualifiers. - -- `isWithin(start, end)` is half-open: `[start, end)` -- inclusive start, exclusive end. -- GHSA range formats: `"< 2.6.7.3"`, `">= 2.7.0, < 2.7.9.5"`, `"= 9.5.0"`. -- `< X` → `compareTo(X) < 0`. `= X` → `compareTo(X) == 0`. `>= A, < B` → `isWithin(A, B)`. - -## Class-level vs method-level symbols - -**Class-level (`method: null`)**: the class load IS the reachability signal. Use only when no -specific method is instrumentable or when class presence in memory is sufficient. - -**Risk with mixed class-level + method-level in the same entry**: if the same CVE entry has both, -the class-level hit registers `` in the registry and first-hit-wins blocks subsequent -method-level hits. Result: `reached=[{}]` instead of the application callsite. - -**Rule**: for libraries that load at startup (snakeyaml, xstream), remove the class-level symbol -and keep only method-level symbols. The class load triggers `registerCve()` with `reached:[]` -and the subsequent method call records the correct callsite. - -## Startup wiring - -- `AppSecSystem.start()` does NOT receive `Instrumentation`. Only `SubscriptionService` and - `SharedCommunicationObjects` are passed in. -- The correct pattern is a separate `maybeStartScaReachability(Instrumentation, ...)` method in - `Agent.java`, gated by `Config.get().isAppSecScaEnabled()`, called after `maybeStartAppSec()`. - -## Hard constraints (never violate) - -- Never use `java.nio.*` in premain code -- this includes `StandardCharsets.UTF_8`. Use the - string literal `"UTF-8"` instead. See `docs/bootstrap_design_guidelines.md`. -- Never modify bytecode for class-level hits (always `return null`). For method-level hits, - bytecode modification is required. -- Never throw from `transform()`. -- Never do blocking I/O in the `transform()` hot path. Use a - `ConcurrentHashMap>` cache. Do not cache empty results. -- Never report the same `(vulnId, artifact, symbolName)` more than once. Dedup: class-level in - the transformer, method-level in bootstrap `ScaReachabilityCallback.reported`. -- Never write to spans or traces -- the RFC explicitly forbids it. -- Never use `String.split(String)` or `String.split(String, int)` (forbidden API). Use a - `static final Pattern` field with `Pattern.compile(...).split(str)`. - -## GHSA data format - -- Each file is a JSON array: `[{...}]`. Process only `"language": "jvm"` entries. -- Full class name = `value + "." + name` from `ecosystem_specific.imports[].symbols[]`. -- Version ranges in `package[].version_range[]` (array of strings, treated as OR). -- The GHSA ID is the identifier -- extracted from the filename (`GHSA-xxx.json` → `"GHSA-xxx"`). - No CVE ID is present in the file. - -## Telemetry serialization - -- `Dependency` is a `public final class` with exactly 4 fields: `name`, `version`, `source`, - `hash` + optional `reachabilityMetadata: List` (nullable). `source` is NOT serialized. -- Serialization uses **Moshi `JsonWriter`** in `TelemetryRequestBody.writeDependency()`. Not Jackson. -- `null` metadata = SCA disabled (field omitted in JSON). `[]` = SCA active, no CVEs. `[{...}]` = CVE state. -- The `metadata.value` field in the RFC MUST be a JSON string (not an object) -- stringify explicitly.