Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

826 changes: 707 additions & 119 deletions apps/elf-eval/src/bin/real_world_job_benchmark.rs

Large diffs are not rendered by default.

282 changes: 263 additions & 19 deletions apps/elf-eval/tests/real_world_job_benchmark.rs

Large diffs are not rendered by default.

94 changes: 94 additions & 0 deletions docs/evidence/2026-06-27-authority-recovery-drill-drift-audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
type: Drift Audit
title: "Authority Recovery Drill Drift Audit"
description: "Drift audit for production-ops authority recovery drill benchmark artifacts and reports."
resource: docs/evidence/2026-06-27-authority-recovery-drill-drift-audit.md
status: active
authority: evidence
owner: docs
last_verified: 2026-06-27
tags:
- docs
- evidence
- benchmarking
- production-ops
source_refs:
- https://linear.app/hackink/issue/XY-1119
code_refs:
- apps/elf-eval/src/bin/real_world_job_benchmark.rs
- apps/elf-eval/fixtures/real_world_memory/production_ops/authority_plane_recovery_drill.json
- docs/spec/real_world_agent_memory_benchmark_v1.md
- docs/runbook/benchmarking/real_world_agent_memory_benchmark.md
related:
- docs/spec/real_world_agent_memory_benchmark_v1.md
- docs/runbook/benchmarking/real_world_agent_memory_benchmark.md
drift_watch:
- apps/elf-eval/src/bin/real_world_job_benchmark.rs
- apps/elf-eval/fixtures/real_world_memory/production_ops/
- docs/spec/real_world_agent_memory_benchmark_v1.md
---
# Authority Recovery Drill Drift Audit

Purpose: Anchor the production-ops authority recovery drill report contract to the
runner, fixture, and documentation surfaces.
Read this when: You need evidence for backup/PITR, idempotent outbox replay, Qdrant
rebuild completeness, degraded read, migration repair, dead-letter handling, and
RPO/RTO reporting in the real-world memory benchmark.
Not this document: Live production restore proof, private-corpus quality, hosted HA,
or multi-region failover evidence.

## Watched Claims

- `elf.authority_recovery_drill/v1` is a benchmark artifact under
`adapter_response.answer.recovery_drills[]`.
- The runner validates drill topology, failure injections, backup/PITR restored
evidence, degraded-read labels with visible source-of-truth records, RPO/RTO
measurements that meet targets, matching authority record counts for source,
journal, memory, knowledge, proposal, trace, and audit planes, preserved source
refs and lifecycle history, idempotent outbox replay without duplicate writes,
Qdrant rebuild completeness without missing vectors or errors, applied migration
repair, and dead-letter handling.
- Reports expose those drill counts through
`operational_evidence.authority_recovery`, including backup/PITR restored,
record-count preservation, and predicate-gated drill pass counters.
- The checked-in fixture is local synthetic evidence only. It does not prove private
corpus quality, provider-backed behavior, hosted HA, standby failover, or
multi-region SLA.

## Evidence Anchors

- `apps/elf-eval/src/bin/real_world_job_benchmark.rs` defines and validates
`AuthorityRecoveryDrillArtifact` and aggregates
`OperationalAuthorityRecoveryReport`.
- `apps/elf-eval/fixtures/real_world_memory/production_ops/authority_plane_recovery_drill.json`
encodes one production-ops job with topology, degraded-read labels, RPO/RTO,
matching before/after authority record counts, replay, rebuild, migration repair,
and dead-letter evidence.
- `docs/spec/real_world_agent_memory_benchmark_v1.md` defines the artifact schema and
production-ops/report semantics.
- `docs/runbook/benchmarking/real_world_agent_memory_benchmark.md` routes operators to
the production-ops command and describes the authority recovery drill coverage.

## Reverse Checks

- Run `cargo make real-world-memory-production-ops` to parse the fixture and render
the production-ops report.
- Run `cargo make check-docs` after docs changes.

## Verdict

pass

## Required Updates

- If recovery drill fields change, update the runner structs, fixture, benchmark
spec, runbook, and this audit together.
- If a live Docker recovery drill is added later, preserve the fixture/local evidence
boundary and add separate live evidence instead of reclassifying this fixture.

## Citations

- `apps/elf-eval/src/bin/real_world_job_benchmark.rs`
- `apps/elf-eval/fixtures/real_world_memory/production_ops/authority_plane_recovery_drill.json`
- `docs/spec/real_world_agent_memory_benchmark_v1.md`
- `docs/runbook/benchmarking/real_world_agent_memory_benchmark.md`
2 changes: 2 additions & 0 deletions docs/evidence/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,5 +27,7 @@ Routes to: Drift audits and evidence concepts under `docs/evidence/`.
suppression boundaries.
- `2026-06-27-work-journal-drift-audit.md`: Drift audit for Work Journal
source-adjacent capture, readback, redaction, and promotion-boundary behavior.
- `2026-06-27-authority-recovery-drill-drift-audit.md`: Drift audit for
production-ops authority recovery drill benchmark artifacts and reports.
- `external_memory_pattern_radar_latest.md`: Latest weekly external memory pattern
radar summary.
4 changes: 4 additions & 0 deletions docs/log.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,3 +140,7 @@ logs.
Work Journal oracle fields, report rates, and hard-fail counters for redaction,
rejected-option, inferred-step, journal-authority, and janitor false-promotion
boundaries.
- Added the XY-1119 authority recovery drill production-ops slice, defining
`elf.authority_recovery_drill/v1` report artifacts, validating topology, degraded
reads, RPO/RTO, authority record counts, idempotent outbox replay, Qdrant rebuild,
migration repair, and dead-letter handling, and linking the drift audit.
34 changes: 25 additions & 9 deletions docs/runbook/benchmarking/real_world_agent_memory_benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ resource: docs/runbook/benchmarking/real_world_agent_memory_benchmark.md
status: active
authority: procedural
owner: runbook
last_verified: 2026-06-23
last_verified: 2026-06-27
tags:
- docs
- runbook
Expand Down Expand Up @@ -192,10 +192,12 @@ including the retrieval-quality slice below. The suite currently encodes:
source-id preservation, evidence binding, no secret leakage, and fixture-backed
capture/integration boundary classification.
- `production_ops`: interrupted generated backfill resume, backup/restore plus
cold-start readback, resource-envelope interpretation, public-proxy
production-private addendum readback, pinned OpenViking local embedding
runtime/wrong-result classification, missing private manifest `blocked`
classification, and provider credential boundary `blocked` classification.
cold-start readback, recoverable authority-plane drill evidence over source,
journal, memory, knowledge, proposal, trace, and audit records,
resource-envelope interpretation, public-proxy production-private addendum readback,
pinned OpenViking local embedding runtime/wrong-result classification, missing
private manifest `blocked` classification, and provider credential boundary
`blocked` classification.
- `personalization`: scoped stable preference correction without temporary or
cross-project preference leakage.
- `core_archival_memory`: core block attachment, scope, provenance, stale-core
Expand Down Expand Up @@ -705,10 +707,24 @@ The production-ops fixtures live under
`apps/elf-eval/fixtures/real_world_memory/production_ops/`. They encode user-job
readback over existing public benchmark and restore evidence: interrupted backfill
resume from checkpoint, clean-run comparison, backup/restore readback, Qdrant rebuild
from Postgres-held vectors, cold-start search recovery, and resource-envelope
interpretation. The P4 slice also encodes the operator-approved public-proxy
production-private addendum and emits `elf.operational_evidence_gates/v1` so local
fixture, public-proxy, private-corpus, and provider-backed evidence remain separate.
from Postgres-held vectors, cold-start search recovery, recoverable authority-plane
drills, and resource-envelope interpretation. Authority recovery drills use
`elf.authority_recovery_drill/v1` under `adapter_response.answer.recovery_drills[]`
to report topology, failure injection, backup/PITR, degraded-read labels, RPO/RTO
targets and measurements, matching before/after authority record counts, idempotent
outbox replay, Qdrant rebuild completeness, migration repair, and dead-letter
handling. The runner fails drills whose predicates are false: backup/PITR must be
restored, source-of-truth records must stay visible during degraded reads, RPO/RTO
measurements must meet targets, authority counts/source refs/lifecycle history must
be preserved, outbox replay must be idempotent without duplicate writes, Qdrant
rebuilds must complete without missing vectors or errors, migration repair must be
applied, and dead-letter rows must be handled. The generated
`operational_evidence.authority_recovery` report includes backup/PITR restored,
record-count preservation, and per-predicate recovery counters; drill pass counts
require both a passing job and successful recovery predicates. The P4 slice also
encodes the operator-approved public-proxy production-private addendum and emits
`elf.operational_evidence_gates/v1` so local fixture, public-proxy, private-corpus,
and provider-backed evidence remain separate.

The same slice deliberately keeps non-pass boundaries typed. A missing private
production manifest is `blocked`, unavailable provider credentials are `blocked`, and
Expand Down
54 changes: 47 additions & 7 deletions docs/spec/real_world_agent_memory_benchmark_v1.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ resource: docs/spec/real_world_agent_memory_benchmark_v1.md
status: active
authority: normative
owner: spec
last_verified: 2026-06-23
last_verified: 2026-06-27
tags:
- docs
- spec
source_refs: []
code_refs:
- Makefile.toml
- apps/elf-eval/src/bin/real_world_job_benchmark.rs
- apps/elf-eval/fixtures/real_world_memory/
- apps/elf-eval/fixtures/real_world_memory/production_ops/authority_plane_recovery_drill.json
related: []
drift_watch:
- docs/spec/real_world_agent_memory_benchmark_v1.md
Expand Down Expand Up @@ -451,6 +451,40 @@ untraced section count. Rebuild results are acceptable only when repeated output
deterministic enough for regression comparison or every allowed variance is explicitly
reported.

### Optional `adapter_response.answer.recovery_drills`

Production-ops fixtures MAY include authority recovery drill artifacts in
`corpus.adapter_response.answer.recovery_drills[]`. These artifacts use schema
`elf.authority_recovery_drill/v1` and are fixture/report evidence, not proof of a
multi-region or hosted HA topology.

Each recovery drill MUST include:

- `drill_id`, `contract_schema`, and `generated_at`;
- `topology` with the authority store, derived indexes, adapters, and failover
boundary;
- one or more `failure_injections` with target, fault, timestamps, and evidence refs;
- `backup_pitr` with backup reference, PITR target, `restored = true`, and evidence
refs;
- `degraded_read` with unavailable derived indexes or adapters labeled separately
from visible source-of-truth records, and `source_of_truth_visible = true`;
- `rpo` and `rto` targets and measured seconds with evidence refs, where measured
seconds are less than or equal to the target seconds;
- `authority_record_counts` for `source`, `journal`, `memory`, `knowledge`,
`proposal`, `trace`, and `audit`, including matching before/after counts plus
`source_refs_preserved = true` and `lifecycle_history_preserved = true`;
- `outbox_replay` with `idempotent = true`, zero duplicate writes, and evidence refs;
- `qdrant_rebuild` with `complete = true`, zero missing vectors, zero errors, and
evidence refs;
- `migration_repair` with `applied = true` and evidence refs;
- `dead_letter` with handled count greater than or equal to dead-letter count and
evidence refs.

A recovery drill MUST NOT claim failover unless a standby or replacement authority
service is actually part of the topology. Qdrant and document indexes remain derived
and rebuildable; degraded read must label unavailable derived indexes or adapters
without hiding Postgres source-of-truth records.

### `negative_traps`

Negative traps MUST be explicit so systems are tested against realistic memory failure
Expand Down Expand Up @@ -638,7 +672,7 @@ Suite ids are stable public names. Each suite MUST contain at least one
| `source_library` | Preserve long-form source records and citable excerpts without silently promoting them to memory. | Capture a long document; hydrate a source_ref excerpt; preserve a social/thread source boundary. | Source ids, canonical source metadata, source_ref hydration pointers, verified excerpts, explicit no-autopromotion boundary. | answer_correctness, evidence_grounding, lifecycle_behavior, trap_avoidance. | PageIndex, ELF. |
| `operator_debugging_ux` | Show whether a wrong or ambiguous memory result can be debugged without raw store spelunking. | Explain why a result ranked first; inspect a trace; identify which stage dropped expected evidence. | Trace bundle, retrieval trajectory, candidate metrics, viewer or CLI readback. | debuggability, evidence_grounding, workflow_helpfulness, answer_correctness. | claude-mem, qmd, agentmemory, ELF. |
| `capture_integration` | Evaluate how accurately work observations become usable memory across agents and tools. | Capture a session decision; exclude private spans; import external agent observations. | Hook/import logs, write policy audits, excluded spans, resulting note ids. | answer_correctness, evidence_grounding, trap_avoidance, lifecycle_behavior. | agentmemory, claude-mem, memsearch, mem0. |
| `production_ops` | Prove safe operation under backup, restore, backfill, cold start, resource, and credential boundaries. | Resume interrupted import; restore from backup; report missing private manifest as bounded caveat. | Command/report artifacts, resource envelope, checkpoint state, failure guard evidence. | lifecycle_behavior, latency_resource, uncertainty_handling, evidence_grounding. | ELF, qmd, memsearch, LangGraph. |
| `production_ops` | Prove safe operation under backup, restore, backfill, cold start, authority recovery, resource, and credential boundaries. | Resume interrupted import; restore from backup; report missing private manifest as bounded caveat; report authority-plane degraded read and replay drills. | Command/report artifacts, resource envelope, checkpoint state, failure guard evidence, authority record counts, RPO/RTO measurements, degraded-read labels. | lifecycle_behavior, latency_resource, uncertainty_handling, evidence_grounding. | ELF, qmd, memsearch, LangGraph. |
| `personalization` | Apply user/project preferences correctly without leaking across scopes or overfitting stale preferences. | Remember preferred response style; avoid using another project tenant's note; update a preference. | Scoped memory ids, preference versions, tenant/project/agent context, negative cross-scope traps. | personalization_fit, trap_avoidance, evidence_grounding, answer_correctness. | mem0, Letta, agentmemory, ELF. |
| `core_archival_memory` | Verify always-loaded core memory behavior separately from archival note search and derived retrieval indexes. | Read an attached core block; enforce core block scope; detect stale core state from archival evidence; fall back to archival notes; recover a decision from core routing plus archival rationale. | Core block ids, attachment ids, read_profile/scope metadata, source_ref and audit history, archival note evidence ids, stale-core traps, and explicit no-Qdrant-core-block boundary evidence. | answer_correctness, evidence_grounding, trap_avoidance, lifecycle_behavior, workflow_helpfulness. | Letta, ELF. |
| `context_trajectory` | Measure staged context trajectory, hierarchy selection, and recursive/context expansion without converting setup or retrieval preconditions into trajectory wins. | Explain whether a staged trajectory can be scored; identify selected hierarchy nodes; report recursive expansion paths and pruned branches. | Same-corpus expected evidence ids, matched/missing evidence ids, stage artifacts, selected hierarchy nodes, rejected siblings or decoys, expansion paths, pruned branches, comparable ELF trace/session artifacts when a comparison is claimed. | answer_correctness, evidence_grounding, trap_avoidance, debuggability, workflow_helpfulness. | OpenViking, ELF, qmd. |
Expand Down Expand Up @@ -690,10 +724,16 @@ Reports MUST include:
separating `local_fixture`, `public_proxy`, `private_corpus`, and
`provider_backed` tiers. The gates MUST report tier status, job counts, pass and
typed non-pass counts, mean latency, cost summary, resource-envelope counts,
cold-start/restore/Qdrant-rebuild counts, typed blocker reasons, and explicit
booleans for whether private-corpus or provider-backed pass claims are allowed.
Local fixture and public-proxy passes MUST NOT satisfy private-corpus or
provider-backed proof.
cold-start/restore/Qdrant-rebuild counts, authority recovery drill counts where a
pass requires the job to pass and every drill predicate above to succeed,
topology coverage, failure-injection counts, degraded-read label counts, visible
source-of-truth counts, backup/PITR restored counts, RPO/RTO target and met counts,
authority record-count preservation counts, source-ref and lifecycle preservation
counts, idempotent replay counts, complete Qdrant rebuild counts, migration repair
counts, dead-letter handling counts, typed blocker reasons, and explicit booleans
for whether private-corpus or provider-backed pass claims are allowed. Local
fixture and public-proxy passes MUST NOT satisfy private-corpus or provider-backed
proof.
- run id, runner version, corpus profile, job ids, suite ids, project adapter metadata;
- per-job status, normalized score, hard-fail hits, evidence ids used, trap ids used;
- per-job `answer_type`, required caveat/refusal flags, and whether an unknown answer
Expand Down