Skip to content

fix: six HIGH-priority hardening fixes (prompt-injection seal + 8-K storage + XML entity expansion)#165

Open
sroussey wants to merge 5 commits into
mainfrom
claude/wonderful-hypatia-y6cb4l
Open

fix: six HIGH-priority hardening fixes (prompt-injection seal + 8-K storage + XML entity expansion)#165
sroussey wants to merge 5 commits into
mainfrom
claude/wonderful-hypatia-y6cb4l

Conversation

@sroussey

Copy link
Copy Markdown
Contributor

Summary

Six HIGH-priority security / correctness fixes, split across two commits:

  • Prompt-injection seal hardening for the S-1 / 424 AI section extractors
    (per-call nonce fence, multi-stage defang, raw-byte source_span cap).
  • Form 8-K storage hardening (XML entity-expansion DoS, accession-number
    validation at trust boundaries, versioned PK so re-extracts don't clobber
    history, transactional writes with rollback).

All 1339 tests pass; full build + tsc are clean.


Plan A — prompt-injection seal (commit 1)

fix(forms/s1): per-call nonce fence + raw-span cap + multi-stage defang

The static <UNTRUSTED_FILER_DOCUMENT> fence tag and single-pass defang
left two avenues open:

  • A filer could pre-stage a literal closing tag and end the fence early
    (the defang was case-insensitive but flat — only one tag-shape was
    rewritten, and obfuscations via HTML entities, fullwidth letters,
    zero-width chars, or intra-tag whitespace slipped past).
  • Model-emitted source_span was capped only at the verifier
    (post-normalization). A row that passed verification could still
    ship unbounded raw bytes into the provenance column.

This PR deepens the seal:

  • Per-call nonce fence. Each wrapUntrusted call mints a 64-bit
    random hex nonce; the fence tag becomes
    <UNTRUSTED_FILER_DOCUMENT_NONCE_<hex>>…</UNTRUSTED_FILER_DOCUMENT_NONCE_<hex>>.
    A pre-staged closing tag in the prospectus cannot match the actual
    fence (the attacker doesn't know the nonce).
  • Multi-stage defang. Before the tag scan: multi-pass HTML entity
    decoding (named + numeric, up to a fixed point), Unicode NFKC
    normalization, and zero-width / bidi format-char stripping. The
    defang scan then matches any tag-shaped token whose alphabetic
    payload squashes to UNTRUSTEDFILERDOCUMENT* and rewrites it to
    [redacted-fence-tag].
  • Raw-byte source_span cap. New boundSourceSpan returns null
    for spans over 1000 raw chars (rather than truncating); new
    verifyRowSpan rejects over-cap spans BEFORE normalization, so a
    whitespace-inflated span that would otherwise normalize under cap
    no longer passes the verifier. All 7 verifyRow: callsites and
    9 source_span: persist sites in the S-1 storage and shared
    offering-sections layer route through these.
  • Version bumps. S-1 extractor 1.2.0 → 1.3.0; 424 extractor
    1.1.0 → 1.2.0. Prompt-shape change drifts confidence calibration,
    and the span-storage shape changes too — operators should rotate.

New / extended tests: 10 unit tests for boundSourceSpan /
verifyRowSpan boundary cases; 9 prompt-injection tests covering the
nonced fence, fullwidth, HTML-entity, mixed-case + ZWSP, intra-tag
whitespace, wrong-nonce, and nonce uniqueness; a 1500-raw-char
whitespace-padded storage-layer dead-letter test.

Plan B — Form 8-K storage (commit 2)

fix(forms/8-K): tx writes, versioned PK, accession unification, XML entity hardening

  • XML entity expansion disabled. processEntities: false on the
    shared Form XML parser. A filer-crafted "billion laughs" DOCTYPE
    no longer expands to a multi-GB string. Regression test feeds a
    10-level chain and asserts parse completes well under 50 ms.
  • Accession-number validator. New TypeAccessionNumber() enforces
    the EDGAR 20-char ^\d{10}-\d{2}-\d{6}$ shape at trust boundaries
    — applied to ProcessAccessionDocFormTask input and the 8-K event
    row schema. Tests assert 21- and 22-char strings are rejected.
  • Versioned PK. form_8k_events now uses a synthetic event_id
    AUTOINCREMENT PK; extractor_id and extractor_version are
    first-class columns; the natural key
    (cik, accession_number, extractor_id, extractor_version, item_code)
    is enforced UNIQUE. Re-extracts under a new version no longer
    overwrite the prior version's rows. New getEventsByVersion()
    query helper.
  • Atomic writes. New Form8KEventRepo.replaceEvents() deletes
    every row matching (cik, accession_number, extractor_id, extractor_version) and bulk-inserts the new set inside a real
    transaction (better-sqlite3 db.transaction on SQLite,
    BEGIN / COMMIT / ROLLBACK on a checked-out PG client). Mid-write
    failure rolls both halves back; the table never carries a partial
    item list for one (filing, version). A SQLite failure-injection
    test verifies rollback.

Migration notes

The 8-K event table introduced in PR #68 had its PK changed from
(cik, accession_number, item_code) to event_id. Neither SQLite
nor Postgres can ALTER away a primary key, so a one-shot
migrateLegacyForm8KEventsTable step (called from
setupAllDatabases before the table's setupDatabase()) detects
the legacy shape (no event_id column) and DROPs the table. 8-K
events are deterministic to re-extract from the filing's items list,
so re-running sec fetch form <cik> 8-K rebuilds the data losslessly.

Operator action

Roll the new extractor versions into production:

sec version startDev extractor S-1 1.3.0 && sec version promote extractor S-1
sec version startDev extractor 424 1.2.0 && sec version promote extractor 424

After upgrading, the legacy 8-K event table (if any) is dropped on
the next setupAllDatabases() invocation — re-fetch any 8-K filings
that were processed under the previous schema:

sec fetch form <cik> 8-K   # or replay an accession with sec extractor retry-...

Files changed (broad strokes)

  • src/sec/forms/registration-statements/s1/sectionExtractors.ts
    nonce fence + multi-stage defang
  • src/sec/forms/registration-statements/s1/verifySourceSpan.ts
    boundSourceSpan / verifyRowSpan / MAX_STORED_SPAN_CHARS
  • src/sec/forms/registration-statements/Form_S_1.storage.ts,
    src/sec/forms/registration-statements/s1/offeringSections.ts
    swap all verifyRow: and source_span: sites
  • src/sec/forms/registration-statements/Form_S_1.storage.ts,
    src/sec/forms/registration-statements/Form_424.storage.ts
    version bumps
  • src/sec/forms/Form.tsprocessEntities: false
  • src/sec/edgar/accessionNumber.ts (new) — TypeAccessionNumber
  • src/task/forms/ProcessAccessionDocFormTask.ts — accession
    validator at task input; pass extractor_id + extractor_version
    into processForm8K
  • src/storage/form-8k-event/Form8KEventSchema.ts — versioned PK,
    Form8KEventUniqueIndexes
  • src/storage/form-8k-event/Form8KEventRepo.ts — version-aware
    query helpers + replaceEvents
  • src/storage/form-8k-event/Form8KEventReplace.ts (new) —
    SQLite / PG transactional replace
  • src/storage/form-8k-event/Form8KEventLegacyMigration.ts (new) —
    drop legacy form_8k_events
  • src/sec/forms/miscellaneous-filings/Form_8_K.storage.ts — use
    replaceEvents, accept extractor version params
  • src/config/{DefaultDI,TestingDI,setupAllDatabases}.ts — wire
    new unique index + legacy-migration step
  • Tests across all of the above (29 new, several extended)

Verification

bun test                                         # 1339 pass / 0 fail
bun test src/sec/forms/registration-statements/  # 89 pass
bun test src/sec/forms/                          # 374 pass
bun test src/storage/form-8k-event/              # 12 pass
bun test src/task/forms/                         # 12 pass
bun run build                                    # clean (bun build + tsc)

🤖 Generated with Claude Code


Generated by Claude Code

claude added 3 commits June 23, 2026 08:31
The prompt-injection seal around S-1/424 AI section extraction had two
filer-controllable weak points:

1. The fence tag was a static literal (`UNTRUSTED_FILER_DOCUMENT`), so a
   filer could pre-stage a matching closing tag and end the fence early.
   The defang scan was case-insensitive but flat — only a single literal
   tag-shape was rewritten.
2. A model-emitted `source_span` was capped at the verifier (post-
   normalization) but persisted raw, so an attacker who slipped any
   verifier-passing row could ship unbounded raw bytes through the
   provenance column.

This patch deepens the seal:

- The fence tag carries a per-call 64-bit random nonce. The
  `UNTRUSTED_FILER_DOCUMENT_NONCE_<hex>` shape means a pre-staged closing
  tag in the prospectus cannot match the call's actual fence.
- Before defang, the section body is HTML-entity-decoded (multi-pass, up
  to a fixed point), NFKC-normalized, and stripped of zero-width / bidi
  format chars. The defang scan matches any tag-shaped token whose
  alphabetic payload squashes to `UNTRUSTEDFILERDOCUMENT...`, so
  obfuscations via `&lt;`, fullwidth letters, ZWSP, intra-tag spaces,
  and case-mixing all collapse to `[redacted-fence-tag]`.
- A new `boundSourceSpan` caps stored spans at 1000 raw chars (returning
  `null` over the cap rather than truncating). A new `verifyRowSpan`
  rejects a span whose raw byte count exceeds the cap before the
  normalize-and-substring check runs, so a whitespace-inflated payload
  that would otherwise normalize under cap can no longer pass the gate.
  All `verifyRow:` callsites and `source_span:` persist sites in the S-1
  storage and shared offering-sections layer route through these.
- Bumps the S-1 extractor version to 1.3.0 and the 424 extractor version
  to 1.2.0: prompt-shape changes drift confidence calibration, and the
  span-storage shape changes too. Operators should run startDev/promote
  to roll the new version into production.

Adds unit tests for `boundSourceSpan` / `verifyRowSpan` boundary cases,
a 1500-raw-char whitespace-padded span dead-letter test in the storage
layer, and obfuscation tests for fullwidth, HTML-entity, mixed-case +
zero-width, intra-tag whitespace, wrong-nonce, and nonce uniqueness.

Co-Authored-By: Claude <noreply@anthropic.com>
…ntity hardening

Four hardening fixes around the Form 8-K event-storage path:

1. fast-xml-parser entity expansion is disabled (`processEntities: false`)
   on the shared Form XML parser. A filer-controlled SGML payload that
   declared a chain of nested entity references would otherwise expand
   into a multi-GB string ("billion laughs") and peg CPU at parse time.
   A regression test feeds a 10-level billion-laughs DOCTYPE through
   Form_8_K.parse and asserts the parse stays well under 50 ms.

2. EDGAR accession numbers cross trust boundaries unconstrained — the
   filing-task input schema and the Form 8-K event table both accepted
   any string, so an over-long or malformed accession could land in
   storage. Introduces `TypeAccessionNumber()` (20-char fixed length,
   `^\d{10}-\d{2}-\d{6}$`) and applies it at the
   ProcessAccessionDocFormTask input and the event row schema.

3. The Form 8-K event row keyed `(cik, accession_number, item_code)` as
   its primary key — a re-extract under a new extractor version would
   overwrite the prior version's rows, erasing the time series. Switches
   the table to a synthetic `event_id` AUTOINCREMENT PK plus an explicit
   `(cik, accession_number, extractor_id, extractor_version, item_code)`
   UNIQUE natural-key index, mirroring the PersonObservation /
   CompanyObservation shape. Both extractor columns are now first-class
   so coverage / drop-previous ceremonies can target a single version.
   A one-shot legacy-schema migration drops the pre-versioned table on
   the SQLite and Postgres paths (the natural-key PK cannot be ALTERed
   away on either backend, and 8-K events are deterministic to re-extract).

4. `processForm8K` previously looped over items with one `put` per item,
   so a mid-loop crash left the row set torn between old and new items
   for the same (filing, version). Adds `Form8KEventRepo.replaceEvents`
   — DELETE all rows for `(cik, accession_number, extractor_id,
   extractor_version)` then bulk-insert the new set, wrapped in a
   real transaction on the SQLite (better-sqlite3 `db.transaction`)
   and Postgres (`BEGIN / COMMIT / ROLLBACK` on a checked-out client)
   paths. The in-memory backend (tests only) is synchronous so a torn
   write cannot interleave. A failure-injection test seeds a row,
   then re-runs `replaceEvents` with a NOT NULL-violating second
   insert and asserts the prior baseline is intact after rollback.

Also wires `extractor_id` + `extractor_version` through the task layer
into `processForm8K` so the same writer can run under any version slot.

Co-Authored-By: Claude <noreply@anthropic.com>
…B_TYPE token

In CI the 21 Form_8_K tests failed with `no such table: form_8k_events`
because `replaceForm8KEvents` was dispatching to `replaceSqlite` even
though the test harness had wired `FORM_8K_EVENT_REPOSITORY_TOKEN` to an
in-memory storage. The trigger was test-process global-DI contamination:
`FetchDailyIndexTask.test.ts` calls `EnvToDI()` at module-load time,
which registers `SEC_DB_TYPE = "sqlite"` in the `globalServiceRegistry`.
The ServiceRegistry has no unregister API, so once any earlier test in
the same Bun worker hits that path, `SEC_DB_TYPE` sticks for the rest of
the run. `resetDependencyInjectionsForTesting()` rebinds the repo tokens
to in-memory storages but cannot clear `SEC_DB_TYPE`, so the SQLite
branch in `replaceForm8KEvents` won and reached for `getDb()`, which
either fell over on an uninitialized SQLite handle (locally) or
write-attempted against a table that was never created (CI).

Fix: trust the actual repo. `InMemoryTabularStorage.isDurable()` returns
`false`; the production storages don't override it. When the resolved
repo is non-durable, take the repo path regardless of `SEC_DB_TYPE`.
This makes the dispatch correct even when global config and the
registered repo disagree, which is the steady-state in the test process.

Reproduces locally via:
  bun test src/task/index/FetchDailyIndexTask.test.ts \\
           src/sec/forms/miscellaneous-filings/Form_8_K.test.ts
(without the fix: 25 Form_8_K fails; with the fix: all 29 pass).

Co-Authored-By: Claude <noreply@anthropic.com>

Copy link
Copy Markdown
Contributor Author

CI's 21 Form_8_K failures were a test-process global-DI contamination bug, not an issue with the H6 hardening itself.

FetchDailyIndexTask.test.ts (and a few other tests) call EnvToDI() at module-load time, which registers SEC_DB_TYPE = "sqlite" in the globalServiceRegistry. There is no unregister API, so subsequent tests in the same Bun worker inherit it. resetDependencyInjectionsForTesting() rebinds the repo tokens to in-memory storage but cannot clear SEC_DB_TYPE, so replaceForm8KEvents was dispatching to replaceSqlite against a non-existent (CI) or uninitialized (local) SQLite handle.

Fix in Form8KEventReplace.ts: trust the actual repo. When the resolved repo.isDurable() returns false (the in-memory backend), take the repo path regardless of SEC_DB_TYPE. Production code paths register the durable SQLite/Postgres storage, so they are unaffected.

Reproduced locally via bun test src/task/index/FetchDailyIndexTask.test.ts src/sec/forms/miscellaneous-filings/Form_8_K.test.ts — 25 fails before, 0 after. Full suite: 1334 pass / 7 fail, with all remaining failures pre-existing network timeouts unrelated to this PR (FetchDailyIndex, FetchQuarterlyIndex, Form_D phone-normalize, OwnershipDocument address-clean). Build green.


Generated by Claude Code

Resolves conflicts created by PR #166 (SPAC de-SPAC lifecycle / merger-proxy /
redemption extraction) landing on main after this PR opened.

Conflicts resolved:

- src/sec/forms/miscellaneous-filings/Form_8_K.storage.ts
  - Function signature combines both side's additive params:
    extractor_id, extractor_version (this PR), fullSubmissionText, model (#166).
  - Event writes go through replaceEvents() (this PR), threading
    extractor_id + extractor_version into the version-scoped delete-then-insert.
  - SPAC milestone mapping + redemption extraction blocks from #166 follow
    unchanged after the events are persisted.

- src/task/forms/ProcessAccessionDocFormTask.ts
  - Keep TypeAccessionNumber import (this PR), processMergerProxy +
    hasRedemptionTriggerItem imports (#166).
  - 8-K dispatch call site passes both extractor_id/extractor_version and
    fullSubmissionText into processForm8K; merger-proxy case from #166 follows.

- src/sec/forms/registration-statements/s1/sectionExtractors.ts (auto-merged
  cleanly by git but the new extractMergerDeal / extractRedemption functions
  still called the pre-PR wrapUntrusted shape + UNTRUSTED_PREAMBLE constant,
  which this PR removed. Both updated to the nonce-fence API
  (wrapUntrusted -> { wrapped, nonce }, buildUntrustedPreamble(nonce))
  so the new SPAC AI extractors get the per-call nonce fence + multi-stage
  defang for free. Without this, the prompts would interpolate as
  "[object Object]" and the model receives garbage.

Verification:
- targeted: bun test src/sec/forms/miscellaneous-filings/ \
    src/sec/forms/proxies-information-statements/ src/task/forms/ \
    src/storage/spac/ src/storage/form-8k-event/ \
    src/sec/forms/registration-statements/  -> 229 pass / 0 fail.
- full: bun test  -> 1410 pass / 7 fail. All 7 fails are pre-existing
  FetchDailyIndexTask + FetchQuarterlyIndexTask 5000ms network timeouts
  unrelated to this PR (sandbox can't reach SEC.gov reliably).
- bun run build  -> clean (bun build + tsc, no errors).

Co-Authored-By: Claude <noreply@anthropic.com>

Copy link
Copy Markdown
Contributor Author

Rebased onto main (6d74dc87a0f271) which had just landed PR #166 (SPAC de-SPAC lifecycle: 8-K milestones, merger-proxy, redemption extraction). Merge commit df5238b.

Conflicts resolved (all additive):

Verification on df5238b:

Suite Result
Targeted (8-K + DEFM14A + task forms + spac + form-8k-event + registration-statements) 229 pass / 0 fail
Full bun test 1410 pass / 7 fail — all 7 fails are pre-existing FetchDailyIndexTask / FetchQuarterlyIndexTask 5000ms network timeouts (sandbox can't reach SEC.gov), same set noted on the earlier post-fix run.
bun run build clean (bun build + tsc)

CI will validate.


Generated by Claude Code

…e-fence API

The new SPAC extractors added in PR #166 (extractMergerDeal,
extractRedemption) called the pre-PR wrapUntrusted shape (returning a
string) and the removed UNTRUSTED_PREAMBLE constant. After this PR
swapped wrapUntrusted to return { wrapped, nonce } and replaced the
constant with buildUntrustedPreamble(nonce), the surviving call sites
template-interpolated UNTRUSTED_PREAMBLE as a free identifier ->
compile error (TS2552), and even if the type had survived the
{ wrapped, nonce } object would have rendered as "[object Object]"
in the prompt -> the model receives garbage and silently returns
nothing (caught by Form_DEFM14A.storage.e2e.test.ts target_name=null
assertions in the post-merge run).

Both extractors now use the same nonce-fence + multi-stage defang as
the other section extractors -- a forced consequence of the merge,
extending the per-call nonce + entity decode + NFKC + zero-width
strip protection to the new SPAC AI extractors at no extra design
cost.

Co-Authored-By: Claude <noreply@anthropic.com>

Copy link
Copy Markdown
Contributor Author

CI on df5238b failed with two TS2552: Cannot find name 'UNTRUSTED_PREAMBLE' errors at sectionExtractors.ts:327 and :363. Self-inflicted: when I resolved the merge conflicts and ran tests locally, I updated the new extractMergerDeal / extractRedemption to the nonce-fence API, but git add ran before those edits, so the merge commit shipped without them. Cleanup commit a20acc6 carries the two lines through. bunx tsc clean locally; CI will validate.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants