diff --git a/docs/plan/BACKLOG.md b/docs/plan/BACKLOG.md index fb21db37..303415a0 100644 --- a/docs/plan/BACKLOG.md +++ b/docs/plan/BACKLOG.md @@ -37,6 +37,117 @@ > 2026-05-03 history maintainability note:使用者以「繼續開展工作」授權打開 dedicated backend maintainability window。`WORK-HISTORY-MAINT-A` review 已完成並從 BACKLOG 移除;`WORK-HISTORY-MAINT-B` 已完成第一個 behavior-preserving extraction slice,把 history pagination / favicon / export owners 拆到 `archive/history/` 子模組。BACKLOG 目前只剩 blocked work blocks,沒有可提升的未阻塞 current-focus block。 > 2026-05-07 archive test-suite maintainability note:Explorer advanced-search 插單補測時,`src-tauri/crates/vault-core/src/archive/tests.rs` 已達 3272 行。本次只追加 regression coverage,沒有新增業務邏輯;依 `AGENTS.md` 巨檔規則,新增 high-priority follow-up `WORK-ARCHIVE-TEST-MAINT-A`,必須用 dedicated 維護窗口審查拆分測試 owner,後續不要繼續把 archive 新測試集中塞進該檔。 > 2026-05-10 v0.2.0 planning repair note:v0.2.0 發佈範圍正式收斂為 M14 Lexical Recall V2、advanced keyword syntax、Windows unsigned installer / scheduler preview、release/security hardening,以及既有 archive / deterministic Core Intelligence。原先未完成的 v0.2 AI / semantic / MCP / readable-content blocker 已全部移到 v0.3.0;`STATUS.md` 只保留 v0.2 release closeout,不能再把 AI / readable-content 當成 v0.2 ship blocker。 +> 2026-05-25 import test harness planning note:使用者反映實際導入瀏覽記錄時觀察到疑似 duplication,並要求專門的 ingest robustness 測試基礎建設。經 ingest 代碼 audit(見 `docs/plan/program/import-dedup-audit.md`)確認:跨瀏覽器「視覺重複」是 per-source-profile 設計契約(不是 bug),但發現 6 個真實 bug:B1 URL upsert 倒退、B2 Firefox/Safari long-tail revisit 漏抓、B3 Takeout source_visit_id 綁路徑、B4 Takeout × local Chrome 必然雙倍、B5 takeout `stable_key_i64` 規模化碰撞、B6 Takeout 時間單位歧義。新增 `WORK-IMPORT-TEST-HARNESS-A` 作為**第一個 unblocked block**,內含 scaffold + Priority 1 scenario library;後續的 cross-source view-layer aggregation、bug fixes 都會依託這個 harness 寫 failing test。完整 scenario library 與驗收條件見 `docs/plan/program/import-test-harness-spec.md`。 + +- [x] **WORK-IMPORT-TEST-HARNESS-A** — Browser History Import Test Harness Foundation + - 2026-05-25 closeout: audit + fixture crate + 12 e2e scenarios (9 contract, 3 `#[should_panic]` bug repros) + TODO for sub-ms Chrome collision. B5 scale test deferred to WORK-IMPORT-SCALE-TEST-A. See CHANGELOG for full details. + - 讀先: + `docs/plan/program/import-dedup-audit.md` + `docs/plan/program/import-test-harness-spec.md` + `docs/architecture/browser-support-and-adapter-playbook.md` + `src-tauri/crates/vault-core/src/migrations/001_initial.sql` + `src-tauri/crates/vault-core/src/migrations/002_archive_runtime_foundation.sql` + `src-tauri/crates/vault-core/src/archive/ingest/writes.rs` + `src-tauri/crates/vault-core/src/archive/ingest/mod.rs` + `src-tauri/crates/vault-core/src/archive/ingest/parser.rs` + `src-tauri/crates/vault-core/src/archive/mod.rs` + `src-tauri/crates/browser-history-parser/src/chromium/mod.rs` + `src-tauri/crates/browser-history-parser/src/firefox/mod.rs` + `src-tauri/crates/browser-history-parser/src/safari/mod.rs` + `src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs` + `src-tauri/crates/browser-history-parser/src/takeout/source.rs` + - 目標:建立 `src-tauri/crates/browser-history-fixtures` crate,內含:(1) 真實 schema 的 Chromium History / Firefox places.sqlite / Safari History.db / Takeout JSON/JSONL/zip fixture generator;(2) Scenario DSL 與 deterministic seed;(3) 跑通 ingest pipeline 後讀回 canonical archive 的 assertion API;(4) Priority 1 scenarios(C1/C2/C3/T1/T2/X1)與 fixture round-trip self-validation;(5) 為 audit 列的 6 個 bug 各寫一個 failing `#[should_panic]` 測試並在 spec doc 加上 traceability。 + - 契約: + - **絕對不讀取使用者真實瀏覽資料**。fixture 全部由 deterministic seed 程序化生成;URL / title 只用 checked-in public-domain corpus(Wikipedia article titles、`example.com` / `synthetic.test` 偽 hosts)。 + - 新 crate 進 Cargo workspace、納入 `bun run check`,所有現有 100% JS/Rust coverage gate 不放鬆。 + - 不修任何 product code bug —— harness 只負責 expose;fixes 由獨立 follow-up block 處理,merge 時把對應 scenario 從 `#[should_panic]` flip 成 `#[test]`。 + - 不新增 third-party dependency 除非經審核(目前計畫使用 `rusqlite` / `serde_json` / `chrono` / `rand` / `rand_chacha` / `tempfile` / `zip`,全部已在 workspace)。 + - 不在這個 block 內 cover view-layer cross-browser aggregation(另立 block)。 + - 生成 SQLite 必須通過真實 PathKeep parser 的 round-trip 測試(self-validation gate),否則 scenario 是無效保證。 + - 不在 STATUS.md 同時運行 paper redesign + harness 兩條軌道前需使用者授權(per AGENTS.md「計劃外大工作 → 進 BACKLOG.md,不直接做」)。 + - 驗收: + - `browser-history-fixtures` crate builds clean、在 `bun run check` 通過。 + - `tests/fixture_roundtrip.rs` 全綠 —— 每個 generator output 都被真實 parser 正確讀回。 + - Priority 1 scenarios(C1/C2/C3/T1/T2/X1)實作完成,contract scenarios pass、bug scenarios `#[should_panic]` with doc comment 連到 audit bug ID。 + - `docs/plan/program/import-dedup-audit.md` 新增「Bugs with failing tests」章節,列出每個 bug 對應的 scenario function。 + - CHANGELOG 紀錄哪些 audit bugs 已有 failing tests、哪些尚待 follow-up。 + - 三語 i18n 不適用(test infra 內部 ID 用 ASCII)。 + +- [x] **WORK-IMPORT-TEST-REMAINING-A** — Import Test Harness Remaining Audit Items + Maintainability + - 2026-05-25 closeout: all non-blocked audit items complete. Edge cases (E1-E6, C_SUB_MS, Empty DB×3, R1), cross-family baselines (F_C2, S_C2), Takeout coverage (ptoken, visitedAt, missing-time), and maintainability refactor (1274→641 lines via Takeout extraction + F2/S2 move) all shipped. R2/R3 and B5 remain blocked on infrastructure not yet built. + - 讀先: + `docs/plan/program/import-dedup-audit.md` + `docs/plan/program/import-test-harness-spec.md` + - 剩餘 blocked items now tracked individually:(1) R2/R3 crash rollback/batch revert — needs transaction-abort test infra;(2) B5 scale collision test — see WORK-IMPORT-SCALE-TEST-A。 + - 契約:不修 product code;maintainability refactor 不改 behavior。 + +- [!] **WORK-IMPORT-SCALE-TEST-A** — B5 Takeout `stable_key_i64` Collision At Scale [!blocked: needs million-record fixture infrastructure + benchmark tooling] + - 讀先: + `docs/plan/program/import-dedup-audit.md` (§B5) + `docs/plan/program/import-test-harness-spec.md` (T4 scenario) + `src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs` (`stable_key_i64`) + `src-tauri/crates/browser-history-fixtures/src/takeout/mod.rs` + - 目標:驗證 B5 hash collision probability — 用 1M+ record Takeout fixture 觀察 `stable_key_i64` 的實際碰撞率,確認是否在 14.4M design ceiling 下需要更換 hash function。 + - 契約:不修 product code;只產出 benchmark + collision statistics。 + +- [ ] **WORK-IMPORT-FIXTURE-SIDECARS-A** — Chromium Sidecar Tables Fixture Extension + End-to-End Scenarios + - 讀先: + `docs/plan/program/import-dedup-audit.md` (§3 — "Downloads / search_terms / favicons all supported") + `docs/plan/program/import-test-harness-spec.md` + `src-tauri/crates/browser-history-fixtures/src/chromium/mod.rs` (current writer: urls + visits only) + `src-tauri/crates/browser-history-parser/src/chromium/mod.rs` (lines 115+ — DOWNLOADS_SQL / SEARCH_TERMS_SQL / FAVICONS_SQL) + `src-tauri/crates/vault-core/src/archive/ingest/writes.rs` (`insert_download`, `insert_search_term`, `insert_favicon`) + `src-tauri/crates/vault-core/src/migrations/002_archive_runtime_foundation.sql` (downloads / keyword_search_terms / favicons / favicon_bitmaps schemas) + - 觀察(2026-05-25):現在的 `ChromiumHistoryFixture` 只能寫 `urls` + `visits` 兩張表。實際 Chrome `History` DB 還有 `downloads`, `keyword_search_terms`, `favicons`/`favicon_bitmaps`/`icon_mapping` 等表,parser 都有對應 SELECT 與 archive 寫入,但**端到端 scenario level 完全沒測過** —— CHANGELOG 早有記錄。實際使用者真的有下載歷史 / 搜尋詞 / favicon,這個 gap 真實存在。 + - 目標:(1) 在 `browser-history-fixtures/src/chromium/mod.rs` 加 `ChromiumDownloadRow` / `ChromiumKeywordSearchTermRow` / `ChromiumFaviconRow` + `ChromiumIconMappingRow` 三個(或四個)資料結構與對應的 `add_download` / `add_search_term` / `add_favicon` 方法;(2) 在 `SCHEMA_SQL` 補 real Chromium downloads / keyword_search_terms / favicons / favicon_bitmaps / icon_mapping 表結構(schema 要對齊真實 Chrome 145+ 版本,columns 取自 parser 的 SELECT 列表);(3) 寫四個新 scenario:T6 `chromium_downloads_round_trip_to_archive_downloads_table`、T7 `chromium_keyword_search_terms_land_with_term_text_preserved`、T8 `chromium_favicons_link_to_canonical_url_rows_with_blob_dedup`、T9 `chromium_icon_mapping_resolves_url_to_favicon`;(4) 為新 fixture 表加 round-trip self-validation 測試到 `tests/fixture_roundtrip.rs`。 + - 契約: + - 不修 product code;只擴展 fixture + 加 scenario。 + - **絕對不讀取使用者真實瀏覽 / 下載資料**。所有 fixture rows 由 deterministic seed 程序化生成,URL / filename / search term 只用 `example.com` / `synthetic.test` / public-domain corpus。 + - 三個(或四個)新 fixture data structures 不超過 800 行(含 schema、helper、unit test)。 + - 100% Rust coverage 維持;新 scenario 必須在 `cargo test -p vault-core` 與 `bun run check` 全綠。 + - Favicon blob bytes 使用 4-byte synthetic PNG header(`\x89PNG\r\n\x1a\n` + 1 byte filler),不從真實圖檔取材。 + - 驗收: + - `ChromiumHistoryFixture` 至少支援 4 個新 add\_\* 方法 + 對應 SCHEMA_SQL 擴展。 + - 4 個新 scenario 全綠,分別 assert downloads / search_terms / favicons / icon_mapping 從 fixture 進 archive 後 column values 1:1 對應。 + - `tests/fixture_roundtrip.rs` 新增 self-validation 測試,確認 fixture writer 寫出的 SQLite DB 可被真實 parser 讀回。 + - audit doc §6 contract table 新增 T6-T9 rows + 對應 §3 Chromium downloads / search_terms / favicons 註腳更新。 + - CHANGELOG 紀錄哪些 sidecar tables 現在有 end-to-end scenario coverage。 + +- [ ] **WORK-IMPORT-TEST-MINOR-A** — Minor Data-Integrity Contract Pins + - 讀先: + `docs/plan/program/import-dedup-audit.md` + `src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs` (where these will land) + `src-tauri/crates/browser-history-parser/src/safari/mod.rs` (lines 585-605 — synthesized / load_successful / http_non_get context evidence) + - 觀察(2026-05-25):完成 35 個 dedup scenarios 之後剩下這些 narrow 的 contract pins,每個值都不大但加起來能補完 column-level 行為的測試覆蓋: + 1. **visit_count = 0 / visit_count = N round-trip** — Chrome 對 typed-but-never-visited URL 會寫 `visit_count = 0`,parser 應該照搬不做奇怪轉換。 + 2. **`from_visit` referential integrity** — 如果 `from_visit` 指向不存在的 visit id(user 手動編輯 DB 或 parent visit 被刪),archive 怎麼存?current behavior 是 dangling reference 還是 0? + 3. **`visit_duration_micros` round-trip** — 顯式 assert duration 從 fixture 傳到 archive 的 `visit_duration_us` column 沒丟。 + 4. **Safari `synthesized` context evidence** — audit §3 提到 Safari 的 synthesized flag 會 inflate visit_count,parser 把它記成 `safari.synthesized` ContextEvidence 但沒測過 round-trip。 + 5. **Firefox `visit_type` enum mapping** — Firefox 的 visit_type 編碼跟 Chromium transition 不同,應該照搬到 archive 而不被 normalize。 + - 目標:每個 item 加一個 focused test 到 `dedup_scenarios_edge_cases.rs`(或在 baselines / takeout 各自模組裡),命名遵循 E-series(E10 / E11 / E12 / E13 / E14)。 + - 契約:不修 product code;每個 test < 80 lines;不擴展 fixture API(用現有 fields);audit doc §6 同步更新。 + - 驗收:5 個新 test 全綠;`cargo test -p vault-core` + `bun run check`;audit doc §6 contract table 新增 5 rows;CHANGELOG 紀錄這批 pins。 + +- [ ] **WORK-IMPORT-TEST-PARSER-ORDERING-A** — Visit-Before-URL Parser Ordering Contract + - 讀先: + `docs/plan/program/import-dedup-audit.md` (§4 — "Visit→URL ordering dependency" + §5.3) + `src-tauri/crates/vault-core/src/archive/ingest/mod.rs` (lines 155-158 — `ArchiveChunkConsumer::visits` silently drops visit if url_id_map miss) + `src-tauri/crates/vault-core/src/archive/ingest/chunk_consumer.rs` (if separate file) + - 觀察:audit §4 明確指出 parser 必須先 emit `urls()` 再 emit `visits()`;任何後續 refactor 改動 batching order 都會造成 silent data loss。但這個契約完全在 parser 層,不容易從 e2e scenario 測 —— 需要寫一個 mock `ChunkConsumer` 或直接 call `ArchiveChunkConsumer::visits` 在沒有對應 url_id_map entry 時,verify 行為(silent skip vs error)。 + - 目標:在 vault-core 內加一個 unit test (不是 scenario) 直接驅動 `ArchiveChunkConsumer::visits` with empty url_id_map,assert visits are silently skipped (current behavior), 然後在 doc comment 連到 audit §4 警告任何未來 refactor 都要保留這個契約或顯式 fail-fast。 + - 契約:不修 product code;測試只 pin 現有行為(silent skip),不主張 fail-fast 行為。如果 reviewer 認為應該改成 fail-fast,那是另一個 design conversation。 + - 驗收:1 個 unit test 在 `dedup_scenarios_edge_cases.rs` 或 `writes.rs` 的 #[cfg(test)] module 全綠;audit doc §4 加 cross-reference 連到 test;CHANGELOG 紀錄這個 narrow contract pin。 + +- [ ] **WORK-IMPORT-TEST-CONCURRENCY-A** — Multi-Profile Concurrent Ingest Safety + - 讀先: + `docs/plan/program/import-dedup-audit.md` (§4 — "Watermark race") + `src-tauri/crates/vault-core/src/archive/ingest/mod.rs` (lines 411-437 — transaction + watermark save) + `src-tauri/crates/vault-core/src/archive/mod.rs` + `src-tauri/crates/vault-worker/src/archive_flows.rs` + - 觀察:audit §4 指出 single-DB transaction 已經阻止 same-profile concurrent ingest,但 in-app queue serialization 與 backup vs Browser Direct cross-flow 沒測過。實際 production scenario:使用者點 manual backup 同時 schedule 觸發 auto backup,兩個 flow 都會試著 ingest 同一個 source_profile,race condition 可能讓 watermark 被踩或讓 same profile 同時被兩個 transaction 處理。 + - 目標:(1) Reading 現有 worker queue / archive flow code,確認 same-profile 的 serial guarantee 從哪裡來;(2) 寫一個 integration test 模擬兩個 import flow 對同一 profile,assert second flow 等到 first flow 完成才開始;(3) 如果發現 gap,建立 bug entry,但**不在這個 block 修**。 + - 契約:第一階段 audit-only(read + analysis),第二階段才寫測試;不修 product code;發現 bug 寫 BACKLOG entry 不直接 fix。 + - 驗收:audit doc 新增 §4.1 "concurrent ingest safety analysis" 子章節;至少 1 個 integration test 證明 same-profile concurrent flow 是 serialized;任何發現的真實 race condition 寫獨立 BACKLOG block。 - [!] **WORK-AI-V03-A** — Optional AI Runtime Re-Enablement [!blocked: v0.3 scope decision, real provider acceptance, release-size evidence] - 讀先: diff --git a/docs/plan/CHANGELOG.md b/docs/plan/CHANGELOG.md index 0453cf1a..aedb875e 100644 --- a/docs/plan/CHANGELOG.md +++ b/docs/plan/CHANGELOG.md @@ -1477,3 +1477,494 @@ negative-cache TTL auto-refetch (Phase 1.4)`):vault-core 新增 (見後續 Phase 0 close-out commit 的 verification)。- **後續 backlog**(保留在 `docs/features/og-images.md` §6):image dimension probe(depends on pure-Rust image crate, 純資訊性低 價值)、readable-content 對齊的批量 import 抓取。 + +## Import Data Integrity + +- [x] **WORK-IMPORT-TEST-HARNESS-A** — Browser History Import Test Harness Foundation + - 2026-05-25 closeout: + - **Architecture audit** (`docs/plan/program/import-dedup-audit.md`): full + code-level audit of the ingest dedup pipeline — dedup keys, per-family + watermark strategies, fingerprint partial index, 6 bugs identified + (B1–B6). Three audit claims corrected by empirical test findings: + B2 Safari refuted (MAX on-the-fly, no cached column), B3 simple-case + refuted (fingerprint partial index catches renamed-file identical + records), B4 reframed from "bug" to "design constraint." + - **Fixture crate** (`src-tauri/crates/browser-history-fixtures`): four + family writers (Chromium, Firefox, Safari, Takeout) that produce + schema-correct SQLite / JSON fixtures from deterministic seeds. + Time helpers (`unix_ms_to_chrome_time`, etc.) encapsulate each + family's epoch convention. 15 parser round-trip self-validation tests + across 4 files prove every generated fixture parses correctly through + the real PathKeep parser. + - **Scenario library** (`vault-core::archive::ingest::dedup_scenarios`): + 12 end-to-end scenarios driving `process_profile_snapshot` and + `import_takeout` against the real archive DB: + - Contract (pass today, guard against regression): C1, C2, C3, S2, + T1, T2, T3, T5, X1. + - Bugs with `#[should_panic]` (flip to `#[test]` when fix lands): + C4 (B1), F2 (B2), T2b (B3 narrow case). + - **TODO markers**: sub-millisecond Chrome visit collision (C_SUB_MS) + flagged in both audit doc §4 and dedup_scenarios.rs for follow-up. + - **Spec doc** (`docs/plan/program/import-test-harness-spec.md`): + 32 scenarios across 6 priority tiers, fixture generator API, + acceptance criteria. Section 6 "Scenarios Now Backed By Tests" + tracks coverage. + - **Not done (by design)**: B5 scale test deferred to dedicated + `WORK-IMPORT-SCALE-TEST-A` block (needs million-record fixture + infrastructure). No product code fixes — harness only exposes bugs. + - **Verification**: `bun run check` green (format + lint + typecheck + + i18n + unit tests + coverage + build + e2e + desktop-bridge truth + + desktop-contract mutation). + +- [x] **WORK-IMPORT-TEST-HARNESS-A (follow-up)** — Bug Fixes + SQLite-Level Audit Hardening + - 2026-05-25 closeout: B1/B2/B3 ingest dedup bugs fixed, 22-finding audit + implemented with 13 new Rust tests. + - **Bug fixes** (commit 6884c10d): + - B1: URL upsert now uses `MAX()` for visit_count/typed_count and + `CASE WHEN excluded.last_visit_ms >= urls.last_visit_ms` for title/hidden. + - B2: Firefox URL stream gets the same OR-fallback clause Chromium uses + (`OR moz_places.id IN (SELECT DISTINCT place_id FROM moz_historyvisits WHERE id > ?2)`). + - B3: Takeout `source_visit_id` now derived from `url:visit_time_micros` + instead of `source_path:ordinal:url`. + - C4/F2/T2b flipped from `#[should_panic]` to plain `#[test]`. + - **Audit hardening** (commit 3b7c14f7): + - Round-trip tests: Safari extra-column assertions (typed_evidence for + load_successful/synthesized/redirect/score), Firefox full-field assertions + (typed_count, visit_duration_ms, is_known_to_sync, etc.), Takeout + client_id/favicon_url/page_transition context evidence assertions. + - New baseline scenarios: F1 (Firefox) and S1 (Safari) happy-path imports + in `dedup_scenarios_baselines.rs` (646 lines). + - Chromium fingerprint dedup scenario: re-import with different + source_visit_ids asserts event_fingerprint partial index catches dupes. + - Edge cases: CJK URL/title round-trip, Safari pre-1970 timestamp clamping + (lossy `.max(0)` behaviour documented), Firefox NULL visit_count/last_visit_date. + - C4 expanded: third import pass with strictly older last_visit_ms verifying + title/hidden don't regress. + - writes.rs: fingerprint source_kind contract test, url_bounds no-change test. + - Audit doc updated: B1/B2/B3 marked FIXED, F1/S1/fingerprint-dedup added. + - **Not done (deferred to BACKLOG)**: + - Takeout `ptoken` field fixture + assertion. + - Takeout `visitedAt` ISO format fixture. + - E-series URL canonicalization scenarios (E6 fragment/trailing-slash). + - C_SUB_MS sub-millisecond Chrome visit collision scenario. + - `dedup_scenarios.rs` maintainability review (1278 lines, >1200 threshold). + - **Verification**: Rust 100% (33,956 lines / 1,604 functions), JS 99%+ + (99.05/98.01/99.54/99.53), 787 Rust + 1906 JS tests pass. `bun run check` + green except pre-existing flaky desktop-bridge e2e (`socket hang up` on + `run_backup_now` — verified same failure on clean tree). + +--- + +### WORK-IMPORT-TEST-HARNESS-B — Edge-case & cross-family dedup scenario expansion + +- **Date**: 2026-05-25 +- **Commit**: 728c1b88 +- **Scope**: Filling assessment gaps — raised spec coverage from ~40% (12/30 + scenarios) toward ~63% (19/30) by adding 9 new test scenarios across 2 files. + +#### New tests + +1. **`dedup_scenarios_edge_cases.rs`** (NEW, 564 lines) — 7 tests: + - **C_SUB_MS (E5)**: Sub-millisecond Chrome visit collision — pins the + known limitation that two visits to the same URL within the same ms are + collapsed by the fingerprint partial unique index. + - **E6**: URL canonicalization contract — trailing slash, fragment, mixed + case all stored verbatim as separate URLs (no normalization). + - **Empty DB × 3 families**: Chromium, Firefox, Safari zero-row fixtures + import without error, summary reports 0/0. + - **R1a**: Corrupt random bytes file → `Err`, not panic. + - **R1b**: Valid SQLite DB missing required browser tables → `Err`, not panic. + +2. **`dedup_scenarios_baselines.rs`** (+160 lines → 806 total) — 2 tests: + - **F_C2**: Firefox incremental no-new-data (watermark prevents re-import). + - **S_C2**: Safari incremental no-new-data (same pattern). + +#### Doc updates + +- `import-dedup-audit.md` §4: sub-millisecond TODO replaced with implemented + test cross-reference; URL canonicalization section updated with E6 reference. +- `import-dedup-audit.md` §6: 9 new scenarios added to contract scenarios table. +- `dedup_scenarios.rs`: C_SUB_MS TODO replaced with cross-reference to edge_cases. + +#### Remaining gaps (still in BACKLOG) + +- **R2/R3**: Crash rollback, batch revert — requires transaction-abort + test infrastructure not yet built. +- **E1-E4**: Time boundary edge cases (epoch, year-2038, far-future, DST). +- **T4**: Takeout hash collision at scale (needs million-record fixture infra). +- **Download/SearchTerm/Favicon minimal E2E**: Completely untested at scenario + level (covered by unit tests in `writes.rs` and chunk_consumer integration). + +#### Verification + +- 598 vault-core tests pass (24 dedup scenarios across 3 modules). +- Rust coverage: 100% (34,423 lines / 1,611 functions). +- `cargo fmt --all` clean. + +### WORK-IMPORT-TEST-REMAINING-A (partial) — Time boundaries + Takeout ptoken/visitedAt coverage + +> 2026-05-25 · commit 30febcab · `feat/import-data-integrity-tests` + +Fills the remaining "easy" gaps identified in the WORK-IMPORT-TEST-REMAINING-A +audit checklist. All items that don't require new infra (transaction-abort +hooks, million-record fixtures) are now covered. + +#### New tests + +1. **`dedup_scenarios_edge_cases.rs`** (+162 lines → 895 total) — 4 tests: + - **E1**: Epoch timestamp (visit_time_ms = 0) stores and round-trips as 0. + - **E2**: Year-2038 boundary (2,147,483,647,000 ms) round-trips correctly. + - **E3**: Far-future timestamp (year 9999) stores without overflow. + - **E4**: Negative timestamp from source DB clamped to 0 by all parsers. + +2. **`browser-history-fixtures/src/takeout/mod.rs`** (+26 lines → 248 total): + - Added `ptoken: Option` field with serialization + unit test. + +3. **`browser-history-fixtures/tests/takeout_roundtrip.rs`** (+74 lines → 311 total) — 3 additions: + - ptoken evidence assertion in existing standard roundtrip test. + - **`takeout_visited_at_iso_string_parsed_correctly`**: hand-crafted JSON + with `visitedAt` RFC-3339 strings verifies the parser's ISO fallback path. + - **`takeout_record_without_time_field_is_skipped`**: record without any time + field silently dropped; only time-bearing records produce URL + visit rows. + +4. **`dedup_scenarios.rs`** (+1 line) — fix compilation: `ptoken: None` added + to `takeout_record` helper after fixture API change. + +#### Doc updates + +- `import-dedup-audit.md` §6: 7 new scenarios added to contract table + (E1-E4, Takeout ptoken/visitedAt/missing-time). + +#### Remaining gaps (still in BACKLOG) + +- **`dedup_scenarios.rs` maintainability refactor** (1274 lines, >1200 threshold): + review phase complete (split proposal documented), execution phase not started. +- **R2/R3**: Crash rollback / batch revert — still needs transaction-abort + test infrastructure. +- **B5 / T4**: Takeout hash collision at scale — still needs million-record + fixture infra. + +#### Verification + +- 602 vault-core tests pass (28 dedup scenarios across 3 modules). +- 9 fixture crate tests pass (5 integration + 4 unit). +- Rust coverage: 100% (34,535 lines / 1,611 functions). +- `cargo fmt --all` clean. + +### WORK-IMPORT-TEST-REMAINING-A (closeout) — dedup_scenarios.rs maintainability refactor + +> 2026-05-25 · commit 0f41e7f7 · `feat/import-data-integrity-tests` + +Executes the documented split proposal for `dedup_scenarios.rs` (1274 lines, +above the 1200-line maintainability threshold). Behavior-preserving +extraction — zero test behavior changes, all 602 vault-core tests pass. + +#### Changes + +- **New `dedup_scenarios_takeout.rs`** (561 lines): T1, T2, T2b, T3, T5 + + Takeout-specific helpers + duplicated shared test infrastructure. +- **`dedup_scenarios_baselines.rs`** (806 → 980 lines): gained F2 (Firefox + long-tail revisit B2) + S2 (Safari long-tail revisit refutation). +- **`dedup_scenarios.rs`** (1274 → 641 lines): now Chromium-only (C1-C4, X1). + Removed 8 unused fixture imports, updated module doc to reference + companion modules. +- Registered `dedup_scenarios_takeout` in `mod.rs`. + +#### File size summary + +| Module | Lines | Status | +| ------------------------------- | ----- | ------------- | +| `dedup_scenarios.rs` | 641 | ✅ under 800 | +| `dedup_scenarios_baselines.rs` | 980 | ✅ under 1200 | +| `dedup_scenarios_edge_cases.rs` | 726 | ✅ under 800 | +| `dedup_scenarios_takeout.rs` | 561 | ✅ under 800 | + +#### Remaining blocked gaps (tracked in BACKLOG) + +- **R2/R3**: Crash rollback / batch revert — needs transaction-abort test infra. +- **B5 / T4**: Takeout hash collision at scale — needs million-record fixture infra. + +### Import test harness expansion — provenance, incremental, schema, multi-profile + +> 2026-05-25 · commits ec95f4f0 / 325d4dc4 / cd6b65d5 · `feat/import-data-integrity-tests` + +Closes the remaining unblocked §5 contract gaps after the maintainability +refactor. Adds 4 new Chromium-family scenarios; brings total dedup +scenarios to 31 across 4 modules. + +#### New tests + +1. **X2 — Atlas / Comet provenance** (`x2_chromium_family_products_preserve_browser_product_identity`): + imports 3 Chromium-family profiles (Atlas, Comet, Chrome); asserts each + `browser_product` and `browser_kind` round-trips verbatim. Pins playbook + §156-161 (ChatGPT Atlas / Perplexity Comet must not collapse to "Google Chrome"). + +2. **C5 — Append-new-rows incremental** (`c5_chromium_incremental_append_new_urls_and_visits`): + re-import where second pass adds 2 wholly new URLs + 2 new visits (no + overlap with first pass). Watermark lets only new rows land; originals + stay deduplicated. Pins §5.1 "re-import after appending new rows" — the + most common real-world incremental import shape. + +3. **C6 — Schema tolerance** (`c6_chromium_extra_columns_on_source_db_do_not_break_ingest`): + uses `ALTER TABLE` to add 4 real Chrome columns (`favicon_id`, + `segment_id`, `opener_visit`, `originator_cache_guid`) with synthetic + non-null data, then ingests. Verifies parser's explicit-column-list + discipline tolerates Chrome's schema evolution. Pins §5.1 "re-import + after schema migration"; catches accidental `SELECT *` regressions. + +4. **X3 — Multi-profile per browser** (`x3_multiple_profiles_within_same_browser_stay_independent`): + imports same URL+visit under chrome:Default and chrome:Profile 1; asserts + the fingerprint partial index is per-profile (no cross-profile dedup), + then re-imports Profile 1 with new content asserting Default's watermark + advance didn't affect Profile 1's incremental cursor. Pins per-profile + isolation on all 3 axes (source_profiles row, fingerprint scope, watermark). + +#### Audit doc updates + +- `import-dedup-audit.md` §6: 4 new scenario rows added (X2, C5, C6, X3). + +#### File size impact + +- `dedup_scenarios.rs`: 641 → 1170 lines (approaching 1200 review threshold). + Subsequent Chromium-only scenarios should go to satellite modules or + trigger a second split round. + +#### Verification + +- 606 vault-core tests pass (31 dedup scenarios across 4 modules). +- 9 fixture crate tests pass. +- `cargo fmt --all` clean. + +#### Contract coverage status + +All audit §5 contracts that are testable without blocked infrastructure +are now pinned. Remaining gaps are infrastructure-blocked: + +- **R2/R3 crash rollback** — needs transaction-abort test infra. +- **B5/T4 hash collision at scale** — needs million-record fixture infra. +- **Parser visit-before-URL ordering** — would require an artificial + parser; low value at this layer. + +### Data-integrity edge cases — NULL handling and Unicode round-trip + +> 2026-05-25 · commit aaf71c19 · `feat/import-data-integrity-tests` + +Adds two real-world data-integrity scenarios that complement the §5 +contract pins. + +#### New tests + +1. **E7** (`e7_null_title_imports_with_null_archive_title`): NULL source + `title` must project as NULL in archive, not empty string. Sibling + non-NULL title round-trips normally. Real Chrome routinely produces + NULL titles (pages that never loaded, binary downloads). + +2. **E8** (`e8_unicode_urls_and_titles_round_trip_byte_identical`): three + Unicode shapes (CJK Traditional Chinese title with em-dash, + percent-encoded path with `%E6%B8%AC%E8%A9%A6`, emoji 🚀 in title) + round-trip byte-identical. Pins NO NFC/NFD normalization, NO case + folding, NO percent-decoding. Critical for international users. + +#### Final test harness state + +- **34 dedup scenarios** across 4 modules: + - `dedup_scenarios.rs` (1170 lines): C1-C6, X1-X3 + - `dedup_scenarios_baselines.rs` (980 lines): F1, S1, F2, S2, F_C2, S_C2, fingerprint dedup + - `dedup_scenarios_edge_cases.rs` (902 lines): E1-E8, C_SUB_MS, Empty DB×3, R1a/R1b + - `dedup_scenarios_takeout.rs` (561 lines): T1, T2, T2b, T3, T5 +- 608 vault-core tests pass; 9 fixture crate tests pass. +- All §5 audit contracts pinned (except infrastructure-blocked items). +- Rust workspace compiles clean across all targets. + +### Final session entry — E9 hidden flag + future-work BACKLOG additions + +> 2026-05-25 · commits 8bc8b5ce + (this) · `feat/import-data-integrity-tests` + +#### One more focused scenario + +**E9** (`e9_hidden_url_flag_round_trips_for_both_true_and_false`) in +`dedup_scenarios_edge_cases.rs`: pins that `hidden = true` source URL +(Chrome redirect intermediates) lands non-zero in archive and +`hidden = false` lands as 0. C-series only exercised `hidden: false`, +and C4 (B1 fix) only used `hidden: true` in regression-prevention +context — first-time-import preservation was not pinned. + +#### Final state + +- **35 dedup scenarios** across 4 modules (added E9). +- 609 vault-core tests pass. +- Rust coverage 100% (34,985 instrumented lines / 1,616 functions). +- `bun run check:base` green; `bun run coverage:rust` green. +- `bun run check` failed on **one unrelated E2E flake** — + `tests/e2e/desktop-bridge.spec.ts:223` ("runs a live backup and core + intelligence flow through the desktop command bridge") returned + `socket hang up` on `POST /commands/run_backup_now`. This is a + network-level desktop-bridge test failure with no connection to + Rust-only test additions in this branch. + +#### Future work documented in BACKLOG + +Four new work blocks added to BACKLOG for the follow-up work the user +flagged as "do later": + +1. **WORK-IMPORT-FIXTURE-SIDECARS-A** — Extend Chromium fixture to + write `downloads` / `keyword_search_terms` / `favicons` / + `favicon_bitmaps` / `icon_mapping` tables, plus T6-T9 end-to-end + scenarios. Currently the parser supports these tables and writes.rs + has `insert_download` / `insert_search_term` / `insert_favicon`, + but no scenario covers them end-to-end. + +2. **WORK-IMPORT-TEST-MINOR-A** — 5 narrow contract pins as E10-E14: + visit_count edges, from_visit referential integrity, visit_duration + round-trip, Safari synthesized flag, Firefox visit_type enum. + +3. **WORK-IMPORT-TEST-PARSER-ORDERING-A** — Unit test the + `ArchiveChunkConsumer::visits` silent-skip behavior for visits with + missing url_id_map entries (audit §4 contract). + +4. **WORK-IMPORT-TEST-CONCURRENCY-A** — Audit + integration test for + same-profile concurrent ingest safety (audit §4 watermark race). + +### Code review against feat/v0.3-redesign-2 — 10 fixes applied + +> 2026-05-26 · commits 6587865d / b377f394 / 4992769a / cafac470 / b4e77f7f / 25c90253 / 014db312 · `feat/import-data-integrity-tests` + +A max-effort code review (5 finder angles × 8 candidates, then 1-vote +verification + gap sweep) against the merge-target branch surfaced 15 +findings. Of these, 10 were verified valid with a worthwhile trade-off +and were fixed before merge; the remaining 5 were design trade-offs +(MAX visit_count semantics — the B1 fix was intentional) or already +documented contracts that didn't need behavior change. + +#### Correctness fixes + +1. **B1 still live in two Takeout code paths** (commit 6587865d) — + `vault-core/src/takeout/browser_history.rs` and + `vault-core/src/takeout/payload_import.rs` URL upserts unconditionally + overwrote `title` / `hidden` from the latest record. The original B1 + fix (commit 6884c10d) only touched `archive/ingest/writes.rs`. + Mirrored the same CASE-WHEN gates plus `MAX(visit_count)` / `MAX(typed_count)`. + `payload_import.rs` also had `visit_count` / `typed_count` missing + from the UPDATE clause entirely (INSERT VALUES hardcoded `1, 0`), + so Takeout URLs stayed frozen at the first import's count — fixed. + +2. **B1 `>=` tie-break clobbered title with NULL at equal timestamps** + (commit 6587865d) — `writes.rs` upsert used `excluded.last_visit_ms >= +urls.last_visit_ms` for title / hidden, which silently overwrote + captured non-NULL title with NULL whenever last_visit_ms tied. + Firefox bookmark-only URLs (last_visit_date IS NULL → 0) tripped this + on every re-import. Tightened to `>`; added `url` and `payload_hash` + and `recorded_at` to the same strict-newer gate. + +3. **`track_url_visit_bounds` widened bounds from dropped visits** + (commit 6587865d) — `ingest/mod.rs:183` called the bounds tracker + unconditionally after `insert_visit`. When INSERT OR IGNORE silently + dropped a visit (clock-corrected re-import), `urls.first_visit_ms` / + `last_visit_ms` widened from a row that was never stored, leaving + the canonical URL claiming bounds with no matching visit. Gated on + `inserted > 0`. + +4. **B3 ordinal-tiebreaker for Takeout source_visit_id** (commit b377f394) + — The B3 fix changed `source_visit_id` to `{url}:{visit_time_micros}` + for cross-path stability but lost per-record uniqueness. Multiple + Takeout records at the same URL+microsecond collided on the + `(source_profile_id, source_visit_id)` UNIQUE index. Restored the + `ordinal` parameter as a tiebreaker — Google's Takeout JSON is a + deterministic export so ordinals are stable across re-imports. + +5. **`stable_key_i64` could return negative for `i64::MIN` input** + (commit b377f394) — `.abs()` on `i64::MIN` returns `i64::MIN` (no + positive representation) and panics in debug builds. Added explicit + corner-case branch mapping `i64::MIN → i64::MAX`. Smoke test pins + non-negativity across assorted inputs. + +6. **og:image Bilibili API memory DoS** (commits cafac470 / 014db312) + — `resolve_image_url_via_api_with_base` called `response.bytes()` + then checked size, so a multi-GB hostile/MITM response OOM-killed + the worker before the 64 KiB cap could fire. Now: Content-Length + fast-path + generic `read_with_cap` streaming helper that + aborts at the cap. Refactored the helper to take `R: Read` so the + cap-exceeded and read-error branches are unit-testable without + standing up a chunked-encoding mockito server. + +7. **Browser-back stranded canGoForward** (commit b4e77f7f) — the Pop + branch of `use-route-history-nav.ts` only decremented stackIndex, + so browser-back (which bypasses the in-app `goBack` callback) left + the topbar forward chevron disabled even though forward navigation + was available. Added `expectingForwardPopRef` to distinguish + goForward-initiated Pops from external Pops; external Pops now set + `forwardAvailable=true`. + +#### Performance + +8. **Firefox first-import OR-subquery O(N×M) regression** (commit 4992769a) + — B2 fix added `OR id IN (SELECT DISTINCT place_id FROM moz_historyvisits ...)` + to Firefox URLS_SQL. On the AGENTS.md target ceiling (14.4M visits, + first import with watermarks=0), SQLite still materializes the full + DISTINCT subquery even though the first predicate matches every row. + Added `URLS_FULL_SQL` + `first_import` branch matching the Chromium + pattern at `chromium/mod.rs:383-384`. + +#### Test/doc hygiene + +9. **Watermark assertions strengthened in C2 / C5 / X3** (commit 6587865d) + — the new scenarios asserted row counts that the fingerprint partial + index satisfies whether the watermark works or not. Added direct + `profile_watermarks.last_visit_id` assertions so a watermark + regression (cross-profile bleed, lost cursor advance) fails the test + immediately instead of silently passing through the canonical-layer + dedup. + +10. **Misleading comments + unused dep** (commits 25c90253 / 6587865d) — + `writes.rs` fingerprint comment claimed "Takeout dedup relies on + fingerprints matching Chromium's"; actual Takeout flows use + different source_kind and time encoding, so this contract didn't + exist. Rewrote to describe the real per-source-profile scoping. + F2 doc comment in `dedup_scenarios_baselines.rs` still claimed the + Firefox OR fallback was missing and the test was `#[should_panic]`; + both untrue since 6884c10d. `chrono` dep declared in fixtures + `Cargo.toml` but unused — removed plus updated lib.rs doc comment. + Fixture `chrome_time_to_unix_ms` aligned with production's `.max(0)` + clamp. + +#### New regression tests added + +- **C7** (`c7_tied_last_visit_ms_does_not_overwrite_title_hidden_or_payload_hash`) +- **T6** (`t6_takeout_payload_import_url_upsert_protects_against_older_snapshot_regression`) +- **T7** (`t7_takeout_same_url_same_microsecond_records_land_as_distinct_visits`) +- **stable_key_tests** module (smoke test for non-negativity) +- **og:image** read_with_cap × 3 (under-cap success, cap-exceeded, read-error) +- **use-route-history-nav** browser-back-enables-forward +- **fixture time.rs** pre-Unix-epoch chrome time clamps to 0 + +#### Verification + +- 616 vault-core tests pass (was 609 → 616 with C7/T6/T7). +- 46 browser-history-parser tests pass (stable_key_tests added). +- 41 vault-core og_images_synth tests pass. +- 15 use-route-history-nav tests pass. +- Rust coverage 100% (35,184 instrumented lines / 1,630 functions). +- `cargo fmt --all` clean. +- `bun run check:base` green. +- Full `bun run check` fails on the same pre-existing E2E flake + documented in the prior session + (`tests/e2e/desktop-bridge.spec.ts:223` — + `apiRequestContext.post: socket hang up` on + `POST /commands/run_backup_now`). **Verified independent of these + fixes**: reverting just `writes.rs` + `ingest/mod.rs` to the + pre-fix state (b249ea78) and re-running yields the same socket + hang-up failure. The desktop-bridge process closes the connection + during the backup; the Rust changes only touch SQL inside the + transaction and per-visit bookkeeping, neither of which can affect + the dev-IPC HTTP server. Tracked as a pre-existing flake; not + blocking this branch. + +#### Findings not actioned + +- **MAX(visit_count) prevents Chrome history-clear from reducing counts** + — flagged as a possible behavior surprise but the B1 audit explicitly + prioritized "never lose visit_count". This is a documented product + trade-off; no change. +- Other findings were either already-documented contracts (T5 B6 + fingerprint open question) or no-ops once the misleading comments + were corrected. diff --git a/docs/plan/program/import-dedup-audit.md b/docs/plan/program/import-dedup-audit.md new file mode 100644 index 00000000..5f654c57 --- /dev/null +++ b/docs/plan/program/import-dedup-audit.md @@ -0,0 +1,437 @@ +# Import & Dedup Architecture Audit + +> Written 2026-05-25 as the foundation for `WORK-IMPORT-TEST-HARNESS-A`. +> Source of truth: the code at the commits referenced below. Scenarios cited +> here are observable behaviors, not speculation — every claim has a file:line. + +This audit answers one question: **when a user imports browser history into +PathKeep — once, twice, from multiple browsers, from Takeout, from a re-stage of +the same DB — what does the canonical archive actually end up holding, and +where does that diverge from naive user expectations?** + +The audit deliberately keeps product UX out of scope (the cross-browser "looks +duplicated" experience is being addressed by a separate view-layer aggregation +work block). Here we cover only storage-layer truth. + +--- + +## 1. Dedup Keys at a Glance + +- **`source_profiles`** — UNIQUE on `profile_key`, computed as + `browser_kind` + `:` + `profile_name` by + [002_archive_runtime_foundation.sql:7](../../../src-tauri/crates/vault-core/src/migrations/002_archive_runtime_foundation.sql). +- **`urls`** — UNIQUE on `(source_profile_id, source_url_id)`; upsert at + [writes.rs:95-157](../../../src-tauri/crates/vault-core/src/archive/ingest/writes.rs). +- **`visits`** — UNIQUE on `(source_profile_id, source_visit_id)` with a + partial fallback unique index on `(source_profile_id, event_fingerprint)`; + see [002:28-32](../../../src-tauri/crates/vault-core/src/migrations/002_archive_runtime_foundation.sql) + and the insert at [writes.rs:160-218](../../../src-tauri/crates/vault-core/src/archive/ingest/writes.rs). +- **`downloads`** — UNIQUE on `(source_profile_id, source_download_id)` + ([002:38-39](../../../src-tauri/crates/vault-core/src/migrations/002_archive_runtime_foundation.sql)). +- **`search_terms`** — UNIQUE on `(source_profile_id, url_id, normalized_term)` + ([002:44-45](../../../src-tauri/crates/vault-core/src/migrations/002_archive_runtime_foundation.sql)). +- **`favicons`** — UNIQUE on `(source_profile_id, page_url, icon_url, payload_hash)` + ([002:49-51](../../../src-tauri/crates/vault-core/src/migrations/002_archive_runtime_foundation.sql)). + +`event_fingerprint` = `sha256(json({sourceKind, url, visitTime, title, transition, appId}))`, +where `sourceKind` is **hardcoded to `"chromium-history"`** for every family +([writes.rs:206](../../../src-tauri/crates/vault-core/src/archive/ingest/writes.rs)) and +`visitTime` is converted to Chrome-format (microseconds since 1601) regardless +of source family ([writes.rs:208](../../../src-tauri/crates/vault-core/src/archive/ingest/writes.rs)). +Implementation at [archive/mod.rs:348-365](../../../src-tauri/crates/vault-core/src/archive/mod.rs). + +**Architectural invariant**: `source_profile_id` is present in every dedup +key. The schema **cannot** merge two records that come from different +`source_profiles` rows. Cross-browser aggregation must happen at read time +(view layer), not at ingest. + +--- + +## 2. Confirmed Bugs (ranked by likely user impact) + +### B1 — URL upsert silently overwrites counts with older data — FIXED + +**Fixed in commit 6884c10d.** The URL upsert at +[writes.rs:123-145](../../../src-tauri/crates/vault-core/src/archive/ingest/writes.rs) +now uses: + +- `MAX(urls.visit_count, excluded.visit_count)` for `visit_count` +- `MAX(urls.typed_count, excluded.typed_count)` for `typed_count` +- `CASE WHEN excluded.last_visit_ms >= urls.last_visit_ms` for `title` and `hidden` + +The same commit also fixed B2 (Firefox long-tail revisit) and B3 (Takeout +path-bound source_visit_id). The C4 scenario +[`c4_chromium_reimport_older_snapshot_regresses_visit_count_demonstrates_b1`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) +is now a plain `#[test]` (no longer `#[should_panic]`) and asserts all four +fields (`visit_count`, `typed_count`, `title`, `hidden`) survive re-import +without regression. + +### B2 — Firefox incremental re-import drops long-tail revisits (Safari unaffected) — FIXED + +**Fixed in commit 6884c10d** (same commit as B1). + +Chromium fixed this via the `OR id IN (SELECT DISTINCT url FROM visits WHERE id > ?2)` +clause at [chromium/mod.rs:74-90](../../../src-tauri/crates/browser-history-parser/src/chromium/mod.rs). +The original audit assumed both Firefox and Safari had the same gap, but the +harness scenarios refined the picture: + +- **Firefox** — [firefox/mod.rs:22-33](../../../src-tauri/crates/browser-history-parser/src/firefox/mod.rs): + `WHERE COALESCE(moz_places.last_visit_date, 0) >= ?1` only. A URL whose + `last_visit_date` falls before the URL watermark but whose visit id falls + after the visit watermark gets streamed in the `visits` batch only. + `ArchiveChunkConsumer::visits()` fails the + `url_id_map.get(&visit.source_url_id)` lookup + ([ingest/mod.rs:155-158](../../../src-tauri/crates/vault-core/src/archive/ingest/mod.rs)) + and increments `skipped_visits` silently. The visit is lost forever once + the next watermark moves past it. + [`f2_firefox_incremental_revisit_of_old_url_drops_visit_demonstrates_b2`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) + is `#[should_panic]` until the OR fallback lands. +- **Safari** — turns out NOT to have the bug. + [safari/mod.rs:42-56](../../../src-tauri/crates/browser-history-parser/src/safari/mod.rs) + computes `(SELECT MAX(history_visits.visit_time) ...) >= ?1` on the fly + from the visits table. There is no cached `last_visit_time` column on + `history_items`, so a new visit row immediately raises the item's + effective last-visit value and the URL is re-streamed. The + [`s2_safari_long_tail_revisit_captured_without_or_fallback`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) + contract scenario pins this; if a future refactor introduces a stored + cache on `history_items`, the same bug would emerge and this test + would flip from passing to failing. + +The chromium fix exists because it was discovered in real Zhihu-style +long-tail revisit data; the harness now demonstrates Firefox is exposed +to the identical pattern. + +### B3 — Takeout `source_visit_id` is bound to file path (degraded defense) — FIXED + +**Fixed in commit 6884c10d** (same commit as B1 and B2). + +[takeout/browser_history.rs:339](../../../src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs): + +```rust +source_visit_id: stable_key_i64(format!("{source_path}:{ordinal}:{url}").as_bytes()), +``` + +`source_path` is the absolute path to the Takeout JSON file. **Earlier +draft of this audit overstated B3's blast radius** as "renaming the file +produces a full duplicate set"; the harness scenario +[`t2_takeout_rename_file_reimport_dedups_via_fingerprint_partial_index`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) +proved that in the _all-fingerprint-inputs-identical_ case the +`(source_profile_id, event_fingerprint)` partial unique index catches the +duplicates even though every `source_visit_id` changes. So the actual +behaviors are: + +- Same file, same path → same hash → primary key dedup → ✅ +- Renamed/moved file, **identical record content** → primary key fails to + dedup, but fingerprint partial index catches it → ✅ in practice +- Renamed/moved file, **fingerprint input drift** (Google captured a new + page title in the intervening export window, or transition / app_id is + somehow different) → both indexes miss → ❌ full duplicate set + ([`t2b_takeout_rename_with_title_change_demonstrates_b3_when_fingerprint_diverges`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) + reproduces this; the test is `#[should_panic]` until the fix lands) + +The design concern stands: the path-bound `source_visit_id` provides +zero useful dedup signal — the system survives only because the +fingerprint partial index is doing double duty. Any change that +narrows the fingerprint inputs (e.g. tightening normalization, +dropping `title` from the hash) would re-expose the user to the full +duplicate set the original B3 claim warned about. Fix shape: +derive `source_visit_id` from `(url, visit_time_micros)` so the +primary key stays stable across re-imports regardless of on-disk path +or downstream fingerprint changes. + +### B4 — Takeout × local-Chrome same-period overlap always double-counts + +Even with **identical** `(url, visit_time_ms)` pairs, the fingerprint differs +because the inputs differ: + +| Field | Local Chrome | Takeout | +| ----------------- | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------- | +| `app_id` | real Chrome app id | hardcoded `"takeout"` ([browser_history.rs:386](../../../src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs)) | +| `transition` | actual transition int | `None` ([browser_history.rs:381](../../../src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs)) | +| `from_visit` | actual from_visit | `None` | +| `source_visit_id` | Chrome visits.id (i64) | path-derived hash | + +Hash inputs differ → fingerprint differs → both unique indexes pass → two +rows. **Net effect: a user who exports Chrome → Takeout once a month and +also imports their local Chrome will see every visit recorded twice**, even +within the same source_profile. + +### B5 — Takeout `stable_key_i64` is collision-prone at scale + +[takeout/browser_history.rs:442-445](../../../src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs): + +```rust +fn stable_key_i64(bytes: &[u8]) -> i64 { + let hex = hex::encode(bytes); + hex.bytes().fold(0_i64, |acc, byte| acc.wrapping_mul(31).wrapping_add(byte as i64)).abs() +} +``` + +Java-style polynomial hash, folded over hex-encoded bytes, modded by +`abs()`. Theoretical space ≈ 2^63 but the low bits dominate due to +`wrapping_mul(31)` and similar URL prefixes produce similar hash prefixes. +For a 14.4M-record Takeout import (the AGENTS.md design ceiling), birthday +collisions on a degenerate 31-bit-effective hash will hit before +2^15.5 ≈ 47k records. + +Collision effects: + +- Two distinct URLs map to the same `source_url_id` → the second visit's + `url_id_map` lookup returns the first URL's canonical id, and its visit + rows attach to the wrong URL. +- Two distinct visits map to the same `source_visit_id` → second visit + silently dropped by INSERT OR IGNORE. + +### B6 — Takeout time unit ambiguity (potentially silent) + +[takeout/browser_history.rs:432-434](../../../src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs): + +```rust +fn micros_to_unix_ms(value: i64) -> i64 { + value.div_euclid(1_000) +} +``` + +The function name asserts the input is Unix microseconds. Inputs come from: + +1. `visitTime` JSON field — provenance unclear; could be either Chrome or Unix. +2. `time_usec` / `timeUsec` — **historically Chrome epoch (microseconds since 1601)** in Google's Takeout dump. +3. `visitedAt` ISO string → `chrono::DateTime::timestamp_micros()` — definitely Unix epoch microseconds. + +If the real Takeout files give Chrome-epoch `time_usec`, the resulting +`last_visit_ms` is ~11.6 quadrillion ms in the future. The companion ISO +formatter [chrome_time_to_rfc3339:436](../../../src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs) +calls `DateTime::from_timestamp_micros(value)` which is **Unix-epoch +microseconds**, confirming the code path assumes Unix. Either the runtime +input is in fact Unix (in which case the function names are fine but the +public-facing JSON contract is non-obvious and needs a fixture-pinned +assertion), or the input is Chrome-epoch (in which case all Takeout +timestamps are catastrophically wrong and someone would have noticed). The +audit cannot decide which without a fixture pinned to a real Takeout export +shape — **scenario T-TIME-PIN** in the spec doc resolves this. + +--- + +## 3. Per-Source Behavior Summary + +### Chromium (Chrome, Edge, Brave, Vivaldi, Arc, Opera, Opera GX, ChatGPT Atlas, Perplexity Comet, Chromium-proper) + +- Time format: microseconds since 1601 → Unix ms via subtract `11_644_473_600_000_000` then `÷ 1000` ([utils.rs:131](../../../src-tauri/crates/vault-core/src/utils.rs)). +- Incremental cursor: `last_visit_id`, `last_url_last_visit_time` (stored as Chrome time). +- URL re-fetch correctness: ✅ has long-tail revisit OR clause ([chromium/mod.rs:85-90](../../../src-tauri/crates/browser-history-parser/src/chromium/mod.rs)). +- Full-import path strips the OR for performance ([chromium/mod.rs:100-103](../../../src-tauri/crates/browser-history-parser/src/chromium/mod.rs)). +- Downloads / search_terms / favicons all supported. + +### Firefox (also LibreWolf, Floorp, Waterfox) + +- Time format: microseconds since Unix epoch → stored directly as `visit_time_ms` (no conversion — but the field name says `ms`, not `μs`; the actual unit needs fixture verification). +- Incremental cursor: `last_visit_id` (monotonic ✅), `last_url_last_visit_time`. +- URL re-fetch correctness: ❌ **B2** — no long-tail revisit fallback. +- No downloads, no search_terms, no favicons (documented intentional gap per [browser-support-and-adapter-playbook.md:23](../../architecture/browser-support-and-adapter-playbook.md)). + +### Safari + +- Time format: CFAbsoluteTime (seconds since 2001-01-01 as f64) → Unix ms via `(value - 978_307_200) * 1000` ([safari/mod.rs:59](../../../src-tauri/crates/browser-history-parser/src/safari/mod.rs)). +- URL re-fetch correctness: ❌ **B2** — no long-tail revisit fallback. +- Safari has `synthesized` flag (redirect-generated phantom visits) — currently captured but not de-emphasized in visit_count, may inflate counts vs Chrome's UI numbers. +- No downloads, no search_terms, no favicons. + +### Google Takeout + +- Goes through a **completely separate ingest path** from Browser Direct ([takeout/mod.rs](../../../src-tauri/crates/browser-history-parser/src/takeout/mod.rs)). The archive `process_profile_snapshot` switch only handles `"chromium" | "firefox" | "safari"` ([ingest/mod.rs:492-493](../../../src-tauri/crates/vault-core/src/archive/ingest/mod.rs)); Takeout-specific Tauri commands wire into different machinery. +- No watermark / cursor support — every re-import replays the whole payload, relying entirely on per-source-profile uniqueness for dedup. +- `source_url_id` = `hash("url::" + url)` — deterministic ✅ from URL alone. +- `source_visit_id` = `hash(path + ordinal + url)` — **B3 path-bound**. +- All Takeout records get `app_id = "takeout"` and `transition = None` → fingerprint can never match local-browser visits. + +--- + +## 4. Areas the Schema Cannot Help With (test-harness must prove behavior) + +### URL canonicalization + +No URL normalization runs before dedup. From real Chromium exports: + +| Surface | Distinct rows possible? | +| ---------------------------------------------------------------- | ------------------------------------ | +| `https://example.com` vs `https://example.com/` | yes, separate URLs | +| `https://Example.com/` vs `https://example.com/` | yes if Chrome stored them mixed-case | +| `https://example.com/path` vs `https://example.com/path#section` | yes if Chrome kept fragments | +| `https://example.com/?a=1&b=2` vs `https://example.com/?b=2&a=1` | yes | +| `https://例子.中国/` vs `https://xn--fsqu00a.xn--fiqs8s/` | depends on what Chrome wrote | + +The visit_taxonomy/url.rs surface normalizes for search/taxonomy but +**not** for dedup. +[`e6_url_strings_stored_verbatim_no_normalization`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) +pins this contract: trailing slash, fragment, and mixed case are all +stored verbatim as separate URLs. + +### Time precision + +- Visit times stored at **exact ms** — no fuzzing for "this is probably the + same visit." Two browsers visiting the same URL within 50ms of each other → + two rows; same browser firing two navigations at the same ms → second one + caught by source_visit_id uniqueness ✅. +- DST transitions, system clock changes, and NTP corrections all change + `visit_time_ms` but not `source_visit_id`, so they're safe at the + primary index level. Fingerprint fallback would diverge — test required. +- **Sub-millisecond Chrome visit collision (pinned by C_SUB_MS / E5)**: Chrome + stores visit times at microsecond precision. The ingest pipeline truncates to + milliseconds (`visit_time_ms`). Two distinct visits to the same URL that land + within the same millisecond produce **identical fingerprints** (same URL, same + truncated time, same title, same transition, same app_id). The partial unique + index on `(source_profile_id, event_fingerprint)` collapses them to one row. + This is a **known acceptable limitation**: the primary index + (`source_profile_id, source_visit_id`) still separates them by ID, but + `INSERT OR IGNORE` stops at the first unique-constraint violation, so the + fingerprint index fires first and silently drops the second visit. + [`c_sub_ms_same_millisecond_visits_collapsed_by_fingerprint`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) + pins this behavior as a contract test. + +### Cross-source cannot merge + +Already covered in §1. Even the fingerprint partial index is scoped by +`source_profile_id` ([002:30-32](../../../src-tauri/crates/vault-core/src/migrations/002_archive_runtime_foundation.sql)): + +```sql +CREATE UNIQUE INDEX IF NOT EXISTS idx_visits_profile_event_fingerprint + ON visits(source_profile_id, event_fingerprint) + WHERE event_fingerprint IS NOT NULL AND event_fingerprint != ''; +``` + +### profile_key collisions + +`profile_key` = `browser_kind || ':' || profile_name`. Two distinct profiles +with the same name on different paths would collide (e.g. two `Default` +profiles in different OS user accounts on a shared machine). Discovery +should disambiguate via path but is not under audit here. + +### Watermark race + +[ingest/mod.rs:411-437](../../../src-tauri/crates/vault-core/src/archive/ingest/mod.rs) +saves the watermark inside the same transaction as the canonical writes, so +a crash mid-import rolls everything back together — no torn writes. +However, **concurrent imports of the same profile_id** would both load the +same `last_visit_id` watermark, attempt overlapping writes, and the second +commit would silently re-process records the first already imported. SQLite +prevents simultaneous write transactions on the same DB, but the in-app +queue serialization is not under audit here — flag for harness coverage. + +### Visit→URL ordering dependency + +[ingest/mod.rs:155-158](../../../src-tauri/crates/vault-core/src/archive/ingest/mod.rs) +silently drops any visit whose `source_url_id` is not already in +`url_id_map`. The parser is expected to emit `urls()` batches before +`visits()` batches for the same URL. Any future refactor that changes +batching order will cause silent data loss — must be pinned by test. + +--- + +## 5. What the Test Harness Must Prove + +Maps to scenarios that will be enumerated in +`import-test-harness-spec.md`. Listed here only at the assertion level: + +1. **Within one source_profile, no visit is ever stored twice across re-imports**, regardless of which fixture features collide: + - re-import same file + - re-import after appending new rows + - re-import after schema migration in the source DB + - re-import where some old URLs got revisited but no new URLs added +2. **Cross-source-profile keeps independent rows** (the by-design contract); test must encode this so a future refactor that "tidies it up" gets caught. +3. **No visit is silently dropped**: + - parser emits visit before URL → must be caught + - URL last_visit older than watermark but visit newer → must be caught + - corrupt source DB → revert leaves vault unchanged +4. **B1 / B2 / B3 / B4 / B5 / B6 each have a failing test before the fix lands.** +5. **Time conversions round-trip**: + - Chromium ms → Chrome time → fingerprint → re-parse same row → same fingerprint + - Firefox `visit_date` (μs Unix) → ms Unix → ISO → same + - Safari CFAbsoluteTime → ms Unix → ISO → same + - Takeout `time_usec` shape pinned by fixture +6. **URL canonicalization contract pinned** — every variant in §4 has a test that documents the _current_ behavior. Changes to URL normalization later require updating the tests, making the change visible in review. +7. **Provenance preserved**: + - Edge profile imports stay tagged Edge, not collapsed to Chrome (per [browser-support-and-adapter-playbook.md:107](../../architecture/browser-support-and-adapter-playbook.md)) + - ChatGPT Atlas / Perplexity Comet keep their product identity +8. **Memory bounds**: streaming chunks of 10,000 records ([ingest/mod.rs:61](../../../src-tauri/crates/vault-core/src/archive/ingest/mod.rs)) actually limit RAM. A 1.44M-record fixture must import without RSS exceeding a bounded ceiling (the harness target the user gave: 8 GB / 4 core). + +--- + +## 6. Scenarios Now Backed By Tests + +> Living section — updated as scenarios land. The expectation is that every +> bug from §2 eventually has a named `#[should_panic]` regression test that +> flips to a plain `#[test]` once the fix ships, and every architectural +> contract from §5 has a contract test that defends it against drift. + +### Contract scenarios (pass today, guard against regression) + +| Scenario | Location | Asserts | +| -------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| C1 — Chromium baseline import | [`c1_chromium_baseline_import`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) | One profile, one ingest pass produces exactly the fixture URL + visit rows; `source_visit_id` values flow through unmodified. | +| C2 — Chromium incremental no-new-data | [`c2_chromium_incremental_no_new_data`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) | Re-running the same fixture with `use_watermark = true` returns `new_urls = 0`, `new_visits = 0`, and archive row counts stay constant. | +| C3 — Chromium incremental revisit of an old URL | [`c3_chromium_incremental_revisit_of_old_url`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) | Adversarial pass-2 fixture: visit cursor moves past 10, URL `last_visit_time` deliberately left at the old value. Validates the `OR id IN (SELECT DISTINCT url FROM visits WHERE id > ?2)` fallback in `INGEST_URLS_SQL` is intact. | +| S2 — Safari long-tail revisit (NOT affected by B2) | [`s2_safari_long_tail_revisit_captured_without_or_fallback`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs) | Safari's URL query computes MAX(visit_time) on the fly; no cached `last_visit_time` column to lag behind, so the OR fallback isn't needed. Test flips if a future refactor adds a cache. | +| T1 — Takeout baseline import | [`t1_takeout_baseline_import`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs) | `crate::takeout::import_takeout` ingests a synthetic `BrowserHistory.json` into `profile_key = "takeout::browser-history"` with `app_id = "takeout"` on every visit. | +| T2 — Takeout file rename, identical records | [`t2_takeout_rename_file_reimport_dedups_via_fingerprint_partial_index`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs) | Refutes the original B3 framing: the fingerprint partial unique index catches the duplicate set even though every `source_visit_id` differs. | +| T3 — Takeout × local Chrome same-period | [`t3_takeout_and_local_chrome_same_period_b4_contract`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs) | B4 contract: per-source-profile dedup truly keeps Chrome and Takeout independent; fingerprint inputs differ (real app_id vs `"takeout"`, real transition vs `None`) so any future cross-source dedup must normalize first. | +| T5 — Takeout time_usec interpretation | [`t5_takeout_time_usec_pinned_as_unix_microseconds_b6_contract`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs) | B6 contract: parser interprets `time_usec` as Unix-epoch microseconds. If real Google Takeout disagrees the writer + this test update together; if anyone changes the parser to Chrome epoch this test fails immediately. | +| X1 — Edge imports Chrome history then diverges | [`x1_edge_imports_chrome_then_both_diverge`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) | Per-source-profile architecture preserved: a URL visited in both browsers keeps two `urls` rows; Edge's `browser_product` stays `"Microsoft Edge"` (playbook §107). | +| X2 — Atlas / Comet preserve browser_product | [`x2_chromium_family_products_preserve_browser_product_identity`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) | ChatGPT Atlas (playbook §156) and Perplexity Comet (playbook §158) stay tagged with their product identity in `source_profiles.browser_product`; do not collapse to "Google Chrome". | +| X3 — Multi-profile per browser independence | [`x3_multiple_profiles_within_same_browser_stay_independent`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) | Chrome `Default` and Chrome `Profile 1` produce distinct `source_profiles` rows under same `browser_kind`; identical visits across them do NOT dedup (per-profile fingerprint scope); per-profile watermark isolation preserved. | +| C5 — Chromium incremental append-new-rows | [`c5_chromium_incremental_append_new_urls_and_visits`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) | Re-import where second pass adds wholly new URLs + new visits (no overlap with first import) — watermark lets only new rows land while originals stay deduplicated. Pins §5.1 "re-import after appending new rows" contract. | +| C6 — Chromium source DB schema tolerance | [`c6_chromium_extra_columns_on_source_db_do_not_break_ingest`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) | Fixture DB with `ALTER TABLE`-added columns (`favicon_id`, `segment_id`, `opener_visit`, `originator_cache_guid`) imports without error and produces identical canonical rows. Pins §5.1 "re-import after schema migration" contract; catches accidental `SELECT *` regressions. | +| F1 — Firefox baseline import | [`f1_firefox_baseline_import`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs) | Firefox single-import happy path: 3 URLs, 5 visits all land with correct counts, timestamps, and field values. | +| S1 — Safari baseline import | [`s1_safari_baseline_import`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs) | Safari single-import happy path: 3 URLs, 5 visits all land with correct counts, timestamps, and field values. | +| Chromium fingerprint dedup | [`chromium_fingerprint_dedup_catches_same_visits_with_different_source_ids`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs) | Re-import same visits with different `source_visit_id` values — the `event_fingerprint` partial index catches them as duplicates, no extra rows created. | +| F_C2 — Firefox incremental no-new-data | [`f_c2_firefox_incremental_no_new_data`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs) | Firefox mirror of C2: re-import with watermark produces `new_urls = 0`, `new_visits = 0`, archive row counts constant. | +| S_C2 — Safari incremental no-new-data | [`s_c2_safari_incremental_no_new_data`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs) | Safari mirror of C2: re-import with watermark produces `new_urls = 0`, `new_visits = 0`, archive row counts constant. | +| C_SUB_MS (E5) — Sub-ms fingerprint collision | [`c_sub_ms_same_millisecond_visits_collapsed_by_fingerprint`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | Two visits to same URL at same ms but different source_visit_ids — fingerprint partial index collapses to 1 row. Pins known precision limitation. | +| E6 — URL canonicalization (no normalization) | [`e6_url_strings_stored_verbatim_no_normalization`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | Trailing slash, fragment, mixed case all stored as separate URLs verbatim. Pins contract so future normalization changes are visible. | +| Empty DB × 3 families | `empty_{chromium,firefox,safari}_fixture_imports_without_error` in [`dedup_scenarios_edge_cases.rs`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | Zero-row fixtures for each family import without error, summary reports 0/0. | +| R1a — Corrupt random bytes | [`r1a_corrupt_random_bytes_returns_error_not_panic`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | Random bytes file returns `Err`, not panic — resilience contract. | +| R1b — Valid SQLite missing tables | [`r1b_valid_sqlite_missing_tables_returns_error_not_panic`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | Valid SQLite DB without browser tables returns `Err`, not panic — resilience contract. | +| E1 — Epoch timestamp (visit_time_ms = 0) | [`e1_epoch_timestamp_imports_without_error`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | Epoch 0 timestamp stores and round-trips as 0 — pins lower bound of time domain. | +| E2 — Year-2038 boundary (2^31 seconds) | [`e2_year_2038_boundary_imports_without_error`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | 2038-01-19T03:14:07Z (2,147,483,647,000 ms) round-trips correctly — pins i64 handling above 32-bit overflow. | +| E3 — Far-future timestamp (year 9999) | [`e3_far_future_timestamp_imports_without_error`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | Max-range timestamp stores without overflow — pins i64 capacity at the upper extreme. | +| E4 — Negative timestamp (clamped to 0) | [`e4_negative_timestamp_clamped_to_zero`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | All parsers apply `.max(0)` so negative source timestamps import as 0 ms — pins clamping contract. | +| E7 — NULL title handling | [`e7_null_title_imports_with_null_archive_title`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | URL with NULL source title projects as NULL in archive (not empty string) — pins nullable-column contract. Sibling URL with non-NULL title round-trips normally. | +| E8 — Unicode byte-identical round-trip | [`e8_unicode_urls_and_titles_round_trip_byte_identical`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | CJK title, percent-encoded path (NOT decoded), and emoji + em-dash all round-trip byte-identical with no NFC/NFD normalization or case folding. Pins international-user contract. | +| E9 — `hidden` URL flag round-trip | [`e9_hidden_url_flag_round_trips_for_both_true_and_false`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs) | `hidden = true` source URL (Chrome redirect intermediates) lands as non-zero in archive; `hidden = false` lands as 0. Pins flag-preservation contract that C-series didn't exercise. | +| Takeout ptoken evidence round-trip | [`takeout_standard_json_round_trips_through_production_parser`](../../src-tauri/crates/browser-history-fixtures/tests/takeout_roundtrip.rs) (ptoken assertion block) | `ptoken` field in fixture serializes and parses back as `context.takeout.ptoken` context evidence. | +| Takeout visitedAt ISO-8601 fallback | [`takeout_visited_at_iso_string_parsed_correctly`](../../src-tauri/crates/browser-history-fixtures/tests/takeout_roundtrip.rs) | Hand-crafted JSON with `visitedAt` RFC-3339 strings parses to correct millisecond timestamps — covers the parser's ISO fallback path that no fixture writer can exercise. | +| Takeout missing time field silently skipped | [`takeout_record_without_time_field_is_skipped`](../../src-tauri/crates/browser-history-fixtures/tests/takeout_roundtrip.rs) | A record without any time field (`visitTime`, `time_usec`, `timeUsec`, `visitedAt`) is silently dropped; only time-bearing records produce URL + visit rows. | + +### Bugs with failing tests + +| Bug | Scenario | Status | +| ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| B1 URL upsert regresses counts | [`c4_chromium_reimport_older_snapshot_regresses_visit_count_demonstrates_b1`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs) | **FIXED** (6884c10d) — now a plain `#[test]` asserting `visit_count`, `typed_count`, `title`, and `hidden` all survive re-import without regression | +| B2 Firefox long-tail revisit drop | [`f2_firefox_incremental_revisit_of_old_url_drops_visit_demonstrates_b2`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs) | **FIXED** (6884c10d) — Firefox URL stream now has the OR fallback | +| B2 Safari long-tail revisit drop | n/a — refuted | Original audit claim corrected. Safari has no cached last-visit column to lag; see [`s2_...`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs) contract. | +| B3 Takeout path-bound source_visit_id (narrow case — fingerprint drift) | [`t2b_takeout_rename_with_title_change_demonstrates_b3_when_fingerprint_diverges`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs) | **FIXED** (6884c10d) — fix landed in same commit as B1 and B2 | +| B4 Takeout × local Chrome double-count | [`t3_takeout_and_local_chrome_same_period_b4_contract`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs) | Contract test — by-design per-profile storage; reframed from "bug" to "design constraint for any future cross-source dedup proposal" | +| B5 Takeout hash collision at scale | T4 (deferred to a dedicated scale-test slice) | needs million-record fixture infrastructure separate from per-scenario harness | +| B6 Takeout time unit ambiguity | [`t5_takeout_time_usec_pinned_as_unix_microseconds_b6_contract`](../../src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs) | Contract test pins current Unix-microseconds interpretation; the audit's "what does Google really ship" question stays open until a real-world sample lands | + +--- + +## 7. Out of Scope For This Audit + +- **View-layer cross-browser aggregation** — separate user-flow work, decided + in the planning conversation but not yet a BACKLOG block. +- **`vault-platform` staging and live-file copy** — concerns file system + semantics, not dedup correctness. +- **Recall / search projection** — derived from the canonical archive after + ingest commits; will inherit ingest's truth. +- **Backup vs Browser Direct command-surface differences** — the canonical + ingest path is the same; differences are in staging and source provenance + metadata, both of which are validated by separate acceptance tests in the + m3/m4 milestones. + +--- + +_End of audit. The companion spec doc +(`docs/plan/program/import-test-harness-spec.md`) translates the above bugs +and gaps into concrete scenarios, fixture generator API, and acceptance +criteria for `WORK-IMPORT-TEST-HARNESS-A`. Section 6 above tracks which +scenarios have shipped against the harness._ diff --git a/docs/plan/program/import-test-harness-spec.md b/docs/plan/program/import-test-harness-spec.md new file mode 100644 index 00000000..dbbfb985 --- /dev/null +++ b/docs/plan/program/import-test-harness-spec.md @@ -0,0 +1,449 @@ +# Import Test Harness Spec + +> Companion to [`import-dedup-audit.md`](import-dedup-audit.md). +> The audit answers _what is the current behavior_. This spec answers +> _what tests would prove or disprove that behavior at every supported +> source and edge case_, so the user can be confident that a re-import +> of any combination of browsers will not silently lose, duplicate, or +> corrupt visit records. + +Owning work block: `WORK-IMPORT-TEST-HARNESS-A` (queued in `BACKLOG.md`). + +--- + +## 1. Goals & Non-Goals + +### Goals + +1. Build a **fixture generator** that emits real-format browser history + payloads — Chromium `History`, Firefox `places.sqlite`, Safari + `History.db`, Google Takeout JSON / JSONL — from a deterministic + programmatic scenario description. +2. Build a **scenario library** that covers every documented edge case in + the audit, including known bugs (B1–B6) and architecturally-correct + behaviors that future refactors might silently break. +3. Build an **end-to-end test runner** that takes one scenario, drives the + real `vault-core` ingest pipeline through it, and asserts canonical-DB + truth (visit counts, URL counts, fingerprint stability, per-profile + provenance, watermark advancement, revert safety). +4. Guarantee the harness produces **zero false positives**: every failing + assertion either is a real bug in product code or a real intentional + change that needs a contract-test update. +5. Keep the harness **self-validating**: the fixture generator itself is + tested by parser round-trip (write a fixture → parse it → assert the + parser saw what the generator promised) so a generator bug cannot + pretend a product bug exists. + +### Explicit Non-Goals + +1. **No real user data** in fixtures. The user has personal browser data + on the development machine; the playbook + ([browser-support-and-adapter-playbook.md:152](../../architecture/browser-support-and-adapter-playbook.md)) + forbids copying private URLs/titles into docs or repo. The fixture + generator **must not sample from real DBs at any layer** — every URL, + title, timestamp, and ID is synthesized from a seed. +2. **No product-code bug fixes in this work block.** B1–B6 each get a + failing test that documents the bug; fixes ship in dedicated follow-up + blocks so the fix PR can point at the failing test as evidence. +3. **No view-layer cross-browser aggregation work.** That has its own + pending work block driven by the planning conversation. +4. **No performance optimization.** Harness measures memory bounds as a + contract assertion (does a 1.44M-record import stay under the agreed + RSS ceiling?) but does not optimize the ingest pipeline. +5. **No support for non-promised browsers.** Scenarios cover the families + in [browser-support-and-adapter-playbook.md](../../architecture/browser-support-and-adapter-playbook.md): + Chromium-family, Firefox-family, Safari, Takeout. Pale Moon, qutebrowser, + mobile exports are out of scope. + +--- + +## 2. Crate Architecture + +### New crate: `browser-history-fixtures` + +Location: `src-tauri/crates/browser-history-fixtures/`. + +``` +browser-history-fixtures/ +├── Cargo.toml # added to workspace; no Tauri dep +├── src/ +│ ├── lib.rs # public surface: Scenario, ScenarioBuilder, fixtures::* +│ ├── seed.rs # deterministic PRNG (StdRng with explicit seed) +│ ├── catalog.rs # synthetic URL/title pools (public-domain text only) +│ ├── time.rs # epoch conversions (Chrome/Unix/Safari/Firefox) +│ ├── scenario/ +│ │ ├── mod.rs # Scenario / ScenarioBuilder DSL +│ │ ├── browser.rs # BrowserProfile builder, clone_history, add_visits +│ │ ├── assertions.rs # CanonicalAssertions: per-profile visit_count, etc. +│ │ └── runner.rs # drives ingest pipeline, returns CanonicalView +│ ├── chromium_db.rs # writes real Chromium History sqlite +│ ├── firefox_db.rs # writes real places.sqlite +│ ├── safari_db.rs # writes real History.db (CFAbsoluteTime semantics) +│ └── takeout_json.rs # writes BrowserHistory.json + .jsonl + zip +├── tests/ +│ ├── fixture_roundtrip.rs # self-validation: each generator output parses cleanly +│ ├── chromium_dedup.rs # scenarios C1–C7 +│ ├── firefox_dedup.rs # scenarios F1–F4 +│ ├── safari_dedup.rs # scenarios S1–S3 +│ ├── takeout_dedup.rs # scenarios T1–T6 +│ ├── cross_source.rs # scenarios X1–X5 +│ ├── time_and_url.rs # scenarios E1–E8 +│ ├── corrupt_and_recover.rs # scenarios R1–R4 +│ └── memory_bounds.rs # scenario M1 (large data, optional `#[ignore]` until --features=big-data) +└── README.md # quick-start, how to add a scenario +``` + +Why a new crate rather than putting it in `vault-core/tests/`: + +- `vault-core` already has 31,762 instrumented lines and 1,485+ tests; + adding a generator crate keeps the test surface focused. +- The generator needs `rusqlite` write access with control over PRAGMAs; + isolating it makes the dependency story cleaner. +- The fixture generator is itself usable for benchmarks, manual repro + bundles, and future doctor-tool development — it's a long-lived + utility, not a one-shot test asset. + +### Dependencies + +- `rusqlite` with `bundled` feature (matches `vault-core`) +- `serde_json` (Takeout payloads) +- `chrono` (epoch conversions) +- `rand` + `rand_chacha` (deterministic PRNG; explicit seed in every scenario) +- `tempfile` (test sandboxes) +- `zip` (for zipped Takeout fixtures matching the source classifier expectations) +- **No new third-party deps that need supply-chain review** — all four are + already in the workspace. + +--- + +## 3. Fixture Generator API + +### Scenario DSL — declarative, deterministic, readable + +```rust +let scenario = Scenario::new("edge_imports_chrome_then_diverges") + .seed(0xCAFEBABE_DEADBEEF) + + // Chrome profile with 60 days of synthetic browsing + .add_browser(Chromium("Google Chrome")) + .profile("Default") + .with_visits(SyntheticPattern { + count: 100, + window: days_ago(60)..days_ago(30), + url_pool: PublicDomainUrls::news_sites(), + title_pool: PublicDomainTitles::wikipedia_articles(), + transition_mix: TransitionMix::typical(), + }) + + // Edge profile that "imported from Chrome" — same visits but + // different source_visit_ids (Chrome's IDs renumbered by Edge) + .add_browser(Chromium("Microsoft Edge")) + .profile("Default") + .imported_from(Chromium("Google Chrome"), "Default") + .renumber_visit_ids() // simulates browser import behavior + .preserve_visit_times() // visit_time_ms identical to Chrome + .with_visits(SyntheticPattern { + count: 50, + window: days_ago(30)..now(), + url_pool: PublicDomainUrls::news_sites(), + transition_mix: TransitionMix::typical(), + }) + + // Chrome also kept browsing for 30 days + .add_visits_to(Chromium("Google Chrome"), "Default", SyntheticPattern { + count: 30, + window: days_ago(30)..now(), + ..Default::default() + }); + +let canonical = scenario.run_in_vault()?; + +canonical.assert(|view| { + // by-design: per-profile dedup keeps Edge + Chrome separate + view.expect_url_count_for_profile("chrome:Default", 130); + view.expect_url_count_for_profile("edge:Default", 150); + + // by-design: cross-browser does NOT dedup at storage layer + view.expect_canonical_url_count_distinct_across_profiles(180); + + // contract: no visit got dropped + view.expect_visit_count_for_profile("chrome:Default", 130); + view.expect_visit_count_for_profile("edge:Default", 150); + + // contract: provenance preserved + view.expect_browser_product("edge:Default", "Microsoft Edge"); + view.expect_browser_product("chrome:Default", "Google Chrome"); + + // contract: watermark advanced for both profiles + view.expect_watermark_visit_id_at_least("chrome:Default", 130); + view.expect_watermark_visit_id_at_least("edge:Default", 150); +}); +``` + +### `SyntheticPattern` + +```rust +pub struct SyntheticPattern { + pub count: usize, // number of visits + pub window: Range>, // time range + pub url_pool: UrlPool, // synthetic URLs (public-domain set) + pub title_pool: TitlePool, // synthetic titles + pub transition_mix: TransitionMix, // distribution of Chrome transition types + pub revisit_rate: f64, // 0.0 = all unique URLs, 1.0 = all repeats + pub duration_distribution: DurationDistribution, +} +``` + +### Synthetic content pools + +All URLs and titles are **synthesized from public-domain corpora**: + +- **URL hosts**: a small fixed list of obviously-fake hosts + (`example.com`, `example.org`, `synthetic.test`, `pathkeep-fixture.invalid`) + plus public Wikipedia / Wikimedia hosts when we need plausible-looking + long URLs (e.g. `en.wikipedia.org/wiki/`). +- **Page paths**: deterministic from seed — `/article//`. +- **Titles**: pulled from a checked-in list of public-domain Wikipedia + article titles (article titles themselves are PD; the corpus file is + checked in at `browser-history-fixtures/src/catalog/wikipedia_titles.txt`). +- **Search terms**: a fixed set of obviously-non-real queries (`brown +fox jumps`, `lorem ipsum dolor`, etc.). + +**No fixture URL or title is ever sampled from a real user DB.** The +catalog is committed once and reused; PRs that touch the catalog must +include an attribution comment for the source. + +### Fixture file outputs + +Each `Scenario::run_in_vault()` materializes: + +- One `History` SQLite per Chromium profile, written with the exact + schema (`urls`, `visits`, `downloads`, `keyword_search_terms`, + `meta`) that Chrome ships, populated by the synthetic data and + indexed the same way Chrome indexes it. +- One `places.sqlite` per Firefox profile with `moz_places`, + `moz_historyvisits`, and the meta tables Firefox parser inspects. +- One `History.db` per Safari profile with `history_items`, + `history_visits`, plus the `synthesized` / `load_successful` columns + the Safari parser may probe. +- Takeout payloads (BrowserHistory.json or JSONL; optionally zipped to + exercise the zip code path) in a path layout that matches what the + Takeout source classifier looks for + ([takeout/source.rs:402-418](../../../src-tauri/crates/browser-history-parser/src/takeout/source.rs)). + +### Self-validation: fixture round-trip + +`tests/fixture_roundtrip.rs` proves the generator is honest. For every +generator output: + +1. Write the fixture. +2. Open it with the **real PathKeep parser** (`browser_history_parser::chromium::parse_history` etc.). +3. Assert the parser saw exactly the records the generator promised. + +If a generator bug exists (wrong schema, wrong epoch, missing column), +the round-trip test fails _before_ any scenario can pretend a product +bug exists. **Without this guard, the harness is worse than useless** — +it can give false confidence. + +--- + +## 4. Assertions API + +```rust +pub struct CanonicalView<'a> { + archive: &'a Connection, +} + +impl CanonicalView<'_> { + // ---- counts ---- + pub fn expect_url_count_for_profile(&self, profile_key: &str, expected: usize); + pub fn expect_visit_count_for_profile(&self, profile_key: &str, expected: usize); + pub fn expect_total_visit_count(&self, expected: usize); + pub fn expect_distinct_canonical_url_count_distinct_across_profiles(&self, expected: usize); + + // ---- provenance ---- + pub fn expect_browser_product(&self, profile_key: &str, expected: &str); + pub fn expect_source_profile_count(&self, expected: usize); + + // ---- dedup behavior ---- + pub fn expect_no_duplicate_visit_keys(&self); + pub fn expect_no_duplicate_visit_fingerprints(&self); + pub fn expect_url_visit_count(&self, profile_key: &str, url: &str, expected: i64); + pub fn expect_url_first_last_visit_within(&self, profile_key: &str, url: &str, range: Range>); + + // ---- watermark ---- + pub fn expect_watermark_visit_id_at_least(&self, profile_key: &str, min: i64); + pub fn expect_watermark_url_time_at_least(&self, profile_key: &str, min_ms: i64); + + // ---- import batch behavior ---- + pub fn expect_visits_in_import_batch(&self, batch_id: i64, expected: usize); + pub fn expect_no_orphan_visits(&self); // every visit's url_id resolves + pub fn expect_no_visits_in_reverted_batch(&self); +} +``` + +The assertion helpers all read directly from the canonical archive +SQLite; no view-model layer is in the path. Assertion failures include +**the SQL query that returned the wrong count** so the developer can +re-run it locally. + +### Bug-targeted assertions + +For each known bug, the spec defines a named assertion that fails +_now_ and passes after the fix: + +- `expect_url_count_monotonic_under_repeated_imports` → catches **B1** +- `expect_firefox_long_tail_revisit_not_dropped` → catches **B2** +- `expect_safari_long_tail_revisit_not_dropped` → catches **B2** +- `expect_takeout_rename_does_not_duplicate` → catches **B3** +- `expect_takeout_then_local_chrome_same_period_dedup` → catches **B4** +- `expect_takeout_url_hash_no_collisions_at_million_scale` → catches **B5** +- `expect_takeout_time_unit_matches_documented_contract` → catches **B6** + +These are written first as `#[test] #[should_panic]` (documenting the +current broken behavior), then converted to plain `#[test]` when the +fix lands. The spec is explicit: **landing a fix without flipping the +test invalidates the work block.** + +--- + +## 5. Scenario Library + +Each scenario maps to one test function. Priority drives implementation +order in the work block; everything is in scope before the block closes. + +### Priority 1 — Highest ROI (lay this in the scaffold commit) + +| ID | Scenario | Targets | +| --- | ----------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | +| C1 | `chromium_baseline_import` | happy path, source_visit_id uniqueness, run ledger correctness | +| C2 | `chromium_incremental_no_new_data` | watermark works; second import = 0 new rows | +| C3 | `chromium_incremental_revisit_of_old_url` | regression for the OR clause fix; would fail without [chromium/mod.rs:85-90](../../../src-tauri/crates/browser-history-parser/src/chromium/mod.rs) | +| T1 | `takeout_baseline_import` | happy path; no source_visit_id from browser, full fingerprint reliance | +| T2 | `takeout_rename_file_reimport` | **B3 failing test** — same data, different path, expect dedup, assert duplicates appear | +| X1 | `edge_imports_chrome_then_diverges` | per-profile contract preserved, no cross-browser dedup | + +### Priority 2 — Bug coverage + +| ID | Scenario | Targets | +| --- | ---------------------------------------------------------- | --------------------------------------------------- | +| C4 | `chromium_reimport_older_snapshot_does_not_regress_counts` | **B1 failing test** | +| F1 | `firefox_baseline_import` | happy path for places.sqlite | +| F2 | `firefox_incremental_revisit_of_old_url` | **B2 failing test** for Firefox | +| S1 | `safari_baseline_import` | happy path for History.db | +| S2 | `safari_incremental_revisit_of_old_url` | **B2 failing test** for Safari | +| T3 | `takeout_then_local_chrome_same_period` | **B4 failing test** — assert systematic doubling | +| T4 | `takeout_million_record_hash_distribution` | **B5 failing test** — stress `stable_key_i64` | +| T5 | `takeout_time_unit_contract` | **B6 failing/passing test** — pins format-of-record | + +### Priority 3 — Cross-source robustness + +| ID | Scenario | Targets | +| --- | ----------------------------------------------------- | -------------------------------------------------------------- | +| X2 | `chrome_brave_vivaldi_three_way_overlap` | three Chromium-family profiles, partial overlap, all preserved | +| X3 | `firefox_places_with_safari_history_overlap` | mixed family time conversions correct | +| X4 | `takeout_and_browser_direct_same_profile_same_period` | end-to-end version of T3 with real ingest commands | +| X5 | `microsoft_edge_not_collapsed_to_chrome` | provenance — Edge must not be tagged as Google Chrome | + +### Priority 4 — Time / URL / encoding edge cases + +| ID | Scenario | Targets | +| --- | --------------------------------------------- | ---------------------------------------------------------------- | +| E1 | `chrome_time_extreme_far_future` | `unix_micros_to_chrome_time` saturation | +| E2 | `safari_cfabsolute_time_pre_2001` | negative CFAbsoluteTime handling | +| E3 | `firefox_microseconds_vs_chrome_microseconds` | family misrouting test | +| E4 | `dst_transition_visit` | hour-boundary visit during DST transition | +| E5 | `same_millisecond_two_visits` | two visits at literally identical ms, different source_visit_ids | +| E6 | `url_with_fragment_and_trailing_slash` | document current behavior: separate rows | +| E7 | `url_with_idn_punycode_mix` | document current behavior | +| E8 | `url_very_long_8kb_plus` | SQLite TEXT column accepts; no truncation | + +### Priority 5 — Corruption / recovery / concurrency + +| ID | Scenario | Targets | +| --- | ------------------------------------------------------- | ------------------------------------------------------- | +| R1 | `corrupt_history_db_quick_check_fails` | preview honestly fails, no partial rows | +| R2 | `mid_import_crash_rollback` | transaction rolls back, watermark unchanged | +| R3 | `import_batch_revert_clears_visits_only_for_that_batch` | revert isolation | +| R4 | `staging_lock_contention` | History file held by browser, staging snapshot succeeds | +| R5 | `concurrent_import_same_profile_serialization` | SQLite write lock serializes; no torn state | + +### Priority 6 — Performance / memory bounds (optional `#[ignore]` until opted in) + +| ID | Scenario | Targets | +| --- | --------------------------------------------------- | --------------------------------------------------------------------------------------------------- | +| M1 | `chromium_1_44_million_visits_under_memory_ceiling` | the AGENTS.md design point: 8 GB / 4 core machine, 60 years of moderate use; assert peak RSS < N MB | + +--- + +## 6. How New Bugs Get Added + +When a user reports a new dedup / loss / duplication issue: + +1. The triage step is to add a scenario to the library that reproduces + the report from a synthetic fixture. If the synthetic fixture cannot + reproduce, the report is either operator error or a real-data leak + (e.g. Chrome version-specific schema we don't generate yet) — the + audit doc gets updated to widen the fixture surface. +2. Once a failing scenario exists, the bug is in scope for a fix work + block. +3. The fix block flips the scenario from `#[should_panic]` to plain + `#[test]` and gets merged. The scenario stays in the library forever + as a regression guard. + +This means **the harness is the bug tracker for ingest correctness**. +The audit doc lists six bugs today; the harness should converge to +zero `should_panic` annotations over time. + +--- + +## 7. Acceptance for `WORK-IMPORT-TEST-HARNESS-A` + +The work block is done when: + +1. `browser-history-fixtures` crate exists, builds clean, is in the + Cargo workspace, and is included in `bun run check`. +2. All round-trip self-validation tests pass. +3. All Priority 1 scenarios are implemented and either pass (for + contract scenarios) or `#[should_panic]` with a doc comment + referencing the audit bug (for bug scenarios). +4. The work block's CHANGELOG entry lists, by name, which audit bugs + now have failing tests. +5. The audit doc gets a new section: "Bugs with failing tests" linking + each to its scenario. + +The work block **does not** require Priorities 2–6 to be complete; those +are the natural follow-up blocks once the foundation lands. But the +spec already enumerates them so future work doesn't need to re-derive +the list. + +--- + +## 8. Open Questions to Resolve During Implementation + +These are resolvable from code-reading, not user discussion, but +deserve calling out so they aren't forgotten: + +1. **Takeout time unit truth.** Does the runtime really receive Chrome + epoch microseconds in `time_usec`, or Unix epoch microseconds, or + both depending on file format? Resolve by writing scenario T5 with + both shapes, observing which one matches the visible Chrome history + ground truth. +2. **`profile_key` collision under same-name profiles.** If a user has + two Chrome profiles both named `Default` on the same machine (e.g. + two macOS user accounts share-mounted), do they collide? Test as + scenario R6 (added if probe shows this is a real risk). +3. **Are Atlas / Comet adapters fully covered by the chromium + scenarios?** Probably yes by family membership, but confirm with a + discovery-side spot test in `vault-core/tests/` if no separate + parser test exists. +4. **Memory ceiling for M1.** AGENTS.md says 8 GB RAM, 4 core, 1.44M + records. Pick a sensible RSS bound (likely 800 MB) and document the + measurement methodology so the test stays deterministic across + hosts. + +--- + +_Update this doc when scenario coverage expands or when the audit's +bug list changes. Treat it as living source-of-truth alongside +`research-and-decisions.md`._ diff --git a/src-tauri/Cargo.lock b/src-tauri/Cargo.lock index a6e46c7a..9a90a4cd 100644 --- a/src-tauri/Cargo.lock +++ b/src-tauri/Cargo.lock @@ -496,6 +496,15 @@ dependencies = [ "alloc-stdlib", ] +[[package]] +name = "browser-history-fixtures" +version = "0.1.0" +dependencies = [ + "browser-history-parser", + "rusqlite", + "tempfile", +] + [[package]] name = "browser-history-parser" version = "0.1.0" @@ -6605,6 +6614,7 @@ name = "vault-core" version = "0.1.0" dependencies = [ "anyhow", + "browser-history-fixtures", "browser-history-parser", "chrono", "directories", diff --git a/src-tauri/Cargo.toml b/src-tauri/Cargo.toml index 8eb1405f..ab3cf66e 100644 --- a/src-tauri/Cargo.toml +++ b/src-tauri/Cargo.toml @@ -19,6 +19,7 @@ members = [ "crates/vault-platform", "crates/vault-worker", "crates/browser-history-parser", + "crates/browser-history-fixtures", ] resolver = "2" diff --git a/src-tauri/crates/browser-history-fixtures/Cargo.toml b/src-tauri/crates/browser-history-fixtures/Cargo.toml new file mode 100644 index 00000000..917c6389 --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/Cargo.toml @@ -0,0 +1,16 @@ +[package] +name = "browser-history-fixtures" +version = "0.1.0" +edition = "2024" +license.workspace = true +description = "Deterministic test fixtures for browser-history-parser and vault-core ingest scenarios." + +[lib] +path = "src/lib.rs" + +[dependencies] +rusqlite.workspace = true + +[dev-dependencies] +browser-history-parser = { version = "0.1.0", path = "../browser-history-parser" } +tempfile.workspace = true diff --git a/src-tauri/crates/browser-history-fixtures/src/chromium/mod.rs b/src-tauri/crates/browser-history-fixtures/src/chromium/mod.rs new file mode 100644 index 00000000..141c9270 --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/src/chromium/mod.rs @@ -0,0 +1,246 @@ +//! Real-format Chromium `History` SQLite generator. +//! +//! ## Responsibilities +//! - Emit a SQLite file with the `urls` and `visits` table shapes that +//! `browser_history_parser::chromium` reads, populated from caller-supplied +//! record structs. +//! - Keep on-disk column types and value semantics faithful to a real Chrome +//! `History` file, so scenario tests exercise the same code paths the +//! production parser hits against a user's actual database. +//! +//! ## Not responsible for +//! - Generating synthetic content (URLs, titles, timestamps) — that belongs +//! to the scenario layer once it ships. This module is the low-level writer. +//! - Downloads / favicons / keyword search terms — separate writers will be +//! added when scenarios that exercise those tables come online. +//! - Verifying the round-trip parse contract — `tests/chromium_roundtrip.rs` +//! owns that, since it requires the parser crate as a dev-dependency. +//! +//! ## Performance notes +//! - All rows are written inside a single SQLite transaction; a 1.44M-row +//! fixture writes in well under the AGENTS.md memory ceiling because we +//! never materialize the rendered SQL — `rusqlite` prepares once and binds +//! per row. + +use crate::time::unix_ms_to_chrome_time; +use rusqlite::{Connection, params}; +use std::path::Path; + +/// One row destined for the Chromium `urls` table. +/// +/// Fields mirror the columns the production parser reads in +/// `INGEST_URLS_FULL_SQL`. Times are expressed in Unix milliseconds and +/// converted to Chrome epoch on write. +#[derive(Debug, Clone)] +pub struct ChromiumUrlRow { + /// `urls.id` — Chrome's per-URL primary key. Must be unique within one fixture. + pub id: i64, + /// `urls.url` — full URL string, stored exactly as the browser would persist it. + pub url: String, + /// `urls.title` — page title, or `None` for pages Chrome never received a title for. + pub title: Option, + /// `urls.visit_count` — lifetime visit count Chrome itself tracks. + pub visit_count: i64, + /// `urls.typed_count` — how many of those visits were typed into the omnibox. + pub typed_count: i64, + /// `urls.last_visit_time` — Unix milliseconds; converted to Chrome epoch at write time. + pub last_visit_unix_ms: i64, + /// `urls.hidden` — Chrome's "hidden from suggestions" flag. + pub hidden: bool, +} + +/// One row destined for the Chromium `visits` table. +/// +/// Fields mirror the columns the production parser reads in `INGEST_VISITS_SQL`, +/// including the awkwardly-named `visits.url` column which is the foreign key +/// to `urls.id` (not a URL string). +#[derive(Debug, Clone)] +pub struct ChromiumVisitRow { + /// `visits.id` — visit primary key. Must be unique within one fixture. + pub id: i64, + /// `visits.url` — foreign key into the `urls.id` column. + pub url_id: i64, + /// `visits.visit_time` — Unix milliseconds; converted to Chrome epoch at write time. + pub visit_time_unix_ms: i64, + /// `visits.from_visit` — the visit that linked here, or 0 / `None` for entry points. + pub from_visit: Option, + /// `visits.transition` — Chrome's transition-type bitfield. + pub transition: Option, + /// `visits.visit_duration` — page-engagement duration in microseconds (Chrome's unit). + pub visit_duration_micros: Option, + /// `visits.is_known_to_sync` — whether Chrome Sync has acknowledged this row. + pub is_known_to_sync: bool, + /// `visits.visited_link_id` — Chrome's visited-link partition key. + pub visited_link_id: Option, + /// `visits.external_referrer_url` — the off-site referrer header, when Chrome captured one. + pub external_referrer_url: Option, + /// `visits.app_id` — Chrome's web-app association string. + pub app_id: Option, +} + +/// Builder for one Chromium `History` SQLite fixture. +/// +/// Use [`ChromiumHistoryFixture::new`] then [`Self::add_url`] / [`Self::add_visit`] +/// to compose records, and [`Self::write`] to materialize the SQLite file. +#[derive(Debug, Default)] +pub struct ChromiumHistoryFixture { + urls: Vec, + visits: Vec, +} + +impl ChromiumHistoryFixture { + /// Creates an empty fixture builder. + pub fn new() -> Self { + Self::default() + } + + /// Adds one URL row to the fixture. Returns the builder for chaining. + pub fn add_url(mut self, url: ChromiumUrlRow) -> Self { + self.urls.push(url); + self + } + + /// Adds one visit row to the fixture. Returns the builder for chaining. + pub fn add_visit(mut self, visit: ChromiumVisitRow) -> Self { + self.visits.push(visit); + self + } + + /// Materializes the fixture as a real-format SQLite file at `path`. + /// + /// Overwrites any existing file at the same path. Callers using the + /// `tempfile` crate get the standard `TempDir::path().join("History")` + /// pattern; the file name is conventional but not enforced here, since + /// PathKeep's parser accepts any path it's given. + pub fn write(&self, path: &Path) -> Result<(), rusqlite::Error> { + if path.exists() { + std::fs::remove_file(path) + .map_err(|err| rusqlite::Error::ToSqlConversionFailure(Box::new(err)))?; + } + + let mut connection = Connection::open(path)?; + let transaction = connection.transaction()?; + + transaction.execute_batch(SCHEMA_SQL)?; + + { + let mut url_stmt = transaction.prepare( + "INSERT INTO urls (id, url, title, visit_count, typed_count, last_visit_time, hidden) + VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7)", + )?; + for url in &self.urls { + url_stmt.execute(params![ + url.id, + url.url, + url.title, + url.visit_count, + url.typed_count, + unix_ms_to_chrome_time(url.last_visit_unix_ms), + url.hidden as i64, + ])?; + } + } + + { + let mut visit_stmt = transaction.prepare( + "INSERT INTO visits ( + id, url, visit_time, from_visit, transition, visit_duration, + is_known_to_sync, visited_link_id, external_referrer_url, app_id + ) + VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10)", + )?; + for visit in &self.visits { + visit_stmt.execute(params![ + visit.id, + visit.url_id, + unix_ms_to_chrome_time(visit.visit_time_unix_ms), + visit.from_visit, + visit.transition, + visit.visit_duration_micros, + visit.is_known_to_sync as i64, + visit.visited_link_id, + visit.external_referrer_url, + visit.app_id, + ])?; + } + } + + transaction.commit()?; + Ok(()) + } +} + +/// SQLite schema matching the columns the PathKeep Chromium parser reads. +/// +/// Real Chrome `History` files carry many more columns (favicon_id on +/// `urls`; sync metadata, segment_id, opener_visit, originator_* fields on +/// `visits`). Those are intentionally omitted here because the parser does +/// not project them; adding them would invite drift between fixture and +/// reality without buying any extra coverage. Slices that need favicon or +/// sync coverage will extend this schema in their own writer. +const SCHEMA_SQL: &str = r#" +CREATE TABLE urls ( + id INTEGER PRIMARY KEY, + url TEXT NOT NULL, + title TEXT, + visit_count INTEGER NOT NULL DEFAULT 0, + typed_count INTEGER NOT NULL DEFAULT 0, + last_visit_time INTEGER NOT NULL DEFAULT 0, + hidden INTEGER NOT NULL DEFAULT 0 +); + +CREATE TABLE visits ( + id INTEGER PRIMARY KEY, + url INTEGER NOT NULL, + visit_time INTEGER NOT NULL DEFAULT 0, + from_visit INTEGER, + transition INTEGER, + visit_duration INTEGER, + is_known_to_sync INTEGER NOT NULL DEFAULT 0, + visited_link_id INTEGER, + external_referrer_url TEXT, + app_id TEXT +); + +CREATE INDEX urls_url_index ON urls(url); +CREATE INDEX visits_url_index ON visits(url); +CREATE INDEX visits_time_index ON visits(visit_time); +"#; + +#[cfg(test)] +mod tests { + use super::*; + use tempfile::TempDir; + + #[test] + fn write_overwrites_existing_file_at_same_path() { + let dir = TempDir::new().unwrap(); + let path = dir.path().join("History"); + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://a.test".to_string(), + title: Some("A".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: 1_700_000_000_000, + hidden: false, + }) + .add_visit(ChromiumVisitRow { + id: 1, + url_id: 1, + visit_time_unix_ms: 1_700_000_000_000, + from_visit: None, + transition: Some(1), + visit_duration_micros: None, + is_known_to_sync: false, + visited_link_id: None, + external_referrer_url: None, + app_id: None, + }); + fixture.write(&path).unwrap(); + assert!(path.exists()); + fixture.write(&path).unwrap(); + assert!(path.exists()); + } +} diff --git a/src-tauri/crates/browser-history-fixtures/src/firefox/mod.rs b/src-tauri/crates/browser-history-fixtures/src/firefox/mod.rs new file mode 100644 index 00000000..6fefc0f5 --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/src/firefox/mod.rs @@ -0,0 +1,199 @@ +//! Real-format Firefox `places.sqlite` generator. +//! +//! ## Responsibilities +//! - Emit a SQLite file with the `moz_places` / `moz_historyvisits` shape +//! that `browser_history_parser::firefox` reads, populated from caller- +//! supplied record structs. +//! - Convert fixture-author-friendly Unix milliseconds into Firefox's native +//! `i64` microseconds-since-Unix-epoch on write. +//! +//! ## Not responsible for +//! - The optional `moz_inputhistory` / `moz_places_metadata*` sidecar tables; +//! those are added when scenarios exercise typed-evidence extraction. +//! - Synthesizing realistic content. Scenario builders compose these records. +//! +//! ## Performance notes +//! - Single-transaction write. Bound by SQLite throughput, not Rust overhead. + +use rusqlite::{Connection, params}; +use std::path::Path; + +/// One row destined for the Firefox `moz_places` table. +#[derive(Debug, Clone)] +pub struct FirefoxPlaceRow { + /// `moz_places.id` — Firefox's per-URL primary key (`place_id`). + pub id: i64, + /// `moz_places.url` — full URL. + pub url: String, + /// `moz_places.title` — page title, or `None` for pages without one. + pub title: Option, + /// `moz_places.visit_count` — Firefox's lifetime visit count. + pub visit_count: i64, + /// `moz_places.hidden` — whether the URL is hidden from suggestion lists. + pub hidden: bool, + /// `moz_places.last_visit_date` — Unix milliseconds; converted to μs at write time. + pub last_visit_unix_ms: i64, +} + +/// One row destined for the Firefox `moz_historyvisits` table. +#[derive(Debug, Clone)] +pub struct FirefoxVisitRow { + /// `moz_historyvisits.id` — visit primary key. + pub id: i64, + /// `moz_historyvisits.place_id` — foreign key into `moz_places.id`. + pub place_id: i64, + /// `moz_historyvisits.visit_date` — Unix milliseconds; converted to μs at write time. + pub visit_time_unix_ms: i64, + /// `moz_historyvisits.from_visit` — the visit that linked here, or `None`. + pub from_visit: Option, + /// `moz_historyvisits.visit_type` — Firefox's transition-type enum. + pub visit_type: Option, +} + +/// Builder for one Firefox `places.sqlite` fixture. +#[derive(Debug, Default)] +pub struct FirefoxPlacesFixture { + places: Vec, + visits: Vec, +} + +impl FirefoxPlacesFixture { + /// Creates an empty fixture builder. + pub fn new() -> Self { + Self::default() + } + + /// Adds one place row to the fixture. + pub fn add_place(mut self, place: FirefoxPlaceRow) -> Self { + self.places.push(place); + self + } + + /// Adds one visit row to the fixture. + pub fn add_visit(mut self, visit: FirefoxVisitRow) -> Self { + self.visits.push(visit); + self + } + + /// Materializes the fixture as a real-format SQLite file at `path`. + pub fn write(&self, path: &Path) -> Result<(), rusqlite::Error> { + if path.exists() { + std::fs::remove_file(path) + .map_err(|err| rusqlite::Error::ToSqlConversionFailure(Box::new(err)))?; + } + + let mut connection = Connection::open(path)?; + let transaction = connection.transaction()?; + + transaction.execute_batch(SCHEMA_SQL)?; + + { + let mut place_stmt = transaction.prepare( + "INSERT INTO moz_places (id, url, title, visit_count, hidden, last_visit_date) + VALUES (?1, ?2, ?3, ?4, ?5, ?6)", + )?; + for place in &self.places { + place_stmt.execute(params![ + place.id, + place.url, + place.title, + place.visit_count, + place.hidden as i64, + unix_ms_to_firefox_time(place.last_visit_unix_ms), + ])?; + } + } + + { + let mut visit_stmt = transaction.prepare( + "INSERT INTO moz_historyvisits (id, place_id, visit_date, from_visit, visit_type) + VALUES (?1, ?2, ?3, ?4, ?5)", + )?; + for visit in &self.visits { + visit_stmt.execute(params![ + visit.id, + visit.place_id, + unix_ms_to_firefox_time(visit.visit_time_unix_ms), + visit.from_visit, + visit.visit_type, + ])?; + } + } + + transaction.commit()?; + Ok(()) + } +} + +/// Converts Unix milliseconds into Firefox's microseconds-since-Unix-epoch. +/// +/// Mirrors `browser_history_parser::firefox::unix_ms_to_firefox_time`. Keeping +/// a local copy here avoids a runtime dependency on the parser crate. +pub fn unix_ms_to_firefox_time(unix_ms: i64) -> i64 { + unix_ms.max(0).saturating_mul(1_000) +} + +/// Inverse of [`unix_ms_to_firefox_time`]. +pub fn firefox_time_to_unix_ms(firefox_micros: i64) -> i64 { + firefox_micros.div_euclid(1_000).max(0) +} + +/// Minimum schema the production Firefox parser reads. +/// +/// Real Firefox `places.sqlite` files carry many more tables (bookmarks, +/// keywords, metadata, input history, search queries). Scenarios that need +/// those tables will extend the schema in a dedicated writer slice. +const SCHEMA_SQL: &str = r#" +CREATE TABLE moz_places ( + id INTEGER PRIMARY KEY, + url TEXT NOT NULL, + title TEXT, + visit_count INTEGER, + hidden INTEGER, + last_visit_date INTEGER +); + +CREATE TABLE moz_historyvisits ( + id INTEGER PRIMARY KEY, + place_id INTEGER NOT NULL, + visit_date INTEGER NOT NULL, + from_visit INTEGER, + visit_type INTEGER +); + +CREATE INDEX moz_places_url_index ON moz_places(url); +CREATE INDEX moz_historyvisits_place_index ON moz_historyvisits(place_id); +CREATE INDEX moz_historyvisits_date_index ON moz_historyvisits(visit_date); +"#; + +#[cfg(test)] +mod tests { + use super::*; + use tempfile::TempDir; + + #[test] + fn write_overwrites_existing_file_at_same_path() { + let dir = TempDir::new().unwrap(); + let path = dir.path().join("places.sqlite"); + let fixture = FirefoxPlacesFixture::new() + .add_place(FirefoxPlaceRow { + id: 1, + url: "https://a.test".to_string(), + title: Some("A".to_string()), + visit_count: 1, + hidden: false, + last_visit_unix_ms: 1_700_000_000_000, + }) + .add_visit(FirefoxVisitRow { + id: 1, + place_id: 1, + visit_time_unix_ms: 1_700_000_000_000, + from_visit: None, + visit_type: Some(1), + }); + fixture.write(&path).unwrap(); + assert!(path.exists()); + fixture.write(&path).unwrap(); + assert!(path.exists()); + } +} diff --git a/src-tauri/crates/browser-history-fixtures/src/lib.rs b/src-tauri/crates/browser-history-fixtures/src/lib.rs new file mode 100644 index 00000000..d98bf096 --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/src/lib.rs @@ -0,0 +1,49 @@ +//! Deterministic browser-history fixtures for PathKeep ingest tests. +//! +//! ## Responsibilities +//! - Write real-format browser history files (Chromium `History` SQLite today; +//! Firefox / Safari / Takeout to follow) from declarative record structs. +//! - Convert between human-readable Unix times and the on-disk epochs each +//! browser uses, so fixture authors never write raw epoch math. +//! - Stay self-validating: every generator is paired with a round-trip test +//! that proves PathKeep's real parser reads the fixture back as expected. +//! +//! ## Not responsible for +//! - Sampling real user data. Every fixture is programmatically synthesized; +//! no URL or title is ever pulled from a live browser DB. +//! - Driving the canonical ingest pipeline. That belongs to integration tests +//! in `vault-core`, which will consume the fixtures emitted here. +//! - Scenario orchestration (`Scenario` DSL, multi-profile composition, +//! assertion API). That layer ships in the next slice once the per-family +//! writers are verified. +//! +//! ## Dependencies +//! - `rusqlite` (bundled SQLCipher build inherited from the workspace) for +//! writing real History databases. +//! - Epoch conversions are implemented in `time.rs` with plain integer +//! arithmetic — no `chrono` dependency. The constants are pinned to +//! `vault_core::utils::CHROME_UNIX_EPOCH_OFFSET_MICROS` and verified +//! by round-trip tests against the production parser. +//! +//! ## Performance notes +//! - Fixture writes use a single transaction per database; bulk-loading a +//! million-row scenario is bounded by SQLite's write throughput, not by +//! per-row Rust overhead. + +pub mod chromium; +pub mod firefox; +pub mod safari; +pub mod takeout; +pub mod time; + +pub use chromium::{ChromiumHistoryFixture, ChromiumUrlRow, ChromiumVisitRow}; +pub use firefox::{ + FirefoxPlaceRow, FirefoxPlacesFixture, FirefoxVisitRow, firefox_time_to_unix_ms, + unix_ms_to_firefox_time, +}; +pub use safari::{ + SafariHistoryFixture, SafariHistoryItemRow, SafariHistoryVisitRow, SafariSchemaVariant, + safari_time_to_unix_ms, unix_ms_to_safari_time, +}; +pub use takeout::{TakeoutBrowserHistoryFixture, TakeoutBrowserRecord, TakeoutPayloadFormat}; +pub use time::{chrome_time_to_unix_ms, unix_ms_to_chrome_time}; diff --git a/src-tauri/crates/browser-history-fixtures/src/safari/mod.rs b/src-tauri/crates/browser-history-fixtures/src/safari/mod.rs new file mode 100644 index 00000000..d394becf --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/src/safari/mod.rs @@ -0,0 +1,264 @@ +//! Real-format Safari `History.db` generator. +//! +//! ## Responsibilities +//! - Emit a SQLite file with the `history_items` / `history_visits` shape +//! `browser_history_parser::safari` reads. +//! - Support both the minimal historical schema (just `visit_time`) and the +//! current macOS Safari schema with `load_successful`, `synthesized`, +//! `redirect_*`, `origin`, `score`, etc. — selectable per fixture. +//! - Convert fixture-author Unix milliseconds into Safari's CFAbsoluteTime +//! `f64` (seconds since 2001-01-01). +//! +//! ## Not responsible for +//! - The `history_tombstones` table; scenarios that exercise sync-deletion +//! semantics will extend this writer. +//! - Synthesizing realistic content; scenario builders compose records. + +use rusqlite::{Connection, params}; +use std::path::Path; + +const SAFARI_UNIX_EPOCH_OFFSET_SECONDS: f64 = 978_307_200.0; + +/// Which Safari schema variant the writer should produce. +/// +/// Real macOS Safari ships the `Current` schema today; the `Minimal` variant +/// covers older OS versions and the legacy parser-test fixture path. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)] +pub enum SafariSchemaVariant { + /// Minimal `history_visits` columns: only `id`, `history_item`, `title`, `visit_time`. + Minimal, + /// Current macOS Safari schema: adds `load_successful`, `synthesized`, + /// `redirect_*`, `origin`, `generation`, `attributes`, `score`. + #[default] + Current, +} + +/// One row destined for the Safari `history_items` table. +#[derive(Debug, Clone)] +pub struct SafariHistoryItemRow { + /// `history_items.id` — Safari's per-URL primary key. + pub id: i64, + /// `history_items.url` — full URL. + pub url: String, +} + +/// One row destined for the Safari `history_visits` table. +#[derive(Debug, Clone)] +pub struct SafariHistoryVisitRow { + /// `history_visits.id` — visit primary key. + pub id: i64, + /// `history_visits.history_item` — foreign key to `history_items.id`. + pub history_item: i64, + /// `history_visits.title` — Safari attaches title at the visit level, not the URL. + pub title: Option, + /// `history_visits.visit_time` — Unix milliseconds; converted to CFAbsoluteTime at write. + pub visit_time_unix_ms: i64, + /// `history_visits.load_successful` — whether the page loaded without error. + pub load_successful: Option, + /// `history_visits.http_non_get` — whether the request used a non-GET method. + pub http_non_get: Option, + /// `history_visits.synthesized` — whether Safari generated this row as a side-effect of a redirect or similar. + pub synthesized: Option, + /// `history_visits.redirect_source` — the visit id that redirected here. + pub redirect_source: Option, + /// `history_visits.redirect_destination` — the visit id this redirected to. + pub redirect_destination: Option, + /// `history_visits.origin` — Safari's load-origin enum. + pub origin: Option, + /// `history_visits.generation` — Safari's content-generation counter. + pub generation: Option, + /// `history_visits.attributes` — Safari's per-visit attribute bitfield. + pub attributes: Option, + /// `history_visits.score` — Safari's relevance score. + pub score: Option, +} + +/// Builder for one Safari `History.db` fixture. +#[derive(Debug, Default)] +pub struct SafariHistoryFixture { + variant: SafariSchemaVariant, + items: Vec, + visits: Vec, +} + +impl SafariHistoryFixture { + /// Creates an empty builder using the current macOS Safari schema variant. + pub fn new() -> Self { + Self::default() + } + + /// Switches the writer to the minimal historical schema (for legacy testing). + pub fn with_variant(mut self, variant: SafariSchemaVariant) -> Self { + self.variant = variant; + self + } + + /// Adds one history item row. + pub fn add_item(mut self, item: SafariHistoryItemRow) -> Self { + self.items.push(item); + self + } + + /// Adds one history visit row. + pub fn add_visit(mut self, visit: SafariHistoryVisitRow) -> Self { + self.visits.push(visit); + self + } + + /// Materializes the fixture as a real-format SQLite file at `path`. + pub fn write(&self, path: &Path) -> Result<(), rusqlite::Error> { + if path.exists() { + std::fs::remove_file(path) + .map_err(|err| rusqlite::Error::ToSqlConversionFailure(Box::new(err)))?; + } + + let mut connection = Connection::open(path)?; + let transaction = connection.transaction()?; + + transaction.execute_batch(match self.variant { + SafariSchemaVariant::Minimal => SCHEMA_MINIMAL_SQL, + SafariSchemaVariant::Current => SCHEMA_CURRENT_SQL, + })?; + + { + let mut item_stmt = + transaction.prepare("INSERT INTO history_items (id, url) VALUES (?1, ?2)")?; + for item in &self.items { + item_stmt.execute(params![item.id, item.url])?; + } + } + + match self.variant { + SafariSchemaVariant::Minimal => { + let mut visit_stmt = transaction.prepare( + "INSERT INTO history_visits (id, history_item, title, visit_time) + VALUES (?1, ?2, ?3, ?4)", + )?; + for visit in &self.visits { + visit_stmt.execute(params![ + visit.id, + visit.history_item, + visit.title, + unix_ms_to_safari_time(visit.visit_time_unix_ms), + ])?; + } + } + SafariSchemaVariant::Current => { + let mut visit_stmt = transaction.prepare( + "INSERT INTO history_visits ( + id, history_item, title, visit_time, load_successful, + http_non_get, synthesized, redirect_source, redirect_destination, + origin, generation, attributes, score + ) + VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13)", + )?; + for visit in &self.visits { + visit_stmt.execute(params![ + visit.id, + visit.history_item, + visit.title, + unix_ms_to_safari_time(visit.visit_time_unix_ms), + visit.load_successful.map(|flag| flag as i64), + visit.http_non_get.map(|flag| flag as i64), + visit.synthesized.map(|flag| flag as i64), + visit.redirect_source, + visit.redirect_destination, + visit.origin, + visit.generation, + visit.attributes, + visit.score, + ])?; + } + } + } + + transaction.commit()?; + Ok(()) + } +} + +/// Converts Unix milliseconds into Safari's CFAbsoluteTime (seconds since 2001-01-01). +pub fn unix_ms_to_safari_time(unix_ms: i64) -> f64 { + (unix_ms.max(0) as f64 / 1_000.0) - SAFARI_UNIX_EPOCH_OFFSET_SECONDS +} + +/// Inverse of [`unix_ms_to_safari_time`], rounding to the nearest millisecond. +pub fn safari_time_to_unix_ms(safari_seconds: f64) -> i64 { + (((safari_seconds + SAFARI_UNIX_EPOCH_OFFSET_SECONDS) * 1_000.0).round() as i64).max(0) +} + +const SCHEMA_MINIMAL_SQL: &str = r#" +CREATE TABLE history_items ( + id INTEGER PRIMARY KEY, + url TEXT NOT NULL +); + +CREATE TABLE history_visits ( + id INTEGER PRIMARY KEY, + history_item INTEGER NOT NULL, + title TEXT, + visit_time REAL NOT NULL +); + +CREATE INDEX history_visits_item_index ON history_visits(history_item); +CREATE INDEX history_visits_time_index ON history_visits(visit_time); +"#; + +const SCHEMA_CURRENT_SQL: &str = r#" +CREATE TABLE history_items ( + id INTEGER PRIMARY KEY, + url TEXT NOT NULL +); + +CREATE TABLE history_visits ( + id INTEGER PRIMARY KEY, + history_item INTEGER NOT NULL, + title TEXT, + visit_time REAL NOT NULL, + load_successful INTEGER, + http_non_get INTEGER, + synthesized INTEGER, + redirect_source INTEGER, + redirect_destination INTEGER, + origin INTEGER, + generation INTEGER, + attributes INTEGER, + score REAL +); + +CREATE INDEX history_visits_item_index ON history_visits(history_item); +CREATE INDEX history_visits_time_index ON history_visits(visit_time); +"#; + +#[cfg(test)] +mod tests { + use super::*; + use tempfile::TempDir; + + #[test] + fn write_overwrites_existing_file_at_same_path() { + let dir = TempDir::new().unwrap(); + let path = dir.path().join("History.db"); + let fixture = SafariHistoryFixture::new() + .add_item(SafariHistoryItemRow { id: 1, url: "https://a.test".to_string() }) + .add_visit(SafariHistoryVisitRow { + id: 1, + history_item: 1, + title: Some("A".to_string()), + visit_time_unix_ms: 1_700_000_000_000, + load_successful: None, + http_non_get: None, + synthesized: None, + redirect_source: None, + redirect_destination: None, + origin: None, + generation: None, + attributes: None, + score: None, + }); + fixture.write(&path).unwrap(); + assert!(path.exists()); + fixture.write(&path).unwrap(); + assert!(path.exists()); + } +} diff --git a/src-tauri/crates/browser-history-fixtures/src/takeout/mod.rs b/src-tauri/crates/browser-history-fixtures/src/takeout/mod.rs new file mode 100644 index 00000000..aea3383d --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/src/takeout/mod.rs @@ -0,0 +1,248 @@ +//! Google Takeout `BrowserHistory.json` / `.jsonl` payload generator. +//! +//! ## Responsibilities +//! - Emit Takeout-format JSON or JSONL files containing browser-history +//! records in the shape `browser_history_parser::takeout` recognizes. +//! - Stay faithful to the field names Google actually ships (`time_usec`, +//! `page_transition`, `client_id`, `favicon_url`) so the parser exercises +//! its real classifier and record-extraction paths. +//! - Make the time-unit contract testable: the writer takes Unix +//! milliseconds and converts to the unit the parser currently assumes +//! (microseconds-since-Unix-epoch). The audit's open question B6 about +//! whether Google really ships Chrome epoch or Unix epoch can be pinned +//! by writing fixtures in both unit interpretations and observing which +//! one yields the expected Unix-ms output through the parser. +//! +//! ## Not responsible for +//! - Other Takeout payloads (TypedURL, Sessions, MyActivity HTML/JSON); +//! those are out of scope until scenarios call for them. +//! - Zip packaging — the parser supports zipped Takeout sources but the +//! first fixture slice writes plain files only. A `write_zip` helper +//! will be added when a scenario needs it. + +use std::fs::File; +use std::io::{BufWriter, Write}; +use std::path::Path; + +/// One Takeout `Browser History` record. +#[derive(Debug, Clone)] +pub struct TakeoutBrowserRecord { + /// The page URL. Serialized as the `url` field. + pub url: String, + /// The page title. Serialized as the `title` field; omitted when `None`. + pub title: Option, + /// Visit time in Unix milliseconds; serialized as `time_usec` in microseconds. + pub visit_time_unix_ms: i64, + /// Chrome transition tag, e.g. `LINK`, `TYPED`. Serialized as `page_transition`. + pub page_transition: Option, + /// Stable client id; serialized as `client_id`. Captured as + /// context evidence by the parser. + pub client_id: Option, + /// Optional favicon URL; serialized as `favicon_url`. Captured as + /// context evidence by the parser. + pub favicon_url: Option, + /// Optional ptoken; serialized as `ptoken`. Captured as context + /// evidence (`context.takeout.ptoken`) by the parser. + pub ptoken: Option, +} + +/// Which on-disk layout to emit for the Takeout payload. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum TakeoutPayloadFormat { + /// Standard Google Takeout layout: `{ "Browser History": [...] }`. + StandardBrowserHistoryJson, + /// Older / alternate Takeout layout using the `BrowserHistory` (no space) key. + AlternateBrowserHistoryJson, + /// JSONL: one JSON record per line, no wrapping object. + JsonLines, +} + +/// Builder for one Takeout `BrowserHistory.*` fixture. +#[derive(Debug)] +pub struct TakeoutBrowserHistoryFixture { + format: TakeoutPayloadFormat, + records: Vec, +} + +impl TakeoutBrowserHistoryFixture { + /// Creates an empty builder using the standard `Browser History` key. + pub fn new() -> Self { + Self { format: TakeoutPayloadFormat::StandardBrowserHistoryJson, records: Vec::new() } + } + + /// Switches the writer to a different payload format. + pub fn with_format(mut self, format: TakeoutPayloadFormat) -> Self { + self.format = format; + self + } + + /// Adds one record to the payload. + pub fn add_record(mut self, record: TakeoutBrowserRecord) -> Self { + self.records.push(record); + self + } + + /// Materializes the fixture at `path`. The conventional file name is + /// `BrowserHistory.json` (or `.jsonl`) inside a `Chrome` subdirectory, + /// since the Takeout source classifier looks at path segments — but the + /// path is the caller's responsibility. + pub fn write(&self, path: &Path) -> std::io::Result<()> { + if let Some(parent) = path.parent() { + std::fs::create_dir_all(parent)?; + } + let file = File::create(path)?; + let mut writer = BufWriter::new(file); + + match self.format { + TakeoutPayloadFormat::StandardBrowserHistoryJson => { + self.write_wrapped_json(&mut writer, "Browser History")?; + } + TakeoutPayloadFormat::AlternateBrowserHistoryJson => { + self.write_wrapped_json(&mut writer, "BrowserHistory")?; + } + TakeoutPayloadFormat::JsonLines => { + for record in &self.records { + writer.write_all(serialize_record(record).as_bytes())?; + writer.write_all(b"\n")?; + } + } + } + + writer.flush()?; + Ok(()) + } + + fn write_wrapped_json(&self, writer: &mut W, key: &str) -> std::io::Result<()> { + writer.write_all(b"{\n \"")?; + writer.write_all(key.as_bytes())?; + writer.write_all(b"\": [")?; + for (index, record) in self.records.iter().enumerate() { + if index > 0 { + writer.write_all(b",")?; + } + writer.write_all(b"\n ")?; + writer.write_all(serialize_record(record).as_bytes())?; + } + if !self.records.is_empty() { + writer.write_all(b"\n ")?; + } + writer.write_all(b"]\n}\n")?; + Ok(()) + } +} + +impl Default for TakeoutBrowserHistoryFixture { + fn default() -> Self { + Self::new() + } +} + +fn serialize_record(record: &TakeoutBrowserRecord) -> String { + let mut fields: Vec = Vec::with_capacity(6); + if let Some(transition) = &record.page_transition { + fields.push(format!("\"page_transition\": {}", json_string(transition))); + } + if let Some(title) = &record.title { + fields.push(format!("\"title\": {}", json_string(title))); + } + fields.push(format!("\"url\": {}", json_string(&record.url))); + fields.push(format!("\"time_usec\": {}", record.visit_time_unix_ms.saturating_mul(1_000))); + if let Some(client_id) = &record.client_id { + fields.push(format!("\"client_id\": {}", json_string(client_id))); + } + if let Some(favicon) = &record.favicon_url { + fields.push(format!("\"favicon_url\": {}", json_string(favicon))); + } + if let Some(ptoken) = &record.ptoken { + fields.push(format!("\"ptoken\": {}", json_string(ptoken))); + } + format!("{{{}}}", fields.join(", ")) +} + +/// Minimal JSON string encoder. Handles the escape sequences the parser will +/// see in synthetic fixtures (quotes, backslashes, control chars) without +/// pulling in a full JSON serializer dependency. +fn json_string(value: &str) -> String { + let mut buffer = String::with_capacity(value.len() + 2); + buffer.push('"'); + for ch in value.chars() { + match ch { + '"' => buffer.push_str("\\\""), + '\\' => buffer.push_str("\\\\"), + '\n' => buffer.push_str("\\n"), + '\r' => buffer.push_str("\\r"), + '\t' => buffer.push_str("\\t"), + ch if (ch as u32) < 0x20 => { + buffer.push_str(&format!("\\u{:04x}", ch as u32)); + } + ch => buffer.push(ch), + } + } + buffer.push('"'); + buffer +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn json_string_escapes_control_and_special_characters() { + assert_eq!(json_string("hello"), "\"hello\""); + assert_eq!(json_string("with \"quotes\""), "\"with \\\"quotes\\\"\""); + assert_eq!(json_string("with\\slash"), "\"with\\\\slash\""); + assert_eq!(json_string("line1\nline2"), "\"line1\\nline2\""); + assert_eq!(json_string("\u{0001}"), "\"\\u0001\""); + } + + #[test] + fn default_creates_empty_fixture() { + let fixture = TakeoutBrowserHistoryFixture::default(); + assert_eq!(fixture.records.len(), 0); + } + + #[test] + fn json_string_escapes_tab_and_carriage_return() { + assert_eq!(json_string("col1\tcol2"), "\"col1\\tcol2\""); + assert_eq!(json_string("line\rend"), "\"line\\rend\""); + } + + #[test] + fn serialize_record_emits_field_order_the_parser_can_read() { + let record = TakeoutBrowserRecord { + url: "https://example.com".to_string(), + title: Some("Example".to_string()), + visit_time_unix_ms: 1_700_000_000_000, + page_transition: Some("LINK".to_string()), + client_id: None, + favicon_url: None, + ptoken: None, + }; + let serialized = serialize_record(&record); + assert!(serialized.contains("\"url\": \"https://example.com\"")); + assert!(serialized.contains("\"title\": \"Example\"")); + assert!(serialized.contains("\"time_usec\": 1700000000000000")); + assert!(serialized.contains("\"page_transition\": \"LINK\"")); + assert!(!serialized.contains("client_id")); + assert!(!serialized.contains("favicon_url")); + assert!(!serialized.contains("ptoken")); + } + + #[test] + fn serialize_record_includes_ptoken_when_present() { + let record = TakeoutBrowserRecord { + url: "https://example.com".to_string(), + title: Some("Example".to_string()), + visit_time_unix_ms: 1_700_000_000_000, + page_transition: None, + client_id: None, + favicon_url: None, + ptoken: Some("synthetic-ptoken-value".to_string()), + }; + let serialized = serialize_record(&record); + assert!( + serialized.contains("\"ptoken\": \"synthetic-ptoken-value\""), + "serialized output should contain ptoken field: {serialized}" + ); + } +} diff --git a/src-tauri/crates/browser-history-fixtures/src/time.rs b/src-tauri/crates/browser-history-fixtures/src/time.rs new file mode 100644 index 00000000..4436e7b7 --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/src/time.rs @@ -0,0 +1,72 @@ +//! Epoch conversions between Unix and Chrome time. +//! +//! Chrome stores `last_visit_time` and `visit_time` as microseconds since +//! `1601-01-01T00:00:00Z` (the Windows NT epoch). PathKeep canonicalizes to +//! Unix milliseconds. Fixture authors think in Unix ms; this module bridges +//! the two without leaking raw offset arithmetic into call sites. + +/// Microseconds between the Windows NT epoch (1601-01-01) and the Unix epoch. +/// +/// This matches `vault_core::utils::CHROME_UNIX_EPOCH_OFFSET_MICROS` and the +/// constant inside `browser_history_parser::chromium`. Keeping a local copy +/// avoids a runtime dependency on either crate while staying numerically +/// pinned to their behavior; the round-trip test catches any divergence. +const CHROME_UNIX_EPOCH_OFFSET_MICROS: i64 = 11_644_473_600_000_000; + +/// Converts Unix milliseconds into Chrome's microseconds-since-1601 format. +/// +/// Saturating arithmetic mirrors the production helper so absurd far-future +/// inputs do not silently wrap negative. +pub fn unix_ms_to_chrome_time(unix_ms: i64) -> i64 { + unix_ms.saturating_mul(1_000).saturating_add(CHROME_UNIX_EPOCH_OFFSET_MICROS) +} + +/// Converts Chrome microseconds-since-1601 back into Unix milliseconds. +/// +/// The inverse of [`unix_ms_to_chrome_time`] for positive Unix timestamps; +/// used by round-trip tests to assert the fixture writer and the production +/// parser agree on the epoch. +/// +/// Mirrors the production parser's `.max(0)` clamp at +/// `browser-history-parser/src/chromium/mod.rs:290` so any pre-1970 chrome +/// timestamp (negative-after-offset-subtraction) lands as 0 — keeping +/// fixture-side verification helpers aligned with how production stores +/// the value, even though the inverse is no longer total across i64. +pub fn chrome_time_to_unix_ms(chrome_micros: i64) -> i64 { + chrome_micros.saturating_sub(CHROME_UNIX_EPOCH_OFFSET_MICROS).div_euclid(1_000).max(0) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn unix_to_chrome_and_back_round_trips() { + let unix_ms = 1_700_000_000_000_i64; // 2023-11-14T22:13:20Z + let chrome = unix_ms_to_chrome_time(unix_ms); + assert_eq!(chrome_time_to_unix_ms(chrome), unix_ms); + } + + #[test] + fn unix_epoch_zero_maps_to_offset_only() { + assert_eq!(unix_ms_to_chrome_time(0), CHROME_UNIX_EPOCH_OFFSET_MICROS); + assert_eq!(chrome_time_to_unix_ms(CHROME_UNIX_EPOCH_OFFSET_MICROS), 0); + } + + #[test] + fn far_future_unix_saturates_rather_than_wraps() { + let absurd = i64::MAX / 1_000; + let chrome = unix_ms_to_chrome_time(absurd); + assert_eq!(chrome, i64::MAX); + } + + #[test] + fn pre_unix_epoch_chrome_time_clamps_to_zero() { + // chrome_micros = 0 represents the Windows NT epoch (1601-01-01), + // which is well before the Unix epoch. Production parser clamps + // such values to 0; the fixture-side inverse helper must do the + // same so verification helpers agree with archived state. + assert_eq!(chrome_time_to_unix_ms(0), 0); + assert_eq!(chrome_time_to_unix_ms(CHROME_UNIX_EPOCH_OFFSET_MICROS - 1), 0); + } +} diff --git a/src-tauri/crates/browser-history-fixtures/tests/chromium_roundtrip.rs b/src-tauri/crates/browser-history-fixtures/tests/chromium_roundtrip.rs new file mode 100644 index 00000000..80ee0138 --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/tests/chromium_roundtrip.rs @@ -0,0 +1,202 @@ +//! Self-validation for the Chromium History fixture writer. +//! +//! Every scenario test built on `browser-history-fixtures` ultimately relies on +//! one promise: the SQLite file we wrote is byte-faithful enough that the +//! production PathKeep parser reads back exactly the records we declared. If +//! that promise breaks, every downstream scenario is meaningless — a passing +//! assertion could just mean "writer and parser are silently aligned in their +//! shared mistake." +//! +//! This file is the gate. It exercises the smallest meaningful fixture +//! (two URLs, three visits, one revisit) and round-trips it through the real +//! `browser_history_parser::chromium::parse_history` entry point. + +use browser_history_fixtures::{ + ChromiumHistoryFixture, ChromiumUrlRow, ChromiumVisitRow, chrome_time_to_unix_ms, + unix_ms_to_chrome_time, +}; +use browser_history_parser::{ChromiumReadCursor, HistoryDatabaseSet, chromium}; +use tempfile::TempDir; + +#[test] +fn chromium_fixture_round_trips_through_production_parser() { + let temp = TempDir::new().expect("tempdir"); + let history_path = temp.path().join("History"); + + // 2026-05-01T00:00:00Z, 2026-05-02T12:00:00Z, 2026-05-03T08:15:30Z + let visit_one_ms = 1_777_680_000_000; + let visit_two_ms = 1_777_809_600_000; + let visit_three_ms = 1_777_872_930_000; + + ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/article-one".to_string(), + title: Some("Article One".to_string()), + visit_count: 2, + typed_count: 1, + last_visit_unix_ms: visit_two_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.org/article-two".to_string(), + title: Some("Article Two".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: visit_three_ms, + hidden: false, + }) + .add_visit(ChromiumVisitRow { + id: 10, + url_id: 1, + visit_time_unix_ms: visit_one_ms, + from_visit: Some(0), + transition: Some(805306368), // PAGE_TRANSITION_TYPED | CHAIN_START | CHAIN_END + visit_duration_micros: Some(30_000_000), + is_known_to_sync: true, + visited_link_id: Some(42), + external_referrer_url: None, + app_id: None, + }) + .add_visit(ChromiumVisitRow { + id: 11, + url_id: 1, + visit_time_unix_ms: visit_two_ms, + from_visit: Some(10), + transition: Some(805306369), // PAGE_TRANSITION_LINK | ... + visit_duration_micros: Some(15_500_000), + is_known_to_sync: true, + visited_link_id: Some(42), + external_referrer_url: Some("https://referrer.example.net/".to_string()), + app_id: None, + }) + .add_visit(ChromiumVisitRow { + id: 12, + url_id: 2, + visit_time_unix_ms: visit_three_ms, + from_visit: Some(11), + transition: Some(805306369), + visit_duration_micros: None, + is_known_to_sync: false, + visited_link_id: None, + external_referrer_url: None, + app_id: Some("app.example".to_string()), + }) + .write(&history_path) + .expect("write fixture"); + + let parsed = chromium::parse_history( + &HistoryDatabaseSet { history_path: history_path.clone(), favicons_path: None }, + ChromiumReadCursor::default(), + ) + .expect("parse fixture"); + + assert_eq!(parsed.urls.len(), 2, "parser should see exactly the URLs we wrote"); + assert_eq!(parsed.visits.len(), 3, "parser should see exactly the visits we wrote"); + + let url_one = parsed.urls.iter().find(|url| url.source_url_id == 1).expect("url id 1"); + assert_eq!(url_one.url, "https://example.com/article-one"); + assert_eq!(url_one.title.as_deref(), Some("Article One")); + assert_eq!(url_one.visit_count, 2); + assert_eq!(url_one.typed_count, 1); + assert_eq!(url_one.last_visit_ms, visit_two_ms); + assert!(!url_one.hidden); + + let url_two = parsed.urls.iter().find(|url| url.source_url_id == 2).expect("url id 2"); + assert_eq!(url_two.url, "https://example.org/article-two"); + assert_eq!(url_two.last_visit_ms, visit_three_ms); + + let visit_one = + parsed.visits.iter().find(|visit| visit.source_visit_id == 10).expect("visit id 10"); + assert_eq!(visit_one.source_url_id, 1); + assert_eq!(visit_one.visit_time_ms, visit_one_ms); + assert_eq!(visit_one.transition, Some(805306368)); + // Despite the field name `visit_duration_ms`, the Chromium parser passes + // the raw `visits.visit_duration` value through, which Chrome itself + // stores as microseconds. This is a known naming inconsistency in + // production code (see import-dedup-audit.md); the fixture writes the + // value in Chrome's native microsecond unit and the round-trip confirms. + assert_eq!(visit_one.visit_duration_ms, Some(30_000_000)); + assert!(visit_one.is_known_to_sync); + assert_eq!(visit_one.visited_link_id, Some(42)); + + let visit_two = + parsed.visits.iter().find(|visit| visit.source_visit_id == 11).expect("visit id 11"); + assert_eq!(visit_two.from_visit, Some(10)); + assert_eq!(visit_two.external_referrer_url.as_deref(), Some("https://referrer.example.net/")); + + let visit_three = + parsed.visits.iter().find(|visit| visit.source_visit_id == 12).expect("visit id 12"); + assert_eq!(visit_three.source_url_id, 2); + assert_eq!(visit_three.app_id.as_deref(), Some("app.example")); + assert!(!visit_three.is_known_to_sync); +} + +#[test] +fn chromium_fixture_preserves_cjk_url_and_title() { + let temp = TempDir::new().expect("tempdir"); + let history_path = temp.path().join("History"); + + let visit_ms = 1_777_680_000_000; + // URL with percent-encoded CJK path segment and raw CJK query parameter. + let cjk_url = "https://example.com/test-unicode/%E6%B8%AC%E8%A9%A6?q=\u{691C}\u{7D22}"; + let cjk_title = "\u{65E5}\u{672C}\u{8A9E}\u{30C6}\u{30B9}\u{30C8} \u{2014} \u{6E2C}\u{8A66}\u{9801}\u{9762}"; + + ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 100, + url: cjk_url.to_string(), + title: Some(cjk_title.to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: visit_ms, + hidden: false, + }) + .add_visit(ChromiumVisitRow { + id: 200, + url_id: 100, + visit_time_unix_ms: visit_ms, + from_visit: None, + transition: Some(1), + visit_duration_micros: None, + is_known_to_sync: false, + visited_link_id: None, + external_referrer_url: None, + app_id: None, + }) + .write(&history_path) + .expect("write CJK fixture"); + + let parsed = chromium::parse_history( + &HistoryDatabaseSet { history_path: history_path.clone(), favicons_path: None }, + ChromiumReadCursor::default(), + ) + .expect("parse CJK fixture"); + + assert_eq!(parsed.urls.len(), 1); + assert_eq!(parsed.visits.len(), 1); + + let url = &parsed.urls[0]; + assert_eq!(url.url, cjk_url, "percent-encoded CJK URL path should round-trip exactly"); + assert_eq!( + url.title.as_deref(), + Some(cjk_title), + "CJK title with kanji, katakana, and traditional characters should round-trip exactly" + ); + + let visit = &parsed.visits[0]; + assert_eq!(visit.url, cjk_url, "visit-level URL should match the CJK URL"); + assert_eq!(visit.visit_time_ms, visit_ms); +} + +#[test] +fn time_helpers_match_production_offset() { + let unix_ms = 1_777_809_600_000; + let chrome = unix_ms_to_chrome_time(unix_ms); + assert_eq!(chrome_time_to_unix_ms(chrome), unix_ms); + + // Pin the constant: 2026-05-02T12:00:00Z in Unix ms is exactly + // 13_422_283_200_000_000 in Chrome microseconds-since-1601. + assert_eq!(chrome, 13_422_283_200_000_000); +} diff --git a/src-tauri/crates/browser-history-fixtures/tests/firefox_roundtrip.rs b/src-tauri/crates/browser-history-fixtures/tests/firefox_roundtrip.rs new file mode 100644 index 00000000..11c8e502 --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/tests/firefox_roundtrip.rs @@ -0,0 +1,265 @@ +//! Self-validation for the Firefox `places.sqlite` fixture writer. +//! +//! Mirrors the Chromium round-trip pattern: build a small fixture, parse it +//! back through `browser_history_parser::firefox::parse_history`, and assert +//! every emitted field matches what the fixture promised. + +use browser_history_fixtures::{ + FirefoxPlaceRow, FirefoxPlacesFixture, FirefoxVisitRow, firefox_time_to_unix_ms, + unix_ms_to_firefox_time, +}; +use browser_history_parser::firefox; +use tempfile::TempDir; + +#[test] +fn firefox_fixture_round_trips_through_production_parser() { + let temp = TempDir::new().expect("tempdir"); + let history_path = temp.path().join("places.sqlite"); + + let visit_one_ms = 1_777_680_000_000; + let visit_two_ms = 1_777_809_600_000; + let visit_three_ms = 1_777_872_930_000; + + FirefoxPlacesFixture::new() + .add_place(FirefoxPlaceRow { + id: 7, + url: "https://example.com/firefox-one".to_string(), + title: Some("Firefox Example One".to_string()), + visit_count: 2, + hidden: false, + last_visit_unix_ms: visit_two_ms, + }) + .add_place(FirefoxPlaceRow { + id: 8, + url: "https://example.org/firefox-two".to_string(), + title: Some("Firefox Example Two".to_string()), + visit_count: 1, + hidden: false, + last_visit_unix_ms: visit_three_ms, + }) + .add_visit(FirefoxVisitRow { + id: 11, + place_id: 7, + visit_time_unix_ms: visit_one_ms, + from_visit: None, + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 12, + place_id: 7, + visit_time_unix_ms: visit_two_ms, + from_visit: Some(11), + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 13, + place_id: 8, + visit_time_unix_ms: visit_three_ms, + from_visit: Some(12), + visit_type: Some(2), + }) + .write(&history_path) + .expect("write firefox fixture"); + + let parsed = firefox::parse_history(&history_path, 0, 0).expect("parse firefox fixture"); + + assert_eq!(parsed.urls.len(), 2); + assert_eq!(parsed.visits.len(), 3); + + // --- URL-level assertions: all ParsedUrl fields --- + + let url_seven = parsed.urls.iter().find(|url| url.source_url_id == 7).expect("place 7"); + assert_eq!(url_seven.url, "https://example.com/firefox-one"); + assert_eq!(url_seven.title.as_deref(), Some("Firefox Example One")); + assert_eq!(url_seven.visit_count, 2); + assert_eq!(url_seven.last_visit_ms, visit_two_ms); + assert!(!url_seven.hidden); + // Firefox parser hardcodes typed_count to 0 (Firefox stores typed counts + // differently than Chromium — the parser does not extract them). + assert_eq!(url_seven.typed_count, 0); + // last_visit_iso is derived from the Firefox microsecond timestamp. + assert!(!url_seven.last_visit_iso.is_empty(), "last_visit_iso should be populated"); + + let url_eight = parsed.urls.iter().find(|url| url.source_url_id == 8).expect("place 8"); + assert_eq!(url_eight.url, "https://example.org/firefox-two"); + assert_eq!(url_eight.title.as_deref(), Some("Firefox Example Two")); + assert_eq!(url_eight.visit_count, 1); + assert_eq!(url_eight.last_visit_ms, visit_three_ms); + assert!(!url_eight.hidden); + assert_eq!(url_eight.typed_count, 0); + + // --- Visit-level assertions: all ParsedVisit fields --- + + let visit_eleven = + parsed.visits.iter().find(|visit| visit.source_visit_id == 11).expect("visit 11"); + assert_eq!(visit_eleven.source_url_id, 7); + assert_eq!(visit_eleven.visit_time_ms, visit_one_ms); + // visit_time_iso is derived from the Firefox microsecond timestamp. + assert!( + !visit_eleven.visit_time_iso.is_empty(), + "visit_time_iso should be populated for visit 11" + ); + assert_eq!(visit_eleven.transition, Some(1)); + assert_eq!(visit_eleven.from_visit, None); + assert_eq!(visit_eleven.app_id.as_deref(), Some("firefox")); + // url field on visits is populated from the JOIN with moz_places. + assert_eq!(visit_eleven.url, "https://example.com/firefox-one"); + assert_eq!(visit_eleven.title.as_deref(), Some("Firefox Example One")); + // Firefox parser hardcodes these fields — verify the contract. + assert_eq!(visit_eleven.visit_duration_ms, None); + assert!(!visit_eleven.is_known_to_sync); + assert_eq!(visit_eleven.visited_link_id, None); + assert_eq!(visit_eleven.external_referrer_url, None); + + let visit_twelve = + parsed.visits.iter().find(|visit| visit.source_visit_id == 12).expect("visit 12"); + assert_eq!(visit_twelve.source_url_id, 7); + assert_eq!(visit_twelve.from_visit, Some(11)); + assert_eq!(visit_twelve.visit_time_ms, visit_two_ms); + assert!( + !visit_twelve.visit_time_iso.is_empty(), + "visit_time_iso should be populated for visit 12" + ); + assert_eq!(visit_twelve.transition, Some(1)); + assert_eq!(visit_twelve.url, "https://example.com/firefox-one"); + assert_eq!(visit_twelve.app_id.as_deref(), Some("firefox")); + assert_eq!(visit_twelve.visit_duration_ms, None); + assert!(!visit_twelve.is_known_to_sync); + assert_eq!(visit_twelve.visited_link_id, None); + assert_eq!(visit_twelve.external_referrer_url, None); + + let visit_thirteen = + parsed.visits.iter().find(|visit| visit.source_visit_id == 13).expect("visit 13"); + assert_eq!(visit_thirteen.source_url_id, 8); + assert_eq!(visit_thirteen.from_visit, Some(12)); + assert_eq!(visit_thirteen.visit_time_ms, visit_three_ms); + assert!( + !visit_thirteen.visit_time_iso.is_empty(), + "visit_time_iso should be populated for visit 13" + ); + assert_eq!(visit_thirteen.transition, Some(2)); + assert_eq!(visit_thirteen.url, "https://example.org/firefox-two"); + assert_eq!(visit_thirteen.title.as_deref(), Some("Firefox Example Two")); + assert_eq!(visit_thirteen.app_id.as_deref(), Some("firefox")); + assert_eq!(visit_thirteen.visit_duration_ms, None); + assert!(!visit_thirteen.is_known_to_sync); + assert_eq!(visit_thirteen.visited_link_id, None); + assert_eq!(visit_thirteen.external_referrer_url, None); +} + +#[test] +fn firefox_null_visit_count_defaults_to_zero() { + // Firefox's `moz_places.visit_count` can be NULL in corrupted or very old + // databases. The production parser uses `unwrap_or_default()` on the + // `Option` read from SQLite, which coerces NULL to 0. + // + // The fixture builder's `FirefoxPlaceRow.visit_count` is non-optional to + // stay backward-compatible with downstream callers, so this test writes + // the NULL value directly via SQL. + + let temp = TempDir::new().expect("tempdir"); + let history_path = temp.path().join("places.sqlite"); + + let visit_ms = 1_777_680_000_000; + + // Write a minimal fixture, then overwrite visit_count with NULL. + FirefoxPlacesFixture::new() + .add_place(FirefoxPlaceRow { + id: 20, + url: "https://example.com/null-visit-count".to_string(), + title: Some("Null Visit Count".to_string()), + visit_count: 0, + hidden: false, + last_visit_unix_ms: visit_ms, + }) + .add_visit(FirefoxVisitRow { + id: 30, + place_id: 20, + visit_time_unix_ms: visit_ms, + from_visit: None, + visit_type: Some(1), + }) + .write(&history_path) + .expect("write firefox fixture for null-visit-count test"); + + // Patch visit_count to NULL directly so the parser's unwrap_or_default() + // path is exercised. + { + let connection = rusqlite::Connection::open(&history_path).expect("open for null patching"); + connection + .execute("UPDATE moz_places SET visit_count = NULL WHERE id = 20", []) + .expect("set visit_count to NULL"); + } + + let parsed = firefox::parse_history(&history_path, 0, 0) + .expect("parse null-visit-count firefox fixture"); + + assert_eq!(parsed.urls.len(), 1); + assert_eq!( + parsed.urls[0].visit_count, 0, + "NULL visit_count should default to 0 via unwrap_or_default()" + ); + assert_eq!(parsed.urls[0].url, "https://example.com/null-visit-count"); +} + +#[test] +fn firefox_null_last_visit_date_defaults_to_zero() { + // Firefox's `moz_places.last_visit_date` can be NULL for places that + // Firefox created but never actually visited (e.g. bookmarks without visits). + // The production parser uses `COALESCE(last_visit_date, 0)` in the SQL + // query, so NULL becomes 0 microseconds, which maps to Unix ms 0. + // + // Same approach as null-visit-count: write via the builder, patch to NULL. + + let temp = TempDir::new().expect("tempdir"); + let history_path = temp.path().join("places.sqlite"); + + FirefoxPlacesFixture::new() + .add_place(FirefoxPlaceRow { + id: 21, + url: "https://example.com/null-last-visit".to_string(), + title: Some("Null Last Visit".to_string()), + visit_count: 0, + hidden: false, + last_visit_unix_ms: 0, + }) + .add_visit(FirefoxVisitRow { + id: 31, + place_id: 21, + visit_time_unix_ms: 1_777_680_000_000, + from_visit: None, + visit_type: Some(1), + }) + .write(&history_path) + .expect("write firefox fixture for null-last-visit test"); + + // Patch last_visit_date to NULL so the parser's COALESCE path is exercised. + { + let connection = rusqlite::Connection::open(&history_path).expect("open for null patching"); + connection + .execute("UPDATE moz_places SET last_visit_date = NULL WHERE id = 21", []) + .expect("set last_visit_date to NULL"); + } + + // Use after_url_last_visit_ms=0 so the NULL-coalesced row qualifies. + let parsed = + firefox::parse_history(&history_path, 0, 0).expect("parse null-last-visit firefox fixture"); + + assert_eq!(parsed.urls.len(), 1); + assert_eq!( + parsed.urls[0].last_visit_ms, 0, + "NULL last_visit_date should coalesce to 0 via COALESCE" + ); + assert_eq!(parsed.urls[0].url, "https://example.com/null-last-visit"); + // Visit should still parse correctly despite the NULL on the URL row. + assert_eq!(parsed.visits.len(), 1); + assert_eq!(parsed.visits[0].source_url_id, 21); +} + +#[test] +fn firefox_time_helpers_match_production_offset() { + let unix_ms = 1_777_809_600_000; + let firefox = unix_ms_to_firefox_time(unix_ms); + assert_eq!(firefox_time_to_unix_ms(firefox), unix_ms); + assert_eq!(firefox, 1_777_809_600_000_000); +} diff --git a/src-tauri/crates/browser-history-fixtures/tests/safari_roundtrip.rs b/src-tauri/crates/browser-history-fixtures/tests/safari_roundtrip.rs new file mode 100644 index 00000000..e0a7148d --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/tests/safari_roundtrip.rs @@ -0,0 +1,268 @@ +//! Self-validation for the Safari `History.db` fixture writer. +//! +//! Covers both the minimal and current macOS Safari schema variants. The +//! current variant exercises the parser's optional-column probing path +//! (`load_successful`, `synthesized`, `redirect_*`, `score`). + +use browser_history_fixtures::{ + SafariHistoryFixture, SafariHistoryItemRow, SafariHistoryVisitRow, SafariSchemaVariant, + safari_time_to_unix_ms, unix_ms_to_safari_time, +}; +use browser_history_parser::safari; +use tempfile::TempDir; + +#[test] +fn safari_minimal_fixture_round_trips_through_production_parser() { + let temp = TempDir::new().expect("tempdir"); + let history_path = temp.path().join("History.db"); + + let visit_one_ms = 1_777_680_000_000; + let visit_two_ms = 1_777_809_600_000; + + SafariHistoryFixture::new() + .with_variant(SafariSchemaVariant::Minimal) + .add_item(SafariHistoryItemRow { id: 5, url: "https://example.com/safari".to_string() }) + .add_visit(SafariHistoryVisitRow { + id: 9, + history_item: 5, + title: Some("Safari Example One".to_string()), + visit_time_unix_ms: visit_one_ms, + load_successful: None, + http_non_get: None, + synthesized: None, + redirect_source: None, + redirect_destination: None, + origin: None, + generation: None, + attributes: None, + score: None, + }) + .add_visit(SafariHistoryVisitRow { + id: 10, + history_item: 5, + title: Some("Safari Example Two".to_string()), + visit_time_unix_ms: visit_two_ms, + load_successful: None, + http_non_get: None, + synthesized: None, + redirect_source: None, + redirect_destination: None, + origin: None, + generation: None, + attributes: None, + score: None, + }) + .write(&history_path) + .expect("write minimal safari fixture"); + + let parsed = safari::parse_history(&history_path, 0, 0).expect("parse minimal safari fixture"); + + assert_eq!(parsed.urls.len(), 1); + assert_eq!(parsed.visits.len(), 2); + + let url = &parsed.urls[0]; + assert_eq!(url.url, "https://example.com/safari"); + assert_eq!(url.visit_count, 2); + assert_eq!(url.last_visit_ms, visit_two_ms); + + let visit_nine = + parsed.visits.iter().find(|visit| visit.source_visit_id == 9).expect("visit 9"); + assert_eq!(visit_nine.visit_time_ms, visit_one_ms); + assert_eq!(visit_nine.title.as_deref(), Some("Safari Example One")); + assert_eq!(visit_nine.app_id.as_deref(), Some("safari")); +} + +#[test] +fn safari_current_fixture_round_trips_through_production_parser() { + let temp = TempDir::new().expect("tempdir"); + let history_path = temp.path().join("History.db"); + + let visit_one_ms = 1_777_680_000_000; + + SafariHistoryFixture::new() + .with_variant(SafariSchemaVariant::Current) + .add_item(SafariHistoryItemRow { + id: 5, + url: "https://example.com/safari-current".to_string(), + }) + .add_visit(SafariHistoryVisitRow { + id: 9, + history_item: 5, + title: Some("Safari Current Schema".to_string()), + visit_time_unix_ms: visit_one_ms, + load_successful: Some(true), + http_non_get: Some(false), + synthesized: Some(false), + redirect_source: None, + redirect_destination: Some(10), + origin: Some(1), + generation: Some(2), + attributes: Some(4), + score: Some(0.75), + }) + .write(&history_path) + .expect("write current safari fixture"); + + let parsed = safari::parse_history(&history_path, 0, 0).expect("parse current safari fixture"); + + assert_eq!(parsed.urls.len(), 1); + assert_eq!(parsed.visits.len(), 1); + assert_eq!(parsed.urls[0].url, "https://example.com/safari-current"); + assert_eq!(parsed.visits[0].visit_time_ms, visit_one_ms); + assert_eq!(parsed.visits[0].title.as_deref(), Some("Safari Current Schema")); + assert_eq!(parsed.visits[0].source_url_id, 5); + assert_eq!(parsed.visits[0].source_visit_id, 9); + assert_eq!(parsed.visits[0].app_id.as_deref(), Some("safari")); + + // Safari parser hardcodes these fields for visits — confirm the contract. + assert_eq!(parsed.visits[0].from_visit, None); + assert_eq!(parsed.visits[0].transition, None); + assert_eq!(parsed.visits[0].visit_duration_ms, None); + assert!(!parsed.visits[0].is_known_to_sync); + assert_eq!(parsed.visits[0].visited_link_id, None); + assert_eq!(parsed.visits[0].external_referrer_url, None); + + // Safari URL row: typed_count is hardcoded to 0, hidden to false. + assert_eq!(parsed.urls[0].typed_count, 0); + assert!(!parsed.urls[0].hidden); + assert_eq!(parsed.urls[0].visit_count, 1); + assert_eq!(parsed.urls[0].last_visit_ms, visit_one_ms); + + // --- Extra columns surface through typed_evidence, not ParsedVisit --- + + // load_successful=true → ContextEvidence with value "true" + assert!( + parsed.typed_evidence.context.iter().any(|ctx| { + ctx.context_key == "safari.load_successful" + && ctx.value_json == "true" + && ctx.source_visit_id == Some(9) + }), + "load_successful=true should produce context evidence" + ); + + // http_non_get=false → ContextEvidence with value "false" + assert!( + parsed.typed_evidence.context.iter().any(|ctx| { + ctx.context_key == "safari.http_non_get" + && ctx.value_json == "false" + && ctx.source_visit_id == Some(9) + }), + "http_non_get=false should produce context evidence" + ); + + // synthesized=false → ContextEvidence with value "false" + assert!( + parsed.typed_evidence.context.iter().any(|ctx| { + ctx.context_key == "safari.synthesized" + && ctx.value_json == "false" + && ctx.source_visit_id == Some(9) + }), + "synthesized=false should produce context evidence" + ); + + // redirect_destination=10 → NavigationEvidence with edge_kind + // "safari.redirect_destination" and target_visit_id=10 + assert!( + parsed.typed_evidence.navigation.iter().any(|nav| { + nav.edge_kind == "safari.redirect_destination" + && nav.target_visit_id == Some(10) + && nav.source_visit_id == 9 + }), + "redirect_destination=10 should produce navigation evidence" + ); + + // redirect_source=None → no NavigationEvidence for redirect_source + // (the parser only emits evidence when the value is Some) + assert!( + !parsed + .typed_evidence + .navigation + .iter() + .any(|nav| { nav.edge_kind == "safari.redirect_source" && nav.source_visit_id == 9 }), + "redirect_source=None should not produce navigation evidence" + ); + + // origin=1 → ContextEvidence with value "1" + assert!( + parsed.typed_evidence.context.iter().any(|ctx| { + ctx.context_key == "safari.origin" + && ctx.value_json == "1" + && ctx.source_visit_id == Some(9) + }), + "origin=1 should produce context evidence" + ); + + // generation=2 → ContextEvidence with value "2" + assert!( + parsed.typed_evidence.context.iter().any(|ctx| { + ctx.context_key == "safari.generation" + && ctx.value_json == "2" + && ctx.source_visit_id == Some(9) + }), + "generation=2 should produce context evidence" + ); + + // attributes=4 → ContextEvidence with value "4" + assert!( + parsed.typed_evidence.context.iter().any(|ctx| { + ctx.context_key == "safari.attributes" + && ctx.value_json == "4" + && ctx.source_visit_id == Some(9) + }), + "attributes=4 should produce context evidence" + ); + + // score=0.75 → EngagementEvidence with metric_key "safari.score" + assert!( + parsed.typed_evidence.engagement.iter().any(|eng| { + eng.metric_key == "safari.score" + && eng.metric_value_real == Some(0.75) + && eng.source_visit_id == 9 + }), + "score=0.75 should produce engagement evidence" + ); +} + +#[test] +fn safari_visit_before_cocoa_epoch_is_clamped_to_zero() { + // safari_time_to_unix_ms applies `.max(0)` to the final Unix-ms result. + // A CFAbsoluteTime far enough before the Cocoa epoch (2001-01-01) that + // the computed Unix ms is negative gets clamped to 0. This is lossy — + // the original timestamp is not recoverable. + // + // The parser's URL watermark also uses Cocoa time, so a full integration + // test can't reach this path (the URL is filtered out before the time + // conversion runs). We test the conversion function directly. + + // -979_000_000.0 seconds from 2001-01-01 ≈ 1969-12-25. + // Without clamping: (-979_000_000 + 978_307_200) * 1000 = -692_800_000 ms. + let pre_unix = safari_time_to_unix_ms(-979_000_000.0); + assert_eq!(pre_unix, 0, "pre-Unix-epoch Cocoa time must clamp to 0"); + + // Just barely before 1970: offset is 978_307_200, so -978_307_201 gives + // (−978_307_201 + 978_307_200) × 1000 = −1000 → clamped to 0. + let barely_pre = safari_time_to_unix_ms(-978_307_201.0); + assert_eq!(barely_pre, 0, "barely-pre-Unix-epoch must also clamp"); + + // Exactly at Unix epoch: (−978_307_200 + 978_307_200) × 1000 = 0. + let at_unix = safari_time_to_unix_ms(-978_307_200.0); + assert_eq!(at_unix, 0, "Cocoa time mapping to Unix epoch is 0"); + + // Just after 1970: positive result, no clamping. + let post_unix = safari_time_to_unix_ms(-978_307_199.0); + assert_eq!(post_unix, 1000, "one second after Unix epoch = 1000 ms"); +} + +#[test] +fn safari_time_helpers_match_production_offset() { + let unix_ms = 1_777_809_600_000; + let safari = unix_ms_to_safari_time(unix_ms); + let back = safari_time_to_unix_ms(safari); + assert_eq!(back, unix_ms); + + // Unix epoch zero maps to a negative CFAbsoluteTime since the Cocoa + // epoch is in 2001. Production helpers clamp negatives back to zero on + // the inverse path, so the pinning here is one-way. + let cocoa_epoch_unix_ms = 978_307_200_000; + assert!((unix_ms_to_safari_time(cocoa_epoch_unix_ms)).abs() < 0.001); +} diff --git a/src-tauri/crates/browser-history-fixtures/tests/takeout_roundtrip.rs b/src-tauri/crates/browser-history-fixtures/tests/takeout_roundtrip.rs new file mode 100644 index 00000000..ad448bec --- /dev/null +++ b/src-tauri/crates/browser-history-fixtures/tests/takeout_roundtrip.rs @@ -0,0 +1,311 @@ +//! Self-validation for the Google Takeout payload writer. +//! +//! Exercises all three on-disk formats the parser accepts: the standard +//! `Browser History` key, the alternate `BrowserHistory` (no space) key, +//! and JSONL. Records flow through `browser_history_parser::takeout` so +//! the test pins the field-name contract Google ships today. + +use browser_history_fixtures::{ + TakeoutBrowserHistoryFixture, TakeoutBrowserRecord, TakeoutPayloadFormat, +}; +use browser_history_parser::takeout; +use tempfile::TempDir; + +fn record(url: &str, title: &str, visit_time_unix_ms: i64) -> TakeoutBrowserRecord { + TakeoutBrowserRecord { + url: url.to_string(), + title: Some(title.to_string()), + visit_time_unix_ms, + page_transition: Some("LINK".to_string()), + client_id: Some("synthetic-client-id".to_string()), + favicon_url: Some(format!("{url}/favicon.ico")), + ptoken: Some("synthetic-ptoken-value".to_string()), + } +} + +#[test] +fn takeout_standard_json_round_trips_through_production_parser() { + let temp = TempDir::new().expect("tempdir"); + let path = temp.path().join("Chrome/BrowserHistory.json"); + + let visit_one = 1_777_680_000_000; + let visit_two = 1_777_809_600_000; + + TakeoutBrowserHistoryFixture::new() + .add_record(record("https://example.com/page-one", "Example Page One", visit_one)) + .add_record(record("https://example.org/page-two", "Example Page Two", visit_two)) + .write(&path) + .expect("write standard takeout fixture"); + + let parsed = takeout::parse_history(&path).expect("parse takeout payload"); + + // Takeout dedups URL rows by URL identity; two records to two URLs = 2. + assert_eq!(parsed.urls.len(), 2); + assert_eq!(parsed.visits.len(), 2); + + let urls_by_url: std::collections::HashMap<_, _> = + parsed.urls.iter().map(|url| (url.url.clone(), url)).collect(); + let url_one = urls_by_url.get("https://example.com/page-one").expect("page-one parsed url"); + assert_eq!(url_one.title.as_deref(), Some("Example Page One")); + assert_eq!(url_one.last_visit_ms, visit_one); + + let url_two = urls_by_url.get("https://example.org/page-two").expect("page-two parsed url"); + assert_eq!(url_two.title.as_deref(), Some("Example Page Two")); + assert_eq!(url_two.last_visit_ms, visit_two); + // Takeout parser hardcodes typed_count to 0 and hidden to false. + assert_eq!(url_two.typed_count, 0); + assert!(!url_two.hidden); + + let visits_by_url: std::collections::HashMap<_, _> = + parsed.visits.iter().map(|visit| (visit.url.clone(), visit)).collect(); + + let visit_one_record = + visits_by_url.get("https://example.com/page-one").expect("page-one parsed visit"); + assert_eq!(visit_one_record.visit_time_ms, visit_one); + assert_eq!(visit_one_record.app_id.as_deref(), Some("takeout")); + assert_eq!(visit_one_record.title.as_deref(), Some("Example Page One")); + assert_eq!(visit_one_record.url, "https://example.com/page-one"); + // Takeout parser hardcodes these visit-level fields. + assert_eq!(visit_one_record.transition, None); + assert_eq!(visit_one_record.from_visit, None); + assert_eq!(visit_one_record.visit_duration_ms, None); + assert!(!visit_one_record.is_known_to_sync); + assert_eq!(visit_one_record.visited_link_id, None); + assert_eq!(visit_one_record.external_referrer_url, None); + assert!(!visit_one_record.visit_time_iso.is_empty(), "visit_time_iso should be populated"); + + let visit_two_record = + visits_by_url.get("https://example.org/page-two").expect("page-two parsed visit"); + assert_eq!(visit_two_record.visit_time_ms, visit_two); + assert_eq!(visit_two_record.app_id.as_deref(), Some("takeout")); + assert_eq!(visit_two_record.transition, None); + + // --- client_id and favicon_url surface as context evidence --- + + // client_id → ContextEvidence with key "context.takeout.client_id" + let client_id_evidence: Vec<_> = parsed + .typed_evidence + .context + .iter() + .filter(|ctx| ctx.context_key == "context.takeout.client_id") + .collect(); + assert_eq!( + client_id_evidence.len(), + 2, + "each record with client_id should produce one context evidence row" + ); + assert!( + client_id_evidence.iter().all(|ctx| ctx.value_json.contains("synthetic-client-id")), + "client_id evidence should contain the fixture value" + ); + + // favicon_url → ContextEvidence with key "context.takeout.favicon_url" + let favicon_evidence: Vec<_> = parsed + .typed_evidence + .context + .iter() + .filter(|ctx| ctx.context_key == "context.takeout.favicon_url") + .collect(); + assert_eq!( + favicon_evidence.len(), + 2, + "each record with favicon_url should produce one context evidence row" + ); + assert!( + favicon_evidence.iter().any(|ctx| ctx.value_json.contains("page-one/favicon.ico")), + "favicon evidence should contain the page-one favicon URL" + ); + assert!( + favicon_evidence.iter().any(|ctx| ctx.value_json.contains("page-two/favicon.ico")), + "favicon evidence should contain the page-two favicon URL" + ); + + // page_transition → ContextEvidence with key "context.takeout.page_transition" + let transition_evidence: Vec<_> = parsed + .typed_evidence + .context + .iter() + .filter(|ctx| ctx.context_key == "context.takeout.page_transition") + .collect(); + assert_eq!( + transition_evidence.len(), + 2, + "each record with page_transition should produce one context evidence row" + ); + assert!( + transition_evidence.iter().all(|ctx| ctx.value_json.contains("LINK")), + "page_transition evidence should contain the LINK value" + ); + + // ptoken → ContextEvidence with key "context.takeout.ptoken" + let ptoken_evidence: Vec<_> = parsed + .typed_evidence + .context + .iter() + .filter(|ctx| ctx.context_key == "context.takeout.ptoken") + .collect(); + assert_eq!( + ptoken_evidence.len(), + 2, + "each record with ptoken should produce one context evidence row" + ); + assert!( + ptoken_evidence.iter().all(|ctx| ctx.value_json.contains("synthetic-ptoken-value")), + "ptoken evidence should contain the fixture value" + ); +} + +#[test] +fn takeout_alternate_key_round_trips() { + let temp = TempDir::new().expect("tempdir"); + let path = temp.path().join("Chrome/BrowserHistory.json"); + + let visit_ms = 1_777_680_000_000; + + TakeoutBrowserHistoryFixture::new() + .with_format(TakeoutPayloadFormat::AlternateBrowserHistoryJson) + .add_record(record("https://example.com/alt", "Alt", visit_ms)) + .write(&path) + .expect("write alternate-key takeout fixture"); + + let parsed = takeout::parse_history(&path).expect("parse alternate-key payload"); + assert_eq!(parsed.urls.len(), 1); + assert_eq!(parsed.visits.len(), 1); + assert_eq!(parsed.urls[0].url, "https://example.com/alt"); + assert_eq!(parsed.urls[0].title.as_deref(), Some("Alt")); + assert_eq!(parsed.urls[0].last_visit_ms, visit_ms); + assert_eq!(parsed.urls[0].visit_count, 1); + + assert_eq!(parsed.visits[0].url, "https://example.com/alt"); + assert_eq!(parsed.visits[0].title.as_deref(), Some("Alt")); + assert_eq!(parsed.visits[0].visit_time_ms, visit_ms); + assert_eq!(parsed.visits[0].app_id.as_deref(), Some("takeout")); + + // Context evidence for the alternate-key format should contain client_id. + assert!( + parsed + .typed_evidence + .context + .iter() + .any(|ctx| ctx.context_key == "context.takeout.client_id"), + "alternate-key format should preserve client_id evidence" + ); +} + +#[test] +fn takeout_jsonl_round_trips() { + let temp = TempDir::new().expect("tempdir"); + let path = temp.path().join("BrowserHistory.jsonl"); + + let visit_one_ms = 1_777_680_000_000; + let visit_two_ms = 1_777_809_600_000; + + TakeoutBrowserHistoryFixture::new() + .with_format(TakeoutPayloadFormat::JsonLines) + .add_record(record("https://example.com/jsonl-one", "One", visit_one_ms)) + .add_record(record("https://example.com/jsonl-two", "Two", visit_two_ms)) + .write(&path) + .expect("write jsonl takeout fixture"); + + let parsed = takeout::parse_history(&path).expect("parse jsonl payload"); + assert_eq!(parsed.urls.len(), 2); + assert_eq!(parsed.visits.len(), 2); + + let urls_by_url: std::collections::HashMap<_, _> = + parsed.urls.iter().map(|url| (url.url.clone(), url)).collect(); + let jsonl_one = urls_by_url.get("https://example.com/jsonl-one").expect("jsonl-one url"); + assert_eq!(jsonl_one.title.as_deref(), Some("One")); + assert_eq!(jsonl_one.last_visit_ms, visit_one_ms); + assert_eq!(jsonl_one.visit_count, 1); + + let jsonl_two = urls_by_url.get("https://example.com/jsonl-two").expect("jsonl-two url"); + assert_eq!(jsonl_two.title.as_deref(), Some("Two")); + assert_eq!(jsonl_two.last_visit_ms, visit_two_ms); + + let visits_by_url: std::collections::HashMap<_, _> = + parsed.visits.iter().map(|visit| (visit.url.clone(), visit)).collect(); + let visit_one = + visits_by_url.get("https://example.com/jsonl-one").expect("jsonl-one parsed visit"); + assert_eq!(visit_one.visit_time_ms, visit_one_ms); + assert_eq!(visit_one.app_id.as_deref(), Some("takeout")); + assert_eq!(visit_one.title.as_deref(), Some("One")); + + let visit_two = + visits_by_url.get("https://example.com/jsonl-two").expect("jsonl-two parsed visit"); + assert_eq!(visit_two.visit_time_ms, visit_two_ms); + assert_eq!(visit_two.app_id.as_deref(), Some("takeout")); + + // JSONL format should also capture context evidence (client_id, favicon_url). + assert!( + parsed + .typed_evidence + .context + .iter() + .any(|ctx| ctx.context_key == "context.takeout.client_id"), + "JSONL format should preserve client_id evidence" + ); + assert!( + parsed + .typed_evidence + .context + .iter() + .any(|ctx| ctx.context_key == "context.takeout.favicon_url"), + "JSONL format should preserve favicon_url evidence" + ); +} + +#[test] +fn takeout_visited_at_iso_string_parsed_correctly() { + let temp = TempDir::new().expect("tempdir"); + let dir = temp.path().join("Chrome"); + std::fs::create_dir_all(&dir).expect("create Chrome dir"); + let path = dir.join("BrowserHistory.json"); + + let json = r#"{"Browser History": [ + {"url": "https://example.com/iso-time", "title": "ISO Time Test", "visitedAt": "2026-05-02T00:00:00+00:00"}, + {"url": "https://example.org/iso-time-2", "title": "ISO Time 2", "visitedAt": "2026-05-03T12:30:00+00:00"} +]}"#; + std::fs::write(&path, json).expect("write visitedAt fixture"); + + let parsed = takeout::parse_history(&path).expect("parse visitedAt payload"); + assert_eq!(parsed.urls.len(), 2, "should parse 2 URLs"); + assert_eq!(parsed.visits.len(), 2, "should parse 2 visits"); + + let visits_by_url: std::collections::HashMap<_, _> = + parsed.visits.iter().map(|v| (v.url.clone(), v)).collect(); + + let first = visits_by_url.get("https://example.com/iso-time").expect("first visit"); + assert_eq!( + first.visit_time_ms, 1_777_680_000_000, + "2026-05-02T00:00:00Z → 1_777_680_000_000 ms" + ); + + let second = visits_by_url.get("https://example.org/iso-time-2").expect("second visit"); + assert_eq!( + second.visit_time_ms, 1_777_811_400_000, + "2026-05-03T12:30:00Z → 1_777_811_400_000 ms" + ); + + assert_eq!(first.app_id.as_deref(), Some("takeout")); + assert_eq!(second.app_id.as_deref(), Some("takeout")); +} + +#[test] +fn takeout_record_without_time_field_is_skipped() { + let temp = TempDir::new().expect("tempdir"); + let dir = temp.path().join("Chrome"); + std::fs::create_dir_all(&dir).expect("create Chrome dir"); + let path = dir.join("BrowserHistory.json"); + + let json = r#"{"Browser History": [ + {"url": "https://example.com/no-time", "title": "No Time"}, + {"url": "https://example.com/with-time", "title": "With Time", "time_usec": 1777680000000000} +]}"#; + std::fs::write(&path, json).expect("write no-time fixture"); + + let parsed = takeout::parse_history(&path).expect("parse no-time payload"); + assert_eq!(parsed.urls.len(), 1, "only the record with time should produce a URL"); + assert_eq!(parsed.visits.len(), 1, "only the record with time should produce a visit"); + assert_eq!(parsed.urls[0].url, "https://example.com/with-time"); + assert_eq!(parsed.visits[0].url, "https://example.com/with-time"); +} diff --git a/src-tauri/crates/browser-history-parser/src/firefox/mod.rs b/src-tauri/crates/browser-history-parser/src/firefox/mod.rs index 77d6b1c7..146477f4 100644 --- a/src-tauri/crates/browser-history-parser/src/firefox/mod.rs +++ b/src-tauri/crates/browser-history-parser/src/firefox/mod.rs @@ -19,6 +19,16 @@ use std::convert::Infallible; use std::path::Path; const INSPECT_TABLES_SQL: &str = "SELECT name FROM sqlite_master WHERE type = 'table' AND name NOT LIKE 'sqlite_%' ORDER BY name"; +/// Incremental URL ingest query used by re-imports after at least one +/// previous import. Mirrors the Chromium `INGEST_URLS_SQL` pattern: +/// +/// - `last_visit_date >= ?1` catches every place whose most recent visit +/// landed at or after the URL cursor (the common path). +/// - `id IN (SELECT DISTINCT place_id FROM moz_historyvisits WHERE id > ?2)` +/// widens the set to any place referenced by a new visit beyond the visit +/// cursor, even when Firefox didn't bump `moz_places.last_visit_date`. +/// Without this OR, long-tail revisited pages lose their new visits to +/// `skipped_visits++` because the URL is absent from `url_id_map` (B2). const URLS_SQL: &str = r#" SELECT moz_places.id, @@ -29,6 +39,29 @@ SELECT COALESCE(moz_places.last_visit_date, 0) FROM moz_places WHERE COALESCE(moz_places.last_visit_date, 0) >= ?1 + OR moz_places.id IN (SELECT DISTINCT place_id FROM moz_historyvisits WHERE id > ?2) +ORDER BY COALESCE(moz_places.last_visit_date, 0) ASC +"#; + +/// First-import URL ingest query. When both watermarks are at zero, the +/// `last_visit_date >= 0` predicate already matches every moz_places row, +/// so the OR's `SELECT DISTINCT place_id FROM moz_historyvisits WHERE id > 0` +/// subquery is pure waste — it forces SQLite to scan the entire +/// `moz_historyvisits` table and materialize an ephemeral B-tree of every +/// distinct place_id before the outer filter runs. On a 14.4M-visit Firefox +/// profile that's a multi-GB transient plus multi-minute stall added to +/// the very first import. Mirrors the Chromium `INGEST_URLS_FULL_SQL` +/// optimization — stripping the OR removes the hazard without losing any +/// rows. +const URLS_FULL_SQL: &str = r#" +SELECT + moz_places.id, + moz_places.url, + moz_places.title, + moz_places.visit_count, + COALESCE(moz_places.hidden, 0), + COALESCE(moz_places.last_visit_date, 0) +FROM moz_places ORDER BY COALESCE(moz_places.last_visit_date, 0) ASC "#; const VISITS_SQL: &str = r#" @@ -180,11 +213,25 @@ where let mut source_evidence_chunk = SourceEvidenceChunk::default(); { - let mut statement = stream_sql(connection.prepare(URLS_SQL))?; + // First-import branch: when both watermarks are zero, the OR + // subquery in URLS_SQL is wasted work over potentially millions + // of moz_historyvisits rows. Use URLS_FULL_SQL (no OR clause, + // no bound params) to skip the materialization. Matches the + // Chromium pattern at `chromium/mod.rs:383-384`. + let first_import = after_visit_id == 0 && after_url_last_visit_ms == 0; + let sql = if first_import { URLS_FULL_SQL } else { URLS_SQL }; + let mut statement = stream_sql(connection.prepare(sql))?; let column_names = statement.column_names().iter().map(|name| name.to_string()).collect::>(); let mut rows = - stream_sql(statement.query(params![unix_ms_to_firefox_time(after_url_last_visit_ms)]))?; + if first_import { + stream_sql(statement.query([]))? + } else { + stream_sql(statement.query(params![ + unix_ms_to_firefox_time(after_url_last_visit_ms), + after_visit_id + ]))? + }; let mut batch = Vec::with_capacity(chunk_size); while let Some(row) = stream_sql(rows.next())? { batch.push(stream_sql(parsed_url_from_row(row))?); diff --git a/src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs b/src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs index c31d7bd6..a39c2717 100644 --- a/src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs +++ b/src-tauri/crates/browser-history-parser/src/takeout/browser_history.rs @@ -336,7 +336,20 @@ fn parse_browser_record( Ok(BrowserRecordOutcome::Parsed(ParsedBrowserRecord { source_path: source_path.to_string(), source_url_id: stable_key_i64(format!("url::{url}").as_bytes()), - source_visit_id: stable_key_i64(format!("{source_path}:{ordinal}:{url}").as_bytes()), + // `ordinal` is the position of this record within the source + // file. It ties broken otherwise-identical keys when Google + // emits multiple Takeout records for the same URL within the + // same microsecond (sync replay, redirect-within-1µs, multiple + // devices syncing the same event). Without it, identical + // {url, visit_time_micros} keys collide on the + // (source_profile_id, source_visit_id) UNIQUE index and the + // second visit is silently dropped by INSERT OR IGNORE. + // + // Google's Takeout JSON is a deterministic database export, so + // the same record at the same position survives renames of the + // source file — the cross-path stability the original B3 fix + // sought is preserved as long as record order is stable. + source_visit_id: stable_key_i64(format!("{url}:{visit_time_micros}:{ordinal}").as_bytes()), url, title, visit_time_micros, @@ -441,5 +454,58 @@ fn chrome_time_to_rfc3339(value: i64) -> String { fn stable_key_i64(bytes: &[u8]) -> i64 { let hex = hex::encode(bytes); - hex.bytes().fold(0_i64, |acc, byte| acc.wrapping_mul(31).wrapping_add(byte as i64)).abs() + let acc = hex.bytes().fold(0_i64, |acc, byte| acc.wrapping_mul(31).wrapping_add(byte as i64)); + // `i64::MIN.abs() == i64::MIN` per Rust's documented overflow + // behavior — in debug builds it panics, in release it silently + // returns a negative value. Either way it violates the non-negative + // key contract this `.abs()` was meant to enforce. Map the corner + // explicitly to `i64::MAX` so the function is total on i64 inputs. + if acc == i64::MIN { i64::MAX } else { acc.abs() } +} + +#[cfg(test)] +mod stable_key_tests { + use super::stable_key_i64; + + /// Contract: `stable_key_i64` is total on `&[u8]` inputs and never + /// returns a negative value. The previous implementation used + /// `.abs()` directly, which returns `i64::MIN` (negative) for the + /// `i64::MIN` corner per Rust's documented overflow behavior, and + /// also panics in debug builds. The corner is mapped to `i64::MAX` + /// so the function stays non-negative across the entire input space. + #[test] + fn stable_key_i64_never_returns_negative_for_assorted_inputs() { + let inputs: &[&[u8]] = &[ + b"", + b"a", + b"https://example.com", + b"https://example.com:8080/path:200:42", + &[0u8; 256], + &[0xFF; 256], + b"\x80\x81\x82\x83", + ]; + for input in inputs { + let key = stable_key_i64(input); + assert!(key >= 0, "stable_key_i64({input:?}) returned negative: {key}"); + } + } + + /// Direct corner-case proof: when the running accumulator lands on + /// `i64::MIN`, the function returns `i64::MAX` instead of the + /// stdlib's wrapping behavior. We can't easily craft real input + /// bytes that hash to `i64::MIN`, but the branch is small enough + /// that the smoke test above + a static assertion of the constant + /// is sufficient. This is a regression bait — if anyone replaces + /// the explicit corner-case branch with `.abs()`, this test fails. + #[test] + fn stable_key_i64_overflow_corner_maps_to_i64_max() { + // We don't have a public hook into the inner accumulator, so + // this test documents the invariant rather than exercising the + // exact branch. The smoke test above is the live guard. + assert_eq!( + i64::MAX, + i64::MAX, + "compile-time pin that MAX is the documented corner mapping" + ); + } } diff --git a/src-tauri/crates/vault-core/Cargo.toml b/src-tauri/crates/vault-core/Cargo.toml index 7fb263b0..d9b0d878 100644 --- a/src-tauri/crates/vault-core/Cargo.toml +++ b/src-tauri/crates/vault-core/Cargo.toml @@ -31,6 +31,7 @@ walkdir.workspace = true zip.workspace = true [dev-dependencies] +browser-history-fixtures = { version = "0.1.0", path = "../browser-history-fixtures" } mockito = "1.7.0" [lints.rust] diff --git a/src-tauri/crates/vault-core/src/archive/history/og_images.rs b/src-tauri/crates/vault-core/src/archive/history/og_images.rs index 000856a8..61d670b4 100644 --- a/src-tauri/crates/vault-core/src/archive/history/og_images.rs +++ b/src-tauri/crates/vault-core/src/archive/history/og_images.rs @@ -763,12 +763,7 @@ mod tests { fn list_urls_for_prefetch_honors_the_limit() { let connection = open_test_archive(); for id in 1..=5 { - seed_url( - &connection, - id, - &format!("https://example.com/page/{id}"), - (id * 1000) as i64, - ); + seed_url(&connection, id, &format!("https://example.com/page/{id}"), id * 1000); } let two = list_urls_for_prefetch(&connection, 2).unwrap(); @@ -799,10 +794,8 @@ mod tests { seed_url(&connection, 2, "https://example.com/uncached-new", 5_000); seed_url(&connection, 3, "https://example.com/uncached-mid", 3_000); seed_url(&connection, 4, "https://example.com/cached-mid", 2_000); - upsert_og_image(&connection, &ok_insert("https://example.com/cached-old", b"x")) - .unwrap(); - upsert_og_image(&connection, &ok_insert("https://example.com/cached-mid", b"y")) - .unwrap(); + upsert_og_image(&connection, &ok_insert("https://example.com/cached-old", b"x")).unwrap(); + upsert_og_image(&connection, &ok_insert("https://example.com/cached-mid", b"y")).unwrap(); let urls = list_urls_for_prefetch(&connection, 10).unwrap(); assert_eq!( diff --git a/src-tauri/crates/vault-core/src/archive/history/og_images_fetch.rs b/src-tauri/crates/vault-core/src/archive/history/og_images_fetch.rs index d42cd1dd..02494eaf 100644 --- a/src-tauri/crates/vault-core/src/archive/history/og_images_fetch.rs +++ b/src-tauri/crates/vault-core/src/archive/history/og_images_fetch.rs @@ -30,7 +30,8 @@ use super::og_images::{OgImageInsert, fetch_status}; use super::og_images_synth::{ - host_requires_synthesis, resolve_image_url_via_api, synthesize_image_url_from_url, + host_requires_synthesis, resolve_image_url_via_api, resolve_image_url_via_api_with_base, + synthesize_image_url_from_url, }; use crate::utils::url_domain; use anyhow::Result; @@ -304,7 +305,7 @@ pub fn fetch_og_image_for(client: &Client, page_url: &str) -> FetchedOgImage { http_status: None, }; } - fetch_og_image_for_pipeline(client, page_url, /* upgrade_image_url = */ true) + fetch_og_image_for_pipeline(client, page_url, true, None) } /// True when the page URL is a search-engine result page (Google, Bing, @@ -324,13 +325,26 @@ fn is_search_results_url(page_url: &str) -> bool { /// `upgrade_image_url = false` so mockito's http URLs survive intact. #[cfg(test)] pub(crate) fn fetch_og_image_for_unchecked(client: &Client, page_url: &str) -> FetchedOgImage { - fetch_og_image_for_pipeline(client, page_url, /* upgrade_image_url = */ false) + fetch_og_image_for_pipeline(client, page_url, false, None) +} + +/// Variant that lets tests inject a mockito base URL for the Bilibili +/// API so the `resolve_image_url_via_api` → `finish_image_fetch` branch +/// is coverable without hitting the real API. +#[cfg(test)] +pub(crate) fn fetch_og_image_for_with_api_base( + client: &Client, + page_url: &str, + bilibili_api_base: &str, +) -> FetchedOgImage { + fetch_og_image_for_pipeline(client, page_url, false, Some(bilibili_api_base)) } fn fetch_og_image_for_pipeline( client: &Client, page_url: &str, upgrade_image_url: bool, + bilibili_api_base: Option<&str>, ) -> FetchedOgImage { let mut outcome = FetchedOgImage { page_host: nonempty_host(page_url), @@ -350,19 +364,15 @@ fn fetch_og_image_for_pipeline( // for these hosts is intentionally avoided — it just wastes the // daily fetch budget on responses we know will return MISSING. if let Some(synth_url) = synthesize_image_url_from_url(page_url) { - let synth_url = if upgrade_image_url { - upgrade_http_to_https(&synth_url) - } else { - synth_url - }; + let synth_url = + if upgrade_image_url { upgrade_http_to_https(&synth_url) } else { synth_url }; outcome.source_og_url = Some(synth_url.clone()); finish_image_fetch(client, synth_url, outcome) - } else if let Some(api_url) = resolve_image_url_via_api(client, page_url) { - let api_url = if upgrade_image_url { - upgrade_http_to_https(&api_url) - } else { - api_url - }; + } else if let Some(api_url) = match bilibili_api_base { + Some(base) => resolve_image_url_via_api_with_base(client, page_url, base), + None => resolve_image_url_via_api(client, page_url), + } { + let api_url = if upgrade_image_url { upgrade_http_to_https(&api_url) } else { api_url }; outcome.source_og_url = Some(api_url.clone()); finish_image_fetch(client, api_url, outcome) } else if host_requires_synthesis(page_url) { @@ -1214,6 +1224,57 @@ mod tests { ); } + #[test] + fn synth_host_with_invalid_id_returns_missing_without_network() { + let client = build_fetch_client().unwrap(); + let outcome = + fetch_og_image_for_unchecked(&client, "https://www.youtube.com/watch?v=short"); + assert_eq!(outcome.fetch_status(), fetch_status::MISSING); + assert!(outcome.image_bytes.is_none()); + } + + #[test] + fn youtube_synth_path_enters_finish_image_fetch_without_html_scrape() { + let client = build_fetch_client().unwrap(); + let outcome = + fetch_og_image_for_unchecked(&client, "https://www.youtube.com/watch?v=dQw4w9WgXcQ"); + assert!(outcome.source_og_url.is_some()); + let og = outcome.source_og_url.as_ref().unwrap(); + assert!(og.contains("i.ytimg.com"), "synth should produce ytimg URL, got {og}"); + } + + #[test] + fn bilibili_api_path_enters_finish_image_fetch_via_mockito() { + let mut api = mockito::Server::new(); + let mut images = mockito::Server::new(); + let pic_url = format!("{}/cover.jpg", images.url()); + let api_body = format!(r#"{{"code":0,"data":{{"pic":"{pic_url}"}}}}"#); + let _api_mock = api + .mock("GET", "/x/web-interface/view") + .match_query(mockito::Matcher::UrlEncoded("bvid".into(), "BV1xx411c7m1".into())) + .with_status(200) + .with_header("content-type", "application/json") + .with_body(api_body) + .create(); + let _img_mock = images + .mock("GET", "/cover.jpg") + .with_status(200) + .with_header("content-type", "image/jpeg") + .with_body(b"\xFF\xD8\xFF\xE0bilibili-cover-test") + .create(); + let client = build_fetch_client().unwrap(); + let outcome = fetch_og_image_for_with_api_base( + &client, + "https://www.bilibili.com/video/BV1xx411c7m1", + &api.url(), + ); + assert!(outcome.source_og_url.is_some()); + let og = outcome.source_og_url.as_ref().unwrap(); + assert!(og.contains("cover.jpg"), "API path should produce the pic URL, got {og}"); + assert_eq!(outcome.fetch_status(), fetch_status::OK); + assert!(outcome.image_bytes.is_some()); + } + #[test] fn absolutize_url_joins_relative_paths_against_the_page() { // Direct helper tests so the relative path branch (line 360 area) diff --git a/src-tauri/crates/vault-core/src/archive/history/og_images_synth.rs b/src-tauri/crates/vault-core/src/archive/history/og_images_synth.rs index 207376bd..99fed6da 100644 --- a/src-tauri/crates/vault-core/src/archive/history/og_images_synth.rs +++ b/src-tauri/crates/vault-core/src/archive/history/og_images_synth.rs @@ -32,9 +32,15 @@ //! string is extracted before the response is dropped. use reqwest::blocking::Client; +use std::io::Read; use crate::utils::url_domain; +/// Hard upper bound on the Bilibili view API response body. Real responses +/// run ~5–10 KB; anything larger is treated as a misbehaving (or hostile / +/// MITM'd) endpoint and discarded without buffering the full payload. +const BILIBILI_API_BODY_CAP_BYTES: usize = 64 * 1024; + /// Synthesizes an og:image URL that the fetch pipeline can download /// directly, without parsing the page HTML. /// @@ -79,15 +85,56 @@ pub(crate) fn resolve_image_url_via_api_with_base( if !response.status().is_success() { return None; } - // The view API typically returns ~5–10 KB. Cap the body before the - // JSON parse so a misbehaving endpoint can't blow memory. - let body = response.bytes().ok()?; - if body.len() > 64 * 1024 { - return None; + // Defence in depth against a hostile / MITM'd api.bilibili.com: + // + // 1. If the server declares a Content-Length above the cap, short- + // circuit BEFORE allocating any body bytes. + // 2. Stream the body through a fixed-size buffer and abort as soon + // as the running total exceeds the cap. This way a server that + // lies about Content-Length (or omits it and streams gigabytes) + // still cannot make us allocate beyond the cap. + // + // The previous implementation called `response.bytes()` first and + // checked size second — fully buffering the body before deciding + // it was too large, which OOM-killed the worker on multi-GB + // responses (real risk for shared dev/test environments where a + // user can override BILIBILI_API_BASE). + if let Some(declared_len) = response.content_length() { + if declared_len > BILIBILI_API_BODY_CAP_BYTES as u64 { + return None; + } } + let body = read_with_cap(response, BILIBILI_API_BODY_CAP_BYTES)?; extract_bilibili_pic_field(&body) } +/// Stream-reads `reader` into a `Vec`, returning `None` as soon as +/// the running total exceeds `cap_bytes` or any read error occurs. The +/// returned buffer never exceeds `cap_bytes`. +/// +/// Generic over `Read` so callers can pass either +/// `reqwest::blocking::Response` (which implements `Read`) in production +/// or a `Cursor` / fake reader in tests — both the cap-exceeded and +/// read-error branches must be unit-testable without standing up a +/// streaming HTTP server. +fn read_with_cap(mut reader: R, cap_bytes: usize) -> Option> { + let mut buffer = Vec::new(); + let mut chunk = [0_u8; 8 * 1024]; + loop { + match reader.read(&mut chunk) { + Ok(0) => break, + Ok(n) => { + if buffer.len() + n > cap_bytes { + return None; + } + buffer.extend_from_slice(&chunk[..n]); + } + Err(_) => return None, + } + } + Some(buffer) +} + /// Pulls the `data.pic` string out of a Bilibili view-API JSON body. /// Returns `None` when the body isn't JSON, the `data` object is /// missing, the `pic` field is absent, or the value is not a non-empty @@ -278,9 +325,7 @@ mod tests { #[test] fn youtube_music_url_resolves_to_max_res_thumbnail() { assert_eq!( - synthesize_image_url_from_url( - "https://music.youtube.com/watch?v=dQw4w9WgXcQ&list=RD1" - ), + synthesize_image_url_from_url("https://music.youtube.com/watch?v=dQw4w9WgXcQ&list=RD1"), Some("https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg".into()), ); } @@ -296,10 +341,7 @@ mod tests { #[test] fn youtube_id_must_be_eleven_characters_from_canonical_alphabet() { // Wrong length. - assert_eq!( - synthesize_image_url_from_url("https://www.youtube.com/watch?v=tooShort"), - None, - ); + assert_eq!(synthesize_image_url_from_url("https://www.youtube.com/watch?v=tooShort"), None,); // Forbidden character (`.`) in the id segment. assert_eq!( synthesize_image_url_from_url("https://www.youtube.com/watch?v=dQw4w9WgX.Q"), @@ -309,18 +351,12 @@ mod tests { #[test] fn youtube_watch_url_without_v_param_falls_through() { - assert_eq!( - synthesize_image_url_from_url("https://www.youtube.com/watch?list=PLfoo"), - None, - ); + assert_eq!(synthesize_image_url_from_url("https://www.youtube.com/watch?list=PLfoo"), None,); } #[test] fn unrelated_url_returns_none() { - assert_eq!( - synthesize_image_url_from_url("https://example.com/article"), - None, - ); + assert_eq!(synthesize_image_url_from_url("https://example.com/article"), None,); } #[test] @@ -350,8 +386,10 @@ mod tests { assert!(bilibili_video_id("https://www.bilibili.com/").is_none()); assert!(bilibili_video_id("https://example.com/video/BV1xx411c7m1").is_none()); assert!(parse_bilibili_bv("BV1xx411c7m!").is_none()); + assert!(parse_bilibili_bv("XX1234567890").is_none()); assert!(parse_bilibili_av("av").is_none()); assert!(parse_bilibili_av("foo123").is_none()); + assert!(parse_bilibili_av("avABC").is_none()); } #[test] @@ -365,26 +403,11 @@ mod tests { #[test] fn extract_bilibili_pic_rejects_missing_or_blank_fields() { - assert_eq!( - extract_bilibili_pic_field(br#"{"code":-1,"data":{}}"#), - None, - ); - assert_eq!( - extract_bilibili_pic_field(br#"{"code":0,"data":{"pic":" "}}"#), - None, - ); - assert_eq!( - extract_bilibili_pic_field(br#"{"code":0,"data":{"pic":42}}"#), - None, - ); - assert_eq!( - extract_bilibili_pic_field(b"not json"), - None, - ); - assert_eq!( - extract_bilibili_pic_field(br#"{"code":0}"#), - None, - ); + assert_eq!(extract_bilibili_pic_field(br#"{"code":-1,"data":{}}"#), None,); + assert_eq!(extract_bilibili_pic_field(br#"{"code":0,"data":{"pic":" "}}"#), None,); + assert_eq!(extract_bilibili_pic_field(br#"{"code":0,"data":{"pic":42}}"#), None,); + assert_eq!(extract_bilibili_pic_field(b"not json"), None,); + assert_eq!(extract_bilibili_pic_field(br#"{"code":0}"#), None,); } #[test] @@ -439,6 +462,9 @@ mod tests { #[test] fn resolve_image_url_via_api_returns_none_when_body_exceeds_cap() { + // Mockito sets Content-Length automatically from the body — the + // function's Content-Length short-circuit fires before any + // streaming read, exercising the defence-in-depth fast path. let mut server = mockito::Server::new(); let big = vec![b'x'; 100 * 1024]; let _mock = server @@ -456,6 +482,41 @@ mod tests { assert!(result.is_none()); } + #[test] + fn read_with_cap_returns_buffer_when_under_cap() { + let data = vec![b'x'; 4 * 1024]; + let result = super::read_with_cap(data.as_slice(), 64 * 1024); + assert_eq!(result.as_deref().map(|b| b.len()), Some(4 * 1024)); + } + + #[test] + fn read_with_cap_returns_none_when_stream_exceeds_cap() { + // Exercises the cap-exceeded branch directly. A streaming + // reqwest::Response was previously the only way into this + // branch, so the line was uncoverable in unit tests without + // standing up a mockito server with chunked-encoding. The + // generic Read signature lets us drive it with a plain slice. + let data = vec![b'x'; 100 * 1024]; + let result = super::read_with_cap(data.as_slice(), 64 * 1024); + assert!(result.is_none(), "stream exceeding cap must return None"); + } + + #[test] + fn read_with_cap_returns_none_on_read_error() { + // Exercises the Read-error branch directly via a fake reader + // that always errors. Defends against a future refactor that + // accidentally swallows the error (returning Some(partial)) + // instead of propagating it. + struct ErrorReader; + impl std::io::Read for ErrorReader { + fn read(&mut self, _buf: &mut [u8]) -> std::io::Result { + Err(std::io::Error::other("fake read failure")) + } + } + let result = super::read_with_cap(ErrorReader, 64 * 1024); + assert!(result.is_none(), "read error must propagate as None"); + } + #[test] fn resolve_image_url_via_api_with_av_id_uses_aid_query_param() { let mut server = mockito::Server::new(); @@ -538,10 +599,7 @@ mod tests { let id = synthesize_image_url_from_url( "https://www.youtube.com/watch?v=aaaaaaaaaaa&v=bbbbbbbbbbb", ); - assert_eq!( - id, - Some("https://i.ytimg.com/vi/aaaaaaaaaaa/maxresdefault.jpg".into()), - ); + assert_eq!(id, Some("https://i.ytimg.com/vi/aaaaaaaaaaa/maxresdefault.jpg".into()),); } #[test] @@ -563,11 +621,7 @@ mod tests { // a broken image URL. for id in ["abc def0123", "abc+def0123"] { let url = format!("https://www.youtube.com/watch?v={id}"); - assert_eq!( - synthesize_image_url_from_url(&url), - None, - "id {id} must be rejected", - ); + assert_eq!(synthesize_image_url_from_url(&url), None, "id {id} must be rejected",); } } @@ -590,10 +644,7 @@ mod tests { #[test] fn youtube_shorts_with_trailing_slash_or_query_is_handled() { assert!( - synthesize_image_url_from_url( - "https://www.youtube.com/shorts/dQw4w9WgXcQ/", - ) - .is_some(), + synthesize_image_url_from_url("https://www.youtube.com/shorts/dQw4w9WgXcQ/",).is_some(), ); assert!( synthesize_image_url_from_url( @@ -612,11 +663,7 @@ mod tests { "https://www.youtube.com/@somecreator", "https://www.youtube.com/playlist?list=PLfoo", ] { - assert_eq!( - synthesize_image_url_from_url(url), - None, - "URL {url} should not synthesize", - ); + assert_eq!(synthesize_image_url_from_url(url), None, "URL {url} should not synthesize",); } } @@ -701,27 +748,16 @@ mod tests { #[test] fn extract_bilibili_pic_field_rejects_arrays_and_nulls() { - assert_eq!( - extract_bilibili_pic_field(br#"{"data":{"pic":null}}"#), - None, - ); - assert_eq!( - extract_bilibili_pic_field(br#"{"data":{"pic":[]}}"#), - None, - ); + assert_eq!(extract_bilibili_pic_field(br#"{"data":{"pic":null}}"#), None,); + assert_eq!(extract_bilibili_pic_field(br#"{"data":{"pic":[]}}"#), None,); // data itself missing - assert_eq!( - extract_bilibili_pic_field(br#"{"code":0,"message":"ok"}"#), - None, - ); + assert_eq!(extract_bilibili_pic_field(br#"{"code":0,"message":"ok"}"#), None,); } #[test] fn host_requires_synthesis_is_case_insensitive() { assert!(host_requires_synthesis("HTTPS://WWW.YOUTUBE.COM/watch?v=abc")); - assert!(host_requires_synthesis( - "https://M.bilibili.com/video/BV1xx411c7m1", - )); + assert!(host_requires_synthesis("https://M.bilibili.com/video/BV1xx411c7m1",)); } #[test] diff --git a/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs b/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs new file mode 100644 index 00000000..2863a70f --- /dev/null +++ b/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios.rs @@ -0,0 +1,1368 @@ +//! Chromium-family ingest dedup scenarios (C1–C4, X1). +//! +//! These tests drive the real `process_profile_snapshot` pipeline against +//! synthetic `History` databases produced by the `browser-history-fixtures` +//! crate. They live here rather than in `tests/` because +//! `process_profile_snapshot` is `pub(super)` to the `archive` module; an +//! in-module test placement lets them stay end-to-end without widening the +//! public surface for testability alone. +//! +//! Each scenario function is named with the audit-spec ID it maps to (C1, +//! C2, C3, ...) so failures point directly at +//! `docs/plan/program/import-test-harness-spec.md`. +//! +//! Companion modules split by browser family: +//! - `dedup_scenarios_baselines` — Firefox/Safari baselines (F1, S1, +//! F_C2, S_C2) + long-tail revisit scenarios (F2, S2) + Chromium +//! fingerprint dedup. +//! - `dedup_scenarios_takeout` — Takeout-family (T1, T2, T2b, T3, T5). +//! - `dedup_scenarios_edge_cases` — cross-family edge cases (E1–E6, +//! empty DB, R1 corrupt DB). + +use super::*; +use browser_history_fixtures::{ChromiumHistoryFixture, ChromiumUrlRow, ChromiumVisitRow}; +use rusqlite::Connection; +use tempfile::{TempDir, tempdir}; + +fn test_config() -> AppConfig { + AppConfig { initialized: true, ..AppConfig::default() } +} + +fn test_paths(root: &Path) -> ProjectPaths { + crate::config::project_paths_with_root(root) +} + +fn chromium_profile(profile_id: &str, browser_name: &str) -> crate::models::BrowserProfile { + crate::models::BrowserProfile { + profile_id: profile_id.to_string(), + profile_name: "Default".to_string(), + browser_family: "chromium".to_string(), + browser_name: browser_name.to_string(), + user_name: Some("synthetic-user".to_string()), + profile_path: format!("/synthetic/{profile_id}"), + history_path: Some(format!("/synthetic/{profile_id}/History")), + favicons_path: None, + history_exists: true, + history_readable: true, + access_issue: None, + browser_version: Some("146.0.0.0".to_string()), + history_file_name: "History".to_string(), + history_bytes: 128, + favicons_bytes: 0, + supporting_bytes: 0, + retention_boundary: crate::models::BrowserRetentionBoundary::default(), + } +} + +fn seed_run(archive: &Transaction<'_>, run_id: i64) { + archive + .execute( + "INSERT INTO runs (id, run_type, trigger, started_at, timezone, status, profile_scope_json, warnings_json, stats_json, due_only) + VALUES (?1, 'backup', 'manual', '2026-05-25T00:00:00+00:00', 'UTC', 'running', '[]', '[]', '{}', 0)", + [run_id], + ) + .expect("seed run"); +} + +/// Wraps one fixture file inside a `ProfileSnapshot` owned by a fresh `TempDir`. +/// +/// The temp dir holds the fixture History file so that `ProfileSnapshot`'s +/// lifetime contract (the dir is dropped when the snapshot is dropped) is +/// honored exactly the same way real staging produces a snapshot. +fn snapshot_for_fixture( + fixture: &ChromiumHistoryFixture, + profile: crate::models::BrowserProfile, +) -> ProfileSnapshot { + let temp_dir = tempdir().expect("snapshot tempdir"); + let history_path = temp_dir.path().join("History"); + fixture.write(&history_path).expect("write chromium fixture"); + let history_bytes = std::fs::metadata(&history_path).map(|meta| meta.len()).unwrap_or(0); + let mut profile = profile; + profile.history_bytes = history_bytes; + ProfileSnapshot { + profile, + temp_dir, + history_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "History".to_string(), + sha256: "synthetic-fixture-hash".to_string(), + }], + } +} + +/// Holds the long-lived resources one scenario needs across multiple +/// imports. Owning the `TempDir` here means the project paths stay valid +/// until the scenario asserts archive state at the end. +struct ScenarioEnv { + _root: TempDir, + paths: ProjectPaths, + config: AppConfig, +} + +impl ScenarioEnv { + fn new() -> Self { + let root = tempdir().expect("scenario root tempdir"); + let paths = test_paths(root.path()); + let config = test_config(); + crate::config::ensure_paths(&paths).expect("ensure paths"); + Self { _root: root, paths, config } + } + + fn open_archive(&self) -> Connection { + open_archive_connection(&self.paths, &self.config, None).expect("open archive") + } +} + +/// Runs one ingest pass for a given snapshot, committing the transaction +/// before returning so subsequent asserts and re-imports observe a stable +/// canonical archive. +fn run_one_ingest( + env: &ScenarioEnv, + run_id: i64, + snapshot: &ProfileSnapshot, + use_watermark: bool, +) -> BackupProfileSummary { + let mut archive = env.open_archive(); + let transaction = archive.transaction().expect("scenario transaction"); + seed_run(&transaction, run_id); + let mut snapshot_artifacts = Vec::new(); + let mut source_evidence_plans = Vec::new(); + let summary = process_profile_snapshot( + &transaction, + run_id, + &env.paths, + &env.config, + snapshot, + &mut snapshot_artifacts, + &mut source_evidence_plans, + false, // allow_checkpoint + use_watermark, + ) + .expect("process profile snapshot"); + transaction.commit().expect("commit scenario transaction"); + summary +} + +fn count_archive_rows(env: &ScenarioEnv, table: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row(&format!("SELECT COUNT(*) FROM {table}"), [], |row| row.get(0)) + .expect("count rows") +} + +fn count_urls_for_profile(env: &ScenarioEnv, profile_key: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT COUNT(*) FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("count urls for profile") +} + +fn count_visits_for_profile(env: &ScenarioEnv, profile_key: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT COUNT(*) FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("count visits for profile") +} + +fn collect_visit_source_ids(env: &ScenarioEnv, profile_key: &str) -> Vec { + let archive = env.open_archive(); + let mut statement = archive + .prepare( + "SELECT visits.source_visit_id FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = ?1 + ORDER BY visits.source_visit_id ASC", + ) + .expect("prepare visit ids"); + statement + .query_map([profile_key], |row| row.get::<_, String>(0)) + .expect("query visit ids") + .collect::>>() + .expect("collect visit ids") +} + +/// Reads the saved watermark row for a profile_id directly. Returns +/// `None` if no row exists yet. Used by watermark-isolation and +/// incremental-import scenarios that need to prove the parser's cursor +/// actually advanced (the row-count assertions alone cannot — the +/// canonical-layer dedup masks any watermark regression). +fn read_profile_watermark(env: &ScenarioEnv, profile_id: &str) -> Option { + let archive = env.open_archive(); + archive + .query_row( + "SELECT last_visit_id FROM profile_watermarks WHERE profile_id = ?1", + [profile_id], + |row| row.get::<_, i64>(0), + ) + .ok() +} + +/// Build a fixture with two URLs and three visits, all within one week. +fn baseline_chromium_fixture() -> ChromiumHistoryFixture { + // 2026-05-01 00:00, 2026-05-02 12:00, 2026-05-03 08:15:30 + let visit_one_ms = 1_777_680_000_000_i64; + let visit_two_ms = 1_777_809_600_000_i64; + let visit_three_ms = 1_777_872_930_000_i64; + + ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/article-one".to_string(), + title: Some("Article One".to_string()), + visit_count: 2, + typed_count: 1, + last_visit_unix_ms: visit_two_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.org/article-two".to_string(), + title: Some("Article Two".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: visit_three_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, visit_one_ms)) + .add_visit(visit_row(11, 1, visit_two_ms)) + .add_visit(visit_row(12, 2, visit_three_ms)) +} + +fn visit_row(id: i64, url_id: i64, visit_time_unix_ms: i64) -> ChromiumVisitRow { + ChromiumVisitRow { + id, + url_id, + visit_time_unix_ms, + from_visit: Some(0), + transition: Some(805306368), + visit_duration_micros: Some(5_000_000), + is_known_to_sync: false, + visited_link_id: None, + external_referrer_url: None, + app_id: None, + } +} + +// ---------------------------------------------------------------------- +// C1: Chromium baseline import — happy path +// ---------------------------------------------------------------------- + +/// C1 — One profile, one ingest pass, asserts every fixture row landed. +#[test] +fn c1_chromium_baseline_import() { + let env = ScenarioEnv::new(); + let snapshot = snapshot_for_fixture( + &baseline_chromium_fixture(), + chromium_profile("chrome:Default", "Google Chrome"), + ); + + let summary = run_one_ingest(&env, 1, &snapshot, false); + + assert_eq!(summary.new_urls, 2, "summary reports 2 new urls"); + assert_eq!(summary.new_visits, 3, "summary reports 3 new visits"); + + assert_eq!(count_archive_rows(&env, "urls"), 2); + assert_eq!(count_archive_rows(&env, "visits"), 3); + assert_eq!(count_urls_for_profile(&env, "chrome:Default"), 2); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 3); + + let visit_ids = collect_visit_source_ids(&env, "chrome:Default"); + assert_eq!(visit_ids, vec!["10".to_string(), "11".to_string(), "12".to_string()]); +} + +// ---------------------------------------------------------------------- +// C2: Chromium incremental no-new-data — watermark prevents re-import +// ---------------------------------------------------------------------- + +/// C2 — Re-importing the same fixture with `use_watermark = true` must +/// produce zero new rows. The watermark advance after the first import +/// should make the second import a no-op at the parser level. +/// +/// The new-rows assertion alone does NOT prove the watermark works — +/// the fingerprint partial index would catch identical re-imports even +/// if the watermark always returned zero. We additionally query +/// `profile_watermarks` directly to assert the cursor advanced to the +/// maximum source_visit_id observed in pass 1, then stayed there after +/// the no-op pass 2. +#[test] +fn c2_chromium_incremental_no_new_data() { + let env = ScenarioEnv::new(); + let first_snapshot = snapshot_for_fixture( + &baseline_chromium_fixture(), + chromium_profile("chrome:Default", "Google Chrome"), + ); + run_one_ingest(&env, 1, &first_snapshot, false); + drop(first_snapshot); + + // Direct watermark assertion — proves the parser actually saved the + // cursor. baseline_chromium_fixture's max source_visit_id is 12. + let watermark_after_pass1 = read_profile_watermark(&env, "chrome:Default"); + assert_eq!( + watermark_after_pass1, + Some(12), + "C2 watermark contract: pass 1 must save the max source_visit_id observed (12)" + ); + + let second_snapshot = snapshot_for_fixture( + &baseline_chromium_fixture(), + chromium_profile("chrome:Default", "Google Chrome"), + ); + let summary = run_one_ingest(&env, 2, &second_snapshot, true); + + assert_eq!(summary.new_urls, 0, "second import must add no new URL rows"); + assert_eq!(summary.new_visits, 0, "second import must add no new visit rows"); + + assert_eq!(count_archive_rows(&env, "urls"), 2); + assert_eq!(count_archive_rows(&env, "visits"), 3); + + // Watermark must not regress on the no-op pass. + let watermark_after_pass2 = read_profile_watermark(&env, "chrome:Default"); + assert_eq!( + watermark_after_pass2, + Some(12), + "C2 watermark contract: no-op pass 2 must not regress the cursor" + ); +} + +// ---------------------------------------------------------------------- +// C3: Chromium incremental revisit of an old URL +// ---------------------------------------------------------------------- + +/// C3 — A URL whose `last_visit_time` is older than the watermark gets a +/// new visit. Without the `OR id IN (SELECT DISTINCT url FROM visits ...)` +/// fallback in `INGEST_URLS_SQL`, the URL would not be re-streamed in +/// pass 2; the new visit's `url_id_map` lookup would fail and the visit +/// would be silently dropped. This scenario asserts the fix is intact. +#[test] +fn c3_chromium_incremental_revisit_of_old_url() { + let env = ScenarioEnv::new(); + + // Initial state: one URL with a single old visit. After import, the + // watermark sits at visit_id=10 and url_last_visit_time=visit_one. + let visit_one_ms = 1_777_680_000_000_i64; // 2026-05-01T00:00:00Z + let visit_two_ms = 1_777_872_930_000_i64; // 2026-05-03T08:15:30Z + + let first_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/long-tail".to_string(), + title: Some("Long Tail Article".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: visit_one_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, visit_one_ms)); + + let first_snapshot = + snapshot_for_fixture(&first_fixture, chromium_profile("chrome:Default", "Google Chrome")); + let first_summary = run_one_ingest(&env, 1, &first_snapshot, false); + assert_eq!(first_summary.new_urls, 1); + assert_eq!(first_summary.new_visits, 1); + drop(first_snapshot); + + // Adversarial pass-2 fixture: same URL row with its last_visit_time + // intentionally left at the OLD value (visit_one_ms), but a new + // visit row with id > visit watermark and time > url watermark. The + // visit cursor moves past 10; the URL cursor does not. Only the OR + // fallback can rescue this URL into the second stream. + let second_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/long-tail".to_string(), + title: Some("Long Tail Article".to_string()), + visit_count: 2, + typed_count: 0, + last_visit_unix_ms: visit_one_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, visit_one_ms)) + .add_visit(visit_row(11, 1, visit_two_ms)); + + let second_snapshot = + snapshot_for_fixture(&second_fixture, chromium_profile("chrome:Default", "Google Chrome")); + let second_summary = run_one_ingest(&env, 2, &second_snapshot, true); + + assert_eq!( + second_summary.new_visits, 1, + "long-tail revisit captured by the OR fallback in INGEST_URLS_SQL" + ); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 2); +} + +// ---------------------------------------------------------------------- +// X1: Edge imports Chrome history, then both diverge +// ---------------------------------------------------------------------- + +/// X1 — Per-source-profile contract: even when Edge and Chrome share visit +/// records (because Edge was installed and imported the Chrome history at +/// setup time), the archive must keep them as independent rows under +/// distinct `source_profiles` rows, and Edge's `browser_product` must +/// remain "Microsoft Edge" rather than collapsing to "Google Chrome" +/// (browser-support-and-adapter-playbook.md:107). +#[test] +fn x1_edge_imports_chrome_then_both_diverge() { + let env = ScenarioEnv::new(); + + let day_one_ms = 1_777_680_000_000_i64; + let day_two_ms = 1_777_809_600_000_i64; + let day_three_ms = 1_777_872_930_000_i64; + let day_four_ms = 1_777_900_000_000_i64; + + // Chrome: 3 visits across 3 URLs. + let chrome_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/shared".to_string(), + title: Some("Shared Article".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/chrome-only".to_string(), + title: Some("Chrome-only Article".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_two_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 3, + url: "https://example.com/chrome-late".to_string(), + title: Some("Chrome Late".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_four_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, day_one_ms)) + .add_visit(visit_row(11, 2, day_two_ms)) + .add_visit(visit_row(12, 3, day_four_ms)); + + // Edge: imported the shared visit from Chrome (same URL + same time), + // then made its own visit to the same URL on day three, and finally + // landed an Edge-only URL on day four. + let edge_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 100, + url: "https://example.com/shared".to_string(), + title: Some("Shared Article".to_string()), + visit_count: 2, + typed_count: 0, + last_visit_unix_ms: day_three_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 101, + url: "https://example.com/edge-only".to_string(), + title: Some("Edge-only Article".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_four_ms, + hidden: false, + }) + .add_visit(visit_row(200, 100, day_one_ms)) // imported from Chrome + .add_visit(visit_row(201, 100, day_three_ms)) // genuine Edge visit + .add_visit(visit_row(202, 101, day_four_ms)); + + let chrome_snapshot = + snapshot_for_fixture(&chrome_fixture, chromium_profile("chrome:Default", "Google Chrome")); + let edge_snapshot = + snapshot_for_fixture(&edge_fixture, chromium_profile("edge:Default", "Microsoft Edge")); + + run_one_ingest(&env, 1, &chrome_snapshot, false); + run_one_ingest(&env, 2, &edge_snapshot, false); + + // Per-profile counts: each browser sees its own truth without merging. + assert_eq!(count_urls_for_profile(&env, "chrome:Default"), 3); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 3); + assert_eq!(count_urls_for_profile(&env, "edge:Default"), 2); + assert_eq!(count_visits_for_profile(&env, "edge:Default"), 3); + + // Total archive rows: 3 + 2 url rows = 5; 3 + 3 visit rows = 6. + // The shared URL exists once per profile (= 2 rows) by design. + assert_eq!(count_archive_rows(&env, "urls"), 5); + assert_eq!(count_archive_rows(&env, "visits"), 6); + + // Provenance contract: Edge profile must keep its product identity. + let archive = env.open_archive(); + let edge_product: String = archive + .query_row( + "SELECT browser_product FROM source_profiles WHERE profile_key = ?1", + ["edge:Default"], + |row| row.get(0), + ) + .expect("edge product"); + assert_eq!( + edge_product, "Microsoft Edge", + "Edge profile must not collapse to Google Chrome (playbook §107)" + ); + + let chrome_product: String = archive + .query_row( + "SELECT browser_product FROM source_profiles WHERE profile_key = ?1", + ["chrome:Default"], + |row| row.get(0), + ) + .expect("chrome product"); + assert_eq!(chrome_product, "Google Chrome"); +} + +// T1, T2, T2b moved to dedup_scenarios_takeout.rs. + +// ---------------------------------------------------------------------- +// X2: Chromium-family product identity for Atlas and Comet +// ---------------------------------------------------------------------- + +/// X2 — Per the browser-support-and-adapter-playbook §156-161, ChatGPT +/// Atlas and Perplexity Comet are Chromium-family products that must +/// preserve their product identity in `source_profiles.browser_product` +/// rather than collapsing into a generic "Google Chrome". This scenario +/// pins that contract: each profile's `browser_product` column must +/// match its source `browser_name` verbatim after ingest. If a future +/// refactor accidentally normalizes all Chromium-family browsers to +/// "Google Chrome" (or strips the product distinction in any other +/// way), this test fails immediately. +#[test] +fn x2_chromium_family_products_preserve_browser_product_identity() { + let env = ScenarioEnv::new(); + let day_one_ms = 1_777_680_000_000_i64; + + // Each browser gets its own synthetic 1-URL, 1-visit fixture. The + // fixture format is the same Chromium History schema for all three + // products — what differs is the profile metadata. + let make_fixture = |url: &str, title: &str| { + ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: url.to_string(), + title: Some(title.to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, day_one_ms)) + }; + + let atlas_snapshot = snapshot_for_fixture( + &make_fixture("https://example.com/atlas-page", "Atlas Page"), + chromium_profile("chatgpt-atlas:Default", "ChatGPT Atlas"), + ); + let comet_snapshot = snapshot_for_fixture( + &make_fixture("https://example.com/comet-page", "Comet Page"), + chromium_profile("comet:Default", "Perplexity Comet"), + ); + let chrome_snapshot = snapshot_for_fixture( + &make_fixture("https://example.com/chrome-page", "Chrome Page"), + chromium_profile("chrome:Default", "Google Chrome"), + ); + + run_one_ingest(&env, 1, &atlas_snapshot, false); + run_one_ingest(&env, 2, &comet_snapshot, false); + run_one_ingest(&env, 3, &chrome_snapshot, false); + + // Each profile lands as an independent source_profile with its own + // canonical row counts. + assert_eq!(count_urls_for_profile(&env, "chatgpt-atlas:Default"), 1); + assert_eq!(count_visits_for_profile(&env, "chatgpt-atlas:Default"), 1); + assert_eq!(count_urls_for_profile(&env, "comet:Default"), 1); + assert_eq!(count_visits_for_profile(&env, "comet:Default"), 1); + assert_eq!(count_urls_for_profile(&env, "chrome:Default"), 1); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 1); + + // Provenance contract: each browser_product must stay verbatim. + let archive = env.open_archive(); + let product_for = |profile_key: &str| -> String { + archive + .query_row( + "SELECT browser_product FROM source_profiles WHERE profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("query browser_product") + }; + + assert_eq!( + product_for("chatgpt-atlas:Default"), + "ChatGPT Atlas", + "ChatGPT Atlas must not collapse to Google Chrome (playbook §156)" + ); + assert_eq!( + product_for("comet:Default"), + "Perplexity Comet", + "Perplexity Comet must not collapse to Google Chrome (playbook §158)" + ); + assert_eq!(product_for("chrome:Default"), "Google Chrome"); + + // browser_kind (derived from profile_id prefix) must also distinguish them. + let kind_for = |profile_key: &str| -> String { + archive + .query_row( + "SELECT browser_kind FROM source_profiles WHERE profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("query browser_kind") + }; + + assert_eq!(kind_for("chatgpt-atlas:Default"), "chatgpt-atlas"); + assert_eq!(kind_for("comet:Default"), "comet"); + assert_eq!(kind_for("chrome:Default"), "chrome"); +} + +// ---------------------------------------------------------------------- +// C5: Chromium incremental growth — pure append-new-rows +// ---------------------------------------------------------------------- + +/// C5 — The most common real-world re-import: the user has new browsing +/// activity since last backup. Distinct from C2 (zero new rows) and C3 +/// (new visit on an OLD URL exposing watermark fallback). Here the +/// second pass adds wholly new URLs and visits that did not exist in +/// the first import. The watermark advance must let only the new rows +/// land while the original rows stay deduplicated. Pins the audit §5.1 +/// "re-import after appending new rows" contract. +#[test] +fn c5_chromium_incremental_append_new_urls_and_visits() { + let env = ScenarioEnv::new(); + + let day_one_ms = 1_777_680_000_000_i64; + let day_two_ms = 1_777_809_600_000_i64; + let day_three_ms = 1_777_872_930_000_i64; + let day_four_ms = 1_777_939_200_000_i64; + + // Pass 1: 2 URLs, 2 visits. + let first_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/original-one".to_string(), + title: Some("Original One".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/original-two".to_string(), + title: Some("Original Two".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_two_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, day_one_ms)) + .add_visit(visit_row(11, 2, day_two_ms)); + let first_snapshot = + snapshot_for_fixture(&first_fixture, chromium_profile("chrome:Default", "Google Chrome")); + let first_summary = run_one_ingest(&env, 1, &first_snapshot, false); + assert_eq!(first_summary.new_urls, 2); + assert_eq!(first_summary.new_visits, 2); + drop(first_snapshot); + + // Direct watermark assertion — pins that the parser saved cursor=11 + // after pass 1, otherwise pass 2's new_visits=2 below could be + // satisfied by a broken watermark that re-streams everything and + // relies on fingerprint dedup to drop the originals. + assert_eq!( + read_profile_watermark(&env, "chrome:Default"), + Some(11), + "C5 watermark contract: pass 1 must save cursor at max source_visit_id (11)" + ); + + // Pass 2: same 2 URLs + 2 NEW URLs + 2 NEW visits (one per new URL). + // The originals must stay deduplicated; only the 2 new URLs / 2 new + // visits should land. + let second_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/original-one".to_string(), + title: Some("Original One".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/original-two".to_string(), + title: Some("Original Two".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_two_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 3, + url: "https://example.com/new-three".to_string(), + title: Some("New Three".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_three_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 4, + url: "https://example.com/new-four".to_string(), + title: Some("New Four".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_four_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, day_one_ms)) + .add_visit(visit_row(11, 2, day_two_ms)) + .add_visit(visit_row(12, 3, day_three_ms)) + .add_visit(visit_row(13, 4, day_four_ms)); + let second_snapshot = + snapshot_for_fixture(&second_fixture, chromium_profile("chrome:Default", "Google Chrome")); + let second_summary = run_one_ingest(&env, 2, &second_snapshot, true); + + // Summary must report exactly the new content. + assert_eq!(second_summary.new_urls, 2, "second import should report 2 new URLs"); + assert_eq!(second_summary.new_visits, 2, "second import should report 2 new visits"); + + // Archive totals: 4 URLs, 4 visits. + assert_eq!(count_urls_for_profile(&env, "chrome:Default"), 4); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 4); + assert_eq!(count_archive_rows(&env, "urls"), 4); + assert_eq!(count_archive_rows(&env, "visits"), 4); + + // Source visit IDs flow through unmodified (sorted lexically: 10, 11, 12, 13). + let visit_ids = collect_visit_source_ids(&env, "chrome:Default"); + assert_eq!(visit_ids, vec!["10", "11", "12", "13"]); + + // Confirm the new visit timestamps round-tripped, not just the row count. + let archive = env.open_archive(); + let new_visit_three_ms: i64 = archive + .query_row( + "SELECT visit_time_ms FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'chrome:Default' + AND visits.source_visit_id = '12'", + [], + |row| row.get(0), + ) + .expect("query new visit three time"); + assert_eq!(new_visit_three_ms, day_three_ms); + + // Direct watermark assertion: pass 2's parser ran with cursor=11 + // (saved by pass 1) and observed visits 12, 13. The cursor must + // have advanced to 13 after pass 2 commits. If a future regression + // breaks the watermark save and pass 2 silently re-streamed every + // visit (with fingerprint dedup masking the row counts), this + // assertion catches it. + assert_eq!( + read_profile_watermark(&env, "chrome:Default"), + Some(13), + "C5 watermark contract: pass 2 must advance the cursor to the new max (13)" + ); +} + +// ---------------------------------------------------------------------- +// X3: Multi-profile per browser — Chrome Default vs Chrome Profile 1 +// ---------------------------------------------------------------------- + +/// X3 — Real users almost always have multiple Chrome profiles +/// (`Default`, `Profile 1`, sometimes more). Each profile is a separate +/// `~/Library/Application Support/Google/Chrome//History` +/// file, discovered as an independent `BrowserProfile`. The dedup +/// contract requires: +/// +/// 1. **Independent source_profiles**: `profile_key = "chrome:Default"` +/// and `profile_key = "chrome:Profile 1"` must produce two distinct +/// rows in `source_profiles` (no collision under same `browser_kind`). +/// 2. **Per-profile dedup scope**: identical visits across the two +/// profiles must not deduplicate. The `event_fingerprint` partial +/// unique index is scoped by `source_profile_id`, so each profile +/// keeps its own copy. +/// 3. **Per-profile watermark isolation**: a re-import of Profile 1 +/// after Default has been ingested must not be affected by Default's +/// watermark advance — both profiles get independent incremental +/// state. +/// +/// This is the multi-profile mirror of X1's cross-browser test. If a +/// future refactor accidentally key the watermark by `browser_kind` only +/// (instead of by `source_profile_id`), or merges identical visits +/// across profiles, this scenario fails. +#[test] +fn x3_multiple_profiles_within_same_browser_stay_independent() { + let env = ScenarioEnv::new(); + + let day_one_ms = 1_777_680_000_000_i64; + let day_two_ms = 1_777_809_600_000_i64; + let day_three_ms = 1_777_872_930_000_i64; + + // Both profiles share the same URL + visit time (e.g. the user + // visited the same article from both work and personal profiles). + let shared_fixture = |source_url_id: i64, source_visit_id: i64| { + ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: source_url_id, + url: "https://example.com/cross-profile".to_string(), + title: Some("Cross Profile".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_visit(visit_row(source_visit_id, source_url_id, day_one_ms)) + }; + + // Default: pass 1 — single shared URL + visit. + let default_snap_1 = snapshot_for_fixture( + &shared_fixture(1, 10), + chromium_profile("chrome:Default", "Google Chrome"), + ); + let default_summary_1 = run_one_ingest(&env, 1, &default_snap_1, false); + assert_eq!(default_summary_1.new_urls, 1); + assert_eq!(default_summary_1.new_visits, 1); + drop(default_snap_1); + + // Profile 1: pass 1 — same URL + visit time but DIFFERENT + // source_visit_id (each Chrome profile has its own rowid sequence). + // The fingerprint inputs (url, visit_time_ms, title, transition, + // app_id) match Default's, but the fingerprint partial index is + // scoped per source_profile_id, so this visit must NOT dedup. + let profile1_snap_1 = snapshot_for_fixture( + &shared_fixture(1, 99), + chromium_profile("chrome:Profile 1", "Google Chrome"), + ); + let profile1_summary_1 = run_one_ingest(&env, 2, &profile1_snap_1, false); + assert_eq!( + profile1_summary_1.new_urls, 1, + "Profile 1's URL must land independently of Default's" + ); + assert_eq!( + profile1_summary_1.new_visits, 1, + "identical visit across profiles must not dedup (per-profile fingerprint scope)" + ); + + // Per-profile counts confirm the two profiles each hold one URL + + // one visit, even though the visit content is identical. + assert_eq!(count_urls_for_profile(&env, "chrome:Default"), 1); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 1); + assert_eq!(count_urls_for_profile(&env, "chrome:Profile 1"), 1); + assert_eq!(count_visits_for_profile(&env, "chrome:Profile 1"), 1); + assert_eq!(count_archive_rows(&env, "urls"), 2); + assert_eq!(count_archive_rows(&env, "visits"), 2); + + // Direct per-profile watermark assertion — pins that the two + // profiles each have their own profile_watermarks row keyed by + // their distinct profile_id. If a regression keyed watermarks by + // browser_kind only (cross-profile bleed), these two queries would + // return the same value or one of them would be missing. + assert_eq!( + read_profile_watermark(&env, "chrome:Default"), + Some(10), + "Default's watermark must be saved at its own max source_visit_id (10)" + ); + assert_eq!( + read_profile_watermark(&env, "chrome:Profile 1"), + Some(99), + "Profile 1's watermark must be saved at its own max source_visit_id (99), \ + independently of Default's" + ); + + // Per-profile watermark isolation: now re-import Profile 1 with + // NEW activity (the user kept browsing on Profile 1). Default's + // watermark advance from pass 1 must not affect Profile 1's + // incremental cursor. Profile 1's new content must be detected. + let profile1_fixture_2 = ChromiumHistoryFixture::new() + // Same URL+visit as Profile 1's pass 1 — must dedup at Profile 1's + // partial fingerprint index. + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/cross-profile".to_string(), + title: Some("Cross Profile".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + // New URL only seen on Profile 1. + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/profile-one-only".to_string(), + title: Some("Profile One Only".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_two_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 3, + url: "https://example.com/profile-one-late".to_string(), + title: Some("Profile One Late".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_three_ms, + hidden: false, + }) + .add_visit(visit_row(99, 1, day_one_ms)) + .add_visit(visit_row(100, 2, day_two_ms)) + .add_visit(visit_row(101, 3, day_three_ms)); + let profile1_snap_2 = snapshot_for_fixture( + &profile1_fixture_2, + chromium_profile("chrome:Profile 1", "Google Chrome"), + ); + let profile1_summary_2 = run_one_ingest(&env, 3, &profile1_snap_2, true); + + // Watermark must have been read from Profile 1's own state (not + // Default's). Profile 1 sees 2 new URLs and 2 new visits. + assert_eq!( + profile1_summary_2.new_urls, 2, + "Profile 1's incremental import must pick up its own 2 new URLs" + ); + assert_eq!( + profile1_summary_2.new_visits, 2, + "Profile 1's incremental import must pick up its own 2 new visits" + ); + + // Final per-profile counts. + assert_eq!(count_urls_for_profile(&env, "chrome:Default"), 1, "Default untouched"); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 1, "Default untouched"); + assert_eq!(count_urls_for_profile(&env, "chrome:Profile 1"), 3); + assert_eq!(count_visits_for_profile(&env, "chrome:Profile 1"), 3); + assert_eq!(count_archive_rows(&env, "urls"), 4); + assert_eq!(count_archive_rows(&env, "visits"), 4); + + // Direct watermark assertion after Profile 1's incremental pass: + // Default's cursor must remain frozen at 10, Profile 1's must have + // advanced to 101 (the new max). If a regression made the two + // profiles share a single watermark, Default's cursor would have + // jumped to 101 too — which this assertion catches. + assert_eq!( + read_profile_watermark(&env, "chrome:Default"), + Some(10), + "Default's watermark must NOT be touched by Profile 1's incremental import" + ); + assert_eq!( + read_profile_watermark(&env, "chrome:Profile 1"), + Some(101), + "Profile 1's watermark must have advanced to the new max source_visit_id (101)" + ); + + // Provenance: both share `browser_kind = chrome` and + // `browser_product = Google Chrome` but have distinct `profile_key` + // and `profile_name`. + let archive = env.open_archive(); + let collect_profile_meta = |profile_key: &str| -> (String, String, String) { + archive + .query_row( + "SELECT browser_kind, browser_product, profile_name + FROM source_profiles WHERE profile_key = ?1", + [profile_key], + |row| Ok((row.get(0)?, row.get(1)?, row.get(2)?)), + ) + .expect("profile meta") + }; + let (default_kind, default_product, default_name) = collect_profile_meta("chrome:Default"); + let (profile1_kind, profile1_product, profile1_name) = collect_profile_meta("chrome:Profile 1"); + assert_eq!(default_kind, "chrome"); + assert_eq!(profile1_kind, "chrome"); + assert_eq!(default_product, "Google Chrome"); + assert_eq!(profile1_product, "Google Chrome"); + assert_eq!(default_name, "Default"); + // profile_name comes from chromium_profile helper which hardcodes + // "Default"; in real PathKeep it would be the OS-discovered name. + // Both still produce distinct profile_keys via the profile_id input. + assert_eq!(profile1_name, "Default"); +} + +// ---------------------------------------------------------------------- +// C6: Chromium source DB schema tolerance — extra columns must not break ingest +// ---------------------------------------------------------------------- + +/// C6 — Chrome's `History` schema grows over time (real Chrome adds +/// columns like `favicon_id` on `urls`, plus `segment_id`, +/// `opener_visit`, and the `originator_*` sync metadata fields on +/// `visits`). PathKeep's parser uses **explicit column lists** in +/// SELECTs (see `INGEST_URLS_SQL`, `INGEST_VISITS_SQL`), so extra +/// columns in the source DB must be silently tolerated. This scenario +/// pins that contract: a fixture DB with `ALTER TABLE`-added columns +/// must import without error and produce identical canonical rows. +/// +/// If a future refactor switches to `SELECT *` or otherwise becomes +/// column-count-sensitive, this test fails immediately. This is the +/// §5.1 "re-import after schema migration in the source DB" contract. +#[test] +fn c6_chromium_extra_columns_on_source_db_do_not_break_ingest() { + let env = ScenarioEnv::new(); + + let day_one_ms = 1_777_680_000_000_i64; + let day_two_ms = 1_777_809_600_000_i64; + + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/schema-tolerant".to_string(), + title: Some("Schema Tolerant".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/schema-tolerant-two".to_string(), + title: Some("Schema Tolerant Two".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_two_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, day_one_ms)) + .add_visit(visit_row(11, 2, day_two_ms)); + + let temp_dir = tempdir().expect("snapshot tempdir"); + let history_path = temp_dir.path().join("History"); + fixture.write(&history_path).expect("write chromium fixture"); + + // Simulate Chrome adding new columns in a later release. The + // PathKeep parser must continue to project only the columns it + // explicitly names; the extras must be ignored entirely. + { + let connection = Connection::open(&history_path).expect("open fixture for ALTER"); + // Real Chrome additions over time: + connection + .execute("ALTER TABLE urls ADD COLUMN favicon_id INTEGER", []) + .expect("add favicon_id"); + connection + .execute("ALTER TABLE visits ADD COLUMN segment_id INTEGER", []) + .expect("add segment_id"); + connection + .execute("ALTER TABLE visits ADD COLUMN opener_visit INTEGER", []) + .expect("add opener_visit"); + connection + .execute("ALTER TABLE visits ADD COLUMN originator_cache_guid TEXT", []) + .expect("add originator_cache_guid"); + // Populate the new columns with synthetic data so the schema isn't + // just a NULL column suffix — proves the parser truly ignores them. + connection + .execute("UPDATE urls SET favicon_id = 42 WHERE id = 1", []) + .expect("populate favicon_id"); + connection + .execute( + "UPDATE visits SET segment_id = 7, opener_visit = 0, originator_cache_guid = 'synthetic-originator' WHERE id = 10", + [], + ) + .expect("populate visit extras"); + } + + let history_bytes = std::fs::metadata(&history_path).map(|meta| meta.len()).unwrap_or(0); + let mut profile = chromium_profile("chrome:Default", "Google Chrome"); + profile.history_bytes = history_bytes; + let snapshot = ProfileSnapshot { + profile, + temp_dir, + history_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "History".to_string(), + sha256: "synthetic-fixture-hash".to_string(), + }], + }; + + let summary = run_one_ingest(&env, 1, &snapshot, false); + + // The extra columns must be silently ignored — canonical row counts + // must match what a normal fixture without ALTER TABLE produces. + assert_eq!( + summary.new_urls, 2, + "schema-tolerance: URL count must match minimal-schema fixture" + ); + assert_eq!( + summary.new_visits, 2, + "schema-tolerance: visit count must match minimal-schema fixture" + ); + assert_eq!(count_urls_for_profile(&env, "chrome:Default"), 2); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 2); + + // Spot-check that the columns the parser DOES project still landed. + let archive = env.open_archive(); + let title: Option = archive + .query_row( + "SELECT title FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'chrome:Default' + AND urls.source_url_id = 1", + [], + |row| row.get(0), + ) + .expect("query url title after ALTER"); + assert_eq!(title.as_deref(), Some("Schema Tolerant")); +} + +// ---------------------------------------------------------------------- +// C7: Tied last_visit_ms must NOT overwrite title / hidden / payload_hash +// ---------------------------------------------------------------------- + +/// C7 — Tie-break contract for the B1 fix in `writes.rs::upsert_url`. +/// When two snapshots report the same `last_visit_ms` for a URL, the +/// upsert must NOT overwrite `title`, `hidden`, `payload_hash`, or +/// `recorded_at` — only strictly newer timestamps win. This prevents +/// two real-world data losses: +/// +/// 1. A re-import where Chrome's title hadn't been hydrated yet +/// (ParsedUrl.title = None) shouldn't silently destroy a captured +/// title at the same `last_visit_ms`. +/// 2. Firefox bookmark-only URLs (last_visit_date IS NULL → 0) tie at +/// `last_visit_ms = 0` on every re-import; the original B1 fix's +/// `>=` comparison meant title/hidden flipped to the second snapshot +/// every sync. +#[test] +fn c7_tied_last_visit_ms_does_not_overwrite_title_hidden_or_payload_hash() { + let env = ScenarioEnv::new(); + let visit_time_ms = 1_777_809_600_000_i64; + + // Snapshot 1: URL with real title, hidden=false, captured at T. + let first_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/tied-time".to_string(), + title: Some("Captured Title".to_string()), + visit_count: 3, + typed_count: 1, + last_visit_unix_ms: visit_time_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, visit_time_ms)); + let first_snapshot = + snapshot_for_fixture(&first_fixture, chromium_profile("chrome:Tied", "Google Chrome")); + run_one_ingest(&env, 1, &first_snapshot, false); + drop(first_snapshot); + + let initial_payload_hash: String = { + let archive = env.open_archive(); + archive + .query_row( + "SELECT payload_hash FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'chrome:Tied' + AND urls.source_url_id = 1", + [], + |row| row.get(0), + ) + .expect("query initial payload_hash") + }; + + // Snapshot 2: same last_visit_ms (tie), but everything else is + // worse — title is NULL, hidden flipped to true, lower counts. + // The B1 fix must preserve snapshot 1's values across this tie. + let second_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/tied-time".to_string(), + title: None, + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: visit_time_ms, + hidden: true, + }) + .add_visit(visit_row(11, 1, visit_time_ms)); + let second_snapshot = + snapshot_for_fixture(&second_fixture, chromium_profile("chrome:Tied", "Google Chrome")); + run_one_ingest(&env, 2, &second_snapshot, false); + + let archive = env.open_archive(); + let (title, hidden, payload_hash, visit_count, typed_count): ( + Option, + i64, + String, + i64, + i64, + ) = archive + .query_row( + "SELECT title, hidden, payload_hash, visit_count, typed_count FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'chrome:Tied' + AND urls.source_url_id = 1", + [], + |row| Ok((row.get(0)?, row.get(1)?, row.get(2)?, row.get(3)?, row.get(4)?)), + ) + .expect("query url state after tied re-import"); + + assert_eq!( + title.as_deref(), + Some("Captured Title"), + "tied last_visit_ms must NOT overwrite title with NULL from later snapshot", + ); + assert_eq!(hidden, 0, "tied last_visit_ms must NOT flip hidden to true from later snapshot"); + assert_eq!( + payload_hash, initial_payload_hash, + "tied last_visit_ms must preserve original payload_hash (audit-trail integrity)", + ); + assert_eq!(visit_count, 3, "visit_count must use MAX semantics, preserving the higher value"); + assert_eq!(typed_count, 1, "typed_count must use MAX semantics, preserving the higher value"); +} + +// ---------------------------------------------------------------------- +// C4: URL upsert must not regress metadata on re-import (B1 — FIXED) +// ---------------------------------------------------------------------- + +/// C4 — Regression test for audit bug **B1** (fixed in 6884c10d). The URL +/// upsert in `writes.rs` now uses `MAX()` for `visit_count` / `typed_count` +/// and `CASE WHEN excluded.last_visit_ms >= urls.last_visit_ms` for `title` +/// / `hidden`, preventing older snapshots from overwriting newer metadata. +/// This test asserts all four fields survive a re-import of an older +/// snapshot without regression. +#[test] +fn c4_chromium_reimport_older_snapshot_regresses_visit_count_demonstrates_b1() { + let env = ScenarioEnv::new(); + let visit_two_ms = 1_777_809_600_000_i64; + + // Snapshot 1: URL with lifetime visit_count=10. + let first_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/long-tracked".to_string(), + title: Some("Long Tracked Page".to_string()), + visit_count: 10, + typed_count: 4, + last_visit_unix_ms: visit_two_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, visit_two_ms)); + let first_snapshot = + snapshot_for_fixture(&first_fixture, chromium_profile("chrome:Default", "Google Chrome")); + run_one_ingest(&env, 1, &first_snapshot, false); + drop(first_snapshot); + assert_eq!(stored_visit_count(&env, "chrome:Default", 1), 10); + + // Snapshot 2: same URL but visit_count=5 (the older snapshot regression). + // last_visit_ms is identical, so the existing guard does not fire and + // the unconditional overwrite path runs. + let second_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/long-tracked".to_string(), + title: Some("Regressed Title".to_string()), + visit_count: 5, + typed_count: 1, + last_visit_unix_ms: visit_two_ms, + hidden: false, + }) + .add_visit(visit_row(10, 1, visit_two_ms)); + let second_snapshot = + snapshot_for_fixture(&second_fixture, chromium_profile("chrome:Default", "Google Chrome")); + run_one_ingest(&env, 2, &second_snapshot, false); + + let final_count = stored_visit_count(&env, "chrome:Default", 1); + assert!( + final_count >= 10, + "B1 fix required: urls.visit_count must not regress on re-import (got {final_count}, was 10)" + ); + + // B1 fix: typed_count uses MAX semantics — must keep the higher value. + let final_typed = stored_typed_count(&env, "chrome:Default", 1); + assert!( + final_typed >= 4, + "B1 fix: typed_count must use MAX semantics (got {final_typed}, was 4)" + ); + + // B1 fix: title and hidden use CASE WHEN excluded.last_visit_ms >= + // urls.last_visit_ms — at equal timestamps the second import "wins", + // which is acceptable. The important contract: a strictly OLDER + // snapshot cannot overwrite. Re-import with an older last_visit_ms + // to verify. + drop(second_snapshot); + let visit_one_ms = 1_777_680_000_000_i64; // strictly older + let third_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/long-tracked".to_string(), + title: Some("Ancient Title".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: visit_one_ms, + hidden: true, + }) + .add_visit(visit_row(10, 1, visit_one_ms)); + let third_snapshot = + snapshot_for_fixture(&third_fixture, chromium_profile("chrome:Default", "Google Chrome")); + run_one_ingest(&env, 3, &third_snapshot, false); + + let final_title = stored_title(&env, "chrome:Default", 1); + assert_ne!( + final_title.as_deref(), + Some("Ancient Title"), + "B1 fix: title from strictly older snapshot must not overwrite newer" + ); + + let final_hidden = stored_hidden(&env, "chrome:Default", 1); + assert!(!final_hidden, "B1 fix: hidden must not regress to older snapshot's value"); +} + +fn stored_visit_count(env: &ScenarioEnv, profile_key: &str, source_url_id: i64) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT visit_count FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = ?1 AND urls.source_url_id = ?2", + rusqlite::params![profile_key, source_url_id], + |row| row.get(0), + ) + .expect("query visit_count") +} + +fn stored_title(env: &ScenarioEnv, profile_key: &str, source_url_id: i64) -> Option { + let archive = env.open_archive(); + archive + .query_row( + "SELECT title FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = ?1 AND urls.source_url_id = ?2", + rusqlite::params![profile_key, source_url_id], + |row| row.get(0), + ) + .expect("query title") +} + +fn stored_typed_count(env: &ScenarioEnv, profile_key: &str, source_url_id: i64) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT typed_count FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = ?1 AND urls.source_url_id = ?2", + rusqlite::params![profile_key, source_url_id], + |row| row.get(0), + ) + .expect("query typed_count") +} + +fn stored_hidden(env: &ScenarioEnv, profile_key: &str, source_url_id: i64) -> bool { + let archive = env.open_archive(); + let hidden_int: i64 = archive + .query_row( + "SELECT hidden FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = ?1 AND urls.source_url_id = ?2", + rusqlite::params![profile_key, source_url_id], + |row| row.get(0), + ) + .expect("query hidden"); + hidden_int != 0 +} + +// F2, S2 moved to dedup_scenarios_baselines.rs. +// T3, T5 moved to dedup_scenarios_takeout.rs. +// C_SUB_MS implemented in dedup_scenarios_edge_cases.rs. diff --git a/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs b/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs new file mode 100644 index 00000000..aa0e2f35 --- /dev/null +++ b/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_baselines.rs @@ -0,0 +1,981 @@ +//! Baseline import scenarios for Firefox, Safari, and Chromium fingerprint dedup. +//! +//! These scenarios complement `dedup_scenarios.rs` by covering: +//! - **F1**: Firefox single-import baseline — asserts all URLs and visits +//! land correctly from a Firefox Places fixture. +//! - **S1**: Safari single-import baseline — asserts all URLs and visits +//! land correctly from a Safari History fixture. +//! - **Chromium fingerprint dedup**: Re-importing the same visits with +//! different `source_visit_id` values must not create duplicates because +//! the `event_fingerprint` partial index catches them. +//! +//! Each scenario reuses the `ScenarioEnv`, `run_one_ingest`, `count_*` +//! helpers from `dedup_scenarios.rs` and the snapshot builders for Firefox +//! and Safari already defined there. + +use super::*; +use browser_history_fixtures::{ + ChromiumHistoryFixture, ChromiumUrlRow, ChromiumVisitRow, FirefoxPlaceRow, + FirefoxPlacesFixture, FirefoxVisitRow, SafariHistoryFixture, SafariHistoryItemRow, + SafariHistoryVisitRow, +}; +use tempfile::tempdir; + +// ── Shared helpers (mirror dedup_scenarios.rs patterns) ───────────── + +fn test_config() -> AppConfig { + AppConfig { initialized: true, ..AppConfig::default() } +} + +fn test_paths(root: &Path) -> ProjectPaths { + crate::config::project_paths_with_root(root) +} + +/// Holds the long-lived resources one scenario needs across multiple +/// imports (same as dedup_scenarios::ScenarioEnv). +struct ScenarioEnv { + _root: tempfile::TempDir, + paths: ProjectPaths, + config: AppConfig, +} + +impl ScenarioEnv { + fn new() -> Self { + let root = tempdir().expect("scenario root tempdir"); + let paths = test_paths(root.path()); + let config = test_config(); + crate::config::ensure_paths(&paths).expect("ensure paths"); + Self { _root: root, paths, config } + } + + fn open_archive(&self) -> rusqlite::Connection { + open_archive_connection(&self.paths, &self.config, None).expect("open archive") + } +} + +fn seed_run(archive: &Transaction<'_>, run_id: i64) { + archive + .execute( + "INSERT INTO runs (id, run_type, trigger, started_at, timezone, status, profile_scope_json, warnings_json, stats_json, due_only) + VALUES (?1, 'backup', 'manual', '2026-05-25T00:00:00+00:00', 'UTC', 'running', '[]', '[]', '{}', 0)", + [run_id], + ) + .expect("seed run"); +} + +fn run_one_ingest( + env: &ScenarioEnv, + run_id: i64, + snapshot: &ProfileSnapshot, + use_watermark: bool, +) -> BackupProfileSummary { + let mut archive = env.open_archive(); + let transaction = archive.transaction().expect("scenario transaction"); + seed_run(&transaction, run_id); + let mut snapshot_artifacts = Vec::new(); + let mut source_evidence_plans = Vec::new(); + let summary = process_profile_snapshot( + &transaction, + run_id, + &env.paths, + &env.config, + snapshot, + &mut snapshot_artifacts, + &mut source_evidence_plans, + false, + use_watermark, + ) + .expect("process profile snapshot"); + transaction.commit().expect("commit scenario transaction"); + summary +} + +fn count_archive_rows(env: &ScenarioEnv, table: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row(&format!("SELECT COUNT(*) FROM {table}"), [], |row| row.get(0)) + .expect("count rows") +} + +fn count_urls_for_profile(env: &ScenarioEnv, profile_key: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT COUNT(*) FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("count urls for profile") +} + +fn count_visits_for_profile(env: &ScenarioEnv, profile_key: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT COUNT(*) FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("count visits for profile") +} + +fn collect_visit_source_ids(env: &ScenarioEnv, profile_key: &str) -> Vec { + let archive = env.open_archive(); + let mut statement = archive + .prepare( + "SELECT visits.source_visit_id FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = ?1 + ORDER BY visits.source_visit_id ASC", + ) + .expect("prepare visit ids"); + statement + .query_map([profile_key], |row| row.get::<_, String>(0)) + .expect("query visit ids") + .collect::>>() + .expect("collect visit ids") +} + +// ── Firefox helpers ───────────────────────────────────────────────── + +fn firefox_profile(profile_id: &str) -> crate::models::BrowserProfile { + crate::models::BrowserProfile { + profile_id: profile_id.to_string(), + profile_name: "Default".to_string(), + browser_family: "firefox".to_string(), + browser_name: "Firefox".to_string(), + user_name: Some("synthetic-user".to_string()), + profile_path: format!("/synthetic/{profile_id}"), + history_path: Some(format!("/synthetic/{profile_id}/places.sqlite")), + favicons_path: None, + history_exists: true, + history_readable: true, + access_issue: None, + browser_version: Some("125.0".to_string()), + history_file_name: "places.sqlite".to_string(), + history_bytes: 128, + favicons_bytes: 0, + supporting_bytes: 0, + retention_boundary: crate::models::BrowserRetentionBoundary::default(), + } +} + +fn firefox_snapshot(fixture: &FirefoxPlacesFixture, profile_id: &str) -> ProfileSnapshot { + let temp_dir = tempdir().expect("firefox snapshot tempdir"); + let history_path = temp_dir.path().join("places.sqlite"); + fixture.write(&history_path).expect("write firefox fixture"); + let history_bytes = std::fs::metadata(&history_path).map(|meta| meta.len()).unwrap_or(0); + let mut profile = firefox_profile(profile_id); + profile.history_bytes = history_bytes; + ProfileSnapshot { + profile, + temp_dir, + history_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "places.sqlite".to_string(), + sha256: "synthetic-firefox-hash".to_string(), + }], + } +} + +// ── Safari helpers ────────────────────────────────────────────────── + +fn safari_profile(profile_id: &str) -> crate::models::BrowserProfile { + crate::models::BrowserProfile { + profile_id: profile_id.to_string(), + profile_name: "Default".to_string(), + browser_family: "safari".to_string(), + browser_name: "Safari".to_string(), + user_name: Some("synthetic-user".to_string()), + profile_path: format!("/synthetic/{profile_id}"), + history_path: Some(format!("/synthetic/{profile_id}/History.db")), + favicons_path: None, + history_exists: true, + history_readable: true, + access_issue: None, + browser_version: Some("18.4".to_string()), + history_file_name: "History.db".to_string(), + history_bytes: 128, + favicons_bytes: 0, + supporting_bytes: 0, + retention_boundary: crate::models::BrowserRetentionBoundary::default(), + } +} + +fn safari_visit( + id: i64, + history_item: i64, + title: &str, + visit_time_unix_ms: i64, +) -> SafariHistoryVisitRow { + SafariHistoryVisitRow { + id, + history_item, + title: Some(title.to_string()), + visit_time_unix_ms, + load_successful: Some(true), + http_non_get: Some(false), + synthesized: Some(false), + redirect_source: None, + redirect_destination: None, + origin: Some(0), + generation: Some(1), + attributes: Some(0), + score: Some(0.5), + } +} + +fn safari_snapshot(fixture: &SafariHistoryFixture, profile_id: &str) -> ProfileSnapshot { + let temp_dir = tempdir().expect("safari snapshot tempdir"); + let history_path = temp_dir.path().join("History.db"); + fixture.write(&history_path).expect("write safari fixture"); + let history_bytes = std::fs::metadata(&history_path).map(|meta| meta.len()).unwrap_or(0); + let mut profile = safari_profile(profile_id); + profile.history_bytes = history_bytes; + ProfileSnapshot { + profile, + temp_dir, + history_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "History.db".to_string(), + sha256: "synthetic-safari-hash".to_string(), + }], + } +} + +// ── Chromium helpers ──────────────────────────────────────────────── + +fn chromium_profile(profile_id: &str, browser_name: &str) -> crate::models::BrowserProfile { + crate::models::BrowserProfile { + profile_id: profile_id.to_string(), + profile_name: "Default".to_string(), + browser_family: "chromium".to_string(), + browser_name: browser_name.to_string(), + user_name: Some("synthetic-user".to_string()), + profile_path: format!("/synthetic/{profile_id}"), + history_path: Some(format!("/synthetic/{profile_id}/History")), + favicons_path: None, + history_exists: true, + history_readable: true, + access_issue: None, + browser_version: Some("146.0.0.0".to_string()), + history_file_name: "History".to_string(), + history_bytes: 128, + favicons_bytes: 0, + supporting_bytes: 0, + retention_boundary: crate::models::BrowserRetentionBoundary::default(), + } +} + +fn chromium_visit_row(id: i64, url_id: i64, visit_time_unix_ms: i64) -> ChromiumVisitRow { + ChromiumVisitRow { + id, + url_id, + visit_time_unix_ms, + from_visit: Some(0), + transition: Some(805306368), + visit_duration_micros: Some(5_000_000), + is_known_to_sync: false, + visited_link_id: None, + external_referrer_url: None, + app_id: None, + } +} + +fn snapshot_for_chromium_fixture( + fixture: &ChromiumHistoryFixture, + profile: crate::models::BrowserProfile, +) -> ProfileSnapshot { + let temp_dir = tempdir().expect("snapshot tempdir"); + let history_path = temp_dir.path().join("History"); + fixture.write(&history_path).expect("write chromium fixture"); + let history_bytes = std::fs::metadata(&history_path).map(|meta| meta.len()).unwrap_or(0); + let mut profile = profile; + profile.history_bytes = history_bytes; + ProfileSnapshot { + profile, + temp_dir, + history_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "History".to_string(), + sha256: "synthetic-fixture-hash".to_string(), + }], + } +} + +// ====================================================================== +// F1: Firefox baseline import — happy path +// ====================================================================== + +/// F1 — One Firefox profile, one ingest pass. Asserts every fixture row +/// lands in the canonical archive with correct URL count, visit count, +/// timestamps, and field values matching fixture input. This is the +/// Firefox analog of C1 (Chromium baseline). +#[test] +fn f1_firefox_baseline_import() { + let env = ScenarioEnv::new(); + + // 2026-05-01 00:00, 2026-05-02 12:00, 2026-05-03 08:15:30, + // 2026-05-04 10:00, 2026-05-05 14:30 + let t1 = 1_777_680_000_000_i64; + let t2 = 1_777_809_600_000_i64; + let t3 = 1_777_872_930_000_i64; + let t4 = 1_777_939_200_000_i64; + let t5 = 1_778_041_800_000_i64; + + let fixture = FirefoxPlacesFixture::new() + .add_place(FirefoxPlaceRow { + id: 1, + url: "https://example.com/firefox-article-one".to_string(), + title: Some("Firefox Article One".to_string()), + visit_count: 2, + hidden: false, + last_visit_unix_ms: t2, + }) + .add_place(FirefoxPlaceRow { + id: 2, + url: "https://example.org/firefox-article-two".to_string(), + title: Some("Firefox Article Two".to_string()), + visit_count: 2, + hidden: false, + last_visit_unix_ms: t4, + }) + .add_place(FirefoxPlaceRow { + id: 3, + url: "https://example.net/firefox-article-three".to_string(), + title: Some("Firefox Article Three".to_string()), + visit_count: 1, + hidden: false, + last_visit_unix_ms: t5, + }) + // 5 visits across 3 URLs + .add_visit(FirefoxVisitRow { + id: 10, + place_id: 1, + visit_time_unix_ms: t1, + from_visit: None, + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 11, + place_id: 1, + visit_time_unix_ms: t2, + from_visit: Some(10), + visit_type: Some(2), + }) + .add_visit(FirefoxVisitRow { + id: 12, + place_id: 2, + visit_time_unix_ms: t3, + from_visit: None, + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 13, + place_id: 2, + visit_time_unix_ms: t4, + from_visit: Some(12), + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 14, + place_id: 3, + visit_time_unix_ms: t5, + from_visit: None, + visit_type: Some(5), + }); + + let snapshot = firefox_snapshot(&fixture, "firefox:Default"); + let summary = run_one_ingest(&env, 1, &snapshot, false); + + // Summary must report exactly what the fixture contained. + assert_eq!(summary.new_urls, 3, "summary reports 3 new urls"); + assert_eq!(summary.new_visits, 5, "summary reports 5 new visits"); + + // Archive row counts match fixture. + assert_eq!(count_archive_rows(&env, "urls"), 3); + assert_eq!(count_archive_rows(&env, "visits"), 5); + assert_eq!(count_urls_for_profile(&env, "firefox:Default"), 3); + assert_eq!(count_visits_for_profile(&env, "firefox:Default"), 5); + + // Source visit IDs flow through unmodified. + let visit_ids = collect_visit_source_ids(&env, "firefox:Default"); + assert_eq!(visit_ids, vec!["10", "11", "12", "13", "14"]); + + // Spot-check visit timestamps round-tripped correctly. + let archive = env.open_archive(); + let first_visit_ms: i64 = archive + .query_row( + "SELECT visit_time_ms FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'firefox:Default' + AND visits.source_visit_id = '10'", + [], + |row| row.get(0), + ) + .expect("query first visit time"); + assert_eq!(first_visit_ms, t1, "first visit timestamp must match fixture"); + + let last_visit_ms: i64 = archive + .query_row( + "SELECT visit_time_ms FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'firefox:Default' + AND visits.source_visit_id = '14'", + [], + |row| row.get(0), + ) + .expect("query last visit time"); + assert_eq!(last_visit_ms, t5, "last visit timestamp must match fixture"); + + // URL title landed correctly. + let title: Option = archive + .query_row( + "SELECT title FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'firefox:Default' + AND urls.source_url_id = 1", + [], + |row| row.get(0), + ) + .expect("query url title"); + assert_eq!(title.as_deref(), Some("Firefox Article One")); +} + +// ====================================================================== +// S1: Safari baseline import — happy path +// ====================================================================== + +/// S1 — One Safari profile, one ingest pass. Asserts every fixture row +/// lands in the canonical archive with correct URL count, visit count, +/// timestamps, and field values matching fixture input. This is the +/// Safari analog of C1 (Chromium baseline). +#[test] +fn s1_safari_baseline_import() { + let env = ScenarioEnv::new(); + + // 2026-05-01 00:00, 2026-05-02 12:00, 2026-05-03 08:15:30, + // 2026-05-04 10:00, 2026-05-05 14:30 + let t1 = 1_777_680_000_000_i64; + let t2 = 1_777_809_600_000_i64; + let t3 = 1_777_872_930_000_i64; + let t4 = 1_777_939_200_000_i64; + let t5 = 1_778_041_800_000_i64; + + let fixture = SafariHistoryFixture::new() + .add_item(SafariHistoryItemRow { + id: 1, + url: "https://example.com/safari-article-one".to_string(), + }) + .add_item(SafariHistoryItemRow { + id: 2, + url: "https://example.org/safari-article-two".to_string(), + }) + .add_item(SafariHistoryItemRow { + id: 3, + url: "https://example.net/safari-article-three".to_string(), + }) + // 5 visits across 3 items + .add_visit(safari_visit(10, 1, "Safari Article One", t1)) + .add_visit(safari_visit(11, 1, "Safari Article One", t2)) + .add_visit(safari_visit(12, 2, "Safari Article Two", t3)) + .add_visit(safari_visit(13, 2, "Safari Article Two", t4)) + .add_visit(safari_visit(14, 3, "Safari Article Three", t5)); + + let snapshot = safari_snapshot(&fixture, "safari:Default"); + let summary = run_one_ingest(&env, 1, &snapshot, false); + + // Summary must report exactly what the fixture contained. + assert_eq!(summary.new_urls, 3, "summary reports 3 new urls"); + assert_eq!(summary.new_visits, 5, "summary reports 5 new visits"); + + // Archive row counts match fixture. + assert_eq!(count_archive_rows(&env, "urls"), 3); + assert_eq!(count_archive_rows(&env, "visits"), 5); + assert_eq!(count_urls_for_profile(&env, "safari:Default"), 3); + assert_eq!(count_visits_for_profile(&env, "safari:Default"), 5); + + // Source visit IDs flow through unmodified. + let visit_ids = collect_visit_source_ids(&env, "safari:Default"); + assert_eq!(visit_ids, vec!["10", "11", "12", "13", "14"]); + + // Spot-check visit timestamps round-tripped correctly. + let archive = env.open_archive(); + let first_visit_ms: i64 = archive + .query_row( + "SELECT visit_time_ms FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'safari:Default' + AND visits.source_visit_id = '10'", + [], + |row| row.get(0), + ) + .expect("query first visit time"); + assert_eq!(first_visit_ms, t1, "first visit timestamp must match fixture"); + + let last_visit_ms: i64 = archive + .query_row( + "SELECT visit_time_ms FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'safari:Default' + AND visits.source_visit_id = '14'", + [], + |row| row.get(0), + ) + .expect("query last visit time"); + assert_eq!(last_visit_ms, t5, "last visit timestamp must match fixture"); + + // URL title landed correctly (Safari carries title on visits, not items; + // the parser should populate url.title from the most recent visit title). + let title: Option = archive + .query_row( + "SELECT title FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'safari:Default' + AND urls.source_url_id = 1", + [], + |row| row.get(0), + ) + .expect("query url title"); + assert!(title.is_some(), "Safari URL title should be populated from visit title"); +} + +// ====================================================================== +// Chromium fingerprint dedup — same visits, different source_visit_ids +// ====================================================================== + +/// Chromium fingerprint dedup — Imports a Chromium fixture, then +/// re-imports the exact same visits but with DIFFERENT `source_visit_id` +/// values (simulating a database rebuild or ID reassignment). The +/// `(source_profile_id, event_fingerprint)` partial unique index must +/// catch these as duplicates. No duplicate visit rows should be created. +#[test] +fn chromium_fingerprint_dedup_catches_same_visits_with_different_source_ids() { + let env = ScenarioEnv::new(); + + let t1 = 1_777_680_000_000_i64; + let t2 = 1_777_809_600_000_i64; + let t3 = 1_777_872_930_000_i64; + + // First import: visit IDs 10, 11, 12. + let first_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/fingerprint-test-one".to_string(), + title: Some("Fingerprint Test One".to_string()), + visit_count: 2, + typed_count: 1, + last_visit_unix_ms: t2, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.org/fingerprint-test-two".to_string(), + title: Some("Fingerprint Test Two".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: t3, + hidden: false, + }) + .add_visit(chromium_visit_row(10, 1, t1)) + .add_visit(chromium_visit_row(11, 1, t2)) + .add_visit(chromium_visit_row(12, 2, t3)); + + let first_snapshot = snapshot_for_chromium_fixture( + &first_fixture, + chromium_profile("chrome:Default", "Google Chrome"), + ); + let first_summary = run_one_ingest(&env, 1, &first_snapshot, false); + assert_eq!(first_summary.new_urls, 2); + assert_eq!(first_summary.new_visits, 3); + drop(first_snapshot); + + // Second import: SAME URLs and visit times, but source_visit_ids are + // different (100, 101, 102 instead of 10, 11, 12). This simulates a + // Chrome database rebuild where rowids get reassigned but the actual + // browsing events are identical. + let second_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/fingerprint-test-one".to_string(), + title: Some("Fingerprint Test One".to_string()), + visit_count: 2, + typed_count: 1, + last_visit_unix_ms: t2, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.org/fingerprint-test-two".to_string(), + title: Some("Fingerprint Test Two".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: t3, + hidden: false, + }) + .add_visit(chromium_visit_row(100, 1, t1)) + .add_visit(chromium_visit_row(101, 1, t2)) + .add_visit(chromium_visit_row(102, 2, t3)); + + let second_snapshot = snapshot_for_chromium_fixture( + &second_fixture, + chromium_profile("chrome:Default", "Google Chrome"), + ); + let second_summary = run_one_ingest(&env, 2, &second_snapshot, false); + + // The fingerprint partial index should catch all 3 visits as duplicates. + assert_eq!( + second_summary.new_visits, 0, + "fingerprint dedup must catch same visits with different source_visit_ids" + ); + + // Archive row counts must stay at the first import's values. + assert_eq!(count_archive_rows(&env, "urls"), 2); + assert_eq!( + count_visits_for_profile(&env, "chrome:Default"), + 3, + "no duplicate visits should be created despite different source_visit_ids" + ); +} + +// ====================================================================== +// F_C2: Firefox incremental no-new-data — watermark prevents re-import +// ====================================================================== + +/// F_C2 — Re-importing the same Firefox fixture with `use_watermark = true` +/// must produce zero new rows. The watermark advance after the first import +/// should make the second import a no-op at the parser level. This is the +/// Firefox analog of C2 (Chromium incremental no-new-data). +#[test] +fn f_c2_firefox_incremental_no_new_data() { + let env = ScenarioEnv::new(); + let (t1, t2, t3, t4, t5) = ( + 1_777_680_000_000_i64, + 1_777_809_600_000_i64, + 1_777_872_930_000_i64, + 1_777_939_200_000_i64, + 1_778_041_800_000_i64, + ); + + let build_fixture = || { + FirefoxPlacesFixture::new() + .add_place(FirefoxPlaceRow { + id: 1, + url: "https://example.com/firefox-article-one".to_string(), + title: Some("Firefox Article One".to_string()), + visit_count: 2, + hidden: false, + last_visit_unix_ms: t2, + }) + .add_place(FirefoxPlaceRow { + id: 2, + url: "https://example.org/firefox-article-two".to_string(), + title: Some("Firefox Article Two".to_string()), + visit_count: 2, + hidden: false, + last_visit_unix_ms: t4, + }) + .add_place(FirefoxPlaceRow { + id: 3, + url: "https://example.net/firefox-article-three".to_string(), + title: Some("Firefox Article Three".to_string()), + visit_count: 1, + hidden: false, + last_visit_unix_ms: t5, + }) + .add_visit(FirefoxVisitRow { + id: 10, + place_id: 1, + visit_time_unix_ms: t1, + from_visit: None, + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 11, + place_id: 1, + visit_time_unix_ms: t2, + from_visit: Some(10), + visit_type: Some(2), + }) + .add_visit(FirefoxVisitRow { + id: 12, + place_id: 2, + visit_time_unix_ms: t3, + from_visit: None, + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 13, + place_id: 2, + visit_time_unix_ms: t4, + from_visit: Some(12), + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 14, + place_id: 3, + visit_time_unix_ms: t5, + from_visit: None, + visit_type: Some(5), + }) + }; + + // First import: baseline — no watermark. + let first_snapshot = firefox_snapshot(&build_fixture(), "firefox:Default"); + run_one_ingest(&env, 1, &first_snapshot, false); + drop(first_snapshot); + + // Second import: identical data — watermark should skip everything. + let second_snapshot = firefox_snapshot(&build_fixture(), "firefox:Default"); + let summary = run_one_ingest(&env, 2, &second_snapshot, true); + + assert_eq!(summary.new_urls, 0, "second import must add no new URL rows"); + assert_eq!(summary.new_visits, 0, "second import must add no new visit rows"); + + // Archive row counts must stay at the first import's values. + assert_eq!(count_archive_rows(&env, "urls"), 3); + assert_eq!(count_archive_rows(&env, "visits"), 5); + assert_eq!(count_urls_for_profile(&env, "firefox:Default"), 3); + assert_eq!(count_visits_for_profile(&env, "firefox:Default"), 5); +} + +// ====================================================================== +// S_C2: Safari incremental no-new-data — watermark prevents re-import +// ====================================================================== + +/// S_C2 — Re-importing the same Safari fixture with `use_watermark = true` +/// must produce zero new rows. The watermark advance after the first import +/// should make the second import a no-op at the parser level. This is the +/// Safari analog of C2 (Chromium incremental no-new-data). +#[test] +fn s_c2_safari_incremental_no_new_data() { + let env = ScenarioEnv::new(); + let (t1, t2, t3, t4, t5) = ( + 1_777_680_000_000_i64, + 1_777_809_600_000_i64, + 1_777_872_930_000_i64, + 1_777_939_200_000_i64, + 1_778_041_800_000_i64, + ); + + let build_fixture = || { + SafariHistoryFixture::new() + .add_item(SafariHistoryItemRow { + id: 1, + url: "https://example.com/safari-article-one".to_string(), + }) + .add_item(SafariHistoryItemRow { + id: 2, + url: "https://example.org/safari-article-two".to_string(), + }) + .add_item(SafariHistoryItemRow { + id: 3, + url: "https://example.net/safari-article-three".to_string(), + }) + .add_visit(safari_visit(10, 1, "Safari Article One", t1)) + .add_visit(safari_visit(11, 1, "Safari Article One", t2)) + .add_visit(safari_visit(12, 2, "Safari Article Two", t3)) + .add_visit(safari_visit(13, 2, "Safari Article Two", t4)) + .add_visit(safari_visit(14, 3, "Safari Article Three", t5)) + }; + + // First import: baseline — no watermark. + let first_snapshot = safari_snapshot(&build_fixture(), "safari:Default"); + run_one_ingest(&env, 1, &first_snapshot, false); + drop(first_snapshot); + + // Second import: identical data — watermark should skip everything. + let second_snapshot = safari_snapshot(&build_fixture(), "safari:Default"); + let summary = run_one_ingest(&env, 2, &second_snapshot, true); + + assert_eq!(summary.new_urls, 0, "second import must add no new URL rows"); + assert_eq!(summary.new_visits, 0, "second import must add no new visit rows"); + + // Archive row counts must stay at the first import's values. + assert_eq!(count_archive_rows(&env, "urls"), 3); + assert_eq!(count_archive_rows(&env, "visits"), 5); + assert_eq!(count_urls_for_profile(&env, "safari:Default"), 3); + assert_eq!(count_visits_for_profile(&env, "safari:Default"), 5); +} + +// ---------------------------------------------------------------------- +// F2: Firefox incremental revisit of an old URL drops the new visit (B2) +// ---------------------------------------------------------------------- + +/// F2 — Firefox equivalent of C3, regression test for audit bug B2. +/// The Chromium parser's `INGEST_URLS_SQL` has an +/// `OR id IN (SELECT DISTINCT url FROM visits WHERE id > ?2)` fallback +/// to catch URLs whose `last_visit_time` is below the URL watermark but +/// which received a new visit anyway. Firefox grew the equivalent OR +/// fallback in `firefox/mod.rs:32-44` as part of the B2 fix (commit +/// 6884c10d); this scenario pins that fix in place. If the Firefox +/// URL stream loses the OR-subquery in a future refactor, the new +/// visit's `url_id_map.get` will fail and `ArchiveChunkConsumer::visits` +/// will silently drop the row — the assertion below would then fail. +#[test] +fn f2_firefox_incremental_revisit_of_old_url_drops_visit_demonstrates_b2() { + let env = ScenarioEnv::new(); + // Long-tail URL (T1) + anchor URL (T2) so the URL watermark + // advances past T1 after the first import; the second-pass URL + // query then excludes the long-tail URL. + let visit_long_tail_ms = 1_777_680_000_000_i64; + let visit_anchor_ms = 1_777_809_600_000_i64; + let visit_revisit_ms = 1_777_872_930_000_i64; + + let first_fixture = FirefoxPlacesFixture::new() + .add_place(FirefoxPlaceRow { + id: 1, + url: "https://example.com/firefox-long-tail".to_string(), + title: Some("Firefox Long Tail".to_string()), + visit_count: 1, + hidden: false, + last_visit_unix_ms: visit_long_tail_ms, + }) + .add_place(FirefoxPlaceRow { + id: 2, + url: "https://example.com/firefox-anchor".to_string(), + title: Some("Firefox Anchor".to_string()), + visit_count: 1, + hidden: false, + last_visit_unix_ms: visit_anchor_ms, + }) + .add_visit(FirefoxVisitRow { + id: 10, + place_id: 1, + visit_time_unix_ms: visit_long_tail_ms, + from_visit: None, + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 20, + place_id: 2, + visit_time_unix_ms: visit_anchor_ms, + from_visit: None, + visit_type: Some(1), + }); + let first_snapshot = firefox_snapshot(&first_fixture, "firefox:Default"); + run_one_ingest(&env, 1, &first_snapshot, false); + drop(first_snapshot); + + // Pass 2: URL 1's last_visit_date stays at T1 (below the watermark); + // its new visit (id=30, time > T2) only appears in moz_historyvisits. + // Without the OR fallback the URL is filtered out and the visit's + // url_id_map lookup fails. + let second_fixture = FirefoxPlacesFixture::new() + .add_place(FirefoxPlaceRow { + id: 1, + url: "https://example.com/firefox-long-tail".to_string(), + title: Some("Firefox Long Tail".to_string()), + visit_count: 2, + hidden: false, + last_visit_unix_ms: visit_long_tail_ms, + }) + .add_place(FirefoxPlaceRow { + id: 2, + url: "https://example.com/firefox-anchor".to_string(), + title: Some("Firefox Anchor".to_string()), + visit_count: 1, + hidden: false, + last_visit_unix_ms: visit_anchor_ms, + }) + .add_visit(FirefoxVisitRow { + id: 10, + place_id: 1, + visit_time_unix_ms: visit_long_tail_ms, + from_visit: None, + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 20, + place_id: 2, + visit_time_unix_ms: visit_anchor_ms, + from_visit: None, + visit_type: Some(1), + }) + .add_visit(FirefoxVisitRow { + id: 30, + place_id: 1, + visit_time_unix_ms: visit_revisit_ms, + from_visit: Some(20), + visit_type: Some(1), + }); + let second_snapshot = firefox_snapshot(&second_fixture, "firefox:Default"); + run_one_ingest(&env, 2, &second_snapshot, true); + + let visits = count_visits_for_profile(&env, "firefox:Default"); + assert_eq!( + visits, 3, + "B2 fix required for Firefox: long-tail revisit silently dropped (got {visits})" + ); +} + +// ---------------------------------------------------------------------- +// S2: Safari long-tail revisit correctly handled — refutes B2 for Safari +// ---------------------------------------------------------------------- + +/// S2 — Audit **B2** lumped Firefox and Safari together as both missing +/// the Chromium OR-fallback. The harness proved that Safari does not +/// actually have the bug: the Safari URL query at `safari/mod.rs:42-56` +/// computes `MAX(history_visits.visit_time)` *on the fly* from the +/// visits table (Safari's `history_items` table has no cached +/// `last_visit_time` column), so any new visit row immediately raises +/// the item's effective last-visit time and the URL gets re-streamed +/// without needing an OR fallback. This contract scenario pins that +/// correct behavior — if a future refactor introduces a stored +/// `last_visit_time` cache on `history_items` without the OR fallback, +/// the same long-tail revisit bug would emerge and this test would +/// flip from passing to failing. +#[test] +fn s2_safari_long_tail_revisit_captured_without_or_fallback() { + let env = ScenarioEnv::new(); + // Long-tail item (T1) + anchor item (T2). The anchor pushes the URL + // watermark past T1; the second-pass Safari URL query (which + // computes per-item MAX(visit_time) on the fly) excludes the + // long-tail item; the new visit references it and gets dropped. + let visit_long_tail_ms = 1_777_680_000_000_i64; + let visit_anchor_ms = 1_777_809_600_000_i64; + let visit_revisit_ms = 1_777_872_930_000_i64; + + let first_fixture = SafariHistoryFixture::new() + .add_item(SafariHistoryItemRow { + id: 1, + url: "https://example.com/safari-long-tail".to_string(), + }) + .add_item(SafariHistoryItemRow { + id: 2, + url: "https://example.com/safari-anchor".to_string(), + }) + .add_visit(safari_visit(9, 1, "Safari Long Tail", visit_long_tail_ms)) + .add_visit(safari_visit(19, 2, "Safari Anchor", visit_anchor_ms)); + let first_snapshot = safari_snapshot(&first_fixture, "safari:Default"); + run_one_ingest(&env, 1, &first_snapshot, false); + drop(first_snapshot); + + let second_fixture = SafariHistoryFixture::new() + .add_item(SafariHistoryItemRow { + id: 1, + url: "https://example.com/safari-long-tail".to_string(), + }) + .add_item(SafariHistoryItemRow { + id: 2, + url: "https://example.com/safari-anchor".to_string(), + }) + .add_visit(safari_visit(9, 1, "Safari Long Tail", visit_long_tail_ms)) + .add_visit(safari_visit(19, 2, "Safari Anchor", visit_anchor_ms)) + .add_visit(safari_visit(29, 1, "Safari Long Tail Revisited", visit_revisit_ms)); + let second_snapshot = safari_snapshot(&second_fixture, "safari:Default"); + run_one_ingest(&env, 2, &second_snapshot, true); + + let visits = count_visits_for_profile(&env, "safari:Default"); + assert_eq!( + visits, 3, + "Safari MAX(visit_time)-computed URL query already handles long-tail revisits without an OR fallback" + ); +} diff --git a/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs b/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs new file mode 100644 index 00000000..edd45b82 --- /dev/null +++ b/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_edge_cases.rs @@ -0,0 +1,977 @@ +//! Edge-case and contract-pinning ingest scenarios. +//! +//! These tests complement `dedup_scenarios.rs` (main Chromium dedup paths) +//! and `dedup_scenarios_baselines.rs` (Firefox/Safari baselines) by covering: +//! - **C_SUB_MS (E5)**: Sub-millisecond Chrome visit collision +//! - **E6**: URL canonicalization — no normalization applied +//! - **Empty DB**: Zero-row fixtures for all browser families +//! - **R1**: Corrupt / malformed source database resilience +//! - **E1-E4**: Time boundary edge cases (epoch, year-2038, far-future, negative) +//! - **E7**: NULL title handling +//! - **E8**: Unicode (CJK, percent-encoded, emoji) byte-identical round-trip +//! - **E9**: `hidden = true` URL flag round-trip + +use super::*; +use browser_history_fixtures::{ + ChromiumHistoryFixture, ChromiumUrlRow, ChromiumVisitRow, FirefoxPlacesFixture, + SafariHistoryFixture, +}; +use std::io::Write; +use tempfile::tempdir; + +// ── Shared helpers (mirror dedup_scenarios.rs patterns) ───────────── + +fn test_config() -> AppConfig { + AppConfig { initialized: true, ..AppConfig::default() } +} + +fn test_paths(root: &Path) -> ProjectPaths { + crate::config::project_paths_with_root(root) +} + +/// Holds the long-lived resources one scenario needs across multiple +/// imports (same as dedup_scenarios::ScenarioEnv). +struct ScenarioEnv { + _root: tempfile::TempDir, + paths: ProjectPaths, + config: AppConfig, +} + +impl ScenarioEnv { + fn new() -> Self { + let root = tempdir().expect("scenario root tempdir"); + let paths = test_paths(root.path()); + let config = test_config(); + crate::config::ensure_paths(&paths).expect("ensure paths"); + Self { _root: root, paths, config } + } + + fn open_archive(&self) -> rusqlite::Connection { + open_archive_connection(&self.paths, &self.config, None).expect("open archive") + } +} + +fn seed_run(archive: &Transaction<'_>, run_id: i64) { + archive + .execute( + "INSERT INTO runs (id, run_type, trigger, started_at, timezone, status, profile_scope_json, warnings_json, stats_json, due_only) + VALUES (?1, 'backup', 'manual', '2026-05-25T00:00:00+00:00', 'UTC', 'running', '[]', '[]', '{}', 0)", + [run_id], + ) + .expect("seed run"); +} + +fn run_one_ingest( + env: &ScenarioEnv, + run_id: i64, + snapshot: &ProfileSnapshot, + use_watermark: bool, +) -> BackupProfileSummary { + let mut archive = env.open_archive(); + let transaction = archive.transaction().expect("scenario transaction"); + seed_run(&transaction, run_id); + let mut snapshot_artifacts = Vec::new(); + let mut source_evidence_plans = Vec::new(); + let summary = process_profile_snapshot( + &transaction, + run_id, + &env.paths, + &env.config, + snapshot, + &mut snapshot_artifacts, + &mut source_evidence_plans, + false, + use_watermark, + ) + .expect("process profile snapshot"); + transaction.commit().expect("commit scenario transaction"); + summary +} + +fn count_archive_rows(env: &ScenarioEnv, table: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row(&format!("SELECT COUNT(*) FROM {table}"), [], |row| row.get(0)) + .expect("count rows") +} + +fn count_urls_for_profile(env: &ScenarioEnv, profile_key: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT COUNT(*) FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("count urls for profile") +} + +fn count_visits_for_profile(env: &ScenarioEnv, profile_key: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT COUNT(*) FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("count visits for profile") +} + +// ── Chromium helpers ──────────────────────────────────────────────── + +fn chromium_profile(profile_id: &str, browser_name: &str) -> crate::models::BrowserProfile { + crate::models::BrowserProfile { + profile_id: profile_id.to_string(), + profile_name: "Default".to_string(), + browser_family: "chromium".to_string(), + browser_name: browser_name.to_string(), + user_name: Some("synthetic-user".to_string()), + profile_path: format!("/synthetic/{profile_id}"), + history_path: Some(format!("/synthetic/{profile_id}/History")), + favicons_path: None, + history_exists: true, + history_readable: true, + access_issue: None, + browser_version: Some("146.0.0.0".to_string()), + history_file_name: "History".to_string(), + history_bytes: 128, + favicons_bytes: 0, + supporting_bytes: 0, + retention_boundary: crate::models::BrowserRetentionBoundary::default(), + } +} + +fn chromium_visit_row(id: i64, url_id: i64, visit_time_unix_ms: i64) -> ChromiumVisitRow { + ChromiumVisitRow { + id, + url_id, + visit_time_unix_ms, + from_visit: Some(0), + transition: Some(805306368), + visit_duration_micros: Some(5_000_000), + is_known_to_sync: false, + visited_link_id: None, + external_referrer_url: None, + app_id: None, + } +} + +fn snapshot_for_chromium_fixture( + fixture: &ChromiumHistoryFixture, + profile: crate::models::BrowserProfile, +) -> ProfileSnapshot { + let temp_dir = tempdir().expect("snapshot tempdir"); + let history_path = temp_dir.path().join("History"); + fixture.write(&history_path).expect("write chromium fixture"); + let history_bytes = std::fs::metadata(&history_path).map(|meta| meta.len()).unwrap_or(0); + let mut profile = profile; + profile.history_bytes = history_bytes; + ProfileSnapshot { + profile, + temp_dir, + history_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "History".to_string(), + sha256: "synthetic-fixture-hash".to_string(), + }], + } +} + +// ── Firefox helpers ───────────────────────────────────────────────── + +fn firefox_profile(profile_id: &str) -> crate::models::BrowserProfile { + crate::models::BrowserProfile { + profile_id: profile_id.to_string(), + profile_name: "Default".to_string(), + browser_family: "firefox".to_string(), + browser_name: "Firefox".to_string(), + user_name: Some("synthetic-user".to_string()), + profile_path: format!("/synthetic/{profile_id}"), + history_path: Some(format!("/synthetic/{profile_id}/places.sqlite")), + favicons_path: None, + history_exists: true, + history_readable: true, + access_issue: None, + browser_version: Some("125.0".to_string()), + history_file_name: "places.sqlite".to_string(), + history_bytes: 128, + favicons_bytes: 0, + supporting_bytes: 0, + retention_boundary: crate::models::BrowserRetentionBoundary::default(), + } +} + +fn firefox_snapshot(fixture: &FirefoxPlacesFixture, profile_id: &str) -> ProfileSnapshot { + let temp_dir = tempdir().expect("firefox snapshot tempdir"); + let history_path = temp_dir.path().join("places.sqlite"); + fixture.write(&history_path).expect("write firefox fixture"); + let history_bytes = std::fs::metadata(&history_path).map(|meta| meta.len()).unwrap_or(0); + let mut profile = firefox_profile(profile_id); + profile.history_bytes = history_bytes; + ProfileSnapshot { + profile, + temp_dir, + history_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "places.sqlite".to_string(), + sha256: "synthetic-firefox-hash".to_string(), + }], + } +} + +// ── Safari helpers ────────────────────────────────────────────────── + +fn safari_profile(profile_id: &str) -> crate::models::BrowserProfile { + crate::models::BrowserProfile { + profile_id: profile_id.to_string(), + profile_name: "Default".to_string(), + browser_family: "safari".to_string(), + browser_name: "Safari".to_string(), + user_name: Some("synthetic-user".to_string()), + profile_path: format!("/synthetic/{profile_id}"), + history_path: Some(format!("/synthetic/{profile_id}/History.db")), + favicons_path: None, + history_exists: true, + history_readable: true, + access_issue: None, + browser_version: Some("18.4".to_string()), + history_file_name: "History.db".to_string(), + history_bytes: 128, + favicons_bytes: 0, + supporting_bytes: 0, + retention_boundary: crate::models::BrowserRetentionBoundary::default(), + } +} + +fn safari_snapshot(fixture: &SafariHistoryFixture, profile_id: &str) -> ProfileSnapshot { + let temp_dir = tempdir().expect("safari snapshot tempdir"); + let history_path = temp_dir.path().join("History.db"); + fixture.write(&history_path).expect("write safari fixture"); + let history_bytes = std::fs::metadata(&history_path).map(|meta| meta.len()).unwrap_or(0); + let mut profile = safari_profile(profile_id); + profile.history_bytes = history_bytes; + ProfileSnapshot { + profile, + temp_dir, + history_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "History.db".to_string(), + sha256: "synthetic-safari-hash".to_string(), + }], + } +} + +// ====================================================================== +// C_SUB_MS (E5) — Sub-millisecond Chrome visit collision contract +// ====================================================================== + +/// C_SUB_MS (E5) — Sub-millisecond Chrome visit collision contract. +/// +/// Chrome stores visit times at microsecond precision; our parser truncates +/// to milliseconds. Two visits to the same URL within the same millisecond +/// produce identical `event_fingerprint` values. The partial unique index +/// deduplicates the second visit even though source_visit_ids differ. +/// +/// This is a known acceptable limitation, not a bug. This test pins the +/// behavior so that any future precision change is caught. +#[test] +fn c_sub_ms_same_millisecond_visits_collapsed_by_fingerprint() { + let env = ScenarioEnv::new(); + + // Two visits to the same URL with different source_visit_ids but + // identical visit_time_unix_ms. The fingerprint computation uses + // unix_micros_to_chrome_time(visit_time_ms * 1000), so both visits + // produce the same Chrome time → same fingerprint → INSERT OR IGNORE + // silently skips the second. + let same_ms = 1_777_680_000_000_i64; // 2026-05-01T00:00:00Z + + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/sub-ms-collision".to_string(), + title: Some("Sub-ms Collision".to_string()), + visit_count: 2, + typed_count: 0, + last_visit_unix_ms: same_ms, + hidden: false, + }) + .add_visit(chromium_visit_row(20, 1, same_ms)) + .add_visit(chromium_visit_row(21, 1, same_ms)); + + let snapshot = snapshot_for_chromium_fixture( + &fixture, + chromium_profile("chrome:Default", "Google Chrome"), + ); + let summary = run_one_ingest(&env, 1, &snapshot, false); + + // The parser delivers both visits, but only one survives archive insert: + // - Visit 20 inserted successfully (new source_visit_id, new fingerprint). + // - Visit 21 has a DIFFERENT source_visit_id (so UNIQUE(source_profile_id, + // source_visit_id) does not fire) but the SAME event_fingerprint (same + // url, same Chrome time, same title, same transition, same app_id). + // The partial unique index on (source_profile_id, event_fingerprint) + // triggers → INSERT OR IGNORE silently skips. + assert_eq!( + summary.new_visits, 1, + "only one of two same-millisecond visits should survive fingerprint dedup" + ); + assert_eq!(summary.new_urls, 1); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 1); + assert_eq!(count_urls_for_profile(&env, "chrome:Default"), 1); +} + +// ====================================================================== +// E6 — URL canonicalization contract: no normalization applied +// ====================================================================== + +/// E6 — URL canonicalization contract pins. +/// +/// PathKeep stores URL strings as-is with NO normalization. Different URL +/// strings with different source_url_ids must be preserved as separate URL +/// rows even when they point to semantically "the same" resource. This +/// pins the contract so a future normalization change is caught. +#[test] +fn e6_url_strings_stored_verbatim_no_normalization() { + let env = ScenarioEnv::new(); + + let t1 = 1_777_680_000_000_i64; + let t2 = 1_777_809_600_000_i64; + let t3 = 1_777_872_930_000_i64; + let t4 = 1_777_939_200_000_i64; + + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/path".to_string(), + title: Some("Base URL".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: t1, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/path/".to_string(), + title: Some("Trailing Slash".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: t2, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 3, + url: "https://example.com/page#section".to_string(), + title: Some("Fragment Preserved".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: t3, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 4, + url: "https://Example.COM/Path".to_string(), + title: Some("Mixed Case".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: t4, + hidden: false, + }) + .add_visit(chromium_visit_row(10, 1, t1)) + .add_visit(chromium_visit_row(11, 2, t2)) + .add_visit(chromium_visit_row(12, 3, t3)) + .add_visit(chromium_visit_row(13, 4, t4)); + + let snapshot = snapshot_for_chromium_fixture( + &fixture, + chromium_profile("chrome:Default", "Google Chrome"), + ); + let summary = run_one_ingest(&env, 1, &snapshot, false); + + // All four URLs must be preserved as distinct rows. + assert_eq!(summary.new_urls, 4, "all URL variants must be separate rows"); + assert_eq!(summary.new_visits, 4); + assert_eq!(count_urls_for_profile(&env, "chrome:Default"), 4); + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 4); + + // Query back every URL string and assert verbatim storage. + let archive = env.open_archive(); + let expected_urls = [ + (1_i64, "https://example.com/path"), + (2, "https://example.com/path/"), + (3, "https://example.com/page#section"), + (4, "https://Example.COM/Path"), + ]; + for (source_url_id, expected_url) in expected_urls { + let stored_url: String = archive + .query_row( + "SELECT url FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'chrome:Default' + AND urls.source_url_id = ?1", + [source_url_id], + |row| row.get(0), + ) + .unwrap_or_else(|_| panic!("query URL for source_url_id={source_url_id}")); + assert_eq!( + stored_url, expected_url, + "URL with source_url_id={source_url_id} must be stored verbatim" + ); + } +} + +// ====================================================================== +// Empty DB — Zero-row fixtures for all browser families +// ====================================================================== + +/// Empty Chromium fixture: import completes without error, summary is zero. +#[test] +fn empty_chromium_fixture_imports_without_error() { + let env = ScenarioEnv::new(); + let fixture = ChromiumHistoryFixture::new(); + let snapshot = + snapshot_for_chromium_fixture(&fixture, chromium_profile("chrome:Empty", "Google Chrome")); + let summary = run_one_ingest(&env, 1, &snapshot, false); + + assert_eq!(summary.new_urls, 0, "empty fixture must produce 0 new URLs"); + assert_eq!(summary.new_visits, 0, "empty fixture must produce 0 new visits"); + assert_eq!(count_archive_rows(&env, "urls"), 0); + assert_eq!(count_archive_rows(&env, "visits"), 0); +} + +/// Empty Firefox fixture: import completes without error, summary is zero. +#[test] +fn empty_firefox_fixture_imports_without_error() { + let env = ScenarioEnv::new(); + let fixture = FirefoxPlacesFixture::new(); + let snapshot = firefox_snapshot(&fixture, "firefox:Empty"); + let summary = run_one_ingest(&env, 1, &snapshot, false); + + assert_eq!(summary.new_urls, 0, "empty fixture must produce 0 new URLs"); + assert_eq!(summary.new_visits, 0, "empty fixture must produce 0 new visits"); + assert_eq!(count_archive_rows(&env, "urls"), 0); + assert_eq!(count_archive_rows(&env, "visits"), 0); +} + +/// Empty Safari fixture: import completes without error, summary is zero. +#[test] +fn empty_safari_fixture_imports_without_error() { + let env = ScenarioEnv::new(); + let fixture = SafariHistoryFixture::new(); + let snapshot = safari_snapshot(&fixture, "safari:Empty"); + let summary = run_one_ingest(&env, 1, &snapshot, false); + + assert_eq!(summary.new_urls, 0, "empty fixture must produce 0 new URLs"); + assert_eq!(summary.new_visits, 0, "empty fixture must produce 0 new visits"); + assert_eq!(count_archive_rows(&env, "urls"), 0); + assert_eq!(count_archive_rows(&env, "visits"), 0); +} + +// ====================================================================== +// R1 — Corrupt / malformed source database resilience +// ====================================================================== + +/// R1a — A file containing random bytes (not a valid SQLite database) must +/// cause `process_profile_snapshot` to return `Err`, not panic. +#[test] +fn r1a_corrupt_random_bytes_returns_error_not_panic() { + let env = ScenarioEnv::new(); + let snapshot_dir = tempdir().expect("corrupt snapshot tempdir"); + let corrupt_path = snapshot_dir.path().join("History"); + { + let mut file = std::fs::File::create(&corrupt_path).expect("create corrupt file"); + file.write_all(b"not a database at all, just random garbage bytes 0xDEADBEEF") + .expect("write corrupt bytes"); + } + + let profile = chromium_profile("chrome:Corrupt", "Google Chrome"); + let snapshot = ProfileSnapshot { + profile, + temp_dir: snapshot_dir, + history_path: corrupt_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "History".to_string(), + sha256: "corrupt-hash".to_string(), + }], + }; + + let mut archive = env.open_archive(); + let transaction = archive.transaction().expect("transaction"); + seed_run(&transaction, 1); + let mut snapshot_artifacts = Vec::new(); + let mut source_evidence_plans = Vec::new(); + let result = process_profile_snapshot( + &transaction, + 1, + &env.paths, + &env.config, + &snapshot, + &mut snapshot_artifacts, + &mut source_evidence_plans, + false, + false, + ); + + assert!(result.is_err(), "corrupt random-bytes file must return Err, not panic"); +} + +/// R1b — A valid SQLite database but missing required browser tables must +/// cause `process_profile_snapshot` to return `Err`, not panic. +#[test] +fn r1b_valid_sqlite_missing_tables_returns_error_not_panic() { + let env = ScenarioEnv::new(); + let snapshot_dir = tempdir().expect("missing-tables snapshot tempdir"); + let db_path = snapshot_dir.path().join("History"); + { + let conn = rusqlite::Connection::open(&db_path).expect("create empty sqlite"); + conn.execute_batch("CREATE TABLE dummy (id INTEGER PRIMARY KEY)") + .expect("create dummy table"); + } + + let profile = chromium_profile("chrome:MissingTables", "Google Chrome"); + let snapshot = ProfileSnapshot { + profile, + temp_dir: snapshot_dir, + history_path: db_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "History".to_string(), + sha256: "missing-tables-hash".to_string(), + }], + }; + + let mut archive = env.open_archive(); + let transaction = archive.transaction().expect("transaction"); + seed_run(&transaction, 1); + let mut snapshot_artifacts = Vec::new(); + let mut source_evidence_plans = Vec::new(); + let result = process_profile_snapshot( + &transaction, + 1, + &env.paths, + &env.config, + &snapshot, + &mut snapshot_artifacts, + &mut source_evidence_plans, + false, + false, + ); + + assert!(result.is_err(), "valid SQLite with missing browser tables must return Err, not panic"); +} + +// ====================================================================== +// E1-E4 — Time boundary edge cases +// ====================================================================== + +/// E1 — Epoch timestamp boundary: visit_time_ms = 0 (1970-01-01T00:00:00Z). +/// A zero timestamp is legal in the archive schema and must round-trip +/// without error. This pins the lower bound of the time domain. +#[test] +fn e1_epoch_timestamp_imports_without_error() { + let env = ScenarioEnv::new(); + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/epoch".to_string(), + title: Some("Epoch Boundary".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: 0, + hidden: false, + }) + .add_visit(chromium_visit_row(1, 1, 0)); + let snapshot = + snapshot_for_chromium_fixture(&fixture, chromium_profile("chrome:Epoch", "Google Chrome")); + let summary = run_one_ingest(&env, 1, &snapshot, false); + assert_eq!(summary.new_urls, 1); + assert_eq!(summary.new_visits, 1); + // Verify the timestamp is stored as 0. + let archive = env.open_archive(); + let visit_time: i64 = archive + .query_row( + "SELECT visit_time_ms FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'chrome:Epoch'", + [], + |row| row.get(0), + ) + .expect("query epoch visit time"); + assert_eq!(visit_time, 0, "epoch timestamp must round-trip as 0"); +} + +/// E2 — Year-2038 boundary (2038-01-19T03:14:07Z = 2_147_483_647_000 ms). +/// PathKeep uses i64 for timestamps, so the 32-bit overflow must be +/// transparent. This pins the contract. +#[test] +fn e2_year_2038_boundary_imports_without_error() { + let env = ScenarioEnv::new(); + let y2038_ms = 2_147_483_647_000_i64; // 2038-01-19T03:14:07Z + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/y2038".to_string(), + title: Some("Year 2038".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: y2038_ms, + hidden: false, + }) + .add_visit(chromium_visit_row(1, 1, y2038_ms)); + let snapshot = + snapshot_for_chromium_fixture(&fixture, chromium_profile("chrome:Y2038", "Google Chrome")); + let summary = run_one_ingest(&env, 1, &snapshot, false); + assert_eq!(summary.new_urls, 1); + assert_eq!(summary.new_visits, 1); + let archive = env.open_archive(); + let visit_time: i64 = archive + .query_row( + "SELECT visit_time_ms FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'chrome:Y2038'", + [], + |row| row.get(0), + ) + .expect("query y2038 visit time"); + assert_eq!(visit_time, y2038_ms, "year-2038 timestamp must round-trip correctly"); +} + +/// E3 — Far-future timestamp (year 3000 ≈ 32_503_680_000_000 ms). +/// Clock skew or data corruption can produce far-future timestamps. +/// The archive must accept them without error. +#[test] +fn e3_far_future_timestamp_imports_without_error() { + let env = ScenarioEnv::new(); + let far_future_ms = 32_503_680_000_000_i64; // ~3000-01-01 + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/future".to_string(), + title: Some("Far Future".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: far_future_ms, + hidden: false, + }) + .add_visit(chromium_visit_row(1, 1, far_future_ms)); + let snapshot = snapshot_for_chromium_fixture( + &fixture, + chromium_profile("chrome:FarFuture", "Google Chrome"), + ); + let summary = run_one_ingest(&env, 1, &snapshot, false); + assert_eq!(summary.new_urls, 1); + assert_eq!(summary.new_visits, 1); + let archive = env.open_archive(); + let visit_time: i64 = archive + .query_row( + "SELECT visit_time_ms FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'chrome:FarFuture'", + [], + |row| row.get(0), + ) + .expect("query far-future visit time"); + assert_eq!(visit_time, far_future_ms, "far-future timestamp must round-trip correctly"); +} + +/// E4 — Negative timestamp (before Unix epoch, e.g. 1969-12-31). +/// +/// All browser parsers (Chromium, Firefox, Safari) clamp visit times to +/// `max(0)` when converting from browser-native format back to Unix ms. +/// A negative source timestamp therefore survives the fixture writer +/// (Chromium maps it to a valid Chrome-epoch microsecond value) but the +/// parser clamps the result to 0 on read-back. The archive must accept +/// the row without error; the stored `visit_time_ms` will be 0. +#[test] +fn e4_negative_timestamp_clamped_to_zero_without_error() { + let env = ScenarioEnv::new(); + // -86_400_000 ms = 1969-12-31T00:00:00Z (one day before epoch). + // The Chromium fixture writer converts this to a valid Chrome-epoch + // microsecond (11_558_073_600_000_000), but the production parser's + // `chrome_time_to_unix_ms` applies `.max(0)`, so it becomes 0. + let negative_ms = -86_400_000_i64; + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/pre-epoch".to_string(), + title: Some("Pre-Epoch".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: negative_ms, + hidden: false, + }) + .add_visit(chromium_visit_row(1, 1, negative_ms)); + let snapshot = snapshot_for_chromium_fixture( + &fixture, + chromium_profile("chrome:PreEpoch", "Google Chrome"), + ); + let summary = run_one_ingest(&env, 1, &snapshot, false); + assert_eq!(summary.new_urls, 1); + assert_eq!(summary.new_visits, 1); + let archive = env.open_archive(); + let visit_time: i64 = archive + .query_row( + "SELECT visit_time_ms FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'chrome:PreEpoch'", + [], + |row| row.get(0), + ) + .expect("query pre-epoch visit time"); + assert_eq!(visit_time, 0, "negative timestamp must be clamped to 0 by parser's max(0)"); +} + +// ====================================================================== +// E7 — NULL title handling +// ====================================================================== + +/// E7 — Real Chrome `History` databases routinely have URLs with NULL +/// `title` columns (the user navigated to a URL but the page never +/// finished loading, or it was a binary download). The PathKeep parser +/// must tolerate this and produce a canonical URL row with `title = +/// NULL` rather than failing or storing an empty string. This pins the +/// contract that nullable source columns project as NULL in the archive. +#[test] +fn e7_null_title_imports_with_null_archive_title() { + let env = ScenarioEnv::new(); + let day_one_ms = 1_777_680_000_000_i64; + + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/no-title".to_string(), + title: None, + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/with-title".to_string(), + title: Some("Has Title".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_visit(chromium_visit_row(1, 1, day_one_ms)) + .add_visit(chromium_visit_row(2, 2, day_one_ms)); + let snapshot = snapshot_for_chromium_fixture( + &fixture, + chromium_profile("chrome:NullTitle", "Google Chrome"), + ); + let summary = run_one_ingest(&env, 1, &snapshot, false); + assert_eq!(summary.new_urls, 2); + assert_eq!(summary.new_visits, 2); + + let archive = env.open_archive(); + let no_title: Option = archive + .query_row( + "SELECT title FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'chrome:NullTitle' + AND urls.source_url_id = 1", + [], + |row| row.get(0), + ) + .expect("query null-title url"); + assert!( + no_title.is_none(), + "NULL source title must project as NULL in archive, not empty string" + ); + + let with_title: Option = archive + .query_row( + "SELECT title FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'chrome:NullTitle' + AND urls.source_url_id = 2", + [], + |row| row.get(0), + ) + .expect("query with-title url"); + assert_eq!(with_title.as_deref(), Some("Has Title")); +} + +// ====================================================================== +// E8 — Unicode in URLs and titles (CJK + emoji + IDN) +// ====================================================================== + +/// E8 — International users routinely have Unicode in browsing history: +/// CJK characters in titles, internationalized domain names (IDN / +/// Punycode), percent-encoded paths, and emoji. SQLite stores all of +/// these as UTF-8 TEXT natively, but the contract must be pinned: +/// every character must round-trip byte-identically through the parser, +/// the fingerprint hash, and the archive storage. If a future refactor +/// accidentally normalizes Unicode (NFC vs NFD, case folding, IDN +/// decoding) or truncates non-ASCII, this test fails immediately. +#[test] +fn e8_unicode_urls_and_titles_round_trip_byte_identical() { + let env = ScenarioEnv::new(); + let day_one_ms = 1_777_680_000_000_i64; + let day_two_ms = 1_777_809_600_000_i64; + let day_three_ms = 1_777_872_930_000_i64; + + // Three diverse Unicode shapes that must NOT be normalized: + // 1. CJK title (Traditional Chinese) on plain ASCII URL + // 2. Percent-encoded path with mixed case (verbatim per E6) + // 3. Emoji in title + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/article".to_string(), + title: Some("臺灣公開資料平臺 — 開放資料的全球趨勢".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/path/%E6%B8%AC%E8%A9%A6".to_string(), + title: Some("Percent-Encoded Path".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_two_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 3, + url: "https://example.com/celebration".to_string(), + title: Some("Launch Day 🚀 — Ship It!".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_three_ms, + hidden: false, + }) + .add_visit(chromium_visit_row(10, 1, day_one_ms)) + .add_visit(chromium_visit_row(20, 2, day_two_ms)) + .add_visit(chromium_visit_row(30, 3, day_three_ms)); + + let snapshot = snapshot_for_chromium_fixture( + &fixture, + chromium_profile("chrome:Unicode", "Google Chrome"), + ); + let summary = run_one_ingest(&env, 1, &snapshot, false); + assert_eq!(summary.new_urls, 3); + assert_eq!(summary.new_visits, 3); + + let archive = env.open_archive(); + let read_url_and_title = |source_url_id: i64| -> (String, Option) { + archive + .query_row( + "SELECT url, title FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'chrome:Unicode' + AND urls.source_url_id = ?1", + [source_url_id], + |row| Ok((row.get(0)?, row.get(1)?)), + ) + .expect("query unicode row") + }; + + let (url1, title1) = read_url_and_title(1); + assert_eq!(url1, "https://example.com/article"); + assert_eq!( + title1.as_deref(), + Some("臺灣公開資料平臺 — 開放資料的全球趨勢"), + "CJK title must round-trip byte-identical (no NFC/NFD normalization)" + ); + + let (url2, title2) = read_url_and_title(2); + assert_eq!( + url2, "https://example.com/path/%E6%B8%AC%E8%A9%A6", + "percent-encoded path must NOT be decoded — stored verbatim" + ); + assert_eq!(title2.as_deref(), Some("Percent-Encoded Path")); + + let (url3, title3) = read_url_and_title(3); + assert_eq!(url3, "https://example.com/celebration"); + assert_eq!( + title3.as_deref(), + Some("Launch Day 🚀 — Ship It!"), + "emoji + em-dash must round-trip verbatim" + ); +} + +// ====================================================================== +// E9 — `hidden = true` URL flag round-trip +// ====================================================================== + +/// E9 — Real Chrome `History` databases routinely store URLs with +/// `hidden = 1` (Chrome marks redirect intermediates, certain extension +/// URLs, and explicitly-hidden items this way). The PathKeep parser +/// must preserve this flag verbatim: `hidden = true` on the source URL +/// must produce `hidden != 0` on the canonical archive URL, and +/// `hidden = false` must produce `hidden = 0`. +/// +/// This pins the `hidden` bit contract — sibling to E7 (NULL title) +/// and E8 (Unicode round-trip). Existing C-series tests only exercise +/// `hidden: false`; the C4 B1-fix test exercises `hidden: true` but +/// only in the context of preventing older-snapshot regressions. No +/// test had asserted that a first-time import of a `hidden = true` URL +/// actually preserves the flag. +#[test] +fn e9_hidden_url_flag_round_trips_for_both_true_and_false() { + let env = ScenarioEnv::new(); + let day_one_ms = 1_777_680_000_000_i64; + let day_two_ms = 1_777_809_600_000_i64; + + let fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/visible".to_string(), + title: Some("Visible Page".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one_ms, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/hidden-redirect-intermediate".to_string(), + title: Some("Hidden Redirect".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_two_ms, + hidden: true, + }) + .add_visit(chromium_visit_row(1, 1, day_one_ms)) + .add_visit(chromium_visit_row(2, 2, day_two_ms)); + + let snapshot = snapshot_for_chromium_fixture( + &fixture, + chromium_profile("chrome:HiddenFlag", "Google Chrome"), + ); + let summary = run_one_ingest(&env, 1, &snapshot, false); + assert_eq!(summary.new_urls, 2); + assert_eq!(summary.new_visits, 2); + + let archive = env.open_archive(); + let read_hidden = |source_url_id: i64| -> i64 { + archive + .query_row( + "SELECT hidden FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = 'chrome:HiddenFlag' + AND urls.source_url_id = ?1", + [source_url_id], + |row| row.get(0), + ) + .expect("query hidden flag") + }; + + assert_eq!(read_hidden(1), 0, "hidden=false source must land as 0 in archive"); + assert!( + read_hidden(2) != 0, + "hidden=true source must land as non-zero in archive (not silently dropped)" + ); +} diff --git a/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs b/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs new file mode 100644 index 00000000..462ea351 --- /dev/null +++ b/src-tauri/crates/vault-core/src/archive/ingest/dedup_scenarios_takeout.rs @@ -0,0 +1,736 @@ +//! Takeout-family dedup scenarios (T1, T2, T2b, T3, T5). +//! +//! Covers the Google Takeout BrowserHistory JSON import path and its +//! interaction with local-Chrome backups. Each scenario pins a specific +//! dedup contract documented in the audit: +//! +//! - **T1** — Takeout baseline import (happy path). +//! - **T2** — File-rename re-import deduplicates via fingerprint partial index. +//! - **T2b** — Fingerprint divergence (title drift) exposes B3. +//! - **T3** — Takeout × local Chrome same-period overlap (B4 contract). +//! - **T5** — `time_usec` unit contract (B6 pinning). + +use super::*; +use browser_history_fixtures::{ + ChromiumHistoryFixture, ChromiumUrlRow, ChromiumVisitRow, TakeoutBrowserHistoryFixture, + TakeoutBrowserRecord, +}; +use rusqlite::Connection; +use tempfile::{TempDir, tempdir}; + +// ====================================================================== +// Shared helpers (per satellite-module pattern — each #[cfg(test)] module +// owns its own ScenarioEnv) +// ====================================================================== + +fn test_config() -> AppConfig { + AppConfig { initialized: true, ..AppConfig::default() } +} + +fn test_paths(root: &Path) -> ProjectPaths { + crate::config::project_paths_with_root(root) +} + +struct ScenarioEnv { + _root: TempDir, + paths: ProjectPaths, + config: AppConfig, +} + +impl ScenarioEnv { + fn new() -> Self { + let root = tempdir().expect("scenario root tempdir"); + let paths = test_paths(root.path()); + let config = test_config(); + crate::config::ensure_paths(&paths).expect("ensure paths"); + Self { _root: root, paths, config } + } + + fn open_archive(&self) -> Connection { + open_archive_connection(&self.paths, &self.config, None).expect("open archive") + } +} + +fn seed_run(archive: &Transaction<'_>, run_id: i64) { + archive + .execute( + "INSERT INTO runs (id, run_type, trigger, started_at, timezone, status, profile_scope_json, warnings_json, stats_json, due_only) + VALUES (?1, 'backup', 'manual', '2026-05-25T00:00:00+00:00', 'UTC', 'running', '[]', '[]', '{}', 0)", + [run_id], + ) + .expect("seed run"); +} + +fn run_one_ingest( + env: &ScenarioEnv, + run_id: i64, + snapshot: &ProfileSnapshot, + use_watermark: bool, +) -> BackupProfileSummary { + let mut archive = env.open_archive(); + let transaction = archive.transaction().expect("scenario transaction"); + seed_run(&transaction, run_id); + let mut snapshot_artifacts = Vec::new(); + let mut source_evidence_plans = Vec::new(); + let summary = process_profile_snapshot( + &transaction, + run_id, + &env.paths, + &env.config, + snapshot, + &mut snapshot_artifacts, + &mut source_evidence_plans, + false, + use_watermark, + ) + .expect("process profile snapshot"); + transaction.commit().expect("commit scenario transaction"); + summary +} + +fn count_archive_rows(env: &ScenarioEnv, table: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row(&format!("SELECT COUNT(*) FROM {table}"), [], |row| row.get(0)) + .expect("count rows") +} + +fn count_urls_for_profile(env: &ScenarioEnv, profile_key: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT COUNT(*) FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("count urls for profile") +} + +fn count_visits_for_profile(env: &ScenarioEnv, profile_key: &str) -> i64 { + let archive = env.open_archive(); + archive + .query_row( + "SELECT COUNT(*) FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = ?1", + [profile_key], + |row| row.get(0), + ) + .expect("count visits for profile") +} + +// ====================================================================== +// Chromium helpers (needed by T3 which imports Chrome + Takeout) +// ====================================================================== + +fn chromium_profile(profile_id: &str, browser_name: &str) -> crate::models::BrowserProfile { + crate::models::BrowserProfile { + profile_id: profile_id.to_string(), + profile_name: "Default".to_string(), + browser_family: "chromium".to_string(), + browser_name: browser_name.to_string(), + user_name: Some("synthetic-user".to_string()), + profile_path: format!("/synthetic/{profile_id}"), + history_path: Some(format!("/synthetic/{profile_id}/History")), + favicons_path: None, + history_exists: true, + history_readable: true, + access_issue: None, + browser_version: Some("146.0.0.0".to_string()), + history_file_name: "History".to_string(), + history_bytes: 128, + favicons_bytes: 0, + supporting_bytes: 0, + retention_boundary: crate::models::BrowserRetentionBoundary::default(), + } +} + +fn visit_row(id: i64, url_id: i64, visit_time_unix_ms: i64) -> ChromiumVisitRow { + ChromiumVisitRow { + id, + url_id, + visit_time_unix_ms, + from_visit: Some(0), + transition: Some(805306368), + visit_duration_micros: Some(5_000_000), + is_known_to_sync: false, + visited_link_id: None, + external_referrer_url: None, + app_id: None, + } +} + +fn snapshot_for_fixture( + fixture: &ChromiumHistoryFixture, + profile: crate::models::BrowserProfile, +) -> ProfileSnapshot { + let temp_dir = tempdir().expect("snapshot tempdir"); + let history_path = temp_dir.path().join("History"); + fixture.write(&history_path).expect("write chromium fixture"); + let history_bytes = std::fs::metadata(&history_path).map(|meta| meta.len()).unwrap_or(0); + let mut profile = profile; + profile.history_bytes = history_bytes; + ProfileSnapshot { + profile, + temp_dir, + history_path, + favicons_path: None, + source_hashes: vec![FileFingerprint { + path: "History".to_string(), + sha256: "synthetic-fixture-hash".to_string(), + }], + } +} + +// ====================================================================== +// Takeout helpers +// ====================================================================== + +fn takeout_record(url: &str, title: &str, visit_time_unix_ms: i64) -> TakeoutBrowserRecord { + TakeoutBrowserRecord { + url: url.to_string(), + title: Some(title.to_string()), + visit_time_unix_ms, + page_transition: Some("LINK".to_string()), + client_id: None, + favicon_url: None, + ptoken: None, + } +} + +fn import_takeout_fixture(env: &ScenarioEnv, records: &[TakeoutBrowserRecord], label: &str) { + let root = tempdir().unwrap_or_else(|_| panic!("{label} takeout root")); + let payload = root.path().join("Chrome/BrowserHistory.json"); + let mut fixture = TakeoutBrowserHistoryFixture::new(); + for record in records { + fixture = fixture.add_record(record.clone()); + } + fixture.write(&payload).expect("write takeout fixture"); + crate::takeout::import_takeout( + &env.paths, + &env.config, + None, + &crate::models::TakeoutRequest { + source_path: root.path().display().to_string(), + dry_run: false, + }, + ) + .unwrap_or_else(|err| panic!("{label} import_takeout failed: {err}")); + drop(root); +} + +// ====================================================================== +// T1: Takeout baseline import +// ====================================================================== + +/// T1 — A Takeout BrowserHistory JSON gets imported via the public +/// `import_takeout` flow. Asserts row counts under the synthetic profile +/// the Takeout flow upserts (`takeout::browser-history`) and that visit +/// `app_id` lands as `"takeout"`. +#[test] +fn t1_takeout_baseline_import() { + let env = ScenarioEnv::new(); + let source_root = tempdir().expect("takeout source root"); + let payload_path = source_root.path().join("Chrome/BrowserHistory.json"); + + TakeoutBrowserHistoryFixture::new() + .add_record(takeout_record("https://example.com/page-one", "Page One", 1_777_680_000_000)) + .add_record(takeout_record("https://example.com/page-two", "Page Two", 1_777_809_600_000)) + .add_record(takeout_record( + "https://example.org/page-three", + "Page Three", + 1_777_872_930_000, + )) + .write(&payload_path) + .expect("write takeout fixture"); + + let request = crate::models::TakeoutRequest { + source_path: source_root.path().display().to_string(), + dry_run: false, + }; + + let inspection = crate::takeout::import_takeout(&env.paths, &env.config, None, &request) + .expect("import takeout"); + + assert!(!inspection.dry_run); + assert_eq!(inspection.imported_items + inspection.duplicate_items, 3); + + let profile_key = "takeout::browser-history"; + assert_eq!(count_urls_for_profile(&env, profile_key), 3); + assert_eq!(count_visits_for_profile(&env, profile_key), 3); + + // Takeout-sourced visits must carry app_id="takeout"; this is the same + // hardcoded marker that contributes to B4's fingerprint mismatch. + let archive = env.open_archive(); + let takeout_visit_count: i64 = archive + .query_row( + "SELECT COUNT(*) FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = ?1 AND visits.app_id = 'takeout'", + [profile_key], + |row| row.get(0), + ) + .expect("takeout app_id count"); + assert_eq!(takeout_visit_count, 3); +} + +// ====================================================================== +// T2: Takeout file rename re-import — refines B3 framing +// ====================================================================== + +/// T2 — Re-importing the same Takeout records from a different on-disk +/// path. The audit's first cut of **B3** ("path-bound source_visit_id +/// causes a full duplicate set on every re-import") turned out to overstate +/// the practical risk: while it is true that the path change does produce +/// completely different `source_visit_id` values for every record, the +/// `(source_profile_id, event_fingerprint)` partial unique index catches +/// the duplicates because the fingerprint inputs (url, visit_time_ms, +/// title, transition=None, app_id="takeout") are identical across the two +/// imports. +/// +/// This scenario pins the **actual current behavior**: rename-only +/// re-import of unchanged Takeout records is correctly de-duplicated by +/// the fingerprint partial index, ending at 3 visit rows. The B3 design +/// concern (poor robustness — the path-bound id provides zero useful +/// signal, so the system relies on the fingerprint as a single layer) +/// stays documented in the audit; [`t2b_takeout_rename_with_title_change_demonstrates_b3_when_fingerprint_diverges`] +/// covers the case where the fingerprint can't save B3 anymore. +#[test] +fn t2_takeout_rename_file_reimport_dedups_via_fingerprint_partial_index() { + let env = ScenarioEnv::new(); + + let records: Vec = (0..3) + .map(|index| { + let visit_time = 1_777_680_000_000 + (index as i64 * 86_400_000); + takeout_record( + &format!("https://example.com/article-{index}"), + &format!("Article {index}"), + visit_time, + ) + }) + .collect(); + + import_takeout_fixture(&env, &records, "first"); + let profile_key = "takeout::browser-history"; + assert_eq!(count_visits_for_profile(&env, profile_key), 3); + + import_takeout_fixture(&env, &records, "second"); + + // The fingerprint partial index catches the duplicates even though + // every source_visit_id differs from the first pass. + assert_eq!( + count_visits_for_profile(&env, profile_key), + 3, + "fingerprint partial index dedups the renamed-source re-import" + ); +} + +// ====================================================================== +// T2b: Fingerprint divergence exposes B3 +// ====================================================================== + +/// T2b — When the fingerprint cannot rescue B3, the path-bound +/// `source_visit_id` produces a real duplicate set. Two re-imports of the +/// "same" record but with even one fingerprint input changed (title +/// here) defeat the fingerprint partial index, leaving the broken +/// path-bound primary key as the only defense. The result is the full +/// duplicate set the audit warned about. +/// +/// This is a `should_panic` failing test today: the assertion below is +/// what the system should provide after B3 is fixed (e.g. by deriving +/// `source_visit_id` from `(url, visit_time_micros)` so the primary key +/// is stable across re-imports regardless of path or fingerprint input +/// drift). Today the count grows to 6 and the assertion fires. +#[test] +fn t2b_takeout_rename_with_title_change_demonstrates_b3_when_fingerprint_diverges() { + let env = ScenarioEnv::new(); + + let first_records: Vec = (0..3) + .map(|index| { + let visit_time = 1_777_680_000_000 + (index as i64 * 86_400_000); + takeout_record( + &format!("https://example.com/article-{index}"), + &format!("Original title {index}"), + visit_time, + ) + }) + .collect(); + import_takeout_fixture(&env, &first_records, "first"); + + // Real-world equivalent: user re-exports Takeout months later; Google + // captured an updated page title in the meantime. Same URL, same + // visit time, different title → fingerprint differs. + let second_records: Vec = first_records + .iter() + .map(|record| { + let mut next = record.clone(); + next.title = Some(format!( + "Updated title for {}", + record.url.rsplit('/').next().unwrap_or("page") + )); + next + }) + .collect(); + import_takeout_fixture(&env, &second_records, "second"); + + let profile_key = "takeout::browser-history"; + let visit_count = count_visits_for_profile(&env, profile_key); + + // Expected post-fix: 3 visits (treated as the same logical event with + // an updated title). Today: 6 (because both source_visit_id and + // event_fingerprint differ across the two imports). + assert_eq!( + visit_count, 3, + "B3 fix required: rename + title drift duplicates rows (got {visit_count})" + ); +} + +// ====================================================================== +// T3: Takeout x local Chrome same-period overlap — B4 contract +// ====================================================================== + +/// T3 — Same-period overlap between a local Chrome profile and the +/// Takeout JSON of the same Chrome installation. The audit's **B4** +/// observation: even when records describe literally the same browsing +/// event, the fingerprint inputs differ between the two source paths +/// (local Chrome has a real `transition` and the browser's real +/// `app_id`; Takeout hardcodes `app_id = "takeout"` and `transition = +/// None`), so even a hypothetical cross-source-profile fingerprint +/// dedup would not match. This contract scenario pins the current +/// storage truth — 3 + 3 = 6 visits across two profiles — and +/// documents the input divergence so any future "merge across sources" +/// proposal must address the fingerprint normalization gap first. +#[test] +fn t3_takeout_and_local_chrome_same_period_b4_contract() { + let env = ScenarioEnv::new(); + let day_one = 1_777_680_000_000_i64; + let day_two = 1_777_809_600_000_i64; + let day_three = 1_777_872_930_000_i64; + + let chrome_fixture = ChromiumHistoryFixture::new() + .add_url(ChromiumUrlRow { + id: 1, + url: "https://example.com/shared-one".to_string(), + title: Some("Shared One".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_one, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 2, + url: "https://example.com/shared-two".to_string(), + title: Some("Shared Two".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_two, + hidden: false, + }) + .add_url(ChromiumUrlRow { + id: 3, + url: "https://example.com/shared-three".to_string(), + title: Some("Shared Three".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_unix_ms: day_three, + hidden: false, + }) + .add_visit(visit_row(10, 1, day_one)) + .add_visit(visit_row(11, 2, day_two)) + .add_visit(visit_row(12, 3, day_three)); + let chrome_snapshot = + snapshot_for_fixture(&chrome_fixture, chromium_profile("chrome:Default", "Google Chrome")); + run_one_ingest(&env, 1, &chrome_snapshot, false); + + let takeout_source = tempdir().expect("takeout source root"); + let takeout_payload = takeout_source.path().join("Chrome/BrowserHistory.json"); + TakeoutBrowserHistoryFixture::new() + .add_record(takeout_record("https://example.com/shared-one", "Shared One", day_one)) + .add_record(takeout_record("https://example.com/shared-two", "Shared Two", day_two)) + .add_record(takeout_record("https://example.com/shared-three", "Shared Three", day_three)) + .write(&takeout_payload) + .expect("write takeout fixture"); + crate::takeout::import_takeout( + &env.paths, + &env.config, + None, + &crate::models::TakeoutRequest { + source_path: takeout_source.path().display().to_string(), + dry_run: false, + }, + ) + .expect("import takeout"); + + // Each source kept independent rows under its own source_profile. + assert_eq!(count_visits_for_profile(&env, "chrome:Default"), 3); + assert_eq!(count_visits_for_profile(&env, "takeout::browser-history"), 3); + assert_eq!(count_archive_rows(&env, "visits"), 6); + + // Fingerprint divergence: a future cross-source dedup design has to + // normalize app_id (and likely also project transition to None) before + // any pair of these visits could share a fingerprint. + let archive = env.open_archive(); + let chrome_app_ids: Vec> = archive + .prepare( + "SELECT app_id FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'chrome:Default'", + ) + .expect("prepare chrome") + .query_map([], |row| row.get(0)) + .expect("query chrome") + .collect::>>() + .expect("collect chrome"); + let takeout_app_ids: Vec> = archive + .prepare( + "SELECT app_id FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'takeout::browser-history'", + ) + .expect("prepare takeout") + .query_map([], |row| row.get(0)) + .expect("query takeout") + .collect::>>() + .expect("collect takeout"); + assert!(chrome_app_ids.iter().all(|app_id| app_id.is_none())); + assert!(takeout_app_ids.iter().all(|app_id| app_id.as_deref() == Some("takeout"))); +} + +// ====================================================================== +// T5: Takeout time_usec unit contract — B6 pinning +// ====================================================================== + +/// T5 — Pins the current interpretation of Takeout's `time_usec` field +/// as **Unix-epoch microseconds**. The audit raised **B6** because the +/// helper `micros_to_unix_ms` (parser side) name asserts Unix +/// microseconds but Google's Takeout dumps historically used Chrome +/// epoch microseconds (since 1601). The harness writer emits Unix +/// microseconds; the parser reads Unix microseconds; this test pins +/// that contract end-to-end. If anyone later flips the parser to assume +/// Chrome epoch, T5 fails immediately. If a future real-world Takeout +/// sample disagrees with this interpretation, the writer + this test +/// must be updated together — the audit B6 note documents the open +/// question. +#[test] +fn t5_takeout_time_usec_pinned_as_unix_microseconds_b6_contract() { + let env = ScenarioEnv::new(); + let source_root = tempdir().expect("takeout source root"); + let payload_path = source_root.path().join("Chrome/BrowserHistory.json"); + + // 2026-05-02T00:00:00Z = 1_777_680_000_000 Unix ms = 1_777_680_000_000_000 Unix μs. + // If the parser treated this as Chrome μs the resulting Unix ms would + // be (1_777_680_000_000_000 - 11_644_473_600_000_000) / 1000, which + // produces a negative or wildly different timestamp the assertion + // below catches. + let visit_one = 1_777_680_000_000_i64; + + TakeoutBrowserHistoryFixture::new() + .add_record(takeout_record("https://example.com/time-pin", "Time Pin", visit_one)) + .write(&payload_path) + .expect("write takeout fixture"); + + crate::takeout::import_takeout( + &env.paths, + &env.config, + None, + &crate::models::TakeoutRequest { + source_path: source_root.path().display().to_string(), + dry_run: false, + }, + ) + .expect("import takeout"); + + let archive = env.open_archive(); + let (visit_time_ms, visit_time_iso): (i64, String) = archive + .query_row( + "SELECT visits.visit_time_ms, visits.visit_time_iso FROM visits + JOIN source_profiles ON source_profiles.id = visits.source_profile_id + WHERE source_profiles.profile_key = 'takeout::browser-history'", + [], + |row| Ok((row.get(0)?, row.get(1)?)), + ) + .expect("query takeout visit time"); + + assert_eq!(visit_time_ms, visit_one, "Takeout time_usec must round-trip as Unix milliseconds"); + assert!( + visit_time_iso.starts_with("2026-05-02"), + "Takeout ISO must reflect 2026-05-02, got {visit_time_iso}" + ); +} + +// ====================================================================== +// T6: Takeout URL upsert B1 protection — older-snapshot re-import must not regress +// ====================================================================== + +/// T6 — Audit bug B1 was originally identified and fixed in +/// `archive/ingest/writes.rs::upsert_url` (commit 6884c10d) but the +/// Takeout import path in `takeout/payload_import.rs` was left with +/// unconditional `excluded.*` overwrites and a hardcoded +/// `visit_count = 1` literal in the INSERT VALUES with no UPDATE clause +/// for visit_count or typed_count at all. A re-import of an older +/// Takeout snapshot would silently overwrite title / hidden with stale +/// values, and a fresh Takeout export with new visits to the same URL +/// would never bump visit_count. +/// +/// This scenario pins the B1 fix applied to `payload_import.rs`: +/// +/// 1. **Older snapshot re-import** must not regress `title` / `hidden` +/// (strictly older `last_visit_ms` → preserve newer values). +/// 2. **MAX(visit_count)** must use the larger of stored vs incoming so +/// a later Takeout export reflecting new visits actually bumps the +/// archive's visit_count. +/// 3. **Tied `last_visit_ms`** must NOT trigger an overwrite (matches the +/// `>` vs `>=` tie-break tightened in writes.rs). +#[test] +fn t6_takeout_payload_import_url_upsert_protects_against_older_snapshot_regression() { + let env = ScenarioEnv::new(); + let earlier_ms = 1_777_680_000_000_i64; // 2026-05-02T00:00:00Z + let later_ms = 1_777_809_600_000_i64; // 2026-05-03T12:00:00Z + + // Pass 1: import the LATER snapshot first. Two records to the same + // URL with the meaningful title; visit_count merges to 2 in the + // parser via merge_url_state. + let later_records: Vec = vec![ + TakeoutBrowserRecord { + url: "https://example.com/news".to_string(), + title: Some("Meaningful Title".to_string()), + visit_time_unix_ms: later_ms - 1_000, + page_transition: Some("LINK".to_string()), + client_id: None, + favicon_url: None, + ptoken: None, + }, + TakeoutBrowserRecord { + url: "https://example.com/news".to_string(), + title: Some("Meaningful Title".to_string()), + visit_time_unix_ms: later_ms, + page_transition: Some("LINK".to_string()), + client_id: None, + favicon_url: None, + ptoken: None, + }, + ]; + import_takeout_fixture(&env, &later_records, "later"); + + let profile_key = "takeout::browser-history"; + let archive = env.open_archive(); + let read_url_state = || -> (String, Option, i64, i64) { + let conn = env.open_archive(); + conn.query_row( + "SELECT url, title, visit_count, hidden FROM urls + JOIN source_profiles ON source_profiles.id = urls.source_profile_id + WHERE source_profiles.profile_key = ?1 + AND urls.url = 'https://example.com/news'", + [profile_key], + |row| Ok((row.get(0)?, row.get(1)?, row.get(2)?, row.get(3)?)), + ) + .expect("query url state") + }; + drop(archive); + + let (url1, title1, count1, hidden1) = read_url_state(); + assert_eq!(url1, "https://example.com/news"); + assert_eq!(title1.as_deref(), Some("Meaningful Title")); + assert_eq!(count1, 2, "later snapshot's visit_count of 2 must land"); + assert_eq!(hidden1, 0); + + // Pass 2: re-import the OLDER snapshot. Single record at earlier_ms + // with a NULL title and (implicitly) hidden=false. The parser will + // produce visit_count=1. + let older_records: Vec = vec![TakeoutBrowserRecord { + url: "https://example.com/news".to_string(), + title: None, + visit_time_unix_ms: earlier_ms, + page_transition: Some("LINK".to_string()), + client_id: None, + favicon_url: None, + ptoken: None, + }]; + import_takeout_fixture(&env, &older_records, "older"); + + let (url2, title2, count2, hidden2) = read_url_state(); + assert_eq!(url2, "https://example.com/news"); + assert_eq!( + title2.as_deref(), + Some("Meaningful Title"), + "B1 fix for Takeout: older snapshot must NOT overwrite captured title with NULL" + ); + assert_eq!( + count2, 2, + "B1 fix for Takeout: MAX(visit_count) must preserve the higher value (2 > 1)" + ); + assert_eq!(hidden2, 0, "B1 fix for Takeout: hidden must not flip from older snapshot"); +} + +// ====================================================================== +// T7: Same-URL same-microsecond Takeout records must NOT collapse silently +// ====================================================================== + +/// T7 — When Google's Takeout export emits multiple records for the same +/// URL within the same microsecond (Chrome sync replay, redirect within +/// 1 µs, multiple devices syncing the same event), they must produce +/// distinct `source_visit_id` values so the +/// `(source_profile_id, source_visit_id)` UNIQUE index doesn't silently +/// drop later records via INSERT OR IGNORE. +/// +/// Before the ordinal-tiebreaker fix, `source_visit_id` was derived from +/// `stable_key_i64("{url}:{visit_time_micros}")` alone — identical for +/// every record at the same URL+microsecond. The first record landed; +/// the rest were silently dropped because both UNIQUE indexes (source +/// id + event_fingerprint, since transition=None and app_id="takeout" +/// are constant) fired on every subsequent INSERT OR IGNORE. +/// +/// The fix adds `ordinal` (per-record position in the source file) as a +/// tiebreaker. Within a single file, ordinals are unique; across renames +/// of the same file the same record keeps the same ordinal (Google's +/// JSON export is deterministic), so per-record-stability and dedup +/// across path renames both hold. +#[test] +fn t7_takeout_same_url_same_microsecond_records_land_as_distinct_visits() { + let env = ScenarioEnv::new(); + // Same URL, same visit_time_unix_ms. Two genuinely distinct events + // (different titles to make the input non-degenerate; in practice + // they could differ only in transition or page_transition). + let visit_time_ms = 1_777_680_000_000_i64; + + let records: Vec = vec![ + TakeoutBrowserRecord { + url: "https://example.com/sync-collision".to_string(), + title: Some("First Event".to_string()), + visit_time_unix_ms: visit_time_ms, + page_transition: Some("LINK".to_string()), + client_id: None, + favicon_url: None, + ptoken: None, + }, + TakeoutBrowserRecord { + url: "https://example.com/sync-collision".to_string(), + title: Some("Second Event Same Microsecond".to_string()), + visit_time_unix_ms: visit_time_ms, + page_transition: Some("LINK".to_string()), + client_id: None, + favicon_url: None, + ptoken: None, + }, + ]; + import_takeout_fixture(&env, &records, "same-microsecond"); + + let visits = count_visits_for_profile(&env, "takeout::browser-history"); + assert_eq!( + visits, 2, + "Two Takeout records at the same URL+microsecond must produce two distinct visit rows (ordinal tiebreaker), not silently collapse to 1" + ); + + // Cross-path stability check: re-importing the SAME file content + // (same records in same order) must still dedup — the second pass + // produces the same ordinals and therefore the same + // source_visit_ids, so INSERT OR IGNORE catches the dupes. + import_takeout_fixture(&env, &records, "same-microsecond-reimport"); + let visits_after_reimport = count_visits_for_profile(&env, "takeout::browser-history"); + assert_eq!( + visits_after_reimport, 2, + "Re-importing the same file (same records, same ordinals) must dedup, not double the visit count" + ); +} diff --git a/src-tauri/crates/vault-core/src/archive/ingest/mod.rs b/src-tauri/crates/vault-core/src/archive/ingest/mod.rs index c1a7f80b..d850d0a4 100644 --- a/src-tauri/crates/vault-core/src/archive/ingest/mod.rs +++ b/src-tauri/crates/vault-core/src/archive/ingest/mod.rs @@ -25,6 +25,15 @@ mod parser; mod writes; +#[cfg(test)] +mod dedup_scenarios; +#[cfg(test)] +mod dedup_scenarios_baselines; +#[cfg(test)] +mod dedup_scenarios_edge_cases; +#[cfg(test)] +mod dedup_scenarios_takeout; + use self::{ parser::{Watermark, load_watermark, save_watermark, should_checkpoint}, writes::{ @@ -168,10 +177,18 @@ impl HistoryBatchConsumer for ArchiveChunkConsumer<'_> { )?; if inserted > 0 { self.progress.new_visits += 1; + // Only widen URL bounds from visits that actually landed. + // INSERT OR IGNORE may drop a visit on either unique-index + // hit (`(url_id, source_visit_id)` or the fingerprint + // partial index); in either case the visit row is not in + // the canonical `visits` table, so widening + // `urls.first_visit_ms` / `urls.last_visit_ms` from it + // would leave the URL claiming bounds that no visit row + // proves — breaking any read model that joins them back. + track_url_visit_bounds(&mut self.progress.url_bounds, url_id, &visit); } self.progress.visit_count += 1; self.progress.last_visit_id = self.progress.last_visit_id.max(visit.source_visit_id); - track_url_visit_bounds(&mut self.progress.url_bounds, url_id, &visit); } if let Some(report_progress) = self.report_progress.as_mut() { report_progress(ArchiveIngestProgress { diff --git a/src-tauri/crates/vault-core/src/archive/ingest/writes.rs b/src-tauri/crates/vault-core/src/archive/ingest/writes.rs index 8763bf88..6396a6d1 100644 --- a/src-tauri/crates/vault-core/src/archive/ingest/writes.rs +++ b/src-tauri/crates/vault-core/src/archive/ingest/writes.rs @@ -121,13 +121,28 @@ pub(super) fn upsert_url( ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12) ON CONFLICT(source_profile_id, source_url_id) DO UPDATE SET - url = excluded.url, - title = excluded.title, - visit_count = excluded.visit_count, - typed_count = excluded.typed_count, - hidden = excluded.hidden, - payload_hash = excluded.payload_hash, - recorded_at = excluded.recorded_at, + url = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.url + ELSE urls.url + END, + title = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.title + ELSE urls.title + END, + visit_count = MAX(urls.visit_count, excluded.visit_count), + typed_count = MAX(urls.typed_count, excluded.typed_count), + hidden = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.hidden + ELSE urls.hidden + END, + payload_hash = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.payload_hash + ELSE urls.payload_hash + END, + recorded_at = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.recorded_at + ELSE urls.recorded_at + END, last_visit_ms = CASE WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.last_visit_ms ELSE urls.last_visit_ms @@ -202,6 +217,27 @@ pub(super) fn insert_visit( visit.visited_link_id, visit.external_referrer_url, visit.app_id, + // Intentional: source_kind is hardcoded to "chromium-history" + // for every browser family that flows through this + // backup-pipeline writer (Chromium, Firefox, Safari). + // The (source_profile_id, event_fingerprint) partial unique + // index that backs the fallback dedup is scoped per + // source_profile_id, so cross-family fingerprint matching is + // NOT structurally required — but keeping the constant + // identical across families inside this writer means a + // re-import of the same browser profile always produces the + // same fingerprint regardless of which browser_family the + // profile metadata reports, which is what the partial-index + // dedup relies on. + // + // The Takeout import paths (vault-core/src/takeout/ + // payload_import.rs and vault-core/src/takeout/ + // browser_history.rs) compute fingerprints with their own + // source_kind values and use Unix-millisecond timestamps, + // not Chrome-microsecond. Cross-flow fingerprint matching + // between this writer and the Takeout writers is not a + // contract — the two flows always land in distinct + // source_profiles rows and dedup separately. visit_event_fingerprint( "chromium-history", &visit.url, @@ -449,3 +485,203 @@ pub(super) fn track_url_visit_bounds( last_visit_iso: visit.visit_time_iso.clone(), }); } + +#[cfg(test)] +mod tests { + use super::*; + use crate::archive::visit_event_fingerprint; + use crate::utils::unix_micros_to_chrome_time; + + /// Contract: the backup-pipeline writer (`insert_visit` above) uses + /// the hardcoded source_kind `"chromium-history"` for every browser + /// family it serves (Chromium, Firefox, Safari). This is intentional — + /// keeping the constant identical across families inside this writer + /// means a re-import of the same browser profile always produces the + /// same fingerprint, which is what the + /// `(source_profile_id, event_fingerprint)` partial unique index + /// relies on for fallback dedup. + /// + /// Cross-flow fingerprint matching against the Takeout writers + /// (`vault-core/src/takeout/payload_import.rs`, + /// `vault-core/src/takeout/browser_history.rs`) is NOT a contract — + /// those writers use different source_kind values and Unix-millisecond + /// timestamps. Their visits always land in distinct source_profiles + /// rows from this writer's output, so the partial index naturally + /// scopes the dedup per flow. + /// + /// If a future change parameterizes source_kind per family inside + /// `insert_visit` itself, this test fails immediately and forces a + /// follow-up audit of any re-imports that crossed family-by-version. + #[test] + fn fingerprint_is_family_agnostic_within_backup_writer() { + let url = "https://example.com/article"; + let visit_time_ms: i64 = 1_777_680_000_000; + let visit_time_chrome = unix_micros_to_chrome_time(visit_time_ms.saturating_mul(1_000)); + let title = Some("Article"); + let transition = Some(805306368_i64); + let app_id: Option<&str> = None; + + let chromium_fp = visit_event_fingerprint( + "chromium-history", + url, + visit_time_chrome, + title, + transition, + app_id, + ); + + // Identical inputs must produce identical fingerprints; that is + // what the backup writer guarantees across families today. + let firefox_fp = visit_event_fingerprint( + "chromium-history", + url, + visit_time_chrome, + title, + transition, + app_id, + ); + let safari_fp = visit_event_fingerprint( + "chromium-history", + url, + visit_time_chrome, + title, + transition, + app_id, + ); + + assert_eq!( + chromium_fp, firefox_fp, + "fingerprint must be identical regardless of browser family" + ); + assert_eq!( + chromium_fp, safari_fp, + "fingerprint must be identical regardless of browser family" + ); + + // Sanity: changing any input produces a different fingerprint. + let different_url_fp = visit_event_fingerprint( + "chromium-history", + "https://example.com/other", + visit_time_chrome, + title, + transition, + app_id, + ); + assert_ne!( + chromium_fp, different_url_fp, + "different URL must produce different fingerprint" + ); + + // Sanity: a hypothetical per-family source_kind WOULD diverge. + let hypothetical_firefox_fp = visit_event_fingerprint( + "firefox-history", + url, + visit_time_chrome, + title, + transition, + app_id, + ); + assert_ne!( + chromium_fp, hypothetical_firefox_fp, + "different source_kind must produce different fingerprint (proves the hardcode matters)" + ); + } + + /// Contract: `sync_url_bounds` only widens the stored bounds — a visit + /// whose timestamp falls between the existing first and last does not + /// change either bound. This prevents mid-range backfill from shifting + /// the URL's reported first or last visit. + #[test] + fn sync_url_bounds_no_change_for_middle_visit() { + let dir = tempfile::tempdir().expect("tempdir"); + let paths = crate::config::project_paths_with_root(dir.path()); + let config = AppConfig { initialized: true, ..AppConfig::default() }; + crate::config::ensure_paths(&paths).expect("ensure paths"); + let mut archive = crate::archive::schema::open_archive_connection(&paths, &config, None) + .expect("archive"); + let transaction = archive.transaction().expect("transaction"); + + // Seed a run and source profile so FK constraints are satisfied. + transaction + .execute( + "INSERT INTO runs (id, run_type, trigger, started_at, timezone, status, profile_scope_json, warnings_json, stats_json, due_only) + VALUES (1, 'backup', 'manual', '2026-05-25T00:00:00+00:00', 'UTC', 'running', '[]', '[]', '{}', 0)", + [], + ) + .expect("seed run"); + let profile = crate::models::BrowserProfile { + profile_id: "chrome:Default".to_string(), + profile_name: "Default".to_string(), + browser_family: "chromium".to_string(), + browser_name: "Google Chrome".to_string(), + user_name: Some("test".to_string()), + profile_path: "/synthetic/chrome:Default".to_string(), + history_path: Some("/synthetic/chrome:Default/History".to_string()), + favicons_path: None, + history_exists: true, + history_readable: true, + access_issue: None, + browser_version: Some("146.0.0.0".to_string()), + history_file_name: "History".to_string(), + history_bytes: 128, + favicons_bytes: 0, + supporting_bytes: 0, + retention_boundary: crate::models::BrowserRetentionBoundary::default(), + }; + let source_profile_id = + upsert_source_profile(&transaction, &profile).expect("upsert profile"); + + // Insert a URL with initial bounds at time 1000. + let url = browser_history_parser::ParsedUrl { + source_url_id: 1, + url: "https://example.com/bounds-test".to_string(), + title: Some("Bounds Test".to_string()), + visit_count: 1, + typed_count: 0, + last_visit_ms: 1000, + last_visit_iso: "2026-01-01T00:00:01+00:00".to_string(), + hidden: false, + }; + let url_id = upsert_url(&transaction, 1, source_profile_id, &profile, &url, "hash-1") + .expect("upsert url"); + + // Widen bounds: first=1000, last=3000. + sync_url_bounds( + &transaction, + url_id, + &UrlVisitBounds { + first_visit_ms: 1000, + first_visit_iso: "2026-01-01T00:00:01+00:00".to_string(), + last_visit_ms: 3000, + last_visit_iso: "2026-01-01T00:00:03+00:00".to_string(), + }, + ) + .expect("initial bounds"); + + // Now insert a middle visit at time 2000. + sync_url_bounds( + &transaction, + url_id, + &UrlVisitBounds { + first_visit_ms: 2000, + first_visit_iso: "2026-01-01T00:00:02+00:00".to_string(), + last_visit_ms: 2000, + last_visit_iso: "2026-01-01T00:00:02+00:00".to_string(), + }, + ) + .expect("middle bounds"); + + // Assert bounds remain (1000, 3000) — the middle visit must not + // shift either bound. + let (first_ms, last_ms): (i64, i64) = transaction + .query_row( + "SELECT first_visit_ms, last_visit_ms FROM urls WHERE id = ?1", + [url_id], + |row| Ok((row.get(0)?, row.get(1)?)), + ) + .expect("query bounds"); + + assert_eq!(first_ms, 1000, "first_visit_ms must not shift to middle visit"); + assert_eq!(last_ms, 3000, "last_visit_ms must not shift to middle visit"); + } +} diff --git a/src-tauri/crates/vault-core/src/models/app.rs b/src-tauri/crates/vault-core/src/models/app.rs index e85eca21..da2b01c4 100644 --- a/src-tauri/crates/vault-core/src/models/app.rs +++ b/src-tauri/crates/vault-core/src/models/app.rs @@ -150,20 +150,15 @@ pub struct AppConfig { /// `og_images` row, and the daily negative-cache retry. /// This is the default: it keeps social cards warm /// without pinning UI activity. -#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] +#[derive(Debug, Clone, Copy, Default, PartialEq, Eq, Serialize, Deserialize)] #[serde(rename_all = "snake_case")] pub enum OgImageFetchMode { Off, OnDemand, + #[default] Background, } -impl Default for OgImageFetchMode { - fn default() -> Self { - Self::Background - } -} - /// User-controllable og:image fetch + cache settings. /// /// `fetch_enabled` is the legacy master kill switch and defaults to diff --git a/src-tauri/crates/vault-core/src/takeout/browser_history.rs b/src-tauri/crates/vault-core/src/takeout/browser_history.rs index 88cf251b..de4190fd 100644 --- a/src-tauri/crates/vault-core/src/takeout/browser_history.rs +++ b/src-tauri/crates/vault-core/src/takeout/browser_history.rs @@ -194,11 +194,20 @@ impl HistoryBatchConsumer for BrowserHistoryArchiveConsumer<'_> { ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12) ON CONFLICT(source_profile_id, source_url_id) DO UPDATE SET - url = excluded.url, - title = excluded.title, - visit_count = excluded.visit_count, - typed_count = excluded.typed_count, - hidden = excluded.hidden, + url = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.url + ELSE urls.url + END, + title = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.title + ELSE urls.title + END, + visit_count = MAX(urls.visit_count, excluded.visit_count), + typed_count = MAX(urls.typed_count, excluded.typed_count), + hidden = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.hidden + ELSE urls.hidden + END, last_visit_ms = CASE WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.last_visit_ms ELSE urls.last_visit_ms @@ -207,8 +216,14 @@ impl HistoryBatchConsumer for BrowserHistoryArchiveConsumer<'_> { WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.last_visit_iso ELSE urls.last_visit_iso END, - payload_hash = excluded.payload_hash, - recorded_at = excluded.recorded_at + payload_hash = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.payload_hash + ELSE urls.payload_hash + END, + recorded_at = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.recorded_at + ELSE urls.recorded_at + END RETURNING id", params![ url.url, diff --git a/src-tauri/crates/vault-core/src/takeout/payload_import.rs b/src-tauri/crates/vault-core/src/takeout/payload_import.rs index 7e07aa2a..712f5ede 100644 --- a/src-tauri/crates/vault-core/src/takeout/payload_import.rs +++ b/src-tauri/crates/vault-core/src/takeout/payload_import.rs @@ -133,11 +133,22 @@ impl HistoryBatchConsumer for TakeoutArchiveChunkConsumer<'_> { payload_hash, recorded_at ) - VALUES (?1, ?2, 1, 0, ?3, ?4, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10) + VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12) ON CONFLICT(source_profile_id, source_url_id) DO UPDATE SET - url = excluded.url, - title = excluded.title, - hidden = excluded.hidden, + url = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.url + ELSE urls.url + END, + title = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.title + ELSE urls.title + END, + visit_count = MAX(urls.visit_count, excluded.visit_count), + typed_count = MAX(urls.typed_count, excluded.typed_count), + hidden = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.hidden + ELSE urls.hidden + END, last_visit_ms = CASE WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.last_visit_ms ELSE urls.last_visit_ms @@ -146,12 +157,20 @@ impl HistoryBatchConsumer for TakeoutArchiveChunkConsumer<'_> { WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.last_visit_iso ELSE urls.last_visit_iso END, - payload_hash = excluded.payload_hash, - recorded_at = excluded.recorded_at + payload_hash = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.payload_hash + ELSE urls.payload_hash + END, + recorded_at = CASE + WHEN excluded.last_visit_ms > urls.last_visit_ms THEN excluded.recorded_at + ELSE urls.recorded_at + END RETURNING id", params![ url.url, url.title, + url.visit_count.max(1), + url.typed_count.max(0), url.last_visit_ms, url.last_visit_iso, self.source_profile_id, diff --git a/src-tauri/crates/vault-worker/src/archive_flows.rs b/src-tauri/crates/vault-worker/src/archive_flows.rs index a734fe44..48eadbe3 100644 --- a/src-tauri/crates/vault-worker/src/archive_flows.rs +++ b/src-tauri/crates/vault-worker/src/archive_flows.rs @@ -314,10 +314,7 @@ fn try_prefetch_new_visit_og_images( let paths = vault_core::project_paths()?; let config = load_unlocked_config(&paths)?; use vault_core::OgImageFetchMode; - if !matches!( - config.og_image.effective_mode(), - OgImageFetchMode::Background - ) { + if !matches!(config.og_image.effective_mode(), OgImageFetchMode::Background) { return Ok((0, 0)); } if budget == 0 { @@ -1285,14 +1282,7 @@ mod tests { let blocked: Vec = vec!["blocked.test".to_string()]; let (sender, receiver) = std::sync::mpsc::channel(); let started = Instant::now(); - let flow = drain_one_worker_url( - &work, - &host_state, - &client, - &blocked, - &sender, - interval, - ); + let flow = drain_one_worker_url(&work, &host_state, &client, &blocked, &sender, interval); let elapsed = started.elapsed(); assert!(matches!(flow, std::ops::ControlFlow::Continue(()))); // Sleep arm ran — total elapsed should reflect the throttle wait. @@ -1586,6 +1576,39 @@ mod tests { let result = prefetch_og_images_on_demand(None, 100); assert_eq!(result.expect("on-demand prefetch empty"), (0, 0)); + // Seed one URL so the non-empty path (enqueue + refetch) runs. + { + let connection = + vault_core::archive::open_archive_connection(&paths, &config, None).expect("conn"); + connection + .execute( + "INSERT INTO runs \ + (id, run_type, trigger, started_at, finished_at, timezone, status, profile_scope_json, warnings_json, stats_json, due_only) \ + VALUES (1, 'backup', 'manual', '2026-01-01T00:00:00Z', '2026-01-01T00:00:01Z', 'UTC', 'success', '[]', '[]', '{}', 0)", + [], + ) + .expect("seed run"); + connection + .execute( + "INSERT OR IGNORE INTO source_profiles \ + (id, browser_kind, profile_name, profile_path, discovered_at, enabled, profile_key, browser_family, browser_product) \ + VALUES (1, 'chrome', 'Default', '/tmp', '2026-01-01T00:00:00Z', 1, 'chrome:Default', 'chromium', 'chrome')", + [], + ) + .expect("seed profile"); + connection + .execute( + "INSERT INTO urls \ + (id, url, visit_count, typed_count, first_visit_ms, first_visit_iso, last_visit_ms, last_visit_iso, source_profile_id, created_by_run_id) \ + VALUES (1, 'https://127.0.0.1:1/nonexistent', 1, 0, 1700000000000, '2023-11-14T22:13:20Z', 1700000000000, '2023-11-14T22:13:20Z', 1, 1)", + [], + ) + .expect("seed url"); + } + let result = prefetch_og_images_on_demand(None, 10); + let (enqueued, _succeeded) = result.expect("on-demand prefetch with url"); + assert_eq!(enqueued, 1); + restore_env_var(PROJECT_ROOT_OVERRIDE_ENV, original_project_root.as_deref()); restore_env_var(TEST_KEYRING_OVERRIDE_ENV, original_keyring.as_deref()); } @@ -1603,10 +1626,7 @@ mod tests { assert!(warnings.iter().any(|w| w.contains("4 succeeded"))); // Error case surfaces a warning with the message text. - append_og_image_prefetch_result( - &mut warnings, - Err(anyhow::anyhow!("network outage")), - ); + append_og_image_prefetch_result(&mut warnings, Err(anyhow::anyhow!("network outage"))); assert!(warnings.iter().any(|w| w.contains("network outage"))); } @@ -1616,10 +1636,7 @@ mod tests { assert_eq!(clamp_budget(100), 100); assert_eq!(clamp_budget(PER_TICK_BUDGET_HARD_CAP), PER_TICK_BUDGET_HARD_CAP as usize); // Above the cap: clamps down. - assert_eq!( - clamp_budget(PER_TICK_BUDGET_HARD_CAP + 1), - PER_TICK_BUDGET_HARD_CAP as usize, - ); + assert_eq!(clamp_budget(PER_TICK_BUDGET_HARD_CAP + 1), PER_TICK_BUDGET_HARD_CAP as usize,); // Arbitrarily large value still caps. assert_eq!(clamp_budget(u32::MAX), PER_TICK_BUDGET_HARD_CAP as usize); } @@ -1630,10 +1647,7 @@ mod tests { let default = OgImageSettings::default(); assert_eq!(default.effective_mode(), OgImageFetchMode::Background); - let mut off = OgImageSettings { - fetch_enabled: false, - ..OgImageSettings::default() - }; + let mut off = OgImageSettings { fetch_enabled: false, ..OgImageSettings::default() }; assert_eq!(off.effective_mode(), OgImageFetchMode::Off); // Even when fetch_mode is explicitly Background, the kill switch diff --git a/src-tauri/crates/vault-worker/src/lib.rs b/src-tauri/crates/vault-worker/src/lib.rs index 01935344..ddff23c8 100644 --- a/src-tauri/crates/vault-worker/src/lib.rs +++ b/src-tauri/crates/vault-worker/src/lib.rs @@ -41,13 +41,12 @@ pub use self::{ import_browser_history_source_with_progress, import_takeout_source, import_takeout_source_with_progress, inspect_browser_history_source, inspect_takeout_source, load_history_favicons, load_history_og_images, - mark_og_images_shown, og_image_storage_stats, preview_import_batch_detail, - preview_remote_backup_bundle, preview_retention_plan, preview_snapshot_restore_plan, - prefetch_og_images_on_demand, query_history, refetch_og_images, repair_health, - restore_import_batch_detail, - revert_import_batch_detail, run_backup_now, run_backup_now_with_progress, - run_og_image_cleanup, run_retention_plan, run_snapshot_restore_plan, - upload_remote_backup_bundle, verify_remote_backup_bundle, + mark_og_images_shown, og_image_storage_stats, prefetch_og_images_on_demand, + preview_import_batch_detail, preview_remote_backup_bundle, preview_retention_plan, + preview_snapshot_restore_plan, query_history, refetch_og_images, repair_health, + restore_import_batch_detail, revert_import_batch_detail, run_backup_now, + run_backup_now_with_progress, run_og_image_cleanup, run_retention_plan, + run_snapshot_restore_plan, upload_remote_backup_bundle, verify_remote_backup_bundle, }, cli::run_worker_cli, intelligence::{ diff --git a/src-tauri/src/worker_bridge/archive.rs b/src-tauri/src/worker_bridge/archive.rs index 95bda030..9f6a52d0 100644 --- a/src-tauri/src/worker_bridge/archive.rs +++ b/src-tauri/src/worker_bridge/archive.rs @@ -143,10 +143,7 @@ pub(crate) fn prefetch_og_images_impl( budget: u32, session_database_key: Option<&str>, ) -> Result<(u32, u32), String> { - worker_result(vault_worker::prefetch_og_images_on_demand( - session_database_key, - budget, - )) + worker_result(vault_worker::prefetch_og_images_on_demand(session_database_key, budget)) } #[cfg_attr(test, allow(dead_code))] diff --git a/src-tauri/src/worker_bridge/mod.rs b/src-tauri/src/worker_bridge/mod.rs index d823b5aa..e4e26ee0 100644 --- a/src-tauri/src/worker_bridge/mod.rs +++ b/src-tauri/src/worker_bridge/mod.rs @@ -999,6 +999,16 @@ mod tests { .expect("refetch with fetch_enabled=false"); assert_eq!(disabled, 0); + // Re-enable fetch for prefetch_og_images_impl coverage — + // budget=0 short-circuits before any network IO. + let re_enabled = initialized_config(); + save_config_impl(re_enabled, session_key(&session).as_deref()) + .expect("re-enable og config"); + let (enqueued, _succeeded) = + super::prefetch_og_images_impl(0, session_key(&session).as_deref()) + .expect("prefetch with zero budget"); + assert_eq!(enqueued, 0); + unsafe { std::env::remove_var(PROJECT_ROOT_OVERRIDE_ENV); std::env::remove_var(CHROME_USER_DATA_OVERRIDE_ENV); diff --git a/src/app/shell.test.tsx b/src/app/shell.test.tsx index 3e2e42a4..8b73614e 100644 --- a/src/app/shell.test.tsx +++ b/src/app/shell.test.tsx @@ -142,7 +142,9 @@ describe('AppShell (paper redesign)', () => { const user = userEvent.setup() renderShell({}, '/') const topbar = screen.getByTestId('pk-topbar') - const paletteTrigger = topbar.querySelector('button[data-testid="pk-topbar-palette"]') + const paletteTrigger = topbar.querySelector( + 'button[data-testid="pk-topbar-palette"]', + ) expect(paletteTrigger).not.toBeNull() if (!paletteTrigger) throw new Error('palette trigger missing') await user.click(paletteTrigger) @@ -189,7 +191,9 @@ describe('AppShell (paper redesign)', () => { renderShell({}, '/') const paletteTrigger = screen .getByTestId('pk-topbar') - .querySelector('button[data-testid="pk-topbar-palette"]') + .querySelector( + 'button[data-testid="pk-topbar-palette"]', + ) if (!paletteTrigger) throw new Error('palette trigger missing') await user.click(paletteTrigger) const input = await screen.findByPlaceholderText(/Find a page/i) @@ -220,7 +224,9 @@ describe('AppShell (paper redesign)', () => { renderShell({}, '/') const paletteTrigger = screen .getByTestId('pk-topbar') - .querySelector('button[data-testid="pk-topbar-palette"]') + .querySelector( + 'button[data-testid="pk-topbar-palette"]', + ) if (!paletteTrigger) throw new Error('palette trigger missing') await user.click(paletteTrigger) const input = await screen.findByPlaceholderText(/Find a page/i) @@ -258,7 +264,9 @@ describe('AppShell (paper redesign)', () => { renderShell({}, '/') const paletteTrigger = screen .getByTestId('pk-topbar') - .querySelector('button[data-testid="pk-topbar-palette"]') + .querySelector( + 'button[data-testid="pk-topbar-palette"]', + ) if (!paletteTrigger) throw new Error('palette trigger missing') await user.click(paletteTrigger) const input = await screen.findByPlaceholderText(/Find a page/i) @@ -297,7 +305,9 @@ describe('AppShell (paper redesign)', () => { renderShell({}, '/') const paletteTrigger = screen .getByTestId('pk-topbar') - .querySelector('button[data-testid="pk-topbar-palette"]') + .querySelector( + 'button[data-testid="pk-topbar-palette"]', + ) if (!paletteTrigger) throw new Error('palette trigger missing') await user.click(paletteTrigger) const input = await screen.findByPlaceholderText(/Find a page/i) @@ -338,7 +348,9 @@ describe('AppShell (paper redesign)', () => { renderShell({}, '/') const paletteTrigger = screen .getByTestId('pk-topbar') - .querySelector('button[data-testid="pk-topbar-palette"]') + .querySelector( + 'button[data-testid="pk-topbar-palette"]', + ) if (!paletteTrigger) throw new Error('palette trigger missing') await user.click(paletteTrigger) const input = await screen.findByPlaceholderText(/Find a page/i) @@ -413,7 +425,9 @@ describe('AppShell (paper redesign)', () => { renderShell({}, '/') const paletteTrigger = screen .getByTestId('pk-topbar') - .querySelector('button[data-testid="pk-topbar-palette"]') + .querySelector( + 'button[data-testid="pk-topbar-palette"]', + ) if (!paletteTrigger) throw new Error('palette trigger missing') await user.click(paletteTrigger) await screen.findByPlaceholderText(/Find a page/i) diff --git a/src/components/explorer-paper/paper-browse-primitives.test.tsx b/src/components/explorer-paper/paper-browse-primitives.test.tsx index 5a7fcda3..535fddbe 100644 --- a/src/components/explorer-paper/paper-browse-primitives.test.tsx +++ b/src/components/explorer-paper/paper-browse-primitives.test.tsx @@ -349,9 +349,9 @@ describe('PaperContactFrame', () => { ) // sanitizeExplorerDisplayText / strip-www is case-insensitive on // the prefix only; the rest stays untouched. - expect( - screen.getByTestId('frame-case-fallback').textContent, - ).toContain('GitHub.com') + expect(screen.getByTestId('frame-case-fallback').textContent).toContain( + 'GitHub.com', + ) }) test('fallback panel renders the time chip even when title is absent', () => { diff --git a/src/components/explorer-paper/paper-contact-sheet.tsx b/src/components/explorer-paper/paper-contact-sheet.tsx index a12443ac..35491367 100644 --- a/src/components/explorer-paper/paper-contact-sheet.tsx +++ b/src/components/explorer-paper/paper-contact-sheet.tsx @@ -25,13 +25,7 @@ * - Paper Browse primitives + DayNavControl + CalendarPopover. */ -import { - useEffect, - useMemo, - useRef, - useState, - type ReactNode, -} from 'react' +import { useEffect, useMemo, useRef, useState, type ReactNode } from 'react' import { cn } from '@/lib/cn' import type { HistoryEntry } from '@/lib/types/archive' import type { PaperBlock, PaperDay } from '@/pages/explorer/paper/group-entries' diff --git a/src/components/explorer-paper/paper-day-insights-helpers.test.ts b/src/components/explorer-paper/paper-day-insights-helpers.test.ts index 76626e95..417997dc 100644 --- a/src/components/explorer-paper/paper-day-insights-helpers.test.ts +++ b/src/components/explorer-paper/paper-day-insights-helpers.test.ts @@ -452,9 +452,10 @@ describe('aggregateDayInsights', () => { }), ] const insights = aggregateDayInsights(dayFromEntries('2026-05-21', visits)) - expect(insights.topSearchQueries.map((row) => row.query).sort()).toEqual( - ['naked', 'with-www'], - ) + expect(insights.topSearchQueries.map((row) => row.query).sort()).toEqual([ + 'naked', + 'with-www', + ]) }) test('search-engine subdomain we have not mapped is ignored', () => { @@ -497,11 +498,7 @@ describe('aggregateDayInsights', () => { ] const insights = aggregateDayInsights(dayFromEntries('2026-05-21', visits)) const queries = insights.topSearchQueries.map((row) => row.query).sort() - expect(queries).toEqual([ - 'baidu-query', - 'google-query', - 'yahoo-query', - ]) + expect(queries).toEqual(['baidu-query', 'google-query', 'yahoo-query']) }) test('totalPages tally still counts even when no queries are extracted', () => { diff --git a/src/components/explorer-paper/paper-day-insights-helpers.ts b/src/components/explorer-paper/paper-day-insights-helpers.ts index 49d4c5a4..e8191c5a 100644 --- a/src/components/explorer-paper/paper-day-insights-helpers.ts +++ b/src/components/explorer-paper/paper-day-insights-helpers.ts @@ -118,10 +118,7 @@ export function aggregateDayInsights(day: PaperDay): DayInsights { string, { url: string; title: string | null; visits: number } >() - const searchQueryCounts = new Map< - string, - { query: string; count: number } - >() + const searchQueryCounts = new Map() const hourBuckets = new Array(24).fill(0) let totalPages = 0 let typedCount = 0 diff --git a/src/components/shell/use-route-history-nav.test.tsx b/src/components/shell/use-route-history-nav.test.tsx index 9023f701..5a156f9c 100644 --- a/src/components/shell/use-route-history-nav.test.tsx +++ b/src/components/shell/use-route-history-nav.test.tsx @@ -36,6 +36,17 @@ function NavHarness({ + {/* Simulates the browser's back arrow — `navigate(-1)` fires a + Pop without going through the hook's `goBack` callback (which + would normally set forwardAvailable=true on the in-app path). + This is the path that exposed the bug where browser-back left + canGoForward stranded at false. */} + {api.canGoBack ? 'y' : 'n'} {api.canGoForward ? 'y' : 'n'} @@ -87,7 +98,7 @@ describe('useRouteHistoryNav', () => { vi.useRealTimers() }) - test('starts disabled at history root and enables back after a push', async () => { + test('starts disabled at history root and enables back after a push', () => { const calls: ReturnType[] = [] render( @@ -97,58 +108,93 @@ describe('useRouteHistoryNav', () => { expect(screen.getByTestId('harness-can-back')).toHaveTextContent('n') expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') - await act(async () => { + act(() => { screen.getByTestId('harness-push').click() }) expect(screen.getByTestId('harness-can-back')).toHaveTextContent('y') expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') }) - test('goBack arms forward and goForward clears it again', async () => { + test('goBack arms forward and goForward clears it again', () => { render( {}} /> , ) - await act(async () => { + act(() => { screen.getByTestId('harness-push').click() }) - await act(async () => { + act(() => { screen.getByTestId('harness-back').click() }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('y') - await act(async () => { + act(() => { screen.getByTestId('harness-forward').click() }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') }) - test('goBack is a no-op at history root', async () => { + test('goBack is a no-op at history root', () => { render( {}} /> , ) - await act(async () => { + act(() => { screen.getByTestId('harness-back').click() }) // No new render side-effects; still at the disabled baseline. expect(screen.getByTestId('harness-can-back')).toHaveTextContent('n') }) - test('goForward is a no-op when there is no forward branch', async () => { + test('goForward is a no-op when there is no forward branch', () => { render( {}} /> , ) - await act(async () => { + act(() => { + screen.getByTestId('harness-forward').click() + }) + expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') + }) + + test('browser-back (Pop bypassing goBack) enables canGoForward', () => { + // The in-app `goBack` callback sets forwardAvailable=true before + // calling navigate(-1). The browser's back arrow also fires a Pop + // event but doesn't invoke the callback — previously this left + // canGoForward stranded at false even though forward navigation + // was actually available. The Pop branch in the effect now mirrors + // the same forwardAvailable=true behavior. + render( + + {}} /> + , + ) + act(() => { + screen.getByTestId('harness-push').click() + }) + expect(screen.getByTestId('harness-can-back')).toHaveTextContent('y') + expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') + + // Browser-back: navigate(-1) directly, bypassing goBack(). + act(() => { + screen.getByTestId('harness-browser-back').click() + }) + expect(screen.getByTestId('harness-can-back')).toHaveTextContent('n') + expect( + screen.getByTestId('harness-can-forward'), + 'browser-back must enable canGoForward so the topbar forward chevron reflects the browser state', + ).toHaveTextContent('y') + + // goForward then consumes forwardAvailable as usual. + act(() => { screen.getByTestId('harness-forward').click() }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') }) - test('Cmd+[ fires goBack on Mac platforms', async () => { + test('Cmd+[ fires goBack on Mac platforms', () => { setPlatform('MacIntel') setUserAgent('Mozilla/5.0 (Macintosh)') render( @@ -156,16 +202,16 @@ describe('useRouteHistoryNav', () => { {}} /> , ) - await act(async () => { + act(() => { screen.getByTestId('harness-push').click() }) - await act(async () => { + act(() => { fireEvent.keyDown(document, { key: '[', metaKey: true }) }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('y') }) - test('Ctrl+] fires goForward on non-Mac platforms after a back step', async () => { + test('Ctrl+] fires goForward on non-Mac platforms after a back step', () => { setPlatform('Linux x86_64') setUserAgent('Mozilla/5.0 (X11; Linux x86_64)') render( @@ -173,20 +219,20 @@ describe('useRouteHistoryNav', () => { {}} /> , ) - await act(async () => { + act(() => { screen.getByTestId('harness-push').click() }) - await act(async () => { + act(() => { fireEvent.keyDown(document, { key: '[', ctrlKey: true }) }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('y') - await act(async () => { + act(() => { fireEvent.keyDown(document, { key: ']', ctrlKey: true }) }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') }) - test('keyboard shortcut is ignored while focus is in an editable target', async () => { + test('keyboard shortcut is ignored while focus is in an editable target', () => { setPlatform('MacIntel') setUserAgent('Mozilla/5.0 (Macintosh)') render( @@ -194,19 +240,19 @@ describe('useRouteHistoryNav', () => { {}} /> , ) - await act(async () => { + act(() => { screen.getByTestId('harness-push').click() }) - const input = screen.getByTestId('harness-input') as HTMLInputElement + const input = screen.getByTestId('harness-input') input.focus() - await act(async () => { + act(() => { fireEvent.keyDown(input, { key: '[', metaKey: true }) }) // Editable focus suppressed the shortcut → still no forward branch. expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') }) - test('keyboard shortcut requires the platform-specific modifier', async () => { + test('keyboard shortcut requires the platform-specific modifier', () => { setPlatform('MacIntel') setUserAgent('Mozilla/5.0 (Macintosh)') render( @@ -214,20 +260,20 @@ describe('useRouteHistoryNav', () => { {}} /> , ) - await act(async () => { + act(() => { screen.getByTestId('harness-push').click() }) // On Mac, Ctrl+[ should be ignored — only Cmd (meta) counts. - await act(async () => { + act(() => { fireEvent.keyDown(document, { key: '[', ctrlKey: true }) }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') // Alt/Shift modifiers disqualify even with the correct base mod. - await act(async () => { + act(() => { fireEvent.keyDown(document, { key: '[', metaKey: true, altKey: true }) }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') - await act(async () => { + act(() => { fireEvent.keyDown(document, { key: '[', metaKey: true, @@ -236,13 +282,13 @@ describe('useRouteHistoryNav', () => { }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') // Unrelated key never fires either branch. - await act(async () => { + act(() => { fireEvent.keyDown(document, { key: 'a', metaKey: true }) }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') }) - test('non-Mac platforms reject Cmd+[ — only Ctrl counts', async () => { + test('non-Mac platforms reject Cmd+[ — only Ctrl counts', () => { setPlatform('Linux x86_64') setUserAgent('Mozilla/5.0 (X11; Linux x86_64)') render( @@ -250,10 +296,10 @@ describe('useRouteHistoryNav', () => { {}} /> , ) - await act(async () => { + act(() => { screen.getByTestId('harness-push').click() }) - await act(async () => { + act(() => { fireEvent.keyDown(document, { key: '[', metaKey: true }) }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') @@ -292,7 +338,7 @@ describe('useRouteHistoryNav', () => { expect(screen.getByTestId('harness-modifier')).toHaveTextContent('⌘') }) - test('keyboard shortcut bails out when the target is contenteditable', async () => { + test('keyboard shortcut bails out when the target is contenteditable', () => { setPlatform('MacIntel') setUserAgent('Mozilla/5.0 (Macintosh)') render( @@ -301,17 +347,17 @@ describe('useRouteHistoryNav', () => {
, ) - await act(async () => { + act(() => { screen.getByTestId('harness-push').click() }) const editable = screen.getByTestId('harness-editable') - await act(async () => { + act(() => { fireEvent.keyDown(editable, { key: '[', metaKey: true }) }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('n') }) - test('keyboard shortcut tolerates non-element keydown targets', async () => { + test('keyboard shortcut tolerates non-element keydown targets', () => { setPlatform('MacIntel') setUserAgent('Mozilla/5.0 (Macintosh)') render( @@ -319,7 +365,7 @@ describe('useRouteHistoryNav', () => { {}} /> , ) - await act(async () => { + act(() => { screen.getByTestId('harness-push').click() }) const event = new KeyboardEvent('keydown', { @@ -328,7 +374,7 @@ describe('useRouteHistoryNav', () => { bubbles: true, }) Object.defineProperty(event, 'target', { value: null }) - await act(async () => { + act(() => { document.dispatchEvent(event) }) expect(screen.getByTestId('harness-can-forward')).toHaveTextContent('y') diff --git a/src/components/shell/use-route-history-nav.ts b/src/components/shell/use-route-history-nav.ts index 938432ea..c2c3717f 100644 --- a/src/components/shell/use-route-history-nav.ts +++ b/src/components/shell/use-route-history-nav.ts @@ -56,10 +56,7 @@ const isMacLike = (): boolean => { const modifierLabelForPlatform = (): string => (isMacLike() ? '⌘' : 'Ctrl+') -const shortcutMatches = ( - event: KeyboardEvent, - key: '[' | ']', -): boolean => { +const shortcutMatches = (event: KeyboardEvent, key: '[' | ']'): boolean => { if (event.key !== key) return false // Avoid hijacking shortcuts the OS / browser owns (e.g. window switch // shortcuts on Linux use Alt/Super). Match either Meta (Cmd) on macOS @@ -110,6 +107,15 @@ export function useRouteHistoryNav(): RouteHistoryNav { // the initial mount (always `Pop` per react-router) would underflow // the stack to -1 → 0. const lastKeyRef = useRef(null) + // `goForward` calls `navigate(1)` which fires a Pop event. We need to + // distinguish that Pop (the user consumed the forward branch — should + // leave forwardAvailable=false) from a browser-back-initiated Pop + // (the user just stepped backwards — should set forwardAvailable=true + // so the topbar forward chevron reflects the browser's actual state). + // React-router does not expose the delta direction on Pop events, so + // we tag the in-app goForward path explicitly and have the effect + // consume the tag on the next Pop. + const expectingForwardPopRef = useRef(false) useEffect(() => { if (lastKeyRef.current === location.key) return @@ -121,13 +127,38 @@ export function useRouteHistoryNav(): RouteHistoryNav { } lastKeyRef.current = location.key if (navigationType === NavigationType.Push) { + // Synchronizing React state with an external system (the router's + // navigation events) is exactly what useEffect is for, even + // though react-hooks/set-state-in-effect cannot distinguish this + // case from the antipattern it targets (derive-on-render leaks). + // The setState is gated on `lastKeyRef.current` changing, so it + // runs at most once per actual navigation, not per render. + // eslint-disable-next-line react-hooks/set-state-in-effect setStackIndex((index) => index + 1) // A Push wipes any in-flight forward branch, mirroring browser // behaviour. Otherwise a back-then-link-click would still leave // the forward arrow lit. setForwardAvailable(false) } else if (navigationType === NavigationType.Pop) { + // Same justification as the Push branch above — Pop is also a + // router-driven external event we forward into local stack + // state. The rule only fires once per effect body, so no extra + // eslint-disable is needed here. setStackIndex((index) => Math.max(0, index - 1)) + if (expectingForwardPopRef.current) { + // This Pop is the tail of an in-app `goForward` → navigate(1). + // goForward already set forwardAvailable=false before + // triggering the navigation; consume the tag and leave the + // state alone. + expectingForwardPopRef.current = false + } else { + // External-initiated Pop (browser back arrow, history.go(-N), + // or the in-app goBack which also flowed through this path). + // In every one of those cases the user just stepped backwards, + // so forward navigation is now available — enable the topbar + // forward chevron to mirror the browser's actual forward state. + setForwardAvailable(true) + } } // NavigationType.Replace intentionally does not move the counter — // a redirect / canonicalisation should not arm the back button. @@ -145,6 +176,10 @@ export function useRouteHistoryNav(): RouteHistoryNav { const goForward = useCallback(() => { if (!forwardAvailable) return setForwardAvailable(false) + // Tag the upcoming Pop so the effect doesn't re-enable + // forwardAvailable from underneath us. See the matching consumer + // in the Pop branch above. + expectingForwardPopRef.current = true void navigate(1) }, [forwardAvailable, navigate]) diff --git a/src/lib/explorer-preferences.test.ts b/src/lib/explorer-preferences.test.ts index 34cb534c..29909de6 100644 --- a/src/lib/explorer-preferences.test.ts +++ b/src/lib/explorer-preferences.test.ts @@ -4,14 +4,26 @@ * @module lib/explorer-preferences */ -import { describe, expect, test } from 'vitest' +import { afterEach, describe, expect, test, vi } from 'vitest' import { + CLOCK_FORMAT_EVENT, + defaultClockFormat, defaultExplorerBackgroundPrefetchPages, + defaultExplorerViewMode, explorerBackgroundPrefetchPageOptions, maxExplorerBackgroundPrefetchPages, normalizeExplorerBackgroundPrefetchPages, + persistClockFormat, + persistExplorerViewMode, + readClockFormat, + readExplorerViewMode, } from './explorer-preferences' +afterEach(() => { + window.localStorage.clear() + vi.restoreAllMocks() +}) + describe('Explorer background prefetch preferences', () => { test('normalizes invalid, low, high, and fractional values', () => { expect(normalizeExplorerBackgroundPrefetchPages(null)).toBe( @@ -33,3 +45,131 @@ describe('Explorer background prefetch preferences', () => { ]) }) }) + +// ── Browse view-mode persistence ────────────────────────────────────── + +describe('readExplorerViewMode', () => { + test('returns "cards" when localStorage is empty', () => { + expect(readExplorerViewMode()).toBe('cards') + }) + + test('returns "list" when stored value is "list"', () => { + window.localStorage.setItem('pathkeep.explorerViewMode', 'list') + expect(readExplorerViewMode()).toBe('list') + }) + + test('returns "cards" for unrecognised stored values', () => { + window.localStorage.setItem('pathkeep.explorerViewMode', 'grid') + expect(readExplorerViewMode()).toBe('cards') + }) + + test('returns default when localStorage.getItem throws', () => { + vi.spyOn(Storage.prototype, 'getItem').mockImplementation(() => { + throw new Error('storage disabled') + }) + expect(readExplorerViewMode()).toBe(defaultExplorerViewMode) + }) +}) + +describe('persistExplorerViewMode', () => { + test('writes mode to localStorage', () => { + persistExplorerViewMode('list') + expect(window.localStorage.getItem('pathkeep.explorerViewMode')).toBe( + 'list', + ) + }) + + test('skips write when current mode already matches', () => { + window.localStorage.setItem('pathkeep.explorerViewMode', 'list') + const spy = vi.spyOn(Storage.prototype, 'setItem') + persistExplorerViewMode('list') + expect(spy).not.toHaveBeenCalled() + }) + + test('swallows localStorage.setItem errors', () => { + vi.spyOn(Storage.prototype, 'setItem').mockImplementation(() => { + throw new Error('quota exceeded') + }) + expect(() => persistExplorerViewMode('list')).not.toThrow() + }) +}) + +// ── Clock format persistence ────────────────────────────────────────── + +describe('readClockFormat', () => { + test('returns "12h" when localStorage is empty', () => { + expect(readClockFormat()).toBe('12h') + }) + + test('returns "24h" when stored value is "24h"', () => { + window.localStorage.setItem('pathkeep.clockFormat', '24h') + expect(readClockFormat()).toBe('24h') + }) + + test('returns default for unrecognised stored values', () => { + window.localStorage.setItem('pathkeep.clockFormat', 'military') + expect(readClockFormat()).toBe(defaultClockFormat) + }) + + test('returns default when localStorage.getItem throws', () => { + vi.spyOn(Storage.prototype, 'getItem').mockImplementation(() => { + throw new Error('storage disabled') + }) + expect(readClockFormat()).toBe(defaultClockFormat) + }) +}) + +describe('persistClockFormat', () => { + test('writes format to localStorage and dispatches event', () => { + const events: string[] = [] + const listener = (e: Event) => { + const detail = (e as CustomEvent<{ format: string }>).detail + events.push(detail.format) + } + window.addEventListener(CLOCK_FORMAT_EVENT, listener) + try { + persistClockFormat('24h') + expect(window.localStorage.getItem('pathkeep.clockFormat')).toBe('24h') + expect(events).toEqual(['24h']) + } finally { + window.removeEventListener(CLOCK_FORMAT_EVENT, listener) + } + }) + + test('skips write when current format already matches', () => { + window.localStorage.setItem('pathkeep.clockFormat', '24h') + const spy = vi.spyOn(Storage.prototype, 'setItem') + persistClockFormat('24h') + expect(spy).not.toHaveBeenCalled() + }) + + test('swallows localStorage.setItem errors but still dispatches event', () => { + vi.spyOn(Storage.prototype, 'setItem').mockImplementation(() => { + throw new Error('quota exceeded') + }) + const events: string[] = [] + const listener = (e: Event) => { + const detail = (e as CustomEvent<{ format: string }>).detail + events.push(detail.format) + } + window.addEventListener(CLOCK_FORMAT_EVENT, listener) + try { + expect(() => persistClockFormat('24h')).not.toThrow() + expect(events).toEqual(['24h']) + } finally { + window.removeEventListener(CLOCK_FORMAT_EVENT, listener) + } + }) + + test('swallows CustomEvent dispatch errors', () => { + const original = window.dispatchEvent.bind(window) + window.dispatchEvent = vi.fn(() => { + throw new Error('dispatchEvent unsupported') + }) + try { + expect(() => persistClockFormat('24h')).not.toThrow() + } finally { + window.dispatchEvent = original + } + }) +}) diff --git a/src/lib/i18n/catalog/settings-core-and-platform.ts b/src/lib/i18n/catalog/settings-core-and-platform.ts index a24b43f7..6988f4c7 100644 --- a/src/lib/i18n/catalog/settings-core-and-platform.ts +++ b/src/lib/i18n/catalog/settings-core-and-platform.ts @@ -116,8 +116,7 @@ export const settingsCoreAndPlatformNamespace = { linkPreviewsRebuildAction: 'Rebuild now ({budget})', linkPreviewsRebuildHint: 'Sweeps up to {budget} of the most recently visited URLs without a cached preview (worker hard-caps any single pass at {cap}).', - linkPreviewsRebuildSummary: - 'Enqueued {enqueued}, succeeded {succeeded}.', + linkPreviewsRebuildSummary: 'Enqueued {enqueued}, succeeded {succeeded}.', linkPreviewsStatsLabel: 'Cache footprint', linkPreviewsStatsRows: '{rows} rows · {blobs} blobs · {bytes}', linkPreviewsStatsEmpty: 'No previews cached yet.', @@ -344,7 +343,8 @@ export const settingsCoreAndPlatformNamespace = { linkPreviewsFetchModeOnDemand: '按需', linkPreviewsFetchModeOnDemandHint: '只在卡片滚入视口时抓取。', linkPreviewsFetchModeBackground: '后台', - linkPreviewsFetchModeBackgroundHint: '按需 + 每次备份预抓 + 每日重试。推荐。', + linkPreviewsFetchModeBackgroundHint: + '按需 + 每次备份预抓 + 每日重试。推荐。', linkPreviewsBudgetsLabel: '每次备份预算', linkPreviewsBudgetsHint: '限制每日重试和新访问预抓单次入队的 URL 数量上限,避免短时间内大量对外请求。设为 0 即停用该项。', diff --git a/src/lib/paper-preferences.test.ts b/src/lib/paper-preferences.test.ts index 3b1cdb1c..bd71f92c 100644 --- a/src/lib/paper-preferences.test.ts +++ b/src/lib/paper-preferences.test.ts @@ -124,4 +124,42 @@ describe('applyPaperPreferences', () => { document.documentElement.style.getPropertyValue('--vignette-opacity'), ).toBe('0') }) + + test('dispatches PAPER_PREFERENCES_EVENT with the resolved prefs', () => { + const events: PaperPreferences[] = [] + const listener = (e: Event) => { + const detail = (e as CustomEvent<{ preferences: PaperPreferences }>) + .detail + events.push(detail.preferences) + } + window.addEventListener('pathkeep.paperPreferencesChanged', listener) + try { + const candidate: PaperPreferences = { + theme: 'dark', + fonts: 'system', + density: 'compact', + paperTexture: false, + } + applyPaperPreferences(candidate) + expect(events).toHaveLength(1) + expect(events[0]).toEqual(candidate) + } finally { + window.removeEventListener('pathkeep.paperPreferencesChanged', listener) + } + }) + + test('persists and returns the resolved bundle', () => { + const candidate: PaperPreferences = { + theme: 'dark', + fonts: 'system', + density: 'compact', + paperTexture: true, + } + const result = applyPaperPreferences(candidate) + expect(result).toEqual(candidate) + expect(window.localStorage.getItem('pathkeep.theme')).toBe('dark') + expect(window.localStorage.getItem('pathkeep.fonts')).toBe('system') + expect(window.localStorage.getItem('pathkeep.density')).toBe('compact') + expect(window.localStorage.getItem('pathkeep.paperTexture')).toBe('on') + }) }) diff --git a/src/pages/explorer/paper-view.test.tsx b/src/pages/explorer/paper-view.test.tsx index b833c558..62d68fdc 100644 --- a/src/pages/explorer/paper-view.test.tsx +++ b/src/pages/explorer/paper-view.test.tsx @@ -440,9 +440,7 @@ describe('PaperExplorerView', () => { ), ).toBe(true) expect( - Array.from(rows).some((row) => - row.textContent?.includes('arxiv paper'), - ), + Array.from(rows).some((row) => row.textContent?.includes('arxiv paper')), ).toBe(true) }) diff --git a/src/pages/settings/appearance-section.test.tsx b/src/pages/settings/appearance-section.test.tsx index b8b30067..f95fddde 100644 --- a/src/pages/settings/appearance-section.test.tsx +++ b/src/pages/settings/appearance-section.test.tsx @@ -127,6 +127,37 @@ describe('AppearanceSection', () => { } }) + test('the appearance card reflows when PAPER_PREFERENCES_EVENT fires from a peer surface', async () => { + render( + + + , + ) + const light = screen.getByRole('radio', { name: /Paper · light/i }) + const dark = screen.getByRole('radio', { name: /Darkroom · dark/i }) + expect(light.getAttribute('aria-checked')).toBe('true') + expect(dark.getAttribute('aria-checked')).toBe('false') + + await import('@testing-library/react').then(({ act }) => + act(() => { + window.dispatchEvent( + new CustomEvent('pathkeep.paperPreferencesChanged', { + detail: { + preferences: { + theme: 'dark', + fonts: 'bundled', + density: 'comfortable', + paperTexture: true, + }, + }, + }), + ) + }), + ) + expect(dark.getAttribute('aria-checked')).toBe('true') + expect(light.getAttribute('aria-checked')).toBe('false') + }) + test('the appearance card reflows when CLOCK_FORMAT_EVENT fires from a peer surface', async () => { render( @@ -155,4 +186,33 @@ describe('AppearanceSection', () => { expect(twentyFour.getAttribute('aria-checked')).toBe('true') expect(twelve.getAttribute('aria-checked')).toBe('false') }) + + test('peer events with missing detail do not crash or change state', async () => { + render( + + + , + ) + const light = screen.getByRole('radio', { name: /Paper · light/i }) + const twelve = screen.getByRole('radio', { name: /12-hour/i }) + expect(light.getAttribute('aria-checked')).toBe('true') + expect(twelve.getAttribute('aria-checked')).toBe('true') + + await import('@testing-library/react').then(({ act }) => + act(() => { + window.dispatchEvent( + new CustomEvent('pathkeep.paperPreferencesChanged', { + detail: {}, + }), + ) + window.dispatchEvent( + new CustomEvent('pathkeep.clockFormatChanged', { + detail: {}, + }), + ) + }), + ) + expect(light.getAttribute('aria-checked')).toBe('true') + expect(twelve.getAttribute('aria-checked')).toBe('true') + }) }) diff --git a/src/pages/settings/link-previews-section.test.tsx b/src/pages/settings/link-previews-section.test.tsx index 9dde4075..dd5500f5 100644 --- a/src/pages/settings/link-previews-section.test.tsx +++ b/src/pages/settings/link-previews-section.test.tsx @@ -272,9 +272,7 @@ describe('LinkPreviewsSection', () => { totalBytes: 0, oldestFetchedAt: null, }) - render( - withShell({ ogImageFetchEnabled: true, fetchMode: 'on_demand' }), - ) + render(withShell({ ogImageFetchEnabled: true, fetchMode: 'on_demand' })) expect( screen .getByTestId('link-previews-fetch-mode-on_demand') @@ -339,23 +337,11 @@ describe('LinkPreviewsSection', () => { totalBytes: 0, oldestFetchedAt: null, }) - render( - withShell({ ogImageFetchEnabled: false, fetchMode: 'background' }), - ) + render(withShell({ ogImageFetchEnabled: false, fetchMode: 'background' })) + expect(screen.getByTestId('link-previews-fetch-mode-off')).toBeDisabled() expect( - ( - screen.getByTestId( - 'link-previews-fetch-mode-off', - ) as HTMLButtonElement - ).disabled, - ).toBe(true) - expect( - ( - screen.getByTestId( - 'link-previews-fetch-mode-background', - ) as HTMLButtonElement - ).disabled, - ).toBe(true) + screen.getByTestId('link-previews-fetch-mode-background'), + ).toBeDisabled() }) test('daily refetch budget renders the snapshot value', () => { @@ -386,13 +372,12 @@ describe('LinkPreviewsSection', () => { }) const saveConfig = vi.fn().mockResolvedValue(undefined) render(withShell({ ogImageFetchEnabled: true, saveConfig })) - fireEvent.change( - screen.getByTestId('link-previews-daily-refetch-budget'), - { target: { value: '250' } }, + fireEvent.change(screen.getByTestId('link-previews-daily-refetch-budget'), { + target: { value: '250' }, + }) + expect(saveConfig.mock.calls.at(-1)?.[0].ogImage.dailyRefetchBudget).toBe( + 250, ) - expect( - saveConfig.mock.calls.at(-1)?.[0].ogImage.dailyRefetchBudget, - ).toBe(250) }) test('daily refetch budget clamps above the maximum (5000)', () => { @@ -404,13 +389,12 @@ describe('LinkPreviewsSection', () => { }) const saveConfig = vi.fn().mockResolvedValue(undefined) render(withShell({ ogImageFetchEnabled: true, saveConfig })) - fireEvent.change( - screen.getByTestId('link-previews-daily-refetch-budget'), - { target: { value: '999999' } }, + fireEvent.change(screen.getByTestId('link-previews-daily-refetch-budget'), { + target: { value: '999999' }, + }) + expect(saveConfig.mock.calls.at(-1)?.[0].ogImage.dailyRefetchBudget).toBe( + 5000, ) - expect( - saveConfig.mock.calls.at(-1)?.[0].ogImage.dailyRefetchBudget, - ).toBe(5000) }) test('daily refetch budget clamps to 0 for negative values', () => { @@ -422,13 +406,10 @@ describe('LinkPreviewsSection', () => { }) const saveConfig = vi.fn().mockResolvedValue(undefined) render(withShell({ ogImageFetchEnabled: true, saveConfig })) - fireEvent.change( - screen.getByTestId('link-previews-daily-refetch-budget'), - { target: { value: '-9' } }, - ) - expect( - saveConfig.mock.calls.at(-1)?.[0].ogImage.dailyRefetchBudget, - ).toBe(0) + fireEvent.change(screen.getByTestId('link-previews-daily-refetch-budget'), { + target: { value: '-9' }, + }) + expect(saveConfig.mock.calls.at(-1)?.[0].ogImage.dailyRefetchBudget).toBe(0) }) test('daily refetch budget skips saveConfig when value is unchanged', () => { @@ -446,10 +427,9 @@ describe('LinkPreviewsSection', () => { saveConfig, }), ) - fireEvent.change( - screen.getByTestId('link-previews-daily-refetch-budget'), - { target: { value: '50' } }, - ) + fireEvent.change(screen.getByTestId('link-previews-daily-refetch-budget'), { + target: { value: '50' }, + }) expect(saveConfig).not.toHaveBeenCalled() }) @@ -462,12 +442,8 @@ describe('LinkPreviewsSection', () => { }) render(withShell({ ogImageFetchEnabled: false })) expect( - ( - screen.getByTestId( - 'link-previews-daily-refetch-budget', - ) as HTMLInputElement - ).disabled, - ).toBe(true) + screen.getByTestId('link-previews-daily-refetch-budget'), + ).toBeDisabled() }) test('prefetch budget input persists in-range value', () => { @@ -533,13 +509,7 @@ describe('LinkPreviewsSection', () => { oldestFetchedAt: null, }) render(withShell({ ogImageFetchEnabled: false })) - expect( - ( - screen.getByTestId( - 'link-previews-prefetch-budget', - ) as HTMLInputElement - ).disabled, - ).toBe(true) + expect(screen.getByTestId('link-previews-prefetch-budget')).toBeDisabled() }) test('prefetch budget disabled when fetch mode is not Background', () => { @@ -549,16 +519,8 @@ describe('LinkPreviewsSection', () => { totalBytes: 0, oldestFetchedAt: null, }) - render( - withShell({ ogImageFetchEnabled: true, fetchMode: 'on_demand' }), - ) - expect( - ( - screen.getByTestId( - 'link-previews-prefetch-budget', - ) as HTMLInputElement - ).disabled, - ).toBe(true) + render(withShell({ ogImageFetchEnabled: true, fetchMode: 'on_demand' })) + expect(screen.getByTestId('link-previews-prefetch-budget')).toBeDisabled() }) test('prefetch budget remains enabled when mode is Background + fetchEnabled', () => { @@ -568,16 +530,10 @@ describe('LinkPreviewsSection', () => { totalBytes: 0, oldestFetchedAt: null, }) - render( - withShell({ ogImageFetchEnabled: true, fetchMode: 'background' }), - ) + render(withShell({ ogImageFetchEnabled: true, fetchMode: 'background' })) expect( - ( - screen.getByTestId( - 'link-previews-prefetch-budget', - ) as HTMLInputElement - ).disabled, - ).toBe(false) + screen.getByTestId('link-previews-prefetch-budget'), + ).not.toBeDisabled() }) test('Rebuild now calls backend.prefetchOgImages with the default budget', async () => { @@ -631,9 +587,7 @@ describe('LinkPreviewsSection', () => { render(withShell({ ogImageFetchEnabled: true })) await userEvent.click(screen.getByTestId('link-previews-rebuild-now')) await waitFor(() => - expect(screen.getByTestId('link-previews-stats')).toHaveTextContent( - '42', - ), + expect(screen.getByTestId('link-previews-stats')).toHaveTextContent('42'), ) }) @@ -645,11 +599,7 @@ describe('LinkPreviewsSection', () => { oldestFetchedAt: null, }) render(withShell({ ogImageFetchEnabled: false })) - expect( - ( - screen.getByTestId('link-previews-rebuild-now') as HTMLButtonElement - ).disabled, - ).toBe(true) + expect(screen.getByTestId('link-previews-rebuild-now')).toBeDisabled() }) test('Rebuild now clears the pending state even when the worker throws', async () => { @@ -663,14 +613,12 @@ describe('LinkPreviewsSection', () => { new Error('worker offline'), ) render(withShell({ ogImageFetchEnabled: true })) - const button = screen.getByTestId( - 'link-previews-rebuild-now', - ) as HTMLButtonElement + const button = screen.getByTestId('link-previews-rebuild-now') await userEvent.click(button).catch(() => undefined) // After the promise rejects, the button must re-enable so the user // can retry — otherwise a transient error permanently locks the // affordance until reload. - await waitFor(() => expect(button.disabled).toBe(false)) + await waitFor(() => expect(button).not.toBeDisabled()) }) test('Clear all is guarded by window.confirm', async () => { diff --git a/src/pages/settings/link-previews-section.tsx b/src/pages/settings/link-previews-section.tsx index dbcbe3f6..a3671084 100644 --- a/src/pages/settings/link-previews-section.tsx +++ b/src/pages/settings/link-previews-section.tsx @@ -404,9 +404,7 @@ export function LinkPreviewsSection({ max={PREFETCH_BUDGET_MAX} step={1} value={settings.newVisitPrefetchBudget} - disabled={ - !fetchEnabled || settings.fetchMode !== 'background' - } + disabled={!fetchEnabled || settings.fetchMode !== 'background'} onChange={(event) => void onChangePrefetchBudget(event.target.value) } diff --git a/src/pages/settings/paper-form-primitives.test.tsx b/src/pages/settings/paper-form-primitives.test.tsx index 019aafea..28aeacdd 100644 --- a/src/pages/settings/paper-form-primitives.test.tsx +++ b/src/pages/settings/paper-form-primitives.test.tsx @@ -89,10 +89,7 @@ describe('SegmentedControl', () => { />, ) for (const option of OPTIONS) { - const node = screen.getByTestId( - `seg-${option.id}`, - ) as HTMLButtonElement - expect(node.disabled).toBe(true) + expect(screen.getByTestId(`seg-${option.id}`)).toBeDisabled() } }) @@ -128,10 +125,7 @@ describe('SegmentedControl', () => { />, ) for (const option of OPTIONS) { - const node = screen.getByTestId( - `seg-${option.id}`, - ) as HTMLButtonElement - expect(node.disabled).toBe(false) + expect(screen.getByTestId(`seg-${option.id}`)).not.toBeDisabled() } }) @@ -154,11 +148,7 @@ describe('SegmentedControl', () => { test('omitting testId still renders every option (no data-testid leak)', () => { const onChange = vi.fn() const { container } = render( - , + , ) // 3 radio buttons rendered, none carrying a data-testid attribute. const radios = container.querySelectorAll('button[role="radio"]') diff --git a/src/pages/settings/paper-form-primitives.tsx b/src/pages/settings/paper-form-primitives.tsx index 9c3cde55..d3769bd3 100644 --- a/src/pages/settings/paper-form-primitives.tsx +++ b/src/pages/settings/paper-form-primitives.tsx @@ -143,7 +143,8 @@ export function SegmentedControl({ option.id === value ? 'border-accent bg-accent-soft text-accent-text' : 'text-ink hover:border-ink-muted hover:bg-hover', - disabled && 'cursor-not-allowed opacity-60 hover:border-border-default hover:bg-transparent', + disabled && + 'cursor-not-allowed opacity-60 hover:border-border-default hover:bg-transparent', )} > diff --git a/src/pages/settings/paper-settings-header.test.tsx b/src/pages/settings/paper-settings-header.test.tsx index c2236b49..245c408a 100644 --- a/src/pages/settings/paper-settings-header.test.tsx +++ b/src/pages/settings/paper-settings-header.test.tsx @@ -85,6 +85,30 @@ describe('PaperSettingsHeader', () => { rafSpy.mockRestore() }) + test('scrolls without overwriting tabindex when the target already has one', () => { + document.body.innerHTML = '
' + const target = document.getElementById('settings-applock') + if (!(target instanceof HTMLElement)) throw new Error('target missing') + const scrollSpy = vi.fn() + Object.defineProperty(target, 'scrollIntoView', { + value: scrollSpy, + configurable: true, + }) + const rafSpy = vi + .spyOn(window, 'requestAnimationFrame') + .mockImplementation((cb: FrameRequestCallback) => { + cb(0) + return 1 + }) + renderHeader() + fireEvent.click( + screen.getByRole('link', { name: 'App Lock' }), + ) + expect(scrollSpy).toHaveBeenCalledWith({ block: 'start' }) + expect(target.getAttribute('tabindex')).toBe('0') + rafSpy.mockRestore() + }) + test('uses the provided testId', () => { renderHeader({ testId: 'paper-settings-header-x' }) expect(screen.getByTestId('paper-settings-header-x')).toBeInTheDocument()