Skip to content

feat: zstd compression#618

Draft
0xbrayo wants to merge 4 commits into
ActivityWatch:masterfrom
0xbrayo:zstd-compression
Draft

feat: zstd compression#618
0xbrayo wants to merge 4 commits into
ActivityWatch:masterfrom
0xbrayo:zstd-compression

Conversation

@0xbrayo

@0xbrayo 0xbrayo commented Jun 13, 2026

Copy link
Copy Markdown
Member

closes #493

…Watch#493)

Add an opt-in 'compression-zstd' feature that transparently compresses the
events.data column with zstd. ActivityWatch events are tiny and their
redundancy is across rows (repeated app names, JSON keys, window titles), so a
dictionary is trained on a sample of the database's own events, stored once, and
used to compress every row. On a real ~545k-event database this shrinks the
stored event JSON by ~47% (file after VACUUM: 77.4 MB -> 52.3 MB) with
negligible read overhead (~0.5 us/event).

Details:
- New compression module: dictionary-compressed blobs are marked with a 0xCC
  prefix; rows that would not shrink (or before a dictionary exists) are stored
  as raw JSON, so a row is never larger than plain JSON. A single reusable
  compressor/decompressor is kept per connection.
- Event data column migrated from TEXT to BLOB (db v6) so binary compresses
  without any encoding overhead.
- The dictionary is trained once at startup when the database has enough events
  (and on upgrade), then existing rows are recompressed in a single
  transaction; new events are compressed on insert.
- Feature is off by default; a database with compressed rows requires a build
  with the feature enabled to read it.

Also enable foreign_keys and add ON DELETE CASCADE to events.bucketrow so
deleting a bucket removes its events; the v6 migration drops pre-existing orphan
events whose bucket no longer exists.

Tested with and without the feature, including a full migrate+train+backfill
roundtrip on a real database.
@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown

Greptile Summary

Adds transparent zstd dictionary compression for the event data column behind an opt-in compression-zstd build feature. The approach trains a 64 KiB shared dictionary on a corpus of the database's own events (once the row count crosses 2 000) and recompresses all existing rows in a single transaction, yielding ~47% size reduction on real data.

  • Schema v6 migration (_migrate_v5_to_v6): converts the data column from TEXT to BLOB, replaces endtime with a computed duration column, adds ON DELETE CASCADE to the events FK, and creates the compression_dict table. Migration runs unconditionally (regardless of the feature flag), so it is a one-way schema change for all users.
  • Startup flow: db_version is now re-read after _create_tables so ensure_compression sees the post-migration version on the first open after an upgrade; decompression failures are surfaced as rusqlite row errors (logged as warnings) rather than panics.
  • Correctness: rows are stored uncompressed when compression would not shrink them; the zstd magic-byte prefix (0x28 0xB5 0x2F 0xFD) is used as a reliable discriminator so compressed and uncompressed rows can coexist transparently.

Confidence Score: 5/5

Safe to merge. The two previously-flagged blocking issues (db_version stale after migration, panic on decompression failure) are both properly fixed in this revision.

The migration is a one-way schema change but it is correct and tested. Decompression errors are now surfaced as rusqlite row errors and logged as warnings rather than panicking. The db_version re-read after _create_tables ensures compression setup runs on the first open post-upgrade. Remaining notes are efficiency and logging concerns that do not affect correctness.

aw-datastore/src/datastore.rs — the backfill memory footprint and the (starttime + duration) predicate are worth revisiting for large databases, but neither blocks merge.

Important Files Changed

Filename Overview
aw-datastore/src/compression.rs New module: zstd dictionary-compressed event storage with transparent compress/decompress, magic-byte detection, and proper feature gating. Well-structured with tests.
aw-datastore/src/datastore.rs Schema v6 migration (TEXT→BLOB, endtime→duration), db_version re-read after migration, ensure_compression on startup, decompression errors properly surfaced as rusqlite errors. Memory concern in train_and_backfill and index-sargability regression are outstanding P2 notes.
aw-datastore/src/worker.rs Adds foreign_keys PRAGMA ON before migrations, enforcing ON DELETE CASCADE for bucket-event relationships.
aw-datastore/tests/datastore.rs Three new integration tests: dictionary roundtrip, first-open migration training, and bad-row skip-without-panic. Good regression coverage for the key scenarios.
aw-datastore/Cargo.toml Adds optional zstd 0.13 dependency behind compression-zstd feature flag.
aw-server/Cargo.toml Propagates compression-zstd feature flag from aw-datastore up to aw-server.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DatastoreInstance::new] --> B[_create_tables: run migrations up to v6]
    B --> C[Re-read db_version post-migration]
    C --> D{db_version >= 6?}
    D -- No --> E[Return: no compression]
    D -- Yes --> F[_load_dictionary from DB]
    F --> G{Dict found?}
    G -- Yes --> H[CompressionContext::from_dictionary]
    H --> I[compression-zstd enabled?]
    I -- Yes --> J[Load compressor + decompressor with dict]
    I -- No --> K[Empty context - reads will error on zstd rows]
    G -- No --> L{migrate_enabled AND event_count >= 2000?}
    L -- No --> E
    L -- Yes --> M[train_and_backfill]
    M --> N[SELECT id, data FROM events]
    N --> O[Decompress all rows to JSON]
    O --> P[zstd::dict::from_samples - train 64 KiB dictionary]
    P --> Q[BEGIN IMMEDIATE TRANSACTION]
    Q --> R[INSERT dict into compression_dict]
    R --> S[UPDATE every event row with compressed blob]
    S --> T{Success?}
    T -- Yes --> U[COMMIT]
    T -- No --> V[ROLLBACK - revert to empty context]

    subgraph Read path
        W[get_events / get_event] --> X[Read data BLOB from SQLite]
        X --> Y{Starts with 0x28B52FFD zstd magic?}
        Y -- No --> Z[Return raw UTF-8 JSON]
        Y -- Yes --> AA{feature enabled + dict loaded?}
        AA -- Yes --> AB[Decompress with dict - return JSON string]
        AA -- No --> AC[Return Err - skip row with warning]
    end
Loading

Reviews (2): Last reviewed commit: "test(datastore): make compression test c..." | Re-trigger Greptile

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.05882% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.26%. Comparing base (656f3c9) to head (a9cea74).
⚠️ Report is 63 commits behind head on master.

Files with missing lines Patch % Lines
aw-datastore/src/datastore.rs 81.57% 7 Missing ⚠️
aw-datastore/tests/datastore.rs 92.10% 3 Missing ⚠️
aw-datastore/src/compression.rs 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #618      +/-   ##
==========================================
+ Coverage   70.81%   77.26%   +6.45%     
==========================================
  Files          51       63      +12     
  Lines        2916     5024    +2108     
==========================================
+ Hits         2065     3882    +1817     
- Misses        851     1142     +291     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Refresh db_version after running migrations so ensure_compression sets up
  compression on the first startup after a v5->v6 upgrade, instead of waiting
  for the next restart.
- Stop panicking when a compressed row can't be decompressed (missing
  dictionary or feature disabled): get_events now skips the row with a warning
  and get_event returns an error, instead of from_utf8_lossy + unwrap.
- Add regression tests: dictionary trained on first open after upgrade, and an
  unreadable compressed row is skipped without panicking.
@0xbrayo

0xbrayo commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

🤖 Claude, on behalf of @0xbrayo

@greptile review

The trailing remove_file() in the new compression tests could fail on Windows,
where the datastore worker may still hold the database file handle briefly after
close(). The test logic already passed; only the hard-unwrapped cleanup was
flaky. Use best-effort removal, matching the existing encrypted-datastore test.
@0xbrayo

0xbrayo commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

🤖 Claude, on behalf of @0xbrayo

Benchmark: dictionary compression on a real-world database

Test data: byte-for-byte snapshot of a live database, 544,673 events across 19 buckets (~12.5 months of aw-watcher-window/afk/etc data).

Methodology:

  1. Started from the v4 snapshot and migrated it to v5 using the pre-compression code (commit 6f6c64a, i.e. current master schema), then ran VACUUM to get a clean baseline size.
  2. Took that fresh v5 database and ran it through this branch's v5→v6 migration (--features compression-zstd), then VACUUMed again for a clean "after" size.

Results:

size
v5 (master, uncompressed) 72,458,240 bytes (69.1 MiB)
v6 (this PR, dictionary-compressed) 46,047,232 bytes (43.9 MiB)
reduction 36.4%

Other notes:

  • 431,310 / 544,673 events (79%) ended up dictionary-compressed; the rest were small enough that storing them raw was already smaller, so they're kept as-is.
  • One-time migration + dictionary training (64 KiB dict, level 6) took ~3 seconds for all 544,673 events, wrapped in a single transaction.
  • Verified correctness: read events back through the REST API after migration — data round-trips intact (app/title fields, durations, timestamps all correct).

@0xbrayo

0xbrayo commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

🤖 Claude, on behalf of @0xbrayo

Why a custom implementation instead of sqlite-zstd?

sqlite-zstd (phiresky) implements essentially the same idea — dictionary-trained zstd compression of a column — and was the inspiration for this approach in #493. We didn't pull it in directly for a few reasons specific to this codebase:

  • License: sqlite-zstd is LGPL-2.0; this project's other dependencies are permissively licensed, so avoiding it sidesteps that friction.
  • Schema fit: it works by moving the target column into a shadow _zstd table and replacing the original table with a view (plus triggers). That would conflict with our existing hand-rolled migration system (_migrate_vX_to_vY, the ON DELETE CASCADE FK, the (bucketrow, starttime) index) and complicate the query layer.
  • Operational simplicity: sqlite-zstd expects periodic zstd_incremental_maintenance() calls to retrain dictionaries and vacuum shadow tables. Here, the dictionary is trained once on first upgrade and reused — no background maintenance task needed.
  • Android: this implementation is just the zstd crate operating on raw bytes in a normal column, with no SQLite virtual tables/extension registration — the simplest thing to guarantee works with the bundled SQLite used on Android.

The underlying technique (train a shared dictionary once, compress every row against it) is the same; it's just fit directly into this project's existing schema/migration model rather than taken as a generic extension.

@0xbrayo 0xbrayo mentioned this pull request Jun 13, 2026
@0xbrayo 0xbrayo marked this pull request as draft June 14, 2026 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transparent compression of database

1 participant