feat: zstd compression by 0xbrayo · Pull Request #618 · ActivityWatch/aw-server-rust

0xbrayo · 2026-06-13T10:22:25Z

closes #493

…Watch#493) Add an opt-in 'compression-zstd' feature that transparently compresses the events.data column with zstd. ActivityWatch events are tiny and their redundancy is across rows (repeated app names, JSON keys, window titles), so a dictionary is trained on a sample of the database's own events, stored once, and used to compress every row. On a real ~545k-event database this shrinks the stored event JSON by ~47% (file after VACUUM: 77.4 MB -> 52.3 MB) with negligible read overhead (~0.5 us/event). Details: - New compression module: dictionary-compressed blobs are marked with a 0xCC prefix; rows that would not shrink (or before a dictionary exists) are stored as raw JSON, so a row is never larger than plain JSON. A single reusable compressor/decompressor is kept per connection. - Event data column migrated from TEXT to BLOB (db v6) so binary compresses without any encoding overhead. - The dictionary is trained once at startup when the database has enough events (and on upgrade), then existing rows are recompressed in a single transaction; new events are compressed on insert. - Feature is off by default; a database with compressed rows requires a build with the feature enabled to read it. Also enable foreign_keys and add ON DELETE CASCADE to events.bucketrow so deleting a bucket removes its events; the v6 migration drops pre-existing orphan events whose bucket no longer exists. Tested with and without the feature, including a full migrate+train+backfill roundtrip on a real database.

greptile-apps · 2026-06-13T10:26:54Z

Greptile Summary

Adds transparent zstd dictionary compression for the event data column behind an opt-in compression-zstd build feature. The approach trains a 64 KiB shared dictionary on a corpus of the database's own events (once the row count crosses 2 000) and recompresses all existing rows in a single transaction, yielding ~47% size reduction on real data.

Schema v6 migration (_migrate_v5_to_v6): converts the data column from TEXT to BLOB, replaces endtime with a computed duration column, adds ON DELETE CASCADE to the events FK, and creates the compression_dict table. Migration runs unconditionally (regardless of the feature flag), so it is a one-way schema change for all users.
Startup flow: db_version is now re-read after _create_tables so ensure_compression sees the post-migration version on the first open after an upgrade; decompression failures are surfaced as rusqlite row errors (logged as warnings) rather than panics.
Correctness: rows are stored uncompressed when compression would not shrink them; the zstd magic-byte prefix (0x28 0xB5 0x2F 0xFD) is used as a reliable discriminator so compressed and uncompressed rows can coexist transparently.

Confidence Score: 5/5

Safe to merge. The two previously-flagged blocking issues (db_version stale after migration, panic on decompression failure) are both properly fixed in this revision.

The migration is a one-way schema change but it is correct and tested. Decompression errors are now surfaced as rusqlite row errors and logged as warnings rather than panicking. The db_version re-read after _create_tables ensures compression setup runs on the first open post-upgrade. Remaining notes are efficiency and logging concerns that do not affect correctness.

aw-datastore/src/datastore.rs — the backfill memory footprint and the (starttime + duration) predicate are worth revisiting for large databases, but neither blocks merge.

Important Files Changed

Filename	Overview
aw-datastore/src/compression.rs	New module: zstd dictionary-compressed event storage with transparent compress/decompress, magic-byte detection, and proper feature gating. Well-structured with tests.
aw-datastore/src/datastore.rs	Schema v6 migration (TEXT→BLOB, endtime→duration), db_version re-read after migration, ensure_compression on startup, decompression errors properly surfaced as rusqlite errors. Memory concern in train_and_backfill and index-sargability regression are outstanding P2 notes.
aw-datastore/src/worker.rs	Adds foreign_keys PRAGMA ON before migrations, enforcing ON DELETE CASCADE for bucket-event relationships.
aw-datastore/tests/datastore.rs	Three new integration tests: dictionary roundtrip, first-open migration training, and bad-row skip-without-panic. Good regression coverage for the key scenarios.
aw-datastore/Cargo.toml	Adds optional zstd 0.13 dependency behind compression-zstd feature flag.
aw-server/Cargo.toml	Propagates compression-zstd feature flag from aw-datastore up to aw-server.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DatastoreInstance::new] --> B[_create_tables: run migrations up to v6]
    B --> C[Re-read db_version post-migration]
    C --> D{db_version >= 6?}
    D -- No --> E[Return: no compression]
    D -- Yes --> F[_load_dictionary from DB]
    F --> G{Dict found?}
    G -- Yes --> H[CompressionContext::from_dictionary]
    H --> I[compression-zstd enabled?]
    I -- Yes --> J[Load compressor + decompressor with dict]
    I -- No --> K[Empty context - reads will error on zstd rows]
    G -- No --> L{migrate_enabled AND event_count >= 2000?}
    L -- No --> E
    L -- Yes --> M[train_and_backfill]
    M --> N[SELECT id, data FROM events]
    N --> O[Decompress all rows to JSON]
    O --> P[zstd::dict::from_samples - train 64 KiB dictionary]
    P --> Q[BEGIN IMMEDIATE TRANSACTION]
    Q --> R[INSERT dict into compression_dict]
    R --> S[UPDATE every event row with compressed blob]
    S --> T{Success?}
    T -- Yes --> U[COMMIT]
    T -- No --> V[ROLLBACK - revert to empty context]

    subgraph Read path
        W[get_events / get_event] --> X[Read data BLOB from SQLite]
        X --> Y{Starts with 0x28B52FFD zstd magic?}
        Y -- No --> Z[Return raw UTF-8 JSON]
        Y -- Yes --> AA{feature enabled + dict loaded?}
        AA -- Yes --> AB[Decompress with dict - return JSON string]
        AA -- No --> AC[Return Err - skip row with warning]
    end

_{Reviews (2): Last reviewed commit: "test(datastore): make compression test c..." | Re-trigger Greptile}

codecov · 2026-06-13T10:30:11Z

Codecov Report

❌ Patch coverage is 87.05882% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.26%. Comparing base (656f3c9) to head (a9cea74).
⚠️ Report is 63 commits behind head on master.

Files with missing lines	Patch %	Lines
aw-datastore/src/datastore.rs	81.57%	7 Missing ⚠️
aw-datastore/tests/datastore.rs	92.10%	3 Missing ⚠️
aw-datastore/src/compression.rs	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #618      +/-   ##
==========================================
+ Coverage   70.81%   77.26%   +6.45%     
==========================================
  Files          51       63      +12     
  Lines        2916     5024    +2108     
==========================================
+ Hits         2065     3882    +1817     
- Misses        851     1142     +291

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Refresh db_version after running migrations so ensure_compression sets up compression on the first startup after a v5->v6 upgrade, instead of waiting for the next restart. - Stop panicking when a compressed row can't be decompressed (missing dictionary or feature disabled): get_events now skips the row with a warning and get_event returns an error, instead of from_utf8_lossy + unwrap. - Add regression tests: dictionary trained on first open after upgrade, and an unreadable compressed row is skipped without panicking.

0xbrayo · 2026-06-13T10:57:59Z

🤖 Claude, on behalf of @0xbrayo

@greptile review

The trailing remove_file() in the new compression tests could fail on Windows, where the datastore worker may still hold the database file handle briefly after close(). The test logic already passed; only the hard-unwrapped cleanup was flaky. Use best-effort removal, matching the existing encrypted-datastore test.

0xbrayo · 2026-06-13T11:30:15Z

🤖 Claude, on behalf of @0xbrayo

Benchmark: dictionary compression on a real-world database

Test data: byte-for-byte snapshot of a live database, 544,673 events across 19 buckets (~12.5 months of aw-watcher-window/afk/etc data).

Methodology:

Started from the v4 snapshot and migrated it to v5 using the pre-compression code (commit 6f6c64a, i.e. current master schema), then ran VACUUM to get a clean baseline size.
Took that fresh v5 database and ran it through this branch's v5→v6 migration (--features compression-zstd), then VACUUMed again for a clean "after" size.

Results:

	size
v5 (master, uncompressed)	72,458,240 bytes (69.1 MiB)
v6 (this PR, dictionary-compressed)	46,047,232 bytes (43.9 MiB)
reduction	36.4%

Other notes:

431,310 / 544,673 events (79%) ended up dictionary-compressed; the rest were small enough that storing them raw was already smaller, so they're kept as-is.
One-time migration + dictionary training (64 KiB dict, level 6) took ~3 seconds for all 544,673 events, wrapped in a single transaction.
Verified correctness: read events back through the REST API after migration — data round-trips intact (app/title fields, durations, timestamps all correct).

0xbrayo · 2026-06-13T11:51:19Z

🤖 Claude, on behalf of @0xbrayo

Why a custom implementation instead of `sqlite-zstd`?

sqlite-zstd (phiresky) implements essentially the same idea — dictionary-trained zstd compression of a column — and was the inspiration for this approach in #493. We didn't pull it in directly for a few reasons specific to this codebase:

License: sqlite-zstd is LGPL-2.0; this project's other dependencies are permissively licensed, so avoiding it sidesteps that friction.
Schema fit: it works by moving the target column into a shadow _zstd table and replacing the original table with a view (plus triggers). That would conflict with our existing hand-rolled migration system (_migrate_vX_to_vY, the ON DELETE CASCADE FK, the (bucketrow, starttime) index) and complicate the query layer.
Operational simplicity: sqlite-zstd expects periodic zstd_incremental_maintenance() calls to retrain dictionaries and vacuum shadow tables. Here, the dictionary is trained once on first upgrade and reused — no background maintenance task needed.
Android: this implementation is just the zstd crate operating on raw bytes in a normal column, with no SQLite virtual tables/extension registration — the simplest thing to guarantee works with the bundled SQLite used on Android.

The underlying technique (train a shared dictionary once, compress every row against it) is the same; it's just fit directly into this project's existing schema/migration model rather than taken as a generic extension.

…d feature

0xbrayo mentioned this pull request Jun 13, 2026

Transparent compression of database #493

Open

docs(datastore): document one-time migration cost for compression-zst…

a9cea74

…d feature

0xbrayo mentioned this pull request Jun 13, 2026

Fix cargo warnings #619

Draft

0xbrayo marked this pull request as draft June 14, 2026 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: zstd compression#618

feat: zstd compression#618
0xbrayo wants to merge 4 commits into
ActivityWatch:masterfrom
0xbrayo:zstd-compression

0xbrayo commented Jun 13, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

0xbrayo commented Jun 13, 2026

Uh oh!

0xbrayo commented Jun 13, 2026

Uh oh!

0xbrayo commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

0xbrayo commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

0xbrayo commented Jun 13, 2026

Uh oh!

0xbrayo commented Jun 13, 2026

Benchmark: dictionary compression on a real-world database

Uh oh!

0xbrayo commented Jun 13, 2026

Why a custom implementation instead of sqlite-zstd?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0xbrayo commented Jun 13, 2026 •

edited

Loading

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading

codecov Bot commented Jun 13, 2026 •

edited

Loading

Why a custom implementation instead of `sqlite-zstd`?