feat: zstd compression#618
Conversation
…Watch#493) Add an opt-in 'compression-zstd' feature that transparently compresses the events.data column with zstd. ActivityWatch events are tiny and their redundancy is across rows (repeated app names, JSON keys, window titles), so a dictionary is trained on a sample of the database's own events, stored once, and used to compress every row. On a real ~545k-event database this shrinks the stored event JSON by ~47% (file after VACUUM: 77.4 MB -> 52.3 MB) with negligible read overhead (~0.5 us/event). Details: - New compression module: dictionary-compressed blobs are marked with a 0xCC prefix; rows that would not shrink (or before a dictionary exists) are stored as raw JSON, so a row is never larger than plain JSON. A single reusable compressor/decompressor is kept per connection. - Event data column migrated from TEXT to BLOB (db v6) so binary compresses without any encoding overhead. - The dictionary is trained once at startup when the database has enough events (and on upgrade), then existing rows are recompressed in a single transaction; new events are compressed on insert. - Feature is off by default; a database with compressed rows requires a build with the feature enabled to read it. Also enable foreign_keys and add ON DELETE CASCADE to events.bucketrow so deleting a bucket removes its events; the v6 migration drops pre-existing orphan events whose bucket no longer exists. Tested with and without the feature, including a full migrate+train+backfill roundtrip on a real database.
Greptile SummaryAdds transparent zstd dictionary compression for the event
Confidence Score: 5/5Safe to merge. The two previously-flagged blocking issues (db_version stale after migration, panic on decompression failure) are both properly fixed in this revision. The migration is a one-way schema change but it is correct and tested. Decompression errors are now surfaced as rusqlite row errors and logged as warnings rather than panicking. The db_version re-read after _create_tables ensures compression setup runs on the first open post-upgrade. Remaining notes are efficiency and logging concerns that do not affect correctness. aw-datastore/src/datastore.rs — the backfill memory footprint and the (starttime + duration) predicate are worth revisiting for large databases, but neither blocks merge. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[DatastoreInstance::new] --> B[_create_tables: run migrations up to v6]
B --> C[Re-read db_version post-migration]
C --> D{db_version >= 6?}
D -- No --> E[Return: no compression]
D -- Yes --> F[_load_dictionary from DB]
F --> G{Dict found?}
G -- Yes --> H[CompressionContext::from_dictionary]
H --> I[compression-zstd enabled?]
I -- Yes --> J[Load compressor + decompressor with dict]
I -- No --> K[Empty context - reads will error on zstd rows]
G -- No --> L{migrate_enabled AND event_count >= 2000?}
L -- No --> E
L -- Yes --> M[train_and_backfill]
M --> N[SELECT id, data FROM events]
N --> O[Decompress all rows to JSON]
O --> P[zstd::dict::from_samples - train 64 KiB dictionary]
P --> Q[BEGIN IMMEDIATE TRANSACTION]
Q --> R[INSERT dict into compression_dict]
R --> S[UPDATE every event row with compressed blob]
S --> T{Success?}
T -- Yes --> U[COMMIT]
T -- No --> V[ROLLBACK - revert to empty context]
subgraph Read path
W[get_events / get_event] --> X[Read data BLOB from SQLite]
X --> Y{Starts with 0x28B52FFD zstd magic?}
Y -- No --> Z[Return raw UTF-8 JSON]
Y -- Yes --> AA{feature enabled + dict loaded?}
AA -- Yes --> AB[Decompress with dict - return JSON string]
AA -- No --> AC[Return Err - skip row with warning]
end
Reviews (2): Last reviewed commit: "test(datastore): make compression test c..." | Re-trigger Greptile |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #618 +/- ##
==========================================
+ Coverage 70.81% 77.26% +6.45%
==========================================
Files 51 63 +12
Lines 2916 5024 +2108
==========================================
+ Hits 2065 3882 +1817
- Misses 851 1142 +291 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
- Refresh db_version after running migrations so ensure_compression sets up compression on the first startup after a v5->v6 upgrade, instead of waiting for the next restart. - Stop panicking when a compressed row can't be decompressed (missing dictionary or feature disabled): get_events now skips the row with a warning and get_event returns an error, instead of from_utf8_lossy + unwrap. - Add regression tests: dictionary trained on first open after upgrade, and an unreadable compressed row is skipped without panicking.
@greptile review |
The trailing remove_file() in the new compression tests could fail on Windows, where the datastore worker may still hold the database file handle briefly after close(). The test logic already passed; only the hard-unwrapped cleanup was flaky. Use best-effort removal, matching the existing encrypted-datastore test.
Benchmark: dictionary compression on a real-world databaseTest data: byte-for-byte snapshot of a live database, 544,673 events across 19 buckets (~12.5 months of Methodology:
Results:
Other notes:
|
Why a custom implementation instead of
|
closes #493