Add capture event transport and server-side write classification#112
Add capture event transport and server-side write classification#112jspahn80134 wants to merge 39 commits into
Conversation
Update the ignored PostgreSQL integration test to assert the rich events schema columns and fix timestamp/JSONB parameter casts used by the capture insert path. Verified against the AWS PostgreSQL database with event_capture_inserts_rich_schema_event_into_db.
macOS CI occasionally delivered the loadfile acknowledgement just after the old two-second harness timeout. Increase the shared browser message wait to five seconds so test_client does not fail on that timing edge.
The first CI rerun passed the original test_client wait but exposed the same timing issue in test_client_updates while waiting for the autosave content update. Use the client response window as the shared browser test wait budget.
The overall browser tests share one WebDriver endpoint and were running concurrently inside the same test binary. This was causing test_client_updates to miss its autosave content update on CI, especially macOS/Safari. Guard the harness with a shared async mutex so each browser session runs in isolation.
bjones1
left a comment
There was a problem hiding this comment.
Here's some initial comments on the PR, mainly questions -- I'd like to hear your thoughts. I'll continue to review.
Use generated Rust-backed capture wire/status types in the VS Code extension. Restore the explanatory extension comments and the current-file update after LoadFile. Keep study lifecycle commands available for automation while removing them from the Command Palette.
Resolve conflicts in the VS Code extension, translation capture path, and overall test harness. Keep upstream CursorPosition/WebDriver updates while preserving capture instrumentation and serialized browser test timing.
bjones1
left a comment
There was a problem hiding this comment.
Good progress!
If there's some discussion/a question you answer, don't resolve it -- this helps me find an read your responses. When everything's already resolved, it's hard for me to find/think about discussions.
| @@ -0,0 +1,193 @@ | |||
| -- CodeChat capture event schema for dissertation analysis. | |||
There was a problem hiding this comment.
I assume this will be moved to the other repo you mentioned focused on analysis?
There was a problem hiding this comment.
I would keep the capture schema here because the server writes to this contract; analysis/export scripts can move to the analysis repo. Leaving this open for your preference.
| pub file_hash: Option<String>, | ||
| /// Whether the path was sent plainly, hashed, or omitted. | ||
| pub path_privacy: Option<String>, | ||
| /// Client timestamp, in milliseconds since Unix epoch. |
There was a problem hiding this comment.
Why do we have both a client timestamp and a server timestamp?
There was a problem hiding this comment.
Done in 3b94efd. I removed the separate server_timestamp_ms column. timestamp is now the server receive/record time; client_timestamp_ms remains optional client-observed time for ordering/latency analysis. Leaving this open for the design discussion.
| // participant/date mappings instead of being configured by students. | ||
| // * `session_id`, `event_id`, `sequence_number`, `schema_version` – event | ||
| // integrity and versioning metadata. | ||
| // * `file_path` – logical path of the file being edited. |
There was a problem hiding this comment.
I really like the idea of a file hash instead of a file path, to avoid capturing PII. Do you think the analysis will suffer not knowing file name, e.g. from looking at a #include <foo.h>? My thought is that the potential PII risk outweighs the benefits that we might obtain.
If so, what do you think about removing file_path entirely?
There was a problem hiding this comment.
Agreed in 3b94efd. Raw file_path is no longer accepted or stored; capture stores only file_hash. I think the PII risk outweighs filename-analysis value. Leaving this open for your privacy/design check.
| #[serde(rename_all = "snake_case")] | ||
| #[ts(export)] | ||
| pub enum CaptureEventType { | ||
| /// Server-classified edit to documentation/prose. |
There was a problem hiding this comment.
Does this include edits to code inside blocks, or is this classified as WriteCode?
There was a problem hiding this comment.
Clarified in 3b94efd: WriteDoc means prose/doc text, including CodeChat doc blocks; code inside blocks/fences is classified as WriteCode. Leaving open in case you want to tune wording/classification.
| /// Canonical type of the captured event. | ||
| pub event_type: CaptureEventType, | ||
| /// When the event occurred, in UTC. | ||
| pub timestamp: DateTime<Utc>, |
There was a problem hiding this comment.
Why a timestamp, client_timestamp, and server_timestamp?
There was a problem hiding this comment.
Same cleanup in 3b94efd: removed server_timestamp_ms. timestamp is the server-side record time; client_timestamp_ms is the optional client-observed event time. Leaving open for discussion.
Summary:
Validation: