Allow admins to customize the time gap between sessions by mihow · Pull Request #1292 · RolnickLab/antenna

mihow · 2026-05-05T00:31:10Z

Closes #1157. Closes #1158. Partially addresses #906 (full fix tracked in #904 — see "What this PR proves on its own").

Summary

The session_time_gap_seconds project setting was added in #918 but was never read by anything — group_images_into_events() still hardcoded a 120-minute gap, so changing the field had no effect. This PR wires the setting through and turns the regroup action into a proper Job so admins can actually see what's happening.

What changes in this PR:

group_images_into_events() now reads deployment.project.session_time_gap_seconds when no explicit gap is passed, with a clamp on invalid values (None/0/negative → 120 min default, logged).
A new RegroupEventsJob JobType — admins / MLDataManager / ProjectManager can trigger regrouping from the API, the admin, or programmatically and see progress, logs, and stage params in the Jobs UI.
DataStorageSyncJob now runs grouping as an explicit second stage. Previously the regroup happened silently inside Deployment.save() — if it raised, the sync Job recorded SUCCESS anyway. Closes Expose session regrouping as a job stage when syncing images #1157 and Expose session regrouping as a job stage with stats #1158.
Per-deployment idempotency lock around group_images_into_events() itself (token-based uuid release, so a longer-than-TTL run never clobbers a newer owner). All four call sites — the new Job, the sync-stage regroup, Deployment.save autoregroup, and the bare Celery task — collapse to a single in-flight run.
New POST /api/v2/deployments/<pk>/regroup-sessions/ action that creates a RegroupEventsJob. Mirrors the existing sync action shape (returns job_id, not a Celery task id).
Two new custom permissions: regroup_sessions_deployment (gates the API action; granted to ProjectManager) and run_regroup_events_job (gates create/run/cancel on the new JobType; granted to MLDataManager, which ProjectManager inherits from). Migration 0089_add_regroup_permissions registers both and grants them to existing role groups.
Django admin "Regroup captures into sessions" and "Sync captures" bulk actions now create Jobs instead of bare Celery tasks, so the admin path matches the API path.

Drive-by fixes in the same area:

events_created stage param was counting events visited rather than events newly created (the bool from get_or_create was being discarded). Fixed to use the was_created return.
The pre-existing duplicate-timestamp warning logged len(values) (string length of the newline-joined report) instead of the duplicate count. Pre-existing since Storage and retrieval of project settings #918's neighbouring code (2024-09-12) but sits inside the very block whose stats this PR now surfaces, so fixed here rather than left to drift.

What is intentionally not included:

No UI form field for the setting. The earlier draft of this PR added a numeric input on the Processing settings page; that's been pulled. The setting stays admin-only until we decide whether it belongs on Project or Deployment, and whether the unit should change from seconds to minutes/hours. See "Follow-ups" below.
No new code paths for unauthenticated or non-member users. Permissions match the existing sync_deployment shape.

One DB migration (0089_add_regroup_permissions, rebased onto main's 0088). The session_time_gap_seconds field itself already existed with default 7200 seconds (= 2 hours = 120 minutes), matching the previous hardcoded behavior.

Use case this fixes

A user reported that monitoring schedules with multi-burst nights (e.g. 2 hours on, 2 hours off, 2 hours on) sometimes group into one Event and sometimes into two, depending on whether the off-window crosses midnight. With this PR, raising the gap setting past the longest off-window collapses cross-midnight bursts into a single Event consistently.

A second pain came out of a recent workshop: sync jobs would complete successfully, but if the post-sync regroup raised, there was no Job-level signal. By the time anyone noticed the captures had no sessions, it took grepping Celery logs to find the failure. The new regroup stage on DataStorageSyncJob surfaces those failures directly in the Jobs UI — a sync-then-regroup failure now flips the Job to FAILURE.

API usage

Request

curl -X POST \
  -H "Authorization: Token <your-token>" \
  -H "Content-Type: application/json" \
  https://your-antenna-host/api/v2/deployments/<deployment_pk>/regroup-sessions/

No request body required.

Response

202 Accepted:

{
  "job_id": 12345,
  "deployment_id": 74,
  "project_id": 24
}

The response returns immediately; regrouping runs as a RegroupEventsJob in the background. Track progress at GET /api/v2/jobs/<job_id>/ or in the Jobs UI — you'll see a "Regroup sessions" stage with events_created, events_touched, events_deleted_empty, captures_grouped, duplicate_timestamps, ungrouped_captures, and no_timestamp_captures stage params on completion.

If a regroup is already in progress for the same deployment, subsequent enqueues still return 202 with a fresh job_id, but the Job's regroup stage will short-circuit at the worker (logged as a warning on the Job) and return early without touching the DB. The lock is keyed on the deployment, not the Job, so duplicate Jobs are visible in the Jobs UI but cannot run overlapping regroups on the same rows.

Error responses

404 Not Found — deployment pk does not exist or is not visible to the requesting user.
401 Unauthorized / 403 Forbidden — token missing or user lacks the regroup_sessions_deployment permission on the deployment's project (granted to the ProjectManager role and superusers).

Caveats — what this PR does and doesn't change

The setting is now wired through, but Event keying still collapses some timestamp groups. Below are the scenarios actually tested and what they show.

Production data — 5-setting cycle on a real deployment

Cycled session_time_gap_seconds through five values via the new endpoint, on a deployment with ~1.2 k captures currently grouped into 69 Events at the default 2 h gap. Each Job completed in 14–22 s. (Run was against the original regroup_events_task.delay() path before the Job framework wrapping; the same numbers re-verified via the new path — see "End-to-end validation" below.)

Setting	Observed events	Change vs default
30 min	69	none
1 hour	69	none
2 hour (default)	69	—
4 hour	69	none
24 hour	60	merged 9 cross-midnight nights

Reading: tightening below ~24 h produced no observable Event-count change on this deployment. Loosening to 24 h merged 9 cross-midnight nights (−13 %).

Why tightening doesn't always change Event count

group_images_into_events() keys Events by group_by = start_date.date() and uses Event.objects.get_or_create(deployment, group_by). Two timestamp groups produced by a tighter gap that start on the same calendar date collide back into a single Event. So the gap setting only changes Event count when:

The off-window between bursts crosses midnight, and
The new gap value is longer than that off-window (so the bursts merge into one timestamp group whose start date stays on the earlier calendar day).

The existing comment in models.py already flags this: "#904 is expected to rework this reuse path more thoroughly." A separate cleaner fix — keying Events by a noon-anchored "monitoring night" instead of calendar date — is parked as a follow-up.

Tests added in this PR

ami/main/tests.py::TestImageGrouping:

Test	Pattern	Setting	Result
`test_cross_midnight_bursts_split_by_short_gap`	Bursts at 22:00 day N and 01:00 day N+1, off-window 2 h 35 min	2 h gap	2 Events (split — user's pain)
↑ same	↑ same	4 h gap	1 Event (this PR's fix)
`test_same_date_bursts_merge_regardless_of_gap`	Two bursts at 20:00 and 22:00 same calendar date, off-window 1 h 35 min	1 h gap	1 Event in DB (date collision masks gap — known limitation)
`test_session_time_gap_seconds_is_used_when_no_explicit_gap`	Cross-midnight bursts	project setting 7200 s	2 Events
↑ same	↑ same	project setting 14400 s	1 Event
`test_invalid_session_time_gap_falls_back_to_default`	Cross-midnight bursts	project setting 0 / -1 / -7200	2 Events (default-gap split, not 12)
`test_regroup_is_idempotent_under_concurrent_calls`	Lock pre-taken before second call	—	Second call returns empty list; no Events created
`test_regroup_lock_release_does_not_clobber_a_newer_owner`	Newer owner takes lock while ours is held	—	Newer owner's lock survives our `finally` block

ami/jobs/tests/test_jobs.py:

Test	What it covers
`TestRegroupEventsJob.test_regroup_job_runs_end_to_end_and_reports_stats`	Job runs to SUCCESS, stage params (`events_created`, `captures_grouped`, etc.) populated, real Events exist in DB
`TestRegroupEventsJob.test_regroup_job_failure_propagates`	A `group_images_into_events` exception propagates from `JobType.run()` so the Celery wrapper records FAILURE
`TestDataStorageSyncJobIncludesRegroupStage.test_sync_job_adds_regroup_stage`	Sync Job exposes a `regroup_sessions` stage with stats after sync completes
`TestDataStorageSyncJobIncludesRegroupStage.test_sync_job_regroup_failure_propagates`	A regroup failure during a sync Job propagates so the sync Job ends FAILURE rather than silently succeeding (fixes the workshop pain in #1157)

End-to-end validation against a real Antenna stack

Run via the new manage.py test_regroup_job_e2e {regroup|sync|concurrent} command (added in this PR; mirrors test_ml_job_e2e). Stack was the dev box's main Antenna compose stack, bind-mounted onto this branch's worktree. Target deployment: PFS_Moth_Wall (id 74, project 24, 1263 captures, 69 events at the 2 h default).

#	Scenario	Command	Result
1	`RegroupEventsJob` on real captures	`test_regroup_job_e2e regroup --deployment 74`	Job SUCCESS in 18 s. Stage params: `captures_grouped=1188, events_created=0, events_touched=69, events_deleted_empty=0, duplicate_timestamps=57, ungrouped_captures=0, no_timestamp_captures=0`. Event count stable (Δ 0). Reproduces the 2 h-baseline row of the production-cycle table via the new code path.
2	`DataStorageSyncJob` two-stage chain	`test_regroup_job_e2e sync --deployment 74`	Job SUCCESS in 20 s. Two stages registered: `data_storage_sync` (`total_files=857, failed=0`) and `regroup_sessions` with same regroup stats as #1. Closes #1157 / #1158.
3	Sync→regroup failure propagation	(covered by unit test)	`test_sync_job_regroup_failure_propagates` mocks `group_images_into_events` to raise; the `RuntimeError` propagates from `JobType.run()` so the Celery wrapper records FAILURE. The stack-level run would re-exercise the same propagation path. Skipped to avoid a code edit just to inject a fault; happy to add a chaos hook if reviewers want it.
4	Concurrent lock idempotency	`test_regroup_job_e2e concurrent --deployment 74`	Both Jobs reached SUCCESS in 18 s. Job A: `captures_grouped=1188` (real work). Job B: `captures_grouped=0` (short-circuit; logged the lock-warning per the body). Event count unchanged from the single-run baseline. Lock semantics held end-to-end.

The 5-setting production cycle (30 min / 1 h / 2 h / 4 h / 24 h) in the "Caveats" table was originally run against the pre-Job-framework regroup_events_task.delay() path. Re-running it through RegroupEventsJob would produce identical Event counts (same underlying group_images_into_events call) and was not re-done end-to-end; the single-setting regroup run above is the live-stack proof that the Job-framework path drives that function correctly.

Headless validation pass (post-takeaway review)

A second pass exercised the API permission paths, setting-change UX, admin bulk action edge cases, and the lock-miss log surface on the same dev stack — these are the paths the unit tests stub out or the original E2E command doesn't cover. All against deployment 74.

#	Scenario	How	Result
5	Superuser POST `/regroup-sessions/` happy path	`curl -X POST` with superuser token	202 `{"job_id": 1587, "deployment_id": 74, "project_id": 24}`. Job ran to SUCCESS in 19 s. Stage params surfaced with both human-readable `name` (e.g. "Captures grouped") and slugified `key` (`captures_grouped`) for UI lookup.
6	Non-member POST	`curl` with token from a user not in project 24	403 `"You do not have permission to perform this action."` (object-level perm gate via `check_custom_permission(action="regroup_sessions") → regroup_sessions_deployment`).
7	POST against non-existent deployment	`curl` with superuser token, pk = 99999999	404 `"Not found."`
8	Unauthenticated POST	`curl` with no auth header	401 `"Authentication credentials were not provided."`
9	Setting change does not auto-enqueue regroup	Save `Project.session_time_gap_seconds` 7200 → 14400 in shell, then count new `RegroupEventsJob`s and check Event count	RegroupEventsJob delta = 0. Event count for project = 294 → 294. Matches the body's documented behavior (admin must trigger manually).
10	Admin bulk action with a project-less deployment in the selection	Build queryset `[74, orphan]` (orphan = `Deployment(project=None)`), invoke `DeploymentAdmin.regroup_events`	Message: `Queued RegroupEventsJob for 1 deployments: [1594] — skipped: #297 ... (no project)`. Orphan skipped cleanly, valid one enqueued.
11	Lock-miss warning surfaces in the Job log	2 sequential `POST /regroup-sessions/` calls < 1 s apart	Job A 1591: real work, 18 s. Jobs B 1592 + C 1593: started 9 s into A's run, both finished in 12 ms with `captures_grouped=0`. Job B/C `.logs.stdout` contains `WARNING group_images_into_events skipped for deployment 74: another regroup is in progress.` — an admin watching the Job sees the gap directly.

Not run on this stack (need fixtures not available locally):

#	Scenario	Fixture needed
12	Cancel mid-run	Deployment with ~50 k+ SourceImages so the regroup runs ≥ 30 s — long enough to hit the cancel API before it finishes. Exercises Celery cancel + the `finally`-block lock release, not this PR's new code.
13	Concurrent at scale	Same ≥ 50 k-capture fixture. Lock cost is constant per call, so the mechanism shouldn't differ from #11 — just confirms wallclock under contention.
14	Admin list at prod-shape size	Synthetic ≥ 1 000 `Deployment` rows with cached count fields populated. With the new denorm-field admin, projection is roughly 2–3 s warm at 1 000 rows (vs 0.6 s on 251 today); linear in row count, not capture count.

Post-review hardening (commit `c56a03b`)

An independent takeaway review surfaced four issues, all addressed in c56a03bb:

"Captures grouped" stat now counts the SourceImage rows actually assigned to an Event, rather than the number of distinct timestamps. When two captures share a timestamp (tracked separately as duplicate_timestamps), both get assigned to the same Event, so the old distinct-timestamp count understated the captures grouped. The E2E table figures above were produced before this change and will read slightly higher under the corrected count.
Lock TTL and the matching Celery time limits dropped to 10 min soft / 11 min hard (was 1 h). Regroup finishes in seconds, so a short TTL caps how long the lock stays held if a worker is killed before its finally block releases it.
Documented the grouping-only scope of RegroupEventsJob and the bare regroup_events task: propagating a moved deployment's project_id to its children stays on Deployment.save() via update_children(), not on the regroup path.
Lock-miss now logs a clear "regroup skipped" message instead of a misleading "now has 0 events", so an operator reading the worker log can tell a short-circuited duplicate run from a genuine reset.

What this PR proves on its own

The setting is wired through (24 h → 60 events vs default 69 events confirms group_images_into_events reads it).
The new API action and Job framework path both succeed end-to-end against a real stack (E2Es Bump cypress-io/github-action from 2 to 5 #1, Bump pre-commit/action from 2.0.0 to 3.0.0 #2, Bump docker/setup-buildx-action from 1 to 2 #4).
Concurrent enqueues for the same deployment don't run overlapping regroups (E2E Bump docker/setup-buildx-action from 1 to 2 #4), and the lock release doesn't clobber a newer owner (unit test).
A regroup failure during a sync Job now flips the Job to FAILURE instead of silently succeeding (unit test; closes Expose session regrouping as a job stage when syncing images #1157 / Expose session regrouping as a job stage with stats #1158).
True per-session granularity below the day boundary is blocked on Fix incorrect session grouping #904, which removes the group_by field and replaces date-keyed reuse with time-overlap matching. This PR is the prerequisite that lets Fix incorrect session grouping #904's reuse-path rework actually act on a configurable threshold. Make max_time_gap a configurable Project setting #906 is therefore partially addressed (setting now respected end-to-end); the deeper "sessions don't split inside a calendar day" case remains for Fix incorrect session grouping #904.

Related PRs and issues

Fix incorrect session grouping #904 — "Fix incorrect session grouping". Removes group_by field, adds use_existing flag, replaces unique constraint. Complementary to this PR: Fix incorrect session grouping #904 makes the gap setting visibly meaningful below the day boundary.
Make max_time_gap a configurable Project setting #906 — "Allow users to customize the time gap between sessions". Setting is now respected end-to-end; the calendar-date Event reuse limitation remains, blocked on Fix incorrect session grouping #904.
Expose session regrouping as a job stage when syncing images #1157 — "Expose session regrouping as a job stage when syncing images". Closed by the new regroup stage on DataStorageSyncJob.
Expose session regrouping as a job stage with stats #1158 — "Expose session regrouping as a job stage with stats". Closed by the stage params on both RegroupEventsJob and the sync Job's regroup stage.

Test plan

yarn build
docker compose build django ui
Pre-commit hooks
manage.py test ami.main.tests.TestImageGrouping — 12/12 pass
manage.py test ami.jobs.tests.test_jobs.TestRegroupEventsJob ami.jobs.tests.test_jobs.TestDataStorageSyncJobIncludesRegroupStage — 4/4 pass
manage.py test ami.main.tests.TestRolePermissions — 4/4 pass (no regression from the two new permissions)
Functional verification end-to-end against the live Antenna dev stack: test_regroup_job_e2e regroup, sync, and concurrent against deployment 74 — see "End-to-end validation" table above.
Headless API + admin verification on the same stack: superuser POST, non-member POST, bad-pk POST, unauthenticated POST, setting-change does-not-enqueue check, admin bulk action with project-less deployment in selection, lock-miss warning visible in Job log — see "Headless validation pass" rows Bump docker/build-push-action from 2 to 3 #5–Bump @types/node from 18.11.9 to 18.11.12 in /frontend #11.

Notes / follow-ups

The setting lives on Project because that's where it was added in Storage and retrieval of project settings #918. Moving it to Deployment is probably the right long-term shape — different monitoring stations on the same project can run different burst schedules. A separate PR with a data migration.
Surface a "Regroup sessions" button on the Stations list / project settings — separate UI follow-up. When that lands, the form input should be minutes or hours, not seconds (per @loppear's review comment on the original draft).
deployment_events_need_update() does not detect a setting change as a reason to regroup, so admins must hit the new API action (or the Django admin bulk action) after editing the gap (also per @loppear's review comment — "expect this setting to apply"). The field's help_text now explicitly points users to the Deployments admin "Regroup captures into sessions" bulk action so the manual-regroup step is discoverable from the form itself. Auto-detection on setting change remains a possible follow-up.
A project-level "Regroup all deployments" admin action (mirroring the existing per-deployment bulk action one level up) would save the click-through-to-Deployments step. Out of scope here; the multi-select bulk action already covers the use case once the user finds it.

Screenshots

Co-author

Co-Authored-By: Claude noreply@anthropic.com

Summary by CodeRabbit

New Features
- Regroup sessions endpoint for deployments and a dedicated "Regroup sessions" background job
- Jobs UI now exposes a regroup stage with summary stats (events created, captures grouped, etc.)
Improvements
- Configurable session time gap with sensible fallback
- Better concurrency safety to avoid duplicate regroup work
- Admin actions now enqueue background jobs instead of running inline
Permissions
- New permissions to control who can run regroup jobs and trigger regroup on deployments

netlify · 2026-05-05T00:31:15Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`c56a03b`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6a289e8ac60c290008ed6842

netlify · 2026-05-05T00:31:16Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`c56a03b`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/6a289e8ac8746900088b6d37

coderabbitai · 2026-05-05T00:31:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds per-deployment regroup cache locking and project-driven session gap handling to core grouping; introduces RegroupEventsJob and an explicit regroup stage in DataStorageSyncJob; switches admin/API to enqueue jobs; adds permissions, role updates, unit/integration tests, and an e2e test command.

Changes

Regroup sessions feature

Layer / File(s)	Summary
Core grouping refactor: cache lock, configurable gap, job params `ami/main/models.py`	Add `REGROUP_LOCK_TTL_SECONDS` and `_regroup_lock`; make `group_images_into_events` use per-deployment cache lock, accept optional `max_time_gap` derived from `deployment.project.session_time_gap_seconds` with 120-minute fallback, compute `events_created` and `events_deleted_empty`, sample duplicate timestamps, and emit extended job stage params when called with `job`/`stage_key`. Update `delete_empty_events` to return an `int`. Add `regroup_after` flag to `Deployment.sync_captures` and new Project permission constants.
Jobs: regroup stage and RegroupEventsJob `ami/jobs/models.py`	Introduce `REGROUP_STAGE_PARAM_NAMES`; add `regroup_stage_key`/`regroup_stage_name` to `DataStorageSyncJob` and refactor its `run` to call `deployment.sync_captures(regroup_after=False)` then explicitly run `group_images_into_events` into a dedicated regroup stage. Add `RegroupEventsJob` job type, implement its `run`, and register it in `VALID_JOB_TYPES`.
Unit tests: image grouping and defaults `ami/main/tests.py`	Add `_create_burst` helper and tests for cross-midnight splitting vs merging under different `max_time_gap` settings, same-date merging regardless of gap, using `deployment.project.session_time_gap_seconds` when `max_time_gap` omitted, and fallback for invalid session_time_gap values.
Unit tests: idempotency and lock semantics `ami/main/tests.py`	Add tests that pre-seed the per-deployment regroup cache lock to assert short-circuit behavior (no Events created) and tests ensuring token-based lock cleanup does not clobber newer owners.
Job integration tests `ami/jobs/tests/test_jobs.py`	Add end-to-end tests for `RegroupEventsJob` and for `DataStorageSyncJob` exposing regroup as a separate stage; assert regroup stage reaches SUCCESS with progress and stage params and that failures in `group_images_into_events` propagate.
API and admin: enqueue via Job infrastructure `ami/main/admin.py`, `ami/main/api/views.py`	Refactor `DeploymentAdmin` display helpers to use denormalized fields for ordering; rewrite `sync_captures` and `regroup_events` admin actions to create/enqueue `Job` records and report queued/skipped deployments; add `DeploymentViewSet.regroup_sessions` POST detail action to enqueue `RegroupEventsJob` and return `202 Accepted` with job/deployment/project ids.
End-to-end test harness `ami/jobs/management/commands/test_regroup_job_e2e.py`	Add management command with modes `regroup`, `sync`, and `concurrent` to enqueue/poll jobs, assert SUCCESS, dump regroup stage params, and validate lock short-circuiting across concurrent runs.
Permissions and role assignments `ami/main/migrations/0089_add_regroup_permissions.py`, `ami/users/roles.py`	Add migration to create and grant two new `Project` permissions (`regroup_sessions_deployment`, `run_regroup_events_job`) to role groups and update `Project` model `permissions`; add these permission codenames to MLDataManager and ProjectManager role sets.
Task documentation `ami/tasks.py`	Expand `regroup_events` task comment to note it is a thin Celery wrapper around `Deployment.save()` autoregroup and that idempotency/locking is handled by `_regroup_lock` in `group_images_into_events`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Expose session regrouping as a job stage when syncing images #1157: Adds explicit regroup stage and surfaces regroup stats — aligns with this PR's regroup stage and stats changes.
Expose session regrouping as a job stage with stats #1158: Requests reporting and a regroup stage; this PR implements those job-stage and reporting changes.
Surface a "captures have no session, click to regroup" affordance to users #1267: Asks for job-based regrouping and visibility; this PR adds RegroupEventsJob, admin/API enqueue, and locking.

Possibly related PRs

RolnickLab/antenna#1159: Also modifies DataStorageSyncJob.run stage/progress handling; related at the job-stage level.
RolnickLab/antenna#1268: Also changes regroup core behavior around events/duration and overlaps with group_images_into_events changes.

Suggested labels

backend

Suggested reviewers

annavik

Poem

🐰 A cache lock held tight in the meadow of code,
Sessions regrouped as the timestamps reload,
Jobs queued with a hop and a soft little thrum,
Stats counted and logged — now the pipelines hum,
Hooray for clear stages and locks that keep calm!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the primary feature: allowing customization of the session time gap setting. It directly addresses the main change without being vague or overly broad.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering all required sections from the template including summary, list of changes, related issues, detailed description, testing instructions, deployment notes, and a complete checklist.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/configurable-session-gap

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/main/api/views.py`:
- Around line 320-323: The view currently enqueues regroup_events_task
unconditionally allowing overlapping runs; modify the handler that calls
self.get_object() and regroup_events_task.delay(deployment.pk) to acquire a
per-deployment in-flight lock (e.g., a Redis/cache-based lock keyed by
f"regroup:{deployment.pk}") before queuing; if the lock cannot be obtained
return an early HTTP 409 (or similar) Response and log that a regroup is already
running, otherwise enqueue the task, keep the lock for the duration of the task
(or set a TTL and have the task release the lock) and ensure regroup_events_task
(or its wrapper) releases the lock on completion/failure so duplicate triggers
are prevented.

In `@ui/src/pages/project-details/processing-form.tsx`:
- Around line 30-35: The sessionTimeGapSeconds field currently validates but
will submit a string because React Hook Form doesn't coerce number inputs;
update the field's rules for sessionTimeGapSeconds to include a setValueAs that
converts the incoming value to a number (e.g., Number or parseInt) so the form
value is stored as a numeric type; modify the rules object for
sessionTimeGapSeconds to add setValueAs and keep required/min validations
intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 80284157-fc1e-41e0-8398-96a13aa37201

📥 Commits

Reviewing files that changed from the base of the PR and between 183f487 and 55cef8b.

📒 Files selected for processing (4)

ami/main/api/views.py
ami/main/models.py
ami/main/tests.py
ui/src/pages/project-details/processing-form.tsx

Copilot

Pull request overview

This PR wires the existing per-project session_time_gap_seconds setting into backend session (Event) grouping, exposes the setting in the UI, and adds an API action to trigger regrouping asynchronously so admins can re-cluster historical captures after changing the threshold.

Changes:

Backend: group_images_into_events() now derives its default max_time_gap from deployment.project.session_time_gap_seconds when not explicitly provided.
API: adds POST /api/v2/deployments/<pk>/regroup-sessions/ to enqueue the existing ami.tasks.regroup_events Celery task.
UI + tests: exposes the setting as a numeric input on the project Processing settings form and adds unit tests covering cross-midnight behavior and the project-setting fallback.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`ami/main/models.py`	Uses the project’s `session_time_gap_seconds` as the default grouping gap instead of a hardcoded 120 minutes.
`ami/main/api/views.py`	Adds a deployment-level action endpoint to queue regrouping in Celery.
`ui/src/pages/project-details/processing-form.tsx`	Adds a numeric form field to edit `sessionTimeGapSeconds` in the Processing settings UI.
`ami/main/tests.py`	Adds tests documenting cross-midnight split/merge behavior and verifying the project setting is used when no explicit gap is passed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…oup task Addresses takeaway-review feedback on PR #1292: - Add `regroup_sessions_deployment` custom Project permission so non-superusers with the ProjectManager role can hit `POST /deployments/<pk>/regroup-sessions/` (mirrors `sync_deployment`). Without this, `BaseModel.check_custom_permission` computes the codename `regroup_sessions_deployment` and the perm doesn't exist, so the action 403s for everyone except superusers. - Wrap `ami.tasks.regroup_events` in a per-deployment cache lock. Concurrent enqueues (double-clicks, sync_captures auto-regroup racing with manual trigger) collapse to a single run instead of overlapping on the same rows. Lower-level than the view so all callers (Deployment.save, admin bulk action, new API action) are protected. - Defensive clamp on `session_time_gap_seconds` — fall back to the historical 120-minute default if the value is 0 / negative, with a warning log. Stops pathological "every gap = new event" behavior when admins enter bad values. - Migration 0085 adds the new perm to `Project.Meta.permissions` and grants it to existing `ProjectManager` role groups via a data migration (mirrors how 0084 revoked `delete_job`). - Tests: - test_invalid_session_time_gap_falls_back_to_default exercises 0 / -1 / -7200 - test_regroup_events_task_is_idempotent_under_concurrent_calls pre-takes the cache lock, asserts a second call skips group_images_into_events Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/tasks.py`:
- Around line 106-119: The current lock uses cache.add(lock_key, 1, ...) and
unconditionally cache.delete(lock_key) which can remove a newer owner's lock;
change to an ownership token pattern: when acquiring the lock (where cache.add
is called) generate a unique token (e.g. UUID) and store that token as the cache
value instead of 1 (use REGROUP_EVENTS_LOCK_TTL as before), keep the token in a
local variable, and on release only remove the key if the stored cache value
equals your token (fetch with cache.get(lock_key) and only call
cache.delete(lock_key) when it matches) or use the cache backend's atomic
compare-and-delete if available; update the lock handling around lock_key,
cache.add, cache.get and cache.delete to use this token ownership check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 251316be-c276-4473-a3fc-36a150e056ed

📥 Commits

Reviewing files that changed from the base of the PR and between 55cef8b and 50d618e.

📒 Files selected for processing (5)

ami/main/migrations/0085_add_regroup_sessions_deployment_permission.py
ami/main/models.py
ami/main/tests.py
ami/tasks.py
ami/users/roles.py

✅ Files skipped from review due to trivial changes (1)

ami/users/roles.py

🚧 Files skipped from review as they are similar to previous changes (1)

ami/main/models.py

mihow · 2026-05-07T06:20:35Z

Claude says: Push 50d618e addresses the takeaway-review feedback from CodeRabbit and Copilot:

regroup_sessions_deployment custom permission (Copilot) — added to Project.Permissions and Project.Meta.permissions, granted to existing ProjectManager role groups via data migration 0085. Mirrors how sync_deployment is wired. Without this, the action 403'd for all non-superusers.
Per-deployment idempotency lock on regroup_events (CodeRabbit) — cache.add() SETNX-style with a 1 h TTL, applied inside the Celery task itself rather than the view. So all callers (the new API action, Deployment.save autoregroup, the admin bulk action) are protected, not just the new endpoint. Concurrent enqueues collapse to a single run.
Clamp non-positive session_time_gap_seconds (Copilot) — fall back to the historical 120-min default with a warning log for 0 or negative values, preventing pathological per-image splits.
Form numeric coercion (CodeRabbit + Copilot) — left as-is. <input type="number"> returning a string for an integer field is the established pattern across processing-form, capture-set form, deployment-location form, and job-details form. DRF IntegerField parses strings transparently. Worth standardizing the helper later, not in this PR.

New tests covering the lock and the clamp:

test_invalid_session_time_gap_falls_back_to_default — exercises 0 / -1 / -7200
test_regroup_events_task_is_idempotent_under_concurrent_calls — pre-takes the lock and asserts the second call exits early without invoking group_images_into_events

11/11 TestImageGrouping and 13/13 permission tests green locally.

annavik

Nice!!

I still think this setting would make more sense in a new section, instead of in Project->Settings->Processing. I know you said this can affect the logic for occurrences once tracking is enabled, but I think this is a bit far-fetched. I'm thinking, if I were about to look for this setting as a user, where would I first go? I'm not sure the processing section would be the place. Maybe this belongs more with upcoming sampling schedule configuration we have discussed?

I don't have very strong opinions here though, so if you feel this is where it belongs, go for it! :)

loppear

Looks like it works for my case. I agree with @annavik that the Processing settings form isn't the first place I looked for this, even having looked at the PR, would make sense to me on General settings (although maybe eventually moved to a less-likely settings tab there, and not opposed to it staying where it is for the time being.)

EDIT: I think regroup_events should probably happen on setting change, not sure how expensive it is or what the conditions where you wouldn't want to regroup (yet?) but I think I'm going to expect this setting to apply - they're going to be regrouped to the new setting anyways on the next sync right?

loppear · 2026-05-08T14:20:04Z

+  sessionTimeGapSeconds: {
+    label: 'Maximum time gap between sessions (default)',
+    description:
+      'Maximum time gap (in seconds) between consecutive images before a new session is started. Default is 7200 seconds (2 hours).',


Might be nice if this was presented to the user as minutes (or hours) rather than seconds.

Claude says: Good point — units in seconds are awkward in a settings UI. The UI form field has been removed entirely in 76b52af; the setting is now admin-only (via Django admin or the API). When we surface it again it should be minutes or hours, and probably at the deployment level rather than the project level (raised in the PR description's "Follow-ups" section).

Claude says: Captured in the body's follow-up section — Notes / follow-ups now reads "the form input should be minutes or hours, not seconds (per @loppear's review comment on the original draft)" alongside the related point about auto-regroup on setting change. The UI form input was pulled from this PR (no UI field ships here), so the unit choice lands with the follow-up UI work rather than gating this PR. Leaving the thread for your call on resolving.

mihow · 2026-06-04T22:16:26Z

Claude says: Pushed 76b52af which restructures the PR based on @mihow's takeaway-review feedback.

Scope shift — out: the UI form field for session_time_gap_seconds. The setting is admin-only for now (Django admin or API). When we re-surface it in the UI it should be minutes/hours rather than seconds, and probably attached to Deployment rather than Project — both noted in "Follow-ups".

Scope shift — in:

RegroupEventsJob JobType. The new POST /api/v2/deployments/<pk>/regroup-sessions/ action and the Django admin bulk action both enqueue Job.objects.create(job_type_key=RegroupEventsJob.key).enqueue() now, so admins see regroup progress, logs, and stats in the Jobs UI like every other long-running task. API response shape is now {job_id, deployment_id, project_id} — matches the existing sync action on the same viewset (previously returned a Celery task_id).
DataStorageSyncJob gets an explicit "Regroup sessions" second stage. Previously regroup ran silently inside Deployment.save(), so a regroup failure after a successful sync left the Job recorded as SUCCESS — the workshop pain Mike flagged. Now a sync-then-regroup failure flips the whole Job to FAILURE and the regroup logs land on the same Job row. Closes Expose session regrouping as a job stage when syncing images #1157 and Expose session regrouping as a job stage with stats #1158.
Idempotency lock moves into group_images_into_events() itself. Previously the lock lived in ami.tasks.regroup_events and only protected the Celery-task entry point — direct callers (Deployment.save autoregroup, _insert_or_update_batch_for_sync) bypassed it. Now every call site collapses to a single in-flight run for a given deployment. Token-based release (uuid in cache value, only delete if value still matches) fixes the lock-release-clobbers-newer-owner race CodeRabbit flagged.
New run_regroup_events_job permission granted to MLDataManager (which ProjectManager inherits from). The regroup_sessions_deployment permission from the previous draft is unchanged. Migration 0085_add_regroup_permissions registers and grants both.
Admin "Sync captures" bulk action also switched to Jobs. Same shape as the regroup admin action, and now inherits the visible regroup stage from DataStorageSyncJob.

Tests: 12/12 TestImageGrouping, 4/4 TestRegroupEventsJob / TestDataStorageSyncJobIncludesRegroupStage, 4/4 TestRolePermissions. The PR description has the full test plan and what's still left to verify against production data.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

ami/main/models.py (1)

1523-1526: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Bug: len(values) returns string length, not duplicate count.

values is a newline-joined string (line 1520-1522), so len(values) gives the character count, not the number of duplicate timestamps. Use duplicate_timestamp_count which was computed on line 1518.

🐛 Proposed fix

     logger.warning(
-        f"Found {len(values)} images with the same timestamp in deployment '{deployment}'. "
+        f"Found {duplicate_timestamp_count} images with the same timestamp in deployment '{deployment}'. "
         f"Only one image will be used for each timestamp for each event."
     )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/models.py` around lines 1523 - 1526, The warning message incorrectly
uses len(values) (which is the character length of the newline-joined string)
instead of the numeric duplicate count; update the logger.warning call in the
block around logger.warning(...) to use duplicate_timestamp_count rather than
len(values) so it reads f"Found {duplicate_timestamp_count} images with the same
timestamp in deployment '{deployment}'..." and leave the rest of the message and
variables (values, deployment) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@ami/main/models.py`:
- Around line 1523-1526: The warning message incorrectly uses len(values) (which
is the character length of the newline-joined string) instead of the numeric
duplicate count; update the logger.warning call in the block around
logger.warning(...) to use duplicate_timestamp_count rather than len(values) so
it reads f"Found {duplicate_timestamp_count} images with the same timestamp in
deployment '{deployment}'..." and leave the rest of the message and variables
(values, deployment) unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0173c1c-89f3-46b5-93d1-7be34a87e027

📥 Commits

Reviewing files that changed from the base of the PR and between 50d618e and 76b52af.

📒 Files selected for processing (9)

ami/jobs/models.py
ami/jobs/tests/test_jobs.py
ami/main/admin.py
ami/main/api/views.py
ami/main/migrations/0085_add_regroup_permissions.py
ami/main/models.py
ami/main/tests.py
ami/tasks.py
ami/users/roles.py

✅ Files skipped from review due to trivial changes (1)

ami/tasks.py

The session_time_gap_seconds project setting was added in #918 but never used. group_images_into_events() still hardcoded a 120-minute gap. This wires the field through and exposes it in the Processing settings UI so each project can override the session boundary. Closes #906 Co-Authored-By: Claude <noreply@anthropic.com>

POST /api/v2/deployments/<pk>/regroup-sessions/ queues the existing ami.tasks.regroup_events Celery task for the deployment. The task calls group_images_into_events(), which now reads the project's session_time_gap_seconds setting. User-facing terminology is "sessions"; "Event" is retained as the internal model name for backwards compatibility. Returns 202 with task_id, deployment_id, project_id. Refs #906 Co-Authored-By: Claude <noreply@anthropic.com>

Three new TestImageGrouping cases that exercise the new session_time_gap_seconds wiring: - test_cross_midnight_bursts_split_by_short_gap: documents the user- reported pain (2 bursts with off-window > default gap, crossing midnight, split into 2 Events on different dates) and proves that bumping max_time_gap past the off-window collapses them to 1 Event. - test_same_date_bursts_merge_regardless_of_gap: documents the inverse behavior — bursts on the same calendar date collide on group_by regardless of gap setting. Acts as a regression target for any future rework of the date-keyed Event reuse path (per the #904 caveat at models.py:1516). - test_session_time_gap_seconds_is_used_when_no_explicit_gap: verifies group_images_into_events falls back to the project setting when no max_time_gap is passed, including refreshing the cached relation after the setting changes. Refs #906 Co-Authored-By: Claude <noreply@anthropic.com>

…oup task Addresses takeaway-review feedback on PR #1292: - Add `regroup_sessions_deployment` custom Project permission so non-superusers with the ProjectManager role can hit `POST /deployments/<pk>/regroup-sessions/` (mirrors `sync_deployment`). Without this, `BaseModel.check_custom_permission` computes the codename `regroup_sessions_deployment` and the perm doesn't exist, so the action 403s for everyone except superusers. - Wrap `ami.tasks.regroup_events` in a per-deployment cache lock. Concurrent enqueues (double-clicks, sync_captures auto-regroup racing with manual trigger) collapse to a single run instead of overlapping on the same rows. Lower-level than the view so all callers (Deployment.save, admin bulk action, new API action) are protected. - Defensive clamp on `session_time_gap_seconds` — fall back to the historical 120-minute default if the value is 0 / negative, with a warning log. Stops pathological "every gap = new event" behavior when admins enter bad values. - Migration 0085 adds the new perm to `Project.Meta.permissions` and grants it to existing `ProjectManager` role groups via a data migration (mirrors how 0084 revoked `delete_job`). - Tests: - test_invalid_session_time_gap_falls_back_to_default exercises 0 / -1 / -7200 - test_regroup_events_task_is_idempotent_under_concurrent_calls pre-takes the cache lock, asserts a second call skips group_images_into_events Co-Authored-By: Claude <noreply@anthropic.com>

…l sites Replaces the direct-Celery API/admin entry points with a new RegroupEventsJob so admins can see regroup progress and failures in the Jobs UI like every other long-running task. The existing DataStorageSyncJob now runs grouping as an explicit second stage instead of relying on Deployment.save() autoregroup, so a post-sync regroup failure flips the Job to FAILURE rather than silently succeeding (closes #1157, #1158). Other changes in this commit: * Move the per-deployment idempotency lock from ami.tasks.regroup_events into group_images_into_events() itself, so all four call sites (RegroupEventsJob, sync-stage regroup, Deployment.save autoregroup, the bare Celery task) collapse to a single in-flight run rather than only the Celery-task entry point being protected. * Use a uuid token on the lock value and only delete the key if it still matches our token. Fixes the lock-release-clobbers-newer-owner race CodeRabbit flagged. * Add run_regroup_events_job permission and grant it to MLDataManager (which ProjectManager inherits from). regroup_sessions_deployment stays on ProjectManager. Both wired through the renamed migration 0085_add_regroup_permissions. * Strip the UI form field for session_time_gap_seconds; field is now admin-only until we decide whether it belongs on Project or Deployment. * Swap the admin "regroup" and "sync" bulk actions to create Jobs so the admin path matches the API path. Tests: - TestImageGrouping: replaced the old task-level idempotency test with one that exercises the new in-function lock; added a token-release test that verifies an expired-then-reacquired lock isn't clobbered. - TestRegroupEventsJob (new): end-to-end Job run, stats populated, and a failure-propagates assertion. - TestDataStorageSyncJobIncludesRegroupStage (new): sync Job exposes the regroup stage and a regroup failure surfaces as Job failure. Co-Authored-By: Claude <noreply@anthropic.com>

Main has moved on to migration 0088. Rename 0085_add_regroup_permissions to 0089 and depend on 0088_detection_det_srcimg_created_idx to resolve the multiple-leaf-nodes conflict CI surfaced. Co-Authored-By: Claude <noreply@anthropic.com>

… log; add e2e command The regroup stage param `events_created` was incrementing for every event visited rather than only for events newly created by `get_or_create`. The bool returned from `get_or_create` was being discarded. Track `events_created_count` separately and use it in the stage-param write. Drive-by: `logger.warning(f"Found {len(values)} ...")` was reporting the character length of the newline-joined duplicate-timestamp report rather than the duplicate count. Pre-existing (since 0954c38, 2024-09-12) but sits inside the very block whose stats this PR now surfaces. Use `duplicate_timestamp_count`, and cap the in-log sample at 20 timestamps so very dirty deployments don't flood logs. Add `manage.py test_regroup_job_e2e {regroup|sync|concurrent}` to exercise the new Job framework path against a real stack: `regroup` runs a `RegroupEventsJob` and dumps stage params; `sync` runs a `DataStorageSyncJob` and asserts the two-stage chain; `concurrent` fires two regroups back-to-back and verifies the lock semantics. Mirrors `test_ml_job_e2e`. Used to validate PR #1292 against deployment 74 of the Antenna dev stack. Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-06-05T00:19:38Z

Claude says: Pushed two follow-up commits + ran live-stack E2Es. Summary:

New commits on feat/configurable-session-gap (force-pushed after rebasing onto current main to pick up migrations 0085–0088; perms migration renumbered to 0089):

184bf412 — Migration rebase: 0085_add_regroup_permissions → 0089_add_regroup_permissions so the leaf-node conflict against main's 0088 clears.
d93aa24c — events_created stage param accounting fix (was counting events visited; now counts newly-created via get_or_create's was_created return); pre-existing duplicate-timestamp len(values) log bug fixed (CodeRabbit caught it on the prior push); new manage.py test_regroup_job_e2e management command.

Live-stack E2E validation (Antenna dev box, deployment 74 PFS_Moth_Wall, 1263 captures, 69 events at 2 h baseline):

#	Scenario	Result
1	`RegroupEventsJob` direct	SUCCESS in 18 s, stage params populated (`captures_grouped=1188, events_touched=69, duplicate_timestamps=57`), Δevents=0
2	`DataStorageSyncJob` two-stage chain	SUCCESS in 20 s, both `data_storage_sync` (`total_files=857`) and `regroup_sessions` stages registered and SUCCESS — closes #1157 / #1158 against real data
3	Sync→regroup failure propagation	Covered by `test_sync_job_regroup_failure_propagates` unit test (mocks `group_images_into_events` to raise; `JobType.run()` propagates so Celery wrapper records FAILURE). Skipped stack-level run to avoid a code edit just to inject a fault — happy to add a chaos hook if you want it
4	Concurrent lock idempotency	Both Jobs SUCCESS; Job A `captures_grouped=1188`, Job B `captures_grouped=0` (short-circuit). Event count unchanged. Lock semantics held end-to-end

PR body updated to reflect all of this — new ## End-to-end validation against a real Antenna stack section, drive-by fixes called out under the Summary bullets, migration number corrected to 0089 throughout, #906 moved out of Closes (it's partially addressed; the calendar-date Event reuse limitation remains blocked on #904).

@loppear's open thread replied to, leaving resolved/unresolved for your call.

Ready for re-review.

mihow · 2026-06-05T00:44:03Z

@annavik @loppear thanks for reviewing. I decided to remove the UI changes. There is more to think about the placement and behavior when a user changes the setting. Also the setting does not act they way the user might expect. It DOES work in Luke's case, but it won't in others. I suggest we move the setting to the Station configuration and also consider simplifying the grouping algorithm. For now, this PR connects the existing function to the existing project setting for the time gap. It also wraps the session regrouping in a Job so you can see it running and when if it fails. You can now also see the logs for the session grouping stage when syncing a Station to it's data storage.

I updated the title to say "allow admins" to customize instead of "allow users" :)

The DeploymentAdmin list view ran four live per-row aggregates for each deployment displayed: - `start_date` → `SourceImage.objects.filter(event__deployment=obj).aggregate(Min)` (joins SourceImage→Event, scans the full per-deployment timeline) - `end_date` → `SourceImage.objects.filter(deployment=obj).aggregate(Max)` - `events_count` → `obj.events.count()` - `captures_count` → re-read of `data_source_total_files` (cached, but wrapped in a method that suggested it was computed) With 251 deployments on the page and ~5 expensive queries per row, the list view became unusable on stacks holding deployments with >100k captures (e.g. one deployment with 638k captures dominated end_date's wall time). Deployment already denormalizes `captures_count`, `events_count`, `data_source_total_size`, `first_capture_timestamp` and `last_capture_timestamp` via `update_calculated_fields()`. Switch the admin list to read those instead and drop the custom aggregates. Add `@admin.display(ordering=...)` so the columns remain sortable against the underlying cached fields. Wall time on the dev box (251 deployments): ~0.6s warm, ~0.76s cold. Co-Authored-By: Claude <noreply@anthropic.com>

The regroup stage params were stored with machine-style names (`captures_grouped`, `events_created`, ...), which is what the Jobs UI displays to admins. Other JobType stages already use title-cased, space-separated names (`Total files`, `Captures added`, etc.); this commit brings the regroup stage in line. Param **names** (UI labels): Captures grouped, Events created, Events touched, Empty events deleted, Duplicate timestamps, Ungrouped captures, Captures missing timestamp. Param **keys** (for retrieval, slugify(name) with `-` → `_`): captures_grouped, events_created, events_touched, empty_events_deleted (was `events_deleted_empty`), duplicate_timestamps, ungrouped_captures, captures_missing_timestamp (was `no_timestamp_captures`). Five keys are stable; two changed shape. Existing Job rows keep their historical name/key on disk — only new runs use the new labels. Centralise the list as `REGROUP_STAGE_PARAM_NAMES` in ami.jobs.models so RegroupEventsJob and DataStorageSyncJob's regroup stage share one source of truth. In `_group_images_into_events_locked`, swap the kwargs-style `update_stage(...)` call (which couldn't carry names with spaces) for an explicit `add_or_update_stage_param(stage_key, name, value)` loop driven by a label→value dict. Update tests to look up by the new keys and assert the human-readable names exist; update `test_regroup_job_e2e` to print `name [key]: value` so the stable retrieval key is visible alongside the UI label. Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

ami/main/tests.py (2)

534-553: ⚡ Quick win

Assert the warning side-effect for invalid gap fallback.

Line 548 currently validates clamped behavior only. Add a log assertion so the “fallback + warning” contract is protected, not just the event count.

🔧 Proposed test hardening

         for bad_value in (0, -1, -7200):
             Event.objects.filter(deployment=self.deployment).delete()
             self.project.session_time_gap_seconds = bad_value
             self.project.save()
             self.deployment.refresh_from_db()
-            group_images_into_events(deployment=self.deployment)
+            with self.assertLogs(level="WARNING") as logs:
+                group_images_into_events(deployment=self.deployment)
             count = Event.objects.filter(deployment=self.deployment).count()
             # Default 120-min gap on this cross-midnight pattern → 2 Events,
             # NOT 12 (which is what bad_value=0 would produce without the guard).
             assert count == 2, f"expected 2 Events at gap={bad_value} (default fallback), got {count}"
+            self.assertTrue(
+                any("session_time_gap_seconds" in msg for msg in logs.output),
+                "Expected warning log for invalid session_time_gap_seconds fallback",
+            )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/tests.py` around lines 534 - 553, Update
test_invalid_session_time_gap_falls_back_to_default to also assert the warning
side-effect: use the test logging capture (e.g., pytest's caplog) while setting
project.session_time_gap_seconds to each bad_value, call
group_images_into_events(deployment=self.deployment), and then assert caplog
captured a WARNING-level message from the module that contains text indicating
the fallback (e.g., mentions "fallback", "session_time_gap_seconds", or the
bad_value) so the test verifies both the event count and that a warning was
emitted when the invalid gap was clamped.

598-605: ⚡ Quick win

Make cache cleanup unconditional in the lock-owner test.

If the assertion at Line 604 fails, Line 605 is skipped and can leak cache state into later tests. Wrap cleanup in finally.

🧹 Proposed cleanup guard

-        with _regroup_lock(self.deployment.pk) as acquired:
-            assert acquired, "Should acquire the lock on first try"
-            # Simulate the TTL expiring and a newer caller taking the lock.
-            cache.set(lock_key, "newer-owner-token", timeout=60)
-
-        # Our `finally` ran. The newer owner's lock should still be there.
-        assert cache.get(lock_key) == "newer-owner-token", "Expired-lock release must not clobber the newer owner"
-        cache.delete(lock_key)
+        try:
+            with _regroup_lock(self.deployment.pk) as acquired:
+                assert acquired, "Should acquire the lock on first try"
+                # Simulate the TTL expiring and a newer caller taking the lock.
+                cache.set(lock_key, "newer-owner-token", timeout=60)
+
+            # Our `finally` ran. The newer owner's lock should still be there.
+            assert cache.get(lock_key) == "newer-owner-token", "Expired-lock release must not clobber the newer owner"
+        finally:
+            cache.delete(lock_key)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/tests.py` around lines 598 - 605, The test currently may skip
cache.delete(lock_key) if the assertion fails; modify the test around the
_regroup_lock block so that cache.delete(lock_key) always runs by placing the
post-lock assertions and cache.delete into a try/finally (or move cache.delete
into a finally) that ensures cleanup regardless of assertion failures; reference
the existing _regroup_lock(self.deployment.pk) context, the lock_key variable,
and cache.delete to locate where to add the finally guard.

ami/jobs/management/commands/test_regroup_job_e2e.py (1)

60-75: ⚡ Quick win

Use explicit exception chaining in except clause.

Line 64 should use raise CommandError(...) from None to explicitly suppress the DoesNotExist exception from the traceback. This distinguishes a deliberate exception translation from an error during exception handling, following PEP 3134.

♻️ Proposed fix

     def _resolve_deployment(self, deployment_id: int) -> Deployment:
         try:
             deployment = Deployment.objects.get(pk=deployment_id)
         except Deployment.DoesNotExist:
-            raise CommandError(f"Deployment {deployment_id} not found")
+            raise CommandError(f"Deployment {deployment_id} not found") from None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/jobs/management/commands/test_regroup_job_e2e.py` around lines 60 - 75,
In _resolve_deployment, change the exception raise to use explicit exception
chaining: when catching Deployment.DoesNotExist in the except block of
_resolve_deployment, replace the current raise CommandError(...) with raise
CommandError(f"Deployment {deployment_id} not found") from None so the original
DoesNotExist traceback is suppressed and the intent of exception translation is
clear.

ami/main/migrations/0089_add_regroup_permissions.py (1)

64-87: 💤 Low value

Consider mirroring grant logic precision in revoke.

Lines 75-87 remove both permissions from both _ProjectManager and _MLDataManager groups, but the grant function at lines 52-61 is more specific: regroup_sessions_deployment goes only to _ProjectManager, while run_regroup_events_job goes to both. Removing a permission from a group that never had it is harmless, but mirroring the grant logic would make the migration more self-documenting.

♻️ Optional refactor to match grant specificity

 def revoke_new_permissions(apps, schema_editor):
     Group = apps.get_model("auth", "Group")
     Permission = apps.get_model("auth", "Permission")
     ContentType = apps.get_model("contenttypes", "ContentType")
     GroupObjectPermission = apps.get_model("guardian", "GroupObjectPermission")

     try:
         project_ct = ContentType.objects.get(app_label="main", model="project")
     except ContentType.DoesNotExist:
         return

-    for codename in ("regroup_sessions_deployment", "run_regroup_events_job"):
-        try:
-            perm = Permission.objects.get(codename=codename, content_type=project_ct)
-        except Permission.DoesNotExist:
-            continue
-        role_groups = Group.objects.filter(Q(name__endswith="_ProjectManager") | Q(name__endswith="_MLDataManager"))
-        for group in role_groups:
-            group.permissions.remove(perm)
-        GroupObjectPermission.objects.filter(
-            permission=perm,
-            content_type=project_ct,
-            group__in=role_groups,
-        ).delete()
+    # Remove regroup_sessions_deployment from ProjectManager only
+    try:
+        regroup_perm = Permission.objects.get(codename="regroup_sessions_deployment", content_type=project_ct)
+        pm_groups = Group.objects.filter(Q(name__endswith="_ProjectManager"))
+        for group in pm_groups:
+            group.permissions.remove(regroup_perm)
+        GroupObjectPermission.objects.filter(
+            permission=regroup_perm,
+            content_type=project_ct,
+            group__in=pm_groups,
+        ).delete()
+    except Permission.DoesNotExist:
+        pass
+
+    # Remove run_regroup_events_job from both MLDataManager and ProjectManager
+    try:
+        job_perm = Permission.objects.get(codename="run_regroup_events_job", content_type=project_ct)
+        job_groups = Group.objects.filter(Q(name__endswith="_MLDataManager") | Q(name__endswith="_ProjectManager"))
+        for group in job_groups:
+            group.permissions.remove(job_perm)
+        GroupObjectPermission.objects.filter(
+            permission=job_perm,
+            content_type=project_ct,
+            group__in=job_groups,
+        ).delete()
+    except Permission.DoesNotExist:
+        pass

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/migrations/0089_add_regroup_permissions.py` around lines 64 - 87,
The revoke_new_permissions function currently removes both codenames from both
role groups; change its logic to mirror the grant logic in the migration by
mapping codenames to the exact target groups before removal: when codename ==
"regroup_sessions_deployment" only target groups ending with "_ProjectManager",
and when codename == "run_regroup_events_job" target both "_ProjectManager" and
"_MLDataManager"; use that per-codename role_groups selection for both
group.permissions.remove(perm) and the GroupObjectPermission.objects.filter(...,
group__in=role_groups).delete() calls so revoke precisely matches grant.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/main/api/views.py`:
- Around line 362-393: The method regroup_sessions in the view currently uses
assert deployment.project which can be disabled; replace that with an explicit
runtime check and return a proper client error: if deployment.project is None,
raise rest_framework.exceptions.ValidationError (or return Response with
status=HTTP_400_BAD_REQUEST and a clear error message) before creating the Job;
ensure the check is done right after deployment = self.get_object() and include
deployment/project identifiers in the error message for clarity.

In `@ami/main/models.py`:
- Around line 968-973: The branch that skips autoregroup (regroup_after=False)
fails to realign child objects because it never triggers
Deployment.update_children(); to fix, after calling
self.save(regroup_async=False, update_calculated_fields=False) and
self.update_calculated_fields(save=True) ensure you explicitly invoke the same
child-realignment logic (call self.update_children() or otherwise run the code
path used by Deployment.save when update_calculated_fields=True) so
Events/Occurrences/SourceImages are moved to the new project when a Deployment
is moved; locate Deployment.save, Deployment.update_calculated_fields, and
Deployment.update_children to add the call in this else branch.
- Around line 1479-1486: The current group_images_into_events function silently
returns [] when _regroup_lock(deployment.pk) is not acquired, which lets callers
(e.g., DataStorageSyncJob) treat regroup as successful and miss newly imported
captures; change this behavior so a lock miss produces a retryable outcome:
either raise a specific retryable exception (e.g., RegroupLockHeldError) or set
a "dirty" flag on the deployment and re-enqueue the job instead of returning [],
and ensure callers (DataStorageSyncJob) catch that exception or check the dirty
flag to re-schedule regroup; update references to _regroup_lock,
group_images_into_events, and any use of job.logger/logger to log a clear
retryable message when the lock is held.

---

Nitpick comments:
In `@ami/jobs/management/commands/test_regroup_job_e2e.py`:
- Around line 60-75: In _resolve_deployment, change the exception raise to use
explicit exception chaining: when catching Deployment.DoesNotExist in the except
block of _resolve_deployment, replace the current raise CommandError(...) with
raise CommandError(f"Deployment {deployment_id} not found") from None so the
original DoesNotExist traceback is suppressed and the intent of exception
translation is clear.

In `@ami/main/migrations/0089_add_regroup_permissions.py`:
- Around line 64-87: The revoke_new_permissions function currently removes both
codenames from both role groups; change its logic to mirror the grant logic in
the migration by mapping codenames to the exact target groups before removal:
when codename == "regroup_sessions_deployment" only target groups ending with
"_ProjectManager", and when codename == "run_regroup_events_job" target both
"_ProjectManager" and "_MLDataManager"; use that per-codename role_groups
selection for both group.permissions.remove(perm) and the
GroupObjectPermission.objects.filter(..., group__in=role_groups).delete() calls
so revoke precisely matches grant.

In `@ami/main/tests.py`:
- Around line 534-553: Update
test_invalid_session_time_gap_falls_back_to_default to also assert the warning
side-effect: use the test logging capture (e.g., pytest's caplog) while setting
project.session_time_gap_seconds to each bad_value, call
group_images_into_events(deployment=self.deployment), and then assert caplog
captured a WARNING-level message from the module that contains text indicating
the fallback (e.g., mentions "fallback", "session_time_gap_seconds", or the
bad_value) so the test verifies both the event count and that a warning was
emitted when the invalid gap was clamped.
- Around line 598-605: The test currently may skip cache.delete(lock_key) if the
assertion fails; modify the test around the _regroup_lock block so that
cache.delete(lock_key) always runs by placing the post-lock assertions and
cache.delete into a try/finally (or move cache.delete into a finally) that
ensures cleanup regardless of assertion failures; reference the existing
_regroup_lock(self.deployment.pk) context, the lock_key variable, and
cache.delete to locate where to add the finally guard.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ccaf86d-d49c-4cba-a120-d0d96c841289

📥 Commits

Reviewing files that changed from the base of the PR and between 76b52af and ba7d85d.

📒 Files selected for processing (10)

ami/jobs/management/commands/test_regroup_job_e2e.py
ami/jobs/models.py
ami/jobs/tests/test_jobs.py
ami/main/admin.py
ami/main/api/views.py
ami/main/migrations/0089_add_regroup_permissions.py
ami/main/models.py
ami/main/tests.py
ami/tasks.py
ami/users/roles.py

✅ Files skipped from review due to trivial changes (1)

ami/tasks.py

🚧 Files skipped from review as they are similar to previous changes (3)

ami/users/roles.py
ami/jobs/tests/test_jobs.py
ami/jobs/models.py

… on sync path Two CodeRabbit findings on the latest revision. 1. `ami/main/api/views.py:376` used `assert deployment.project` to guard the API action. `assert` is stripped under `python -O`, so a deployment whose project FK is null (the schema allows it — `Project` FK is nullable on `Deployment`) would 500 with a confusing AttributeError later in the view instead of returning a clean 400. Replace with an explicit `api_exceptions.ValidationError` that returns 400 with a structured detail. 2. `Deployment.sync_captures()`'s new `regroup_after=False` branch (added in this PR so `DataStorageSyncJob` can drive the regroup as a separate tracked stage) bypassed `Deployment.save()`'s `update_calculated_fields= True` block, which is where `self.update_children()` realigns child `Event`/`Occurrence`/`SourceImage` `project` pointers. Result: a sync Job whose deployment had moved between projects would leave child rows with stale project FKs, breaking project-scoped queries (default filters, visibility checks, etc.). The user-triggered `RegroupEventsJob` path is unaffected because it doesn't run sync_captures — the regression is specific to the new sync→regroup chain. Add an explicit `self.update_children()` call after the `update_calculated_fields(save=True)` refresh on the regroup_after=False branch, gated on `project_id` like the corresponding block in `Deployment.save()`. Existing regroup tests still pass (4/4). Co-Authored-By: Claude <noreply@anthropic.com>

CodeRabbit raised a real concern about the sync→regroup chain: if a concurrent regroup holds the lock when DataStorageSyncJob's regroup stage runs, the lock-miss branch in `group_images_into_events` returns `[]` and the just-synced captures sit ungrouped until the next save/sync. A previous draft of this fix added a post-lock re-enqueue (`regroup_events.delay()` after the `with _regroup_lock:` block) that called `deployment_events_need_update` and chained a follow-up task. A fresh review found that approach risks cascading enqueues on busy deployments — the lock serialises work but not enqueueing, so concurrent `Deployment.save` autoregroups during a slow run can pile redundant tasks into the queue (no infinite loop, but tens of redundant runs plausible). Drop the re-enqueue. Instead, surface a clear warning on the Job itself when the sync stage's regroup is skipped: we know `total_files` from the preceding sync stage, so we can say "sync added N files but regroup was skipped" directly on the Job that an admin is looking at. The captures will be picked up by the next Deployment.save autoregroup (models.py:1110-1113), which is the existing safety net for the general case. Trade-off vs the previous draft: a sync that races with a concurrent regroup defers its new captures to the next save/sync (small window, no data loss) but does NOT silently appear to have succeeded — the warning lands in the Job log where it belongs. No new tests; existing 4 regroup/lock tests + 12 grouping tests pass (16/16). Co-Authored-By: Claude <noreply@anthropic.com>

loppear

Narrowing the scope to just using this admin Project setting is good enough for me. It seems the changes to move this into an explicit job for better logging during storage sync are motivated by existing issues, so if this helps clean that up ok!

…al-regroup requirement The previous help text just said "Time gap in seconds to consider a new session" — it didn't tell the user that changing the value doesn't regroup existing captures, which is the most common point of confusion when admins edit the field. Now points users to the Deployments admin "Regroup captures into sessions" bulk action so they can apply the change. Surfaces via Django admin form, DRF browsable API, and OPTIONS schema. Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-06-09T13:35:02Z

Thanks for your review @loppear. I plan to do some merging & deploying in the afternoon today.

Four fixes surfaced by an independent takeaway review: - Count actual SourceImage rows assigned to an event for the "Captures grouped" stat instead of the distinct-timestamp count. Two captures can share a timestamp, so the old count understated the work done. - Lower the regroup lock TTL and the matching Celery time limits to 10 min soft / 11 min hard. group_images_into_events is mostly batch SQL and finishes in seconds, so a tight limit bounds the worst-case stuck-lock window when a worker dies hard (OOM, SIGKILL, eviction) and the finally block never runs to release the lock. - Document that RegroupEventsJob and the bare regroup_events task only group; propagating project_id to children stays on Deployment.save(). - Log a clear "regroup skipped" message when the per-deployment lock is already held, instead of a misleading "now has 0 events". Co-Authored-By: Claude <noreply@anthropic.com>

Main now has 0089_add_regroup_permissions and 0090_session_time_gap_seconds_help_text (PR #1292), which made the branch's 0089/0091 migrations a second leaf in the migration graph and failed CI. Renumber to 0091/0092 on top of main's 0090. Co-Authored-By: Claude <noreply@anthropic.com>

* Model and API view for SourceImageThumbnail * Thumbnails: add field to captures serializer * With internal prefix * black postcommit * black? * black settings * Better handling for thumbnail invalidation and deletion. * Thumbnail lookup by label and check for changed settings width * Move thumbnail generation to method on SourceImage * Thumbnails: handle uploaded source images (without datasource), 302 redirects, and add index for (source_image, label). * merge migrations * UI capture data: add and use thumbnail property accessors * Add thumbnails to api responses for events with captures * revert UI capture.src property, only use thumbnails explicitly * fix: prioritize API response size before natural size * Merge migrations * Rebase thumbnail migrations * ami.utils.media.fetch_image_content for thumbnail image source * fix(thumbnails): convert non-RGB source images to RGB before JPEG encode JPEG natively supports only L, RGB, and CMYK modes. Source images in RGBA/P/LA/PA modes (common for uploaded PNGs) raised "OSError: cannot write mode <X> as JPEG" on `img.save(format="JPEG")`, 500-ing the thumbnail endpoint. Convert to RGB first when needed. Adds a regression test that feeds an RGBA PNG through find_or_generate and asserts the redirect succeeds and the thumbnail row lands. Closes review thread on ami/main/models.py L2241. Co-Authored-By: Claude <noreply@anthropic.com> * fix(thumbnails): handle SourceImage.last_modified=None on regen check Two changes, one root cause. `SourceImage.last_modified` tracks the source bytes' mtime (S3 LastModified on the sync path) and is the freshness signal the thumbnail regen check uses to decide whether to invalidate a cached thumbnail. It is NOT the same as `updated_at` (BaseModel auto_now), which bumps on every row save (cached counts, detection counts, etc.) and would force regen far too often. The upload path (`create_source_image_from_upload`) never set this field, so: 1. Uploaded captures had `last_modified=None` forever, breaking parity with sync-created rows. 2. The thumbnail regen check `thumb.last_modified < self.last_modified` raised `TypeError` on the 2nd+ request for any uploaded capture's thumbnail (datetime < None). Fixes: * `create_source_image_from_upload`: set `last_modified=timezone.now()` so uploaded captures have a real source mtime, matching the sync path's behaviour of copying the S3 LastModified header. * `find_or_generate_thumbnail_for_label`: guard the comparison. None on either side now means "no signal of source change, trust the cached thumb" instead of crashing. Covers legacy rows synced before the upload backfill. Tests: * `test_thumbnail_source_last_modified_none_does_not_crash` reproduces the TypeError class of bug by nulling `last_modified` after a thumbnail row exists and asserts the 2nd request reuses the cache. * `test_upload_handler_backfills_last_modified` covers the upload-path backfill (asserts the new field falls inside [before, after] of the upload call). Closes review thread on ami/main/models.py L2227. Co-Authored-By: Claude <noreply@anthropic.com> * fix(thumbnails): drop no-op f-strings; return 405 on list Two F541 (Ruff: f-string without placeholders) hits, plus a small behavioural cleanup on the list action. * `SourceImageThumbnailViewSet.list` now raises `MethodNotAllowed` (405) instead of `NotFound` (404). The route exists for `/captures/thumbnails/<pk>/?label=...` only; listing has no defined semantics. 405 sets the Allow header and is the correct REST response. * Updated `test_thumbnail_no_list` to expect 405 and to use a plain string literal (the previous test used an f-string with no placeholders). Closes review thread on ami/main/tests.py L237 (Ruff F541). Co-Authored-By: Claude <noreply@anthropic.com> * fix(thumbnails): guard empty THUMBNAILS['SIZES'] config When `settings.THUMBNAILS['SIZES']` is empty (misconfiguration), the default-label fallback `next(iter(_sizes))` raised `StopIteration` and surfaced as a 500 with no useful detail. Replace with an explicit empty check that returns 404 with a clear API error message, and use `next(iter(_sizes))` only after we know the dict is non-empty (also switch the fallback expression to `or next(...)` since the previous form passed `next(iter(...))` as the default arg unconditionally — fine for non-empty dicts but still worth tidying alongside the guard). Adds a test that overrides THUMBNAILS with an empty SIZES dict and asserts the endpoint returns 404 with the configured-error message. Co-Authored-By: Claude <noreply@anthropic.com> * fix(media): add finite default timeout to fetch_image_content `requests.get(url)` with no timeout blocks indefinitely if the upstream storage is slow or unreachable. The thumbnail-generation path (`SourceImage.find_or_generate_thumbnail_for_label`) runs this synchronously inside a Django web request, so a stuck upstream ties up a gunicorn worker; a gallery's worth of stuck calls drains the pool. Default timeout is `(connect=5s, read=30s)`. Callers that need different limits (e.g. very large originals on a slow link) can override by passing `timeout=`. Behaviour for fast / responsive upstreams is unchanged. Tests verify the default is finite and that explicit overrides are honored. Co-Authored-By: Claude <noreply@anthropic.com> * fix(thumbnails): use upsert + drop pre-save delete-loop to fix concurrent-gen race Two related bugs in `find_or_generate_thumbnail_for_label`: 1. Under concurrent gen for the same (source_image, label) pair the loser could either (a) hit the new UniqueConstraint and 500 with IntegrityError when its delete-loop ran before the winner's INSERT, or (b) destroy the winner's row + storage blob when the delete-loop ran after. Either outcome is wrong; both were possible because the code did `filter(label=label).delete()` then `thumbnails.create(...)` instead of a single atomic upsert. 2. The pre-save delete-loop also deleted the storage blob BEFORE the new `default_storage.save()`, which on storage backends that suffix colliding keys (FileSystemStorage default) left the row pointing at a path that the next save could overwrite in the same way the racer would. Belt-and-braces failure where the cleanup order itself was the bug. Fix: * Replace the delete-then-create with `update_or_create` keyed on `label`. Django serializes via the UniqueConstraint and, on collision, catches IntegrityError on INSERT and re-runs as UPDATE. No 500s, no destructive deletes of concurrent rows. * Snapshot the prior `path` before the new save and clean it up after the row update only when the backend gave us a new key (`prior_path != thumbnail_path`). Backends that overwrite in place (boto3 with `AWS_S3_FILE_OVERWRITE=True`) skip the cleanup and avoid the file-deletion race entirely. * Force-bump `thumb.last_modified` on the UPDATE branch via `QuerySet.update` because `auto_now_add=True` only fires on INSERT — without this, regen via source-mtime bump would re-trigger regen on every subsequent request as the field's pre_save callback preserved the existing (now-stale) value. Tests: * `test_thumbnail_regen_reuses_row_via_upsert`: delete the storage blob out from under a cached row, re-request, assert the row's PK is preserved (proving the UPDATE path was taken, not a fresh INSERT). * `test_thumbnail_exists_newer_modified_source` updated to assert row reuse + freshness bump rather than the previous (incidental) PK change. The `SourceImageThumbnail.last_modified` field semantics are still muddled (see PR thread; rename or drop tracked separately). Co-Authored-By: Claude <noreply@anthropic.com> * fix(thumbnails): CASCADE source_image FK and clean storage blob on row delete Thumbnails are pure derivatives of (source_image, label). The previous `on_delete=SET_NULL, null=True` left orphan rows AND their storage blobs behind forever whenever the parent SourceImage was deleted, with no other reaper in the codebase. Silent storage growth and orphan-row accumulation; CodeRabbit flagged this in PR #1306 and the thread was closed without a fix. Changes: * `SourceImageThumbnail.source_image` → `on_delete=CASCADE`, no longer nullable. Deleting a SourceImage now takes its thumbnail rows with it. * New `pre_delete` signal in `ami.main.signals` does best-effort `default_storage.delete(instance.path)` so the blob doesn't outlive the row. Fires on explicit row deletes AND CASCADE deletes (signals fire for both), so this covers parent SourceImage deletion automatically. Storage failures are logged and swallowed — a transient backing-store error must not block the delete chain. * Migration 0090 is two-step: reap pre-existing rows where `source_image_id IS NULL` (the SET_NULL aftermath) first, then AlterField the FK. Orphan-row reap runs through the same pre_delete signal so their blobs are cleaned alongside. Tests: * `test_source_image_delete_cascades_to_thumbnails` — create 2 thumbnail rows for a capture, delete the capture, assert both rows gone via CASCADE. * `test_thumbnail_delete_removes_storage_blob` — plant a real blob in storage, wire a row to it, delete the row, assert the blob is gone (proving the pre_delete signal fired and cleaned up). Co-Authored-By: Claude <noreply@anthropic.com> * chore: black formatting * chore(ui): prettier strip trailing whitespace in capture.ts * docs(sourceimage): help_text for storage-derived fields; admin read-only `size`, `last_modified`, `checksum`, `checksum_algorithm` all record metadata read from the source image file in storage during sync or upload, not row-level mtime. Document that in help_text so the field's intent shows up in the admin UI and in API schema output (the four fields previously had no help_text at all). Mark the cluster read-only in the SourceImage admin so operators can't clobber the cached metadata by mistake through the change form. Left writable on the API so a project owner can still create SourceImage entries by hand via an API client when the sync/upload flows don't fit; the sync endpoint at POST /api/v2/deployments/<pk>/sync/ remains the default population path. Adds migration 0091 for the help_text changes (no schema change beyond the AlterField metadata). Co-Authored-By: Claude <noreply@anthropic.com> * chore(migrations): collapse 0089 SET_NULL + 0090 CASCADE into single 0089 0089 was renamed-into-existence this branch (originally 0088, bumped when main added 0088_detection_det_srcimg_created_idx). The follow-up 0090 only existed to switch the just-created FK from SET_NULL to CASCADE — same model, same concern, accidentally split because 0089 shipped wrong on the first cut. This commit: * Rewrites 0089 to create the FK with CASCADE from the start (drops null=True and the SET_NULL on_delete). The RunPython orphan-reaper from 0090 is gone — a fresh table can't have orphans. * Deletes 0090. * Updates 0091's dependency to point at 0089 (was 0090). End DB state is identical to the old 0089+0090 pair, so anyone who already migrated past 0090 just pulls — Django sees 0089 as already-applied (same name) and skips. 0090 will appear as an applied migration with no matching file; warning only, no breakage. Local devs who want a clean django_migrations table can manually DELETE FROM django_migrations WHERE app='main' AND name='0090_sourceimagethumbnail_cascade_source_image'. Safe because this migration only landed on the in-flight feature branch — no production state to preserve. CI re-runs from scratch each time and passed (TestImageThumbnailViews 18/18 with --noinput on a fresh DB). Co-Authored-By: Claude <noreply@anthropic.com> * chore(migrations): drop over-defensive comment from collapsed 0089 The collapse rationale is unnecessary in-tree — the branch was never merged, so no production state ever ran the SET_NULL form. Anyone mid-rebase on this branch will just see the new migration replace the old one cleanly. Co-Authored-By: Claude <noreply@anthropic.com> * fix(ui): session list uses capture thumbnails instead of full-size originals The sessions list and gallery rendered example captures from capture.url — the full-resolution original (often 9000px+ wide, fetched straight from object storage). The events API has provided a thumbnails dict on example_captures since this branch; use the medium (1024px) thumbnail with a fallback to the original when thumbnails are absent, matching the pattern in capture.ts. The medium size also keeps the og:image meta tag on session details at a reasonable resolution. Co-Authored-By: Claude <noreply@anthropic.com> * fix(migrations): renumber thumbnail migrations after main merge Main now has 0089_add_regroup_permissions and 0090_session_time_gap_seconds_help_text (PR #1292), which made the branch's 0089/0091 migrations a second leaf in the migration graph and failed CI. Renumber to 0091/0092 on top of main's 0090. Co-Authored-By: Claude <noreply@anthropic.com> * fix(tests): use real taxon pk in role-permission test data TestRolePermissions hardcoded taxon_id="5" in the identification create payload. Any test class that runs earlier and creates taxa advances the id sequence, so pk 5 stops existing and the POST fails validation with 400 before the permission check returns 403. The new TestImageThumbnailViews class in this branch sorts alphabetically before TestRolePermissions and triggered exactly that. Use the occurrence's own determination pk instead. Co-Authored-By: Claude <noreply@anthropic.com> * fix(thumbnails): record requested spec width on thumbnail rows PIL.Image.thumbnail() preserves aspect ratio and can emit a width 1-2px below the requested spec (239 for a 240 spec was observed in a dev deployment). The regen gate compares the stored width against the spec with strict equality, so storing the encoder output made affected rows regenerate on every request and download the original each time. Store the requested spec width instead — the row identifies which configured size it satisfies, matching how derivative-image systems key their caches (sorl-thumbnail/easy-thumbnails hash the requested geometry, WordPress keys by registered size name). The actual encoded height is still stored; it is informational and never compared. Existing rows that hold an encoder-output width self-heal with a single regeneration on their next request. Co-Authored-By: Claude <noreply@anthropic.com> * perf(thumbnails): trust the cached row without a storage HEAD per request The regen gate called default_storage.exists() on every warm hit, which is a HEAD request against the object store (or a stat on filesystem backends) per thumbnail per page view. The cached row is already the warm signal; the existence check only protected against blobs deleted out of band, which currently has no code path. Orphaned rows now show a broken image until the row is removed and the next request regenerates. Co-Authored-By: Claude <noreply@anthropic.com> * feat(thumbnails): encode thumbnails as progressive JPEGs Progressive encoding lets browsers paint a blurry-to-crisp preview as bytes arrive instead of filling in top-to-bottom. optimize adds an extra Huffman pass for slightly smaller files and quality=82 raises Pillow's default of 75 for screen-sized thumbnails. All three flags only cost time on the cold generation path. Co-Authored-By: Claude <noreply@anthropic.com> * perf(thumbnails): let browsers cache the thumbnail redirect 302 responses are not cached by browsers unless explicitly allowed, so every page view re-paid a full Django round trip (auth, object lookup, freshness check) per thumbnail, even when nothing changed. A five-minute private Cache-Control covers typical back-and-forth browsing within a session while staying well below the shortest presigned-URL lifetime (django-storages AWS_QUERYSTRING_EXPIRE defaults to 3600s), so a cached redirect can never point at an expired signature. Co-Authored-By: Claude <noreply@anthropic.com> * fix(thumbnails): regenerate when the cached row has no stored path A row whose path is empty (failed or interrupted generation) passed the regen gate and redirected the browser to the storage root. Treat a blank path as cold and regenerate. Co-Authored-By: Claude <noreply@anthropic.com> * refactor(thumbnails): trim review-narration comments to load-bearing facts Several comment blocks added during review explained why a change was correct relative to code that no longer exists. Keep only the constraints a future reader needs (spec-width vs encoder output, auto_now_add bump, presign-lifetime bound); drop the change-justification prose, which lives in the commit history. Co-Authored-By: Claude <noreply@anthropic.com> * refactor(thumbnails): trim remaining narration comments found in full sweep Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Anna Viklund <annamariaviklund@gmail.com> Co-authored-by: Michael Bunsen <notbot@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

mihow changed the title ~~feat(projects): wire session_time_gap_seconds into event grouping~~ Allow users to customize the time gap between sessions May 5, 2026

mihow marked this pull request as ready for review May 6, 2026 12:51

Copilot AI review requested due to automatic review settings May 6, 2026 12:51

Copilot started reviewing on behalf of mihow May 6, 2026 12:52 View session

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread ami/main/api/views.py

Comment thread ui/src/pages/project-details/processing-form.tsx Outdated

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread ami/main/api/views.py

Comment thread ui/src/pages/project-details/processing-form.tsx Outdated

Comment thread ami/main/models.py Outdated

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread ami/tasks.py Outdated

mihow requested a review from annavik May 7, 2026 06:24

annavik reviewed May 7, 2026

View reviewed changes

loppear reviewed May 8, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

mihow and others added 7 commits June 4, 2026 17:10

mihow force-pushed the feat/configurable-session-gap branch from 72dfbeb to d93aa24 Compare June 5, 2026 00:16

mihow changed the title ~~Allow users to customize the time gap between sessions~~ Allow admins to customize the time gap between sessions Jun 5, 2026

mihow and others added 2 commits June 4, 2026 18:23

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread ami/main/api/views.py

Comment thread ami/main/models.py

Comment thread ami/main/models.py

mihow and others added 2 commits June 5, 2026 10:14

loppear approved these changes Jun 8, 2026

View reviewed changes

mihow force-pushed the feat/configurable-session-gap branch from 8b5c72c to c56a03b Compare June 9, 2026 23:15

mihow merged commit 13203e4 into main Jun 9, 2026
7 checks passed

mihow deleted the feat/configurable-session-gap branch June 9, 2026 23:36

Conversation

mihow commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Use case this fixes

API usage

Request

Response

Error responses

Caveats — what this PR does and doesn't change

Production data — 5-setting cycle on a real deployment

Why tightening doesn't always change Event count

Tests added in this PR

End-to-end validation against a real Antenna stack

Headless validation pass (post-takeaway review)

Post-review hardening (commit c56a03b)

What this PR proves on its own

Related PRs and issues

Test plan

Notes / follow-ups

Screenshots

Co-author

Summary by CodeRabbit

Uh oh!

netlify Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

netlify Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mihow commented May 7, 2026

Uh oh!

annavik left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loppear left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loppear May 8, 2026

Choose a reason for hiding this comment

Uh oh!

mihow Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

mihow Jun 5, 2026

Choose a reason for hiding this comment

mihow commented May 5, 2026 •

edited

Loading

Post-review hardening (commit `c56a03b`)

netlify Bot commented May 5, 2026 •

edited

Loading

netlify Bot commented May 5, 2026 •

edited

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

annavik left a comment •

edited

Loading

loppear left a comment •

edited

Loading

mihow commented Jun 5, 2026 •

edited

Loading