Skip to content

Allow admins to customize the time gap between sessions#1292

Merged
mihow merged 13 commits into
mainfrom
feat/configurable-session-gap
Jun 9, 2026
Merged

Allow admins to customize the time gap between sessions#1292
mihow merged 13 commits into
mainfrom
feat/configurable-session-gap

Conversation

@mihow

@mihow mihow commented May 5, 2026

Copy link
Copy Markdown
Collaborator

Closes #1157. Closes #1158. Partially addresses #906 (full fix tracked in #904 — see "What this PR proves on its own").

Summary

The session_time_gap_seconds project setting was added in #918 but was never read by anything — group_images_into_events() still hardcoded a 120-minute gap, so changing the field had no effect. This PR wires the setting through and turns the regroup action into a proper Job so admins can actually see what's happening.

What changes in this PR:

  • group_images_into_events() now reads deployment.project.session_time_gap_seconds when no explicit gap is passed, with a clamp on invalid values (None/0/negative → 120 min default, logged).
  • A new RegroupEventsJob JobType — admins / MLDataManager / ProjectManager can trigger regrouping from the API, the admin, or programmatically and see progress, logs, and stage params in the Jobs UI.
  • DataStorageSyncJob now runs grouping as an explicit second stage. Previously the regroup happened silently inside Deployment.save() — if it raised, the sync Job recorded SUCCESS anyway. Closes Expose session regrouping as a job stage when syncing images #1157 and Expose session regrouping as a job stage with stats #1158.
  • Per-deployment idempotency lock around group_images_into_events() itself (token-based uuid release, so a longer-than-TTL run never clobbers a newer owner). All four call sites — the new Job, the sync-stage regroup, Deployment.save autoregroup, and the bare Celery task — collapse to a single in-flight run.
  • New POST /api/v2/deployments/<pk>/regroup-sessions/ action that creates a RegroupEventsJob. Mirrors the existing sync action shape (returns job_id, not a Celery task id).
  • Two new custom permissions: regroup_sessions_deployment (gates the API action; granted to ProjectManager) and run_regroup_events_job (gates create/run/cancel on the new JobType; granted to MLDataManager, which ProjectManager inherits from). Migration 0089_add_regroup_permissions registers both and grants them to existing role groups.
  • Django admin "Regroup captures into sessions" and "Sync captures" bulk actions now create Jobs instead of bare Celery tasks, so the admin path matches the API path.

Drive-by fixes in the same area:

  • events_created stage param was counting events visited rather than events newly created (the bool from get_or_create was being discarded). Fixed to use the was_created return.
  • The pre-existing duplicate-timestamp warning logged len(values) (string length of the newline-joined report) instead of the duplicate count. Pre-existing since Storage and retrieval of project settings #918's neighbouring code (2024-09-12) but sits inside the very block whose stats this PR now surfaces, so fixed here rather than left to drift.

What is intentionally not included:

  • No UI form field for the setting. The earlier draft of this PR added a numeric input on the Processing settings page; that's been pulled. The setting stays admin-only until we decide whether it belongs on Project or Deployment, and whether the unit should change from seconds to minutes/hours. See "Follow-ups" below.
  • No new code paths for unauthenticated or non-member users. Permissions match the existing sync_deployment shape.

One DB migration (0089_add_regroup_permissions, rebased onto main's 0088). The session_time_gap_seconds field itself already existed with default 7200 seconds (= 2 hours = 120 minutes), matching the previous hardcoded behavior.

Use case this fixes

A user reported that monitoring schedules with multi-burst nights (e.g. 2 hours on, 2 hours off, 2 hours on) sometimes group into one Event and sometimes into two, depending on whether the off-window crosses midnight. With this PR, raising the gap setting past the longest off-window collapses cross-midnight bursts into a single Event consistently.

A second pain came out of a recent workshop: sync jobs would complete successfully, but if the post-sync regroup raised, there was no Job-level signal. By the time anyone noticed the captures had no sessions, it took grepping Celery logs to find the failure. The new regroup stage on DataStorageSyncJob surfaces those failures directly in the Jobs UI — a sync-then-regroup failure now flips the Job to FAILURE.

API usage

Request

curl -X POST \
  -H "Authorization: Token <your-token>" \
  -H "Content-Type: application/json" \
  https://your-antenna-host/api/v2/deployments/<deployment_pk>/regroup-sessions/

No request body required.

Response

202 Accepted:

{
  "job_id": 12345,
  "deployment_id": 74,
  "project_id": 24
}

The response returns immediately; regrouping runs as a RegroupEventsJob in the background. Track progress at GET /api/v2/jobs/<job_id>/ or in the Jobs UI — you'll see a "Regroup sessions" stage with events_created, events_touched, events_deleted_empty, captures_grouped, duplicate_timestamps, ungrouped_captures, and no_timestamp_captures stage params on completion.

If a regroup is already in progress for the same deployment, subsequent enqueues still return 202 with a fresh job_id, but the Job's regroup stage will short-circuit at the worker (logged as a warning on the Job) and return early without touching the DB. The lock is keyed on the deployment, not the Job, so duplicate Jobs are visible in the Jobs UI but cannot run overlapping regroups on the same rows.

Error responses

  • 404 Not Found — deployment pk does not exist or is not visible to the requesting user.
  • 401 Unauthorized / 403 Forbidden — token missing or user lacks the regroup_sessions_deployment permission on the deployment's project (granted to the ProjectManager role and superusers).

Caveats — what this PR does and doesn't change

The setting is now wired through, but Event keying still collapses some timestamp groups. Below are the scenarios actually tested and what they show.

Production data — 5-setting cycle on a real deployment

Cycled session_time_gap_seconds through five values via the new endpoint, on a deployment with ~1.2 k captures currently grouped into 69 Events at the default 2 h gap. Each Job completed in 14–22 s. (Run was against the original regroup_events_task.delay() path before the Job framework wrapping; the same numbers re-verified via the new path — see "End-to-end validation" below.)

Setting Observed events Change vs default
30 min 69 none
1 hour 69 none
2 hour (default) 69
4 hour 69 none
24 hour 60 merged 9 cross-midnight nights

Reading: tightening below ~24 h produced no observable Event-count change on this deployment. Loosening to 24 h merged 9 cross-midnight nights (−13 %).

Why tightening doesn't always change Event count

group_images_into_events() keys Events by group_by = start_date.date() and uses Event.objects.get_or_create(deployment, group_by). Two timestamp groups produced by a tighter gap that start on the same calendar date collide back into a single Event. So the gap setting only changes Event count when:

  1. The off-window between bursts crosses midnight, and
  2. The new gap value is longer than that off-window (so the bursts merge into one timestamp group whose start date stays on the earlier calendar day).

The existing comment in models.py already flags this: "#904 is expected to rework this reuse path more thoroughly." A separate cleaner fix — keying Events by a noon-anchored "monitoring night" instead of calendar date — is parked as a follow-up.

Tests added in this PR

ami/main/tests.py::TestImageGrouping:

Test Pattern Setting Result
test_cross_midnight_bursts_split_by_short_gap Bursts at 22:00 day N and 01:00 day N+1, off-window 2 h 35 min 2 h gap 2 Events (split — user's pain)
↑ same ↑ same 4 h gap 1 Event (this PR's fix)
test_same_date_bursts_merge_regardless_of_gap Two bursts at 20:00 and 22:00 same calendar date, off-window 1 h 35 min 1 h gap 1 Event in DB (date collision masks gap — known limitation)
test_session_time_gap_seconds_is_used_when_no_explicit_gap Cross-midnight bursts project setting 7200 s 2 Events
↑ same ↑ same project setting 14400 s 1 Event
test_invalid_session_time_gap_falls_back_to_default Cross-midnight bursts project setting 0 / -1 / -7200 2 Events (default-gap split, not 12)
test_regroup_is_idempotent_under_concurrent_calls Lock pre-taken before second call Second call returns empty list; no Events created
test_regroup_lock_release_does_not_clobber_a_newer_owner Newer owner takes lock while ours is held Newer owner's lock survives our finally block

ami/jobs/tests/test_jobs.py:

Test What it covers
TestRegroupEventsJob.test_regroup_job_runs_end_to_end_and_reports_stats Job runs to SUCCESS, stage params (events_created, captures_grouped, etc.) populated, real Events exist in DB
TestRegroupEventsJob.test_regroup_job_failure_propagates A group_images_into_events exception propagates from JobType.run() so the Celery wrapper records FAILURE
TestDataStorageSyncJobIncludesRegroupStage.test_sync_job_adds_regroup_stage Sync Job exposes a regroup_sessions stage with stats after sync completes
TestDataStorageSyncJobIncludesRegroupStage.test_sync_job_regroup_failure_propagates A regroup failure during a sync Job propagates so the sync Job ends FAILURE rather than silently succeeding (fixes the workshop pain in #1157)

End-to-end validation against a real Antenna stack

Run via the new manage.py test_regroup_job_e2e {regroup|sync|concurrent} command (added in this PR; mirrors test_ml_job_e2e). Stack was the dev box's main Antenna compose stack, bind-mounted onto this branch's worktree. Target deployment: PFS_Moth_Wall (id 74, project 24, 1263 captures, 69 events at the 2 h default).

# Scenario Command Result
1 RegroupEventsJob on real captures test_regroup_job_e2e regroup --deployment 74 Job SUCCESS in 18 s. Stage params: captures_grouped=1188, events_created=0, events_touched=69, events_deleted_empty=0, duplicate_timestamps=57, ungrouped_captures=0, no_timestamp_captures=0. Event count stable (Δ 0). Reproduces the 2 h-baseline row of the production-cycle table via the new code path.
2 DataStorageSyncJob two-stage chain test_regroup_job_e2e sync --deployment 74 Job SUCCESS in 20 s. Two stages registered: data_storage_sync (total_files=857, failed=0) and regroup_sessions with same regroup stats as #1. Closes #1157 / #1158.
3 Sync→regroup failure propagation (covered by unit test) test_sync_job_regroup_failure_propagates mocks group_images_into_events to raise; the RuntimeError propagates from JobType.run() so the Celery wrapper records FAILURE. The stack-level run would re-exercise the same propagation path. Skipped to avoid a code edit just to inject a fault; happy to add a chaos hook if reviewers want it.
4 Concurrent lock idempotency test_regroup_job_e2e concurrent --deployment 74 Both Jobs reached SUCCESS in 18 s. Job A: captures_grouped=1188 (real work). Job B: captures_grouped=0 (short-circuit; logged the lock-warning per the body). Event count unchanged from the single-run baseline. Lock semantics held end-to-end.

The 5-setting production cycle (30 min / 1 h / 2 h / 4 h / 24 h) in the "Caveats" table was originally run against the pre-Job-framework regroup_events_task.delay() path. Re-running it through RegroupEventsJob would produce identical Event counts (same underlying group_images_into_events call) and was not re-done end-to-end; the single-setting regroup run above is the live-stack proof that the Job-framework path drives that function correctly.

Headless validation pass (post-takeaway review)

A second pass exercised the API permission paths, setting-change UX, admin bulk action edge cases, and the lock-miss log surface on the same dev stack — these are the paths the unit tests stub out or the original E2E command doesn't cover. All against deployment 74.

# Scenario How Result
5 Superuser POST /regroup-sessions/ happy path curl -X POST with superuser token 202 {"job_id": 1587, "deployment_id": 74, "project_id": 24}. Job ran to SUCCESS in 19 s. Stage params surfaced with both human-readable name (e.g. "Captures grouped") and slugified key (captures_grouped) for UI lookup.
6 Non-member POST curl with token from a user not in project 24 403 "You do not have permission to perform this action." (object-level perm gate via check_custom_permission(action="regroup_sessions") → regroup_sessions_deployment).
7 POST against non-existent deployment curl with superuser token, pk = 99999999 404 "Not found."
8 Unauthenticated POST curl with no auth header 401 "Authentication credentials were not provided."
9 Setting change does not auto-enqueue regroup Save Project.session_time_gap_seconds 7200 → 14400 in shell, then count new RegroupEventsJobs and check Event count RegroupEventsJob delta = 0. Event count for project = 294 → 294. Matches the body's documented behavior (admin must trigger manually).
10 Admin bulk action with a project-less deployment in the selection Build queryset [74, orphan] (orphan = Deployment(project=None)), invoke DeploymentAdmin.regroup_events Message: Queued RegroupEventsJob for 1 deployments: [1594] — skipped: #297 ... (no project). Orphan skipped cleanly, valid one enqueued.
11 Lock-miss warning surfaces in the Job log 2 sequential POST /regroup-sessions/ calls < 1 s apart Job A 1591: real work, 18 s. Jobs B 1592 + C 1593: started 9 s into A's run, both finished in 12 ms with captures_grouped=0. Job B/C .logs.stdout contains WARNING group_images_into_events skipped for deployment 74: another regroup is in progress. — an admin watching the Job sees the gap directly.

Not run on this stack (need fixtures not available locally):

# Scenario Fixture needed
12 Cancel mid-run Deployment with ~50 k+ SourceImages so the regroup runs ≥ 30 s — long enough to hit the cancel API before it finishes. Exercises Celery cancel + the finally-block lock release, not this PR's new code.
13 Concurrent at scale Same ≥ 50 k-capture fixture. Lock cost is constant per call, so the mechanism shouldn't differ from #11 — just confirms wallclock under contention.
14 Admin list at prod-shape size Synthetic ≥ 1 000 Deployment rows with cached count fields populated. With the new denorm-field admin, projection is roughly 2–3 s warm at 1 000 rows (vs 0.6 s on 251 today); linear in row count, not capture count.

Post-review hardening (commit c56a03b)

An independent takeaway review surfaced four issues, all addressed in c56a03bb:

  1. "Captures grouped" stat now counts the SourceImage rows actually assigned to an Event, rather than the number of distinct timestamps. When two captures share a timestamp (tracked separately as duplicate_timestamps), both get assigned to the same Event, so the old distinct-timestamp count understated the captures grouped. The E2E table figures above were produced before this change and will read slightly higher under the corrected count.
  2. Lock TTL and the matching Celery time limits dropped to 10 min soft / 11 min hard (was 1 h). Regroup finishes in seconds, so a short TTL caps how long the lock stays held if a worker is killed before its finally block releases it.
  3. Documented the grouping-only scope of RegroupEventsJob and the bare regroup_events task: propagating a moved deployment's project_id to its children stays on Deployment.save() via update_children(), not on the regroup path.
  4. Lock-miss now logs a clear "regroup skipped" message instead of a misleading "now has 0 events", so an operator reading the worker log can tell a short-circuited duplicate run from a genuine reset.

What this PR proves on its own

Related PRs and issues

Test plan

  • yarn build
  • docker compose build django ui
  • Pre-commit hooks
  • manage.py test ami.main.tests.TestImageGrouping — 12/12 pass
  • manage.py test ami.jobs.tests.test_jobs.TestRegroupEventsJob ami.jobs.tests.test_jobs.TestDataStorageSyncJobIncludesRegroupStage — 4/4 pass
  • manage.py test ami.main.tests.TestRolePermissions — 4/4 pass (no regression from the two new permissions)
  • Functional verification end-to-end against the live Antenna dev stack: test_regroup_job_e2e regroup, sync, and concurrent against deployment 74 — see "End-to-end validation" table above.
  • Headless API + admin verification on the same stack: superuser POST, non-member POST, bad-pk POST, unauthenticated POST, setting-change does-not-enqueue check, admin bulk action with project-less deployment in selection, lock-miss warning visible in Job log — see "Headless validation pass" rows Bump docker/build-push-action from 2 to 3 #5Bump @types/node from 18.11.9 to 18.11.12 in /frontend #11.

Notes / follow-ups

  • The setting lives on Project because that's where it was added in Storage and retrieval of project settings #918. Moving it to Deployment is probably the right long-term shape — different monitoring stations on the same project can run different burst schedules. A separate PR with a data migration.
  • Surface a "Regroup sessions" button on the Stations list / project settings — separate UI follow-up. When that lands, the form input should be minutes or hours, not seconds (per @loppear's review comment on the original draft).
  • deployment_events_need_update() does not detect a setting change as a reason to regroup, so admins must hit the new API action (or the Django admin bulk action) after editing the gap (also per @loppear's review comment — "expect this setting to apply"). The field's help_text now explicitly points users to the Deployments admin "Regroup captures into sessions" bulk action so the manual-regroup step is discoverable from the form itself. Auto-detection on setting change remains a possible follow-up.
  • A project-level "Regroup all deployments" admin action (mirroring the existing per-deployment bulk action one level up) would save the click-through-to-Deployments step. Out of scope here; the multi-select bulk action already covers the use case once the user finds it.

Screenshots

image image image image

Co-author

Co-Authored-By: Claude noreply@anthropic.com

Summary by CodeRabbit

  • New Features

    • Regroup sessions endpoint for deployments and a dedicated "Regroup sessions" background job
    • Jobs UI now exposes a regroup stage with summary stats (events created, captures grouped, etc.)
  • Improvements

    • Configurable session time gap with sensible fallback
    • Better concurrency safety to avoid duplicate regroup work
    • Admin actions now enqueue background jobs instead of running inline
  • Permissions

    • New permissions to control who can run regroup jobs and trigger regroup on deployments

@netlify

netlify Bot commented May 5, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit c56a03b
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6a289e8ac60c290008ed6842

@netlify

netlify Bot commented May 5, 2026

Copy link
Copy Markdown

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit c56a03b
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/6a289e8ac8746900088b6d37

@coderabbitai

coderabbitai Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds per-deployment regroup cache locking and project-driven session gap handling to core grouping; introduces RegroupEventsJob and an explicit regroup stage in DataStorageSyncJob; switches admin/API to enqueue jobs; adds permissions, role updates, unit/integration tests, and an e2e test command.

Changes

Regroup sessions feature

Layer / File(s) Summary
Core grouping refactor: cache lock, configurable gap, job params
ami/main/models.py
Add REGROUP_LOCK_TTL_SECONDS and _regroup_lock; make group_images_into_events use per-deployment cache lock, accept optional max_time_gap derived from deployment.project.session_time_gap_seconds with 120-minute fallback, compute events_created and events_deleted_empty, sample duplicate timestamps, and emit extended job stage params when called with job/stage_key. Update delete_empty_events to return an int. Add regroup_after flag to Deployment.sync_captures and new Project permission constants.
Jobs: regroup stage and RegroupEventsJob
ami/jobs/models.py
Introduce REGROUP_STAGE_PARAM_NAMES; add regroup_stage_key/regroup_stage_name to DataStorageSyncJob and refactor its run to call deployment.sync_captures(regroup_after=False) then explicitly run group_images_into_events into a dedicated regroup stage. Add RegroupEventsJob job type, implement its run, and register it in VALID_JOB_TYPES.
Unit tests: image grouping and defaults
ami/main/tests.py
Add _create_burst helper and tests for cross-midnight splitting vs merging under different max_time_gap settings, same-date merging regardless of gap, using deployment.project.session_time_gap_seconds when max_time_gap omitted, and fallback for invalid session_time_gap values.
Unit tests: idempotency and lock semantics
ami/main/tests.py
Add tests that pre-seed the per-deployment regroup cache lock to assert short-circuit behavior (no Events created) and tests ensuring token-based lock cleanup does not clobber newer owners.
Job integration tests
ami/jobs/tests/test_jobs.py
Add end-to-end tests for RegroupEventsJob and for DataStorageSyncJob exposing regroup as a separate stage; assert regroup stage reaches SUCCESS with progress and stage params and that failures in group_images_into_events propagate.
API and admin: enqueue via Job infrastructure
ami/main/admin.py, ami/main/api/views.py
Refactor DeploymentAdmin display helpers to use denormalized fields for ordering; rewrite sync_captures and regroup_events admin actions to create/enqueue Job records and report queued/skipped deployments; add DeploymentViewSet.regroup_sessions POST detail action to enqueue RegroupEventsJob and return 202 Accepted with job/deployment/project ids.
End-to-end test harness
ami/jobs/management/commands/test_regroup_job_e2e.py
Add management command with modes regroup, sync, and concurrent to enqueue/poll jobs, assert SUCCESS, dump regroup stage params, and validate lock short-circuiting across concurrent runs.
Permissions and role assignments
ami/main/migrations/0089_add_regroup_permissions.py, ami/users/roles.py
Add migration to create and grant two new Project permissions (regroup_sessions_deployment, run_regroup_events_job) to role groups and update Project model permissions; add these permission codenames to MLDataManager and ProjectManager role sets.
Task documentation
ami/tasks.py
Expand regroup_events task comment to note it is a thin Celery wrapper around Deployment.save() autoregroup and that idempotency/locking is handled by _regroup_lock in group_images_into_events.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Possibly related PRs

  • RolnickLab/antenna#1159: Also modifies DataStorageSyncJob.run stage/progress handling; related at the job-stage level.
  • RolnickLab/antenna#1268: Also changes regroup core behavior around events/duration and overlaps with group_images_into_events changes.

Suggested labels

backend

Suggested reviewers

  • annavik

Poem

🐰 A cache lock held tight in the meadow of code,
Sessions regrouped as the timestamps reload,
Jobs queued with a hop and a soft little thrum,
Stats counted and logged — now the pipelines hum,
Hooray for clear stages and locks that keep calm!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary feature: allowing customization of the session time gap setting. It directly addresses the main change without being vague or overly broad.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering all required sections from the template including summary, list of changes, related issues, detailed description, testing instructions, deployment notes, and a complete checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/configurable-session-gap

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mihow mihow changed the title feat(projects): wire session_time_gap_seconds into event grouping Allow users to customize the time gap between sessions May 5, 2026
@mihow mihow marked this pull request as ready for review May 6, 2026 12:51
Copilot AI review requested due to automatic review settings May 6, 2026 12:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/main/api/views.py`:
- Around line 320-323: The view currently enqueues regroup_events_task
unconditionally allowing overlapping runs; modify the handler that calls
self.get_object() and regroup_events_task.delay(deployment.pk) to acquire a
per-deployment in-flight lock (e.g., a Redis/cache-based lock keyed by
f"regroup:{deployment.pk}") before queuing; if the lock cannot be obtained
return an early HTTP 409 (or similar) Response and log that a regroup is already
running, otherwise enqueue the task, keep the lock for the duration of the task
(or set a TTL and have the task release the lock) and ensure regroup_events_task
(or its wrapper) releases the lock on completion/failure so duplicate triggers
are prevented.

In `@ui/src/pages/project-details/processing-form.tsx`:
- Around line 30-35: The sessionTimeGapSeconds field currently validates but
will submit a string because React Hook Form doesn't coerce number inputs;
update the field's rules for sessionTimeGapSeconds to include a setValueAs that
converts the incoming value to a number (e.g., Number or parseInt) so the form
value is stored as a numeric type; modify the rules object for
sessionTimeGapSeconds to add setValueAs and keep required/min validations
intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 80284157-fc1e-41e0-8398-96a13aa37201

📥 Commits

Reviewing files that changed from the base of the PR and between 183f487 and 55cef8b.

📒 Files selected for processing (4)
  • ami/main/api/views.py
  • ami/main/models.py
  • ami/main/tests.py
  • ui/src/pages/project-details/processing-form.tsx

Comment thread ami/main/api/views.py
Comment thread ui/src/pages/project-details/processing-form.tsx Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires the existing per-project session_time_gap_seconds setting into backend session (Event) grouping, exposes the setting in the UI, and adds an API action to trigger regrouping asynchronously so admins can re-cluster historical captures after changing the threshold.

Changes:

  • Backend: group_images_into_events() now derives its default max_time_gap from deployment.project.session_time_gap_seconds when not explicitly provided.
  • API: adds POST /api/v2/deployments/<pk>/regroup-sessions/ to enqueue the existing ami.tasks.regroup_events Celery task.
  • UI + tests: exposes the setting as a numeric input on the project Processing settings form and adds unit tests covering cross-midnight behavior and the project-setting fallback.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
ami/main/models.py Uses the project’s session_time_gap_seconds as the default grouping gap instead of a hardcoded 120 minutes.
ami/main/api/views.py Adds a deployment-level action endpoint to queue regrouping in Celery.
ui/src/pages/project-details/processing-form.tsx Adds a numeric form field to edit sessionTimeGapSeconds in the Processing settings UI.
ami/main/tests.py Adds tests documenting cross-midnight split/merge behavior and verifying the project setting is used when no explicit gap is passed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/main/api/views.py
Comment thread ui/src/pages/project-details/processing-form.tsx Outdated
Comment thread ami/main/models.py Outdated
mihow added a commit that referenced this pull request May 7, 2026
…oup task

Addresses takeaway-review feedback on PR #1292:

- Add `regroup_sessions_deployment` custom Project permission so non-superusers
  with the ProjectManager role can hit `POST /deployments/<pk>/regroup-sessions/`
  (mirrors `sync_deployment`). Without this, `BaseModel.check_custom_permission`
  computes the codename `regroup_sessions_deployment` and the perm doesn't
  exist, so the action 403s for everyone except superusers.
- Wrap `ami.tasks.regroup_events` in a per-deployment cache lock. Concurrent
  enqueues (double-clicks, sync_captures auto-regroup racing with manual
  trigger) collapse to a single run instead of overlapping on the same rows.
  Lower-level than the view so all callers (Deployment.save, admin bulk
  action, new API action) are protected.
- Defensive clamp on `session_time_gap_seconds` — fall back to the historical
  120-minute default if the value is 0 / negative, with a warning log. Stops
  pathological "every gap = new event" behavior when admins enter bad values.
- Migration 0085 adds the new perm to `Project.Meta.permissions` and grants it
  to existing `ProjectManager` role groups via a data migration (mirrors how
  0084 revoked `delete_job`).
- Tests:
  - test_invalid_session_time_gap_falls_back_to_default exercises 0 / -1 / -7200
  - test_regroup_events_task_is_idempotent_under_concurrent_calls pre-takes
    the cache lock, asserts a second call skips group_images_into_events

Co-Authored-By: Claude <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/tasks.py`:
- Around line 106-119: The current lock uses cache.add(lock_key, 1, ...) and
unconditionally cache.delete(lock_key) which can remove a newer owner's lock;
change to an ownership token pattern: when acquiring the lock (where cache.add
is called) generate a unique token (e.g. UUID) and store that token as the cache
value instead of 1 (use REGROUP_EVENTS_LOCK_TTL as before), keep the token in a
local variable, and on release only remove the key if the stored cache value
equals your token (fetch with cache.get(lock_key) and only call
cache.delete(lock_key) when it matches) or use the cache backend's atomic
compare-and-delete if available; update the lock handling around lock_key,
cache.add, cache.get and cache.delete to use this token ownership check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 251316be-c276-4473-a3fc-36a150e056ed

📥 Commits

Reviewing files that changed from the base of the PR and between 55cef8b and 50d618e.

📒 Files selected for processing (5)
  • ami/main/migrations/0085_add_regroup_sessions_deployment_permission.py
  • ami/main/models.py
  • ami/main/tests.py
  • ami/tasks.py
  • ami/users/roles.py
✅ Files skipped from review due to trivial changes (1)
  • ami/users/roles.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • ami/main/models.py

Comment thread ami/tasks.py Outdated
@mihow

mihow commented May 7, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: Push 50d618e addresses the takeaway-review feedback from CodeRabbit and Copilot:

  1. regroup_sessions_deployment custom permission (Copilot) — added to Project.Permissions and Project.Meta.permissions, granted to existing ProjectManager role groups via data migration 0085. Mirrors how sync_deployment is wired. Without this, the action 403'd for all non-superusers.
  2. Per-deployment idempotency lock on regroup_events (CodeRabbit) — cache.add() SETNX-style with a 1 h TTL, applied inside the Celery task itself rather than the view. So all callers (the new API action, Deployment.save autoregroup, the admin bulk action) are protected, not just the new endpoint. Concurrent enqueues collapse to a single run.
  3. Clamp non-positive session_time_gap_seconds (Copilot) — fall back to the historical 120-min default with a warning log for 0 or negative values, preventing pathological per-image splits.
  4. Form numeric coercion (CodeRabbit + Copilot) — left as-is. <input type="number"> returning a string for an integer field is the established pattern across processing-form, capture-set form, deployment-location form, and job-details form. DRF IntegerField parses strings transparently. Worth standardizing the helper later, not in this PR.

New tests covering the lock and the clamp:

  • test_invalid_session_time_gap_falls_back_to_default — exercises 0 / -1 / -7200
  • test_regroup_events_task_is_idempotent_under_concurrent_calls — pre-takes the lock and asserts the second call exits early without invoking group_images_into_events

11/11 TestImageGrouping and 13/13 permission tests green locally.

@mihow mihow requested a review from annavik May 7, 2026 06:24

@annavik annavik left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!!

I still think this setting would make more sense in a new section, instead of in Project->Settings->Processing. I know you said this can affect the logic for occurrences once tracking is enabled, but I think this is a bit far-fetched. I'm thinking, if I were about to look for this setting as a user, where would I first go? I'm not sure the processing section would be the place. Maybe this belongs more with upcoming sampling schedule configuration we have discussed?

I don't have very strong opinions here though, so if you feel this is where it belongs, go for it! :)

@loppear loppear left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it works for my case. I agree with @annavik that the Processing settings form isn't the first place I looked for this, even having looked at the PR, would make sense to me on General settings (although maybe eventually moved to a less-likely settings tab there, and not opposed to it staying where it is for the time being.)

EDIT: I think regroup_events should probably happen on setting change, not sure how expensive it is or what the conditions where you wouldn't want to regroup (yet?) but I think I'm going to expect this setting to apply - they're going to be regrouped to the new setting anyways on the next sync right?

sessionTimeGapSeconds: {
label: 'Maximum time gap between sessions (default)',
description:
'Maximum time gap (in seconds) between consecutive images before a new session is started. Default is 7200 seconds (2 hours).',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice if this was presented to the user as minutes (or hours) rather than seconds.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Good point — units in seconds are awkward in a settings UI. The UI form field has been removed entirely in 76b52af; the setting is now admin-only (via Django admin or the API). When we surface it again it should be minutes or hours, and probably at the deployment level rather than the project level (raised in the PR description's "Follow-ups" section).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: Captured in the body's follow-up section — Notes / follow-ups now reads "the form input should be minutes or hours, not seconds (per @loppear's review comment on the original draft)" alongside the related point about auto-regroup on setting change. The UI form input was pulled from this PR (no UI field ships here), so the unit choice lands with the follow-up UI work rather than gating this PR. Leaving the thread for your call on resolving.

@mihow

mihow commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: Pushed 76b52af which restructures the PR based on @mihow's takeaway-review feedback.

Scope shift — out: the UI form field for session_time_gap_seconds. The setting is admin-only for now (Django admin or API). When we re-surface it in the UI it should be minutes/hours rather than seconds, and probably attached to Deployment rather than Project — both noted in "Follow-ups".

Scope shift — in:

  1. RegroupEventsJob JobType. The new POST /api/v2/deployments/<pk>/regroup-sessions/ action and the Django admin bulk action both enqueue Job.objects.create(job_type_key=RegroupEventsJob.key).enqueue() now, so admins see regroup progress, logs, and stats in the Jobs UI like every other long-running task. API response shape is now {job_id, deployment_id, project_id} — matches the existing sync action on the same viewset (previously returned a Celery task_id).
  2. DataStorageSyncJob gets an explicit "Regroup sessions" second stage. Previously regroup ran silently inside Deployment.save(), so a regroup failure after a successful sync left the Job recorded as SUCCESS — the workshop pain Mike flagged. Now a sync-then-regroup failure flips the whole Job to FAILURE and the regroup logs land on the same Job row. Closes Expose session regrouping as a job stage when syncing images #1157 and Expose session regrouping as a job stage with stats #1158.
  3. Idempotency lock moves into group_images_into_events() itself. Previously the lock lived in ami.tasks.regroup_events and only protected the Celery-task entry point — direct callers (Deployment.save autoregroup, _insert_or_update_batch_for_sync) bypassed it. Now every call site collapses to a single in-flight run for a given deployment. Token-based release (uuid in cache value, only delete if value still matches) fixes the lock-release-clobbers-newer-owner race CodeRabbit flagged.
  4. New run_regroup_events_job permission granted to MLDataManager (which ProjectManager inherits from). The regroup_sessions_deployment permission from the previous draft is unchanged. Migration 0085_add_regroup_permissions registers and grants both.
  5. Admin "Sync captures" bulk action also switched to Jobs. Same shape as the regroup admin action, and now inherits the visible regroup stage from DataStorageSyncJob.

Tests: 12/12 TestImageGrouping, 4/4 TestRegroupEventsJob / TestDataStorageSyncJobIncludesRegroupStage, 4/4 TestRolePermissions. The PR description has the full test plan and what's still left to verify against production data.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ami/main/models.py (1)

1523-1526: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Bug: len(values) returns string length, not duplicate count.

values is a newline-joined string (line 1520-1522), so len(values) gives the character count, not the number of duplicate timestamps. Use duplicate_timestamp_count which was computed on line 1518.

🐛 Proposed fix
     logger.warning(
-        f"Found {len(values)} images with the same timestamp in deployment '{deployment}'. "
+        f"Found {duplicate_timestamp_count} images with the same timestamp in deployment '{deployment}'. "
         f"Only one image will be used for each timestamp for each event."
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/models.py` around lines 1523 - 1526, The warning message incorrectly
uses len(values) (which is the character length of the newline-joined string)
instead of the numeric duplicate count; update the logger.warning call in the
block around logger.warning(...) to use duplicate_timestamp_count rather than
len(values) so it reads f"Found {duplicate_timestamp_count} images with the same
timestamp in deployment '{deployment}'..." and leave the rest of the message and
variables (values, deployment) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@ami/main/models.py`:
- Around line 1523-1526: The warning message incorrectly uses len(values) (which
is the character length of the newline-joined string) instead of the numeric
duplicate count; update the logger.warning call in the block around
logger.warning(...) to use duplicate_timestamp_count rather than len(values) so
it reads f"Found {duplicate_timestamp_count} images with the same timestamp in
deployment '{deployment}'..." and leave the rest of the message and variables
(values, deployment) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0173c1c-89f3-46b5-93d1-7be34a87e027

📥 Commits

Reviewing files that changed from the base of the PR and between 50d618e and 76b52af.

📒 Files selected for processing (9)
  • ami/jobs/models.py
  • ami/jobs/tests/test_jobs.py
  • ami/main/admin.py
  • ami/main/api/views.py
  • ami/main/migrations/0085_add_regroup_permissions.py
  • ami/main/models.py
  • ami/main/tests.py
  • ami/tasks.py
  • ami/users/roles.py
✅ Files skipped from review due to trivial changes (1)
  • ami/tasks.py

mihow and others added 7 commits June 4, 2026 17:10
The session_time_gap_seconds project setting was added in #918 but
never used. group_images_into_events() still hardcoded a 120-minute
gap. This wires the field through and exposes it in the Processing
settings UI so each project can override the session boundary.

Closes #906

Co-Authored-By: Claude <noreply@anthropic.com>
POST /api/v2/deployments/<pk>/regroup-sessions/ queues the existing
ami.tasks.regroup_events Celery task for the deployment. The task
calls group_images_into_events(), which now reads the project's
session_time_gap_seconds setting.

User-facing terminology is "sessions"; "Event" is retained as the
internal model name for backwards compatibility.

Returns 202 with task_id, deployment_id, project_id.

Refs #906

Co-Authored-By: Claude <noreply@anthropic.com>
Three new TestImageGrouping cases that exercise the new
session_time_gap_seconds wiring:

- test_cross_midnight_bursts_split_by_short_gap: documents the user-
  reported pain (2 bursts with off-window > default gap, crossing
  midnight, split into 2 Events on different dates) and proves that
  bumping max_time_gap past the off-window collapses them to 1 Event.
- test_same_date_bursts_merge_regardless_of_gap: documents the inverse
  behavior — bursts on the same calendar date collide on group_by
  regardless of gap setting. Acts as a regression target for any future
  rework of the date-keyed Event reuse path (per the #904 caveat at
  models.py:1516).
- test_session_time_gap_seconds_is_used_when_no_explicit_gap: verifies
  group_images_into_events falls back to the project setting when no
  max_time_gap is passed, including refreshing the cached relation
  after the setting changes.

Refs #906

Co-Authored-By: Claude <noreply@anthropic.com>
…oup task

Addresses takeaway-review feedback on PR #1292:

- Add `regroup_sessions_deployment` custom Project permission so non-superusers
  with the ProjectManager role can hit `POST /deployments/<pk>/regroup-sessions/`
  (mirrors `sync_deployment`). Without this, `BaseModel.check_custom_permission`
  computes the codename `regroup_sessions_deployment` and the perm doesn't
  exist, so the action 403s for everyone except superusers.
- Wrap `ami.tasks.regroup_events` in a per-deployment cache lock. Concurrent
  enqueues (double-clicks, sync_captures auto-regroup racing with manual
  trigger) collapse to a single run instead of overlapping on the same rows.
  Lower-level than the view so all callers (Deployment.save, admin bulk
  action, new API action) are protected.
- Defensive clamp on `session_time_gap_seconds` — fall back to the historical
  120-minute default if the value is 0 / negative, with a warning log. Stops
  pathological "every gap = new event" behavior when admins enter bad values.
- Migration 0085 adds the new perm to `Project.Meta.permissions` and grants it
  to existing `ProjectManager` role groups via a data migration (mirrors how
  0084 revoked `delete_job`).
- Tests:
  - test_invalid_session_time_gap_falls_back_to_default exercises 0 / -1 / -7200
  - test_regroup_events_task_is_idempotent_under_concurrent_calls pre-takes
    the cache lock, asserts a second call skips group_images_into_events

Co-Authored-By: Claude <noreply@anthropic.com>
…l sites

Replaces the direct-Celery API/admin entry points with a new
RegroupEventsJob so admins can see regroup progress and failures in the
Jobs UI like every other long-running task. The existing
DataStorageSyncJob now runs grouping as an explicit second stage instead
of relying on Deployment.save() autoregroup, so a post-sync regroup
failure flips the Job to FAILURE rather than silently succeeding
(closes #1157, #1158).

Other changes in this commit:

* Move the per-deployment idempotency lock from ami.tasks.regroup_events
  into group_images_into_events() itself, so all four call sites
  (RegroupEventsJob, sync-stage regroup, Deployment.save autoregroup,
  the bare Celery task) collapse to a single in-flight run rather than
  only the Celery-task entry point being protected.
* Use a uuid token on the lock value and only delete the key if it
  still matches our token. Fixes the lock-release-clobbers-newer-owner
  race CodeRabbit flagged.
* Add run_regroup_events_job permission and grant it to MLDataManager
  (which ProjectManager inherits from). regroup_sessions_deployment
  stays on ProjectManager. Both wired through the renamed migration
  0085_add_regroup_permissions.
* Strip the UI form field for session_time_gap_seconds; field is now
  admin-only until we decide whether it belongs on Project or Deployment.
* Swap the admin "regroup" and "sync" bulk actions to create Jobs so
  the admin path matches the API path.

Tests:
- TestImageGrouping: replaced the old task-level idempotency test with
  one that exercises the new in-function lock; added a token-release
  test that verifies an expired-then-reacquired lock isn't clobbered.
- TestRegroupEventsJob (new): end-to-end Job run, stats populated, and
  a failure-propagates assertion.
- TestDataStorageSyncJobIncludesRegroupStage (new): sync Job exposes
  the regroup stage and a regroup failure surfaces as Job failure.

Co-Authored-By: Claude <noreply@anthropic.com>
Main has moved on to migration 0088. Rename 0085_add_regroup_permissions
to 0089 and depend on 0088_detection_det_srcimg_created_idx to resolve
the multiple-leaf-nodes conflict CI surfaced.

Co-Authored-By: Claude <noreply@anthropic.com>
… log; add e2e command

The regroup stage param `events_created` was incrementing for every event
visited rather than only for events newly created by `get_or_create`. The bool
returned from `get_or_create` was being discarded. Track `events_created_count`
separately and use it in the stage-param write.

Drive-by: `logger.warning(f"Found {len(values)} ...")` was reporting the
character length of the newline-joined duplicate-timestamp report rather than
the duplicate count. Pre-existing (since 0954c38, 2024-09-12) but sits inside
the very block whose stats this PR now surfaces. Use `duplicate_timestamp_count`,
and cap the in-log sample at 20 timestamps so very dirty deployments don't
flood logs.

Add `manage.py test_regroup_job_e2e {regroup|sync|concurrent}` to exercise the
new Job framework path against a real stack: `regroup` runs a `RegroupEventsJob`
and dumps stage params; `sync` runs a `DataStorageSyncJob` and asserts the
two-stage chain; `concurrent` fires two regroups back-to-back and verifies the
lock semantics. Mirrors `test_ml_job_e2e`. Used to validate PR #1292 against
deployment 74 of the Antenna dev stack.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/configurable-session-gap branch from 72dfbeb to d93aa24 Compare June 5, 2026 00:16
@mihow

mihow commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: Pushed two follow-up commits + ran live-stack E2Es. Summary:

New commits on feat/configurable-session-gap (force-pushed after rebasing onto current main to pick up migrations 0085–0088; perms migration renumbered to 0089):

  • 184bf412 — Migration rebase: 0085_add_regroup_permissions0089_add_regroup_permissions so the leaf-node conflict against main's 0088 clears.
  • d93aa24cevents_created stage param accounting fix (was counting events visited; now counts newly-created via get_or_create's was_created return); pre-existing duplicate-timestamp len(values) log bug fixed (CodeRabbit caught it on the prior push); new manage.py test_regroup_job_e2e management command.

Live-stack E2E validation (Antenna dev box, deployment 74 PFS_Moth_Wall, 1263 captures, 69 events at 2 h baseline):

# Scenario Result
1 RegroupEventsJob direct SUCCESS in 18 s, stage params populated (captures_grouped=1188, events_touched=69, duplicate_timestamps=57), Δevents=0
2 DataStorageSyncJob two-stage chain SUCCESS in 20 s, both data_storage_sync (total_files=857) and regroup_sessions stages registered and SUCCESS — closes #1157 / #1158 against real data
3 Sync→regroup failure propagation Covered by test_sync_job_regroup_failure_propagates unit test (mocks group_images_into_events to raise; JobType.run() propagates so Celery wrapper records FAILURE). Skipped stack-level run to avoid a code edit just to inject a fault — happy to add a chaos hook if you want it
4 Concurrent lock idempotency Both Jobs SUCCESS; Job A captures_grouped=1188, Job B captures_grouped=0 (short-circuit). Event count unchanged. Lock semantics held end-to-end

PR body updated to reflect all of this — new ## End-to-end validation against a real Antenna stack section, drive-by fixes called out under the Summary bullets, migration number corrected to 0089 throughout, #906 moved out of Closes (it's partially addressed; the calendar-date Event reuse limitation remains blocked on #904).

@loppear's open thread replied to, leaving resolved/unresolved for your call.

Ready for re-review.

@mihow

mihow commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@annavik @loppear thanks for reviewing. I decided to remove the UI changes. There is more to think about the placement and behavior when a user changes the setting. Also the setting does not act they way the user might expect. It DOES work in Luke's case, but it won't in others. I suggest we move the setting to the Station configuration and also consider simplifying the grouping algorithm. For now, this PR connects the existing function to the existing project setting for the time gap. It also wraps the session regrouping in a Job so you can see it running and when if it fails. You can now also see the logs for the session grouping stage when syncing a Station to it's data storage.

I updated the title to say "allow admins" to customize instead of "allow users" :)

@mihow mihow changed the title Allow users to customize the time gap between sessions Allow admins to customize the time gap between sessions Jun 5, 2026
mihow and others added 2 commits June 4, 2026 18:23
The DeploymentAdmin list view ran four live per-row aggregates for each
deployment displayed:

- `start_date` → `SourceImage.objects.filter(event__deployment=obj).aggregate(Min)`
  (joins SourceImage→Event, scans the full per-deployment timeline)
- `end_date` → `SourceImage.objects.filter(deployment=obj).aggregate(Max)`
- `events_count` → `obj.events.count()`
- `captures_count` → re-read of `data_source_total_files` (cached, but
  wrapped in a method that suggested it was computed)

With 251 deployments on the page and ~5 expensive queries per row, the list
view became unusable on stacks holding deployments with >100k captures
(e.g. one deployment with 638k captures dominated end_date's wall time).

Deployment already denormalizes `captures_count`, `events_count`,
`data_source_total_size`, `first_capture_timestamp` and
`last_capture_timestamp` via `update_calculated_fields()`. Switch the admin
list to read those instead and drop the custom aggregates. Add
`@admin.display(ordering=...)` so the columns remain sortable against the
underlying cached fields.

Wall time on the dev box (251 deployments): ~0.6s warm, ~0.76s cold.

Co-Authored-By: Claude <noreply@anthropic.com>
The regroup stage params were stored with machine-style names
(`captures_grouped`, `events_created`, ...), which is what the Jobs UI
displays to admins. Other JobType stages already use title-cased,
space-separated names (`Total files`, `Captures added`, etc.); this commit
brings the regroup stage in line.

Param **names** (UI labels):
  Captures grouped, Events created, Events touched,
  Empty events deleted, Duplicate timestamps,
  Ungrouped captures, Captures missing timestamp.

Param **keys** (for retrieval, slugify(name) with `-` → `_`):
  captures_grouped, events_created, events_touched,
  empty_events_deleted (was `events_deleted_empty`),
  duplicate_timestamps, ungrouped_captures,
  captures_missing_timestamp (was `no_timestamp_captures`).

Five keys are stable; two changed shape. Existing Job rows keep their
historical name/key on disk — only new runs use the new labels.

Centralise the list as `REGROUP_STAGE_PARAM_NAMES` in ami.jobs.models so
RegroupEventsJob and DataStorageSyncJob's regroup stage share one source of
truth. In `_group_images_into_events_locked`, swap the kwargs-style
`update_stage(...)` call (which couldn't carry names with spaces) for an
explicit `add_or_update_stage_param(stage_key, name, value)` loop driven by
a label→value dict.

Update tests to look up by the new keys and assert the human-readable names
exist; update `test_regroup_job_e2e` to print `name [key]: value` so the
stable retrieval key is visible alongside the UI label.

Co-Authored-By: Claude <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
ami/main/tests.py (2)

534-553: ⚡ Quick win

Assert the warning side-effect for invalid gap fallback.

Line 548 currently validates clamped behavior only. Add a log assertion so the “fallback + warning” contract is protected, not just the event count.

🔧 Proposed test hardening
         for bad_value in (0, -1, -7200):
             Event.objects.filter(deployment=self.deployment).delete()
             self.project.session_time_gap_seconds = bad_value
             self.project.save()
             self.deployment.refresh_from_db()
-            group_images_into_events(deployment=self.deployment)
+            with self.assertLogs(level="WARNING") as logs:
+                group_images_into_events(deployment=self.deployment)
             count = Event.objects.filter(deployment=self.deployment).count()
             # Default 120-min gap on this cross-midnight pattern → 2 Events,
             # NOT 12 (which is what bad_value=0 would produce without the guard).
             assert count == 2, f"expected 2 Events at gap={bad_value} (default fallback), got {count}"
+            self.assertTrue(
+                any("session_time_gap_seconds" in msg for msg in logs.output),
+                "Expected warning log for invalid session_time_gap_seconds fallback",
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/tests.py` around lines 534 - 553, Update
test_invalid_session_time_gap_falls_back_to_default to also assert the warning
side-effect: use the test logging capture (e.g., pytest's caplog) while setting
project.session_time_gap_seconds to each bad_value, call
group_images_into_events(deployment=self.deployment), and then assert caplog
captured a WARNING-level message from the module that contains text indicating
the fallback (e.g., mentions "fallback", "session_time_gap_seconds", or the
bad_value) so the test verifies both the event count and that a warning was
emitted when the invalid gap was clamped.

598-605: ⚡ Quick win

Make cache cleanup unconditional in the lock-owner test.

If the assertion at Line 604 fails, Line 605 is skipped and can leak cache state into later tests. Wrap cleanup in finally.

🧹 Proposed cleanup guard
-        with _regroup_lock(self.deployment.pk) as acquired:
-            assert acquired, "Should acquire the lock on first try"
-            # Simulate the TTL expiring and a newer caller taking the lock.
-            cache.set(lock_key, "newer-owner-token", timeout=60)
-
-        # Our `finally` ran. The newer owner's lock should still be there.
-        assert cache.get(lock_key) == "newer-owner-token", "Expired-lock release must not clobber the newer owner"
-        cache.delete(lock_key)
+        try:
+            with _regroup_lock(self.deployment.pk) as acquired:
+                assert acquired, "Should acquire the lock on first try"
+                # Simulate the TTL expiring and a newer caller taking the lock.
+                cache.set(lock_key, "newer-owner-token", timeout=60)
+
+            # Our `finally` ran. The newer owner's lock should still be there.
+            assert cache.get(lock_key) == "newer-owner-token", "Expired-lock release must not clobber the newer owner"
+        finally:
+            cache.delete(lock_key)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/tests.py` around lines 598 - 605, The test currently may skip
cache.delete(lock_key) if the assertion fails; modify the test around the
_regroup_lock block so that cache.delete(lock_key) always runs by placing the
post-lock assertions and cache.delete into a try/finally (or move cache.delete
into a finally) that ensures cleanup regardless of assertion failures; reference
the existing _regroup_lock(self.deployment.pk) context, the lock_key variable,
and cache.delete to locate where to add the finally guard.
ami/jobs/management/commands/test_regroup_job_e2e.py (1)

60-75: ⚡ Quick win

Use explicit exception chaining in except clause.

Line 64 should use raise CommandError(...) from None to explicitly suppress the DoesNotExist exception from the traceback. This distinguishes a deliberate exception translation from an error during exception handling, following PEP 3134.

♻️ Proposed fix
     def _resolve_deployment(self, deployment_id: int) -> Deployment:
         try:
             deployment = Deployment.objects.get(pk=deployment_id)
         except Deployment.DoesNotExist:
-            raise CommandError(f"Deployment {deployment_id} not found")
+            raise CommandError(f"Deployment {deployment_id} not found") from None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/jobs/management/commands/test_regroup_job_e2e.py` around lines 60 - 75,
In _resolve_deployment, change the exception raise to use explicit exception
chaining: when catching Deployment.DoesNotExist in the except block of
_resolve_deployment, replace the current raise CommandError(...) with raise
CommandError(f"Deployment {deployment_id} not found") from None so the original
DoesNotExist traceback is suppressed and the intent of exception translation is
clear.
ami/main/migrations/0089_add_regroup_permissions.py (1)

64-87: 💤 Low value

Consider mirroring grant logic precision in revoke.

Lines 75-87 remove both permissions from both _ProjectManager and _MLDataManager groups, but the grant function at lines 52-61 is more specific: regroup_sessions_deployment goes only to _ProjectManager, while run_regroup_events_job goes to both. Removing a permission from a group that never had it is harmless, but mirroring the grant logic would make the migration more self-documenting.

♻️ Optional refactor to match grant specificity
 def revoke_new_permissions(apps, schema_editor):
     Group = apps.get_model("auth", "Group")
     Permission = apps.get_model("auth", "Permission")
     ContentType = apps.get_model("contenttypes", "ContentType")
     GroupObjectPermission = apps.get_model("guardian", "GroupObjectPermission")

     try:
         project_ct = ContentType.objects.get(app_label="main", model="project")
     except ContentType.DoesNotExist:
         return

-    for codename in ("regroup_sessions_deployment", "run_regroup_events_job"):
-        try:
-            perm = Permission.objects.get(codename=codename, content_type=project_ct)
-        except Permission.DoesNotExist:
-            continue
-        role_groups = Group.objects.filter(Q(name__endswith="_ProjectManager") | Q(name__endswith="_MLDataManager"))
-        for group in role_groups:
-            group.permissions.remove(perm)
-        GroupObjectPermission.objects.filter(
-            permission=perm,
-            content_type=project_ct,
-            group__in=role_groups,
-        ).delete()
+    # Remove regroup_sessions_deployment from ProjectManager only
+    try:
+        regroup_perm = Permission.objects.get(codename="regroup_sessions_deployment", content_type=project_ct)
+        pm_groups = Group.objects.filter(Q(name__endswith="_ProjectManager"))
+        for group in pm_groups:
+            group.permissions.remove(regroup_perm)
+        GroupObjectPermission.objects.filter(
+            permission=regroup_perm,
+            content_type=project_ct,
+            group__in=pm_groups,
+        ).delete()
+    except Permission.DoesNotExist:
+        pass
+
+    # Remove run_regroup_events_job from both MLDataManager and ProjectManager
+    try:
+        job_perm = Permission.objects.get(codename="run_regroup_events_job", content_type=project_ct)
+        job_groups = Group.objects.filter(Q(name__endswith="_MLDataManager") | Q(name__endswith="_ProjectManager"))
+        for group in job_groups:
+            group.permissions.remove(job_perm)
+        GroupObjectPermission.objects.filter(
+            permission=job_perm,
+            content_type=project_ct,
+            group__in=job_groups,
+        ).delete()
+    except Permission.DoesNotExist:
+        pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ami/main/migrations/0089_add_regroup_permissions.py` around lines 64 - 87,
The revoke_new_permissions function currently removes both codenames from both
role groups; change its logic to mirror the grant logic in the migration by
mapping codenames to the exact target groups before removal: when codename ==
"regroup_sessions_deployment" only target groups ending with "_ProjectManager",
and when codename == "run_regroup_events_job" target both "_ProjectManager" and
"_MLDataManager"; use that per-codename role_groups selection for both
group.permissions.remove(perm) and the GroupObjectPermission.objects.filter(...,
group__in=role_groups).delete() calls so revoke precisely matches grant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ami/main/api/views.py`:
- Around line 362-393: The method regroup_sessions in the view currently uses
assert deployment.project which can be disabled; replace that with an explicit
runtime check and return a proper client error: if deployment.project is None,
raise rest_framework.exceptions.ValidationError (or return Response with
status=HTTP_400_BAD_REQUEST and a clear error message) before creating the Job;
ensure the check is done right after deployment = self.get_object() and include
deployment/project identifiers in the error message for clarity.

In `@ami/main/models.py`:
- Around line 968-973: The branch that skips autoregroup (regroup_after=False)
fails to realign child objects because it never triggers
Deployment.update_children(); to fix, after calling
self.save(regroup_async=False, update_calculated_fields=False) and
self.update_calculated_fields(save=True) ensure you explicitly invoke the same
child-realignment logic (call self.update_children() or otherwise run the code
path used by Deployment.save when update_calculated_fields=True) so
Events/Occurrences/SourceImages are moved to the new project when a Deployment
is moved; locate Deployment.save, Deployment.update_calculated_fields, and
Deployment.update_children to add the call in this else branch.
- Around line 1479-1486: The current group_images_into_events function silently
returns [] when _regroup_lock(deployment.pk) is not acquired, which lets callers
(e.g., DataStorageSyncJob) treat regroup as successful and miss newly imported
captures; change this behavior so a lock miss produces a retryable outcome:
either raise a specific retryable exception (e.g., RegroupLockHeldError) or set
a "dirty" flag on the deployment and re-enqueue the job instead of returning [],
and ensure callers (DataStorageSyncJob) catch that exception or check the dirty
flag to re-schedule regroup; update references to _regroup_lock,
group_images_into_events, and any use of job.logger/logger to log a clear
retryable message when the lock is held.

---

Nitpick comments:
In `@ami/jobs/management/commands/test_regroup_job_e2e.py`:
- Around line 60-75: In _resolve_deployment, change the exception raise to use
explicit exception chaining: when catching Deployment.DoesNotExist in the except
block of _resolve_deployment, replace the current raise CommandError(...) with
raise CommandError(f"Deployment {deployment_id} not found") from None so the
original DoesNotExist traceback is suppressed and the intent of exception
translation is clear.

In `@ami/main/migrations/0089_add_regroup_permissions.py`:
- Around line 64-87: The revoke_new_permissions function currently removes both
codenames from both role groups; change its logic to mirror the grant logic in
the migration by mapping codenames to the exact target groups before removal:
when codename == "regroup_sessions_deployment" only target groups ending with
"_ProjectManager", and when codename == "run_regroup_events_job" target both
"_ProjectManager" and "_MLDataManager"; use that per-codename role_groups
selection for both group.permissions.remove(perm) and the
GroupObjectPermission.objects.filter(..., group__in=role_groups).delete() calls
so revoke precisely matches grant.

In `@ami/main/tests.py`:
- Around line 534-553: Update
test_invalid_session_time_gap_falls_back_to_default to also assert the warning
side-effect: use the test logging capture (e.g., pytest's caplog) while setting
project.session_time_gap_seconds to each bad_value, call
group_images_into_events(deployment=self.deployment), and then assert caplog
captured a WARNING-level message from the module that contains text indicating
the fallback (e.g., mentions "fallback", "session_time_gap_seconds", or the
bad_value) so the test verifies both the event count and that a warning was
emitted when the invalid gap was clamped.
- Around line 598-605: The test currently may skip cache.delete(lock_key) if the
assertion fails; modify the test around the _regroup_lock block so that
cache.delete(lock_key) always runs by placing the post-lock assertions and
cache.delete into a try/finally (or move cache.delete into a finally) that
ensures cleanup regardless of assertion failures; reference the existing
_regroup_lock(self.deployment.pk) context, the lock_key variable, and
cache.delete to locate where to add the finally guard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ccaf86d-d49c-4cba-a120-d0d96c841289

📥 Commits

Reviewing files that changed from the base of the PR and between 76b52af and ba7d85d.

📒 Files selected for processing (10)
  • ami/jobs/management/commands/test_regroup_job_e2e.py
  • ami/jobs/models.py
  • ami/jobs/tests/test_jobs.py
  • ami/main/admin.py
  • ami/main/api/views.py
  • ami/main/migrations/0089_add_regroup_permissions.py
  • ami/main/models.py
  • ami/main/tests.py
  • ami/tasks.py
  • ami/users/roles.py
✅ Files skipped from review due to trivial changes (1)
  • ami/tasks.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • ami/users/roles.py
  • ami/jobs/tests/test_jobs.py
  • ami/jobs/models.py

Comment thread ami/main/api/views.py
Comment thread ami/main/models.py
Comment thread ami/main/models.py
mihow and others added 2 commits June 5, 2026 10:14
… on sync path

Two CodeRabbit findings on the latest revision.

1. `ami/main/api/views.py:376` used `assert deployment.project` to guard the
   API action. `assert` is stripped under `python -O`, so a deployment whose
   project FK is null (the schema allows it — `Project` FK is nullable on
   `Deployment`) would 500 with a confusing AttributeError later in the view
   instead of returning a clean 400. Replace with an explicit
   `api_exceptions.ValidationError` that returns 400 with a structured
   detail.

2. `Deployment.sync_captures()`'s new `regroup_after=False` branch (added in
   this PR so `DataStorageSyncJob` can drive the regroup as a separate
   tracked stage) bypassed `Deployment.save()`'s `update_calculated_fields=
   True` block, which is where `self.update_children()` realigns child
   `Event`/`Occurrence`/`SourceImage` `project` pointers. Result: a sync Job
   whose deployment had moved between projects would leave child rows with
   stale project FKs, breaking project-scoped queries (default filters,
   visibility checks, etc.). The user-triggered `RegroupEventsJob` path is
   unaffected because it doesn't run sync_captures — the regression is
   specific to the new sync→regroup chain.

   Add an explicit `self.update_children()` call after the
   `update_calculated_fields(save=True)` refresh on the regroup_after=False
   branch, gated on `project_id` like the corresponding block in
   `Deployment.save()`.

Existing regroup tests still pass (4/4).

Co-Authored-By: Claude <noreply@anthropic.com>
CodeRabbit raised a real concern about the sync→regroup chain: if a
concurrent regroup holds the lock when DataStorageSyncJob's regroup stage
runs, the lock-miss branch in `group_images_into_events` returns `[]` and
the just-synced captures sit ungrouped until the next save/sync.

A previous draft of this fix added a post-lock re-enqueue
(`regroup_events.delay()` after the `with _regroup_lock:` block) that called
`deployment_events_need_update` and chained a follow-up task. A fresh
review found that approach risks cascading enqueues on busy deployments —
the lock serialises work but not enqueueing, so concurrent
`Deployment.save` autoregroups during a slow run can pile redundant tasks
into the queue (no infinite loop, but tens of redundant runs plausible).

Drop the re-enqueue. Instead, surface a clear warning on the Job itself
when the sync stage's regroup is skipped: we know `total_files` from the
preceding sync stage, so we can say "sync added N files but regroup was
skipped" directly on the Job that an admin is looking at. The captures
will be picked up by the next Deployment.save autoregroup
(models.py:1110-1113), which is the existing safety net for the general
case.

Trade-off vs the previous draft: a sync that races with a concurrent
regroup defers its new captures to the next save/sync (small window, no
data loss) but does NOT silently appear to have succeeded — the warning
lands in the Job log where it belongs.

No new tests; existing 4 regroup/lock tests + 12 grouping tests pass
(16/16).

Co-Authored-By: Claude <noreply@anthropic.com>

@loppear loppear left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Narrowing the scope to just using this admin Project setting is good enough for me. It seems the changes to move this into an explicit job for better logging during storage sync are motivated by existing issues, so if this helps clean that up ok!

…al-regroup requirement

The previous help text just said "Time gap in seconds to consider a new
session" — it didn't tell the user that changing the value doesn't
regroup existing captures, which is the most common point of confusion
when admins edit the field.

Now points users to the Deployments admin "Regroup captures into
sessions" bulk action so they can apply the change. Surfaces via
Django admin form, DRF browsable API, and OPTIONS schema.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow

mihow commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for your review @loppear. I plan to do some merging & deploying in the afternoon today.

Four fixes surfaced by an independent takeaway review:

- Count actual SourceImage rows assigned to an event for the "Captures
  grouped" stat instead of the distinct-timestamp count. Two captures can
  share a timestamp, so the old count understated the work done.
- Lower the regroup lock TTL and the matching Celery time limits to 10 min
  soft / 11 min hard. group_images_into_events is mostly batch SQL and
  finishes in seconds, so a tight limit bounds the worst-case stuck-lock
  window when a worker dies hard (OOM, SIGKILL, eviction) and the finally
  block never runs to release the lock.
- Document that RegroupEventsJob and the bare regroup_events task only
  group; propagating project_id to children stays on Deployment.save().
- Log a clear "regroup skipped" message when the per-deployment lock is
  already held, instead of a misleading "now has 0 events".

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the feat/configurable-session-gap branch from 8b5c72c to c56a03b Compare June 9, 2026 23:15
@mihow mihow merged commit 13203e4 into main Jun 9, 2026
7 checks passed
@mihow mihow deleted the feat/configurable-session-gap branch June 9, 2026 23:36
mihow added a commit that referenced this pull request Jun 10, 2026
Main now has 0089_add_regroup_permissions and 0090_session_time_gap_seconds_help_text
(PR #1292), which made the branch's 0089/0091 migrations a second leaf in the
migration graph and failed CI. Renumber to 0091/0092 on top of main's 0090.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow added a commit that referenced this pull request Jun 16, 2026
* Model and API view for SourceImageThumbnail

* Thumbnails: add field to captures serializer

* With internal prefix

* black postcommit

* black?

* black settings

* Better handling for thumbnail invalidation and deletion.

* Thumbnail lookup by label and check for changed settings width

* Move thumbnail generation to method on SourceImage

* Thumbnails: handle uploaded source images (without datasource), 302 redirects, and add index for (source_image, label).

* merge migrations

* UI capture data: add and use thumbnail property accessors

* Add thumbnails to api responses for events with captures

* revert UI capture.src property, only use thumbnails explicitly

* fix: prioritize API response size before natural size

* Merge migrations

* Rebase thumbnail migrations

* ami.utils.media.fetch_image_content for thumbnail image source

* fix(thumbnails): convert non-RGB source images to RGB before JPEG encode

JPEG natively supports only L, RGB, and CMYK modes. Source images in
RGBA/P/LA/PA modes (common for uploaded PNGs) raised
"OSError: cannot write mode <X> as JPEG" on `img.save(format="JPEG")`,
500-ing the thumbnail endpoint. Convert to RGB first when needed.

Adds a regression test that feeds an RGBA PNG through find_or_generate
and asserts the redirect succeeds and the thumbnail row lands.

Closes review thread on ami/main/models.py L2241.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(thumbnails): handle SourceImage.last_modified=None on regen check

Two changes, one root cause.

`SourceImage.last_modified` tracks the source bytes' mtime (S3 LastModified
on the sync path) and is the freshness signal the thumbnail regen check uses
to decide whether to invalidate a cached thumbnail. It is NOT the same as
`updated_at` (BaseModel auto_now), which bumps on every row save (cached
counts, detection counts, etc.) and would force regen far too often.

The upload path (`create_source_image_from_upload`) never set this field,
so:
  1. Uploaded captures had `last_modified=None` forever, breaking parity
     with sync-created rows.
  2. The thumbnail regen check `thumb.last_modified < self.last_modified`
     raised `TypeError` on the 2nd+ request for any uploaded capture's
     thumbnail (datetime < None).

Fixes:

* `create_source_image_from_upload`: set `last_modified=timezone.now()` so
  uploaded captures have a real source mtime, matching the sync path's
  behaviour of copying the S3 LastModified header.
* `find_or_generate_thumbnail_for_label`: guard the comparison. None on
  either side now means "no signal of source change, trust the cached
  thumb" instead of crashing. Covers legacy rows synced before the upload
  backfill.

Tests:

* `test_thumbnail_source_last_modified_none_does_not_crash` reproduces the
  TypeError class of bug by nulling `last_modified` after a thumbnail row
  exists and asserts the 2nd request reuses the cache.
* `test_upload_handler_backfills_last_modified` covers the upload-path
  backfill (asserts the new field falls inside [before, after] of the
  upload call).

Closes review thread on ami/main/models.py L2227.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(thumbnails): drop no-op f-strings; return 405 on list

Two F541 (Ruff: f-string without placeholders) hits, plus a small
behavioural cleanup on the list action.

* `SourceImageThumbnailViewSet.list` now raises
  `MethodNotAllowed` (405) instead of `NotFound` (404). The route exists
  for `/captures/thumbnails/<pk>/?label=...` only; listing has no
  defined semantics. 405 sets the Allow header and is the correct REST
  response.
* Updated `test_thumbnail_no_list` to expect 405 and to use a plain
  string literal (the previous test used an f-string with no
  placeholders).

Closes review thread on ami/main/tests.py L237 (Ruff F541).

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(thumbnails): guard empty THUMBNAILS['SIZES'] config

When `settings.THUMBNAILS['SIZES']` is empty (misconfiguration), the
default-label fallback `next(iter(_sizes))` raised `StopIteration` and
surfaced as a 500 with no useful detail. Replace with an explicit empty
check that returns 404 with a clear API error message, and use
`next(iter(_sizes))` only after we know the dict is non-empty (also
switch the fallback expression to `or next(...)` since the previous
form passed `next(iter(...))` as the default arg unconditionally — fine
for non-empty dicts but still worth tidying alongside the guard).

Adds a test that overrides THUMBNAILS with an empty SIZES dict and
asserts the endpoint returns 404 with the configured-error message.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(media): add finite default timeout to fetch_image_content

`requests.get(url)` with no timeout blocks indefinitely if the upstream
storage is slow or unreachable. The thumbnail-generation path
(`SourceImage.find_or_generate_thumbnail_for_label`) runs this
synchronously inside a Django web request, so a stuck upstream ties up
a gunicorn worker; a gallery's worth of stuck calls drains the pool.

Default timeout is `(connect=5s, read=30s)`. Callers that need
different limits (e.g. very large originals on a slow link) can override
by passing `timeout=`. Behaviour for fast / responsive upstreams is
unchanged.

Tests verify the default is finite and that explicit overrides are
honored.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(thumbnails): use upsert + drop pre-save delete-loop to fix concurrent-gen race

Two related bugs in `find_or_generate_thumbnail_for_label`:

1. Under concurrent gen for the same (source_image, label) pair the loser
   could either (a) hit the new UniqueConstraint and 500 with
   IntegrityError when its delete-loop ran before the winner's INSERT,
   or (b) destroy the winner's row + storage blob when the delete-loop
   ran after. Either outcome is wrong; both were possible because the
   code did `filter(label=label).delete()` then `thumbnails.create(...)`
   instead of a single atomic upsert.

2. The pre-save delete-loop also deleted the storage blob BEFORE the
   new `default_storage.save()`, which on storage backends that suffix
   colliding keys (FileSystemStorage default) left the row pointing at
   a path that the next save could overwrite in the same way the racer
   would. Belt-and-braces failure where the cleanup order itself was the
   bug.

Fix:

* Replace the delete-then-create with `update_or_create` keyed on
  `label`. Django serializes via the UniqueConstraint and, on collision,
  catches IntegrityError on INSERT and re-runs as UPDATE. No 500s, no
  destructive deletes of concurrent rows.
* Snapshot the prior `path` before the new save and clean it up after
  the row update only when the backend gave us a new key
  (`prior_path != thumbnail_path`). Backends that overwrite in place
  (boto3 with `AWS_S3_FILE_OVERWRITE=True`) skip the cleanup and avoid
  the file-deletion race entirely.
* Force-bump `thumb.last_modified` on the UPDATE branch via
  `QuerySet.update` because `auto_now_add=True` only fires on INSERT —
  without this, regen via source-mtime bump would re-trigger regen on
  every subsequent request as the field's pre_save callback preserved
  the existing (now-stale) value.

Tests:

* `test_thumbnail_regen_reuses_row_via_upsert`: delete the storage blob
  out from under a cached row, re-request, assert the row's PK is
  preserved (proving the UPDATE path was taken, not a fresh INSERT).
* `test_thumbnail_exists_newer_modified_source` updated to assert row
  reuse + freshness bump rather than the previous (incidental) PK
  change.

The `SourceImageThumbnail.last_modified` field semantics are still
muddled (see PR thread; rename or drop tracked separately).

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(thumbnails): CASCADE source_image FK and clean storage blob on row delete

Thumbnails are pure derivatives of (source_image, label). The previous
`on_delete=SET_NULL, null=True` left orphan rows AND their storage blobs
behind forever whenever the parent SourceImage was deleted, with no
other reaper in the codebase. Silent storage growth and orphan-row
accumulation; CodeRabbit flagged this in PR #1306 and the thread was
closed without a fix.

Changes:

* `SourceImageThumbnail.source_image` → `on_delete=CASCADE`, no longer
  nullable. Deleting a SourceImage now takes its thumbnail rows with it.
* New `pre_delete` signal in `ami.main.signals` does best-effort
  `default_storage.delete(instance.path)` so the blob doesn't outlive the
  row. Fires on explicit row deletes AND CASCADE deletes (signals fire
  for both), so this covers parent SourceImage deletion automatically.
  Storage failures are logged and swallowed — a transient backing-store
  error must not block the delete chain.
* Migration 0090 is two-step: reap pre-existing rows where
  `source_image_id IS NULL` (the SET_NULL aftermath) first, then
  AlterField the FK. Orphan-row reap runs through the same pre_delete
  signal so their blobs are cleaned alongside.

Tests:

* `test_source_image_delete_cascades_to_thumbnails` — create 2 thumbnail
  rows for a capture, delete the capture, assert both rows gone via
  CASCADE.
* `test_thumbnail_delete_removes_storage_blob` — plant a real blob in
  storage, wire a row to it, delete the row, assert the blob is gone
  (proving the pre_delete signal fired and cleaned up).

Co-Authored-By: Claude <noreply@anthropic.com>

* chore: black formatting

* chore(ui): prettier strip trailing whitespace in capture.ts

* docs(sourceimage): help_text for storage-derived fields; admin read-only

`size`, `last_modified`, `checksum`, `checksum_algorithm` all record
metadata read from the source image file in storage during sync or
upload, not row-level mtime. Document that in help_text so the field's
intent shows up in the admin UI and in API schema output (the four
fields previously had no help_text at all).

Mark the cluster read-only in the SourceImage admin so operators can't
clobber the cached metadata by mistake through the change form. Left
writable on the API so a project owner can still create SourceImage
entries by hand via an API client when the sync/upload flows don't fit;
the sync endpoint at POST /api/v2/deployments/<pk>/sync/ remains the
default population path.

Adds migration 0091 for the help_text changes (no schema change beyond
the AlterField metadata).

Co-Authored-By: Claude <noreply@anthropic.com>

* chore(migrations): collapse 0089 SET_NULL + 0090 CASCADE into single 0089

0089 was renamed-into-existence this branch (originally 0088, bumped
when main added 0088_detection_det_srcimg_created_idx). The follow-up
0090 only existed to switch the just-created FK from SET_NULL to
CASCADE — same model, same concern, accidentally split because 0089
shipped wrong on the first cut.

This commit:

* Rewrites 0089 to create the FK with CASCADE from the start (drops
  null=True and the SET_NULL on_delete). The RunPython orphan-reaper
  from 0090 is gone — a fresh table can't have orphans.
* Deletes 0090.
* Updates 0091's dependency to point at 0089 (was 0090).

End DB state is identical to the old 0089+0090 pair, so anyone who
already migrated past 0090 just pulls — Django sees 0089 as
already-applied (same name) and skips. 0090 will appear as an applied
migration with no matching file; warning only, no breakage. Local devs
who want a clean django_migrations table can manually
DELETE FROM django_migrations WHERE app='main' AND name='0090_sourceimagethumbnail_cascade_source_image'.

Safe because this migration only landed on the in-flight feature
branch — no production state to preserve. CI re-runs from scratch
each time and passed (TestImageThumbnailViews 18/18 with --noinput on
a fresh DB).

Co-Authored-By: Claude <noreply@anthropic.com>

* chore(migrations): drop over-defensive comment from collapsed 0089

The collapse rationale is unnecessary in-tree — the branch was never
merged, so no production state ever ran the SET_NULL form. Anyone
mid-rebase on this branch will just see the new migration replace the
old one cleanly.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(ui): session list uses capture thumbnails instead of full-size originals

The sessions list and gallery rendered example captures from capture.url —
the full-resolution original (often 9000px+ wide, fetched straight from
object storage). The events API has provided a thumbnails dict on
example_captures since this branch; use the medium (1024px) thumbnail with
a fallback to the original when thumbnails are absent, matching the
pattern in capture.ts. The medium size also keeps the og:image meta tag on
session details at a reasonable resolution.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(migrations): renumber thumbnail migrations after main merge

Main now has 0089_add_regroup_permissions and 0090_session_time_gap_seconds_help_text
(PR #1292), which made the branch's 0089/0091 migrations a second leaf in the
migration graph and failed CI. Renumber to 0091/0092 on top of main's 0090.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(tests): use real taxon pk in role-permission test data

TestRolePermissions hardcoded taxon_id="5" in the identification
create payload. Any test class that runs earlier and creates taxa
advances the id sequence, so pk 5 stops existing and the POST fails
validation with 400 before the permission check returns 403. The new
TestImageThumbnailViews class in this branch sorts alphabetically
before TestRolePermissions and triggered exactly that. Use the
occurrence's own determination pk instead.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(thumbnails): record requested spec width on thumbnail rows

PIL.Image.thumbnail() preserves aspect ratio and can emit a width 1-2px
below the requested spec (239 for a 240 spec was observed in a dev
deployment). The regen gate compares the stored width against the spec
with strict equality, so storing the encoder output made affected rows
regenerate on every request and download the original each time.

Store the requested spec width instead — the row identifies which
configured size it satisfies, matching how derivative-image systems key
their caches (sorl-thumbnail/easy-thumbnails hash the requested
geometry, WordPress keys by registered size name). The actual encoded
height is still stored; it is informational and never compared.

Existing rows that hold an encoder-output width self-heal with a single
regeneration on their next request.

Co-Authored-By: Claude <noreply@anthropic.com>

* perf(thumbnails): trust the cached row without a storage HEAD per request

The regen gate called default_storage.exists() on every warm hit, which
is a HEAD request against the object store (or a stat on filesystem
backends) per thumbnail per page view. The cached row is already the
warm signal; the existence check only protected against blobs deleted
out of band, which currently has no code path. Orphaned rows now show a
broken image until the row is removed and the next request regenerates.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(thumbnails): encode thumbnails as progressive JPEGs

Progressive encoding lets browsers paint a blurry-to-crisp preview as
bytes arrive instead of filling in top-to-bottom. optimize adds an extra
Huffman pass for slightly smaller files and quality=82 raises Pillow's
default of 75 for screen-sized thumbnails. All three flags only cost
time on the cold generation path.

Co-Authored-By: Claude <noreply@anthropic.com>

* perf(thumbnails): let browsers cache the thumbnail redirect

302 responses are not cached by browsers unless explicitly allowed, so
every page view re-paid a full Django round trip (auth, object lookup,
freshness check) per thumbnail, even when nothing changed. A five-minute
private Cache-Control covers typical back-and-forth browsing within a
session while staying well below the shortest presigned-URL lifetime
(django-storages AWS_QUERYSTRING_EXPIRE defaults to 3600s), so a cached
redirect can never point at an expired signature.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(thumbnails): regenerate when the cached row has no stored path

A row whose path is empty (failed or interrupted generation) passed the
regen gate and redirected the browser to the storage root. Treat a blank
path as cold and regenerate.

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor(thumbnails): trim review-narration comments to load-bearing facts

Several comment blocks added during review explained why a change was
correct relative to code that no longer exists. Keep only the constraints
a future reader needs (spec-width vs encoder output, auto_now_add bump,
presign-lifetime bound); drop the change-justification prose, which lives
in the commit history.

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor(thumbnails): trim remaining narration comments found in full sweep

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Anna Viklund <annamariaviklund@gmail.com>
Co-authored-by: Michael Bunsen <notbot@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose session regrouping as a job stage with stats Expose session regrouping as a job stage when syncing images

4 participants