Allow admins to customize the time gap between sessions#1292
Conversation
✅ Deploy Preview for antenna-preview canceled.
|
✅ Deploy Preview for antenna-ssec canceled.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds per-deployment regroup cache locking and project-driven session gap handling to core grouping; introduces RegroupEventsJob and an explicit regroup stage in DataStorageSyncJob; switches admin/API to enqueue jobs; adds permissions, role updates, unit/integration tests, and an e2e test command. ChangesRegroup sessions feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ami/main/api/views.py`:
- Around line 320-323: The view currently enqueues regroup_events_task
unconditionally allowing overlapping runs; modify the handler that calls
self.get_object() and regroup_events_task.delay(deployment.pk) to acquire a
per-deployment in-flight lock (e.g., a Redis/cache-based lock keyed by
f"regroup:{deployment.pk}") before queuing; if the lock cannot be obtained
return an early HTTP 409 (or similar) Response and log that a regroup is already
running, otherwise enqueue the task, keep the lock for the duration of the task
(or set a TTL and have the task release the lock) and ensure regroup_events_task
(or its wrapper) releases the lock on completion/failure so duplicate triggers
are prevented.
In `@ui/src/pages/project-details/processing-form.tsx`:
- Around line 30-35: The sessionTimeGapSeconds field currently validates but
will submit a string because React Hook Form doesn't coerce number inputs;
update the field's rules for sessionTimeGapSeconds to include a setValueAs that
converts the incoming value to a number (e.g., Number or parseInt) so the form
value is stored as a numeric type; modify the rules object for
sessionTimeGapSeconds to add setValueAs and keep required/min validations
intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 80284157-fc1e-41e0-8398-96a13aa37201
📒 Files selected for processing (4)
ami/main/api/views.pyami/main/models.pyami/main/tests.pyui/src/pages/project-details/processing-form.tsx
There was a problem hiding this comment.
Pull request overview
This PR wires the existing per-project session_time_gap_seconds setting into backend session (Event) grouping, exposes the setting in the UI, and adds an API action to trigger regrouping asynchronously so admins can re-cluster historical captures after changing the threshold.
Changes:
- Backend:
group_images_into_events()now derives its defaultmax_time_gapfromdeployment.project.session_time_gap_secondswhen not explicitly provided. - API: adds
POST /api/v2/deployments/<pk>/regroup-sessions/to enqueue the existingami.tasks.regroup_eventsCelery task. - UI + tests: exposes the setting as a numeric input on the project Processing settings form and adds unit tests covering cross-midnight behavior and the project-setting fallback.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
ami/main/models.py |
Uses the project’s session_time_gap_seconds as the default grouping gap instead of a hardcoded 120 minutes. |
ami/main/api/views.py |
Adds a deployment-level action endpoint to queue regrouping in Celery. |
ui/src/pages/project-details/processing-form.tsx |
Adds a numeric form field to edit sessionTimeGapSeconds in the Processing settings UI. |
ami/main/tests.py |
Adds tests documenting cross-midnight split/merge behavior and verifying the project setting is used when no explicit gap is passed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…oup task Addresses takeaway-review feedback on PR #1292: - Add `regroup_sessions_deployment` custom Project permission so non-superusers with the ProjectManager role can hit `POST /deployments/<pk>/regroup-sessions/` (mirrors `sync_deployment`). Without this, `BaseModel.check_custom_permission` computes the codename `regroup_sessions_deployment` and the perm doesn't exist, so the action 403s for everyone except superusers. - Wrap `ami.tasks.regroup_events` in a per-deployment cache lock. Concurrent enqueues (double-clicks, sync_captures auto-regroup racing with manual trigger) collapse to a single run instead of overlapping on the same rows. Lower-level than the view so all callers (Deployment.save, admin bulk action, new API action) are protected. - Defensive clamp on `session_time_gap_seconds` — fall back to the historical 120-minute default if the value is 0 / negative, with a warning log. Stops pathological "every gap = new event" behavior when admins enter bad values. - Migration 0085 adds the new perm to `Project.Meta.permissions` and grants it to existing `ProjectManager` role groups via a data migration (mirrors how 0084 revoked `delete_job`). - Tests: - test_invalid_session_time_gap_falls_back_to_default exercises 0 / -1 / -7200 - test_regroup_events_task_is_idempotent_under_concurrent_calls pre-takes the cache lock, asserts a second call skips group_images_into_events Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ami/tasks.py`:
- Around line 106-119: The current lock uses cache.add(lock_key, 1, ...) and
unconditionally cache.delete(lock_key) which can remove a newer owner's lock;
change to an ownership token pattern: when acquiring the lock (where cache.add
is called) generate a unique token (e.g. UUID) and store that token as the cache
value instead of 1 (use REGROUP_EVENTS_LOCK_TTL as before), keep the token in a
local variable, and on release only remove the key if the stored cache value
equals your token (fetch with cache.get(lock_key) and only call
cache.delete(lock_key) when it matches) or use the cache backend's atomic
compare-and-delete if available; update the lock handling around lock_key,
cache.add, cache.get and cache.delete to use this token ownership check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 251316be-c276-4473-a3fc-36a150e056ed
📒 Files selected for processing (5)
ami/main/migrations/0085_add_regroup_sessions_deployment_permission.pyami/main/models.pyami/main/tests.pyami/tasks.pyami/users/roles.py
✅ Files skipped from review due to trivial changes (1)
- ami/users/roles.py
🚧 Files skipped from review as they are similar to previous changes (1)
- ami/main/models.py
|
Claude says: Push 50d618e addresses the takeaway-review feedback from CodeRabbit and Copilot:
New tests covering the lock and the clamp:
11/11 |
There was a problem hiding this comment.
Nice!!
I still think this setting would make more sense in a new section, instead of in Project->Settings->Processing. I know you said this can affect the logic for occurrences once tracking is enabled, but I think this is a bit far-fetched. I'm thinking, if I were about to look for this setting as a user, where would I first go? I'm not sure the processing section would be the place. Maybe this belongs more with upcoming sampling schedule configuration we have discussed?
I don't have very strong opinions here though, so if you feel this is where it belongs, go for it! :)
There was a problem hiding this comment.
Looks like it works for my case. I agree with @annavik that the Processing settings form isn't the first place I looked for this, even having looked at the PR, would make sense to me on General settings (although maybe eventually moved to a less-likely settings tab there, and not opposed to it staying where it is for the time being.)
EDIT: I think regroup_events should probably happen on setting change, not sure how expensive it is or what the conditions where you wouldn't want to regroup (yet?) but I think I'm going to expect this setting to apply - they're going to be regrouped to the new setting anyways on the next sync right?
| sessionTimeGapSeconds: { | ||
| label: 'Maximum time gap between sessions (default)', | ||
| description: | ||
| 'Maximum time gap (in seconds) between consecutive images before a new session is started. Default is 7200 seconds (2 hours).', |
There was a problem hiding this comment.
Might be nice if this was presented to the user as minutes (or hours) rather than seconds.
There was a problem hiding this comment.
Claude says: Good point — units in seconds are awkward in a settings UI. The UI form field has been removed entirely in 76b52af; the setting is now admin-only (via Django admin or the API). When we surface it again it should be minutes or hours, and probably at the deployment level rather than the project level (raised in the PR description's "Follow-ups" section).
There was a problem hiding this comment.
Claude says: Captured in the body's follow-up section — Notes / follow-ups now reads "the form input should be minutes or hours, not seconds (per @loppear's review comment on the original draft)" alongside the related point about auto-regroup on setting change. The UI form input was pulled from this PR (no UI field ships here), so the unit choice lands with the follow-up UI work rather than gating this PR. Leaving the thread for your call on resolving.
|
Claude says: Pushed 76b52af which restructures the PR based on @mihow's takeaway-review feedback. Scope shift — out: the UI form field for Scope shift — in:
Tests: 12/12 |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ami/main/models.py (1)
1523-1526:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winBug:
len(values)returns string length, not duplicate count.
valuesis a newline-joined string (line 1520-1522), solen(values)gives the character count, not the number of duplicate timestamps. Useduplicate_timestamp_countwhich was computed on line 1518.🐛 Proposed fix
logger.warning( - f"Found {len(values)} images with the same timestamp in deployment '{deployment}'. " + f"Found {duplicate_timestamp_count} images with the same timestamp in deployment '{deployment}'. " f"Only one image will be used for each timestamp for each event." )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ami/main/models.py` around lines 1523 - 1526, The warning message incorrectly uses len(values) (which is the character length of the newline-joined string) instead of the numeric duplicate count; update the logger.warning call in the block around logger.warning(...) to use duplicate_timestamp_count rather than len(values) so it reads f"Found {duplicate_timestamp_count} images with the same timestamp in deployment '{deployment}'..." and leave the rest of the message and variables (values, deployment) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@ami/main/models.py`:
- Around line 1523-1526: The warning message incorrectly uses len(values) (which
is the character length of the newline-joined string) instead of the numeric
duplicate count; update the logger.warning call in the block around
logger.warning(...) to use duplicate_timestamp_count rather than len(values) so
it reads f"Found {duplicate_timestamp_count} images with the same timestamp in
deployment '{deployment}'..." and leave the rest of the message and variables
(values, deployment) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f0173c1c-89f3-46b5-93d1-7be34a87e027
📒 Files selected for processing (9)
ami/jobs/models.pyami/jobs/tests/test_jobs.pyami/main/admin.pyami/main/api/views.pyami/main/migrations/0085_add_regroup_permissions.pyami/main/models.pyami/main/tests.pyami/tasks.pyami/users/roles.py
✅ Files skipped from review due to trivial changes (1)
- ami/tasks.py
The session_time_gap_seconds project setting was added in #918 but never used. group_images_into_events() still hardcoded a 120-minute gap. This wires the field through and exposes it in the Processing settings UI so each project can override the session boundary. Closes #906 Co-Authored-By: Claude <noreply@anthropic.com>
POST /api/v2/deployments/<pk>/regroup-sessions/ queues the existing ami.tasks.regroup_events Celery task for the deployment. The task calls group_images_into_events(), which now reads the project's session_time_gap_seconds setting. User-facing terminology is "sessions"; "Event" is retained as the internal model name for backwards compatibility. Returns 202 with task_id, deployment_id, project_id. Refs #906 Co-Authored-By: Claude <noreply@anthropic.com>
Three new TestImageGrouping cases that exercise the new session_time_gap_seconds wiring: - test_cross_midnight_bursts_split_by_short_gap: documents the user- reported pain (2 bursts with off-window > default gap, crossing midnight, split into 2 Events on different dates) and proves that bumping max_time_gap past the off-window collapses them to 1 Event. - test_same_date_bursts_merge_regardless_of_gap: documents the inverse behavior — bursts on the same calendar date collide on group_by regardless of gap setting. Acts as a regression target for any future rework of the date-keyed Event reuse path (per the #904 caveat at models.py:1516). - test_session_time_gap_seconds_is_used_when_no_explicit_gap: verifies group_images_into_events falls back to the project setting when no max_time_gap is passed, including refreshing the cached relation after the setting changes. Refs #906 Co-Authored-By: Claude <noreply@anthropic.com>
…oup task Addresses takeaway-review feedback on PR #1292: - Add `regroup_sessions_deployment` custom Project permission so non-superusers with the ProjectManager role can hit `POST /deployments/<pk>/regroup-sessions/` (mirrors `sync_deployment`). Without this, `BaseModel.check_custom_permission` computes the codename `regroup_sessions_deployment` and the perm doesn't exist, so the action 403s for everyone except superusers. - Wrap `ami.tasks.regroup_events` in a per-deployment cache lock. Concurrent enqueues (double-clicks, sync_captures auto-regroup racing with manual trigger) collapse to a single run instead of overlapping on the same rows. Lower-level than the view so all callers (Deployment.save, admin bulk action, new API action) are protected. - Defensive clamp on `session_time_gap_seconds` — fall back to the historical 120-minute default if the value is 0 / negative, with a warning log. Stops pathological "every gap = new event" behavior when admins enter bad values. - Migration 0085 adds the new perm to `Project.Meta.permissions` and grants it to existing `ProjectManager` role groups via a data migration (mirrors how 0084 revoked `delete_job`). - Tests: - test_invalid_session_time_gap_falls_back_to_default exercises 0 / -1 / -7200 - test_regroup_events_task_is_idempotent_under_concurrent_calls pre-takes the cache lock, asserts a second call skips group_images_into_events Co-Authored-By: Claude <noreply@anthropic.com>
…l sites Replaces the direct-Celery API/admin entry points with a new RegroupEventsJob so admins can see regroup progress and failures in the Jobs UI like every other long-running task. The existing DataStorageSyncJob now runs grouping as an explicit second stage instead of relying on Deployment.save() autoregroup, so a post-sync regroup failure flips the Job to FAILURE rather than silently succeeding (closes #1157, #1158). Other changes in this commit: * Move the per-deployment idempotency lock from ami.tasks.regroup_events into group_images_into_events() itself, so all four call sites (RegroupEventsJob, sync-stage regroup, Deployment.save autoregroup, the bare Celery task) collapse to a single in-flight run rather than only the Celery-task entry point being protected. * Use a uuid token on the lock value and only delete the key if it still matches our token. Fixes the lock-release-clobbers-newer-owner race CodeRabbit flagged. * Add run_regroup_events_job permission and grant it to MLDataManager (which ProjectManager inherits from). regroup_sessions_deployment stays on ProjectManager. Both wired through the renamed migration 0085_add_regroup_permissions. * Strip the UI form field for session_time_gap_seconds; field is now admin-only until we decide whether it belongs on Project or Deployment. * Swap the admin "regroup" and "sync" bulk actions to create Jobs so the admin path matches the API path. Tests: - TestImageGrouping: replaced the old task-level idempotency test with one that exercises the new in-function lock; added a token-release test that verifies an expired-then-reacquired lock isn't clobbered. - TestRegroupEventsJob (new): end-to-end Job run, stats populated, and a failure-propagates assertion. - TestDataStorageSyncJobIncludesRegroupStage (new): sync Job exposes the regroup stage and a regroup failure surfaces as Job failure. Co-Authored-By: Claude <noreply@anthropic.com>
Main has moved on to migration 0088. Rename 0085_add_regroup_permissions to 0089 and depend on 0088_detection_det_srcimg_created_idx to resolve the multiple-leaf-nodes conflict CI surfaced. Co-Authored-By: Claude <noreply@anthropic.com>
… log; add e2e command
The regroup stage param `events_created` was incrementing for every event
visited rather than only for events newly created by `get_or_create`. The bool
returned from `get_or_create` was being discarded. Track `events_created_count`
separately and use it in the stage-param write.
Drive-by: `logger.warning(f"Found {len(values)} ...")` was reporting the
character length of the newline-joined duplicate-timestamp report rather than
the duplicate count. Pre-existing (since 0954c38, 2024-09-12) but sits inside
the very block whose stats this PR now surfaces. Use `duplicate_timestamp_count`,
and cap the in-log sample at 20 timestamps so very dirty deployments don't
flood logs.
Add `manage.py test_regroup_job_e2e {regroup|sync|concurrent}` to exercise the
new Job framework path against a real stack: `regroup` runs a `RegroupEventsJob`
and dumps stage params; `sync` runs a `DataStorageSyncJob` and asserts the
two-stage chain; `concurrent` fires two regroups back-to-back and verifies the
lock semantics. Mirrors `test_ml_job_e2e`. Used to validate PR #1292 against
deployment 74 of the Antenna dev stack.
Co-Authored-By: Claude <noreply@anthropic.com>
72dfbeb to
d93aa24
Compare
|
Claude says: Pushed two follow-up commits + ran live-stack E2Es. Summary: New commits on
Live-stack E2E validation (Antenna dev box, deployment 74
PR body updated to reflect all of this — new @loppear's open thread replied to, leaving resolved/unresolved for your call. Ready for re-review. |
|
@annavik @loppear thanks for reviewing. I decided to remove the UI changes. There is more to think about the placement and behavior when a user changes the setting. Also the setting does not act they way the user might expect. It DOES work in Luke's case, but it won't in others. I suggest we move the setting to the Station configuration and also consider simplifying the grouping algorithm. For now, this PR connects the existing function to the existing project setting for the time gap. It also wraps the session regrouping in a Job so you can see it running and when if it fails. You can now also see the logs for the session grouping stage when syncing a Station to it's data storage. I updated the title to say "allow admins" to customize instead of "allow users" :) |
The DeploymentAdmin list view ran four live per-row aggregates for each deployment displayed: - `start_date` → `SourceImage.objects.filter(event__deployment=obj).aggregate(Min)` (joins SourceImage→Event, scans the full per-deployment timeline) - `end_date` → `SourceImage.objects.filter(deployment=obj).aggregate(Max)` - `events_count` → `obj.events.count()` - `captures_count` → re-read of `data_source_total_files` (cached, but wrapped in a method that suggested it was computed) With 251 deployments on the page and ~5 expensive queries per row, the list view became unusable on stacks holding deployments with >100k captures (e.g. one deployment with 638k captures dominated end_date's wall time). Deployment already denormalizes `captures_count`, `events_count`, `data_source_total_size`, `first_capture_timestamp` and `last_capture_timestamp` via `update_calculated_fields()`. Switch the admin list to read those instead and drop the custom aggregates. Add `@admin.display(ordering=...)` so the columns remain sortable against the underlying cached fields. Wall time on the dev box (251 deployments): ~0.6s warm, ~0.76s cold. Co-Authored-By: Claude <noreply@anthropic.com>
The regroup stage params were stored with machine-style names (`captures_grouped`, `events_created`, ...), which is what the Jobs UI displays to admins. Other JobType stages already use title-cased, space-separated names (`Total files`, `Captures added`, etc.); this commit brings the regroup stage in line. Param **names** (UI labels): Captures grouped, Events created, Events touched, Empty events deleted, Duplicate timestamps, Ungrouped captures, Captures missing timestamp. Param **keys** (for retrieval, slugify(name) with `-` → `_`): captures_grouped, events_created, events_touched, empty_events_deleted (was `events_deleted_empty`), duplicate_timestamps, ungrouped_captures, captures_missing_timestamp (was `no_timestamp_captures`). Five keys are stable; two changed shape. Existing Job rows keep their historical name/key on disk — only new runs use the new labels. Centralise the list as `REGROUP_STAGE_PARAM_NAMES` in ami.jobs.models so RegroupEventsJob and DataStorageSyncJob's regroup stage share one source of truth. In `_group_images_into_events_locked`, swap the kwargs-style `update_stage(...)` call (which couldn't carry names with spaces) for an explicit `add_or_update_stage_param(stage_key, name, value)` loop driven by a label→value dict. Update tests to look up by the new keys and assert the human-readable names exist; update `test_regroup_job_e2e` to print `name [key]: value` so the stable retrieval key is visible alongside the UI label. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
ami/main/tests.py (2)
534-553: ⚡ Quick winAssert the warning side-effect for invalid gap fallback.
Line 548 currently validates clamped behavior only. Add a log assertion so the “fallback + warning” contract is protected, not just the event count.
🔧 Proposed test hardening
for bad_value in (0, -1, -7200): Event.objects.filter(deployment=self.deployment).delete() self.project.session_time_gap_seconds = bad_value self.project.save() self.deployment.refresh_from_db() - group_images_into_events(deployment=self.deployment) + with self.assertLogs(level="WARNING") as logs: + group_images_into_events(deployment=self.deployment) count = Event.objects.filter(deployment=self.deployment).count() # Default 120-min gap on this cross-midnight pattern → 2 Events, # NOT 12 (which is what bad_value=0 would produce without the guard). assert count == 2, f"expected 2 Events at gap={bad_value} (default fallback), got {count}" + self.assertTrue( + any("session_time_gap_seconds" in msg for msg in logs.output), + "Expected warning log for invalid session_time_gap_seconds fallback", + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ami/main/tests.py` around lines 534 - 553, Update test_invalid_session_time_gap_falls_back_to_default to also assert the warning side-effect: use the test logging capture (e.g., pytest's caplog) while setting project.session_time_gap_seconds to each bad_value, call group_images_into_events(deployment=self.deployment), and then assert caplog captured a WARNING-level message from the module that contains text indicating the fallback (e.g., mentions "fallback", "session_time_gap_seconds", or the bad_value) so the test verifies both the event count and that a warning was emitted when the invalid gap was clamped.
598-605: ⚡ Quick winMake cache cleanup unconditional in the lock-owner test.
If the assertion at Line 604 fails, Line 605 is skipped and can leak cache state into later tests. Wrap cleanup in
finally.🧹 Proposed cleanup guard
- with _regroup_lock(self.deployment.pk) as acquired: - assert acquired, "Should acquire the lock on first try" - # Simulate the TTL expiring and a newer caller taking the lock. - cache.set(lock_key, "newer-owner-token", timeout=60) - - # Our `finally` ran. The newer owner's lock should still be there. - assert cache.get(lock_key) == "newer-owner-token", "Expired-lock release must not clobber the newer owner" - cache.delete(lock_key) + try: + with _regroup_lock(self.deployment.pk) as acquired: + assert acquired, "Should acquire the lock on first try" + # Simulate the TTL expiring and a newer caller taking the lock. + cache.set(lock_key, "newer-owner-token", timeout=60) + + # Our `finally` ran. The newer owner's lock should still be there. + assert cache.get(lock_key) == "newer-owner-token", "Expired-lock release must not clobber the newer owner" + finally: + cache.delete(lock_key)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ami/main/tests.py` around lines 598 - 605, The test currently may skip cache.delete(lock_key) if the assertion fails; modify the test around the _regroup_lock block so that cache.delete(lock_key) always runs by placing the post-lock assertions and cache.delete into a try/finally (or move cache.delete into a finally) that ensures cleanup regardless of assertion failures; reference the existing _regroup_lock(self.deployment.pk) context, the lock_key variable, and cache.delete to locate where to add the finally guard.ami/jobs/management/commands/test_regroup_job_e2e.py (1)
60-75: ⚡ Quick winUse explicit exception chaining in except clause.
Line 64 should use
raise CommandError(...) from Noneto explicitly suppress theDoesNotExistexception from the traceback. This distinguishes a deliberate exception translation from an error during exception handling, following PEP 3134.♻️ Proposed fix
def _resolve_deployment(self, deployment_id: int) -> Deployment: try: deployment = Deployment.objects.get(pk=deployment_id) except Deployment.DoesNotExist: - raise CommandError(f"Deployment {deployment_id} not found") + raise CommandError(f"Deployment {deployment_id} not found") from None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ami/jobs/management/commands/test_regroup_job_e2e.py` around lines 60 - 75, In _resolve_deployment, change the exception raise to use explicit exception chaining: when catching Deployment.DoesNotExist in the except block of _resolve_deployment, replace the current raise CommandError(...) with raise CommandError(f"Deployment {deployment_id} not found") from None so the original DoesNotExist traceback is suppressed and the intent of exception translation is clear.ami/main/migrations/0089_add_regroup_permissions.py (1)
64-87: 💤 Low valueConsider mirroring grant logic precision in revoke.
Lines 75-87 remove both permissions from both
_ProjectManagerand_MLDataManagergroups, but the grant function at lines 52-61 is more specific:regroup_sessions_deploymentgoes only to_ProjectManager, whilerun_regroup_events_jobgoes to both. Removing a permission from a group that never had it is harmless, but mirroring the grant logic would make the migration more self-documenting.♻️ Optional refactor to match grant specificity
def revoke_new_permissions(apps, schema_editor): Group = apps.get_model("auth", "Group") Permission = apps.get_model("auth", "Permission") ContentType = apps.get_model("contenttypes", "ContentType") GroupObjectPermission = apps.get_model("guardian", "GroupObjectPermission") try: project_ct = ContentType.objects.get(app_label="main", model="project") except ContentType.DoesNotExist: return - for codename in ("regroup_sessions_deployment", "run_regroup_events_job"): - try: - perm = Permission.objects.get(codename=codename, content_type=project_ct) - except Permission.DoesNotExist: - continue - role_groups = Group.objects.filter(Q(name__endswith="_ProjectManager") | Q(name__endswith="_MLDataManager")) - for group in role_groups: - group.permissions.remove(perm) - GroupObjectPermission.objects.filter( - permission=perm, - content_type=project_ct, - group__in=role_groups, - ).delete() + # Remove regroup_sessions_deployment from ProjectManager only + try: + regroup_perm = Permission.objects.get(codename="regroup_sessions_deployment", content_type=project_ct) + pm_groups = Group.objects.filter(Q(name__endswith="_ProjectManager")) + for group in pm_groups: + group.permissions.remove(regroup_perm) + GroupObjectPermission.objects.filter( + permission=regroup_perm, + content_type=project_ct, + group__in=pm_groups, + ).delete() + except Permission.DoesNotExist: + pass + + # Remove run_regroup_events_job from both MLDataManager and ProjectManager + try: + job_perm = Permission.objects.get(codename="run_regroup_events_job", content_type=project_ct) + job_groups = Group.objects.filter(Q(name__endswith="_MLDataManager") | Q(name__endswith="_ProjectManager")) + for group in job_groups: + group.permissions.remove(job_perm) + GroupObjectPermission.objects.filter( + permission=job_perm, + content_type=project_ct, + group__in=job_groups, + ).delete() + except Permission.DoesNotExist: + pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ami/main/migrations/0089_add_regroup_permissions.py` around lines 64 - 87, The revoke_new_permissions function currently removes both codenames from both role groups; change its logic to mirror the grant logic in the migration by mapping codenames to the exact target groups before removal: when codename == "regroup_sessions_deployment" only target groups ending with "_ProjectManager", and when codename == "run_regroup_events_job" target both "_ProjectManager" and "_MLDataManager"; use that per-codename role_groups selection for both group.permissions.remove(perm) and the GroupObjectPermission.objects.filter(..., group__in=role_groups).delete() calls so revoke precisely matches grant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ami/main/api/views.py`:
- Around line 362-393: The method regroup_sessions in the view currently uses
assert deployment.project which can be disabled; replace that with an explicit
runtime check and return a proper client error: if deployment.project is None,
raise rest_framework.exceptions.ValidationError (or return Response with
status=HTTP_400_BAD_REQUEST and a clear error message) before creating the Job;
ensure the check is done right after deployment = self.get_object() and include
deployment/project identifiers in the error message for clarity.
In `@ami/main/models.py`:
- Around line 968-973: The branch that skips autoregroup (regroup_after=False)
fails to realign child objects because it never triggers
Deployment.update_children(); to fix, after calling
self.save(regroup_async=False, update_calculated_fields=False) and
self.update_calculated_fields(save=True) ensure you explicitly invoke the same
child-realignment logic (call self.update_children() or otherwise run the code
path used by Deployment.save when update_calculated_fields=True) so
Events/Occurrences/SourceImages are moved to the new project when a Deployment
is moved; locate Deployment.save, Deployment.update_calculated_fields, and
Deployment.update_children to add the call in this else branch.
- Around line 1479-1486: The current group_images_into_events function silently
returns [] when _regroup_lock(deployment.pk) is not acquired, which lets callers
(e.g., DataStorageSyncJob) treat regroup as successful and miss newly imported
captures; change this behavior so a lock miss produces a retryable outcome:
either raise a specific retryable exception (e.g., RegroupLockHeldError) or set
a "dirty" flag on the deployment and re-enqueue the job instead of returning [],
and ensure callers (DataStorageSyncJob) catch that exception or check the dirty
flag to re-schedule regroup; update references to _regroup_lock,
group_images_into_events, and any use of job.logger/logger to log a clear
retryable message when the lock is held.
---
Nitpick comments:
In `@ami/jobs/management/commands/test_regroup_job_e2e.py`:
- Around line 60-75: In _resolve_deployment, change the exception raise to use
explicit exception chaining: when catching Deployment.DoesNotExist in the except
block of _resolve_deployment, replace the current raise CommandError(...) with
raise CommandError(f"Deployment {deployment_id} not found") from None so the
original DoesNotExist traceback is suppressed and the intent of exception
translation is clear.
In `@ami/main/migrations/0089_add_regroup_permissions.py`:
- Around line 64-87: The revoke_new_permissions function currently removes both
codenames from both role groups; change its logic to mirror the grant logic in
the migration by mapping codenames to the exact target groups before removal:
when codename == "regroup_sessions_deployment" only target groups ending with
"_ProjectManager", and when codename == "run_regroup_events_job" target both
"_ProjectManager" and "_MLDataManager"; use that per-codename role_groups
selection for both group.permissions.remove(perm) and the
GroupObjectPermission.objects.filter(..., group__in=role_groups).delete() calls
so revoke precisely matches grant.
In `@ami/main/tests.py`:
- Around line 534-553: Update
test_invalid_session_time_gap_falls_back_to_default to also assert the warning
side-effect: use the test logging capture (e.g., pytest's caplog) while setting
project.session_time_gap_seconds to each bad_value, call
group_images_into_events(deployment=self.deployment), and then assert caplog
captured a WARNING-level message from the module that contains text indicating
the fallback (e.g., mentions "fallback", "session_time_gap_seconds", or the
bad_value) so the test verifies both the event count and that a warning was
emitted when the invalid gap was clamped.
- Around line 598-605: The test currently may skip cache.delete(lock_key) if the
assertion fails; modify the test around the _regroup_lock block so that
cache.delete(lock_key) always runs by placing the post-lock assertions and
cache.delete into a try/finally (or move cache.delete into a finally) that
ensures cleanup regardless of assertion failures; reference the existing
_regroup_lock(self.deployment.pk) context, the lock_key variable, and
cache.delete to locate where to add the finally guard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7ccaf86d-d49c-4cba-a120-d0d96c841289
📒 Files selected for processing (10)
ami/jobs/management/commands/test_regroup_job_e2e.pyami/jobs/models.pyami/jobs/tests/test_jobs.pyami/main/admin.pyami/main/api/views.pyami/main/migrations/0089_add_regroup_permissions.pyami/main/models.pyami/main/tests.pyami/tasks.pyami/users/roles.py
✅ Files skipped from review due to trivial changes (1)
- ami/tasks.py
🚧 Files skipped from review as they are similar to previous changes (3)
- ami/users/roles.py
- ami/jobs/tests/test_jobs.py
- ami/jobs/models.py
… on sync path Two CodeRabbit findings on the latest revision. 1. `ami/main/api/views.py:376` used `assert deployment.project` to guard the API action. `assert` is stripped under `python -O`, so a deployment whose project FK is null (the schema allows it — `Project` FK is nullable on `Deployment`) would 500 with a confusing AttributeError later in the view instead of returning a clean 400. Replace with an explicit `api_exceptions.ValidationError` that returns 400 with a structured detail. 2. `Deployment.sync_captures()`'s new `regroup_after=False` branch (added in this PR so `DataStorageSyncJob` can drive the regroup as a separate tracked stage) bypassed `Deployment.save()`'s `update_calculated_fields= True` block, which is where `self.update_children()` realigns child `Event`/`Occurrence`/`SourceImage` `project` pointers. Result: a sync Job whose deployment had moved between projects would leave child rows with stale project FKs, breaking project-scoped queries (default filters, visibility checks, etc.). The user-triggered `RegroupEventsJob` path is unaffected because it doesn't run sync_captures — the regression is specific to the new sync→regroup chain. Add an explicit `self.update_children()` call after the `update_calculated_fields(save=True)` refresh on the regroup_after=False branch, gated on `project_id` like the corresponding block in `Deployment.save()`. Existing regroup tests still pass (4/4). Co-Authored-By: Claude <noreply@anthropic.com>
CodeRabbit raised a real concern about the sync→regroup chain: if a concurrent regroup holds the lock when DataStorageSyncJob's regroup stage runs, the lock-miss branch in `group_images_into_events` returns `[]` and the just-synced captures sit ungrouped until the next save/sync. A previous draft of this fix added a post-lock re-enqueue (`regroup_events.delay()` after the `with _regroup_lock:` block) that called `deployment_events_need_update` and chained a follow-up task. A fresh review found that approach risks cascading enqueues on busy deployments — the lock serialises work but not enqueueing, so concurrent `Deployment.save` autoregroups during a slow run can pile redundant tasks into the queue (no infinite loop, but tens of redundant runs plausible). Drop the re-enqueue. Instead, surface a clear warning on the Job itself when the sync stage's regroup is skipped: we know `total_files` from the preceding sync stage, so we can say "sync added N files but regroup was skipped" directly on the Job that an admin is looking at. The captures will be picked up by the next Deployment.save autoregroup (models.py:1110-1113), which is the existing safety net for the general case. Trade-off vs the previous draft: a sync that races with a concurrent regroup defers its new captures to the next save/sync (small window, no data loss) but does NOT silently appear to have succeeded — the warning lands in the Job log where it belongs. No new tests; existing 4 regroup/lock tests + 12 grouping tests pass (16/16). Co-Authored-By: Claude <noreply@anthropic.com>
loppear
left a comment
There was a problem hiding this comment.
Narrowing the scope to just using this admin Project setting is good enough for me. It seems the changes to move this into an explicit job for better logging during storage sync are motivated by existing issues, so if this helps clean that up ok!
…al-regroup requirement The previous help text just said "Time gap in seconds to consider a new session" — it didn't tell the user that changing the value doesn't regroup existing captures, which is the most common point of confusion when admins edit the field. Now points users to the Deployments admin "Regroup captures into sessions" bulk action so they can apply the change. Surfaces via Django admin form, DRF browsable API, and OPTIONS schema. Co-Authored-By: Claude <noreply@anthropic.com>
|
Thanks for your review @loppear. I plan to do some merging & deploying in the afternoon today. |
Four fixes surfaced by an independent takeaway review: - Count actual SourceImage rows assigned to an event for the "Captures grouped" stat instead of the distinct-timestamp count. Two captures can share a timestamp, so the old count understated the work done. - Lower the regroup lock TTL and the matching Celery time limits to 10 min soft / 11 min hard. group_images_into_events is mostly batch SQL and finishes in seconds, so a tight limit bounds the worst-case stuck-lock window when a worker dies hard (OOM, SIGKILL, eviction) and the finally block never runs to release the lock. - Document that RegroupEventsJob and the bare regroup_events task only group; propagating project_id to children stays on Deployment.save(). - Log a clear "regroup skipped" message when the per-deployment lock is already held, instead of a misleading "now has 0 events". Co-Authored-By: Claude <noreply@anthropic.com>
8b5c72c to
c56a03b
Compare
Main now has 0089_add_regroup_permissions and 0090_session_time_gap_seconds_help_text (PR #1292), which made the branch's 0089/0091 migrations a second leaf in the migration graph and failed CI. Renumber to 0091/0092 on top of main's 0090. Co-Authored-By: Claude <noreply@anthropic.com>
* Model and API view for SourceImageThumbnail
* Thumbnails: add field to captures serializer
* With internal prefix
* black postcommit
* black?
* black settings
* Better handling for thumbnail invalidation and deletion.
* Thumbnail lookup by label and check for changed settings width
* Move thumbnail generation to method on SourceImage
* Thumbnails: handle uploaded source images (without datasource), 302 redirects, and add index for (source_image, label).
* merge migrations
* UI capture data: add and use thumbnail property accessors
* Add thumbnails to api responses for events with captures
* revert UI capture.src property, only use thumbnails explicitly
* fix: prioritize API response size before natural size
* Merge migrations
* Rebase thumbnail migrations
* ami.utils.media.fetch_image_content for thumbnail image source
* fix(thumbnails): convert non-RGB source images to RGB before JPEG encode
JPEG natively supports only L, RGB, and CMYK modes. Source images in
RGBA/P/LA/PA modes (common for uploaded PNGs) raised
"OSError: cannot write mode <X> as JPEG" on `img.save(format="JPEG")`,
500-ing the thumbnail endpoint. Convert to RGB first when needed.
Adds a regression test that feeds an RGBA PNG through find_or_generate
and asserts the redirect succeeds and the thumbnail row lands.
Closes review thread on ami/main/models.py L2241.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(thumbnails): handle SourceImage.last_modified=None on regen check
Two changes, one root cause.
`SourceImage.last_modified` tracks the source bytes' mtime (S3 LastModified
on the sync path) and is the freshness signal the thumbnail regen check uses
to decide whether to invalidate a cached thumbnail. It is NOT the same as
`updated_at` (BaseModel auto_now), which bumps on every row save (cached
counts, detection counts, etc.) and would force regen far too often.
The upload path (`create_source_image_from_upload`) never set this field,
so:
1. Uploaded captures had `last_modified=None` forever, breaking parity
with sync-created rows.
2. The thumbnail regen check `thumb.last_modified < self.last_modified`
raised `TypeError` on the 2nd+ request for any uploaded capture's
thumbnail (datetime < None).
Fixes:
* `create_source_image_from_upload`: set `last_modified=timezone.now()` so
uploaded captures have a real source mtime, matching the sync path's
behaviour of copying the S3 LastModified header.
* `find_or_generate_thumbnail_for_label`: guard the comparison. None on
either side now means "no signal of source change, trust the cached
thumb" instead of crashing. Covers legacy rows synced before the upload
backfill.
Tests:
* `test_thumbnail_source_last_modified_none_does_not_crash` reproduces the
TypeError class of bug by nulling `last_modified` after a thumbnail row
exists and asserts the 2nd request reuses the cache.
* `test_upload_handler_backfills_last_modified` covers the upload-path
backfill (asserts the new field falls inside [before, after] of the
upload call).
Closes review thread on ami/main/models.py L2227.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(thumbnails): drop no-op f-strings; return 405 on list
Two F541 (Ruff: f-string without placeholders) hits, plus a small
behavioural cleanup on the list action.
* `SourceImageThumbnailViewSet.list` now raises
`MethodNotAllowed` (405) instead of `NotFound` (404). The route exists
for `/captures/thumbnails/<pk>/?label=...` only; listing has no
defined semantics. 405 sets the Allow header and is the correct REST
response.
* Updated `test_thumbnail_no_list` to expect 405 and to use a plain
string literal (the previous test used an f-string with no
placeholders).
Closes review thread on ami/main/tests.py L237 (Ruff F541).
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(thumbnails): guard empty THUMBNAILS['SIZES'] config
When `settings.THUMBNAILS['SIZES']` is empty (misconfiguration), the
default-label fallback `next(iter(_sizes))` raised `StopIteration` and
surfaced as a 500 with no useful detail. Replace with an explicit empty
check that returns 404 with a clear API error message, and use
`next(iter(_sizes))` only after we know the dict is non-empty (also
switch the fallback expression to `or next(...)` since the previous
form passed `next(iter(...))` as the default arg unconditionally — fine
for non-empty dicts but still worth tidying alongside the guard).
Adds a test that overrides THUMBNAILS with an empty SIZES dict and
asserts the endpoint returns 404 with the configured-error message.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(media): add finite default timeout to fetch_image_content
`requests.get(url)` with no timeout blocks indefinitely if the upstream
storage is slow or unreachable. The thumbnail-generation path
(`SourceImage.find_or_generate_thumbnail_for_label`) runs this
synchronously inside a Django web request, so a stuck upstream ties up
a gunicorn worker; a gallery's worth of stuck calls drains the pool.
Default timeout is `(connect=5s, read=30s)`. Callers that need
different limits (e.g. very large originals on a slow link) can override
by passing `timeout=`. Behaviour for fast / responsive upstreams is
unchanged.
Tests verify the default is finite and that explicit overrides are
honored.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(thumbnails): use upsert + drop pre-save delete-loop to fix concurrent-gen race
Two related bugs in `find_or_generate_thumbnail_for_label`:
1. Under concurrent gen for the same (source_image, label) pair the loser
could either (a) hit the new UniqueConstraint and 500 with
IntegrityError when its delete-loop ran before the winner's INSERT,
or (b) destroy the winner's row + storage blob when the delete-loop
ran after. Either outcome is wrong; both were possible because the
code did `filter(label=label).delete()` then `thumbnails.create(...)`
instead of a single atomic upsert.
2. The pre-save delete-loop also deleted the storage blob BEFORE the
new `default_storage.save()`, which on storage backends that suffix
colliding keys (FileSystemStorage default) left the row pointing at
a path that the next save could overwrite in the same way the racer
would. Belt-and-braces failure where the cleanup order itself was the
bug.
Fix:
* Replace the delete-then-create with `update_or_create` keyed on
`label`. Django serializes via the UniqueConstraint and, on collision,
catches IntegrityError on INSERT and re-runs as UPDATE. No 500s, no
destructive deletes of concurrent rows.
* Snapshot the prior `path` before the new save and clean it up after
the row update only when the backend gave us a new key
(`prior_path != thumbnail_path`). Backends that overwrite in place
(boto3 with `AWS_S3_FILE_OVERWRITE=True`) skip the cleanup and avoid
the file-deletion race entirely.
* Force-bump `thumb.last_modified` on the UPDATE branch via
`QuerySet.update` because `auto_now_add=True` only fires on INSERT —
without this, regen via source-mtime bump would re-trigger regen on
every subsequent request as the field's pre_save callback preserved
the existing (now-stale) value.
Tests:
* `test_thumbnail_regen_reuses_row_via_upsert`: delete the storage blob
out from under a cached row, re-request, assert the row's PK is
preserved (proving the UPDATE path was taken, not a fresh INSERT).
* `test_thumbnail_exists_newer_modified_source` updated to assert row
reuse + freshness bump rather than the previous (incidental) PK
change.
The `SourceImageThumbnail.last_modified` field semantics are still
muddled (see PR thread; rename or drop tracked separately).
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(thumbnails): CASCADE source_image FK and clean storage blob on row delete
Thumbnails are pure derivatives of (source_image, label). The previous
`on_delete=SET_NULL, null=True` left orphan rows AND their storage blobs
behind forever whenever the parent SourceImage was deleted, with no
other reaper in the codebase. Silent storage growth and orphan-row
accumulation; CodeRabbit flagged this in PR #1306 and the thread was
closed without a fix.
Changes:
* `SourceImageThumbnail.source_image` → `on_delete=CASCADE`, no longer
nullable. Deleting a SourceImage now takes its thumbnail rows with it.
* New `pre_delete` signal in `ami.main.signals` does best-effort
`default_storage.delete(instance.path)` so the blob doesn't outlive the
row. Fires on explicit row deletes AND CASCADE deletes (signals fire
for both), so this covers parent SourceImage deletion automatically.
Storage failures are logged and swallowed — a transient backing-store
error must not block the delete chain.
* Migration 0090 is two-step: reap pre-existing rows where
`source_image_id IS NULL` (the SET_NULL aftermath) first, then
AlterField the FK. Orphan-row reap runs through the same pre_delete
signal so their blobs are cleaned alongside.
Tests:
* `test_source_image_delete_cascades_to_thumbnails` — create 2 thumbnail
rows for a capture, delete the capture, assert both rows gone via
CASCADE.
* `test_thumbnail_delete_removes_storage_blob` — plant a real blob in
storage, wire a row to it, delete the row, assert the blob is gone
(proving the pre_delete signal fired and cleaned up).
Co-Authored-By: Claude <noreply@anthropic.com>
* chore: black formatting
* chore(ui): prettier strip trailing whitespace in capture.ts
* docs(sourceimage): help_text for storage-derived fields; admin read-only
`size`, `last_modified`, `checksum`, `checksum_algorithm` all record
metadata read from the source image file in storage during sync or
upload, not row-level mtime. Document that in help_text so the field's
intent shows up in the admin UI and in API schema output (the four
fields previously had no help_text at all).
Mark the cluster read-only in the SourceImage admin so operators can't
clobber the cached metadata by mistake through the change form. Left
writable on the API so a project owner can still create SourceImage
entries by hand via an API client when the sync/upload flows don't fit;
the sync endpoint at POST /api/v2/deployments/<pk>/sync/ remains the
default population path.
Adds migration 0091 for the help_text changes (no schema change beyond
the AlterField metadata).
Co-Authored-By: Claude <noreply@anthropic.com>
* chore(migrations): collapse 0089 SET_NULL + 0090 CASCADE into single 0089
0089 was renamed-into-existence this branch (originally 0088, bumped
when main added 0088_detection_det_srcimg_created_idx). The follow-up
0090 only existed to switch the just-created FK from SET_NULL to
CASCADE — same model, same concern, accidentally split because 0089
shipped wrong on the first cut.
This commit:
* Rewrites 0089 to create the FK with CASCADE from the start (drops
null=True and the SET_NULL on_delete). The RunPython orphan-reaper
from 0090 is gone — a fresh table can't have orphans.
* Deletes 0090.
* Updates 0091's dependency to point at 0089 (was 0090).
End DB state is identical to the old 0089+0090 pair, so anyone who
already migrated past 0090 just pulls — Django sees 0089 as
already-applied (same name) and skips. 0090 will appear as an applied
migration with no matching file; warning only, no breakage. Local devs
who want a clean django_migrations table can manually
DELETE FROM django_migrations WHERE app='main' AND name='0090_sourceimagethumbnail_cascade_source_image'.
Safe because this migration only landed on the in-flight feature
branch — no production state to preserve. CI re-runs from scratch
each time and passed (TestImageThumbnailViews 18/18 with --noinput on
a fresh DB).
Co-Authored-By: Claude <noreply@anthropic.com>
* chore(migrations): drop over-defensive comment from collapsed 0089
The collapse rationale is unnecessary in-tree — the branch was never
merged, so no production state ever ran the SET_NULL form. Anyone
mid-rebase on this branch will just see the new migration replace the
old one cleanly.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(ui): session list uses capture thumbnails instead of full-size originals
The sessions list and gallery rendered example captures from capture.url —
the full-resolution original (often 9000px+ wide, fetched straight from
object storage). The events API has provided a thumbnails dict on
example_captures since this branch; use the medium (1024px) thumbnail with
a fallback to the original when thumbnails are absent, matching the
pattern in capture.ts. The medium size also keeps the og:image meta tag on
session details at a reasonable resolution.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(migrations): renumber thumbnail migrations after main merge
Main now has 0089_add_regroup_permissions and 0090_session_time_gap_seconds_help_text
(PR #1292), which made the branch's 0089/0091 migrations a second leaf in the
migration graph and failed CI. Renumber to 0091/0092 on top of main's 0090.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(tests): use real taxon pk in role-permission test data
TestRolePermissions hardcoded taxon_id="5" in the identification
create payload. Any test class that runs earlier and creates taxa
advances the id sequence, so pk 5 stops existing and the POST fails
validation with 400 before the permission check returns 403. The new
TestImageThumbnailViews class in this branch sorts alphabetically
before TestRolePermissions and triggered exactly that. Use the
occurrence's own determination pk instead.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(thumbnails): record requested spec width on thumbnail rows
PIL.Image.thumbnail() preserves aspect ratio and can emit a width 1-2px
below the requested spec (239 for a 240 spec was observed in a dev
deployment). The regen gate compares the stored width against the spec
with strict equality, so storing the encoder output made affected rows
regenerate on every request and download the original each time.
Store the requested spec width instead — the row identifies which
configured size it satisfies, matching how derivative-image systems key
their caches (sorl-thumbnail/easy-thumbnails hash the requested
geometry, WordPress keys by registered size name). The actual encoded
height is still stored; it is informational and never compared.
Existing rows that hold an encoder-output width self-heal with a single
regeneration on their next request.
Co-Authored-By: Claude <noreply@anthropic.com>
* perf(thumbnails): trust the cached row without a storage HEAD per request
The regen gate called default_storage.exists() on every warm hit, which
is a HEAD request against the object store (or a stat on filesystem
backends) per thumbnail per page view. The cached row is already the
warm signal; the existence check only protected against blobs deleted
out of band, which currently has no code path. Orphaned rows now show a
broken image until the row is removed and the next request regenerates.
Co-Authored-By: Claude <noreply@anthropic.com>
* feat(thumbnails): encode thumbnails as progressive JPEGs
Progressive encoding lets browsers paint a blurry-to-crisp preview as
bytes arrive instead of filling in top-to-bottom. optimize adds an extra
Huffman pass for slightly smaller files and quality=82 raises Pillow's
default of 75 for screen-sized thumbnails. All three flags only cost
time on the cold generation path.
Co-Authored-By: Claude <noreply@anthropic.com>
* perf(thumbnails): let browsers cache the thumbnail redirect
302 responses are not cached by browsers unless explicitly allowed, so
every page view re-paid a full Django round trip (auth, object lookup,
freshness check) per thumbnail, even when nothing changed. A five-minute
private Cache-Control covers typical back-and-forth browsing within a
session while staying well below the shortest presigned-URL lifetime
(django-storages AWS_QUERYSTRING_EXPIRE defaults to 3600s), so a cached
redirect can never point at an expired signature.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(thumbnails): regenerate when the cached row has no stored path
A row whose path is empty (failed or interrupted generation) passed the
regen gate and redirected the browser to the storage root. Treat a blank
path as cold and regenerate.
Co-Authored-By: Claude <noreply@anthropic.com>
* refactor(thumbnails): trim review-narration comments to load-bearing facts
Several comment blocks added during review explained why a change was
correct relative to code that no longer exists. Keep only the constraints
a future reader needs (spec-width vs encoder output, auto_now_add bump,
presign-lifetime bound); drop the change-justification prose, which lives
in the commit history.
Co-Authored-By: Claude <noreply@anthropic.com>
* refactor(thumbnails): trim remaining narration comments found in full sweep
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Anna Viklund <annamariaviklund@gmail.com>
Co-authored-by: Michael Bunsen <notbot@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Closes #1157. Closes #1158. Partially addresses #906 (full fix tracked in #904 — see "What this PR proves on its own").
Summary
The
session_time_gap_secondsproject setting was added in #918 but was never read by anything —group_images_into_events()still hardcoded a 120-minute gap, so changing the field had no effect. This PR wires the setting through and turns the regroup action into a proper Job so admins can actually see what's happening.What changes in this PR:
group_images_into_events()now readsdeployment.project.session_time_gap_secondswhen no explicit gap is passed, with a clamp on invalid values (None/0/negative → 120 min default, logged).RegroupEventsJobJobType — admins /MLDataManager/ProjectManagercan trigger regrouping from the API, the admin, or programmatically and see progress, logs, and stage params in the Jobs UI.DataStorageSyncJobnow runs grouping as an explicit second stage. Previously the regroup happened silently insideDeployment.save()— if it raised, the sync Job recorded SUCCESS anyway. Closes Expose session regrouping as a job stage when syncing images #1157 and Expose session regrouping as a job stage with stats #1158.group_images_into_events()itself (token-based uuid release, so a longer-than-TTL run never clobbers a newer owner). All four call sites — the new Job, the sync-stage regroup,Deployment.saveautoregroup, and the bare Celery task — collapse to a single in-flight run.POST /api/v2/deployments/<pk>/regroup-sessions/action that creates aRegroupEventsJob. Mirrors the existingsyncaction shape (returnsjob_id, not a Celery task id).regroup_sessions_deployment(gates the API action; granted toProjectManager) andrun_regroup_events_job(gates create/run/cancel on the new JobType; granted toMLDataManager, whichProjectManagerinherits from). Migration0089_add_regroup_permissionsregisters both and grants them to existing role groups.Drive-by fixes in the same area:
events_createdstage param was counting events visited rather than events newly created (the bool fromget_or_createwas being discarded). Fixed to use thewas_createdreturn.len(values)(string length of the newline-joined report) instead of the duplicate count. Pre-existing since Storage and retrieval of project settings #918's neighbouring code (2024-09-12) but sits inside the very block whose stats this PR now surfaces, so fixed here rather than left to drift.What is intentionally not included:
ProjectorDeployment, and whether the unit should change from seconds to minutes/hours. See "Follow-ups" below.sync_deploymentshape.One DB migration (
0089_add_regroup_permissions, rebased onto main's 0088). Thesession_time_gap_secondsfield itself already existed with default7200seconds (= 2 hours = 120 minutes), matching the previous hardcoded behavior.Use case this fixes
A user reported that monitoring schedules with multi-burst nights (e.g. 2 hours on, 2 hours off, 2 hours on) sometimes group into one Event and sometimes into two, depending on whether the off-window crosses midnight. With this PR, raising the gap setting past the longest off-window collapses cross-midnight bursts into a single Event consistently.
A second pain came out of a recent workshop: sync jobs would complete successfully, but if the post-sync regroup raised, there was no Job-level signal. By the time anyone noticed the captures had no sessions, it took grepping Celery logs to find the failure. The new regroup stage on
DataStorageSyncJobsurfaces those failures directly in the Jobs UI — a sync-then-regroup failure now flips the Job to FAILURE.API usage
Request
No request body required.
Response
202 Accepted:{ "job_id": 12345, "deployment_id": 74, "project_id": 24 }The response returns immediately; regrouping runs as a
RegroupEventsJobin the background. Track progress atGET /api/v2/jobs/<job_id>/or in the Jobs UI — you'll see a "Regroup sessions" stage withevents_created,events_touched,events_deleted_empty,captures_grouped,duplicate_timestamps,ungrouped_captures, andno_timestamp_capturesstage params on completion.If a regroup is already in progress for the same deployment, subsequent enqueues still return
202with a freshjob_id, but the Job's regroup stage will short-circuit at the worker (logged as a warning on the Job) and return early without touching the DB. The lock is keyed on the deployment, not the Job, so duplicate Jobs are visible in the Jobs UI but cannot run overlapping regroups on the same rows.Error responses
404 Not Found— deployment pk does not exist or is not visible to the requesting user.401 Unauthorized/403 Forbidden— token missing or user lacks theregroup_sessions_deploymentpermission on the deployment's project (granted to theProjectManagerrole and superusers).Caveats — what this PR does and doesn't change
The setting is now wired through, but Event keying still collapses some timestamp groups. Below are the scenarios actually tested and what they show.
Production data — 5-setting cycle on a real deployment
Cycled
session_time_gap_secondsthrough five values via the new endpoint, on a deployment with ~1.2 k captures currently grouped into 69 Events at the default 2 h gap. Each Job completed in 14–22 s. (Run was against the originalregroup_events_task.delay()path before the Job framework wrapping; the same numbers re-verified via the new path — see "End-to-end validation" below.)Reading: tightening below ~24 h produced no observable Event-count change on this deployment. Loosening to 24 h merged 9 cross-midnight nights (−13 %).
Why tightening doesn't always change Event count
group_images_into_events()keys Events bygroup_by = start_date.date()and usesEvent.objects.get_or_create(deployment, group_by). Two timestamp groups produced by a tighter gap that start on the same calendar date collide back into a single Event. So the gap setting only changes Event count when:The existing comment in
models.pyalready flags this: "#904 is expected to rework this reuse path more thoroughly." A separate cleaner fix — keying Events by a noon-anchored "monitoring night" instead of calendar date — is parked as a follow-up.Tests added in this PR
ami/main/tests.py::TestImageGrouping:test_cross_midnight_bursts_split_by_short_gaptest_same_date_bursts_merge_regardless_of_gaptest_session_time_gap_seconds_is_used_when_no_explicit_gaptest_invalid_session_time_gap_falls_back_to_defaulttest_regroup_is_idempotent_under_concurrent_callstest_regroup_lock_release_does_not_clobber_a_newer_ownerfinallyblockami/jobs/tests/test_jobs.py:TestRegroupEventsJob.test_regroup_job_runs_end_to_end_and_reports_statsevents_created,captures_grouped, etc.) populated, real Events exist in DBTestRegroupEventsJob.test_regroup_job_failure_propagatesgroup_images_into_eventsexception propagates fromJobType.run()so the Celery wrapper records FAILURETestDataStorageSyncJobIncludesRegroupStage.test_sync_job_adds_regroup_stageregroup_sessionsstage with stats after sync completesTestDataStorageSyncJobIncludesRegroupStage.test_sync_job_regroup_failure_propagatesEnd-to-end validation against a real Antenna stack
Run via the new
manage.py test_regroup_job_e2e {regroup|sync|concurrent}command (added in this PR; mirrorstest_ml_job_e2e). Stack was the dev box's main Antenna compose stack, bind-mounted onto this branch's worktree. Target deployment:PFS_Moth_Wall(id 74, project 24, 1263 captures, 69 events at the 2 h default).RegroupEventsJobon real capturestest_regroup_job_e2e regroup --deployment 74captures_grouped=1188, events_created=0, events_touched=69, events_deleted_empty=0, duplicate_timestamps=57, ungrouped_captures=0, no_timestamp_captures=0. Event count stable (Δ 0). Reproduces the 2 h-baseline row of the production-cycle table via the new code path.DataStorageSyncJobtwo-stage chaintest_regroup_job_e2e sync --deployment 74data_storage_sync(total_files=857, failed=0) andregroup_sessionswith same regroup stats as #1. Closes #1157 / #1158.test_sync_job_regroup_failure_propagatesmocksgroup_images_into_eventsto raise; theRuntimeErrorpropagates fromJobType.run()so the Celery wrapper records FAILURE. The stack-level run would re-exercise the same propagation path. Skipped to avoid a code edit just to inject a fault; happy to add a chaos hook if reviewers want it.test_regroup_job_e2e concurrent --deployment 74captures_grouped=1188(real work). Job B:captures_grouped=0(short-circuit; logged the lock-warning per the body). Event count unchanged from the single-run baseline. Lock semantics held end-to-end.The 5-setting production cycle (30 min / 1 h / 2 h / 4 h / 24 h) in the "Caveats" table was originally run against the pre-Job-framework
regroup_events_task.delay()path. Re-running it throughRegroupEventsJobwould produce identical Event counts (same underlyinggroup_images_into_eventscall) and was not re-done end-to-end; the single-settingregrouprun above is the live-stack proof that the Job-framework path drives that function correctly.Headless validation pass (post-takeaway review)
A second pass exercised the API permission paths, setting-change UX, admin bulk action edge cases, and the lock-miss log surface on the same dev stack — these are the paths the unit tests stub out or the original E2E command doesn't cover. All against deployment 74.
/regroup-sessions/happy pathcurl -X POSTwith superuser token{"job_id": 1587, "deployment_id": 74, "project_id": 24}. Job ran to SUCCESS in 19 s. Stage params surfaced with both human-readablename(e.g. "Captures grouped") and slugifiedkey(captures_grouped) for UI lookup.curlwith token from a user not in project 24"You do not have permission to perform this action."(object-level perm gate viacheck_custom_permission(action="regroup_sessions") → regroup_sessions_deployment).curlwith superuser token, pk = 99999999"Not found."curlwith no auth header"Authentication credentials were not provided."Project.session_time_gap_seconds7200 → 14400 in shell, then count newRegroupEventsJobs and check Event count[74, orphan](orphan =Deployment(project=None)), invokeDeploymentAdmin.regroup_eventsQueued RegroupEventsJob for 1 deployments: [1594] — skipped: #297 ... (no project). Orphan skipped cleanly, valid one enqueued.POST /regroup-sessions/calls < 1 s apartcaptures_grouped=0. Job B/C.logs.stdoutcontainsWARNING group_images_into_events skipped for deployment 74: another regroup is in progress.— an admin watching the Job sees the gap directly.Not run on this stack (need fixtures not available locally):
finally-block lock release, not this PR's new code.Deploymentrows with cached count fields populated. With the new denorm-field admin, projection is roughly 2–3 s warm at 1 000 rows (vs 0.6 s on 251 today); linear in row count, not capture count.Post-review hardening (commit c56a03b)
An independent takeaway review surfaced four issues, all addressed in
c56a03bb:duplicate_timestamps), both get assigned to the same Event, so the old distinct-timestamp count understated the captures grouped. The E2E table figures above were produced before this change and will read slightly higher under the corrected count.finallyblock releases it.RegroupEventsJoband the bareregroup_eventstask: propagating a moved deployment'sproject_idto its children stays onDeployment.save()viaupdate_children(), not on the regroup path.What this PR proves on its own
group_images_into_eventsreads it).group_byfield and replaces date-keyed reuse with time-overlap matching. This PR is the prerequisite that lets Fix incorrect session grouping #904's reuse-path rework actually act on a configurable threshold. Make max_time_gap a configurable Project setting #906 is therefore partially addressed (setting now respected end-to-end); the deeper "sessions don't split inside a calendar day" case remains for Fix incorrect session grouping #904.Related PRs and issues
group_byfield, addsuse_existingflag, replaces unique constraint. Complementary to this PR: Fix incorrect session grouping #904 makes the gap setting visibly meaningful below the day boundary.DataStorageSyncJob.RegroupEventsJoband the sync Job's regroup stage.Test plan
yarn builddocker compose build django uimanage.py test ami.main.tests.TestImageGrouping— 12/12 passmanage.py test ami.jobs.tests.test_jobs.TestRegroupEventsJob ami.jobs.tests.test_jobs.TestDataStorageSyncJobIncludesRegroupStage— 4/4 passmanage.py test ami.main.tests.TestRolePermissions— 4/4 pass (no regression from the two new permissions)test_regroup_job_e2e regroup,sync, andconcurrentagainst deployment 74 — see "End-to-end validation" table above.Notes / follow-ups
Projectbecause that's where it was added in Storage and retrieval of project settings #918. Moving it toDeploymentis probably the right long-term shape — different monitoring stations on the same project can run different burst schedules. A separate PR with a data migration.deployment_events_need_update()does not detect a setting change as a reason to regroup, so admins must hit the new API action (or the Django admin bulk action) after editing the gap (also per @loppear's review comment — "expect this setting to apply"). The field'shelp_textnow explicitly points users to the Deployments admin "Regroup captures into sessions" bulk action so the manual-regroup step is discoverable from the form itself. Auto-detection on setting change remains a possible follow-up.Screenshots
Co-author
Co-Authored-By: Claude noreply@anthropic.com
Summary by CodeRabbit
New Features
Improvements
Permissions