Skip to content

Move ML recap analysis to heavy-processing Celery worker#4571

Merged
patrick91 merged 6 commits intomainfrom
feat/celery-recap-analysis
Feb 9, 2026
Merged

Move ML recap analysis to heavy-processing Celery worker#4571
patrick91 merged 6 commits intomainfrom
feat/celery-recap-analysis

Conversation

@patrick91
Copy link
Member

Summary

  • Offload compute_similar_talks and compute_topic_clusters to the heavy_processing Celery queue to prevent OOM in Gunicorn web workers
  • Admin view now checks cache first, dispatches a Celery task on miss, and returns {"status": "processing"} for the frontend to poll
  • Frontend JS polls every 3s (up to ~2 min timeout) until results are ready
  • Follows the same pattern as video_uploads/admin.py (apply_async(queue="heavy_processing") + check_pending_heavy_processing_work.delay())

Test plan

  • All 11 recap tests pass (reviews/tests/test_recap.py)
  • Verify cache-hit path returns results immediately
  • Verify cache-miss dispatches task and returns {"status": "processing"}
  • Verify recompute flag bypasses cache and dispatches task
  • Verify frontend polling renders results when task completes

Offload compute_similar_talks and compute_topic_clusters to the
heavy_processing Celery queue to prevent OOM in Gunicorn web workers.
The admin view now checks cache first, dispatches a Celery task on miss,
and returns {"status": "processing"} for the frontend to poll.
@vercel
Copy link

vercel bot commented Feb 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
pycon Ready Ready Preview Feb 9, 2026 6:36pm

@claude
Copy link
Contributor

claude bot commented Feb 9, 2026

Summary

This PR offloads ML-based recap analysis (compute_similar_talks and compute_topic_clusters) to a Celery heavy-processing queue to prevent OOM in Gunicorn web workers. The admin view now checks cache first, dispatches a Celery task on miss, and returns {"status": "processing"} for the frontend to poll. Comprehensive test coverage (11 tests) validates cache hits, task dispatch, lock management, and error handling.

Issues Found

Security: MD5 Usage for Cache Keys

backend/reviews/cache_keys.py:18

Using MD5 for cache key generation is not a security issue here since it is only for cache key generation (not cryptographic purposes), but it is worth noting that MD5 is flagged by security scanners. Consider using hashlib.sha256() instead for better compatibility with security policies.

Potential Race Condition in Lock Management

backend/reviews/admin.py:478-488

There is a subtle race condition between checking the stale lock and acquiring it. Both threads could end up dispatching tasks after one deletes a stale lock. This is a minor issue since the lock timeout (300s) and polling mechanism will eventually resolve it, but consider using a compare-and-delete operation if your cache backend supports it.

Error Handling: Missing Database Connectivity Check

backend/reviews/tasks.py:22-29

The task catches Conference.DoesNotExist, but if the database is temporarily unavailable, it will raise an unhandled exception. Consider adding a broader exception handler to handle database connectivity issues gracefully.

Architecture: Cache Key Calculation in Request Handler

backend/reviews/admin.py:456

The view converts the entire queryset to a list just to compute the cache key. For conferences with many submissions, this loads all submission data into memory in the web worker, which contradicts the goal of keeping web workers lightweight. Consider computing a simpler cache key based only on conference ID and submission count/last-modified timestamp.

Testing: Missing Test for Duplicate Simultaneous Requests

The test suite covers stale locks and active locks, but does not test what happens when multiple simultaneous requests arrive at exactly the same time before any lock is set. Consider adding a concurrency test that simulates this scenario.

@codecov
Copy link

codecov bot commented Feb 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.52%. Comparing base (4883f6c) to head (77c48b7).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4571      +/-   ##
==========================================
+ Coverage   92.50%   92.52%   +0.02%     
==========================================
  Files         355      357       +2     
  Lines       10658    10690      +32     
  Branches      812      812              
==========================================
+ Hits         9859     9891      +32     
  Misses        687      687              
  Partials      112      112              
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ential backoff

- Extract get_accepted_submissions() to remove duplicated query between
  admin.py and tasks.py
- Add try/except in Celery task that caches error state so the frontend
  can show the actual failure instead of a timeout
- Frontend handles {"status": "error"} responses from cached error state
- Replace fixed 3s polling with exponential backoff (1s, 2s, 3s, 5s...)
…ndling, tests

- Add cache.add lock to prevent duplicate task dispatch on concurrent requests
- Pass combined_cache_key from view to task to avoid key mismatch from
  race conditions between dispatch and execution
- Handle Conference.DoesNotExist in task for deleted conferences
- Clean up computing lock in finally block
- Align frontend poll timeout (3min) with error cache TTL (2min)
- Add integration tests: task cache population, error caching, missing conference
- Add stampede prevention test (cache.add returns False)
…ation

- Move get_cache_key and get_embedding_text to reviews/cache_keys.py
  (lightweight module with no ML imports) so admin view can import
  without triggering torch/sentence-transformers loading
- Skip individual function caching in task (conference_id=None) since
  the combined result is cached; avoids triple cache storage
- Remove _get_cache_key mock from tests — real hash computation now
  exercises the cache key generation logic
…lignment

- Store Celery task ID in computing lock and check AsyncResult state to
  detect stale locks from crashed workers before dispatching duplicates
- Clean up computing lock on Conference.DoesNotExist early return
- Increase frontend poll timeout to 360s to exceed backend 300s lock TTL
- Extend polling backoff intervals to 15s for large conferences
- Add .order_by("id") to get_accepted_submissions for consistent ordering
- Extract RESULT_CACHE_TTL / ERROR_CACHE_TTL constants
- Add tests for stale lock detection, active lock preservation, and
  DoesNotExist lock cleanup
@patrick91 patrick91 merged commit bed7f4c into main Feb 9, 2026
8 checks passed
@patrick91 patrick91 deleted the feat/celery-recap-analysis branch February 9, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments