Move ML recap analysis to heavy-processing Celery worker#4571
Move ML recap analysis to heavy-processing Celery worker#4571
Conversation
Offload compute_similar_talks and compute_topic_clusters to the
heavy_processing Celery queue to prevent OOM in Gunicorn web workers.
The admin view now checks cache first, dispatches a Celery task on miss,
and returns {"status": "processing"} for the frontend to poll.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
SummaryThis PR offloads ML-based recap analysis (compute_similar_talks and compute_topic_clusters) to a Celery heavy-processing queue to prevent OOM in Gunicorn web workers. The admin view now checks cache first, dispatches a Celery task on miss, and returns {"status": "processing"} for the frontend to poll. Comprehensive test coverage (11 tests) validates cache hits, task dispatch, lock management, and error handling. Issues FoundSecurity: MD5 Usage for Cache Keysbackend/reviews/cache_keys.py:18 Using MD5 for cache key generation is not a security issue here since it is only for cache key generation (not cryptographic purposes), but it is worth noting that MD5 is flagged by security scanners. Consider using hashlib.sha256() instead for better compatibility with security policies. Potential Race Condition in Lock Managementbackend/reviews/admin.py:478-488 There is a subtle race condition between checking the stale lock and acquiring it. Both threads could end up dispatching tasks after one deletes a stale lock. This is a minor issue since the lock timeout (300s) and polling mechanism will eventually resolve it, but consider using a compare-and-delete operation if your cache backend supports it. Error Handling: Missing Database Connectivity Checkbackend/reviews/tasks.py:22-29 The task catches Conference.DoesNotExist, but if the database is temporarily unavailable, it will raise an unhandled exception. Consider adding a broader exception handler to handle database connectivity issues gracefully. Architecture: Cache Key Calculation in Request Handlerbackend/reviews/admin.py:456 The view converts the entire queryset to a list just to compute the cache key. For conferences with many submissions, this loads all submission data into memory in the web worker, which contradicts the goal of keeping web workers lightweight. Consider computing a simpler cache key based only on conference ID and submission count/last-modified timestamp. Testing: Missing Test for Duplicate Simultaneous RequestsThe test suite covers stale locks and active locks, but does not test what happens when multiple simultaneous requests arrive at exactly the same time before any lock is set. Consider adding a concurrency test that simulates this scenario. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4571 +/- ##
==========================================
+ Coverage 92.50% 92.52% +0.02%
==========================================
Files 355 357 +2
Lines 10658 10690 +32
Branches 812 812
==========================================
+ Hits 9859 9891 +32
Misses 687 687
Partials 112 112 🚀 New features to boost your workflow:
|
…ential backoff
- Extract get_accepted_submissions() to remove duplicated query between
admin.py and tasks.py
- Add try/except in Celery task that caches error state so the frontend
can show the actual failure instead of a timeout
- Frontend handles {"status": "error"} responses from cached error state
- Replace fixed 3s polling with exponential backoff (1s, 2s, 3s, 5s...)
…ndling, tests - Add cache.add lock to prevent duplicate task dispatch on concurrent requests - Pass combined_cache_key from view to task to avoid key mismatch from race conditions between dispatch and execution - Handle Conference.DoesNotExist in task for deleted conferences - Clean up computing lock in finally block - Align frontend poll timeout (3min) with error cache TTL (2min) - Add integration tests: task cache population, error caching, missing conference - Add stampede prevention test (cache.add returns False)
…ation - Move get_cache_key and get_embedding_text to reviews/cache_keys.py (lightweight module with no ML imports) so admin view can import without triggering torch/sentence-transformers loading - Skip individual function caching in task (conference_id=None) since the combined result is cached; avoids triple cache storage - Remove _get_cache_key mock from tests — real hash computation now exercises the cache key generation logic
…lignment
- Store Celery task ID in computing lock and check AsyncResult state to
detect stale locks from crashed workers before dispatching duplicates
- Clean up computing lock on Conference.DoesNotExist early return
- Increase frontend poll timeout to 360s to exceed backend 300s lock TTL
- Extend polling backoff intervals to 15s for large conferences
- Add .order_by("id") to get_accepted_submissions for consistent ordering
- Extract RESULT_CACHE_TTL / ERROR_CACHE_TTL constants
- Add tests for stale lock detection, active lock preservation, and
DoesNotExist lock cleanup
Summary
compute_similar_talksandcompute_topic_clustersto theheavy_processingCelery queue to prevent OOM in Gunicorn web workers{"status": "processing"}for the frontend to pollvideo_uploads/admin.py(apply_async(queue="heavy_processing")+check_pending_heavy_processing_work.delay())Test plan
reviews/tests/test_recap.py){"status": "processing"}