Recap screen by patrick91 · Pull Request #4567 · pythonitalia/pycon

patrick91 · 2026-02-09T10:55:26Z

Rename recap to shortlist
Recap screen
Add similar talks POC
WIP
Order things
Add color-coded similarity score badges
Defer similar talks & topic clusters to button-triggered load

Orange for 60-79% similarity, red for 80%+ to highlight potentially overlapping talks at a glance.

Move ML computation (similar talks, topic clusters) behind a "Compute Analysis" button to avoid slow page loads on cache miss. Adds a JSON endpoint that returns both datasets, rendered client-side.

vercel · 2026-02-09T10:55:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
pycon	Ready	Preview	Feb 9, 2026 11:23am

claude · 2026-02-09T10:57:09Z

This PR adds a recap screen with ML-based similar talks detection and topic clustering for accepted submissions. It also renames "recap" to "shortlist" throughout the adapter interface.

Issues Found

Architecture & Design

admin.py:369-379 - Missing tenant scoping in query (CRITICAL SECURITY ISSUE)

def _get_accepted_submissions(self, conference):
    return (
        Submission.objects.filter(conference=conference)
        .filter(...)

This query in _get_accepted_submissions is used by both review_recap_view and review_recap_compute_analysis_view. The query filters by conference but doesn't verify that the requesting user has permission to view this conference's data. While there's a permission check review_session.user_can_review(request.user) at the view level, this relies on the review_session object being correctly retrieved. If there's any way to manipulate the review_session_id parameter, a user could potentially access another conference's submissions.

Recommendation: The user_can_review check should explicitly verify the user has permission for the specific conference, not just the review session. Consider adding an explicit conference permission check at backend/reviews/admin.py:381.

similar_talks.py:250 - Insecure hash function for cache keys

content_hash = hashlib.md5()

Using MD5 for cache key generation. While this isn't directly exploitable for cache poisoning since it's only used for cache keys (not cryptographic purposes), it's a code smell that could become a vulnerability if this pattern is copied elsewhere.

Recommendation: Use hashlib.sha256() instead. MD5 has known collision vulnerabilities.

similar_talks.py - Heavy ML dependencies without resource limits
The new ML functionality (sentence-transformers, BERTopic, scikit-learn) will load large models into memory and perform CPU-intensive operations. There are no resource limits, timeouts, or queue management:

get_embedding_model() loads a 90MB+ model into memory and caches it globally
compute_similar_talks() and compute_topic_clusters() process all accepted submissions synchronously
No pagination or batch size limits for processing submissions

Recommendations:

Add Celery async task for ML computation rather than synchronous HTTP requests at backend/reviews/admin.py:442-492
Add timeout to prevent requests from hanging indefinitely
Consider implementing batch processing for large conferences (>500 submissions)
Add model memory monitoring/limits

Testing & Coverage

test_similar_talks.py - Insufficient edge case coverage
Tests only cover happy path and basic mocking. Missing:

What happens when model loading fails?
What happens with very large input text (>512 tokens, the model's max)?
What happens when embeddings fail to encode?
Cache key collision scenarios
Language detection edge cases (mixed languages, unsupported languages)

test_recap.py - Missing negative permission tests
The test suite at backend/reviews/tests/test_recap.py:141-155 only tests PermissionDenied for non-reviewers. Missing:

User from different conference trying to access recap
Non-staff user attempting access
Review session in wrong state for analysis
Cross-tenant access attempts

Missing integration tests
No tests verify the full flow:

Trigger analysis computation → wait for cache → retrieve cached results
Multiple concurrent requests computing analysis (race conditions)
Analysis with actual database submissions (not mocks)

Error Handling

admin.py:442-492 - No error handling for ML computation failures

def review_recap_compute_analysis_view(self, request, review_session_id):
    # ... permission checks ...
    similar_talks = compute_similar_talks(...)  # Can fail
    topic_clusters = compute_topic_clusters(...)  # Can fail
    return JsonResponse({...})

If either ML function raises an exception (model loading fails, encoding fails, clustering fails), the user gets a 500 error with no helpful message. The functions can fail for many reasons:

Model files corrupted/missing
Out of memory
NLTK data missing
Invalid embeddings

Recommendation: Wrap ML calls in try-except and return user-friendly error messages at backend/reviews/admin.py:455-467.

similar_talks.py:186-187 - Silent failure on NLTK download

except LookupError:
    logger.info("Downloading NLTK stopwords...")
    nltk.download("stopwords", quiet=True)

If the download fails (no internet, disk full, permission denied), the function continues silently and the subsequent nltk.corpus.stopwords.words() call will fail with a confusing error.

Recommendation: Add error handling for failed NLTK downloads at backend/reviews/similar_talks.py:186-188.

Performance

admin.py:399 - N+1 query potential

accepted_submissions = self._get_accepted_submissions(conference)
# Later in template context:
for s in accepted_submissions

While the query uses prefetch_related("languages"), when building submissions_data the code accesses s.speaker.display_name which will trigger a query per submission if speakers aren't prefetched.

Fixed on line 377: The query includes .select_related("speaker", ...) so this is actually fine. No issue here.

similar_talks.py:289 - Entire dataset loaded into memory

texts = [get_embedding_text(s) for s in submissions_list]
embeddings = model.encode(texts)

For a conference with 500+ submissions, this loads all text and all embeddings into memory at once. The sentence-transformers model.encode() can accept batches, but no batch size is specified.

Recommendation: Add batch processing for conferences with >200 submissions at backend/reviews/similar_talks.py:289-290.

similar_talks.py:292 - O(n²) cosine similarity computation

similarity_matrix = cosine_similarity(embeddings)

This computes similarity for every pair of submissions. For 500 submissions, that's 250,000 comparisons. While cosine_similarity is optimized, it still requires O(n²) space for the matrix.

Recommendation: If memory becomes an issue, consider computing similarities on-demand or using approximate nearest neighbor search (e.g., FAISS).

Other Issues

Missing index on UserReview.review_session_id
The queries in get_shortlist_items_queryset filter by review_session_id frequently. Ensure there's a database index on UserReview.review_session_id and UserReview.proposal_id/UserReview.grant_id.

similar_talks.py:17 - Hardcoded cache timeout

CACHE_TIMEOUT = 60 * 60 * 24  # 24 hours

This should be configurable via Django settings rather than hardcoded.

Inconsistent naming: "recap" vs "shortlist"
The PR renames "recap" to "shortlist" in adapters.py but then adds a new "recap" screen in admin.py. This is confusing:

shortlist_template property renamed from recap_template
But review_recap_view is a new view separate from shortlist
Three different screens now: review → shortlist → recap

Recommendation: Add documentation clarifying the three-stage flow and what each screen is for.

codecov · 2026-02-09T11:00:10Z

Codecov Report

❌ Patch coverage is 73.00613% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.50%. Comparing base (42c679b) to head (3ff0cf7).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4567      +/-   ##
==========================================
- Coverage   92.70%   92.50%   -0.20%     
==========================================
  Files         354      355       +1     
  Lines       10510    10658     +148     
  Branches      780      812      +32     
==========================================
+ Hits         9743     9859     +116     
- Misses        665      687      +22     
- Partials      102      112      +10

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Add force_recompute param to compute_similar_talks/compute_topic_clusters - Add ?recompute=1 query param support to compute-analysis endpoint - Show "Recompute (ignore cache)" button after initial results load - Replace globals with functools.cache, simplify _get_submission_languages - Remove unused language mappings, keep only English and Italian

patrick91 added 7 commits February 9, 2026 11:53

Rename recap to shortlist

17812b8

Recap screen

09ecea5

Add similar talks POC

1f429ee

WIP

d6810e1

Order things

ffec416

Add color-coded similarity score badges

b42b418

Orange for 60-79% similarity, red for 80%+ to highlight potentially overlapping talks at a glance.

Defer similar talks & topic clusters to button-triggered load

8847d28

Move ML computation (similar talks, topic clusters) behind a "Compute Analysis" button to avoid slow page loads on cache miss. Adds a JSON endpoint that returns both datasets, rendered client-side.

vercel bot deployed to Preview February 9, 2026 10:56 View deployment

Move inline imports to top-level and extract get_stats_for_submissions

bc3ebac

Improve code

24d0bc2

vercel bot deployed to Preview February 9, 2026 11:03 View deployment

patrick91 marked this pull request as ready for review February 9, 2026 11:05

vercel bot deployed to Preview February 9, 2026 11:07 View deployment

vercel bot deployed to Preview February 9, 2026 11:17 View deployment

Tests and checks

3ff0cf7

patrick91 force-pushed the recap-screen branch from 3ed6c11 to 3ff0cf7 Compare February 9, 2026 11:21

vercel bot deployed to Preview February 9, 2026 11:23 View deployment

patrick91 merged commit b06be65 into main Feb 9, 2026
8 checks passed

patrick91 deleted the recap-screen branch February 9, 2026 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Recap screen#4567

Recap screen#4567
patrick91 merged 11 commits intomainfrom
recap-screen

patrick91 commented Feb 9, 2026

Uh oh!

vercel bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

patrick91 commented Feb 9, 2026

Uh oh!

vercel bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues Found

Architecture & Design

Testing & Coverage

Error Handling

Performance

Other Issues

Uh oh!

codecov bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Feb 9, 2026 •

edited

Loading

claude bot commented Feb 9, 2026 •

edited

Loading

codecov bot commented Feb 9, 2026 •

edited

Loading