Skip to content

Optimize provenance lookup query in simple page retrieval #1243

@dkliban

Description

@dkliban

Problem

The package provenance lookup query in the PyPI Simple API retrieve method generates an inefficient nested subquery pattern that is a significant contributor to database load in production. During peak load on 2026-05-20 at 16:00 UTC, 4 concurrent instances of this query pattern consumed 2.64 avg active sessions on the RDS instance (db.m7g.4xlarge, 16 vCPUs). Total DB load reached 21.9 avg active sessions, exceeding the CPU count.

Source

pulp_python/app/pypi/views.py lines 387-388:

provenances = PackageProvenance.objects.filter(package__in=local_packages).values_list(
    "package__filename", flat=True
)

SQL pattern observed (from RDS Performance Insights)

SELECT "python_pythonpackagecontent"."filename" AS "package__filename"
FROM "python_packageprovenance"
INNER JOIN "python_pythonpackagecontent"
    ON ("python_packageprovenance"."package_id" = "python_pythonpackagecontent"."content_ptr_id")
WHERE "python_packageprovenance"."package_id" IN (
    SELECT V0."content_ptr_id"
    FROM "python_pythonpackagecontent" V0
    WHERE V0."content_ptr_id" IN (
        SELECT U0."pulp_id"
        FROM "core_content" U0
        WHERE U0."pulp_id" IN ('...uuid...', '...uuid...', ...)
    )
)

The nested IN (SELECT ... IN (SELECT ...)) pattern is inefficient. The wait event breakdown during the spike was primarily CPU (17.9 avg active sessions), indicating compute-bound query execution (likely hash joins or repeated subquery evaluation).

Suggestions

  • Combine with the metadata query: Instead of two separate queries (one for package metadata, one for provenance), use a single annotated queryset with Exists() or Subquery() to check provenance in the same query
  • Add an index on python_packageprovenance.package_id if one doesn't exist
  • Simplify the filter: The package__in=local_packages produces the nested subquery. Consider using package_id__in=local_packages.values_list("pk", flat=True) to flatten the subquery, or passing an explicit list of PKs
  • Cache provenance existence: If provenance data changes infrequently, cache which packages have provenance records

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions