Skip to content

CRC of objects with references is not comparable across separate databases #74

@SkowronskiAndrew

Description

@SkowronskiAndrew

Summary

objects.crc32 is meant to be a content fingerprint, and the comparing-builds workflow expects you can analyze two builds into two separate databases and diff CRCs to find which objects changed. That works for leaf assets (Texture2D/Mesh/AudioClip — no references), but it is broken for any object that contains references (Materials, prefabs/GameObjects, MonoBehaviours, etc.): identical content produces different CRCs in two separate analyze runs.

Cause

When PPtrAndCrcProcessor.ExtractPPtr folds a reference into the CRC, it uses the resolved analyzer/database object id returned by the callback, not the PPtr's own identity:

var refId = m_Callback(m_ObjectId, fileId, pathId, ...);   // analyzer db id
m_Crc32 = Crc32Algorithm.Append(m_Crc32, <refId bytes>);

That id comes from ObjectIdProvider.GetId((m_LocalToDbFileId[fileId], pathId)), and both the serialized-file id and the object id are assigned sequentially per analyze run. So the same logical object gets different ids in db1 vs db2 → different CRC for identical content → cross-database comparison reports spurious differences for every object that has references.

Why we can't just hash the raw PPtr (the tradeoff)

The obvious fix is to hash the raw on-disk PPtr (fileId + pathId) instead of the resolved id. But the resolved id is currently what makes within-database duplicate detection (view_potential_duplicates) work across bundles: two copies of the same object in different bundles reference the same target, and resolving through m_LocalToDbFileId (keyed by filename) + pathId normalizes them to the same id → same CRC → detected as duplicates.

fileId is a local index into a serialized file's external-reference list, so two copies of an object in different bundles can have different fileId values for the same target. Hashing the raw PPtr would therefore weaken duplicate detection. Deduplication is an important feature and is probably not well covered by tests yet, so we don't want to risk regressing it.

Options to evaluate

  1. Raw PPtr (fileId + pathId) — simplest; fixes cross-db comparison in the common case; risks weakening view_potential_duplicates (local fileId differs between bundles).
  2. Stable target identity + pathId — resolve fileId to a stable identifier for the target file and hash that + pathId, so it is independent of the local index. This fixes cross-db comparison AND preserves cross-bundle duplicate detection, but the "stable identifier" differs by source:
    • Build output external references carry a path (e.g. archive:/CAB-...), not a GUID.
    • Editor / Library references carry a GUID (the source asset's GUID).
      So the CRC needs to mix in whichever of ExternalReference.Path / ExternalReference.Guid is populated (and a fixed marker for local refs, fileId == 0). Relies on those fields being present and stable.
      More code: thread the external-reference info from sf.ExternalReferences into the CRC.
  3. Status quo — cross-db comparison stays broken for referenced objects.

Prerequisite

Add test coverage for view_potential_duplicates / cross-bundle deduplication before changing the CRC, so a fix can be validated to not regress it.

Context

Discovered while reviewing #73 / #70. Note that this is independent of the CRC changes made there (the ManagedReferenceData size fix, the ComputeCRC chunking fix, and the cah:/ stream hashing) — those also change CRC values vs. older tool versions, so CRCs are not comparable across tool versions regardless.

Related: #44 (refs table).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions