Summary
objects.crc32 is meant to be a content fingerprint, and the comparing-builds workflow expects you can analyze two builds into two separate databases and diff CRCs to find which objects changed. That works for leaf assets (Texture2D/Mesh/AudioClip — no references), but it is broken for any object that contains references (Materials, prefabs/GameObjects, MonoBehaviours, etc.): identical content produces different CRCs in two separate analyze runs.
Cause
When PPtrAndCrcProcessor.ExtractPPtr folds a reference into the CRC, it uses the resolved analyzer/database object id returned by the callback, not the PPtr's own identity:
var refId = m_Callback(m_ObjectId, fileId, pathId, ...); // analyzer db id
m_Crc32 = Crc32Algorithm.Append(m_Crc32, <refId bytes>);
That id comes from ObjectIdProvider.GetId((m_LocalToDbFileId[fileId], pathId)), and both the serialized-file id and the object id are assigned sequentially per analyze run. So the same logical object gets different ids in db1 vs db2 → different CRC for identical content → cross-database comparison reports spurious differences for every object that has references.
Why we can't just hash the raw PPtr (the tradeoff)
The obvious fix is to hash the raw on-disk PPtr (fileId + pathId) instead of the resolved id. But the resolved id is currently what makes within-database duplicate detection (view_potential_duplicates) work across bundles: two copies of the same object in different bundles reference the same target, and resolving through m_LocalToDbFileId (keyed by filename) + pathId normalizes them to the same id → same CRC → detected as duplicates.
fileId is a local index into a serialized file's external-reference list, so two copies of an object in different bundles can have different fileId values for the same target. Hashing the raw PPtr would therefore weaken duplicate detection. Deduplication is an important feature and is probably not well covered by tests yet, so we don't want to risk regressing it.
Options to evaluate
- Raw PPtr (
fileId + pathId) — simplest; fixes cross-db comparison in the common case; risks weakening view_potential_duplicates (local fileId differs between bundles).
- Stable target identity +
pathId — resolve fileId to a stable identifier for the target file and hash that + pathId, so it is independent of the local index. This fixes cross-db comparison AND preserves cross-bundle duplicate detection, but the "stable identifier" differs by source:
- Build output external references carry a path (e.g.
archive:/CAB-...), not a GUID.
- Editor / Library references carry a GUID (the source asset's GUID).
So the CRC needs to mix in whichever of ExternalReference.Path / ExternalReference.Guid is populated (and a fixed marker for local refs, fileId == 0). Relies on those fields being present and stable.
More code: thread the external-reference info from sf.ExternalReferences into the CRC.
- Status quo — cross-db comparison stays broken for referenced objects.
Prerequisite
Add test coverage for view_potential_duplicates / cross-bundle deduplication before changing the CRC, so a fix can be validated to not regress it.
Context
Discovered while reviewing #73 / #70. Note that this is independent of the CRC changes made there (the ManagedReferenceData size fix, the ComputeCRC chunking fix, and the cah:/ stream hashing) — those also change CRC values vs. older tool versions, so CRCs are not comparable across tool versions regardless.
Related: #44 (refs table).
Summary
objects.crc32is meant to be a content fingerprint, and the comparing-builds workflow expects you can analyze two builds into two separate databases and diff CRCs to find which objects changed. That works for leaf assets (Texture2D/Mesh/AudioClip — no references), but it is broken for any object that contains references (Materials, prefabs/GameObjects, MonoBehaviours, etc.): identical content produces different CRCs in two separateanalyzeruns.Cause
When
PPtrAndCrcProcessor.ExtractPPtrfolds a reference into the CRC, it uses the resolved analyzer/database object id returned by the callback, not the PPtr's own identity:That id comes from
ObjectIdProvider.GetId((m_LocalToDbFileId[fileId], pathId)), and both the serialized-file id and the object id are assigned sequentially per analyze run. So the same logical object gets different ids in db1 vs db2 → different CRC for identical content → cross-database comparison reports spurious differences for every object that has references.Why we can't just hash the raw PPtr (the tradeoff)
The obvious fix is to hash the raw on-disk PPtr (
fileId+pathId) instead of the resolved id. But the resolved id is currently what makes within-database duplicate detection (view_potential_duplicates) work across bundles: two copies of the same object in different bundles reference the same target, and resolving throughm_LocalToDbFileId(keyed by filename) +pathIdnormalizes them to the same id → same CRC → detected as duplicates.fileIdis a local index into a serialized file's external-reference list, so two copies of an object in different bundles can have differentfileIdvalues for the same target. Hashing the raw PPtr would therefore weaken duplicate detection. Deduplication is an important feature and is probably not well covered by tests yet, so we don't want to risk regressing it.Options to evaluate
fileId+pathId) — simplest; fixes cross-db comparison in the common case; risks weakeningview_potential_duplicates(localfileIddiffers between bundles).pathId— resolvefileIdto a stable identifier for the target file and hash that +pathId, so it is independent of the local index. This fixes cross-db comparison AND preserves cross-bundle duplicate detection, but the "stable identifier" differs by source:archive:/CAB-...), not a GUID.So the CRC needs to mix in whichever of
ExternalReference.Path/ExternalReference.Guidis populated (and a fixed marker for local refs,fileId == 0). Relies on those fields being present and stable.More code: thread the external-reference info from
sf.ExternalReferencesinto the CRC.Prerequisite
Add test coverage for
view_potential_duplicates/ cross-bundle deduplication before changing the CRC, so a fix can be validated to not regress it.Context
Discovered while reviewing #73 / #70. Note that this is independent of the CRC changes made there (the ManagedReferenceData size fix, the ComputeCRC chunking fix, and the cah:/ stream hashing) — those also change CRC values vs. older tool versions, so CRCs are not comparable across tool versions regardless.
Related: #44 (refs table).