[cDAC] Implement stack reference walking and GC stress verification#125505
Open
max-charlamb wants to merge 48 commits intodotnet:mainfrom
Open
[cDAC] Implement stack reference walking and GC stress verification#125505max-charlamb wants to merge 48 commits intodotnet:mainfrom
max-charlamb wants to merge 48 commits intodotnet:mainfrom
Conversation
Contributor
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
… support Squash of cdac-stackreferences branch changes onto main: - Implement stack reference enumeration (EnumerateStackRefs) - Add GC scanning support (GcScanner, GcScanContext, BitStreamReader) - Add exception handling for stack walks (ExceptionHandling) - Add IsFunclet/IsFilterFunclet to execution manager - Add EH clause retrieval for ReadyToRun - Add data types: EEILExceptionClause, CorCompileExceptionClause, CorCompileExceptionLookupEntry, LastReportedFuncletInfo - Update datadescriptor.inc with new type layouts - Update SOSDacImpl with improved stack walk support Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the native GcInfoDecoder::EnumerateLiveSlots to managed code: - Add FindSafePoint for partially-interruptible safe point lookup - Handle partially-interruptible path (1-bit-per-slot and RLE encoded) - Handle indirect live state table with pointer offset indirection - Handle fully-interruptible path with chunk-based lifetime transitions (couldBeLive bitvectors, final state bits, transition offsets) - Report untracked slots (always live unless suppressed by flags) - Add InterruptibleRanges/SlotTable decode points for lazy decoding - Save safe point and live state bit offsets during body decode - Add POINTER_SIZE_ENCBASE, LIVESTATE_RLE_*, NUM_NORM_CODE_OFFSETS_* constants to IGCInfoTraits (same across all platforms) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix IsFrameless: use StackWalkState.SW_FRAMELESS check - Fix EnumGcRefs call: pass CodeManagerFlags parameter (was missing) - Add public access modifier to GetMethodRegionInfo in ExecutionManager_1/2 - Fix redundant equality (== false) in ExecutionManagerCore - Suppress unused parameter/variable analyzer errors in GcScanner stub Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wire GcScanner to use IGCInfoDecoder.EnumerateLiveSlots - Add LiveSlotCallback delegate and EnumerateLiveSlots to IGCInfoDecoder - Add interface implementation in GcInfoDecoder that wraps the generic method - Translate register slots to values via IPlatformAgnosticContext - Translate stack slots using SP/FP base + offset addressing - Add StackBaseRegister accessor to GcInfoDecoder - Report live slots to GcScanContext.GCEnumCallback with proper flags - Add GcScanFlags.None value Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add StackReferenceData public data class in Abstractions - Change IStackWalk.WalkStackReferences to return IReadOnlyList<StackReferenceData> - Update StackWalk_1.WalkStackReferences to convert and return results - Add ISOSStackRefEnum, ISOSStackRefErrorEnum COM interfaces with GUIDs - Add SOSStackRefData, SOSStackRefError structs, SOSStackSourceType enum - Add SOSStackRefEnum class implementing ISOSStackRefEnum (follows SOSHandleEnum pattern) - Wire up SOSDacImpl.GetStackReferences: find thread by OS ID, walk stack references, convert to SOSStackRefData[], return via COM enumerator - Remove Console.WriteLine debug output from WalkStackReferences Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add three test classes for stack reference enumeration: - StackReferenceDumpTests: Basic tests using StackWalk debuggee (WalkStackReferences returns without throwing, refs have valid source info) - GCRootsStackReferenceDumpTests: Tests using GCRoots debuggee which keeps objects alive on stack via GC.KeepAlive (finds refs, refs point to valid objects) - PInvokeFrameStackReferenceDumpTests: Tests using PInvokeStub debuggee which has InlinedCallFrame on the stack (non-frameless Frame path) The PInvokeStub tests exercise the Frame::GcScanRoots path which is not yet implemented (empty else block in WalkStackReferences). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add native C++ changes needed for the data descriptor entries: - Add friend cdac_data<ExInfo> to ExceptionFlags for m_flags access - Add LastReportedFuncletInfo struct and field to ExInfo - Add cdac_data<PatchpointInfo> specialization for LocalCount - Use cdac_data<ExInfo>::ExceptionFlagsValue for ExceptionFlags offset Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add ScanFrameRoots method that dispatches based on frame type name. Most frame types use the base Frame::GcScanRoots_Impl which is a no-op. Key findings documented in the code: - GCFrame is NOT part of the Frame chain and the DAC does not scan it - Stub frames (StubDispatch, External, CallCounting, Dynamic, CLRToCOM) call PromoteCallerStack to report method arguments — not yet implemented - InlinedCallFrame, SoftwareExceptionFrame, etc. use the base no-op Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation - Fix GcScanSlotLocation register for stack slots: was hardcoded to 0, now correctly maps GC_SP_REL→RSP(4), GC_FRAMEREG_REL→stackBaseRegister - Update GetStackReferences debug block to use set-based comparison (match by Address) instead of index-based, since ref ordering may differ - Validate Object, SourceType, Source, and Flags for each matched ref Known issue: Some refs have different computed addresses between cDAC and legacy DAC due to stack slot address computation differences. Needs further investigation of SP/FP handling during stack walk context management. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix two bugs in the GCInfoDecoder slot table decoder that caused wrong slots to be reported as live: 1. When previous slot had non-zero flags, subsequent slots use a FULL offset (STACK_SLOT_ENCBASE) not a delta. The managed code incorrectly used STACK_SLOT_DELTA_ENCBASE for this case. 2. When previous slot had zero flags, subsequent slots use an unsigned delta (DecodeVarLengthUnsigned) with no +1 adjustment. The managed code incorrectly used DecodeVarLengthSigned with +1. Both bugs affected tracked and untracked stack slot sections. Verified with DOTNET_ENABLE_CDAC=1 and cdb against three debuggee dumps: all refs now match the legacy DAC exactly (count, Address, Object, Source, SourceType, Flags, Register, Offset for every ref). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix two bugs found via deep comparison with native GCInfoDecoder: 1. ARM64GCInfoTraits.DenormalizeStackBaseRegister used 0x29 (41 decimal) instead of 29 decimal. ARM64's frame pointer is X29, so the native XORs with 29. This would produce wrong addresses for all ARM64 stack-base-relative GC slots. 2. When ExecutionAborted and instruction offset is not in any interruptible range, the native code jumps to ExitSuccess (skips all reporting). The managed code incorrectly jumped to ReportUntracked, which would over-report untracked slots for aborted frames. Also documented the missing scratch register/slot filtering as a known gap (TODO in ReportSlot). The native ReportSlotToGC checks IsScratchRegister/IsScratchStackSlot for non-leaf frames; the cDAC currently reports all slots unconditionally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Match native safe point skip: always skip numSafePoints * numTracked bits in the else branch, matching the native behavior. The indirect table case (numBitsPerOffset != 0) combined with interruptible ranges is unreachable in practice. - Add TODO for FindSafePoint binary search optimization (perf only, no correctness impact). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add scratch register filtering to match native ReportSlotToGC behavior: - Add IsScratchRegister to IGCInfoTraits with per-platform implementations: - AMD64: preserved = rbx, rbp, rsi, rdi, r12-r15 (Windows ABI) - ARM64: preserved = x19-x28; scratch = x0-x17, x29-x30 - ARM: preserved = r4-r11; scratch = r0-r3, r12, r14 - Interpreter: no scratch registers - Add scratch filtering in ReportSlot: skip scratch registers for non-leaf frames (when ActiveStackFrame is not set) - Add ReportFPBasedSlotsOnly filtering: skip register slots and non-FP-relative stack slots when flag is set - Add IsScratchStackSlot check: skip SP-relative slots in the outgoing/scratch area for non-leaf frames - Set ActiveStackFrame flag for the first frameless frame in WalkStackReferences (matching native GetCodeManagerFlags behavior) Verified with DOTNET_ENABLE_CDAC=1 against three debuggee dumps: all refs match the legacy DAC exactly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 5 issues from PR dotnet#125075 review: 1. datadescriptor.inc: Fix EHInfo type annotation from /*uint16*/ to /*pointer*/ — phdrJitEHInfo is PTR_EE_ILEXCEPTION, not uint16. 2. StackWalk.md: Update GetMethodDescPtr(IStackDataFrameHandle) docs to describe InlinedCallFrame special case for interop MethodDesc reporting at SW_SKIPPED_FRAME positions. 3. BitStreamReader: Replace static host-dependent BitsPerSize (IntPtr.Size * 8) with instance-based _bitsPerSize (target.PointerSize * 8) for correct cross-architecture analysis. 4. SOSDacImpl: Restore GetMethodDescPtrFromFrame implementation that was incorrectly stubbed with E_FAIL. Restores the cDAC implementation with debug validation against legacy DAC. 5. ReadyToRunJitManager: Fix GetEHClauses clause address computation to include entry.ExceptionInfoRva — was computing from imageBase directly, missing the RVA offset to the exception info section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix several bugs in the cDAC's stack reference walking that caused mismatches against the legacy DAC during GC stress testing: - Fix GC_CALLER_SP_REL using wrong base address: GcScanner used the current context's StackPointer for GC_CALLER_SP_REL slots instead of the actual caller SP. Fixed by computing the caller SP via clone+unwind, with lazy caching to avoid repeated unwinds. - Fix IsFirst/ActiveStackFrame tracking: The cDAC used a simple isFirstFramelessFrame boolean to determine active frame status. Replaced with an IsFirst state machine in StackWalkData matching native CrawlFrame::isFirst semantics - starts true, set false after frameless frames, restored to true after FRAME_ATTR_RESUMABLE frames (ResumableFrame, RedirectedThreadFrame, HijackFrame). - Fix FaultingExceptionFrame incorrectly treated as resumable: FaultingExceptionFrame has FRAME_ATTR_FAULTED but NOT FRAME_ATTR_RESUMABLE. Including it in the resumable check caused IsFirst=true on the wrong managed frame, producing spurious scratch register refs. - Skip Frames below initial context SP in CreateStackWalk: Matches the native DAC behavior where StackWalkFrames with a profiler filter context skips Frames at lower SP (pushed more recently). Without this, RedirectedThreadFrame from GC stress redirect incorrectly set IsFirst=true for non-leaf managed frames. - Refactor scratch stack slot detection into IsScratchStackSlot on platform traits (AMD64, ARM64, ARM), matching the native GcInfoDecoder per-platform IsScratchStackSlot pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The initial Frame skip used the leaf's SP as the threshold, which missed active InlinedCallFrames whose address was above the leaf SP but below the caller SP. These Frames would be processed as SW_FRAME, causing UpdateContextFromFrame to restore the IP to the P/Invoke return address within the same method and producing duplicate GC refs. Use the caller SP (computed by unwinding the initial managed frame) as the skip threshold, matching the native CheckForSkippedFrames which uses EnsureCallerContextIsValid + GetSP(pCallerContext). This correctly skips all Frames between the managed frame and its caller, including both RedirectedThreadFrame and active InlinedCallFrames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Newer fields added to RealCodeHeader (EHInfo), ReadyToRunInfo (ExceptionInfoSection), and ExceptionInfo (ExceptionFlags, StackLowBound, StackHighBound, PassNumber, CSFEHClause, CSFEnclosingClause, CallerOfActualHandlerFrame, LastReportedFuncletInfo) may not exist in older contract versions. Guard each with type.Fields.ContainsKey and default to safe values to prevent KeyNotFoundException when analyzing older dumps. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
src/native/managed/cdac/Microsoft.Diagnostics.DataContractReader.Legacy/SOSDacImpl.cs
Show resolved
Hide resolved
...ed/cdac/Microsoft.Diagnostics.DataContractReader.Contracts/Contracts/GCInfo/GCInfoDecoder.cs
Outdated
Show resolved
Hide resolved
...ed/cdac/Microsoft.Diagnostics.DataContractReader.Contracts/Contracts/GCInfo/GCInfoDecoder.cs
Outdated
Show resolved
Hide resolved
rcj1
reviewed
Mar 16, 2026
rcj1
reviewed
Mar 17, 2026
...ractReader.Contracts/Contracts/ExecutionManager/ExecutionManagerCore.ReadyToRunJitManager.cs
Outdated
Show resolved
Hide resolved
Comment on lines
57
to
61
| TargetPointer firstNestedException = TargetPointer.Null; | ||
| if (address != TargetPointer.Null) | ||
| if (thread.ExceptionTracker != TargetPointer.Null) | ||
| { | ||
| Data.ExceptionInfo exceptionInfo = _target.ProcessedData.GetOrAdd<Data.ExceptionInfo>(address); | ||
| Data.ExceptionInfo exceptionInfo = _target.ProcessedData.GetOrAdd<Data.ExceptionInfo>(thread.ExceptionTracker); | ||
| firstNestedException = exceptionInfo.PreviousNestedInfo; |
Comment on lines
+130
to
+134
| private TargetPointer GetCurrentExceptionTracker(StackDataFrameHandle handle) | ||
| { | ||
| Data.Thread thread = _target.ProcessedData.GetOrAdd<Data.Thread>(handle.ThreadData.ThreadAddress); | ||
| return thread.ExceptionTracker; | ||
| } |
Comment on lines
+54
to
+66
| public TargetPointer GetRegisterValue(uint registerNumber) | ||
| { | ||
| if (!s_registerNumberToField.TryGetValue((int)registerNumber, out FieldInfo? field)) | ||
| throw new ArgumentOutOfRangeException(nameof(registerNumber), $"Register number {registerNumber} not found in {typeof(T).Name}"); | ||
|
|
||
| object? value = field.GetValue(Context); | ||
| return value switch | ||
| { | ||
| ulong ul => new TargetPointer(ul), | ||
| uint ui => new TargetPointer(ui), | ||
| _ => throw new InvalidOperationException($"Unexpected register field type {field.FieldType} for register {registerNumber}"), | ||
| }; | ||
| } |
Comment on lines
+47
to
+49
| $coreRoot = "$repoRoot\artifacts\tests\coreclr\windows.x64.$Configuration\Tests\Core_Root" | ||
| $testDir = "$repoRoot\artifacts\tests\coreclr\windows.x64.$Configuration\Tests\cdacgcstresstest" | ||
|
|
noahfalk
approved these changes
Mar 17, 2026
Member
noahfalk
left a comment
There was a problem hiding this comment.
A few nits but overall this looks good to me. I didn't spend any effort trying to reason about the correctness of the GC reference reporting algo as I'm expecting the testing will do a far better job than I could by code inspection.
...DataContractReader.Contracts/Contracts/ExecutionManager/ExecutionManagerCore.EEJitManager.cs
Outdated
Show resolved
Hide resolved
...rosoft.Diagnostics.DataContractReader.Contracts/Contracts/StackWalk/Context/ContextHolder.cs
Outdated
Show resolved
Hide resolved
...c/Microsoft.Diagnostics.DataContractReader.Contracts/Contracts/StackWalk/GC/GcScanContext.cs
Show resolved
Hide resolved
...rosoft.Diagnostics.DataContractReader.Contracts/Contracts/StackWalk/Context/ContextHolder.cs
Outdated
Show resolved
Hide resolved
...tive/managed/cdac/Microsoft.Diagnostics.DataContractReader.Abstractions/Contracts/IThread.cs
Show resolved
Hide resolved
…2-with-stress # Conflicts: # src/native/managed/cdac/Microsoft.Diagnostics.DataContractReader.Contracts/Contracts/StackWalk/Context/ContextHolder.cs # src/native/managed/cdac/Microsoft.Diagnostics.DataContractReader.Contracts/Contracts/StackWalk/Context/IPlatformAgnosticContext.cs
Break out of the while loop when Split-Path -Parent returns the same path (filesystem root), preventing infinite iteration on Windows where C:\ is its own parent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RegisterNumber on RegisterAttribute is no longer needed since PR dotnet#125621 added explicit TrySetRegister(int)/TryReadRegister(int) switch dispatch directly on each context struct. Revert these files to match main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+63
to
+73
| TargetPointer baseAddr = spBase switch | ||
| { | ||
| 1 => context.StackPointer, // GC_SP_REL | ||
| 2 => ReadRegisterValue(context, (int)stackBaseRegister), // GC_FRAMEREG_REL | ||
| 0 => GetCallerSP(context, ref callerSP), // GC_CALLER_SP_REL | ||
| _ => throw new InvalidOperationException($"Unknown stack slot base: {spBase}"), | ||
| }; | ||
|
|
||
| TargetPointer addr = new(baseAddr.Value + (ulong)(long)spOffset); | ||
| GcScanSlotLocation loc = new((int)spBase, spOffset, true); | ||
| scanContext.GCEnumCallback(addr, scanFlags, loc); |
Comment on lines
+88
to
+90
| See [tests/gcstress/known-issues.md](tests/gcstress/known-issues.md) for the full list. | ||
| Key gaps include dynamic method (IL Stub) GC refs, frame duplication on deep stacks, | ||
| and unimplemented `PromoteCallerStack` for stub frames. Current pass rate: ~99.7%. |
…ToRunInfo fields Add missing fields to test mock type descriptors: - ExceptionInfoSection in ReadyToRunInfoFields - EHInfo in RealCodeHeaderFields (increase RealCodeHeaderSize to 0x38) - ExceptionFlags, StackLowBound, StackHighBound, PassNumber, CSFEHClause, CSFEnclosingClause, CallerOfActualHandlerFrame in ExceptionInfoFields Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements the cDAC's
WalkStackReferencesAPI — the managed equivalent of the runtime's GC stack crawl for reporting GC references — and adds an in-process GC stress verification tool to validate correctness.Stack Reference Walking (cDAC implementation)
Implements the full stack reference reporting pipeline in the cDAC:
GCInfoDecoder.cs):EnumerateLiveSlotsdecodes GC slot tables (register and stack slots), lifetime transitions, and untracked slots to report live GC references at a given instruction offset. SupportsCodeManagerFlagsfor active frame, execution-aborted, funclet parent, and FP-based-only filtering.GcScanner.cs): Orchestrates per-frame GC reference scanning — resolves slot locations to target addresses using the register context and stack pointer bases (caller-SP-relative, SP-relative, frame-register-relative).ExceptionHandling.cs): Implements funclet skip logic (HasFrameBeenUnwoundByAnyActiveException) by walking the ExInfo chain to determine which frames have been unwound by active exceptions, matching the runtime'sExceptionTrackerbehavior.WalkStackReferencesto iterate managed frames, apply GC scanning, handle resumable frames, track caller SP for proper slot resolution, and filter funclet frames.GetRegisterByNumberandSetRegisterByNumberto AMD64/ARM64/ARM contexts for register-indexed GC slot access.IsFilterFuncletandGetExceptionClausesAPIs for exception clause lookup during stack walking.ExInfo.ExceptionFlags,ExInfo.ScannedStackRange,PatchpointInfo.NumberOfLocals,RealCodeHeader.ExceptionLookupTable/NumExceptionClausesfields.StackReferenceDumpTestswith a dedicatedStackRefsdebuggee that validates the cDAC'sWalkStackReferencesoutput matches the legacy DAC.GC Stress Verification Tool (
GCSTRESS_CDAC=0x20)Adds a new GC stress mode that loads the cDAC in-process during GC stress points and compares its stack reference output against the runtime's actual GC reporting.
How it works
CdacGcStressInit): WhenDOTNET_GCStress=0x20is set, the runtime loads the cDAC NativeAOT library and creates anIXCLRDataProcessinstance with an in-process data target.VerifyAtStressPointingccover.cpp): The tool collects GC references from both the cDAC (WalkStackReferences) and the runtime's own stack crawl, then compares them.DOTNET_GCStressCdacFailFast=1).DOTNET_GCStressCdacLogFilewith per-stress-point pass/fail tags for automated analysis.Thread safety
A
CrstStaticlock serializes access to the shared cDAC instance during stress points. Future work could use per-thread cDAC instances.Test infrastructure
test-cdac-gcstress.ps1: PowerShell script that builds and runs the GC stress verification against the GCSimulator test.known-issues.md: Documents known discrepancy categories (~1.7% failure rate):PromoteCallerStack; neither cDAC nor DAC implement this.DynamicResolver) lack GC info accessible to the cDAC.InlinedCallFrameis active, the cDAC may report both the managed frame's refs and the Frame's refs.Other changes
[cDAC]contract dependency annotations togcinfotypes.h(GcSlotFlags,GcStackSlotBase,ReturnKind) andcorhdr.h(CorExceptionFlag).IsFilterFuncletdocumentation toExecutionManager.md.StackReferenceDataintoIStackWalk.cs.cdac_data<PatchpointInfo>friend declaration andcdac_data<ExInfo>offset specialization for exception flags.Current status and known limitations
This is the initial implementation targeting AMD64 (Windows and Linux). It has been validated using GC stress with the GCSimulator test, achieving a ~98.3% match rate between cDAC and runtime GC reference reporting. The remaining ~1.7% of stress points have known discrepancy categories documented in
known-issues.md:PromoteCallerStack. Neither the cDAC nor the legacy DAC implement this — it is runtime-only behavior.DynamicResolver) lack GC info accessible to the cDAC.InlinedCallFrameis active at a stress point, the cDAC may double-report refs from both the managed frame and the Frame.These are not regressions — they represent parity gaps with the runtime that also exist (or would exist) in the legacy DAC. Follow-up PRs will:
Configuration
DOTNET_GCStress=0x20DOTNET_GCStressCdacFailFast=1DOTNET_GCStressCdacLogFile=<path>