Bound the GPFIFO-space wait in nvWriteGpEntry()#1192
Open
Lumzdas wants to merge 1 commit into
Open
Conversation
cf7592c to
2fc027e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bound the GPFIFO-space wait in nvWriteGpEntry()
Problem
nvWriteGpEntry()waits for a free GPFIFO entry in a loop with no timeout - it only exits when the GPU consumes an entry or a channel error is flagged. If a channel stalls without a flagged error, the loop spins forever in kernel context. When the spinning code path holds the nvkms lock, every subsequent modeset blocks in uninterruptible D state and the desktop deadlocks until reboot.I hit this reproducibly via DIFR prefetch after resume from suspend; it appears to be the same failure as #1167 (same trace, also 595.71.05, suspend-triggered) and #1177 (same trace on 580.159.03, triggered during normal use).
Observed failure
Environment:
NVreg_PreserveVideoMemoryAllocationsenabled, nvidia-suspend/resume services activeResume completes cleanly at the kernel PM level (
PM: suspend exit, nvidia-resume.service finished, no errors), but the screen stays frozen - the compositor's first DRM atomic commit after resume never returns:The hung task detector identifies the lock holder:
The
nvidia-modeset/kthread stays in R state burning exactly 100% of one core in kernel mode (stimein/proc/274/statgrows at 100 ticks/s). Asysrq-lNMI backtrace taken 29 minutes after resume still catches it at the same spot, confirming the loop never exits:So: the DIFR prefetch channel comes out of suspend with a full GPFIFO that the GPU never consumes, no channel error is flagged, and
nvWriteGpEntry()spins forever insideDifrPrefetchEventDeferredWorkwith the nvkms lock held. Only recovery is a reboot - the compositor can't even be killed.Fix
Bound the wait the same way
IdleChannel()already does: pollnvPushImportGetMilliSeconds(), honor the channel'snoTimeoutflag, and on expiry ofNV_PUSH_NOTIFIER_SHORT_TIMEOUTlog an error and return FALSE.Kickoff()already handlesnvWriteGpEntry()returning FALSE (the channel-error path) and does not advanceputOffset.PrefetchSingleSurface()times out and reportsNV2080_CTRL_LPWR_DIFR_PREFETCH_FAIL_CE_HW_ERROR, after which (per the comment there) "DIFR will remain disabled until next driver load". That designed failure path was unreachable only because the kickoff one line above it could spin forever first.With the timeout, the suspend-stall scenario degrades to: one 3-second wait, one logged error, DIFR disabled until next driver load, desktop keeps working.
Notes
__nvPushMakeRoom()has a similar unbounded wait. It is left untouched here: it returns void, is reached from many push sites, and once the first prefetch fails DIFR is disabled, so it is no longer reachable from this bug.noTimeout = FALSE, so it opts into this timeout; channels created withnoTimeout = TRUEkeep the old behavior.