feature: detect MPI desynchronizations#4091
Conversation
| #ifdef GEOS_USE_MPI_DESYNC_DETECTION | ||
| /** | ||
| * @struct MpiDesyncGuard | ||
| * @brief RAII helper to detect MPI desynchronizations from MPI collective operations. | ||
| */ | ||
| struct MpiDesyncGuard |
There was a problem hiding this comment.
is that useful to expose, or should it better be in internal namespace?
| #ifdef GEOS_USE_MPI | ||
| // Find the min dt across processes | ||
| real64 dt_global; | ||
| MPI_Allreduce( &m_dt, &dt_global, 1, MPI_DOUBLE, MPI_MIN, MPI_COMM_GEOS ); | ||
| m_dt = dt_global; | ||
| m_dt = MpiWrapper::min( m_dt, MPI_COMM_GEOS ); | ||
| #endif |
There was a problem hiding this comment.
Thanks for these small refactors! Just one small remark, the #if may no longer be useful.
There was a problem hiding this comment.
(I have the same for other files, HaltEvent.cpp, PeriodicEvent.cpp and SiloFile.cpp)
There was a problem hiding this comment.
Talking about SiloFile.cpp, a lot a its calls seem to be centralizable in the MpiWrapper. It seem to be the very last file doing unjustified direct-accesses.
| explicit MpiDesyncGuard( MPI_Comm const & comm ) | ||
| : m_comm( comm ) | ||
| { | ||
| saveStackTrace(); // Here every rank saves a stacktrace TODO modify |
There was a problem hiding this comment.
This is costly for an happy path.
There was a problem hiding this comment.
What do you think of caching only the stack-frames for later, with something like:
private:
/// raw stack-frames addresses. Critical: fixed-size, no dynamic allocation
static stdArray< void *, maxFrames > m_frames{};
/// @brief save the stacktrace stack-frames state for laters
void saveStackTrace()
{
#if defined(__GLIBC__) || defined(__APPLE__)
m_size = backtrace( m_frames.data(), maxFrames ); // naive approach, maybe it would be better to expose `LvArray::system::collect()`
#else
m_size = 0;
#endif
}
/// @return the symbolized, demangled stacktrace (expensive, cold path only).
string symbolizeStackTrace() const
{
std::ostringstream oss;
for( void const * frame : m_frames )
{ // same, naive approach, maybe better to expose a `stackTrace( bool const location, std::array<void*,N> frames )` from LvArray
oss << "Frame " << i << ": ";
Dl_info dli;
const bool dl_ok = dladdr( array[ i ], &dli );
oss << ( dl_ok ?
( dli.dli_sname ? LvArray::system::demangle( dli.dli_sname ) : dli.dli_fname ) :
"Unknown" ) << "\n";
}
oss << "=====\n";
return oss.str();
}
There was a problem hiding this comment.
the "static" suggestion is linked to another comment on stacked guards.
| explicit MpiDesyncGuard( MPI_Comm const & comm ) | ||
| : m_comm( comm ) | ||
| { | ||
| saveStackTrace(); // Here every rank saves a stacktrace TODO modify | ||
| } |
There was a problem hiding this comment.
(and keep one lines bodies in one line
| explicit MpiDesyncGuard( MPI_Comm const & comm ) | |
| : m_comm( comm ) | |
| { | |
| saveStackTrace(); // Here every rank saves a stacktrace TODO modify | |
| } | |
| explicit MpiDesyncGuard( MPI_Comm const & comm ) | |
| : m_comm( comm ) | |
| { saveStackTrace(); } |
and the comment goes in the @brief or @note)
| explicit MpiDesyncGuard( MPI_Comm const & comm ) | ||
| : m_comm( comm ) |
There was a problem hiding this comment.
What happens when guards are stacked?
| struct MpiDesyncGuard | ||
| { | ||
| MPI_Comm const & m_comm; | ||
| std::atomic<bool> m_collectiveOperationSuccess{ false }; |
The goal of this PR is to add a way to detect MPI desynchronizations.
MPI desynchronizations occur when an MPI operation is not called on all ranks (for example, due to an
ifcondition or aforloop). When this happens, subsequent MPI operations may become non-blocking, potentially leading to incorrect results.Identifying whether a desynchronization has occurred, and locating it, can be long.
This PR proposes mechanism to detect such desynchronizations. It also outputs the current stack trace when such desynchronization occurs.
Implementation
The current approach adds a
barrier(which should remain blocking even if a desynchronization occurs) after every collective MPI call, combined with atimeout()to detect infinite hangs. If the timeout is exceeded (there is an infinite hang), it indicates that not every ranks of the current communicator participated in the collective MPI operation we are monitoring, indicating a desynchronization.Compile option
This feature is currently behind a compile option (
-DGEOS_ENABLE_MPI_DESYNC_DETECTION=ON).All code related to the detection is behind guards, and doesn't get compiled if GEOS is not built with the compile option.
I am not anchored on this choice, neither for the implementation. I'm open to any proposition.