Skip to content

Conversation

@whlipscomb
Copy link
Contributor

Until now, the results of most CISM simulations have depended on the number of processors. This is mostly because CISM computes many global sums, especially in the velocity solvers. MPI global sums of floating-point variables don't respect the associativity of addition. The general solution in CESM shared code is contained in shr_reprosum_mod.F90, written by Pat Worley of ORNL. The basic idea is to convert real(8) variables to integer(8) variables, sum the integers globally, and convert the result back to real(8).

This PR implements reproducible sums in CISM, using a new module called cism_reprosum_mod. This module is based on Pat's module (a 2023 version, more recent than the current shr_reprosum_mod in CESM), with some minor differences. It uses CISM modules instead of CESM shared modules. I removed Pat's code for dealing with infinities and NaNs, which would require adding a new infnan module to CISM.

Global sums are now handled as follows: To sum a floating-point array, we call either parallel_global_sum (for the scalar grid), parallel_global_sum_stagger (for the velocity grid), or parallel_global_sum_patch (to take many sums over subsets, or 'patches', of the global domain, such as individual glaciers). These functions are in parallel_mpi.F90. There are several versions (real_2d, real_3d, etc.) of each. Each function contains a switch to either (1) call the standard parallel_reduce_sum function to compute the global sum, or (2) call a new subroutine called parallel_reduce_reprosum, which in turn calls subroutine cism_reprosum_calc and returns the reproducible sum.

The flux-routing hydrology scheme needed special treatment, since it routes subglacial water across processor boundaries over many iterations. I changed several variables in one subroutine from real(8) to integer(8).

I verified that CISM results are now reproducible with many physics and solver options (including both DIVA and BP) for the dome test and several Greenland, Antarctic, and Alpine glacier tests. The most notable exceptions are the local and global tridiagonal preconditioners (PCs), which have unavoidable dependence on the number of processors. Runs with reproducible sums enabled should use either the diagonal PC or (for the 3d BP solver) the SIA-based PC. It wasn't possible to test every option, but for Greenland runs I used a config file similar to the standard CESM configuration, so Greenland results in coupled CESM runs should now be reproducible.

To generate reproducible sums, the user simply sets reproducible_sums = .true. in the config file. The default setting is false.

Turning on reproducible sums significantly slows the standalone code, by more than 1.5x for GrIS runs and by more than 2x for AIS and Alps runs with the DIVA solver. Part of the slowdown comes from slower global sums, and part comes from using the diagonal PC instead of the local tridiagonal PC. I checked that for a GrIS test I ran before making these changes, the new code is about as fast as the previous code (about 45 min for 1000 yr with 256 cores), if all the config options are the same (including the local tridiagonal PC with reproducible sums turned off). I think we can tolerate a slower version of CISM for CESM coupled runs, since CISM will still run much faster per simulated year than other CESM components.

For temp_init = 1, 2 or 3, the initial ice temperature profile depends on artm,
which therefore needs to be set correctly before initializing the temperature.

For artm_input_function = ARTM_INPUT_FUNCTION_XY = 0 (the default),
artm is read from the input file at startup with the correct value.
But for the other artm_input_function options (XY_GRADZ, XYZ, XY_LAPSE),
artm is derived from input fields such as artm_ref, artm_gradz and artm_3d.

With this commit, CISM computes artm (if needed) at initialization, before
computing the ice temperature profile. Without this computation, the temperature
was initialized based on an erroneous artm = 0 and was too warm.

I also added an SMB computation at initialization. This is not strictly needed
but can be a helpful diagnostic.

I moved the downscaling computations to new subroutines called downscale_artm
and downscale_smb to avoid repeating code.
Until now, CISM results have varied with the number of processors.
This is because CISM computes many global sums, not just for diagnostics but
also for prognostic calculations (e.g., in the PCG velocity solver).
Two runs with different processor counts will have roundoff-level differences
in the results of global sums, which propagate to the rest of the code.
We want CISM to be able to compute global sums in a reproducible way,
independent of processor count, like other CESM components.

Reproducible sums are not yet operational, but this commit adds most of the infrastructure:

* There is a new config option called 'reproducible_sums', which is false by default.
  To activate reproducible sums, the user simply sets this to true.

* The parallel derived type now includes a logical variable called reprosum, whose value
  is set at initialization based on the config option 'reproducible_sums'.

* There are two new files in libglimmer: cism_reprosum_mod.F90 and cism_infnan_mod.F90.
  These are adapted, with minor changes, from the files shr_reprosum_mod and shr_infnan_mod.F90
  in the CESM shared code written by Pat Worley.

* Many global sums now pass through an interface that supports reproducible sums:
  parallel_global_sum, parallel_global_sum_stagger, or parallel_global_sum_patch.
  These interfaces contain functions with reprosum logic.
  If parallel%reprosum is true, these functions will call subroutime cism_reprosum_calcs,
  which converts sums of floating-point variables to sums involving integers only,
  and in this way computed global sums independent of processor count.
  If parallel%reprosum is false, these subroutines do a local sum and then call
  parallel_reduce_sum, as before.

Note the following changes:

* Replaced the interface parallel_global_sum_staggered (consisting of four subroutines)
  with parallel_global_sum_stagger (consisting of four functions).
  The new functions yield the same results as the old subroutines if reprosum = F.

* Replaced many calls to parallel_reduce_sum with calls to function parallel_global_sum,
  which can call either cism_reprosum_calcs or parallel_reduce_sum.
  The remaining direct calls to parallel_reduce_sum are mostly sums over integers,
  which are independent of processor count.

* Added an interface called parallel_global_sum_patch.
  The parallel_global_sum_patch functions compute the global sum of a given field
  on an arbitrary number of 'patches', each of which has a patch ID denoted
  by a positive integer. These functions are useful for computing sums over
  (1) glaciers and (2) ocean basins, where each glacier or basin is considered a patch.

* Added a subroutine parallel_reprosum_calc in the cism_parallel module; still to be tested.

* Removed subroutines get_area_vol and calc_iareaf_iareag in glide_mask.F90

Collectively, these commits are answer-changing at the roundoff level for both prognostic
and diagnostic variables.

A future commit will add the calls to cism_reprosum_calc and verify that the sums are reproducible.
This commit changes subroutines distributed_gather_var_* to gather_var_*,
and similarly changes distributed_scatter_var_* to scatter_var_*.
This prevents some compilers from complaining about long subroutine names.

The 'distributed' prefix in subroutines names goes back a decade or so
to when CISM was first parallelized. Might at some point want to remove
it from other subroutine names.
This commit continues the work toward a reproducible sum capability:
an option to compute sums such that results on different numbers
of processors are BFB.

The basic capability is now working. In the various parallel_global_sum routines
(with interfaces parallel_global_sum, parallel_global_sum_stagger and
parallel_global_sum_patch), there is an option to call subroutine
parallel_reduce_reprosum in lieu of parallel_reduce sum.
Subroutine parallel_reduce_reprosum calls cism_reproduce_calc, which
converts dp variables to i8 and does the sums.

I verified that sums are now reproducible for some test problems, including
a Greenland configuration similar to what we are using for CESM3.
One issue was that the velocity solver was using local copies of
variables xVertex and yVertex for its node geometry, which led to a dependence
on processor count. Computing these variables based on the global coordinates
model%general%x0 and model%general%y0 solves the problem, although it leads
to slightly larger roundoff errors for some calculations.

I fixed another subtle issue by rewriting two equations of this form:
  c1 = c1 + a*b
  c2 = c2 + a*b
with equations of this form:
  f = a*b
  c1 = c1 + f
  c2 = c2 + f
I don't know why this was necessary.

Some results still depend on processor count: for instance, the scheme that routes
subglacial water downhill to neighboring cells. I will work on that in a future commit.

Other changes:

* Fixed a bug in the way the rmse for surface velocity is computed.

* Added a subroutine called double_to_binary in glimmer_utils.
  Given an input double-precision floating-point number, the output is a
  character string corresponding to the 64-bit internal representation of that number
  (1 bit for sign, 11 for exponent, and 52 for fraction).
  This is useful for debugging.

* Added two 'write_array_to_file' subroutines (real8_2d and real8_3d) in glissade_utils.
  These subroutines take an input array of dp variables, convert each variable
  to a character string corresponding to the 64-bit binary representation,
  and write that string to a file. This can be useful for debugging if we suspect
  that the values of some array elements depend on processor count, but we
  don't know which elements. The real8_3d version has an optional argument to make
  it work for either (k,i,j) arrays or (i,j,k) arrays.
This commit makes a very small change in glissade_velo_higher.F90,
subroutine basal_sliding_bc_2d. A recent commit swaps the following lines:

    Auu(i,j,m) = Auu(i,j,m) + (dx*dy/vol0) * beta(i,j)
    Avv(i,j,m) = Avv(i,j,m) + (dx*dy/vol0) * beta(i,j)

for these:

   increment = (dx*dy/vol0) * beta(i,j)
   Auu(i,j,m) = Auu(i,j,m) + increment
   Avv(i,j,m) = Avv(i,j,m) + increment

Before the swap, the reprosum capability was not working for a Greenland test case.
The results differ for a 1-year run with 1 task v. 4 tasks.
After the swap, the results for 1 task v. 4 tasks are BFB.
I discovered this more or less by accident. I don't see why such a change
would matter for the reprosum algorithm.

This commit backs out the change above.
With the change backed out, the 1-task and 4-task results differ.
A working hypothesis is that the differences are due to a logic error
in the original reprosum code by Pat Worley.
In 2023, Pat introduced some corrections. The next step is to implement
Pat's corrections and see if they fix the reprosum capability.
This commit replaces the old cism_reprosum_mod - which was based on the current copy
of shr_reprosum_mod in the CESM repo - with a new cism_reprosum_mod which is based on
a current copy of shr_reprosum_mod in the E3SM repo. The E3SM version includes
changes made by Pat Worley in 2023.

Pat's changes are contained in three PRs described here:
* E3SM-Project/E3SM#5534
* E3SM-Project/E3SM#5549
* E3SM-Project/E3SM#5560

The new copy of shr_reprosum_mod is here:
* https://github.com/E3SM-Project/E3SM/blob/master/share/util/shr_reprosum_mod.F90

Many of Pat's changes were cosmetic, but he also changed some logic to prevent
potential failures of reproducibility. I had hoped that these changes would fix
the problem described in the previous commit ('Change for reprosum testing').
However, I am getting the same answers as with the previous reprosum code.
The runs with 1 and 4 tasks still fail the BFB test with the unswapped code
in glissade_velo_higher.F90, while passing the BFB test with the swapped code.
This commit restores the code swap. I still don't understand why this swap is needed.

Thanks to Bill Sacks for pointing me to Pat's changes.
This commit modifies the calculation of the glacier boundary mask to support reproducible sums.
Previously, this mask (which enforces high values of Cp at vertices that border two different glaciers)
could be computed incorrectly near block boundaries.

Other changes:

* I added an optional argument, zero_global_boundary_no_ice_bc, to subroutine parallel_halo_real8_2d.
  By default, this subroutine zeroes out the values of scalars in cells adjacent to the global boundary.
  (The reasons are a bit complicated; they are related to the issue of communicating scalar values
  to the halo cells of diagonal block neighbors when some blocks are inactive.)
  Passing in '.false.' for this argument preserves the original value in these cells.
  For topg and relx, it is better to keep the correct physical values than to set them to zero.

* The call to glissade_glacier_init now takes place after thck, topg, lsrf, and usrf have been computed
  in halo cells. This prevents erroneous nonzero thickness targets near block boundaries.

* I added a subroutine called parallel_halo_extrapolate_1d. This subroutine is used to set
  the correct values of (x0,y0) and (x1,y1) in halo cells.
  Note: CISM requires (x1,y1) in the input file. If (x0,y0) are not present, they are
        computed from (x1,y1).

* If glacier%length_scale_factor /= 1, then (x0,y0) and (x1,y1) are now scaled at initialization,
  along with dew and dns.

* Removed circular dependencies by deleting 'use cism_parallel' and 'use glimmer_log' from the profile module

* Extended the write_array_to_file subroutines to allow either floating-point or binary-string output

* Eliminated some glacier diagnostic output for the case that the diagnostic cell belongs to no glacier
This commit modifies the routing algorithm of the subglacial flow-routing scheme
to support reproducible sums.

The challenge is that as water fluxes are routed downstream, they often pass from
one processor to another. This means that fluxes must be stored temporarily in halo cells,
then routed further downstream on a subsequent call to the routing subroutine.
Depending on the number of tasks, fluxes can enter a given cell in a different order
as they are accumulated. This means the final fluxes can differ at roundoff level.

The fix here is to convert the fluxes and related floating-point arrays (e.g., the weights that
compute how much water is refrozen) from dp to i8 variables before passing them to the routing subroutine.
Since the real variables are O(1) or less, they are multiplied by a factor of 1.e16 to conserve water
and make the results as close as possible to those without reprosums.

I was not able to make this work for (1) flux-routing schemes other than D8 or (2) the option
with refreezing enabled. If either of these options is turned on, the fluxes are multiplied
by floating-point numbers during each step of the routing, breaking reproducibility.
I added some code to check these options at startup and to switch, if needed, to options
consistent with reproducible sums.

I encapsulated most of the reprosum-related code in a subroutine called route_flux_to_margin_or_halo.
There are now two versions of this subroutine: one that works with reprosum-safe i8 arrays
and one that uses dp arrays as in the original code.

To support the new integer8 variables and subroutine, I added subroutines parallel_halo_integer8_2d,
parallel_global_sum_integer8_2d, and parallel_reduce_sum_integer8 in the cism_parallel_module.

For an AIS test with subglacial hydrology turned on (with D8 routing and without refreezing),
I verified that the results are now BFB with different numbers of tasks.
The results agree at roundoff level with runs without reprosums.
This commit replaces btemp_scale, a temperature scale in the flux-routing hydrology,
with two distinct scales that have different functions.
The two new scales are (1) btemp_flow_scale and (2) btemp_freeze_scale.

btemp_flow_scale is used to weigh the likelihood of alternative downstream paths,
as computed in subroutine get_flux_fraction. The scale determines a weighting factor:
  btemp_weight_flow = exp(-delta_Tb/btemp_flow_scale),
  where delta_Tb is the difference between PMP temperature and bed temperature.
A low value of btemp_weight_flow means that water is more likely to be routed to a
neighboring grid cell instead of passing through a frozen one.

btemp_freeze_scale is used to compute how much of the incoming water flux refreezes
locally rather than passing through. The weight is computed as:
   btemp_weight_freeze = exp(-delta_Tb/btemp_freeze_scale).
A low value of btemp_weight_freeze means that more water refreezes in place.

For both scales, a larger value makes the hydrology less sensitive to bed temperature.
A zero value (the default) means that the scaling is ignored, and the flow is independent
of bed temperature.

Since the refreezing option is not supported for reproducible sums, I will try running
the hydrology with the flow scale alone. If this works well, the refreezing option
might be removed later.
This commit makes a small change in the way coulomb_c and powerlaw_c
are initialized. The old logic, called from glissade_initialise after
input files are read in, was to set C_c to a constant on initialization
based on the following logic:

    if (model%options%which_ho_coulomb_c == HO_COULOMB_C_CONSTANT .or. &
        model%options%is_restart == NO_RESTART) then

	if (model%options%elevation_based_coulomb_c) then
            [stuff]
	else
	    model%basal_physics%coulomb_c = model%basal_physics%coulomb_c_const
	endif
    endif
And similarly for powerlaw_c.

This is problematic for CESM coupled runs, which may read C_c from an input file
created by a previous CISM standalone spinup.
When the CESM run is not a restart run, the logic above sets C_c to a constant
instead of using the values from the input file.

I changed the logic so that
(1) For option HO_COULOMB_C_CONSTANT, coulomb_c is set to coulomb_c_const.
(2) For elevation-based coulomb_c, check whether coulomb_c_hi or coulomb_c_lo = 0
    everywhere. If so, these fields are set to prescribed constants, and then
    elevation-based coulomb_c is computed.
(3) For coulomb_c not based on elevation, check whether coulomb_c = 0 everywhere.
    If so, coulomb_c is set to coulomb_c_const.

In this way, we avoid overwriting coulomb_c (or coulomb_c_hi and coulomb_c_lo)
if these fields have already been read in.
This commit adds logic to abort the run if the user requests local triagonal
or global tridiagonal preconditioning along with reproducible sums.
I looked at both preconditioners and concluded that it's not feasible
to make either of them independent of the number of processors.

For the global tridiag preconditioner, the issue is that the algorithm
computes coefficients that connect adjacent processors on a given row or column.
If the block layout is different, then these coefficients will be different
and answers will differ at roundoff level, even though we are solving
the same global equations.

For local tridiag preconditioner, a different block layout means that
we are solving an entirely different set of tridiagonal systems,
ensuring that answers will differ.

With reprosums enabled, the user will therefor need to choose the simpler
diagonal preconditioner (or the SIA preconditioner for the 3D solvers).
This could result in slower throughput, but not enough slower to be a big problem,
especially in CESM coupled runs where the ice sheet model is much cheaper
than other components.

I also fixed a minor bug in the subroutine that computes glacier AARs.
I introduced the bug when I inserted calls to the parallel_global_sum_patch
subroutine. The AARs are now correct again. This fix doesn't change
prognostic results, since the AAR is just a diagnostic.
Pat Worley's original code uses the shared module shr_infnan_mod.
This module supports logic in the reprosum algorithm to ignore infinities
and NaNs in the input arrays. When implementing reprosums in CISM,
I carried along a CISM version of this module, cism_infnan_mod.

However, CISM arrays passed to the reprosum module should not contain
infinities or NaNs, and I would prefer not to carry along a module
that is never used. This commit removes cism_infnan_mod.F90, along
with the infnan-related logic and use statements in cism_reprosum_mod.
Tony Craig did the same thing for the CICE reprosum module:
https://github.com/ESCOMP/CICE/blob/main/cicecore/cicedyn/infrastructure/
comm/mpi/ice_reprosum.F90

The CISM version of the reprosum module still contains some subroutines
that aren't now called from CISM. I am leaving these subroutines in place
in case they turn out to be useful later.
This commit fixes a bug in function parallel_global_sum_stagger_real8_3d_nflds.
The function now works correctly.

I found the bug while running Greenland tests with the BP solver,
using various combinations of preconditioners and sparse PCG solvers
(e.g., both diagonal and SIA preconditioners, with the standard
and Chronopoulos-Gear PCG algorithms). All the BP configurations
I tested are now reproducible.

I also cleaned up some diagnostics, making the output consistent
across related subroutines.
Several years ago, I implemented local and global triagonal preconditioners
in module glissade_velo_higher_pcg. For some reason, I coded and tested them
in the Chronopoulos-Gear PCG subroutines (which_ho_sparse = 3) but not the
standard PCG subroutines (which_ho_sparse = 2).

This commit adds the two tridiagonal preconditioners to both standard PCG
subroutines (2d and 3d). I tested the 2d version for DIVA and the 3d version
for BP, verifying that answers agree with other preconditioners, within roundoff.
This commit simply moves the 2D subroutines before the 3D subroutines
in glissade_velo_higher_pcg.F90. It seems more logical to put the
2D subroutines first. Might do something similar in glissade_velo_higher.F90.
This commit implements reproducible sums for computations of the L2 norm
(and related quantities) of the residual vector A*x - b in the velocity solver.

The summed L2 norm is compared to a target value to check whether the solver
has converged. Reproducible sums are needed only if the L2 norm is very close
to the target, such that the norm exceeds the target for (say) 128 processors
and fall below the target for 256 processors, due to roundoff differences.
This is unlikely but not impossible, so it's better to be on the safe side.
This commit simply moves subroutines linked to the 2d velocity solver
above subroutines linked to the 3d velocity solver, as done earlier
for the PCG subroutine.

Historically, Blatter-Pattyn was coded before SSA/L1L2/DIVA, which
is why the 3d subroutines came first.
@whlipscomb whlipscomb requested review from Katetc and gunterl January 24, 2026 01:51
@whlipscomb whlipscomb removed the request for review from Katetc January 24, 2026 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants