Skip to content

Conversation

@slevis-lmwg
Copy link
Contributor

@slevis-lmwg slevis-lmwg commented Aug 19, 2025

Description of changes

Update the contents of .gitmodules and update the submodules themselves.

Specific notes

Contributors other than yourself, if any:
@ekluzek

CTSM Issues Fixed (include github issue #):
Resolves #3411
Resolves #3180

Are answers expected to change (and if so in what way)?
No

Documentation

  • Update ChangeLog/Sum (DIFFs roundoff)

Testing
./run_sys_tests -s aux_clm -c ctsm5.3.072 -g ctsm5.3.073

  • izumi tests_0826-153515iz
  • derecho tests_0826-152952de

@slevis-lmwg slevis-lmwg self-assigned this Aug 19, 2025
@slevis-lmwg slevis-lmwg added bfb bit-for-bit PR status: awaiting review Work on this PR is paused while waiting for review. test: aux_clm Pass aux_clm suite before merging labels Aug 19, 2025
@slevis-lmwg slevis-lmwg requested a review from ekluzek August 19, 2025 22:57
@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Aug 19, 2025

./run_sys_tests -s aux_clm -c ctsm5.3.066 --skip-generate

  • izumi OK
  • derecho FAIL (discussed below)

@slevis-lmwg slevis-lmwg added this to the cesm3_0_beta08 milestone Aug 19, 2025
@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Aug 20, 2025

I'm seeing widespread DIFFs from ctsm5.3.066 in /glade/derecho/scratch/slevis/tests_0819-165935de but not on izumi:

UPDATE
The derecho baseline is fine (sanity check)
./create_test SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen -c /glade/campaign/cgd/tss/ctsm_baselines/ctsm5.3.066
PASS with vanilla ctsm5.3.066 (testid 20250820_142914_4t5p4o)
PASS with vanilla b4b-dev (testid 20250820_150939_u5ubrr)
FAIL with ctsm5.3.066-21-g760cdd4ee i.e. this PR (testid 20250820_172330_t8sxmt)

@slevis-lmwg
Copy link
Contributor Author

@ekluzek @jedwards4b
These submodule updates are answer-changing on derecho (not on izumi).
Please share thoughts on how to proceed.

@ekluzek
Copy link
Collaborator

ekluzek commented Aug 21, 2025

@slevis-lmwg looking at the changes in the testdb, I see that ccs_config_cesm1.0.52 updates the gnu compiler. So try updating to ccs_config_cesm1.0.51 the one just before and see if answers remain the same. Are the diffs in answers -- only for the gnu compiler? Or does it change for intel as well?

We'll probably need to have a tag on master that updates to ccs_config_cesm1.0.52, for this.

@samsrabin
Copy link
Member

@jedwards4b asks:

If you are answer changing in the gnu compiler but not in intel do you really need an additional tag for that? Especially since the change is expected based on a compiler version change?

We might be able to avoid a new master tag if we show that the gnu compiler change is the only reason for the diffs. (We would also need to mark the previous baseline as old and generate a new one.)

However, that's a bit more work than just making a new tag. @jedwards4b, is there some cost of making a new tag that we're not considering?

@github-project-automation github-project-automation bot moved this to Ready to start (or start again) in CTSM: Upcoming tags Aug 22, 2025
@ekluzek ekluzek moved this from Ready to start (or start again) to In progress - b4b-dev in CTSM: Upcoming tags Aug 22, 2025
@samsrabin
Copy link
Member

Per SE meeting today: This PR will only bring ccs_config up to ccs_config_cesm1.0.51, which should result in no answer changes. The update to ccs_config_cesm1.0.52, with its answer changes, will happen on master.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Aug 22, 2025

aux_clm (not done, yet) STILL SHOWS DIFFs on derecho (tests_0822-140545de) and not on izumi:

FAIL SMS_D_Ln1.f10_f10_mg37.I2000Clm50BgcCropQianRs.derecho_intel.clm-run_self_tests BASELINE ctsm5.3.066: DIFF
FAIL SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop BASELINE ctsm5.3.066: DIFF
FAIL SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop BASELINE ctsm5.3.066: DIFF
FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE ctsm5.3.066: DIFF

Looking back at my last aux_clm (before reverting to ccs_config_cesm1.0.51 from .56 tests_0819-165935de) I find almost the same:

FAIL SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop BASELINE ctsm5.3.066: DIFF
FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE ctsm5.3.066: DIFF

...so this cannot make it into b4b-dev, yet.

The first of the top FAILs ("run_self_tests") is expected, as I can tell from
/glade/campaign/cgd/tss/ctsm_baselines/ctsm5.3.071/SMS_D_Ln1.f10_f10_mg37.I2000Clm50BgcCropQianRs.derecho_intel.clm-run_self_tests/TestStatus

@slevis-lmwg slevis-lmwg added blocked: answer changing Can't be resolved until we're ready for answer changes on master next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Aug 22, 2025
@slevis-lmwg
Copy link
Contributor Author

At standup we agreed (@ekluzek @samsrabin) that I should open a new issue that we don't trust nvhpc to not have diffs and mark corresponding tests as expected failures.

@samsrabin
Copy link
Member

And also that this PR should just go onto master so we don't spend any more time trying to figure it out. ccs_config can thus be put back to ccs_config_cesm1.0.56.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Aug 25, 2025

We changed our minds as per @samsrabin's post:
Go back to the original plan in this PR and merge to master as answer-changing in both gnu (expected) and nvhpc (unexpected but not concerning).

@ekluzek ekluzek moved this from In progress - b4b-dev to In progress - master in CTSM: Upcoming tags Aug 25, 2025
@ekluzek ekluzek changed the title Update .gitmodules to cesm3_0_alpha07c ctsm5.3.073: Update .gitmodules to cesm3_0_alpha07c Aug 25, 2025
@slevis-lmwg
Copy link
Contributor Author

OK, the change in answers for nvhpc came in with ccs_config_cesm1.0.49 with a compiler version change (in this case moving back a little bit). The nvhpc compiler has been really problematic, so worrying about changes in answers for it isn't really important. But, having answer changes for nvhpc on b4b-dev would be disruptive to the workflow there.

@ekluzek I interpret your comment to mean that the nvhpc diffs are also expected, so I will not open an issue unnecessarily. Please correct me if I misunderstood.

@ekluzek
Copy link
Collaborator

ekluzek commented Aug 25, 2025

Yep that's right. The nvhpc answer changes are also expected. So no new issue is needed.

@slevis-lmwg
Copy link
Contributor Author

While I wait for aux_clm, so that I may make final updates to ChangeLog/ChangeSum, I will request for a PR review/approval.

Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I suggest an addition to the ChangeLog. And it will be good to confirm that only GNU and NVHPC change answers on Derecho. And nice that we already have that confirmed for Izumi.

I also looked to see if there were any updates in the new versions that we should highlight in the ChangeLog and didn't spot anything that CTSM users would want to know about.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Aug 27, 2025

./run_sys_tests -s aux_clm -c ctsm5.3.072 -g ctsm5.3.073 on derecho had these unexpected failures:

FAIL MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel NLCOMP
FAIL SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default NLCOMP
FAIL SUBSETDATAREGION_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default NLCOMP
FAIL SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop RUN

Starting with the last of these: Resubmitting as well as rerunning manually both failed the same way. This appears related to the nvhpc flakiness that we have been discussing. I could mark this test as EXPECTED FAILURE or remove it altogether. Preferences?
UPDATE due to additional information. From meeting with @ekluzek, document the following, and we're good:

PASS SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop SETUP (UNEXPECTED: expected FAIL)
FAIL SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop RUN

The first three picked up NLCOMP diffs (I do not know why) but did not change answers, so I think I can let them pass.

@jedwards4b
Copy link
Contributor

@slevis-lmwg you should be able to find the NLCOMP failures in the TestStatus.log file for the case.

@slevis-lmwg
Copy link
Contributor Author

I did and saw the NLCOMP diffs and confirmed that these diffs did not change answers. By "I do not know why" I meant that I was not expecting NLCOMP diffs.

@jedwards4b
Copy link
Contributor

@slevis-lmwg wouldn't it be good for completeness to show what the NLCOMP differences are in the PR so that the community is aware and could raise issues or justify the differences?

@slevis-lmwg
Copy link
Contributor Author

The NLCOMP diffs that I was not expecting

MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel

2025-08-26 15:32:26: NLCOMP
Comparison failed between '/glade/derecho/scratch/slevis/tests_0826-152952de/MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel.GC.0826-152952de_int/CaseDocs/lnd_in' with '/glade/campaign/cgd/tss/ctsm_baselines/ctsm5.3.072/MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel/CaseDocs/lnd_in'
  BASE: fsurdat = surfdata_10x15_hist_1850_78pfts_c250825.nc'
  COMP: fsurdat = surfdata_10x15_hist_1850_78pfts_c240908.nc'

2025-08-26 16:56:42: NLCOMP
Comparison failed between '/glade/derecho/scratch/slevis/tests_0826-152952de/MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel.GC.0826-152952de_int/CaseDocs/lnd_in' with '/glade/campaign/cgd/tss/ctsm_baselines/ctsm5.3.072/MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel/CaseDocs/lnd_in'
  BASE: fsurdat = surfdata_10x15_hist_1850_78pfts_c250825.nc'
  COMP: fsurdat = surfdata_10x15_hist_1850_78pfts_c250826.nc'

SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default

2025-08-26 16:59:37: NLCOMP
Comparison failed between '/glade/derecho/scratch/slevis/tests_0826-152952de/SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default.GC.0826-152952de_int/CaseDocs/lnd_in' with '/glade/campaign/cgd/tss/ctsm_baselines/ctsm5.3.072/SUBSETDATAPOINT_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default/CaseDocs/lnd_in'
  BASE: fsurdat = surfdata_-92.798085_45.402252_hist_2000_78pfts_c250825.nc'
  COMP: fsurdat = surfdata_-92.798085_45.402252_hist_2000_78pfts_c250826.nc'
  BASE: flanduse_timeseries = landuse.timeseries_-92.798085_45.402252_SSP2-4.5_1850-2100_78pfts_c250825.nc'
  COMP: flanduse_timeseries = landuse.timeseries_-92.798085_45.402252_SSP2-4.5_1850-2100_78pfts_c250826.nc'

SUBSETDATAREGION_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default

2025-08-26 17:02:41: NLCOMP
Comparison failed between '/glade/derecho/scratch/slevis/tests_0826-152952de/SUBSETDATAREGION_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default.GC.0826-152952de_int/CaseDocs/lnd_in' with '/glade/campaign/cgd/tss/ctsm_baselines/ctsm5.3.072/SUBSETDATAREGION_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default/CaseDocs/lnd_in'
  BASE: fsurdat = surfdata_291.0-293.0_-9.0--7.0_hist_2000_78pfts_c250825.nc'
  COMP: fsurdat = surfdata_291.0-293.0_-9.0--7.0_hist_2000_78pfts_c250826.nc'
  BASE: flanduse_timeseries = landuse.timeseries_291.0-293.0_-9.0--7.0_SSP2-4.5_1850-2100_78pfts_c250825.nc'
  COMP: flanduse_timeseries = landuse.timeseries_291.0-293.0_-9.0--7.0_SSP2-4.5_1850-2100_78pfts_c250826.nc'
Comparison failed between '/glade/derecho/scratch/slevis/tests_0826-152952de/SUBSETDATAREGION_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default.GC.0826-152952de_int/CaseDocs/datm_in' with '/glade/campaign/cgd/tss/ctsm_baselines/ctsm5.3.072/SUBSETDATAREGION_Ld5_D_Mmpi-serial.CLM_USRDAT.I2000Clm60BgcCropCrujra.derecho_intel.clm-default/CaseDocs/datm_in'
Differences in namelist 'datm_nml':
  BASE: model_maskfile = domain.lnd.fv0.9x1.25_gx1v7_291.0-293.0_-9.0--7.0_c250825_ESMF_UNSTRUCTURED_MESH.nc'
  COMP: model_maskfile = domain.lnd.fv0.9x1.25_gx1v7_291.0-293.0_-9.0--7.0_c250826_ESMF_UNSTRUCTURED_MESH.nc'
  BASE: model_meshfile = domain.lnd.fv0.9x1.25_gx1v7_291.0-293.0_-9.0--7.0_c250825_ESMF_UNSTRUCTURED_MESH.nc'
  COMP: model_meshfile = domain.lnd.fv0.9x1.25_gx1v7_291.0-293.0_-9.0--7.0_c250826_ESMF_UNSTRUCTURED_MESH.nc'

@slevis-lmwg
Copy link
Contributor Author

@samsrabin may the above NLCOMP diffs relate to changes that you brought in recently?

@jedwards4b
Copy link
Contributor

They all look like updates except the first one which looks like a regression - was this intentional?

  BASE: fsurdat = surfdata_10x15_hist_1850_78pfts_c250825.nc'
  COMP: fsurdat = surfdata_10x15_hist_1850_78pfts_c240908.nc'

@slevis-lmwg
Copy link
Contributor Author

Investigating this with @samsrabin. I'm comparing baselines and found the first instance of one of these diffs:
diff ctsm5.3.069/MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel/CaseDocs/lnd_in ctsm5.3.068/MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel/CaseDocs/.
The SUBSETDATA* tests were newly introduced in the same baseline (ctsm5.3.069).

@slevis-lmwg
Copy link
Contributor Author

Also the NLCOMP diffs continued to appear in every tag subsequent to 069 but were ignored. In some cases there were simultaneous - expected - NLCOMP diffs, which may explain why the unexpected ones slipped under the radar.

@slevis-lmwg
Copy link
Contributor Author

@samsrabin this is the diff I'm seeing in lnd_in between 069 and 068:

<  fsurdat = '/glade/derecho/scratch/samrabin/tests_0811-170941de/MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel.GC.0811-170941de_int/surfdata_10x15_hist_1850_78pfts_c250811.nc'
---
>  fsurdat = '/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/surfdata_esmf/ctsm5.3.0/surfdata_10x15_hist_1850_78pfts_c240908.nc'

This makes me wonder whether you told the tests to make temporary copies of the fsurdat files that they use...

@samsrabin
Copy link
Member

samsrabin commented Aug 27, 2025

@jedwards4b I think this is related to the problem you alerted me to with ctsm6.1.112.

Generating a MKSURFDATAESMF_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel baseline at ctsm5.3.069 gives CaseDocs/lnd_in with fsurdat pointing at a file generated during that test (ending in cYYMMDD.nc, where YYMMDD is today's date). Reverting CIME to cime6.1.111 changes it so that the baseline namelist file is saved earlier: CaseDocs/lnd_in has fsurdat pointing at the default file, ending in c240908.nc.

So cime6.1.112 causes the baseline namelist files to be saved later in the process. I think this is actually correct for this particular test, because the test does actually run with the file it generates. But it looks like the namelist comparison is still happening before the lnd_in namelist has been updated, so it still has the default c240908.nc file.

Fixing this test would require a CIME update that makes it so the namelist comparison happens after the lnd_in namelist has been updated—ideally, immediately before the baseline is saved. (We would then end up with a different problem, like what we see for the SUBSETDATA tests here, but that's easily fixable.)

Like I said, I think the baseline namelist files being saved later in the process is correct for this particular test. I don't know if that also applies for the E3SM test @jgfouca highlighted as having a related (?) problem.

@slevis-lmwg
Copy link
Contributor Author

@samsrabin thank you for figuring out how the problem will get solved.
@jedwards4b thank you for suggesting that I post more information about this.

@slevis-lmwg
Copy link
Contributor Author

From software meeting:

  • Open two issues and mark NLCOMP as EXPECTED FAILUREs
  • Merge and tag

@slevis-lmwg slevis-lmwg merged commit cf41209 into ESCOMP:master Aug 28, 2025
4 checks passed
@github-project-automation github-project-automation bot moved this from In progress - master to Done (non release/external) in CTSM: Upcoming tags Aug 28, 2025
@slevis-lmwg slevis-lmwg deleted the upd_gitmodules_to_cesm3_0_alpha07c branch August 28, 2025 19:23
slevis-lmwg added a commit to mvalmartin/ctsm that referenced this pull request Aug 29, 2025
Update .gitmodules to cesm3_0_alpha07c

 Update ccs_config_cesm1.0.48 to ccs_config_cesm1.0.56.
 Update cime6.1.112 to cime6.1.113.
 Answers change in gnu and nvhpc tests on derecho (details in the PR).
 New bugs found and reported in issues
 ESCOMP#3453 FAIL MKSURFDATAESMF_...intel NLCOMP
 ESCOMP#3454 FAIL SUBSETDATA* tests NLCOMP

 PR ESCOMP#3422
@slevis-lmwg slevis-lmwg removed the PR status: awaiting review Work on this PR is paused while waiting for review. label Sep 10, 2025
djk2120 added a commit to djk2120/CTSM that referenced this pull request Sep 10, 2025
Update .gitmodules to cesm3_0_alpha07c

 Update ccs_config_cesm1.0.48 to ccs_config_cesm1.0.56.
 Update cime6.1.112 to cime6.1.113.
 Answers change in gnu and nvhpc tests on derecho (details in the PR).
 New bugs found and reported in issues
 ESCOMP#3453 FAIL MKSURFDATAESMF_...intel NLCOMP
 ESCOMP#3454 FAIL SUBSETDATA* tests NLCOMP

 PR ESCOMP#3422
Preparing beta version of tether utility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

non-bfb Changes answers (incl. adding tests) test: aux_clm Pass aux_clm suite before merging

Projects

Status: Done (non release/external)

Development

Successfully merging this pull request may close these issues.

Update cime and ccs_config versions to those used in cesm3_0_alpha07c nvhpc module setup problem in ctsm5.3.050 with ccs_config1.0.43

4 participants