fix(training): scaler memory usage #391

japols · 2025-06-30T15:26:29Z

Description

Avoid explicitly materialising a single, big combined scaling tensor in memory. Instead, apply scalers iteratively via broadcasted multiplication.

What problem does this change solve?

Combining all scalers in ScaleTensor.get_scaler() can lead to big allocations, up to 17GB when training at 4km resolution.

Additional notes

Memory snapshot of loss (fwd/bwd) when training at 9km before (left) and after (right) changes:

Small test run of old version vs new: mlflow

Tagging @HCookie @sahahner

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

HCookie

A great refactor, thanks

training/src/anemoi/training/losses/scaler_tensor.py

anaprietonem · 2025-07-01T07:18:03Z

@japols, do you have some runs/plot that you can post before and after the fix to confirm it's working as expected?

training/src/anemoi/training/losses/scaler_tensor.py

ssmmnn11 · 2025-07-01T09:11:44Z

A great refactor, thanks

Is the bottom line that this was not caught in the benchmarking: the memory usage was very small at n320?

training/src/anemoi/training/losses/scaler_tensor.py

ssmmnn11

good!

@HCookie

## Description Avoid explicitly materialising a single, big combined scaling tensor in memory. Instead, apply scalers iteratively via broadcasted multiplication. ## What problem does this change solve? Combining all scalers in `ScaleTensor.get_scaler()` can lead to big allocations, up to 17GB when training at 4km resolution. ## Additional notes ## Memory snapshot of loss (fwd/bwd) when training at 9km before (left) and after (right) changes: ![Screenshot 2025-07-01 at 11 35 26](https://github.com/user-attachments/assets/ffa635ca-4f36-48d0-8da6-812608b833dc) Small test run of old version vs new: [mlflow](https://mlflow.ecmwf.int/#/metric?runs=[%22e7ddf2fe507f4d68a26ae0406d6e5b8f%22,%22410a0717832c4782a0d9b472d82f48c5%22]&metric=%22train_mse_loss_step%22&experiments=[%2245%22]&plot_metric_keys=%5B%22train_mse_loss_step%22%5D&plot_layout={%22autosize%22:true,%22xaxis%22:{},%22yaxis%22:{}}&x_axis=relative&y_axis_scale=linear&line_smoothness=1&show_point=false&deselected_curves=[]&last_linear_y_axis_range=[]) Tagging @HCookie @sahahner ***As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/*** By opening this pull request, I affirm that all authors agree to the [Contributor License Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md)

@HCookie

## Description Avoid explicitly materialising a single, big combined scaling tensor in memory. Instead, apply scalers iteratively via broadcasted multiplication. ## What problem does this change solve? Combining all scalers in `ScaleTensor.get_scaler()` can lead to big allocations, up to 17GB when training at 4km resolution. ## Additional notes ## Memory snapshot of loss (fwd/bwd) when training at 9km before (left) and after (right) changes: ![Screenshot 2025-07-01 at 11 35 26](https://github.com/user-attachments/assets/ffa635ca-4f36-48d0-8da6-812608b833dc) Small test run of old version vs new: [mlflow](https://mlflow.ecmwf.int/#/metric?runs=[%22e7ddf2fe507f4d68a26ae0406d6e5b8f%22,%22410a0717832c4782a0d9b472d82f48c5%22]&metric=%22train_mse_loss_step%22&experiments=[%2245%22]&plot_metric_keys=%5B%22train_mse_loss_step%22%5D&plot_layout={%22autosize%22:true,%22xaxis%22:{},%22yaxis%22:{}}&x_axis=relative&y_axis_scale=linear&line_smoothness=1&show_point=false&deselected_curves=[]&last_linear_y_axis_range=[]) Tagging @HCookie @sahahner ***As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/*** By opening this pull request, I affirm that all authors agree to the [Contributor License Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md)

🤖 Automated Release PR This PR was created by `release-please` to prepare the next release. Once merged: 1. A new version tag will be created 2. A GitHub release will be published 3. The changelog will be updated Changes to be included in the next release: --- <details><summary>training: 0.6.0</summary> ## [0.6.0](training-0.5.1...training-0.6.0) (2025-08-01) ### ⚠ BREAKING CHANGES * for schemas of data processors ([#433](#433)) * BaseGraphModule and tasks introduced in anemoi-core ([#399](#399)) ### Features * Add metadata back to pl checkpoint. ([#303](#303)) ([0193b28](0193b28)) * BaseGraphModule and tasks introduced in anemoi-core ([#399](#399)) ([f8ab962](f8ab962)) * **deps:** Use mlflow-skinny instead of mlflow ([#418](#418)) ([6a8beb3](6a8beb3)) * Log FTT2 loss + Fourier Correlation loss ([#148](#148)) ([345b0ab](345b0ab)) * **model:** Postprocessors for leaky boundings ([#315](#315)) ([b54562b](b54562b)) * **models:** Checkpointed Mapper Chunking ([#406](#406)) ([8577772](8577772)) * **models:** Mapper edge sharding ([#366](#366)) ([326751d](326751d)) * Variable filtering ([#208](#208)) ([fba5e47](fba5e47)) ### Bug Fixes * Dropping 3.9 ([#436](#436)) ([f6c0214](f6c0214)) * For schemas of data processors ([#433](#433)) ([539939b](539939b)) * Mlflow hp params limit ([#424](#424)) ([138bc3a](138bc3a)) * Mlflowlogger duplicated key ([#414](#414)) ([cb64a1c](cb64a1c)) * **models,traininig:** Hierarchical model + integration test ([#400](#400)) ([71dfd89](71dfd89)) * **models:** Add removed sharded_input_key in PR400 ([#425](#425)) ([089fe6f](089fe6f)) * New checkpoint ([#445](#445)) ([a25df93](a25df93)) * Plotting error when precip related params are not diagnostic ([#369](#369)) ([010cfa3](010cfa3)) * **training:** Address issues with [#208](#208) ([#417](#417)) ([665f462](665f462)) * **training:** Scaler memory usage ([#391](#391)) ([a9d30e1](a9d30e1)) * Update import mflow utils unit tests ([#427](#427)) ([70ecdd9](70ecdd9)) * Update level retrieval logic ([#405](#405)) ([f393bc3](f393bc3)) * Use transforms: Variable for ExtractVariableGroupAndLevel ([#321](#321)) ([7649f4f](7649f4f)) * Warm restart ([#443](#443)) ([ff96236](ff96236)) ### Documentation * **graphs:** Documenting some missing features ([#423](#423)) ([8addbd8](8addbd8)) </details> <details><summary>graphs: 0.6.3</summary> ## [0.6.3](graphs-0.6.2...graphs-0.6.3) (2025-08-01) ### Features * **graphs:** Add lat weighted attribute ([#223](#223)) ([5dd32ca](5dd32ca)) * **graphs:** Support to export edges to npz ([#395](#395)) ([e21738f](e21738f)) ### Bug Fixes * Dropping 3.9 ([#436](#436)) ([f6c0214](f6c0214)) * **graphs:** Revert PR [#379](#379) ([#409](#409)) ([d51219f](d51219f)) * **graphs:** Throw error instead of raising warning when graph exists. ([#379](#379)) ([6ec6c18](6ec6c18)) * **graphs:** Undo masking when torch-cluster is installed ([#375](#375)) ([9f75c06](9f75c06)) ### Documentation * **graphs:** Documenting some missing features ([#423](#423)) ([8addbd8](8addbd8)) </details> <details><summary>models: 0.9.0</summary> ## [0.9.0](models-0.8.1...models-0.9.0) (2025-08-01) ### ⚠ BREAKING CHANGES * for schemas of data processors ([#433](#433)) ### Features * **model:** Postprocessors for leaky boundings ([#315](#315)) ([b54562b](b54562b)) * **models:** Checkpointed Mapper Chunking ([#406](#406)) ([8577772](8577772)) * **models:** Mapper edge sharding ([#366](#366)) ([326751d](326751d)) ### Bug Fixes * Dropping 3.9 ([#436](#436)) ([f6c0214](f6c0214)) * For schemas of data processors ([#433](#433)) ([539939b](539939b)) * **models,traininig:** Hierarchical model + integration test ([#400](#400)) ([71dfd89](71dfd89)) * **models:** Remove repeated lines ([#377](#377)) ([1f0b861](1f0b861)) * **models:** Uneven channel sharding ([#385](#385)) ([dd095c4](dd095c4)) * Pydantic model validator not working in transformer schema ([#422](#422)) ([42f437a](42f437a)) * Remove dead code and fix typo ([#357](#357)) ([8c615ba](8c615ba)) </details> --- > [!IMPORTANT] > Please do not change the PR title, manifest file, or any other automatically generated content in this PR unless you understand the implications. Changes here can break the release process. > > ⚠️ Merging this PR will: > - Create a new release > - Trigger deployment pipelines > - Update package versions **Before merging:** - Ensure all tests pass - Review the changelog carefully - Get required approvals [Release-please documentation](https://github.com/googleapis/release-please)

add: apply scalers iteratively

cf648f6

japols self-assigned this Jun 30, 2025

japols added this to Anemoi-dev Jun 30, 2025

japols added enhancement New feature or request help wanted Extra attention is needed training labels Jun 30, 2025

github-project-automation bot moved this to Now In Progress in Anemoi-dev Jun 30, 2025

HCookie self-requested a review June 30, 2025 15:27

HCookie reviewed Jun 30, 2025

View reviewed changes

training/src/anemoi/training/losses/scaler_tensor.py Outdated Show resolved Hide resolved

training/src/anemoi/training/losses/scaler_tensor.py Outdated Show resolved Hide resolved

ssmmnn11 requested changes Jul 1, 2025

View reviewed changes

training/src/anemoi/training/losses/scaler_tensor.py Outdated Show resolved Hide resolved

github-project-automation bot moved this from Now In Progress to Under Review in Anemoi-dev Jul 1, 2025

feedback

ee8356e

japols added the ATS Approval Not Needed No approval needed by ATS label Jul 8, 2025

HCookie reviewed Jul 8, 2025

View reviewed changes

training/src/anemoi/training/losses/scaler_tensor.py Outdated Show resolved Hide resolved

HCookie reviewed Jul 8, 2025

View reviewed changes

training/src/anemoi/training/losses/scaler_tensor.py Show resolved Hide resolved

remove device, add pytest

b95430d

HCookie approved these changes Jul 9, 2025

View reviewed changes

Merge branch 'main' into fix/scaler-memory

3371c8a

ssmmnn11 approved these changes Jul 9, 2025

View reviewed changes

japols merged commit a9d30e1 into main Jul 9, 2025
17 checks passed

japols deleted the fix/scaler-memory branch July 9, 2025 11:50

github-project-automation bot moved this from Under Review to Done in Anemoi-dev Jul 9, 2025

DeployDuck mentioned this pull request Jul 8, 2025

chore: Release main #380

Merged

ssmmnn11 approved these changes Jul 9, 2025

View reviewed changes

sahahner mentioned this pull request Jul 11, 2025

feat(models): nan locations in imputer calculated on the fly #378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(training): scaler memory usage #391

fix(training): scaler memory usage #391

Uh oh!

japols commented Jun 30, 2025 •

edited

Loading

Uh oh!

HCookie left a comment

Uh oh!

Uh oh!

Uh oh!

anaprietonem commented Jul 1, 2025

Uh oh!

Uh oh!

ssmmnn11 commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

ssmmnn11 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix(training): scaler memory usage #391

fix(training): scaler memory usage #391

Uh oh!

Conversation

japols commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What problem does this change solve?

Additional notes

Uh oh!

HCookie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

anaprietonem commented Jul 1, 2025

Uh oh!

Uh oh!

ssmmnn11 commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

ssmmnn11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

japols commented Jun 30, 2025 •

edited

Loading