fix: mlflow hp params limit #424

anaprietonem · 2025-07-21T07:40:42Z

Description

When logging to many hp_params into the mlflow server, the runs and the experiment takes long to load. We set a configurable limit to better control this, so if a run has many hp_params the code will crash and avoid that run being logged. This limit can be configured from the config, but if not limit is passed then the default value of 2000 will be used.

Additionally this PR modifies the W&B schema so the identity when W&B is not enabled, is update to None so it doesn't crash or complaint about the '?'

What problem does this change solve?

What issue or task does this change relate to?

Additional notes

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

for more information, see https://pre-commit.ci

anaprietonem · 2025-07-21T14:12:00Z

Update PR to use log_model to use default of False -> Done ✅

training/src/anemoi/training/schemas/diagnostics.py

training/src/anemoi/training/diagnostics/mlflow/logger.py

training/src/anemoi/training/schemas/diagnostics.py

anaprietonem · 2025-07-24T06:57:13Z

Note - while testing this PR we noticed that logging_models into mlflow is currently not working. This is unrelated to the changes of this PR as mentioned here mlflow/mlflow#15111

If users need to 'log_model: True' they'd need to pin pytorch-lightning to 2.5.0 or smaller, as this is currently a bug in their latest version

anaprietonem · 2025-07-30T11:32:49Z

@gmertes I believe I have addressed the comments above, please let me know if there is anything else!

gmertes

Thanks!

🤖 Automated Release PR This PR was created by `release-please` to prepare the next release. Once merged: 1. A new version tag will be created 2. A GitHub release will be published 3. The changelog will be updated Changes to be included in the next release: --- <details><summary>training: 0.6.0</summary> ## [0.6.0](training-0.5.1...training-0.6.0) (2025-08-01) ### ⚠ BREAKING CHANGES * for schemas of data processors ([#433](#433)) * BaseGraphModule and tasks introduced in anemoi-core ([#399](#399)) ### Features * Add metadata back to pl checkpoint. ([#303](#303)) ([0193b28](0193b28)) * BaseGraphModule and tasks introduced in anemoi-core ([#399](#399)) ([f8ab962](f8ab962)) * **deps:** Use mlflow-skinny instead of mlflow ([#418](#418)) ([6a8beb3](6a8beb3)) * Log FTT2 loss + Fourier Correlation loss ([#148](#148)) ([345b0ab](345b0ab)) * **model:** Postprocessors for leaky boundings ([#315](#315)) ([b54562b](b54562b)) * **models:** Checkpointed Mapper Chunking ([#406](#406)) ([8577772](8577772)) * **models:** Mapper edge sharding ([#366](#366)) ([326751d](326751d)) * Variable filtering ([#208](#208)) ([fba5e47](fba5e47)) ### Bug Fixes * Dropping 3.9 ([#436](#436)) ([f6c0214](f6c0214)) * For schemas of data processors ([#433](#433)) ([539939b](539939b)) * Mlflow hp params limit ([#424](#424)) ([138bc3a](138bc3a)) * Mlflowlogger duplicated key ([#414](#414)) ([cb64a1c](cb64a1c)) * **models,traininig:** Hierarchical model + integration test ([#400](#400)) ([71dfd89](71dfd89)) * **models:** Add removed sharded_input_key in PR400 ([#425](#425)) ([089fe6f](089fe6f)) * New checkpoint ([#445](#445)) ([a25df93](a25df93)) * Plotting error when precip related params are not diagnostic ([#369](#369)) ([010cfa3](010cfa3)) * **training:** Address issues with [#208](#208) ([#417](#417)) ([665f462](665f462)) * **training:** Scaler memory usage ([#391](#391)) ([a9d30e1](a9d30e1)) * Update import mflow utils unit tests ([#427](#427)) ([70ecdd9](70ecdd9)) * Update level retrieval logic ([#405](#405)) ([f393bc3](f393bc3)) * Use transforms: Variable for ExtractVariableGroupAndLevel ([#321](#321)) ([7649f4f](7649f4f)) * Warm restart ([#443](#443)) ([ff96236](ff96236)) ### Documentation * **graphs:** Documenting some missing features ([#423](#423)) ([8addbd8](8addbd8)) </details> <details><summary>graphs: 0.6.3</summary> ## [0.6.3](graphs-0.6.2...graphs-0.6.3) (2025-08-01) ### Features * **graphs:** Add lat weighted attribute ([#223](#223)) ([5dd32ca](5dd32ca)) * **graphs:** Support to export edges to npz ([#395](#395)) ([e21738f](e21738f)) ### Bug Fixes * Dropping 3.9 ([#436](#436)) ([f6c0214](f6c0214)) * **graphs:** Revert PR [#379](#379) ([#409](#409)) ([d51219f](d51219f)) * **graphs:** Throw error instead of raising warning when graph exists. ([#379](#379)) ([6ec6c18](6ec6c18)) * **graphs:** Undo masking when torch-cluster is installed ([#375](#375)) ([9f75c06](9f75c06)) ### Documentation * **graphs:** Documenting some missing features ([#423](#423)) ([8addbd8](8addbd8)) </details> <details><summary>models: 0.9.0</summary> ## [0.9.0](models-0.8.1...models-0.9.0) (2025-08-01) ### ⚠ BREAKING CHANGES * for schemas of data processors ([#433](#433)) ### Features * **model:** Postprocessors for leaky boundings ([#315](#315)) ([b54562b](b54562b)) * **models:** Checkpointed Mapper Chunking ([#406](#406)) ([8577772](8577772)) * **models:** Mapper edge sharding ([#366](#366)) ([326751d](326751d)) ### Bug Fixes * Dropping 3.9 ([#436](#436)) ([f6c0214](f6c0214)) * For schemas of data processors ([#433](#433)) ([539939b](539939b)) * **models,traininig:** Hierarchical model + integration test ([#400](#400)) ([71dfd89](71dfd89)) * **models:** Remove repeated lines ([#377](#377)) ([1f0b861](1f0b861)) * **models:** Uneven channel sharding ([#385](#385)) ([dd095c4](dd095c4)) * Pydantic model validator not working in transformer schema ([#422](#422)) ([42f437a](42f437a)) * Remove dead code and fix typo ([#357](#357)) ([8c615ba](8c615ba)) </details> --- > [!IMPORTANT] > Please do not change the PR title, manifest file, or any other automatically generated content in this PR unless you understand the implications. Changes here can break the release process. > > ⚠️ Merging this PR will: > - Create a new release > - Trigger deployment pipelines > - Update package versions **Before merging:** - Ensure all tests pass - Review the changelog carefully - Get required approvals [Release-please documentation](https://github.com/googleapis/release-please)

This reverts commit 138bc3a.

anaprietonem added 6 commits July 11, 2025 06:20

test ptl changes

9e0d052

update tests

83f9498

address wandb schema

ed0c009

address wandb schema

fb7cfb7

fix type hint

a92947b

test ptl changes

8e09862

github-actions bot added the training label Jul 21, 2025

Merge branch 'main' into fix/mlflow_hp_params_limit

2a765f5

anaprietonem changed the title ~~Fix/mlflow hp params limit~~ fix: mlflow hp params limit Jul 21, 2025

anaprietonem and others added 2 commits July 21, 2025 07:44

update default

37b14c5

[pre-commit.ci] auto fixes from pre-commit.com hooks

541b3ab

for more information, see https://pre-commit.ci

anaprietonem added the ATS Approval Not Needed No approval needed by ATS label Jul 21, 2025

anaprietonem self-assigned this Jul 21, 2025

anaprietonem added this to Anemoi-dev Jul 21, 2025

github-project-automation bot moved this to Now In Progress in Anemoi-dev Jul 21, 2025

anaprietonem added 3 commits July 21, 2025 07:51

fix schema

68cb44f

clean

7db96b7

clean

fc2cc42

anaprietonem requested review from gmertes and theissenhelen July 21, 2025 07:55

anaprietonem marked this pull request as ready for review July 21, 2025 08:50

anaprietonem and others added 5 commits July 22, 2025 06:26

test ptl changes

a02cfa2

log_model entry update

b761124

merge

2dbbac6

update

32482fa

Merge branch 'main' into fix/mlflow_hp_params_limit

767c662

gmertes reviewed Jul 23, 2025

View reviewed changes

training/src/anemoi/training/schemas/diagnostics.py Show resolved Hide resolved

training/src/anemoi/training/diagnostics/mlflow/logger.py Outdated Show resolved Hide resolved

training/src/anemoi/training/schemas/diagnostics.py Outdated Show resolved Hide resolved

anaprietonem added 2 commits July 23, 2025 20:15

entry for tracking_uri

97658c4

extend logging error message

d81e784

simplify logic

83610a5

anaprietonem requested a review from gmertes July 24, 2025 07:14

Merge branch 'main' into fix/mlflow_hp_params_limit

033b5dc

gmertes approved these changes Jul 30, 2025

View reviewed changes

Merge branch 'main' into fix/mlflow_hp_params_limit

9f5c532

gmertes merged commit 138bc3a into main Jul 30, 2025
18 checks passed

gmertes deleted the fix/mlflow_hp_params_limit branch July 30, 2025 12:19

github-project-automation bot moved this from Now In Progress to Done in Anemoi-dev Jul 30, 2025

DeployDuck mentioned this pull request Jul 30, 2025

chore: Release main #380

Merged

matschreiner added a commit to matschreiner/anemoi-core that referenced this pull request Aug 6, 2025

Revert "fix: mlflow hp params limit (ecmwf#424)"

e8a1cca

This reverts commit 138bc3a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: mlflow hp params limit #424

fix: mlflow hp params limit #424

Uh oh!

anaprietonem commented Jul 21, 2025 •

edited

Loading

Uh oh!

anaprietonem commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anaprietonem commented Jul 24, 2025 •

edited

Loading

Uh oh!

anaprietonem commented Jul 30, 2025

Uh oh!

gmertes left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: mlflow hp params limit #424

fix: mlflow hp params limit #424

Uh oh!

Conversation

anaprietonem commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What problem does this change solve?

What issue or task does this change relate to?

Additional notes

Uh oh!

anaprietonem commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anaprietonem commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anaprietonem commented Jul 30, 2025

Uh oh!

gmertes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anaprietonem commented Jul 21, 2025 •

edited

Loading

anaprietonem commented Jul 21, 2025 •

edited

Loading

anaprietonem commented Jul 24, 2025 •

edited

Loading