Skip to content

Conversation

@anaprietonem
Copy link
Contributor

@anaprietonem anaprietonem commented Jul 21, 2025

Description

When logging to many hp_params into the mlflow server, the runs and the experiment takes long to load. We set a configurable limit to better control this, so if a run has many hp_params the code will crash and avoid that run being logged. This limit can be configured from the config, but if not limit is passed then the default value of 2000 will be used.

Additionally this PR modifies the W&B schema so the identity when W&B is not enabled, is update to None so it doesn't crash or complaint about the '?'

What problem does this change solve?

What issue or task does this change relate to?

Additional notes

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

@anaprietonem anaprietonem changed the title Fix/mlflow hp params limit fix: mlflow hp params limit Jul 21, 2025
@anaprietonem anaprietonem added the ATS Approval Not Needed No approval needed by ATS label Jul 21, 2025
@anaprietonem anaprietonem self-assigned this Jul 21, 2025
@github-project-automation github-project-automation bot moved this to Now In Progress in Anemoi-dev Jul 21, 2025
@anaprietonem anaprietonem marked this pull request as ready for review July 21, 2025 08:50
@anaprietonem
Copy link
Contributor Author

anaprietonem commented Jul 21, 2025

  • Update PR to use log_model to use default of False -> Done ✅

@anaprietonem
Copy link
Contributor Author

anaprietonem commented Jul 24, 2025

Note - while testing this PR we noticed that logging_models into mlflow is currently not working. This is unrelated to the changes of this PR as mentioned here mlflow/mlflow#15111

If users need to 'log_model: True' they'd need to pin pytorch-lightning to 2.5.0 or smaller, as this is currently a bug in their latest version

@anaprietonem anaprietonem requested a review from gmertes July 24, 2025 07:14
@anaprietonem
Copy link
Contributor Author

@gmertes I believe I have addressed the comments above, please let me know if there is anything else!

Copy link
Member

@gmertes gmertes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@gmertes gmertes merged commit 138bc3a into main Jul 30, 2025
18 checks passed
@gmertes gmertes deleted the fix/mlflow_hp_params_limit branch July 30, 2025 12:19
@github-project-automation github-project-automation bot moved this from Now In Progress to Done in Anemoi-dev Jul 30, 2025
@DeployDuck DeployDuck mentioned this pull request Jul 30, 2025
anaprietonem pushed a commit that referenced this pull request Aug 4, 2025
🤖 Automated Release PR

This PR was created by `release-please` to prepare the next release.
Once merged:

1. A new version tag will be created
2. A GitHub release will be published
3. The changelog will be updated

Changes to be included in the next release:
---


<details><summary>training: 0.6.0</summary>

##
[0.6.0](training-0.5.1...training-0.6.0)
(2025-08-01)


### ⚠ BREAKING CHANGES

* for schemas of data processors
([#433](#433))
* BaseGraphModule and tasks introduced in anemoi-core
([#399](#399))

### Features

* Add metadata back to pl checkpoint.
([#303](#303))
([0193b28](0193b28))
* BaseGraphModule and tasks introduced in anemoi-core
([#399](#399))
([f8ab962](f8ab962))
* **deps:** Use mlflow-skinny instead of mlflow
([#418](#418))
([6a8beb3](6a8beb3))
* Log FTT2 loss + Fourier Correlation loss
([#148](#148))
([345b0ab](345b0ab))
* **model:** Postprocessors for leaky boundings
([#315](#315))
([b54562b](b54562b))
* **models:** Checkpointed Mapper Chunking
([#406](#406))
([8577772](8577772))
* **models:** Mapper edge sharding
([#366](#366))
([326751d](326751d))
* Variable filtering
([#208](#208))
([fba5e47](fba5e47))


### Bug Fixes

* Dropping 3.9 ([#436](#436))
([f6c0214](f6c0214))
* For schemas of data processors
([#433](#433))
([539939b](539939b))
* Mlflow hp params limit
([#424](#424))
([138bc3a](138bc3a))
* Mlflowlogger duplicated key
([#414](#414))
([cb64a1c](cb64a1c))
* **models,traininig:** Hierarchical model + integration test
([#400](#400))
([71dfd89](71dfd89))
* **models:** Add removed sharded_input_key in PR400
([#425](#425))
([089fe6f](089fe6f))
* New checkpoint
([#445](#445))
([a25df93](a25df93))
* Plotting error when precip related params are not diagnostic
([#369](#369))
([010cfa3](010cfa3))
* **training:** Address issues with
[#208](#208)
([#417](#417))
([665f462](665f462))
* **training:** Scaler memory usage
([#391](#391))
([a9d30e1](a9d30e1))
* Update import mflow utils unit tests
([#427](#427))
([70ecdd9](70ecdd9))
* Update level retrieval logic
([#405](#405))
([f393bc3](f393bc3))
* Use transforms: Variable for ExtractVariableGroupAndLevel
([#321](#321))
([7649f4f](7649f4f))
* Warm restart ([#443](#443))
([ff96236](ff96236))


### Documentation

* **graphs:** Documenting some missing features
([#423](#423))
([8addbd8](8addbd8))
</details>

<details><summary>graphs: 0.6.3</summary>

##
[0.6.3](graphs-0.6.2...graphs-0.6.3)
(2025-08-01)


### Features

* **graphs:** Add lat weighted attribute
([#223](#223))
([5dd32ca](5dd32ca))
* **graphs:** Support to export edges to npz
([#395](#395))
([e21738f](e21738f))


### Bug Fixes

* Dropping 3.9 ([#436](#436))
([f6c0214](f6c0214))
* **graphs:** Revert PR
[#379](#379)
([#409](#409))
([d51219f](d51219f))
* **graphs:** Throw error instead of raising warning when graph exists.
([#379](#379))
([6ec6c18](6ec6c18))
* **graphs:** Undo masking when torch-cluster is installed
([#375](#375))
([9f75c06](9f75c06))


### Documentation

* **graphs:** Documenting some missing features
([#423](#423))
([8addbd8](8addbd8))
</details>

<details><summary>models: 0.9.0</summary>

##
[0.9.0](models-0.8.1...models-0.9.0)
(2025-08-01)


### ⚠ BREAKING CHANGES

* for schemas of data processors
([#433](#433))

### Features

* **model:** Postprocessors for leaky boundings
([#315](#315))
([b54562b](b54562b))
* **models:** Checkpointed Mapper Chunking
([#406](#406))
([8577772](8577772))
* **models:** Mapper edge sharding
([#366](#366))
([326751d](326751d))


### Bug Fixes

* Dropping 3.9 ([#436](#436))
([f6c0214](f6c0214))
* For schemas of data processors
([#433](#433))
([539939b](539939b))
* **models,traininig:** Hierarchical model + integration test
([#400](#400))
([71dfd89](71dfd89))
* **models:** Remove repeated lines
([#377](#377))
([1f0b861](1f0b861))
* **models:** Uneven channel sharding
([#385](#385))
([dd095c4](dd095c4))
* Pydantic model validator not working in transformer schema
([#422](#422))
([42f437a](42f437a))
* Remove dead code and fix typo
([#357](#357))
([8c615ba](8c615ba))
</details>

---
> [!IMPORTANT]
> Please do not change the PR title, manifest file, or any other
automatically generated content in this PR unless you understand the
implications. Changes here can break the release process.
> 
> ⚠️ Merging this PR will:
> - Create a new release
> - Trigger deployment pipelines
> - Update package versions

 **Before merging:**
 - Ensure all tests pass
 - Review the changelog carefully
 - Get required approvals

[Release-please
documentation](https://github.com/googleapis/release-please)
matschreiner added a commit to matschreiner/anemoi-core that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ATS Approval Not Needed No approval needed by ATS training

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants