Skip to content

Conversation

@icedoom888
Copy link
Contributor

@icedoom888 icedoom888 commented Apr 10, 2025

Description

Introduces Autoencoder training in Anemoi.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Issue Number

Closes #171
Reopens #172

Code Compatibility

  • I have performed a self-review of my code

Code Performance and Testing

  • I have added tests that prove my fix is effective or that my feature works
  • I ran the complete Pytest test suite locally, and they pass
  • I have tested the changes on a single GPU
  • I have tested the changes on multiple GPUs / multi-node setups
  • I have run the Benchmark Profiler against the old version of the code
  • If the new feature introduces modifications at the config level, I have made sure to update Pydantic Schemas and default configs accordingly

Dependencies

  • I have ensured that the code is still pip-installable after the changes and runs
  • I have tested that new dependencies themselves are pip-installable.
  • I have not introduced new dependencies in the inference portion of the pipeline

Documentation

  • My code follows the style guidelines of this project
  • I have updated the documentation and docstrings to reflect the changes
  • I have added comments to my code, particularly in hard-to-understand areas

Additional Notes


📚 Documentation preview 📚: https://anemoi-training--252.org.readthedocs.build/en/252/


📚 Documentation preview 📚: https://anemoi-graphs--252.org.readthedocs.build/en/252/


📚 Documentation preview 📚: https://anemoi-models--252.org.readthedocs.build/en/252/

@mchantry
Copy link
Member

@icedoom888 please can you push the branch to this repo too, so we can all run the integration tests. Many thanks.

@icedoom888
Copy link
Contributor Author

@mchantry
Copy link
Member

@mchantry integration tests passing here: https://github.com/ecmwf/anemoi-core/actions/runs/18498383834

nightly-integration-tests-hpc-gpu

Great, thanks so much.

@mchantry
Copy link
Member

@icedoom888 sorry if you have already discussed this. Have you tried using the current forecasting dataset/datamodule but with rollout=0? I believe this will give you the time slices that you get from the singledataset setup, without needing to create a new class. Not adding a new class will help when implementing multiple-datasets for anemoi.

@Rilwan-Adewoyin
Copy link
Member

@icedoom888 Can you add a config for the hierarchicalautoencoder, currently I believe there are only exemplar configs for the autoencoder.
I suspect you would only need to add a file here: training/src/anemoi/training/config/hierarchical_autoencoder.yaml

@icedoom888
Copy link
Contributor Author

@icedoom888 Can you add a config for the hierarchicalautoencoder, currently I believe there are only exemplar configs for the autoencoder. I suspect you would only need to add a file here: training/src/anemoi/training/config/hierarchical_autoencoder.yaml

Done!

@Rilwan-Adewoyin
Copy link
Member

Rilwan-Adewoyin commented Oct 15, 2025

In this PR there are great plot visualisation changes you've added as mentioned by @mc4117 (https://github.com/ecmwf/anemoi-core/pull/252/files#r2426852227)

I've noted the following two aspects:

  1. Improved visualisation when plotting smaller non-global regions (def lambert_conformal_from_latlon_points)
  2. Ability for Map Plot based callbacks to only plot over a subset of grid points (FocusArea)

It seems like this set of changes may be best placed in a second PR.

This 2nd PR would have review from some of the primary contributors to the existing plotting logic, there may be useful suggestions for improving it or ensuring it extends to more usecases - currently they only apply to the Callbacks that plot maps.

I think outside of this the other plot functions essential for reconstruction plotting can be maintained in this PR

@icedoom888
Copy link
Contributor Author

@icedoom888 sorry if you have already discussed this. Have you tried using the current forecasting dataset/datamodule but with rollout=0? I believe this will give you the time slices that you get from the singledataset setup, without needing to create a new class. Not adding a new class will help when implementing multiple-datasets for anemoi.

# Fallback if max is None or rollout_cfg is missing
        rollout_value = rollout_start
        if rollout_cfg and rollout_epoch_increment > 0 and rollout_max is not None:
            rollout_value = rollout_max

        else:
            LOGGER.warning(
                "Falling back rollout to: %s",
                rollout_value,
            )

        rollout = max(rollout_value, val_rollout)

This code from /users/apennino/anemoi-core/training/src/anemoi/training/data/datamodule/singledatamodule.py, forces rollout to be max of rollout_value and validation_rollout which by schema has to be greater than 1.

@icedoom888
Copy link
Contributor Author

@mchantry after changing every single rollout schema to NonNegativeInt to allow for rollout to 0, I now get:

[rank0]: IndexError: Caught IndexError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/users/apennino/anaconda3/envs/anemoi/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:   File "/users/apennino/anaconda3/envs/anemoi/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
[rank0]:     data.append(next(self.dataset_iter))
[rank0]:   File "/users/apennino/anemoi-core/training/src/anemoi/training/data/dataset/singledataset.py", line 284, in __iter__
[rank0]:     timeincrement = self.relative_date_indices[1] - self.relative_date_indices[0]
[rank0]: IndexError: list index out of range

This happens because the normal dataset and dataloader are expecting a list of date indeces, not just one. Hence proving the need for my implementation.

@icedoom888
Copy link
Contributor Author

@Rilwan-Adewoyin Thanks for reviewing!
I am using the callbacks to these plots in the default configurations of Autoencoders. Specifically the PlotReconstruction is an important part of the training to visualize the output! How do you suggest we handle this?

@mc4117
Copy link
Member

mc4117 commented Oct 20, 2025

@mchantry after changing every single rollout schema to NonNegativeInt to allow for rollout to 0, I now get:

[rank0]: IndexError: Caught IndexError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/users/apennino/anaconda3/envs/anemoi/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:   File "/users/apennino/anaconda3/envs/anemoi/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
[rank0]:     data.append(next(self.dataset_iter))
[rank0]:   File "/users/apennino/anemoi-core/training/src/anemoi/training/data/dataset/singledataset.py", line 284, in __iter__
[rank0]:     timeincrement = self.relative_date_indices[1] - self.relative_date_indices[0]
[rank0]: IndexError: list index out of range

This happens because the normal dataset and dataloader are expecting a list of date indeces, not just one. Hence proving the need for my implementation.

I ran some tests this morning and I think all that would be needed is an if statement where if len(relative_date_indices) == 1 then set time_increment=1 and then this should run? Let me know if this works for you too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Under Review

Development

Successfully merging this pull request may close these issues.

Autoencoder Training

6 participants