-
Notifications
You must be signed in to change notification settings - Fork 67
feat(models,training): multi dataset integration #594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Various changes, see git history. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…ure/multi_ds_dict
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
Hi @mchantry, I'm planning on doing some training with this branch. Could you let me know what I should expect to work and not work? |
| # ALWAYS override dataset from dataloader config (ignore dummy in graph config) | ||
| if hasattr(graph_config.nodes.data.node_builder, "dataset"): | ||
| graph_config.nodes.data.node_builder.dataset = dataset_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a similar mechanism be required for ICON Node builders? see also #627
| dataset: aifs-ea-an-oper-0001-mars-${data.resolution}-1979-2023-6h-v8.zarr | ||
|
|
||
| # Secondary dataset for ERA5_copy (using same file for debugging) | ||
| dataset_b: cerra-rr-an-oper-0001-mars-5p5km-1984-2022-6h-v3-hmsi.zarr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this working currently? I've been trying to run with two datasets with non-identical datetimes and getting errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I mean, one is a subset of datetimes of the other as it's the case in the example above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean about non-identical datetimes radi?
The first implementation of multiple datasets aim to use datasets that are fully time aligned, and feature the time datetimes
for more information, see https://pre-commit.ci
## Description Adaptation of plotting callbacks to multiple datasets. We add a parameter `datasets` to the callback configuration that allows specifying which datasets to plot. All plots will be based on the same configuration, including the choice of parameters to plot. **In order to plot different datasets with different parameters, the user can configure multiple callbacks of the same type with different parameters.** Callbacks included: - PlotLoss - PlotSample - PlotSpectrum - PlotHistogram - GraphTrainableFeaturesPlot Callbacks not included in this PR: - ensemble plots - LongRolloutPlots Here are MLFlow runs for a [single dataset](https://mlflow.ecmwf.int/#/experiments/420/runs/8dca1935a3194edea16a21d4ed4d6a13/artifacts) and a [two dataset](https://mlflow.ecmwf.int/#/experiments/420/runs/ea59f113b0d34061b487038552d9467c/artifacts) use case. ## The config interface for plotting Most datasets will have somewhat different parameters, and hence different parameters to plot. The config interface implemented on this PR means that users will either need to plot parameters shared by all datasets, or configure multiple callbacks of the same type for different datasets. The reasons for this interface as a first draft is not that is it optimal for multiple datasets but rather that: - it keeps plots and plotting configs for the single dataset case backwards compatible. Avoiding regression on plotting for existing use cases was the main requirement. - it requires less changes to the callbacks than making them more configurable - it provides _some functionality_ to plot multiple datasets. - we don't know yet how multiple datasets will mostly be used and it might make sense to delay the design of the interface a bit. This will give us more time to rethink plotting callbacks more generally, rather than settling on an interface now. ## Other decisions - We will adapt pydantic schemas for plotting (and set a default for the new parameter `datasets = ["data"]`) as part of fixing schemas on the multiple datasets branch - I'd suggest to update the remaining plotting callbacks separately but open to thoughts.
Description
Supports multiple time-aligned datasets as inputs and outputs for training.
e.g.
where inputs/outputs each use their own encoder/decoder.
As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/
By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.