Skip to content

Conversation

@mchantry
Copy link
Member

@mchantry mchantry commented Oct 8, 2025

Description

Supports multiple time-aligned datasets as inputs and outputs for training.
e.g.

era_t     |         era_{t+1}
          | - > 
cerra_t   |         cerra_{t+1}
era_t      |        era_{t+1}
           | - > 
           |        cerra_{t+1}
era_t     |       
          | - > 
cerra_t   |         cerra_{t+1}

where inputs/outputs each use their own encoder/decoder.

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

@github-project-automation github-project-automation bot moved this to To be triaged in Anemoi-dev Oct 8, 2025
@mchantry mchantry changed the title Feature/multi dataset integration feat(models,training): multi dataset integration Oct 8, 2025
@mchantry mchantry added the ATS Approved Approved by ATS label Oct 9, 2025
@dnerini dnerini moved this from To be triaged to Now In Progress in Anemoi-dev Oct 21, 2025
@radiradev
Copy link
Contributor

Hi @mchantry, I'm planning on doing some training with this branch. Could you let me know what I should expect to work and not work?

Comment on lines +176 to +178
# ALWAYS override dataset from dataloader config (ignore dummy in graph config)
if hasattr(graph_config.nodes.data.node_builder, "dataset"):
graph_config.nodes.data.node_builder.dataset = dataset_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a similar mechanism be required for ICON Node builders? see also #627

dataset: aifs-ea-an-oper-0001-mars-${data.resolution}-1979-2023-6h-v8.zarr

# Secondary dataset for ERA5_copy (using same file for debugging)
dataset_b: cerra-rr-an-oper-0001-mars-5p5km-1984-2022-6h-v3-hmsi.zarr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this working currently? I've been trying to run with two datasets with non-identical datetimes and getting errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I mean, one is a subset of datetimes of the other as it's the case in the example above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean about non-identical datetimes radi?

The first implementation of multiple datasets aim to use datasets that are fully time aligned, and feature the time datetimes

JPXKQX and others added 5 commits November 18, 2025 14:21
## Description

Adaptation of plotting callbacks to multiple datasets. We add a
parameter `datasets` to the callback configuration that allows
specifying which datasets to plot. All plots will be based on the same
configuration, including the choice of parameters to plot.

**In order to plot different datasets with different parameters, the
user can configure multiple callbacks of the same type with different
parameters.**

Callbacks included:
- PlotLoss
- PlotSample
- PlotSpectrum
- PlotHistogram
- GraphTrainableFeaturesPlot

Callbacks not included in this PR:
- ensemble plots
- LongRolloutPlots

Here are MLFlow runs for a [single
dataset](https://mlflow.ecmwf.int/#/experiments/420/runs/8dca1935a3194edea16a21d4ed4d6a13/artifacts)
and a [two
dataset](https://mlflow.ecmwf.int/#/experiments/420/runs/ea59f113b0d34061b487038552d9467c/artifacts)
use case.

## The config interface for plotting
Most datasets will have somewhat different parameters, and hence
different parameters to plot. The config interface implemented on this
PR means that users will either need to plot parameters shared by all
datasets, or configure multiple callbacks of the same type for different
datasets.
 
The reasons for this interface as a first draft is not that is it
optimal for multiple datasets but rather that:
- it keeps plots and plotting configs for the single dataset case
backwards compatible. Avoiding regression on plotting for existing use
cases was the main requirement.
- it requires less changes to the callbacks than making them more
configurable
- it provides _some functionality_ to plot multiple datasets.
- we don't know yet how multiple datasets will mostly be used and it
might make sense to delay the design of the interface a bit. This will
give us more time to rethink plotting callbacks more generally, rather
than settling on an interface now.

## Other decisions
- We will adapt pydantic schemas for plotting (and set a default for the
new parameter `datasets = ["data"]`) as part of fixing schemas on the
multiple datasets branch
- I'd suggest to update the remaining plotting callbacks separately but
open to thoughts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Now In Progress

Development

Successfully merging this pull request may close these issues.

8 participants