Skip to content

Commit 0922a52

Browse files
OriolAbrilsethaxenahartikainen
committed
address comments
Co-authored-by: Seth Axen <[email protected]> Co-authored-by: Ari Hartikainen <[email protected]>
1 parent ed54b65 commit 0922a52

File tree

1 file changed

+19
-12
lines changed

1 file changed

+19
-12
lines changed

doc/source/schema/schema.md

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Currently there are **two beta implementations** of this design:
1515
[CmdStanPy](https://mc-stan.org/cmdstanpy/)
1616
- [TensorFlow Probability](https://www.tensorflow.org/probability)
1717
* [ArviZ.jl](https://github.com/arviz-devs/ArviZ.jl) in **Julia** which integrates with:
18-
- [CmdStan.jl](https://github.com/StanJulia/CmdStan.jl), [StanSample.jl](https://github.com/StanJulia/StanSample.jl) and [Stan.jl](https://github.com/StanJulia/Stan.jl)
18+
- [CmdStan.jl](https://github.com/StanJulia/CmdStan.jl), [Soss.jl](https://cscherrer.github.io/Soss.jl/stable/), [StanSample.jl](https://github.com/StanJulia/StanSample.jl) and [Stan.jl](https://github.com/StanJulia/Stan.jl)
1919
- [Turing.jl](https://turing.ml/dev/) and indirectly any package using [MCMCChains.jl](https://github.com/TuringLang/MCMCChains.jl) to store results
2020

2121
## Terminology
@@ -39,11 +39,12 @@ There are also some extensions particular to the {ref}`InferenceData <xarray_fo
3939
`InferenceData` stores all quantities that are relevant to fulfilling its goals in different groups. Different groups generally distinguish conceptually different quantities in Bayesian inference, however, convenience in {ref}`creation <creating_InferenceData>` and {ref}`usage <working_with_InferenceData>` of `InferenceData` objects also plays a role. In general, each quantity (such as posterior distribution or observed data) will be represented by several multidimensional labeled variables.
4040

4141
### Rules
42-
Below are a few rules which should be followed:
42+
Below are a few rules that should be followed:
4343
* Each group should have one entry per variable and each variable should be named.
44-
* Dimension names `chain`, `draw`, `sample` and `post_pred_id` are reserved for
44+
* Dimension names `chain`, `draw`, `sample` and `pred_id` are reserved for
4545
InferenceData use to indicate sample dimensions.
46-
* Dimensions in InferenceData (including sample dimensions) should be identified by name only.
46+
* Dimensions in InferenceData (including sample dimensions) should be identified by name only. The
47+
dimension order does not matter, only their names.
4748
* For groups like `observed_data` or `constant_data`, all sample dimensions can be
4849
omitted. For groups like `prior`, `posterior` or `posterior_predictive` either `sample` has to be
4950
present or both `chain` and `draw` dimensions need to be present. Any combinations that follow
@@ -60,17 +61,20 @@ However, it is recommended to store the following fields when relevant:
6061
but they are all tied to a single model. The model identifier can be added as metadata
6162
to simplify the calls to model comparison functions.
6263
* `created_at`: the date of creation of the group.
63-
* `arviz_version`: the version of the ArviZ library that generated the InferenceData
64-
* `arviz_language`: the programming language from which ArviZ was used to create the InferenceData
64+
* `creation_library`: the library used to create the InferenceData (might not necessarly be ArviZ)
65+
* `creation_library_version`: the version of `creation_library` that generated the InferenceData
66+
* `creation_library_language`: the programming language from which `creation_library` was used to create the InferenceData
6567
* `inference_library`: the library used to run the inference.
6668
* `inference_library_version`: version of the inference library used.
6769

70+
Metadata can be stored at the whole `InferenceData` level but also at group level when needed.
71+
6872

6973
### Relations between groups
70-
`InferenceData` data objects contain any combination of the groups described below. There are also some relations (detailed below) between the variables and dimensions of different groups. Hence, whenever related groups are present they should comply with these relations. Neither the presence of groups nor described below or the lack of some of the groups described below go against the schema.
74+
`InferenceData` data objects contain any combination of the groups described below. There are also some relations (detailed below) between the variables and dimensions of different groups. Hence, whenever related groups are present they should comply with these relations. Neither the presence of groups not described below or the lack of some of the groups described below go against the schema.
7175

7276
#### `posterior`
73-
Samples from the posterior distribution p(theta|y) in the parameter (also called constrained) space.
77+
Samples from the posterior distribution $p(\theta|y)$ in the parameter (also called constrained) space.
7478

7579
(schema/unconstrained_posterior)=
7680
#### `unconstrained_posterior`
@@ -81,15 +85,17 @@ Therefore, to get the samples for _all_ the variables in the unconstrained space
8185
variables should be taken from the `unconstrained_posterior` group if present,
8286
and if not, then the values from the variable in the `posterior` group should be used.
8387

84-
Both samples and variables (for those present only) should match between the `posterior` and the `unconstrained_posterior` groups. Note that as defined above, matching samples and variables impose constraints on dimensions and coordinates, not on the values which will be different.
88+
Samples should match between the `posterior` and the `unconstrained_posterior` groups.
89+
All variables in `unconstrained_posterior` should have a counterpart in `posterior`
90+
with the same name. However, they don't need to have the same dimensions nor shape.
8591

8692
:::{note}
8793
:class: dropdown
8894

8995
Both InferenceData groups and variables can have metadata, which in the `unconstrained_posterior`
9096
case could be used to store the transformations each variable goes through to map between the
9197
constrained and unconstrained spaces. The schema leaves this completely up to the user
92-
and imposes on conventions or restrictions on such metadata.
98+
and imposes no conventions or restrictions on such metadata.
9399
:::
94100

95101
(schema/sample_stats)=
@@ -122,7 +128,7 @@ additive constant).
122128
the accepted proposal.
123129
* `max_energy_error`: The maximum absolute difference in Hamiltonian energy between the initial point and all possible samples in the proposed tree.
124130
* `int_time`: The total integration time (static HMC sampler)
125-
* `mass_matrix`: Mass matrix (also known as _metric_) used in HMC samplers for the computation of the Hamiltonian.
131+
* `inv_metric`: Inverse metric (also known as inverse _mass matrix_) used in HMC samplers for the computation of the Hamiltonian.
126132
When it is constant, the resulting implementation is known as Euclidean HMC;
127133
in that case, the variable wouldn't need to have any sampling dimensions
128134
even if part of the `sample_stats` group.
@@ -181,9 +187,10 @@ any sample stats group.
181187

182188
#### Warmup groups
183189
Samples generated during the adaptation/warmup phases of algorithms like HMC
184-
can also be stored in InferenData. In such cases, the data/samples
190+
can also be stored in InferenceData. In such cases, the data/samples
185191
generated during the adaptation process should be stored in groups with
186192
the same name with the `warmup_` prefix, e.g. `warmup_posterior`, `warmup_sample_stats_prior`.
193+
The `warmup_` prefix goes before other prefixes.
187194

188195
#### Unconstrained groups
189196
Samples on the unconstrained space in cases where the samples need to be generated with

0 commit comments

Comments
 (0)