Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
283 changes: 283 additions & 0 deletions doc/architectural_decisions/007-output-file-archiving.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
# 7. Output file archiving

## Status

Draft

## Context

Our bluesky implementation contains bluesky callbacks which produce scientist-facing output files, for example:
- [Human-readable scan result files](/callbacks/file_writing): {py:obj}`HumanReadableFileCallback <ibex_bluesky_core.callbacks.HumanReadableFileCallback>`
- [Fitting results](/fitting/livefit_logger): {py:obj}`LiveFitLogger <ibex_bluesky_core.callbacks.LiveFitLogger>`
- [Plot PNGs](#plot_png_saver): {py:obj}`PlotPNGSaver <ibex_bluesky_core.callbacks.PlotPNGSaver>`

In addition, we have a [developer-facing callback for diagnostics](/callbacks/docs_logging_callback),
{py:obj}`DocLoggingCallback <ibex_bluesky_core.callbacks.DocLoggingCallback>`.

The above callbacks produce files on disk in response to a bluesky scan. These files contain valuable data and so we
need to consider how these files are archived for the long term. This must align with the
[ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx). We should make an attempt to align with
[FAIR principles](https://www.go-fair.org/fair-principles/).

According to the definitions in the [ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx), the data
generated by bluesky is generally either "facility generated reduced data" or "metadata".

This ADR is concerned with the location in which these bluesky output files are stored, and the archiving infrastructure
which is therefore used to keep these files for the long term.

---

At the time of writing this ADR, in June 2025, the scientist-facing files are being written to
```
...\inst$\<instrument>\user\bluesky_scans\<rb_number>\
```

This location has some disadvantages:
- It is a network location, which means that a site network break will cause bluesky scans to fail to run
- It is not a location designed for long-term scientifically useful data - for example in terms of data integrity
- It is not necessarily accessible from downstream systems such as Topcat

Therefore, we would like to define a different, more suitable, location into which bluesky output files can be written.

---

Some representative use-cases are presented below, showing how data is expected to be used by scientists (click to
expand each use case):

<details>
<summary>1 Bluesky scan, no neutron runs (e.g. scanning against a block)</summary>

```{mermaid}
sequenceDiagram
actor PI
participant NDX
participant Archive
participant TopCat
note over PI:Start of RBNumber experiment
PI ->> NDX: Start bluesky scan
note over PI: Time Passes
note over NDX: Bluesky scan ends
note over NDX: creates scan.ascii and scan.nxs
NDX ->> Archive: Sends scan.ascii and scan.nxs
TopCat ->> Archive: Collects scan.ascii and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to scan.ascii and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to scan.nxs
```
</details>

<details>
<summary>1 Bluesky scan, aborted neutron runs</summary>

```{mermaid}
sequenceDiagram
actor PI
participant NDX
participant Archive
participant TopCat as Online Catalogue
note over PI:Start of RBNumber experiment
PI ->> NDX: Start bluesky scan
note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run
note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run
note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run
note over NDX: Bluesky scan ends
note over NDX: creates scan.ascii and scan.nxs
NDX ->> Archive: Sends scan.ascii and scan.nxs
TopCat ->> Archive: Collects scan.ascii and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to scan.ascii and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to scan.nxs
```
</details>

<details>
<summary>1 Bluesky scan, one neutron run</summary>

```{mermaid}
sequenceDiagram
actor PI
participant NDX
participant Archive
participant TopCat
note over PI:Start of RBNumber experiment
PI ->> NDX: Start bluesky scan
note over NDX: Bluesky scan starts DAE run
note over PI: Time Passes
note over NDX: Bluesky scan ends DAE run <br/> Bluesky scan ends
par
note over NDX: creates runnumber.nxs with DAE and SE data
and
note over NDX: creates scan.ascii and scan.nxs
end
NDX ->> Archive: Sends runnumber.nxs, scan.ascii, and scan.nxs
TopCat ->> Archive: Collects runnumber.nxs, scan.ascii, and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumber.nxs, scan.ascii, and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumber.nxs and scan.nxs
```
</details>

<details>
<summary>1 Bluesky scan, N neutron runs</summary>

```{mermaid}
sequenceDiagram
actor PI
participant NDX
participant Archive
participant TopCat
note over PI:Start of RBNumber experiment
PI ->> NDX: Start bluesky scan
note over NDX: Bluesky scan starts DAE run
note over PI: Time Passes
note over NDX: Bluesky scan ends DAE run
note over NDX: creates runnumber.nxs with DAE and SE data
NDX ->> Archive: Sends runnumber.nxs
TopCat ->> Archive: Collects runnumber.nxs
note over PI: Time Passes
note over NDX: Bluesky scan starts DAE run
note over PI: Time Passes
note over NDX: Bluesky scan ends DAE run
note over NDX: creates runnumber+1.nxs with DAE and SE data
NDX ->> Archive: Sends runnumber+1.nxs
TopCat ->> Archive: Collects runnumber+1.nxs
note over NDX: Bluesky scan ends
NDX ->> Archive: Sends scan.ascii and scan.nxs
TopCat ->> Archive: Collects scan.ascii and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, scan.ascii, and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, and scan.nxs
```
</details>

<details>
<summary>1 Bluesky scan, neutron/muon runs on multiple instruments</summary>

```{mermaid}
sequenceDiagram
actor PI
participant NDX-A
participant NDX-B
participant NDX-C
participant Archive
participant TopCat
note over PI:Start of RBNumber experiment
PI ->> NDX-A: Start bluesky scan
NDX-A ->> NDX-B: Start DAE run
NDX-A ->> NDX-C: Start DAE run
note over PI: Time Passes
NDX-B ->> NDX-A: Provides summary run data
NDX-C ->> NDX-A: Provides summary run data
NDX-A ->> NDX-B: End DAE run
note over NDX-B: creates runnumberB.nxs with DAE and SE data
NDX-B ->> Archive: Sends runnumberB.nxs
TopCat ->> Archive: Collects runnumberB.nxs
NDX-A ->> NDX-C: End DAE run
note over NDX-C: creates runnumberC.nxs with DAE and SE data
NDX-C ->> Archive: Sends runnumberC.nxs
TopCat ->> Archive: Collects runnumberC.nxs
note over NDX-A: Bluesky scan ends
NDX-A ->> Archive: Sends scan.ascii and scan.nxs
TopCat ->> Archive: Collects scan.ascii and scan.nxs
note over PI: 5 months later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, scan.ascii, and scan.nxs
note over PI: 1 year later
PI ->> TopCat: Show me my data
TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, and scan.nxs
```
</details>

## Present

The following people have been involved in discussions leading up to this ADR:

- Tom
- Chris M-S
- George
- Kathryn
- Jack H
- CK (Reflectometry)

This document was additionally reviewed in a regular Thursday code-review slot by the whole IBEX team.

## Decisions

### File-writing location

Bluesky should write data into the `c:\data\RB<rb_number>\bluesky_scans\` folder during a scan.
File naming itself will keep its current scheme (timestamped files).

This location was chosen because it mirrors the archiving setup used by neutron cameras on IMAT.

### Attributes & checksums

Bluesky should mark files as read-only, using Windows file attributes, when it has finished writing them. This is so
that the archiving process can unambiguously tell whether a file has finished being written. It also reduces the
likelihood that a file is accidentally modified.

Checksums should be generated, either at the point when the data is initially generated, or by the archiving process
just before it first copies or moves a file.

We have agreed on the desire to generate checksums for data, which is already done for DAE data. These checksums are
useful to check for data corruption, which might occur in transit, or in-place on instrument computers or archive servers.
A number of checksumming approaches have been considered, and no approach has been chosen yet. The options discussed
are:
- **Use windows alternate file streams**. This is how checksums are done in existing DAE `.raw` files. It has the
advantage that it is relatively simple to implement, but the disadvantage that they do not map nicely onto Linux file
systems.
- **Generate one checksum per file**, for example `file.txt` would also have an associated `file.sha1.txt` containing the
checksum. The advantage is that this is simple to implement and platform-agnostic. The disadvantage is that it doubles
the number of files visible in the archive area.
- **Generate a single checksum file** containing the checksums of all bluesky data, at a higher level of granularity (for
example by RB number or by cycle). It is currently unclear exactly how this approach would be implemented, and at what
point these checksums would be moved to the archive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checksums could be per file, or all files in a folder could be in a single checksum file. May need some further discussion depending on when we feel we archive from an RB number directory

### Moving to the ISIS archive

An automated cron task will look for read-only Bluesky output files, and their associated checksums, in `c:\data` at
regular short intervals (for example, 1 minute), and will move them to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have one checksum file per data file then it can move checksum file, if we have a single file listing checksums it would be a bit more complicated

Copy link
Member Author

@Tom-Willemsen Tom-Willemsen Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if it is done as an alternative file stream then the checksum "automatically" gets moved along with the file... although they're annoying to access from most programming languages and not very portable to e.g. linux later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, linux is the main reason i'd try and avoid them...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few scenarios where the out of band checksum streams are useful currently (as they don't get included in the checksum calculation - even by accident). They are also useful to ensure immutability between the instrument computer and the current archive - and I have on several occasions, run simple checks like these when the integrity of the file system is in question (e.g. disk errors on one archive server). Having a separate block of checksums in a different disk locality and directory structure might themselves be suspect (note: the check of the checksum/file stream is local and two-way - one validates the other). A separate .zip of checksums might be the way to go, but it does provide it's own maintenance issues (however it does have a checksum itself and file changes can be made whilst updating checksums if necessary - so not unlike the second file stream).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have re-worded the checksumming section to reflect uncertainty in the exact technical approach, while acknowledging that we've decided that we do want to generate checksums.

- The ISIS data archive, under `autoreduced/bluesky_scans`. The `autoreduced` folder already exists on the archive.
- The data cache disk on the instrument, under `c:\data\Export only\RB<rb_number\bluesky_scans`.

Data on the cache disk, under `Export only`, is kept on the instrument for a short period (usually 24 hours), and then
deleted by existing processes.

This is run as a cron task so that, if the network happens to be unavailable at the time when a scan ends, the copy
process will catch up when the network becomes available again. This cron task will only move files which sit within
a `bluesky_scans` folder, to prevent it from interfering with other non-bluesky files.

Creating a new `bluesky_scans` folder alongside the existing `autoreduced` folder was considered, but was felt to be
unachievable - it would require too much work relative to using the existing `autoreduced` folder.

### File formats

At present, our scan file output format is explicitly designed to be "human-readable" (and, in fact, the callback which
generates these files is explicitly called
{py:obj}`HumanReadableFileCallback <ibex_bluesky_core.callbacks.HumanReadableFileCallback>`).

We have [issue 26](https://github.com/ISISComputingGroup/ibex_bluesky_core/issues/26) which will implement
machine-readable files, using a format such as `.hdf5` or `.nxs`. These files will sit alongside the existing
human-readable files; it is acknowledged that while machine-readable files are better from a data preservation and
archiving standpoint, we will need to retain the human-readable files to support quick browsing by scientists without
using special software.

## Consequences

- Bluesky output data will be stored in a location suitable for long-term, scientifically useful, data. This includes
data integrity and availability concerns.
- Bluesky scans will no longer be reliant on a network location being available to run a scan
- The initial location where bluesky writes data (`c:\data\<rb number>`) will not be the same as its final location (the
`autoreduced` folder on the ISIS archive). This is also true for current DAE data, as generated by the ISISICP.
1 change: 1 addition & 0 deletions doc/callbacks/plotting.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ Due to an implementation detail of {py:obj}`matplotlib.pyplot.pcolormesh`,
the plot will only appear once at least *two* rows of data have been collected.
:::

{#plot_png_saver}
## Saving plots to PNG files

`ibex_bluesky_core` provides a {py:obj}`PlotPNGSaver<ibex_bluesky_core.callbacks.PlotPNGSaver>` callback to save plots on a run stop to PNG files, which by saves them to the default output file location unless a filepath is explicitly given.
Expand Down
7 changes: 5 additions & 2 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
("py:obj", r"^.*\.T.*_co$"),
]

myst_enable_extensions = ["dollarmath", "strikethrough", "colon_fence"]
myst_enable_extensions = ["dollarmath", "strikethrough", "colon_fence", "attrs_block"]
suppress_warnings = ["myst.strikethrough"]

extensions = [
Expand All @@ -43,7 +43,10 @@
"sphinx.ext.intersphinx",
# Add links to source code in API docs
"sphinx.ext.viewcode",
# Mermaid diagrams
"sphinxcontrib.mermaid",
]
mermaid_d3_zoom = True
napoleon_google_docstring = True
napoleon_numpy_docstring = False

Expand All @@ -70,7 +73,7 @@
html_favicon = "favicon.svg"

autoclass_content = "both"
myst_heading_anchors = 3
myst_heading_anchors = 7
autodoc_preserve_defaults = True

intersphinx_mapping = {
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ doc = [
"sphinx_rtd_theme",
"myst_parser",
"sphinx-autobuild",
"sphinxcontrib-mermaid",
]
dev = [
"ibex_bluesky_core[doc]",
Expand Down