diff --git a/doc/architectural_decisions/007-output-file-archiving.md b/doc/architectural_decisions/007-output-file-archiving.md new file mode 100644 index 00000000..f5a92921 --- /dev/null +++ b/doc/architectural_decisions/007-output-file-archiving.md @@ -0,0 +1,283 @@ +# 7. Output file archiving + +## Status + +Draft + +## Context + +Our bluesky implementation contains bluesky callbacks which produce scientist-facing output files, for example: +- [Human-readable scan result files](/callbacks/file_writing): {py:obj}`HumanReadableFileCallback ` +- [Fitting results](/fitting/livefit_logger): {py:obj}`LiveFitLogger ` +- [Plot PNGs](#plot_png_saver): {py:obj}`PlotPNGSaver ` + +In addition, we have a [developer-facing callback for diagnostics](/callbacks/docs_logging_callback), +{py:obj}`DocLoggingCallback `. + +The above callbacks produce files on disk in response to a bluesky scan. These files contain valuable data and so we +need to consider how these files are archived for the long term. This must align with the +[ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx). We should make an attempt to align with +[FAIR principles](https://www.go-fair.org/fair-principles/). + +According to the definitions in the [ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx), the data +generated by bluesky is generally either "facility generated reduced data" or "metadata". + +This ADR is concerned with the location in which these bluesky output files are stored, and the archiving infrastructure +which is therefore used to keep these files for the long term. + +--- + +At the time of writing this ADR, in June 2025, the scientist-facing files are being written to +``` +...\inst$\\user\bluesky_scans\\ +``` + +This location has some disadvantages: +- It is a network location, which means that a site network break will cause bluesky scans to fail to run +- It is not a location designed for long-term scientifically useful data - for example in terms of data integrity +- It is not necessarily accessible from downstream systems such as Topcat + +Therefore, we would like to define a different, more suitable, location into which bluesky output files can be written. + +--- + +Some representative use-cases are presented below, showing how data is expected to be used by scientists (click to +expand each use case): + +
+1 Bluesky scan, no neutron runs (e.g. scanning against a block) + +```{mermaid} +sequenceDiagram +actor PI +participant NDX +participant Archive +participant TopCat +note over PI:Start of RBNumber experiment +PI ->> NDX: Start bluesky scan +note over PI: Time Passes +note over NDX: Bluesky scan ends +note over NDX: creates scan.ascii and scan.nxs +NDX ->> Archive: Sends scan.ascii and scan.nxs +TopCat ->> Archive: Collects scan.ascii and scan.nxs +note over PI: 5 months later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to scan.ascii and scan.nxs +note over PI: 1 year later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to scan.nxs +``` +
+ +
+1 Bluesky scan, aborted neutron runs + +```{mermaid} +sequenceDiagram +actor PI +participant NDX +participant Archive +participant TopCat as Online Catalogue +note over PI:Start of RBNumber experiment +PI ->> NDX: Start bluesky scan +note over NDX: DAE run started by scan
Time passes
Required data gathered in scan documents
Abort DAE run +note over NDX: DAE run started by scan
Time passes
Required data gathered in scan documents
Abort DAE run +note over NDX: DAE run started by scan
Time passes
Required data gathered in scan documents
Abort DAE run +note over NDX: Bluesky scan ends +note over NDX: creates scan.ascii and scan.nxs +NDX ->> Archive: Sends scan.ascii and scan.nxs +TopCat ->> Archive: Collects scan.ascii and scan.nxs +note over PI: 5 months later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to scan.ascii and scan.nxs +note over PI: 1 year later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to scan.nxs +``` +
+ +
+1 Bluesky scan, one neutron run + +```{mermaid} +sequenceDiagram +actor PI +participant NDX +participant Archive +participant TopCat +note over PI:Start of RBNumber experiment +PI ->> NDX: Start bluesky scan +note over NDX: Bluesky scan starts DAE run +note over PI: Time Passes +note over NDX: Bluesky scan ends DAE run
Bluesky scan ends +par +note over NDX: creates runnumber.nxs with DAE and SE data +and +note over NDX: creates scan.ascii and scan.nxs +end +NDX ->> Archive: Sends runnumber.nxs, scan.ascii, and scan.nxs +TopCat ->> Archive: Collects runnumber.nxs, scan.ascii, and scan.nxs +note over PI: 5 months later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to runnumber.nxs, scan.ascii, and scan.nxs +note over PI: 1 year later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to runnumber.nxs and scan.nxs +``` +
+ +
+1 Bluesky scan, N neutron runs + +```{mermaid} +sequenceDiagram +actor PI +participant NDX +participant Archive +participant TopCat +note over PI:Start of RBNumber experiment +PI ->> NDX: Start bluesky scan +note over NDX: Bluesky scan starts DAE run +note over PI: Time Passes +note over NDX: Bluesky scan ends DAE run +note over NDX: creates runnumber.nxs with DAE and SE data +NDX ->> Archive: Sends runnumber.nxs +TopCat ->> Archive: Collects runnumber.nxs +note over PI: Time Passes +note over NDX: Bluesky scan starts DAE run +note over PI: Time Passes +note over NDX: Bluesky scan ends DAE run +note over NDX: creates runnumber+1.nxs with DAE and SE data +NDX ->> Archive: Sends runnumber+1.nxs +TopCat ->> Archive: Collects runnumber+1.nxs +note over NDX: Bluesky scan ends +NDX ->> Archive: Sends scan.ascii and scan.nxs +TopCat ->> Archive: Collects scan.ascii and scan.nxs +note over PI: 5 months later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, scan.ascii, and scan.nxs +note over PI: 1 year later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, and scan.nxs +``` +
+ +
+1 Bluesky scan, neutron/muon runs on multiple instruments + +```{mermaid} +sequenceDiagram +actor PI +participant NDX-A +participant NDX-B +participant NDX-C +participant Archive +participant TopCat +note over PI:Start of RBNumber experiment +PI ->> NDX-A: Start bluesky scan +NDX-A ->> NDX-B: Start DAE run +NDX-A ->> NDX-C: Start DAE run +note over PI: Time Passes +NDX-B ->> NDX-A: Provides summary run data +NDX-C ->> NDX-A: Provides summary run data +NDX-A ->> NDX-B: End DAE run +note over NDX-B: creates runnumberB.nxs with DAE and SE data +NDX-B ->> Archive: Sends runnumberB.nxs +TopCat ->> Archive: Collects runnumberB.nxs +NDX-A ->> NDX-C: End DAE run +note over NDX-C: creates runnumberC.nxs with DAE and SE data +NDX-C ->> Archive: Sends runnumberC.nxs +TopCat ->> Archive: Collects runnumberC.nxs +note over NDX-A: Bluesky scan ends +NDX-A ->> Archive: Sends scan.ascii and scan.nxs +TopCat ->> Archive: Collects scan.ascii and scan.nxs +note over PI: 5 months later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, scan.ascii, and scan.nxs +note over PI: 1 year later +PI ->> TopCat: Show me my data +TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, and scan.nxs +``` +
+ +## Present + +The following people have been involved in discussions leading up to this ADR: + +- Tom +- Chris M-S +- George +- Kathryn +- Jack H +- CK (Reflectometry) + +This document was additionally reviewed in a regular Thursday code-review slot by the whole IBEX team. + +## Decisions + +### File-writing location + +Bluesky should write data into the `c:\data\RB\bluesky_scans\` folder during a scan. +File naming itself will keep its current scheme (timestamped files). + +This location was chosen because it mirrors the archiving setup used by neutron cameras on IMAT. + +### Attributes & checksums + +Bluesky should mark files as read-only, using Windows file attributes, when it has finished writing them. This is so +that the archiving process can unambiguously tell whether a file has finished being written. It also reduces the +likelihood that a file is accidentally modified. + +Checksums should be generated, either at the point when the data is initially generated, or by the archiving process +just before it first copies or moves a file. + +We have agreed on the desire to generate checksums for data, which is already done for DAE data. These checksums are +useful to check for data corruption, which might occur in transit, or in-place on instrument computers or archive servers. +A number of checksumming approaches have been considered, and no approach has been chosen yet. The options discussed +are: +- **Use windows alternate file streams**. This is how checksums are done in existing DAE `.raw` files. It has the +advantage that it is relatively simple to implement, but the disadvantage that they do not map nicely onto Linux file +systems. +- **Generate one checksum per file**, for example `file.txt` would also have an associated `file.sha1.txt` containing the +checksum. The advantage is that this is simple to implement and platform-agnostic. The disadvantage is that it doubles +the number of files visible in the archive area. +- **Generate a single checksum file** containing the checksums of all bluesky data, at a higher level of granularity (for +example by RB number or by cycle). It is currently unclear exactly how this approach would be implemented, and at what +point these checksums would be moved to the archive. + +### Moving to the ISIS archive + +An automated cron task will look for read-only Bluesky output files, and their associated checksums, in `c:\data` at +regular short intervals (for example, 1 minute), and will move them to: +- The ISIS data archive, under `autoreduced/bluesky_scans`. The `autoreduced` folder already exists on the archive. +- The data cache disk on the instrument, under `c:\data\Export only\RB`). + +We have [issue 26](https://github.com/ISISComputingGroup/ibex_bluesky_core/issues/26) which will implement +machine-readable files, using a format such as `.hdf5` or `.nxs`. These files will sit alongside the existing +human-readable files; it is acknowledged that while machine-readable files are better from a data preservation and +archiving standpoint, we will need to retain the human-readable files to support quick browsing by scientists without +using special software. + +## Consequences + +- Bluesky output data will be stored in a location suitable for long-term, scientifically useful, data. This includes +data integrity and availability concerns. +- Bluesky scans will no longer be reliant on a network location being available to run a scan +- The initial location where bluesky writes data (`c:\data\`) will not be the same as its final location (the +`autoreduced` folder on the ISIS archive). This is also true for current DAE data, as generated by the ISISICP. diff --git a/doc/callbacks/plotting.md b/doc/callbacks/plotting.md index ad1fe443..a9cd7384 100644 --- a/doc/callbacks/plotting.md +++ b/doc/callbacks/plotting.md @@ -83,6 +83,7 @@ Due to an implementation detail of {py:obj}`matplotlib.pyplot.pcolormesh`, the plot will only appear once at least *two* rows of data have been collected. ::: +{#plot_png_saver} ## Saving plots to PNG files `ibex_bluesky_core` provides a {py:obj}`PlotPNGSaver` callback to save plots on a run stop to PNG files, which by saves them to the default output file location unless a filepath is explicitly given. diff --git a/doc/conf.py b/doc/conf.py index 259f3772..35d258ea 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -29,7 +29,7 @@ ("py:obj", r"^.*\.T.*_co$"), ] -myst_enable_extensions = ["dollarmath", "strikethrough", "colon_fence"] +myst_enable_extensions = ["dollarmath", "strikethrough", "colon_fence", "attrs_block"] suppress_warnings = ["myst.strikethrough"] extensions = [ @@ -43,7 +43,10 @@ "sphinx.ext.intersphinx", # Add links to source code in API docs "sphinx.ext.viewcode", + # Mermaid diagrams + "sphinxcontrib.mermaid", ] +mermaid_d3_zoom = True napoleon_google_docstring = True napoleon_numpy_docstring = False @@ -70,7 +73,7 @@ html_favicon = "favicon.svg" autoclass_content = "both" -myst_heading_anchors = 3 +myst_heading_anchors = 7 autodoc_preserve_defaults = True intersphinx_mapping = { diff --git a/pyproject.toml b/pyproject.toml index ef266b1b..14ab3ea6 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -59,6 +59,7 @@ doc = [ "sphinx_rtd_theme", "myst_parser", "sphinx-autobuild", + "sphinxcontrib-mermaid", ] dev = [ "ibex_bluesky_core[doc]",