- 
                Notifications
    
You must be signed in to change notification settings  - Fork 0
 
Draft ADR 7 (Document decisions relating to file archiving) #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,283 @@ | ||
| # 7. Output file archiving | ||
| 
     | 
||
| ## Status | ||
| 
     | 
||
| Draft | ||
| 
     | 
||
| ## Context | ||
| 
     | 
||
| Our bluesky implementation contains bluesky callbacks which produce scientist-facing output files, for example: | ||
| - [Human-readable scan result files](/callbacks/file_writing): {py:obj}`HumanReadableFileCallback <ibex_bluesky_core.callbacks.HumanReadableFileCallback>` | ||
| - [Fitting results](/fitting/livefit_logger): {py:obj}`LiveFitLogger <ibex_bluesky_core.callbacks.LiveFitLogger>` | ||
| - [Plot PNGs](#plot_png_saver): {py:obj}`PlotPNGSaver <ibex_bluesky_core.callbacks.PlotPNGSaver>` | ||
| 
     | 
||
| In addition, we have a [developer-facing callback for diagnostics](/callbacks/docs_logging_callback), | ||
| {py:obj}`DocLoggingCallback <ibex_bluesky_core.callbacks.DocLoggingCallback>`. | ||
| 
     | 
||
| The above callbacks produce files on disk in response to a bluesky scan. These files contain valuable data and so we | ||
| need to consider how these files are archived for the long term. This must align with the | ||
| [ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx). We should make an attempt to align with | ||
| [FAIR principles](https://www.go-fair.org/fair-principles/). | ||
| 
     | 
||
| According to the definitions in the [ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx), the data | ||
| generated by bluesky is generally either "facility generated reduced data" or "metadata". | ||
| 
     | 
||
| This ADR is concerned with the location in which these bluesky output files are stored, and the archiving infrastructure | ||
| which is therefore used to keep these files for the long term. | ||
| 
     | 
||
| --- | ||
| 
     | 
||
| At the time of writing this ADR, in June 2025, the scientist-facing files are being written to | ||
| ``` | ||
| ...\inst$\<instrument>\user\bluesky_scans\<rb_number>\ | ||
| ``` | ||
| 
     | 
||
| This location has some disadvantages: | ||
| - It is a network location, which means that a site network break will cause bluesky scans to fail to run | ||
| - It is not a location designed for long-term scientifically useful data - for example in terms of data integrity | ||
| - It is not necessarily accessible from downstream systems such as Topcat | ||
| 
     | 
||
| Therefore, we would like to define a different, more suitable, location into which bluesky output files can be written. | ||
| 
     | 
||
| --- | ||
| 
     | 
||
| Some representative use-cases are presented below, showing how data is expected to be used by scientists (click to | ||
| expand each use case): | ||
| 
     | 
||
| <details> | ||
| <summary>1 Bluesky scan, no neutron runs (e.g. scanning against a block)</summary> | ||
| 
     | 
||
| ```{mermaid} | ||
| sequenceDiagram | ||
| actor PI | ||
| participant NDX | ||
| participant Archive | ||
| participant TopCat | ||
| note over PI:Start of RBNumber experiment | ||
| PI ->> NDX: Start bluesky scan | ||
| note over PI: Time Passes | ||
| note over NDX: Bluesky scan ends | ||
| note over NDX: creates scan.ascii and scan.nxs | ||
| NDX ->> Archive: Sends scan.ascii and scan.nxs | ||
| TopCat ->> Archive: Collects scan.ascii and scan.nxs | ||
| note over PI: 5 months later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to scan.ascii and scan.nxs | ||
| note over PI: 1 year later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to scan.nxs | ||
| ``` | ||
| </details> | ||
| 
     | 
||
| <details> | ||
| <summary>1 Bluesky scan, aborted neutron runs</summary> | ||
| 
     | 
||
| ```{mermaid} | ||
| sequenceDiagram | ||
| actor PI | ||
| participant NDX | ||
| participant Archive | ||
| participant TopCat as Online Catalogue | ||
| note over PI:Start of RBNumber experiment | ||
| PI ->> NDX: Start bluesky scan | ||
| note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run | ||
| note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run | ||
| note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run | ||
| note over NDX: Bluesky scan ends | ||
| note over NDX: creates scan.ascii and scan.nxs | ||
| NDX ->> Archive: Sends scan.ascii and scan.nxs | ||
| TopCat ->> Archive: Collects scan.ascii and scan.nxs | ||
| note over PI: 5 months later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to scan.ascii and scan.nxs | ||
| note over PI: 1 year later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to scan.nxs | ||
| ``` | ||
| </details> | ||
| 
     | 
||
| <details> | ||
| <summary>1 Bluesky scan, one neutron run</summary> | ||
| 
     | 
||
| ```{mermaid} | ||
| sequenceDiagram | ||
| actor PI | ||
| participant NDX | ||
| participant Archive | ||
| participant TopCat | ||
| note over PI:Start of RBNumber experiment | ||
| PI ->> NDX: Start bluesky scan | ||
| note over NDX: Bluesky scan starts DAE run | ||
| note over PI: Time Passes | ||
| note over NDX: Bluesky scan ends DAE run <br/> Bluesky scan ends | ||
| par | ||
| note over NDX: creates runnumber.nxs with DAE and SE data | ||
| and | ||
| note over NDX: creates scan.ascii and scan.nxs | ||
| end | ||
| NDX ->> Archive: Sends runnumber.nxs, scan.ascii, and scan.nxs | ||
| TopCat ->> Archive: Collects runnumber.nxs, scan.ascii, and scan.nxs | ||
| note over PI: 5 months later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to runnumber.nxs, scan.ascii, and scan.nxs | ||
| note over PI: 1 year later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to runnumber.nxs and scan.nxs | ||
| ``` | ||
| </details> | ||
| 
     | 
||
| <details> | ||
| <summary>1 Bluesky scan, N neutron runs</summary> | ||
| 
     | 
||
| ```{mermaid} | ||
| sequenceDiagram | ||
| actor PI | ||
| participant NDX | ||
| participant Archive | ||
| participant TopCat | ||
| note over PI:Start of RBNumber experiment | ||
| PI ->> NDX: Start bluesky scan | ||
| note over NDX: Bluesky scan starts DAE run | ||
| note over PI: Time Passes | ||
| note over NDX: Bluesky scan ends DAE run | ||
| note over NDX: creates runnumber.nxs with DAE and SE data | ||
| NDX ->> Archive: Sends runnumber.nxs | ||
| TopCat ->> Archive: Collects runnumber.nxs | ||
| note over PI: Time Passes | ||
| note over NDX: Bluesky scan starts DAE run | ||
| note over PI: Time Passes | ||
| note over NDX: Bluesky scan ends DAE run | ||
| note over NDX: creates runnumber+1.nxs with DAE and SE data | ||
| NDX ->> Archive: Sends runnumber+1.nxs | ||
| TopCat ->> Archive: Collects runnumber+1.nxs | ||
| note over NDX: Bluesky scan ends | ||
| NDX ->> Archive: Sends scan.ascii and scan.nxs | ||
| TopCat ->> Archive: Collects scan.ascii and scan.nxs | ||
| note over PI: 5 months later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, scan.ascii, and scan.nxs | ||
| note over PI: 1 year later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, and scan.nxs | ||
| ``` | ||
| </details> | ||
| 
     | 
||
| <details> | ||
| <summary>1 Bluesky scan, neutron/muon runs on multiple instruments</summary> | ||
| 
     | 
||
| ```{mermaid} | ||
| sequenceDiagram | ||
| actor PI | ||
| participant NDX-A | ||
| participant NDX-B | ||
| participant NDX-C | ||
| participant Archive | ||
| participant TopCat | ||
| note over PI:Start of RBNumber experiment | ||
| PI ->> NDX-A: Start bluesky scan | ||
| NDX-A ->> NDX-B: Start DAE run | ||
| NDX-A ->> NDX-C: Start DAE run | ||
| note over PI: Time Passes | ||
| NDX-B ->> NDX-A: Provides summary run data | ||
| NDX-C ->> NDX-A: Provides summary run data | ||
| NDX-A ->> NDX-B: End DAE run | ||
| note over NDX-B: creates runnumberB.nxs with DAE and SE data | ||
| NDX-B ->> Archive: Sends runnumberB.nxs | ||
| TopCat ->> Archive: Collects runnumberB.nxs | ||
| NDX-A ->> NDX-C: End DAE run | ||
| note over NDX-C: creates runnumberC.nxs with DAE and SE data | ||
| NDX-C ->> Archive: Sends runnumberC.nxs | ||
| TopCat ->> Archive: Collects runnumberC.nxs | ||
| note over NDX-A: Bluesky scan ends | ||
| NDX-A ->> Archive: Sends scan.ascii and scan.nxs | ||
| TopCat ->> Archive: Collects scan.ascii and scan.nxs | ||
| note over PI: 5 months later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, scan.ascii, and scan.nxs | ||
| note over PI: 1 year later | ||
| PI ->> TopCat: Show me my data | ||
| TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, and scan.nxs | ||
| ``` | ||
| </details> | ||
| 
     | 
||
| ## Present | ||
| 
     | 
||
| The following people have been involved in discussions leading up to this ADR: | ||
| 
     | 
||
| - Tom | ||
| - Chris M-S | ||
| - George | ||
| - Kathryn | ||
| - Jack H | ||
| - CK (Reflectometry) | ||
| 
     | 
||
| This document was additionally reviewed in a regular Thursday code-review slot by the whole IBEX team. | ||
| 
     | 
||
| ## Decisions | ||
| 
     | 
||
| ### File-writing location | ||
| 
     | 
||
| Bluesky should write data into the `c:\data\RB<rb_number>\bluesky_scans\` folder during a scan. | ||
| File naming itself will keep its current scheme (timestamped files). | ||
| 
     | 
||
| This location was chosen because it mirrors the archiving setup used by neutron cameras on IMAT. | ||
| 
     | 
||
| ### Attributes & checksums | ||
| 
     | 
||
| Bluesky should mark files as read-only, using Windows file attributes, when it has finished writing them. This is so | ||
| that the archiving process can unambiguously tell whether a file has finished being written. It also reduces the | ||
| likelihood that a file is accidentally modified. | ||
| 
     | 
||
| Checksums should be generated, either at the point when the data is initially generated, or by the archiving process | ||
| just before it first copies or moves a file. | ||
| 
     | 
||
| We have agreed on the desire to generate checksums for data, which is already done for DAE data. These checksums are | ||
| useful to check for data corruption, which might occur in transit, or in-place on instrument computers or archive servers. | ||
| A number of checksumming approaches have been considered, and no approach has been chosen yet. The options discussed | ||
| are: | ||
| - **Use windows alternate file streams**. This is how checksums are done in existing DAE `.raw` files. It has the | ||
| advantage that it is relatively simple to implement, but the disadvantage that they do not map nicely onto Linux file | ||
| systems. | ||
| - **Generate one checksum per file**, for example `file.txt` would also have an associated `file.sha1.txt` containing the | ||
| checksum. The advantage is that this is simple to implement and platform-agnostic. The disadvantage is that it doubles | ||
| the number of files visible in the archive area. | ||
| - **Generate a single checksum file** containing the checksums of all bluesky data, at a higher level of granularity (for | ||
| example by RB number or by cycle). It is currently unclear exactly how this approach would be implemented, and at what | ||
| point these checksums would be moved to the archive. | ||
| 
     | 
||
| ### Moving to the ISIS archive | ||
| 
     | 
||
| An automated cron task will look for read-only Bluesky output files, and their associated checksums, in `c:\data` at | ||
| regular short intervals (for example, 1 minute), and will move them to: | ||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we have one checksum file per data file then it can move checksum file, if we have a single file listing checksums it would be a bit more complicated There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess if it is done as an alternative file stream then the checksum "automatically" gets moved along with the file... although they're annoying to access from most programming languages and not very portable to e.g. linux later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. indeed,  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are quite a few scenarios where the out of band checksum streams are useful currently (as they don't get included in the checksum calculation - even by accident). They are also useful to ensure immutability between the instrument computer and the current archive - and I have on several occasions, run simple checks like these when the integrity of the file system is in question (e.g. disk errors on one archive server). Having a separate block of checksums in a different disk locality and directory structure might themselves be suspect (note: the check of the checksum/file stream is local and two-way - one validates the other). A separate .zip of checksums might be the way to go, but it does provide it's own maintenance issues (however it does have a checksum itself and file changes can be made whilst updating checksums if necessary - so not unlike the second file stream). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have re-worded the checksumming section to reflect uncertainty in the exact technical approach, while acknowledging that we've decided that we do want to generate checksums.  | 
||
| - The ISIS data archive, under `autoreduced/bluesky_scans`. The `autoreduced` folder already exists on the archive. | ||
| - The data cache disk on the instrument, under `c:\data\Export only\RB<rb_number\bluesky_scans`. | ||
| 
     | 
||
| Data on the cache disk, under `Export only`, is kept on the instrument for a short period (usually 24 hours), and then | ||
| deleted by existing processes. | ||
| 
     | 
||
| This is run as a cron task so that, if the network happens to be unavailable at the time when a scan ends, the copy | ||
| process will catch up when the network becomes available again. This cron task will only move files which sit within | ||
| a `bluesky_scans` folder, to prevent it from interfering with other non-bluesky files. | ||
| 
     | 
||
| Creating a new `bluesky_scans` folder alongside the existing `autoreduced` folder was considered, but was felt to be | ||
| unachievable - it would require too much work relative to using the existing `autoreduced` folder. | ||
| 
     | 
||
| ### File formats | ||
| 
     | 
||
| At present, our scan file output format is explicitly designed to be "human-readable" (and, in fact, the callback which | ||
| generates these files is explicitly called | ||
| {py:obj}`HumanReadableFileCallback <ibex_bluesky_core.callbacks.HumanReadableFileCallback>`). | ||
| 
     | 
||
| We have [issue 26](https://github.com/ISISComputingGroup/ibex_bluesky_core/issues/26) which will implement | ||
| machine-readable files, using a format such as `.hdf5` or `.nxs`. These files will sit alongside the existing | ||
| human-readable files; it is acknowledged that while machine-readable files are better from a data preservation and | ||
| archiving standpoint, we will need to retain the human-readable files to support quick browsing by scientists without | ||
| using special software. | ||
| 
     | 
||
| ## Consequences | ||
| 
     | 
||
| - Bluesky output data will be stored in a location suitable for long-term, scientifically useful, data. This includes | ||
| data integrity and availability concerns. | ||
| - Bluesky scans will no longer be reliant on a network location being available to run a scan | ||
| - The initial location where bluesky writes data (`c:\data\<rb number>`) will not be the same as its final location (the | ||
| `autoreduced` folder on the ISIS archive). This is also true for current DAE data, as generated by the ISISICP. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checksums could be per file, or all files in a folder could be in a single checksum file. May need some further discussion depending on when we feel we archive from an RB number directory