-
Notifications
You must be signed in to change notification settings - Fork 0
Draft ADR 7 (Document decisions relating to file archiving) #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| alternative file stream, comparable to what is done for existing DAE data. These checksums | ||
| can be used to check for data corruption as the files are moved to the archive, and later replicated between the | ||
| archive servers. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checksums could be per file, or all files in a folder could be in a single checksum file. May need some further discussion depending on when we feel we archive from an RB number directory
| ### Moving to the ISIS archive | ||
|
|
||
| An automated cron task will look for read-only Bluesky output files, and their associated checksums, in `c:\data` at | ||
| regular short intervals (for example, 1 minute), and will move them to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have one checksum file per data file then it can move checksum file, if we have a single file listing checksums it would be a bit more complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if it is done as an alternative file stream then the checksum "automatically" gets moved along with the file... although they're annoying to access from most programming languages and not very portable to e.g. linux later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, linux is the main reason i'd try and avoid them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few scenarios where the out of band checksum streams are useful currently (as they don't get included in the checksum calculation - even by accident). They are also useful to ensure immutability between the instrument computer and the current archive - and I have on several occasions, run simple checks like these when the integrity of the file system is in question (e.g. disk errors on one archive server). Having a separate block of checksums in a different disk locality and directory structure might themselves be suspect (note: the check of the checksum/file stream is local and two-way - one validates the other). A separate .zip of checksums might be the way to go, but it does provide it's own maintenance issues (however it does have a checksum itself and file changes can be made whilst updating checksums if necessary - so not unlike the second file stream).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have re-worded the checksumming section to reflect uncertainty in the exact technical approach, while acknowledging that we've decided that we do want to generate checksums.
Description of work
An initial pass at an ADR describing file archiving decisions.
Once we are happy with this internally, I will merge it as a draft ADR and present it to scientists at a regular scans library catchup for feedback, and then once any feedback is incorporated it can become an active ADR.
Ticket
#214
Labels
Add appropriate label(s) to this PR
Acceptance criteria
Documentation
See PR.