-
-
Notifications
You must be signed in to change notification settings - Fork 1
USFM Parsing and Translation
Eli C. Lowry edited this page Sep 18, 2025
·
10 revisions
Serval seeks to mirror how Paratext parses USFM files, both by providing support for USFM as per the documentation as well as seeking to be accommodating to some non-standard usages.
- Unique Identifiers are generated to reference a specific text segment in a scripture text and act as a primary anchor point.
- The reference is serialized in the following format:
[verse reference]/[path element 1]/[path element 2]/...
. - Verse references follow the standard USFM identification and naming
- Non-verse paths are identified by
[localized instance #]:[USFM tag]
- For example, the reference for the section header that occurs directly after MAT 1:1 would be represented as
MAT 1:1/1:s
.
- For example, the reference for the section header that occurs directly after MAT 1:1 would be represented as
- Positions are 1-based (the position 0 is used when a position is not specified or unknown).
- Some non-verse text segments can be nested in another element.
- For example, a table cell might be represented as
MAT 1:1/1:tr/1:tc1
.
- For example, a table cell might be represented as
- Introductory material that occurs at the beginning of a book before the first verse is referenced by the
1:0
verse reference.
When projects are read in, they are put in original versification. The source and target verse ranges are then merged with the other so that all translation and output USFM happens from the union of all verse ranges.
The USFM standard has multiple types of markers including: paragraph, character and note (among others). These marker types do not always map to intended usage, especially when looking at translations. This is how Serval (and the underlying machine library) interprets these markers:
- ID, Chapter, Verse
- Examples: \id \c \v
- These markers signify the change of a VerseRef, the basic biblical structure. They form the frame that all other markers sit within.
- Non-Verse
- Examples: \s1, \mt2, \p, \tr, \th1, \tr, \tc2, \esb, \esbe, etc. etc.
- These are section headers, introductory material, tables, sidebars, etc. They often sit outside of verses and are each translated separately. Paragraphs and tables, if they occur in a verse, are treated as a paragraph instead of a non-verse marker.
- Note that tables and paragraphs will be stripped out when inside of a verse (segment) or a footnote. Paragraph and table formatting will otherwise be preserved.
- Reference
- Examples: \r, \rem
- These markers are not translated as they will only contain references or remarks.
- Paragraph
- Examples: \q1, \q2, \p, etc.
- These markers signify the break-up of verse text, primarily as paragraphs and as poetry formatting. For AI drafting, the paragraph markers are removed and the entire verse's text is translated in one piece.
- Embed
- Examples: \f, \fe, \x, \fm, \fig
- Each of these (note, cross reference, figure) is treated as a section "embedded" into a verse or non-verse. The entire thing is either removed or passed through (according to configuration) with only the NoteText being AI translated. If it is passed through, it will be moved to the end of the verse or non-verse text that it is within.
- EmbedPart
- Examples: Any marker that begins with a f, x or z.
- These occur within an Embed and demarcate a new part of the Embed.
- NoteText
- Examples: \ft
- All NoteText and EmbedParts (\fr, \xta, etc.) will be passed through and left untranslated.
- Nested
- Examples: any EmbedPart marker within a NoteText that begins with a "+", such as +xt
- This will be treated as an embed "nested" in an "embed"
- Style
- Examples: /nd, etc. - any character marker that is not a EmbedPart or Embed
- Text within a style marker is treated as part of the overall text, such as the word LORD (rather than a distinct content embedded into a verse, such as a cross reference). The markers are either preserved and moved to the end of the text section or removed.