Skip to content
Open
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
327 changes: 323 additions & 4 deletions Linking.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,6 @@ The "reloc." custom sections must come after the
["linking"](#linking-metadata-section) custom section in order to validate
relocation indices.

Any LEB128-encoded values should be maximally padded so that they can be
rewritten without affecting the position of any other bytes. For instance, the
function index 3 should be encoded as `0x83 0x80 0x80 0x80 0x00`.

Relocations contain the following fields:

| Field | Type | Description |
Expand Down Expand Up @@ -181,6 +177,51 @@ relocations applied to the CODE section, a relocation cannot straddle two
functions, and for the DATA section relocations must lie within a data element's
body.

### Additional validation rules

When perfoming validation on object files, care must be taken to ensure that
meaningless relocations are not present in the binary.

**Note**: Linker is not required to perform validation on its input object
files.

All LEB128-encoded values that are to be relocated must be maximally padded so
that they can be rewritten without affecting the position of any other bytes.
For instance, the function index 3 must be encoded as `0x83 0x80 0x80 0x80 0x00`.

When relocations occur in the CODE section, only the following relocations may
occur:

| relocation type | condition the value at relocation offset |
|---------------------------------|------------------------------------------|
| `R_WASM_FUNCTION_INDEX_LEB` | must represent a `funcidx` |
| `R_WASM_TYPE_INDEX_LEB` | must represent a `typeidx` |
| `R_WASM_GLOBAL_INDEX_LEB` | must represent a `globalidx` |
| `R_WASM_EVENT_INDEX_LEB` | must represent a `tagidx` |
| `R_WASM_TABLE_NUMBER_LEB` | must represent a `tableidx` |
| `R_WASM_TABLE_INDEX_SLEB` | must represent an operand of `i32.const` |
| `R_WASM_TABLE_INDEX_SLEB64` | must represent an operand of `i64.const` |
| `R_WASM_MEMORY_ADDR_SLEB` | must represent an operand of `i32.const` |
| `R_WASM_MEMORY_ADDR_REL_SLEB` | must represent an operand of `i32.const` |
| `R_WASM_MEMORY_ADDR_TLS_SLEB` | must represent an operand of `i32.const` |
| `R_WASM_MEMORY_ADDR_SLEB64` | must represent an operand of `i64.const` |
| `R_WASM_MEMORY_ADDR_REL_SLEB64` | must represent an operand of `i64.const` |
| `R_WASM_MEMORY_ADDR_TLS_SLEB64` | must represent an operand of `i64.const` |
| `R_WASM_MEMORY_ADDR_LEB` | must represent the `offset` part of `memarg` where `memidx` references a 32-bit memory |
| `R_WASM_MEMORY_ADDR_LEB64` | must represent the `offset` part of `memarg` where `memidx` references a 64-bit memory |

For `R_WASM_*_OFFSET_I*` relocations, the following condidions must hold for
the addend:

- If `index` references the CODE section, the addend must represent the first
byte of an instruction, or the byte after the last instruction.
- If `index` references the DATA section, the addend must represent a valid
offset into a data segment's data area.
- If `index` references the custom section, the addend must represent a valid
offset into that custom section's data area.

All other relocations are considered invalid for the purposes of validation

## Linking Metadata Section

A linking metadata section is a user-defined section with the name
Expand Down Expand Up @@ -322,6 +363,8 @@ For section symbols:
| ------------ | -------------- | ------------------------------------------- |
| section | `varuint32` | the index of the target section |

Section symbols may only reference the CODE section, the DATA section, or custom sections.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I'm not sure about this actually.

When you asked about documenting a limitation on sections I thought you were referring to the fact that relocations can only apply to certain section types.

"Which sections can have relocations within them" is a different concept to "which sections can be referred to by WASM_SYMBOL_TYPE_SECTION symbols".

I believe that WASM_SYMBOL_TYPE_SECTION symbols are only used by debug info, but my memory is a little foggy here.

Looking at the code it actually looks like this symbols might only be valid for custom sections: https://github.com/llvm/llvm-project/blob/38372df53fd7f6c8bd8c46bf720b676e12f481d9/lld/wasm/InputFiles.cpp#L697-L705.

Which would make sense if these only used in debug info since all debug info is stored in custom sections.

Copy link
Contributor Author

@feedab1e feedab1e Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it can work like that, though, since R_WASM_SECTION_OFFSET_I32 relocations reference a section symbol, and for that to work as DWARF code addresses, the symbol that relocation references would have to reference the CODE section, while the relocation itself would have to target a place in the debug section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually, that can absolutely work if WASM_*_OFFSET_* relocations actually resolve to offsets form the file start, like DWARF actually expects. I do think the current spec is not very clear on this and someone from LLVM should take a look at what actually happens there and adjust https://github.com/WebAssembly/tool-conventions/blob/main/Linking.md#processing-relocations accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the text format to reflect that section offset relocations may only reference custom sections for now.


The current set of valid flags for symbols are:

- `1 / WASM_SYM_BINDING_WEAK` - Indicating that this is a weak symbol. When
Expand Down Expand Up @@ -734,3 +777,279 @@ necessary for referencing such segments (e.g. in `data.drop` or `memory.init`
instruction) do not yet exist.
- There is currently no support for table element segments, either active or
passive.

# Text format

The text format for linking metadata is intended for WAT consumers that wish to
emit relocatable object files, and WAT producers wish to emit human-readable
relocation metadata for later creation of a relocatable object file.

## Relocations

Relocations are represented as WebAssembly annotations of the form
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here? Should we just use Wasm?

```wat
(@reloc <format> <method> <modifier> <symbol-reference> <addend>)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally for syntax like this I like to try to avoid "extra layers of indirection" of a sort. Here one layer of indirection is the set of relocations themselves (e.g. R_WASM_*) and another layer of indirection is the syntax here of <format> <method> <modifier>. What would you think about dropping all the extra syntax and using the relocation names themselves? For example R_WASM_FUNCTION_INDEX_LEB would correspond to (@reloc function-index-leb ..). That would make it immediately clear which exact relocation this corresponds to and would avoid the need to consult the tables here and translate back/forth.

Copy link
Contributor Author

@feedab1e feedab1e Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I specifically decided against that since then it wouldn't be possible to abbreviate the relocation via that predefinition mechanism (so, call $foo (@reloc $via_other_sym) wouldn't be possible, would have to write call $foo (@reloc function-index-leb $via_other_sym)), and one would also have to remember which format goes with which instruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also an issue of R_WASM_TABLE_INDEX_* vs R_WASM_TABLE_NUMBER_*, where clearly the choice of R_WASM_TABLE_INDEX_* for an index into a table was wrong and clashed with the relocation type for an index of a table, and I don't want to cement that mistake even further into the format.

Rectifying that error at the source would require me to patch LLVM in sync with this change, so like the other issue with WASM_SYM_NO_STRIP/WASM_SEG_FLAG_RETAIN/retain that @sbc100 raised in #258 (comment), I think this should be split into a different PR and handled later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a better idea for what to call R_WASM_TABLE_INDEX_*?

Copy link
Member

@sbc100 sbc100 Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(By the way the history here is that prior to mulit-table there was only one table, so R_WASM_TABLE_INDEX_ was unambiguous when it was created. I still think its reasonable to think "what is the table index for a given function", meaning the index in the table at which that function lives, but I'm open a better ideas).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe R_WASM_TABLE_OFFSET_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd lean towards something like R_WASM_FUNC_TABLE_ELEM_*, expecting that some time in the future when (if) relocations for table elements arrive, there will be a more general R_WASM_TABLE_ELEM_* for tables other than the function table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(coincidentally this would align well with functable relocation method name)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of just (@reloc $foo) sounds pretty nice to me yeah, although being able to support that would require contextual knowledge of what else is being parsed at that time. How about supporting (@reloc function-index-leb $foo) and then optionally supporting (@reloc $foo) in a few select locations as a "sugar" of sorts? That way there's the benefit of concise relocations as well as being able to disambiguate things if necessary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in terms of renaming things, in theory that can be done at any time right? The binary format of preexisting relocations isn't going to change. While work is "created" in the sense that LLVM should eventually update that's no breaking change in the sense that it's impossible to rename things, right?

Given that, I personally like @feedab1e's idea of repurposing R_WASM_TABLE_INDEX_LEB for "the index of the table itself" and using R_WASM_FUNC_TABLE_ELEM_I32 for "the address of this thing in a table"

```

- `format` determines the resulting format of a relocation

|`<format>`| corresponding relocation constants | interpretation |
|----------|------------------------------------|---------------------|
|`i32` | `R_WASM_*_I32` | 4-byte [uint32] |
|`i64` | `R_WASM_*_I64` | 8-byte [uint64] |
|`leb` | `R_WASM_*_LEB` | 5-byte [varuint32] |
|`sleb` | `R_WASM_*_SLEB` | 5-byte [varint32] |
|`leb64` | `R_WASM_*_LEB64` | 10-byte [varuint64] |
|`sleb64` | `R_WASM_*_SLEB64` | 10-byte [varint64] |

- `method` describes the type of relocation, so what kind of symbol we are relocating against and how to interpret that symbol.

| `<method>` | symbol kind | corresponding relocation constants | interpretation |
|-------------|-------------|------------------------------------|-----------------------------------|
| `tag` | event* | `R_WASM_EVENT_INDEX_*` | Final WebAssembly event index |
| `table` | table* | `R_WASM_TABLE_NUMBER_*` | Final WebAssembly table index (index of a table, not into one) |
| `global` | global* | `R_WASM_GLOBAL_INDEX_*` | Final WebAssembly global index |
| `func` | function* | `R_WASM_FUNCTION_INDEX_*` | Final WebAssembly function index |
| `functable` | function | `R_WASM_TABLE_INDEX_*` | Index into the dynamic function table, used for taking address of functions |
| `codeseg` | function | `R_WASM_FUNCTION_OFFSET` | Offset into the function body from the start of the function |
| `codesec` | function | `R_WASM_SECTION_OFFSET` | Offset into the function section |
| `datasec` | data | `R_WASM_SECTION_OFFSET` | Offset into the data section |
| `customsec` | N/A | `R_WASM_SECTION_OFFSET` | Offset into a custom section |
| `data` | data | `R_WASM_MEMORY_ADDR_*` | WebAssembly linear memory address |

Symbol kinds marked with `*` are considered *primary*.

- `modifier` describes the additional attributes that a relocation might have.

| `<modifier>` | corresponding relocation constants | interpretation |
|--------------|---------------------------------------|-------------------|
| nothing | nothing | Normal relocation |
| `pic` | `R_WASM_*_LOCREL_*`, `R_WASM_*_REL_*` | Address relative to `env.__memory_base` or `env.__table_base`, used for dynamic linking |
| `tls` | `R_WASM_*_TLS*` | Address relative to `env.__tls_base`, used for thread-local storage |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to reflect the entire list of relocation types like they are listed in the binary format and/or in llvm: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/BinaryFormat/WasmRelocs.def

i.e. why create this new concept of a base type + a modifier that doesn't exist elsewhere yet? Why not just use type=R_WASM_EVENT_INDEX_XX in the text format? This would also make the format redundant since its also part of the name of the relocation type.

Maybe this new method/format/modifier concept could be added more globally later once the initial version of the text format is added? But for v1 it seems like it would make sense to simply mirror the existing binary format enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was covered extensively in #258 (comment), and @alexcrichton expressed support for it here, but in short, that way there wouldn't be an option to elide parts of the relocation annotation (i.e. defaulting and predefinig wouldn't work), so all relocations would be incredibly verbose (for example, call $foo would become call $foo (@reloc type=R_WASM_FUNC_INDEX_LEB) for no reason).

Copy link
Member

@sbc100 sbc100 Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I don't see how specifying the full relocation type (e.g using R_WASM_FUNC_INDEX_LEB when a reloc is present) would prevent the whole relocation from being implicit / elided.

This seem like two orthogonal decisions, but I get that I must be missing something:

  1. Do we implicitly generate reloc entries for things like call instruction?
  2. When reloc is specified explicitly do we use the existing enum, of something new/different

I'm also not sure that reducing verbosity needs to be the highest priority since the plan is for this format to be mostly machine read and machine written, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I don't see how specifying the full relocation type (e.g using R_WASM_FUNC_INDEX_LEB when a reloc is present) would prevent the whole relocation from being implicit / elided.

This seem like two orthogonal decisions, but I get that I must be missing something:

  1. Do we implicitly generate reloc entries for things like call instruction?
  2. When reloc is specified explicitly do we use the existing enum, of something new/different

Apart form fully elidable relocations, other types of relocations exist, like in memory (i32.load (@reloc $mem_sym)) and in constants (i32.const (@reloc data $mem_sym)) where a relocation is not entirely elided but is greatly abbreviated form the complete relocation type. Apart form that, specifying the complete relocation type would expose relocation type names (which are for now an implementation detail of LLVM) to the wider text format.

I'm also not sure that reducing verbosity needs to be the highest priority since the plan is for this format to be mostly machine read and machine written, right?

Well, it needs to be human-readable, too, since it's a text format and humans are expected to read that too, like they usually read assembly, and likewise human-writable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart form that, specifying the complete relocation type would expose relocation type names (which are for now an implementation detail of LLVM) to the wider text format.

The relocation type names are not indented to be LLVM specific. The list of 20 relocation types, along with their ffull names, are listed above in this very document.

This is designed to mirror the ELF relocation types that are defined in the ELF header and not specific to either LLVM or GCC but are using in both place.

I think it might be a good idea to reflect this precisely in text for, so we can avoid having two different ways to specify things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it's more of a preference of per-relocation/symbol annotations over a top-level annotation. I agree that either system would work alright, and the main concern that I can think of is printing an wasm object file (binary-to-text) where it feels more natural to print @reloc or @sym-per-entry as opposed to verifying that everything has a relocation or symbol and then not printing anything. For a pure text-to-binary use case I'd agree that the top-level annotation is nicer to have.

I'd naively expect that with @sym and @reloc would be frequent enough in a file that it wouldn't take much of a visual scan myself, but I'm not personally too concerned about that myself.

I should also be clear that I'm happy to be overruled here. IMO text-format design is something that's worth bikeshedding but not endlessly, so I wouldn't want to hold up anything on my own behalf too much

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m still not sure there’s much value in either making all symbols/relocs explicit or having a toplevel annotation, since neither actually guarantee that an object file is produced. So the only effect of such an annotation would be to artificially restrict files that can be relocatable. Currently, any valid WASM module as well as any valid WAT module can be transformed into an object file and back by merely attaching or stripping the linking information. It's a nice property and I would like to keep it, especially since there is nothing stopping us from it.

If we really really want something (advisory) that says which features are intended to be used with which WAT files, then perhaps we should have a general annotation that applies to all features, not just linking.

That annotation, if decided upon, could in principle augment assembler flags same way as -DMACRO/#define MACRO works in C and C++ compilers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a separate concern of how explicit relocation annotations will fit in with other metadata that isn't aware of linking, like code metadata. Currently it's easy, code metadata info needs a funcidx, so it grabs the primary symbol for that funcidx, and places a relocation of that symbol, no annotations required.

If explicit relocations are mandated, then the standardized syntax for code metadata would simply no longer work, just like every annotation-based metadata feature that isn't explicitly aware of relocations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might come down to how I'm approaching this from a different perspective (I think) than you might be. I'm thinking about this from a tooling perspective of a text-to-binary or binary-to-text transform. I suspect you're coming from the direction of producing-the-text and/or reading-the-text (please correct me though!). I'm concerned with "the binary should have one canonical text form and vice versa" and I'm not interested in auto-injecting linking/reloc.* sections myself (e.g. via some sort of CLI flag and/or configuration outside of the text file).

Not sure if that helps, but figured I'd write down how my perspective may be a bit different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might come down to how I'm approaching this from a different perspective (I think) than you might be. I'm thinking about this from a tooling perspective of a text-to-binary or binary-to-text transform. I suspect you're coming from the direction of producing-the-text and/or reading-the-text (please correct me though!). I'm concerned with "the binary should have one canonical text form and vice versa" and I'm not interested in auto-injecting linking/reloc.* sections myself (e.g. via some sort of CLI flag and/or configuration outside of the text file).

Not sure if that helps, but figured I'd write down how my perspective may be a bit different.

Well, from the transform PoV, mandating the annotations would actually break that property, since there would be binary object files that can no longer be converted to text (#258 (comment)), so our hands are tied, and we have to make this work with (mandatory) elision, or we have to redesign every other annotation to accommodate linking. Elision does not break the round-trip requirement either, since elided relocations have a unique and mandatory representation in a valid binary object file, and vice versa


- `addend` describes the additional components of a relocation.

| `<addend>` | interpretation | condition |
|--------------|----------------------|-----------------------------------------------|
| nothing | Zero addend | always |
| `+<integer>` | Positive byte offset | `method` allows addend |
| `-<integer>` | Negative byte offset | `method` allows addend and `format` is signed |
| `<labeluse>` | Byte offest to label | `method` is either `codeseg` or `*sec` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does <labeluse> correspond to in https://github.com/WebAssembly/tool-conventions/blob/main/Linking.md? I would expect this to be an integer-or-not, and I'd be worried about a bare integer 1 being confused with "+1" for example given the current syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<labeluse> corresponds to the use of a label, those too always have a dollar name, so there should be no ambiguity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm ok I think I see why I was pretty confused by this. Speaking of overloaded terminology... When I read "label" here I thought that this was referring to wasm block labels. (e.g. block $foo ... br_if $foo ... end). I also looked for the term "label" in Linking.md and couldn't otherwise find it. But it looks like this is someting where a symbol is bound to an offset in a data segment, custom section, or function?

I was also a little confused at this where the addend in Linking.md is just an integer and not necessarily an entity somewhere else. I may have misinterpreted though?

I think basically I'm confused about what this is. Do you have an example of where this would be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, those aren't described in trunk, but are described here: https://github.com/feedab1e/tool-conventions/blob/main/Linking.md#labels.
They are basically used in conjunction with R_WASM_SECTION_OFFSET_* to represent DWARF code address relocations.

You stick this label into the instruction stream, and whenever that label is used as an addend, it will point to that place in the instruction stream.

(module
  (func $foo (param) (result)
    nop
    (@sym.label $between_nops)
    nop)
  (@custom "debug_whatever"
    (;suppose this would represent a DWARF code address;)
    "\00\00\00\00\00" (@reloc leb codesec $foo $between_nops)))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha ok I see! Two thoughts then. Maybe @sym.bind instead of @sym.label to avoid overloading terminology with the wasm labels?

Also, in terms of ambiguity in parsing, what I was worried about was something like (@sym.label) (without a $foo-name) introducing symbol "0" and then the relocation wants to refer to it and 0 would ambiguously be an offset of 0 bytes or the 0th symbol. As I write this out though this seems silly. The specification could instead be that all symbols are $-bound, right? (and if so could "1" be allowed as an offset to avoid needing "+1"?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both forms are valid (I assume they are both generated by llvm today). Its up to the compiler to perform these optimizations (constant folding?) but we shouldn't mandate it one or the other here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there are some cases where this folds cannot always be performed. I vaguely remember something about the offset being limited to positive values causing issues, but I can't remember the exact issue.

There was some recent work in llvm to make more use of the offset: llvm/llvm-project#145829

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both forms are valid (I assume they are both generated by llvm today). Its up to the compiler to perform these optimizations (constant folding?) but we shouldn't mandate it one or the other here.

No, this is only concerning WAT notation for the instructions, no codegen changes.

Relocation on a memarg replaces the integer that would otherwise be specified as offset=<num>, so this number is basically unused for purposes of linking.

A potential issue would be in that this number is still specified in the main binary, so currently i32.load offset=3 (@reloc $datasym +5) is representable, but may not be if that change is adopted. A solution to that would be to default instead of predefining the addend component for the memarg case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you are proposing a shorthand where the offset= is not ignored like it is in the binary format.

Seems like maybe too clever, at least for v1? Better to keep it simple for the first iteration, and they add sugar/simplification/possible-elision once we have v1 working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that also would work.


- `symbol` describes the symbol against which to perform relocation.
- For `funcsec` relocation method, this is the function id, so that if the
addend is zero, the relocation points to the first instruction of that
function.
- For `datasec` relocation method, this is the data segment id, so that if
the addend is zero, the relocation points to the first byte of data in that
segment.
- For `customsec` relocation method, this is the name of the custom section,
so that if the addend is zero, the relocation points to the first byte of
data in that segment.
- For other relocation methods, this denotes the symbol in the scope of that
symbol kind.
Comment on lines 795 to 803
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While syntactically it's nice to reuse the same $foo used for a data segment or function that doesn't reflect reality where the symbol table itself has its own indexing namespace and relocations point within that rather than elsewhere. For me that leads to preferring one of two options:

  1. Have a new identifier namespace for symbols which these refer to. That would be a parallel identifier namespace to the function namespace for example.
  2. Have various validation predicates such as there can't be two symbols on any one wasm item because there's otherwise no way to assign a relocation to a particular symbol if there are multiple.

I don't really know what the best option is myself, but I'm overall a bit worried about the disconnect between the syntax proposed here and the binary format having its own index space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current state of identifiers is desribed here: #258 (comment), I am writing a section in the doc describing this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that the overloaded terminology may be thwarting here again, so I'll outline what I'm thinking with an example:

(module
  (func $foo (@sym $bar) (result i32)
      call $foo (@reloc $bar)
  )
)

Should this work? Here from a pure wasm text format point of view, ignoring annotations, $foo is the identifier of the function here. What I was naively expecting is that (@sym $bar) is introducing a symbol with the text-format-identifier of $bar and this declaration is entirely unrelated to func $foo other than that's the index of the function it refers to. The (@reloc $bar) then resolves to the (@sym $bar) here.

Does that match your understanding? Or are you thinking something different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.


The relocation type is looked up from the combination of `format`, `method`,
and `modifier`. If no relocation type exists, an error is raised.

If a component of a relocation is predetermined, it must be skipped in the
annotation text.

If a component of a relocation is defaulted, it may be skipped in the
annotation text.

For example, a relocation into the function table by the index of `$foo` with a
predetermined `format` would look like following:
```wat
(@reloc functable $foo)
```
If all components of a relocation annotation are skipped, the annotation may be
omitted.

### Instruction relocations

For every usage of `typeidx`, `funcidx`, `globalidx`, `tagidx`, a relocation
annotation is added afterwards, with `format` predefined as `leb`, `method`
predefined as the *primary* method for that type, and `symbol` defaulted as the
*primary* symbol of that `idx`

- For the `i32.const` instruction, a relocation annotation is added after the
integer literal operand, with `format` predefined as `sleb`, and `method` is
allowed to be either `data` or `functable`.
- For the `i64.const` instruction, a relocation annotation is added after the
integer literal operand, with `format` predefined as `sleb64`, and `method`
is allowed to be either `data` or `functable`.
- For the `i{32,64}.{load,store}*` instructions, a relocation annotation is
added after the offset operand, with `format` predefined as `leb` if the
*memory* being referenced is 32-bit, and `leb64` otherwise, and `method`
predefined as `data`.

### Data relocations

In data segments, relocation annotations can be interleaved into the data
string sequence. When that happens, relocations are situated after the last
byte of the value being relocated.

For example, relocation of a 32-bit function pointer `$foo` and a 32-bit
reference to a data symbol `$bar` into the data segment of size 8 would look
like following:
```wat
(data (i32.const 0) "\00\00\00\00" (@reloc i32 functbl $foo) "\00\00\00\00" (@reloc i32 data $bar))
```

## Symbols

Symbols are represented as WebAssembly annotations of the form
```wat
(@sym <name> <qualifier>*)
```
Data imports represented as WebAssembly annotations of the form
```wat
(@sym.import.data <name> <qualifier>*)
```

- `name` is the symbol name written as WebAssembly `id`, it is the name by
which relocation annotations reference the symbol. If it is not present, the
symbol is considered *primary* symbol for that WebAssembly object, its name
is taken from the related object
- There may only be one primary symbol for each WebAssembly object.
- If a symbol is not associated with an object, it may not be the primary
symbol.

- `qualifier` is one of the allowed qualifiers on a symbol declaration.
Qualifiers may not repeat.

| `<qualifier>` | effect |
|---------------------------|-----------------------------------------------|
| `binding=<binding>` | sets symbol flags according to `<binding>` |
| `visibility=<visibility>` | sets symbol flags according to `<visibility>` |
| `retain` | sets `WASM_SYM_NO_STRIP` symbol flag |
| `thread_local` | sets `WASM_SYM_TLS` symbol flag |
| `size=<int>` | sets symbol's `size` appropriately |
| `offset=<int>` | sets `WASM_SYM_ABSOLUTE` symbol flag, sets symbol's `offset` appropriately |
| `name=<string>` | sets `WASM_SYM_EXPLICIT_NAME` symbol flag, sets symbol's `name_len`, `name_data` appropriately |
| `priority=<int>` | adds symbol to `WASM_INIT_FUNCS` section with the given priority |
| `comdat=<id>` | adds symbol to a `comdat` with the given id |

| `<binding>` | flag |
|-------------|--------------------------|
| `global` | 0 |
| `local` | `WASM_SYM_BINDING_LOCAL` |
| `weak` | `WASM_SYM_BINDING_WEAK` |

| `<visibility>` | flag |
|----------------|------------------------------|
| `default` | |
| `hidden` | `WASM_SYM_VISIBILITY_HIDDEN` |

Shorthands may be used in place of full qualifiers:

| shorthand | resulting qualifier |
|-----------|---------------------|
| `hidden` | `visibility=hidden` |
| `local` | `binding=local` |
| `weak` | `binding=weak` |

- The `priority` qualifier may only be applied to function symbols.
- The `size` and `offset` qualifiers may only be applied to data symbols.
- The `size` and `name` qualifiers must be applied to data symbols.
- The `name` qualifier must be applied to data imports.

If all components of a symbol annotation are skipped, the annotation may be
omitted.

### WebAssembly object symbols

For symbols related to WebAssembly objects, the symbol annotation sequence
occurs after the optional `id` of the declaration.

For example, the following code:
```wat
(import "env" "foo" (func (@sym $a retain name="a") (@sym $b hidden name="b") (param) (result)))
```
declares 3 symbols: one primary symbol with the name of the index of the
function, one symbol with the name `$a`, and one symbol with the name `$b`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what the "primary symbol" is here? As it relates to linking.md I would expect this to introduce 2 symbols, only $a and $b, and I'm not sure what "primary symbol" is in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably need an explanation in the document.

First thing I want to mention is that for each relocatable WebAssembly object identifier context, there exists a corresponding identifier context for symbols, and every time a name is used in a @reloc annotation, the name is looked up from that symbol identifier context.

So in WABT there already exists an option to generate relocatable binaries, where WABT implicitly assigns a symbol for each relocatable WebAssembly object (so, functions, objects, tables, and tags), and whenever that object is used, a relocation to its symbol is produced. This allows creating a subset of relocatable binaries without needing to specify any annotations at all.

However, in the actual linking spec, multiple symbols can exist per relocatable WebAssembly object, which breaks WABT's assumptions about uniqueness of symbols.

To reconcile this, I introduced the notion of a primary symbol, which means basically "The symbol we use by default if no relocation annotation is specified". It inherits the name of the object, and if one wants to specify special properties for that symbol, they would write a @sym annotation without a name. All other symbols have a dollar name. They cannot have numeric names, since those are all already taken up by the primary symbols per inheritance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple symbols can exist per relocatable WebAssembly object

I think we need some way here to disambiguate "object". In the context of linking this has an assumed meaning. Perhaps "element"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I used entity before I saw object used in the same context in the spec, so if we don't like object as a term, I'd propose using that. The problem with element is that there is a WebAssembly meaning behind this term, so we can't use it either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entity sounds better than object to me yes. Naming is hard.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to forgive me if these questions I'm raising seem naive. I am not intimately familiar with the linking and reloc.* sections, nor exactly how LLVM uses all the bits and bobs everywhere. I'm familar with everything at a high-level, however, and much of my thinking is projected from an assumed understanding of how things work.

One thing I've never understood is how symbols in the linker-sense related to the name section in the text-format sense. For example is the symbol of (import "a" "b" (func $bar)) "a", "b", "bar", or just the index 0? Or does "a" have to be "env" and nothing else?

Another observation/thought, related to the discussion below, is that I would be surprised at the ability to omit (@sym). For example I would expect that the text-to-binary conversion for the module you pasted would not result in a linking or reloc.* section. I would assume that (@sym) and (@reloc ...) are required to get something into the custom sections, and otherwise nothing is emitted.

Copy link
Member

@sbc100 sbc100 Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @alexcrichton here (assuming I understand the above correctly).

I think we should perhaps not allow the omitting the annotations. Maybe we just require them in all cases, at least initially for the first draft. That way, the contents of the linker section and symbol table much more obviously corresponds 1-to-1 with the annotations in the wat.

if we later decide it really is too verbose we could always add something later, some way to enable eliding of annotations. I can imagine several ways to opt into that more magical behaviour, but perhaps we should start with a the simpler, explicit version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to forgive me if these questions I'm raising seem naive. I am not intimately familiar with the linking and reloc.* sections, nor exactly how LLVM uses all the bits and bobs everywhere. I'm familar with everything at a high-level, however, and much of my thinking is projected from an assumed understanding of how things work.

One thing I've never understood is how symbols in the linker-sense related to the name section in the text-format sense. For example is the symbol of (import "a" "b" (func $bar)) "a", "b", "bar", or just the index 0? Or does "a" have to be "env" and nothing else?

My understanding is that the linking spec is entirely unaware of the name section, and the only strings that participate in deciding on the linkage name for a symbol are the import's field and the explicit name, if present.

Another observation/thought, related to the discussion below, is that I would be surprised at the ability to omit (@sym). For example I would expect that the text-to-binary conversion for the module you pasted would not result in a linking or reloc.* section. I would assume that (@sym) and (@reloc ...) are required to get something into the custom sections, and otherwise nothing is emitted.

I think we should perhaps not allow the omitting the annotations. Maybe we just require them in all cases, at least initially for the first draft. That way, the contents of the linker section and symbol table much more obviously corresponds 1-to-1 with the annotations in the wat.

Sure, I can disable omission for just (@sym).

if we later decide it really is too verbose we could always add something later, some way to enable eliding of annotations. I can imagine several ways to opt into that more magical behaviour, but perhaps we should start with a the simpler, explicit version?

The main thing I'm concerned about is that that will still break WABT users that relied on this magic, but on the other hand I can't imagine that that many people relied on that feature in the first place, so it's probably fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I'm concerned about is that that will still break WABT users that relied on this magic, but on the other hand I can't imagine that that many people relied on that feature in the first place, so it's probably fine.

I wouldn't worry about that. The -r flag in wabt was always a big hack.

Also, if we want -r to continue to mean "generate auto-magic reloc information" in wabt then it can do that right?

Copy link
Contributor Author

@feedab1e feedab1e Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I'm concerned about is that that will still break WABT users that relied on this magic, but on the other hand I can't imagine that that many people relied on that feature in the first place, so it's probably fine.

I wouldn't worry about that. The -r flag in wabt was always a big hack.

Also, if we want -r to continue to mean "generate auto-magic reloc information" in wabt then it can do that right?

Yeah, probably, then I think --enable-linking would enable recognition of those annotations, while -r would trigger generation of symbols without (@sym), so that for the behaviour you want it would be wat2wasm --enable-annotations --enable-linking, for the behavior I proposed it would be wat2wasm --enable-annotations --enable-linking -r, and for the old WABT behavior it would be wat2wasm -r.

And in terms of wording, that would be something like
"Implementations may choose to allow omitting (@sym) annotations where all of their components are not present".

That said, I would still strongly advise against omitting (@reloc) even without -r, since those are very predictably required at every WebAssembly entity use, so it would be an error not to write them, multiple relocations will not appear on the same offset (meaning there will be no confusion like happened with (@sym)), and it would still be obvious that we are creating an object file because of the (@sym) we have on all symbols.


### Data symbols

Data symbol annotations can be interleaved into the data string sequence.
When that happens, relocations are situated before the first byte of the value
being defined.

For example, a declaration of a 32-bit global with the name `$foo` and linkage
name "foo" would look like following:
```wat
(data (i32.const 0) (@sym $foo name="foo" size=4) "\00\00\00\00")
```

### Data imports

Data imports occur in the same place as module fields. Data imports are always
situated before data symbols.

## COMDATs

COMDATs are represented as WebAssembly annotations of the form
```wat
(@comdat <id> <string>)
```
where `id` is the WebAssembly name of the COMDAT, and `<string>` is `name_len`
and `name_str` of the `comdat`.

COMDAT declarations occur in the same place as module fields.

## Labels

For some relocation types, an offset into a section/function is necessary. For
these cases, labels exsist.
Labels are represented as WebAssembly annotations of the form
```wat
(@sym.label <id>)
```

### Function labels
Function labels occur in the same place as instructions.
A label always denotes the first byte of the next instruction, or the byte
after the end of the function's instruction stream, if there isn't a next
instruction.

Function label names are local to the function in which they occur.

### Data labels
Data labels can be interleaved into the data string sequence.
When that happens, relocations are situated after the last byte of the value
being relocated.

Data label names are local to the data segment in which they occur.

### Custom labels
Custom labels can be interleaved into the data string sequence.
When that happens, relocations are situated after the last byte of the value
being relocated.

Custom label names are local to the custom section in which they occur.

## Data segment flags
Data segment flags are represented as WebAssembly annotations of the form
```wat
(@sym.segment <qualifier>*)
```

- `qualifier` is one of the allowed qualifiers on a data segment declaration.
Qualifiers may not repeat.

| `<qualifier>` | effect |
|-----------------|------------------------------------------------------|
| `align=<int>` | sets segment's `alignment` appropriately |
| `name=<string>` | sets segment's `name_len`, `name_data` appropriately |
| `strings` | sets `WASM_SEGMENT_FLAG_STRINGS` segment flag |
| `thread_local` | sets `WASM_SEGMENT_FLAG_TLS` segment flag |
| `retain` | sets `WASM_SEG_FLAG_RETAIN` segment flag |

If `align` is not specified, it is given a default value of 1.
If `name` is not specified, it is given an empty default value.

If all components of segment flags are skipped, the annotation may be omitted.

Data segment annotation occurs after the optional `id` of the data segment
declaration.