-
Notifications
You must be signed in to change notification settings - Fork 5
Notes On The Design Of An Ocap Kernel
@Chip Morningstar
This document has been posted primarily to make it available to the Ocap Kernel team, both to help my teammates stay aware of what I'm doing and to get their feedback. However, I still appreciate feedback from anybody else who has the time and energy to go down into the rabbit hole, regardless of who they are or what part of the organization they are in. That said, keep in mind that this is very much a work in progress. The content here is likely to vary wildly in form, substance, and coherence from one day to the next, as the writing here is not merely a record of my emerging design but very much a part of how I work out my thoughts as to what the design even is. If what you seek is some kind of global understanding of the design itself you might want to wait until it's more cooked; rest assured that I won't be quiet about it once some kind of good-enough-to-criticize threshold is reached (if you want to be actively notified at that point, just drop me a note). If what you seek is design details suitable for implementation or any kind of actual close scrutiny, you definitely should wait until it's more complete -- in its current state this is much more of a think piece than a specification, however much it might at times resemble the latter.
"The naming of names is a difficult matter, it isn't just one of your holiday games." -- not T. S. Eliot
We probably should pick a proper label for this thing we are creating. We've been referring to it as "The Ocap Kernel", but I think it's more than that. Our team is labeled "The Ocap Kernel Team" (and I suspect the higher level management discussions that lead to our being organized and tasked with doing this probably used this language as well), but there's a distinction to be made between the platform as a whole and the kernel per se. In particular, I expect we are headed for some kind of kernel+vats architecture. This suggests that there's almost certainly going to at least be a component that is the trusted supervisory code that lives inside the vat but above any user code. This component is clearly part of our platform but definitionally is not part of the kernel.
(Though I will note in passing that "Ocap Kernel" conveniently abbreviates as "OK", which might be a source of amusement as we contemplate the Naming Of Things.)
A further complication is that each user who is hosting some set of vats will be running a separate instance of our platform, and such instances probably also need a term to label them. An object inside a vat may exchange messages not only with objects in other vats on its same platform instance but also with objects in vats hosted by other users elsewhere on the internet. Consequently, we'll probably want a word meaning "all the vats hosted by some user". Also, I'm being a little loose here with the term "user". Our default mental model is of vats being hosted inside some person's web browser, and it's natural to think of that person as "the user", but of course we also want to consider vats being hosted by independent servers in some data center somewhere whose purpose is to provide services unrelated to any particular human -- it's the internet, after all. You could think of these servers (or their operators) as "users" in some sense, though at that point the language gets a little strained. And of course that's before we even start talking about message traffic exchanged with entities entirely outside our paradigm (via, say, HTTP or whatever), even if those entities remain part of the ecosystem we're concerned with.
So I'm just going to start inventing terminology and making up names for things. Don't get overly fixated on the language that I've chosen (despite the fact that I'm obsessing over it quite a bit here myself) but rather consider all these words provisional, subject to revision, elimination, or addition as we refine our design and get clearer in our own minds what we are doing. In this document I'm going to try to follow the convention writing any new bit of jargon in BOLD ALL CAPS the first time it appears (hopefully in the context of some kind of definition or explanation) and thereafter follow the Germanic convention of Capitalizing These Words when they're used.
Also, as I've been writing I've been encountering unanswered design questions that I don't already have a strong opinion on. These are marked with OPEN DESIGN QUESTION in the text.
Everything here takes as a given that we're building on a common base of
Hardened JavaScript, the ocap security model, and various related concepts such
as compartments, eventual messaging, and so on, as provided by the SES shim or
the XS engine, plus various supporting packages in the Agoric endo
repository
for things like bundling or marshaling. For purposes of this discussion, I'm
going to presume that everybody generally knows what these things are and is on
board with this approach, rather than stopping to explain or justify these bits,
except when there are nuances that merit deeper discussion (ultimately, of
course, there will be more outward reaching documentation that explains All The
Things, but that's for later).
At the bottom we have OBJECTS. Because we are building in Hardened JavaScript, these are all just JavaScript objects, but for our purposes we further divide them into three varieties: DATA OBJECTS, BEHAVIORAL OBJECTS, and ORDINARY OBJECTS. This distinction is relevant in the context of marshaling and serialization, and in the places those are used, namely messaging and persistence.
-
Data Objects are passed by value; they are selfless (that is, they have no identity) and are described entirely by the bits from which they are composed. All own properties of Data Objects must be either (a) JavaScript primitive values (e.g.,
number
,string
,boolean
), (b) References to Behavioral Objects, or (c) recursively nested Data Objects. Since Data Objects are passed by value, they cannot have methods (i.e., own properties that are functions) because these can't be serialized. The prototype of a Data Object must benull
or the JavaScript built-in objectObject
. -
Behavioral Objects are passed by reference. They have identity and may have methods, but can have no visible state (i.e., no JavaScript own properties other than methods). An empty object (i.e., one with no own properties at all) is considered a Behavioral Object since References to empty object instances are extremely useful as passable tokens (i.e., things that have identity and essentially nothing else).
-
Ordinary Objects are regular JavaScript objects that remain inside the Vat. They may have any mixture of JavaScript properties whatsoever but are not passable.
Each Object resides within some VAT, an isolated process that can send and receive messages. When a message is sent from an Object in a Vat to the world outside that Vat, the message may contain OBJECT REFERENCES that designate Behavioral Objects (which are often, but by no means always, contained within the Vat sending the message). When a message is received by a Vat from the outside world, the message may contain Object References that designate Behavioral Objects, quite often ones contained by other Vats. Such References can be used by code within the Vat as targets for future messages it might send.
Broadly speaking, a Vat consists of two parts: a SUPERVISOR and USER CODE.
The Supervisor is trusted, privileged code that is part of our platform. It is responsible for:
-
receiving, deserializing, and queueing messages inbound from other Vats, then, when appropriate according to our code execution model, dequeueing, demarshaling, and dispatching these to User Code.
-
accepting, marshaling, and queueing messages sent from User Code to other Vats, then, when appropriate, dequeueing, serializing, and transmitting them to the outside world (specifically, transmitting them via the Kernel; more on this shortly).
-
overseeing persistent storage on behalf of the Vat, both for the Supervisor's own use (e.g., holding the inbound and outbound message queues) and for the storage of persistent objects belonging to User Code.
-
providing other system level services to User Code as we see fit to provide, potentially including process termination, access to other communications channels (e.g., HTTP), clocks, timers, and so on. The list of possible services that might be included here is open ended and potentially large, though of course we'll want to constrain and thoroughly specify it in the final design. Some of these services may be provided by syscalls that the Superviser makes to the Kernel.
User Code is a bundle of arbitrary, user-provided JavaScript that is executed within a compartment that the Supervisor creates and manages for that purpose. In general, User Code is untrusted, except insofar as we might have some kind of vetting and certification procedure in place for a limited number of special, privileged applications to be granted special endowments that provide capabilities to perform selected sensitive operations that are normally closely held (e.g., access to the clock or to the browser DOM).
The Vats running on a given host are managed by a KERNEL. The Kernel and Vats together are called a CLUSTER (note: of all the vocabulary I've been defining here, "Cluster" is the word I'd most like to have a better alternative for). The Kernel is responsible for:
-
creating, loading, starting, pausing, resuming, and stopping the Vats in its Cluster (note that a Vat might be created as the result of extrinsic user action or by the intrinsic action of another Vat via a service that the Kernel makes available to it for this purpose)
-
managing the granting of powers to specific Vats under the control of the user whose Cluster it is
-
providing a KERNEL MANAGEMENT API, enabling a user interface by which its associated user can control or configure the Kernel's operation and that of its Vats
-
conveying messages between the Vats that it is managing
-
receiving messages inbound from other Clusters and forwarding them to the appropriate Vats
-
accepting messages outbound from Vats in its Cluster that are targeted at objects hosted in other Clusters, and transmitting them over the network to those Clusters
-
translating inbound and outbound object references (contained in messages) between Vat-relative and global forms
-
holding and managing the received message queues for any unsettled promises for which one of its Vats is the decider
A Cluster will be hosted either by a browser extension (where the primary UX will be presented in the browser itself via the web interface) or by a standalone Node executable (where the primary UX will most likely be command line arguments and configuration files). Whether it is presented as an interactive web page or as a CLI, we'll refer to the the controlling entity as the CONSOLE. The Console governs the operation of the Cluster via the Kernel. A Console itself consists of two parts:
-
A CONSOLE DAEMON, an independent process that maintains persistent console state independent of the execution state of the Kernel or any Vats. It interacts with the Kernel via the Kernel Management API. Note that the Console Daemon needs to be independent of the Kernel because one of its jobs will be launching the Cluster in the first place, prior to which there is no Kernel. The Console Daemon in turn presents an API to the:
-
A CONSOLE USER INTERFACE, a possibly ephemeral process (or a series of such processes over time) that actually implements the user facing controls for the cluster, whether via a web interface, a CLI, or something else.
One of the things that the Console Daemon maintains is the user's collection of PETNAMES, which are human readable, human meaningful, user assigned labels for those Objects within the Cluster's Vats that have been made available for the user's direct manipulation (using the Console User Interface) as the result of operations invoked from the Console.
We anticipate there will be a single Console Daemon implementation that serves for all purposes, but at least two and possibly more Console User Interface implementations that vary depending on the particulars of the type of control UX that is being realized.
Typically, a given machine will only host a single Cluster, with the notable exception of multiple Clusters on a machine in service of testing and debugging (i.e., not the normal use case but one that is still important to support as part of the DX).
The ultimate affordance of an Object Reference is its use as the target
destination for a message send, though the use of Object References as identity
bearing tokens (i.e., values that can be compared with ===
and !==
) is also
supported.
At bottom, any given message is sent from one Object to another. However, to describe how things actually work we also need to consider the role of Vats and Clusters. In particular, a message sent to an Object in another Vat within the same Cluster as the sender needs to be handled rather differently from a message sent to an Object in a different Cluster.
Within a Cluster, a message sent by an Object in Vat A to a different Object in Vat B is transferred from the sender to the receiver via the Kernel. The Kernel needs to be able to determine, based solely on the target Object Reference itself, which Vat to deliver the message to, and in what form to deliver it such that the receiving Vat's Supervisor will understand which of its Objects is to be the actual recipient.
The primary way (aside from bootstrapping, about which more in a moment) that a Vat could know the identity of an Object outside itself is if the reference to that Object had previously been imported into the Vat in the arguments of an earlier message, which in turn means that that Object Reference had to have been exported by the originating Vat as part of (or prior to) sending that earlier message. Export preceding import means that the Kernel sees the Object Reference before the receiving Vat does, and is thus always able to associate any Object Reference found in a message with the Vat that exported it, and thus knows where to deliver any subsequent message addressed to that same Reference from some other Vat.
In order for this process of Reference begetting Reference not to lead to infinite regress, some kind of bootstrap mechanism is required, else there'd be no way for any Vat to ever know about anything outside itself in the first place and thus no way for it to ever send messages anywhere. To bootstrap the reference graph, whenever a Vat is created, its initialization creates within it an initial ROOT OBJECT, a Reference to which is exported to the creator as the result of the Vat creation operation. If the Vat was created by the action of User Code in another Vat, this Root Object Reference is delivered to the creator Object as the return value from its invocation of the Vat creation operation. If the Vat was created by user Console action, the Root Object Reference is placed into the user's Petname table for potential future use in operations commanded via the Console (see the discussion of Petnames below).
The logic for messages between Objects in separate Clusters does not map directly to that for messages between Objects in Vats within the same Cluster, although the principles involved are closely related. The first important difference is that an observer outside a given Cluster has no way to know which of that Cluster's Vats a given Object Reference feeds into. On the other hand, there's no particular reason for this outside observer to know or care about which specific Vat it is either. Instead, we consider the internal organization of a Cluster to be the Cluster's private business. Outside an Object's containing Cluster we only associate a Reference to that Object with the Cluster itself. Another way of putting this is: since you don't have visibility into the inner workings of a Cluster, you must treat all the Objects exported from the Cluster as if the Cluster was a single giant Vat, and that any finer-grained distinction within that Cluster that you might become aware of is an illusion presented to you by the Cluster's Kernel. The logic of References begetting References via message passing in the intra-Cluster (i.e., Vat-to-Vat) case is recapitulated in the inter-Cluster case, but at the granularity of Clusters rather than the granularity of Vats.
A second, and probably more important, difference is that between Clusters there is no single entity analogous to a Kernel that knows what all the References are that have been exported are and who exported them. Instead, Clusters connect to each other in an arbitrary peer-to-peer network rather than via the star topology that the Kernel organizes its Vats into. This has important implications with respect to how connectivity between Objects is bootstrapped across Cluster boundaries.
Whereas Vats are created by other Vats via the Kernel or by the user via the Console, new Clusters are created strictly by the autonomous actions of users, external to any pre-existing Cluster (e.g., a user boots a new Cluster instance on their machine and connects it to the internet). In particular, the Cluster itself has no notion of a Root Object, since in its initial state a Cluster need not have any Vats at all. Once the Cluster does have Vats (either through later user action via the Console or as a consequence of configuration information provided as part of Cluster creation), these Vats (or rather, the Objects within them) have no a priori connection to anything, so initial connectivity between Clusters cannot be established purely by reference passing. Instead, we require some means for a Cluster's user, or software running on the user's behalf, to publish the equivalent of an Object Reference in a form that can be communicated out of band (i.e., in something other than an Object to Object message) to the users of other Clusters, who in turn need a means to accept these and convert them into regular Object References that they can use normally. To this end, the Kernel has the capability to translate between Object References, which only have meaning in the context of messages between Objects as represented in the implementation of the inter-object message passing scheme, and OCAP URLS, which are text strings that can be communicated via pretty much any means you like (email, SMS, PostIt notes, QR codes on the sides of buses, etc.). The two directions of translation (from Ocap URL to Object Reference and vice versa) are treated as capabilities that are made available to the user via the Console. From the Console these capabilities may be used directly or selectively granted to Objects to which the user has access.
Taken all together, the arrangements discussed above yield five different possible forms an Object Reference may take depending on context:
-
A VREF (short for "vat reference") designates an Object within the scope of the Objects known to a particular Vat. It is used across the Kernel/Vat boundary in the marshaling of messages delivered into or sent by that Vat. A VRef is generated and assigned by the Kernel when importing an Object Reference into a Vat for the first time and by the Vat when exporting an Object Reference from it for the first time.
-
A KREF (short for "kernel reference") designates an Object within the scope of the Kernel itself. It is used in the translation of References between one Vat and another. A KRef is generated and assigned by the Kernel whenever an Object reference is imported into or exported from a Vat for the first time. KRefs are strictly internal to the Kernel implementation. (Note: in principle KRefs are not strictly required, but the for practical purposed the differentiation between VRefs and KRefs is important because it enables the Kernel to maintain 2N Reference translation tables for N Vats rather than having to potentially maintain N2 translation tables.)
-
An RREF (short for "remote reference") designates an object within the scope of an established point-to-point communications Channel between two Clusters (more on Channels below). An RRef does not survive the Channel it is associated with. An RRef is generated when the Kernel for one Cluster exports an Object Reference into the Channel connecting it to another Cluster's Kernel.
-
An OCAP URL is an externalizable string that designates an object globally in a form suitable for transmission out of band. It incorporates addressing information that allows a Kernel to identify the Cluster containing the Object it refers to and create a Channel to it if one does not already exist or to obtain the existing Channel if one does. It designates an Object in a way that is independent of whether the Cluster containing the Object is currently available or not and independent of the Channel used to establish connectivity. It cannot be used to address messages directly, but can be used to establish connectivity and ultimately obtain an RRef which does permit messaging.
-
A PETNAME is a human readable, human meaningful, user assigned label that designates an Object within the scope of a given Cluster's Console. Internally, the Console maintains a table mapping these labels to KRefs. Petnames may manifest as textual names, such as a user might type or paste into a text input box on a web form, or draggable labels (with distinguishing text, icons, or both) that can be manipulated as part of an interactive Console GUI, or they might take some other form depending on the Console UX design, as long as they have the property of being directly meaningful to the user. One user's Petname for an Object is unrelated to another user's Petname for the same Object. (Note that the two users might happen to pick the same Petname string, perhaps deriving it from some functional role, known to both of them, that the Object plays in their world. However, such coincidences cannot be relied on and even when they do happen no deeper meaning can safely be inferred from them.)
In addition, this writeup will at times use EREF (short for "endpoint reference") as a generic term in contexts where we might be talking about either a VRef or an RRef, since it will be less cumbersome than repeating the phrase "VRef or RRef as appropriate" over and over again.
A CHANNEL is a communications connection between two Clusters that provides a context for message traffic between them. In particular, it establishes a scope for the interpretation of shared RRefs. A Channel is a sturdy version of a TCP-like session connection. It may, in fact, be transported over an ordinary TCP (actually, more likely TLS) connection, but it can survive loss and reestablishment of the underlying transport session, as well as termination and restart of the Kernel or Vat processes at either end, while preserving all of the Object References stretched over the connection and preserving unbroken message streams back and forth between the two endpoints.
Connecting one Cluster to another requires solving two important sub-problems: (a) knowing or learning where to make a network connection (i.e., what IP address and port number to connect to) in order to interact with the Cluster, and, (b) once having connected, verifying that what one has connected to is, in fact, the Cluster one actually intended to connect to (or, indeed, if it is a Cluster at all).
In addition, the actual mechanism of network connectivity needs to account for our intention to host Clusters inside users' web browsers. This is significant because browsers commonly run on machines that aren't in a position to directly accept incoming network connections due to network address translation, dynamic IP address assignment, firewalls, proxies, and all the other myriad ways that the IT organizations of the world have gummed up the proper functioning of TCP/IP networks over the years. Unless and until proven otherwise, I'm going to assume we can use one of the several pre-existing solutions to this problem that clever people have devised, rather than trying to invent our own workaround. (A couple of plausible candidates that various folks have pointed out are libp2p and p2panda; no doubt there are others as well). However, we do need to acknowledge that whatever solution we adopt will most likely end up as a key part of our network stack, since it will almost certainly need to be baked into our design at a fairly fundamental level. Consequently, for the time being I'm just going to treat this as a solved problem, making the naive but useful presumption that one computer can make a connection to another at will, while handwaving away all the complications.
An Ocap URL needs to contain several distinct pieces of information:
-
Network addressing information that indicates how to contact the Cluster that contains the object it designates.
-
Cluster identification information that enables the party establishing a Channel to verify that it is validly in contact with the proper Cluster.
-
An opaque encoding of the object's identity in a way that can be decoded by the target Cluster but which by itself does not uniquely reveal the Object's identity. In particular, multiple different Ocap URLs may all encode the same Object Reference, and it should generally be considered a best practice to vend a new, unique Ocap URL each time an Object Reference is externalized from its containing Cluster.
There are a few different existing variations of Ocap URLs that we can point to, but all the ones I'm aware of share a common ancestry and a similar form. For example, the E version goes like this:
cap://
searchpath
/
vatID
/
objectID
where
-
searchpath
is a semicolon-separated list of network addresses to contact to attempt to locate the appropriate Vat (a network address, in this context, following the familiarhost
:
port
convention) -
vatID
identifies the Vat in the form of a base-64 encoded public key fingerprint -
objectID
identifies the object within the Vat in the form of a base-64 encoded Swiss number
The URL scheme tag varies somewhat in the wild. E originally used cap
(which
was at one point registered with IANA for this purpose, though this registration
seems to have lapsed). Newer versions of the CapTP protocol specification use
captp
, while the OCapN initiative appears to use ocapn
. There are, of
course, numerous design variations to consider. For example, one relatively
trivial question is whether to encode things like key fingerprints or Swiss
numbers in base-64 or hex. Another choice might be to represent the object ID
as a simple integer that is salted and then encrypted with the host Vat's
private key, frustrating some attacks that attempt to compare object IDs out of
band and possibly also simplifying the internal lookup table that the host must
maintain. (Note also that all the extant schemes I've looked at assume
Vat-to-Vat connectivity, whereas we'll be connecting Clusters, but I don't think
this difference will have a substantive impact on our fundamental design.)
Identifying a Vat using its public key fingerprint is a way to verify that a connection has been made to the proper endpoint as part of setting up a secure communications pathway. This is one of the jobs traditionally performed by TLS. However, in order to authenticate that a communications counterparty is the one expected, in normal operation TLS leverages DNS: when connecting to a server via TLS, the server provides a cryptographic certificate attesting to the association between the DNS name of the server and the public key used in the TLS handshake to generate the session key. This certificate is signed by a certificate authority (CA) whose validity is verified either by another certificate signed by a higher level CA (repeating this process recursively) or by being found in a collection of root CAs that the browser is pre-configured to automatically trust. The determination of the ultimate authenticity of the root CAs is highly centralized, leading to a number of points of vulnerability -- the authorities maintaining the root CA list could be compromised, individual CAs could be compromised without their parent CA's knowledge, revocation of a compromised CA's certificate might not have reached the browser (for many different possible reasons), or the root CA list held by the browser could itself be compromised (either within a given user's environment or at some point in the browser's implementation, build, and distribution path). Many of these failures have, in fact, occurred repeatedly in the history of the web. DNS itself is also a point of vulnerability: in the event that the operator of a server loses control over the DNS registry for its domain name, some other party may be positioned to substitute its own server address and corresponding TLS certificate chain. These sorts of compromises have also happened in practice. Because the root of the CA hierarchy is highly centralized, there remains a risk that some of these compromises could happen at scale. A further problem is that even without security failures, establishing the connection requires that the entity connected to possess the requisite certificates in the first place, which is itself administratively complex and highly error prone (and whether or not DNS is involved merely varies the modes of complexity and failure).
An alternative approach, which I want to advocate here, is for our Ocap URLs to
employ something like Tyler Close's
YURL
scheme. While there are lots of details and possible variations, the basic idea
behind them all is simple: embed the fingerprint of the contactee's public key
into the URL itself, so that when a connection is made its authenticity can be
validated directly. Tyler's documentation (just linked) shows how this can be
accomplished with existing, mature TLS implementations, requiring only small and
relatively simple changes to the way TLS is used. In the 20+ years since YURLs
were first proposed they have seen only experimental adoption, in large part
because of the huge entrenched installed base built around TLS, HTTPS, and the
CA hierarchy, but also because YURLs and their brethren involve URLs that are
large and unwieldy, giving them very poor ergonomics in many web use cases.
However, because we are already handling Object References using Petnames
combined with automation, our typical use cases do not generally require
exposing representations of Object References to humans. YURLs do lend
themselves very well to being handled via automation in such a fashion, which
will allow us to have much better decentralization, avoid a large swath of
potential security vulnerabilities, and bypass a lot of complicated, difficult
to understand, and error prone administrative procedures (the latter being
particularly important if we want this to be used by ordinary users in their
browsers rather than just by professional IT people). Note that one of the main
things that distinguishes the YURL approach from the cap
URL scheme and its
variants is that while both make use of a public key fingerprint embedded in the
URL, YURLs leverage the preexisting authentication machinery of TLS, while cap
URLs require the implementor to perform the Vat authentication step themselves.
On the other hand, the cap
scheme is a bit more flexible in its support for
an explicit search path, though the value of this is perhaps questionable.
(OPEN DESIGN QUESTIONS given some of the above described variations and
tradeoffs: (1) should we simply adopt one of the existing URL schemes or are we
better off designing our own? In particular, might it be better to stick to some
convention for using, say, https
rather than using a specialized URL scheme?
(2) Should the host address in an Ocap URL be regarded as the actual site to
contact to reach the Object in question or should it be regarded as a hint as to
where one might look, with the URL providing entry into a directory service if
it doesn't reach the Object directly? In some of the work with E we referred to
this as a "redirectory", drawing analogy to a web server that might serve up a
page or might serve up an HTTP redirect.)
With a couple of exceptions (which will be described when we get to them), we conceive of the interfaces between the various communicating components of the overall system in terms of asynchronous messaging between loosely coupled actors. Note that use of the term "message" here can be a source of confusion, since among the most common and important kinds of things being transported by messages between components are messages between Objects, but these are "messages" at two different levels of abstraction. When helpful for clarity I will sometimes refer to SYSTEM MESSAGES or OBJECT MESSAGES to disambiguate what is being talked about. All the traffic on the interfaces described in this section consists of System Messages; some of these System Messages will carry Object Messages in their payloads.
Typically, connections between processes on the same machine will be carried by
interprocess pipes. Channels, which nominally connect across machine
boundaries, will be carried via a network protocol layered over TCP/IP. In any
case, these connections are always asynchronous and bidirectional. We will of
course hide all that under an abstraction layer so that the components
themselves are shielded from having to know the particulars of how they talk to
each other. (In particular, for clarity of exposition I'm deliberately
describing things here using Unix-like abstractions such as processes and pipes,
though in the browser environment we'll more likely be wrapping the
aforementioned abstraction layer around pages, iframes, web workers, and their
associated browser APIs such as PostMessage
.)
Between any two kinds of components we define an interface that specifies what the allowed messages in each direction are and what operations they invoke. Every component implements some set of methods, some of which may be made available on more than one of its interfaces. In this sense, methods and interfaces are slightly orthogonal, despite most methods being interface-specific. Consequently, some of the component descriptions below will present the methods the components provide (and the operations they realize) separately from listing which interfaces those methods are exposed on.
One operation that is needed in various places is message translation, wherein a marshaled message is converted from the form needed in one frame of reference to that needed in another. For example, transferring a message from the Kernel to some Vat requires converting all the KRefs it contains to that Vat's corresponding VRefs. While this minimally includes simply mapping the various Object Reference strings through a table, importing an the Object into some context where it was previously unknown may also require generating a new Reference identifier for it in the receiving context.
Although this design is not yet committing to a specific scheme for marshaling
Object Messages and related data values, any acceptable scheme will need to make
it easy and efficient to find and translate Object References embedded in a
message's encoded representation. Agoric's capdata
and smallcaps
schemes
both satisfy this requirement, as does OCapN. For expository purposes I'll adopt
the approach (and accompanying terminology) used by both of Agoric's schemes.
These schemes break the representation of a Data Object into two parts, labeled
body
and slots
.
-
slots
is an array of strings, each of which is one of the Object References found in the Data Object being marshaled (nominally deduplicated, though this is not strictly required). -
body
is an encoded representation of the structure and content of the Data Object itself, in which each of the embedded Object References is encoded by an index into theslots
array.
With this arrangement, Object Reference translation can be performed by simply
mapping over the slots
array without having to parse or otherwise examine the
potentially much more complicated and idiosyncratic body
.
An Object Message is represented by a Data Object that contains all the message elements in a standardized form. For our purposes, a message consists of:
- target - a singular Object Reference indicating the Object to which the message should be delivered
- method - a string labeling the method to be invoked on the message target
- arguments - an array of zero or more values that are the arguments to the method call. These values can be primitive data values, Object References, or more complex Data Objects that may have further Object References or Data Objects embedded within them.
- result - a singular Object Reference indicating a promise that, when settled, will be the return value (or failure exception) of the method invoked
We will sometimes collectively refer to all the message elements except the target as the message payload. It is useful to distinguish this as its own abstraction because, in some cases, a message may need to be generated, handled, or processed prior to the target being known. For example, we may want to reuse a given message payload in order to send the same message to multiple recipients.
The format of Object References proposed here is lifted wholesale from that used by Agoric's SwingSet. I don't believe our requirements introduce any new complications that we need to account for, nor have we proposed abandoning any functionality that might enable us to simplify the scheme. These patterns have had a lot of operational shakedown and are known to be robust, so there's no need to reinvent the wheel here.
Object references in the Kernel distinguish between regular objects and
promises, since the latter require special handling. The pattern for KRefs will
be k
TN
where k
indicates this is a KRef, T is either o
or p
to
indicate whether the referenced thing is an object or a promise, and N is an
ordinal integer assigned by the kernel from a counter. Thus, for example,
ko47
is Kernel object number 47, while kp3
is Kernel promise number 3.
A single counter would suffice for both, so that the value of N would be
sufficient to uniquely specify the thing referenced. However, by maintaining
separate counters for promises and objects we halve the speed that KRef size
grows. Note also that the k
is technically redundant, but has historically
proven extremely helpful in debugging to distinguish KRefs from other kinds of
References appearing in log output and debug messages. (In this sense the p
and o
are also redundant, but eliminating them would require a table lookup
any time we needed to know what something was, and since that will happen a lot
it's more convenient to have it in the reference string directly. Also, it too
is helpful in debugging.)
VRefs are used for communicating across the Kernel/Vat boundary, so they need to
be intelligible to both sides. Unlike KRefs, VRefs can be generated on either
side of this boundary. Vats generate them when exporting objects from the Vat
to the Kernel, while the Kernel generates them when importing from the Kernel to
the Vat. As with KRefs, these can be generated using counters, but we need a
separate counter for each side and we need to be able to distinguish which is
which. This leads to a slightly more complex scheme: vTSN
, where v
indicates this is a VRef, T is either o
or p
to indicate whether the
reference is to an object or a promise, S is either +
or -
to indicate
which party generated the identifier: +
from the Vat and -
from the Kernel,
and N is an ordinal indicating which specific thing it is. Thus, for example,
vp+12
is the 12th promise reference generated by the Vat, while vo-63
is the
63rd object reference generated by the Kernel (for that Vat). Similar to the
story with KRefs, in some sense the v
, o
, and p
are redundant, but
regardless of that they're overwhelmingly helpful in practice.
RRefs are used for communicating between Clusters over a Channel. Like VRefs,
references can be generated at either end, requiring support for separate
allocation counters on each side. This leads to a scheme of the form rTSN
,
which is parallel to the pattern used for VRefs except that we prefix the
References with an r
instead of a v
. However, unlike the Vat/Kernel
relationship, the relationship between the two ends of a Channel is symmetric,
so there's no a priori basis for assigning the +
to side A and the -
to side
B. There are various symmetry breaking schemes one might devise that would
require some kind of handshake as part of connection setup, but that is not
actually required. Instead, each side uses +
for entities that exist on its
local end of the Channel and -
for entities that exist on the remote end. We
then follow the principle that each side is very polite and will always encode
any messages it sends using the form understood by its counterpart. This is
accomplished via asymmetry in the translation tables mapping between KRefs and
RRefs, where each +
in a key in the local-to-remote table is matched by a -
in the corresponding value in the remote-to-local, and vice versa (whereas for
VRefs the analagous tables implement a pure bidirectional mapping). Thus when
side A sends ro+47
to side B, it is understood to refer to object #47 existing
on side B, which will then be referred to as ro-47
in any messages sent back
from B to A that reference the same object.
The interactions between a Kernel and its Vats, and between Clusters (which is to say, between Kernels) are mediated by a protocol known as Ken. (The name itself doesn't stand for anything in particular; it's an allusion to Tyler Close's Waterken system in which the underlying principles were first articulated.) Ken is described in more detail in this paper from HP Labs and this paper given at Usenix 2012.
The purpose of many communications protocols, such as TCP/IP or X.25, is to compensate for the unreliability of the underlying data communications link, which can fail in a myriad of ways. Ken is not really one of those; its job is to compensate for the unreliability of the endpoints themselves. This is particularly important with systems of asynchronously interacting autonomous processes which are attempting to cooperate despite not being under a common administrative umbrella. In point of fact, Ken is not really a protocol as such but more of a design pattern that describes how communicating processes should behave and interact so as to ensure, as the papers describing it say, "output-valid rollback-recovery", a kind of fault tolerance that maintains global consistency of system state despite having autonomous components that can fail unpredictably.
Ken works particularly well in systems composed of event-loop driven processes, which is what this design is proposing. The basic outline is as follows:
- When a message is received, record its contents in a persistent medium before processing it.
- After it has been persisted, acknowledge its receipt to the sender.
- Process each received message to completion, one at at time. Only persist state changes resulting from that processing once completion is reached.
- This persistent state must include the contents of any additional messages that were sent as part of this message processing activity.
- These sent messages are only transmitted after they have been persisted.
- Only after a transmitted message has been acknowledged by its recipient can the message be removed from the sender's persistent record.
The idea is that all permanent external consequences of processing an event -- both state changes and message output -- manifest atomically. In particular, when an event handler sends a message, what it is really doing is adding that message to a queue of messages that will be released in a batch when it is done handling the event. If the handling of the event is aborted during processing, whether deliberately or as the consequence of system failure, it is as if the event processing never happened.
When an aborted process is restarted, part of its recovery procedure is to reconnect to any other processes it had been in communications with. Since it was persisting messages as they were received, it knows where it was in each incoming message stream and can request its counterpart on that stream to retransmit any messages that might have been sent subsequent to the most recent message it knows about.
Each ENDPOINT of a connection between two processes can be both the source and the destination of messages. In particular, in the Ken framework each Endpoint is the source of an ordered stream of messages to its counterpart and the destination of a stream of messages going in the opposite direction. Each message in one of these streams contains a sequence number, an ordinal indicating the message's position in the stream, counting upwards monotonically from 1. Sequence number counters in each direction are scoped to the Endpoint and last for the life of the connection. Note that when we talk about the "connection" between Endpoints, we are talking about the communications relationship between them that Ken provides. The scope of this connection potentially (depending on implementation) extends beyond the life of the raw, low-level communications link between the two parties that the Ken relationship is built on top of. In such cases, that low-level link can be broken (either deliberately or as the result of network errors or process crashes) and then resumed, while the message streams that the connection carries remain intact.
For purposes of explanation, it's useful to see things from the perspective of a single process. We'll refer to the process itself as the LOCAL ENDPOINT (or simply as LOCAL to avoid verbosity) and other processes that it has connections to as REMOTE ENDPOINTS (or REMOTES).
For each connection, Local maintains the following data:
-
recvCount
-- The count of messages received from Remote -
sendCount
-- The count of messages sent to Remote -
sendAckCount
-- The sequence number of the last message sent to Remote for which acknowledgement has been received -
sendQueue
-- A queue of messages to Remote that have been produced by Local computational activity but not yet transmitted -
sentMsgs
-- A collection of messages that have been transmitted to Remote but not yet acknowledged -
recvQueue
-- A queue of messages that have been received from Remote but not yet processed
At the start of the connection, recvCount
, sendCount
, and sendAckCount
are
all 0, while sendQueue
, sentMsgs
, and recvQueue
are empty.
In addition, for each incoming message, Local maintains a notional
toSendQueue
, an ordered collection of messages that begins empty and
accumulates outgoing messages as the corresponding incoming message is being
processed.
All of these values are kept in persistent storage (i.e., they will survive crash and restart of the Local process). This writeup will begin with the simplifying assumption that this data is always immediately and synchronously written to persistent storage whenever it is updated. In the interest of performance, in practice we don't actually need to persist this aggressively, as long as certain rules are followed. However, the particulars of those rules complicate the story, so in the interest of clarity we will describe the persistence rules and related complications separately.
The logic of message exchange with Remote rem
goes like this:
When a message msg
arrives from rem
:
- If
msg.seqnum
is <=rem.recvCount
, discard the message and take no further action. Otherwise: - If
msg.seqnum
is !=rem.recvCount + 1
, raise a MessageOutOfOrder error - Push the received message
msg
ontorem.recvQueue
- Increment
rem.recvCount
- Transmit to
rem
an acknowledgement ofrem.recvCount
(Note: in practice this acknowledgement doesn't actually need to be transmitted eagerly but can be nagled. It can wait to piggyback on top of the next communication one would be sending torem
anyway, or simply be deferred for a time.)
When rem.sendQueue
is not empty:
- Pop
msg
offrem.sendQueue
- Increment
rem.sendCount
- Add a new entry
rem.sentMsgs[rem.sendCount] = msg
- Transmit
msg
torem
When rem.recvQueue
is not empty:
- Pop
rmsg
offrem.recvQueue
- Set
rmsg.toSendQueue
to an empty collection - Process
rmsg
to completion; any messages sent during processing are pushed ontormsg.toSendQueue
- (Atomically) While
rmsg.toSendQueue
is not empty:- Pop
smsg
offrmsg.toSendQueue
- Push
smsg
ontosmsg.target.sendQueue
- Pop
When an acknowledgment of seqnum
arrives from rem
:
- Set
rem.sendAckAcount
toseqnum
- Delete all entries
rem.sentMsgs[n]
wheren <= seqnum
When a retransmit request for seqnum
arrives from rem
:
- If
seqnum
<=rem.sendAckCount
, raise a SequenceFailure error - If
rem.sendCount
<seqnum
, raise a SequenceFailure error - If
rem.sentMsgs[seqnum]
does not exist, raise SequenceFailure error - Transmit, to
rem
, each messagerem.sentMsgs[n]
wheren >= seqnum
To re-establish connectivity with rem
after an interruption in the
underlying communications link or a restart of the Local process:
- Transmit to
rem
a retransmit request forrem.recvCount
- Compute
lastSent
, whererem.sentMsgs[lastSent]
is the message most recently transmitted torem
- Transmit a
lastSent
status update torem
When a lastSent
status update arrives from rem
- If
lastSent > rem.recvCount
, transmit torem
a retransmit request forrem.recvCount
The Kernel interacts with three different (kinds of) entities via three different interfaces:
- Console (via the Console Daemon) to Kernel: KERNEL MANAGEMENT API
- Vats (via a Vat's Supervisor) to Kernel: SYSCALL API
- Remote Cluster (via a Channel) to Kernel: REMOTE API
In the individual method descriptions, each method will be tagged with markers indicating which interface or interfaces expose it and how:
- C - Console: the method is part of the Kernel Management API
- V - Vat: the method is part of the Syscall API
- R - Remote: the method is part of the Remote API
- I - internal, meaning that in addition to its appearance in one or more of the regular Kernel interfaces, the method is also employed as a compositional building block in the realization of other methods
- O - Ocap, meaning the method may be made available to selected Vats as an endowment (i.e., provided to the Vat's User Code at the time the Vat is launched), in the form of a (potentially revocable) object capability that allows the method to be invoked from within the Vat (potentially by User Code) to which the capability has been specifically granted
Inbound to the Kernel, messages are consumed eagerly (though subject to rate limiting to protect against attacks on availability) from the connection upon which they arrive, then queued within the Kernel itself. The Kernel maintains a separate queue for each connection to simplify prioritization of traffic handling. In general, the Console has priority over Vats which in turn have priority over Channels, while contending traffic from multiple Vats or multiple Channels are serviced round-robin. More complicated scheduling strategies are possible and might be worth exploring, but I think that at this stage in our design process we know too little about our requirements and our likely operating ecosystem to have actionable opinions about what, if any, alternative scheduling schemes might be more desirable.
Outbound from the Kernel, messages are transmitted immediately, though subject to whatever buffering the host operating system might employ. In the fullness of time we probably need to consider how flow control might back up into actually blocking further processing on the sending end of an interface, but that seems like an advanced consideration that should not hold up engineering progress right now.
For expository purposes, the various Kernel methods are presented in functional groups. This grouping does not imply any deeper semantics.
-
StartCluster(
storage
)
* [C] -- Launch a new Kernel process for a Cluster- Establish a connection to Kernel's Console Daemon.
- Interrogate Kernel's persistent storage to determine the state of its
Cluster. If the Kernel is being launched as a part of the launch of a new
Cluster (indicated by absence of such state), initialize persistent state
with:
- New Cluster ID
- Empty extant Vats table
- Empty extant Channels table
- If there was already a running Kernel process with this Cluster's Cluster
ID:
- Send error feedback to the Console
- Exit
- For each extant Vat vatID:
- If the corresponding Vat process is already running: connect to the running Vat process
- If not: invoke
RestartVat(
vatID
)
- For each extant Channel channelID:
- Invoke
RestartChannel(
channelID
)
- If this succeeds: resynchronize the message streams to and from each of the corresponding remote Clusters according to the Ken protocol.
- If this fails:
- Invoke
TerminateChannel(
channelID
)
- Send appropriate error feedback to the Console
- Invoke
- Invoke
* Note: while this is a Kernel operation, it is not actually a method, and thus is not exposed on any Kernel interface, since it cannot be invoked by sending a message to the Kernel: by definition, at the time the operation is initiated there is no running Kernel process to send a message to. Instead, it is the operation implied by launching the Kernel application at the operating system level in the first place. Parameters governing its behavior are passed as command line arguments (or as environment variable settings) rather than as message arguments.
-
StopCluster()
[C] -- Shutdown all of a Cluster's processes in way that can be restarted- For each extant Channel, stop that Channel
- For each extant Vat:
- Invoke
StopVat()
on that Vat - Await
ReceiveCrankCompletion
resulting from theStopVat
method
- Invoke
- Checkpoint Kernel state
- Exit
-
TerminateCluster()
[C] -- Permanently shutdown all of a Clusters's processes and reclaim its resources- For each extant Channel channelID, invoke
TerminateChannel(
channelID
)
- For each extant Vat, invoke
TerminateVat()
for that Vat - Delete Kernel's persistent state
- Exit
- For each extant Channel channelID, invoke
-
LaunchVat(
isFirstTime
,
bundleSpec
,
config
,
endowments
)
[C, O] -- Create a new Vat and start it running in a process- Start an instance of the Supervisor application, passing it:
- a flag indicating that it should start a new Vat instance
- a pathname or bundle that the Vat is to execute as its User Code
- any additional configured initialization parameters for the Vat instance
- endowments: any special ocaps that the party launching the Vat wishes it to have
- Add the newly launched Vat to the Kernel's extant Vats table
- Start an instance of the Supervisor application, passing it:
-
StopVat(
vatID
)
[C] -- Shutdown a Vat process, retaining its persistent state- Invoke
StopVat()
on the Vat - Await
ReceiveCrankCompletion
resulting from theStopVat
method - Update the Vat's run state in the extant Vats table to note that it is stopped
- Invoke
-
UpdateVat(
vatID
,
bundleSpec
,
updateConfig
)
[C, O] -- Replace a Vat's code without loss of persistent state- Start an instance of the Supervisor application, passing it:
- a flag indicating that it should restart an existing Vat instance
- a pathname or bundle that the Vat is to execute as its User Code
- any additional configured update parameters for the Vat instance
- Update the Vat's run state in the extant Vats table to note that it is running
- Start an instance of the Supervisor application, passing it:
-
RestartVat(
vatID
)
[C, I] -- Restart a previously shutdown Vat from its persistent state- Start an instance of the Supervisor application, passing it:
- a flag indicating that it should restart an existing Vat instance
- Update the Vat's run state in the extant Vats table to note that it is running
- Start an instance of the Supervisor application, passing it:
-
TerminateVat(
vatID
)
[C, I, O] -- Shutdown a Vat process and discard its persistent state- Invoke
StopVat()
on the Vat - For each Cluster the the Vat has exported remote References to: invoke
DropExports([
refs
])
on all such References, via the corresponding Channel - For each Cluster the the Vat has imported remote References to: invoke
DropImports([
refs
])
on all such References, via the corresponding Channel - Remove the Vat from the extant Vats table
- Invoke
-
StartChannel(
clusterSpec
)
[I] -- Create a Channel to some remote Cluster- If there is already a Channel to the Cluster, fail
- Open a connection to the Cluster
- If this succeeds:
- Construct a new Channel with the connection
- Add the new Channel to the Kernel's extant Channels table
- If this fails, fail
-
StopChannel(
channelID
)
[I] -- Disconnect a Channel, retaining the option to reconnect it- Drop the Channel's connection
- Update the Channel's state in the extant Channels table to note that it is closed
-
RestartChannel(
channelID
)
[I] -- Reconnect a stopped Channel- Open a connection to the Cluster
- If this succeeds: update the Channel's state in the extant Channels table to note that it is open
- If this fails, fail
-
TerminateChannel(
channelID
)
[I] -- Permanently disconnect a Channel with no option to reconnect it- Send a
TerminateChannel
message over the Channel to the remote Cluster - Drop the Channel's connection
- For each Vat exporting References to the remote Cluster: invoke
RetireExports([
refs
])
on all such exports from that Vat - For each Vat importing References from the remote Cluster: invoke
RetireImports([
refs
])
on all such imports from the Vat - Remove the Channel from the extant Channels Table
OPEN DESIGN QUESTION: Should these methods also be exposed on the Console interface (which in an earlier draft they were)? AGAINST: the management of Channels should be automatic and the Channels themselves should be transparent, so enabling them to be fiddled with externally seems questionable. FOR: being able to fiddle with them could be helpful in the context of development and debugging.
- Send a
-
OcapURLToObjectReference(
ref
)
[C, O] -- Given an Object Reference, provide an equivalent Ocap URL -
ObjectReferenceToOcapURL(
url
)
[C, O] -- Given an Ocap URL, provide the corresponding Object Reference -
DeliverObjectMessage(
targetRef
,
message
)
[C, V, R] -- Deliver an Object Message to the appropriate Vat -
NotifyPromiseSettlements([[
promiseRef
,
value
], ...])
[C, V, R] -- Deliver a batch of promise resolutions/rejections to the appropriate Vat -
ReceiveCrankCompletion(
status
,
completionRecord
)
[V] -- Accept the results from a Vat of processing an Object Message or promise settlement, consisting of: result status, Object Message sends, and promise settlements -
DropImports([
refs
])
[V, R, I] -- Note that the invoker can no longer reach the given Objects -
RetireImports([
refs
])
[V, R, I] -- Note that the invoker can no longer recognize the given Objects -
RetireExports([
refs
])
[V, R, I] -- Note that the invoker no longer has the given ObjectsNote: While the signature of each of these methods is the same regardless of which interface it appears on, Object References (which are simply strings) contained within arguments are interpreted in the context of that interface: - C (Kernel Management API): KRefs - V, O (Syscall API): VRefs - R (Remote API): RRefs - I (Internal): inherited from whoever called it
-
ListVats()
[C] -- Provide a list of all extant Vats, with accompanying descriptive metadata -
ListChannels()
[C] -- Provide a list of all extant Channels, with accompanying descriptive metadata -
DebugVat(
vatID
)
[C] -- Command a Vat to make itself available to the/a JavaScript debugger -
DumpVatInfo(
vatID
)
[C] -- Provide an info record about a Vat, including its c-lists and persistent state -
DumpKernelInfo()
[C] -- Provide an info record about the Kernel, including its persistent state -
DumpChannelInfo(
channelID
)
[C] -- Provide an info record about a Channel, including its c-lists
The primary job of the kernel is routing Object Messages to and from Objects in its Vats. Each message has a source and a destination, either of which could be in one of its Vats or in some remote Cluster. (One might ask about the case where both source and destination are in remote Clusters and the Kernel is simply acting as a relay without any of its Vats actually participating. I believe the logic of the Kernel's operation will handle this case transparently and automatically. However, in the interest of not piling on complications, I'm not going to spend much energy right now analyzing this. Even so, this case could prove important in the future so, let's make a note not to forget about it.)
For purposes of this discussion we will treat notifications of promise resolutions and rejections as if they were Object Messages. The way these get handled within the Endpoints is quite different from how Object Messages get handled, but the logic of communications and routing is identical.
From the Kernel's perspective, there are three kinds of Endpoints we are concerned with:
- remote Clusters, connected via Channels to the Kernel
- local Vats, which the Kernel is managing
- the Console, mostly trivial for purposes of this discussion, but needs to be accounted for
The Kernel manages its communications with these Endpoints following the rules of the Ken protocol, described above. However, note that for a remote Cluster, connections (and related data such as message sequence numbers) are scoped to the Channel; the Channel wraps a communications link that can come and go, but if the Channel itself is closed or lost and then subsequently a new Channel to the same Cluster is opened, the message streams and the associated sequence counters are reset.
In addition to the counters and queues required by Ken, for each Endpoint, the Kernel maintains the following data:
-
erefToKref
-- A table mapping from the Remote Endpoint's Object Reference namespace to the Kernel's -
krefToEref
-- A table mapping from the Kernel's Object Reference namespace to the Remote Endpoint's
Initially, erefToKref
and krefToEref
are empty.
The Kernel also maintains a single global table that is independent of Endpoint:
-
krefToEndpoint
-- A table mapping each KRef to the Remote Endpoint from which the Object Reference it denotes was imported
The logic of message routing follows that of Ken generally, with a few important wrinkles:
When a message msg
arrives from Endpoint ep
:
- Validate the sequence number according to normal Ken rules
- Translate
msg
intokmsg
by mapping it from the Endpoint namespace into the Kernel namespace, replacing each Referencer
inmsg
(msg.target
,msg.arguments.slots[]
, andmsg.result
) withep.erefToKref[r]
. If there is no mapping table entry forr
then:- If
r
ismsg.target
, raise an UnknownMessageTarget error - Otherwise add a new export from
ep
to the Kernel:- Generate a new KRef
k
- Add a new entry
krefToEndpoint[k] = ep
- Add a new entry
ep.erefToKref[r] = k
- Generate a new KRef
- If
- Push the translated message
kmsg
ontokrefToEndpoint[kmsg.target].sendQueue
- Handle sequence numbers and acknowledgement according to normal Ken rules
When ep.sendQueue
is not empty:
- Pop
kmsg
offep.sendQueue
- Translate
kmsg
intomsg
by mapping it from the Kernel namespace into the Endpoint namespace, replacing each Referencer
inkmsg
(kmsg.target
,kmsg.arguments.slots[]
, andkmsg.result
) withep.krefToEref[r]
. If there is no mapping table entry forr
then:- If
r
iskmsg.target
, raise an InternalConsistencyFailure error - Otherwise add a new import from the Kernel to
ep
:- Generate a new ERef
e
- Add a new entry
ep.krefToEref[r] = e
- Generate a new ERef
- If
- Add
msg
toep.SentMsgs
and transmit according to normal Ken rules
The difference, in summary, is that the Object References embedded in Object Messages undergo namespace translation appropriate for the Remote Endpoint.
Promises were originally invented to enable messages to be sent to the result of an operation before that result has been determined, which allows for looser coupling of asynchronous processes. With suitable pipelining support, this can enable substantial speedup of distributed systems (by a factor of 100 to 10,000 as measured in some use cases) by reducing or eliminating delays due to network round trip latency. However, since messaging as such is not (yet) part of the JavaScript specification, when promises were added to the language in 2015, their design incorporated no direct support for this fundamental motivating use case. Endo implements mechanisms for eventual send and handled promises, which work hand in hand to allow us to realize this missing functionality. (Those of us who were involved in developing these building blocks are working to get them added to the language officially, but the TC39 process moves slowly, especially when, as here, the abstractions being proposed are unfamiliar to many members of the committee.)
The eventual send operator, E
, wraps a JavaScript object reference in a proxy
that will transform a function application into an asynchronous message send.
For example E(foo).fargulate(47, 'whacka-whacka')
will send the fargulate
message to the object designated by foo
with 47
and 'whacka-whacka'
as the
parameters. If the target of the message send is a regular JavaScript object,
the effect will be an asynchronous function application. If the target is a
JavaScript promise, the message will be captured in a then
closure, to be
transformed into message send to the resolution of the promise (which in turn
will recursively be handled as a function application or further message send
depending on whether that result was an object or a promise).
All of the above works within the context of a single JavaScript process' address space. However, if the target of a message send is a Reference to a promise in another Vat, then further machinery is required. The handled promise class wraps a promise in a proxy-like abstraction that allows one to register a handler that can take arbitrary action when the promise is resolved or rejected or when a method is invoked on it. In particular, when a message is sent to an unresolved promise, instead of being delivered to a target object (which it can't be because that object is not yet determined), it is added to a queue of messages directed toward that promise. When the promise is resolved, the contents of this queue are delivered (in order) to the object that the promise resolved to. This requires us to hold this queue somewhere. Since messages sent from a Vat get serialized and passed to the Kernel, it makes sense for the Kernel to be the one who does the holding. Since the messages will be in their serialized form, they can be stored in persistent storage, enabling them to continue to exist (and eventually be delivered) even if the Vat that sent them exits.
As discussed above, when an Object Reference is delivered to a Vat in a message
for the first time, the Vat is said to import that Reference. Similarly, when
a Vat transmits an Object Reference in a message for the first time, the Vat is
said to export that Reference. (Note: In a possibly quixotic effort to avoid
some potential confusion, I'd like to highlight that these uses of the words
"import" and "export" are distinct from the parallel use of the very same words
in discussions of the JavaScript module system with its import
and export
keywords. If we had it to do all over again ... etc.)
Reference imports and exports are important in the context of garbage collection.
From the exporting Vat's point of view, the fact that a Reference has been exported means that some entity external to the Vat is potentially pointing to the referenced Object. The export therefor anchors the Object in memory and prevents it from being garbage collected even if there are no pointers to the Object within the User Code's memory (in practice this is realized by having the Supervisor maintain a reference to the Object in a Map).
When a Reference is imported into a Vat, the importing Vat is treated as if it
were pointing to the Object just as if some other Object in the exporting Vat
were doing so. The Kernel is responsible for keeping track of which other Vats
the reference has been imported into. When an imported object becomes
unreferenced inside the importing Vat, as part of garbage collection the Vat
signals to the Kernel (via a dropImports
message to the Kernel's syscall
interface) that it has dropped the Reference. When the Kernel notices that the
number of referencing Vats has fallen to zero, it in turn signals the original
exporting Vat (via a dropExports
message to the Supervisor), at which point
the Vat is free to garbage collect the Object if it continues to be unreferenced
within the Vat. The exact same logic applies to References transmitted between
Clusters, where Channels play the role of exporting or importing Vats.
Unfortunately, there are a couple of additional subtleties that further complicate this picture.
The first subtlety is that we've been speaking loosely about a Vat referencing an imported Object, as if "referencing" was a singular kind of activity. However, there are actually two different senses of "reference" that both need to be accounted for. An Object is said to be reachable if it is possible to send a message to it or to store a reference to it in a property of some other object; this is the conventional sense in which we think of object references in the context of garbage collection. An Object is said to be recognizable if a Reference to it has been used as a key in a weak collection of some kind (e.g., a WeakMap or WeakSet). If an Object is recognizable but not reachable, it means that the Vat has no hold on the Object (i.e., the recognizing Vat is not preventing the exporting Vat from garbage collecting the Object) but nevertheless, if a future message was to reintroduce the Reference to it into the importing Vat, the Reference should continue to be usable as a key into the aforementioned weak collection, meaning (1) the collection needs to retain the key (along with whatever value the key indexes) despite the fact that the key itself is nominally unreferenced and by the normal rules of weak collections should be garbage collectible, and (2) in order for this to work, the Kernel needs to reintroduce the Reference using the same Vref.
Operations on the Kernel Syscall API and Vat Management API use the term drop
in the names of messages used to communicate the loss of reachability, and the
term release
in the names of messages used to communicate the loss of
recognizability.
The second subtlety is that none of the logic described above really accounts for the additional machinery needed to properly garbage collect circular chains of reference that cross Vat boundaries. In practice, this requires significant additional bookkeeping for relatively little practical benefit. Operational experience has shown cyclic garbage to be relatively rare in practice, especially if a modest amount of care is taken to avoid it. It does remain a potential denial-of-service vulnerability, since without a cyclic GC solution it is possible for collaborating entities to construct uncollectable cycles on purpose. We discount this security concern somewhat because such malicious entities can easily overwhelm memory in other ways if costs are not imposed on them for doing so, and any such costs that are imposed should address cyclic garbage as well. Nevertheless, it remains possible to wind up with a unreferenced distributed cycle by mistake, and thus a potentially expensive storage leak, even when people are deliberately trying not to. This raises another OPEN DESIGN QUESTION: what priority should the cyclic GC problem have and how much do we care to invest in dealing with it?
A Vat is made up of the Supervisor and the User Code. Unlike most of the other components described here, the Supervisor and the User Code are together regarded as a single intentional unit, the Vat. While the User Code actually does whatever job that particular Vat has been put in place to do, the User Code itself is (usually) untrusted, since it can contain unknown code potentially coming from anywhere. The role of the Supervisor is to provide a trusted intermediary between the Kernel and the User Code.
Operation of a Vat involves three interfaces:
- Kernel to Supervisor: VAT MANAGEMENT API
- Supervisor to User Code: USER CODE INTERFACE
- User Code to Supervisor: VAT SERVICES API
The Vat Management API is how the outside world talks to a Vat. The Supervisor handles all interactions with the Kernel on behalf of its Vat. It receives System Messages from the Kernel via the Vat's Vat Management API, while it sends messages to the Kernel via the Kernel's Syscall API interface.
The User Code Interface and Vat Services API are strictly internal to the Vat.
Unlike almost all the other interfaces in this design, the User Code Interface is an asynchronous function call interface rather than a serialized data stream. The Supervisor dequeues an event from its run queue (put there by System Messages received via the Vat Management API) and, if it is an Object Message, dispatches it to the User Code, which then processes it to completion and returns to the Supervisor.
The Vat Services API is a synchronous function call interface that the Supervisor provides to the User Code as part of the Vat startup procedure. It is how the Supervisor provides various services to the User Code, such as the ability to send Object Messages.
All Supervisor activity aside from I/O (i.e., reading System Messages from the Kernel to Vat connection and capturing them in the Vat's run queue) happens when the Supervisor has agency, either between invocations of the User Code or during execution of one of the service functions in the Vat Services API. At any given time either the Supervisor has agency or the User Code does, but never both. This is important because while Supervisor code is running it cannot be interfered with by concurrently executing User Code. More importantly, once the User Code has completed processing an event the Supervisor can be assured that no additional User Code will run until such a time as the Supervisor chooses to allow it.
Unlike the Kernel's interfaces, the sets of methods exposed on the three Vat interfaces are disjoint. In other words, each interface presents a unique and distinct group of methods.
-
LaunchVat(
isFirstTime
,
storage
,
bundleSpec
,
config
,
endowments
)
* -- Start an instance of the Supervisor application, passing it:- a flag indicating whether it is launching a new Vat instance or restarting an existing Vat instance
- access to the persistent storage that the running Vat will use
- a pathname or bundle that the Vat is to execute as its User Code, or a
null
indicating that an existing Vat instance should deploy the User Code it already has. - any additional configured initialization or update parameters for the Vat instance
- any special ocaps that the party launching the Vat wishes it to have
* Note: while this operation is described as part of the Vat Management API, it is not actually a method (and thus not part of the literal API as such), since it cannot be invoked by sending a message to the Vat: by definition, at the time the operation is initiated there is no Vat to send a message to. Instead it is the operation implied by launching the Supervisor application at the operating system level. Parameters governing its operation are passed as command line arguments rather than as message arguments.
-
DeliverObjectMessage(
message
)
-- Deliver an Object Message to this Vat -
NotifyPromiseSettlements([[
promiseRef
,
value
], ...])
-- Deliver a batch of promise resolutions/rejections to this Vat -
DropExports([
refs
])
-- Note that the world outside this Vat can no longer reach the given Objects -
RetireExports([
refs
])
-- Note that the world outside this Vat can no longer recognize the given Objects -
RetireImports([
refs
])
-- Note that the world outside this Vat can no longer present the given Objects for recognition -
StopVat()
-- Shutdown this Vat's process, retaining its persistent state -
TerminateVat()
-- Shutdown this Vat's process and discard its persistent state -
DumpVatInfo()
-- Provide an information record about this Vat, including its persistent state, suitable for use in debugging
-
Init(
startMode
,
syscall
,
configParameters
)
-- Initialize the User Code. The actual initialization that will be done is specific to whatever the User Code is implementing; the interface merely specifies a standard set of parameters including:- a flag indicating whether this is the initial launch, an ordinary restart, or a restart after upgrade
- a reference to the Supervisor services (syscall) object
- configured initialization/upgrade parameters (including endowments)
-
Invoke(
target
,
args
)
* -- Invoke a function or method inside the User Code. -
Settle(
promise
,
successFlag
,
value
)
* -- Resolve or reject a promise contained within the User Code's state, possibly triggering execution ofthen
handlers registered on that promise* Note:
Invoke
andSettle
are not actually methods exposed on the User Code Interface per se. Instead, they represent the actual direct method invocation, or promise resolution/rejection on the relevant targets. This is possible because the Supervisor and the User Code execute in the same address space. Externally initiated invocations or settlements are only possible for targets that have previously been exported by the User Code by earlier outbound messages or promise settlements.
-
SendObjectMessage(
target
,
args
)
-- Send an Object Message to some target external to the Vat -
SettlePromise(
promise
,
successFlag
,
value
)
-- Notify the world outside the Vat of the settlement of some previously exported promise -
Exit(
status
)
-- Permanently shutdown this Vat -
Persistent data store methods TBD
In all cases the actions initiated by User Code invocations of Vat Services API methods do not take place immediately. Instead, they are accumulated into the crank completion record for whatever crank the Vat is currently executing, and will be acted on collectively when the User Code finishes and surrenders agency to the Supervisor. At that time the actions will be executed by the Supervisor or added to the completion record that will be relayed to the Kernel via the Kernel's
ReceiveCrankCompletion
method as part of executing the ken protocol on the Vat's behalf.
As described above, the Console consists of two components, the Console Daemon and the Console UI. In terms of the overall system design, we have little to say in this section about the Console UI aside from the interface that the Console Daemon exposes to it. This is not to say the Console UI design is not vitally important, but in considering its design we are principally concerned with usability rather than fundamental semantics. Moreover, as previously explained, we expect there to be multiple flavors of Console UI for different kinds of users and for different purposes, so even in a mature system there's no definitive design to specify.
The Console Daemon exists mainly to mediate access to the Cluster in terms of Petnames rather than raw, low-level Object References. It runs as a separate process so that the mapping table can be managed independently from the comings and goings of Console UI instances. OPEN DESIGN QUESTION: Might the Console Daemon be better conceptualized not as a background process but as a Kernel interface library (to be linked directly into individual Console UI implementations) that accesses a shared Petname table database? This might simplify some things, though it would require adopting a database that supports concurrent access by multiple readers and writers. I have an intuition that this latter approach is a potential footgun, and in particular is risky from a security perspective, but I don't at this time feel like I have an articulable justification for that intuition. However, for the time being I'm going to proceed with the daemon process approach until we have a chance to discuss the question more deeply.
The Console Daemon nominally presents two interfaces, but really only one:
- Console UI to Console Daemon CONSOLE SERVICES INTERFACE
- Kernel to Console Daemon
The Console Services Interface is essentially a tamed version of the Kernel Management API, exposing those Kernel methods that are suitable for direct manipulation by a user in a form adapted to such use.
The Kernel to Console Daemon interface isn't really an interface in the sense of all the other interfaces we're defining here. There are no actual methods, as all information that flows from the Kernel to the Console does so in the context of responses to Kernel Management API methods rather than the Kernel initiating an operation itself -- in principle the Kernel should be able run on its own for extended periods of time without any Console attached, so it only speaks to the Console when the Console speaks to it. However, as we transition to an actual implementation, the content of these method call responses will need to be worked out in detail and should be documented here.
Essentially all of the methods described above for the Kernel that are tagged with C are replicated on the Console Services Interface. The key difference is that Object References (as found, for example, in message deliveries) are expressed in terms of Petnames.
Note that most of these Kernel methods are quite inappropriate for direct exposure to human users, but are instead made available so that the Console UI can implement higher-level operations that are appropriate to give to people directly (without taking a position as to what those appropriate operations might or ought to be). The exception is debugging and development tools, which might provide affordances for doing all kinds of things manually that normally would only be done by code. However, since these are both powerful and potentially dangerous, the availability of such affordances should be gated by security checkpoints or at least buried deep in some menu, so that unsophisticated users (which is to say, most users) don't get entangled with them by accident when they don't know what they're doing, just as you wouldn't provide an ordinary user with a hex debugger that lets them poke directly into memory.
The remaining methods that are unique to the Console Services Interface are concerned with managing the Console's Petname table, which maps between user-provided Petnames and KRefs.
-
NameObject
-- Assign a Petname (and optionally description) to an Object. Note that a given Object can have more than one Petname. -
RenameObject
-- Change the Petname of an Object to a different string -
RemovePetname
-- Remove an entry from the Petname table. Note that this does not remove the Object itself, simply makes it inaccessible to direct manipulation via the name being dropped. -
ListPetnames
-- Return a list of the currently extant Petnames (and optionally the descriptions that go with them)
The available options for persistent storage pose substantial design constraints on the architecture described here. If our scope was limited to environments such as Node, I'd simply pick SQLite as the underlying storage medium and then feel free to specify the details of the persistence subsystem entirely on the basis of our functional requirements. However, while we do anticipate running inside Node, our principal deployment environment is the browser, which limits us considerably. Each of the persistent storage options available inside the browser environment is unsatisfying in annoying ways, even given that the options available to browser extensions are more expansive than those afforded to regular web pages.
In particular, since we expect Vats to be handling valuable and delicate things like financial transactions, we'd very much like our persistent storage to have ACID properties, which for the most part the available storage options do not innately provide. While it is reasonably straightforward to achieve consistency and isolation with careful programming and a suitable data architecture, atomicity and durability remain problematic.
After eliminating obviously unsuitable persistence mechanisms such as cookies, we have three options:
-
browser.storage.local
- This is a variant of thelocalStorage
API that is provided for use by extensions. It differs fromlocalStorage
in being asynchronous, able to natively store values of a variety of JavaScript types rather than just strings, and somewhat less subject to capricious erasure of stored data due to disk space pressure or weird timeouts. Notably, this is the storage option used by the current version of MetaMask Snaps. It is a key-value store with semantics similar to a JavaScriptMap
. While its version of theset
operation supports writes to multiple keys in a single call, nowhere in the available documentation have I been able to find any indication whether these are written atomically or not (my suspicion is that they're not atomic, since, if they were, I would expect that to be the kind of thing that would be advertised as a feature). Nor is there any indication as to whether one can assume that data is safely written to disk at the point that a write operation's result promise resolves. Since this interface has no concept of a database transaction as such, in the (highly speculative) event that a multiple-keyset
does execute atomically, we'd be forced to arrange things so that all the data updates produced in a crank can be structured to be written together. While this would not be a blocker, it would probably be a major annoyance. Bottom line: asynchronicity is not helpful for our use case, atomicity is likely a problem, and durability is uncertain. -
indexedDB
- This is the officially blessed web standard for web applications requiring long term or large capacity storage. It is an asynchronous NoSQL database. It may be more suited to our needs thanstorage.local
, since it does provide for atomic transactions. However, this feature comes with a huge caveat: its transaction model has a bizarre autocommit semantics that may make its transactions impossible for us to use effectively, since asynchronous non-database operations interleaved with a transaction can result in surprise premature commits. It appears that their transaction model exists simply to enable applications to perform multiple correlated data updates rather than to provide a generalized rollback mechanism that can be used in the event of arbitrary failures. LikelocalStorage
, there are no documented guarantees about durability of writes. Bottom line: asynchronicity that's not helpful for our use case, atomicity may be effectively unattainable, and durability is uncertain. -
OPFS - The Origin Private File System provides web applications with access to a
chroot
ed file system with a standard set of operations for navigating the directory hierarchy, manipulating directories, and reading & writing files. In particular, it supports synchronous file I/O with significantly higher performance than you get from the various other asynchronous data storage options. The downside is that whatever storage logic you want to have on top of this must be provided by you, possibly at the cost of significant engineering work and certainly at the cost of making the install package bigger. However, a WASM-based SQLite3 implementation is now available in early form and might be suitable for our use. Bottom line: this might be the solution to all our persistence problems; on the other hand it might be a mirage.
A key design decision is whether Vats' access to persistent storage should be
implemented in the Kernel and provided to the Vats via syscalls, or whether it
should instead by implemented directly in the Supervisor. If we go with the
storage.local
option, it will necessarily be a Kernel service because there is
a single data store that belongs to the extension and the API is not available
to workers. In the case of indexedDB
or something like an OPFS Sqlite
database (or maybe some other OPFS-based solution), we have a choice.
Points in favor of centralizing all persistence functionality in the Kernel:
-
Reduced Supervisor size and thus reduced Vat memory footprint, since the database access code would otherwise have to be replicated in each Vat worker.
-
Centralizing operations is probably better from a logging and monitoring point of view, and in some cases may help with debugging (notably, puzzling out what is happening when multiple Vats try to engage in some kind of coordinated action).
-
Allows the Kernel to manage the Ken protocol on the Vats' behalf, essentially letting a Cluster be treated as a monolith with respect to distributed consistency and taking Kernel<->Vat messaging out of the Ken story entirely. This could be short term win if we're not doing intercluster communication at all to start with and therefor we could postpone all the Ken implementation work.
Points in favor of having Vats handle their own persistence:
-
Concurrently executing Vats can run more independently of the Kernel and of each other. In particular, Vats would not be contending for the Kernel's attention: a Vat that needs just a few storage operations won't get slowed down by some other Vat that's doing a lot of storage operations. This might also be a security improvement, since it can somewhat reduce the ability of Vats to use syscall timing information to sniff what other Vats might be doing (though this is a minimal benefit since I'd expect us to follow the practice of keeping the clock closely held).
-
Avoids having storage syscalls be a Kernel performance bottleneck in general.
-
Removes a bunch of operations from the syscall API, making the Kernel implementation generally smaller and simpler.
-
Maintaining separate per Vat data stores is probably better for Vat isolation. In particular, garbage collecting dead Vats becomes immensely simpler.
-
Avoids the IPC cost of a very large number of syscall roundtrips between the Kernel and the Vats (Agoric's experience was that the overwhelming majority of traffic between the SwingSet kernel and its vats turned out to be vat storage requests).
My personal inclination strongly favors the non-centralized approach, but it's something we should discuss. In particular, since the centralized approach seems to be how Snaps works, it's worth considering whether or not changing this would be disruptive.
three party handoff - priority? necessity? threat or menace?