corda/docs/source/wire-format.rst

501 lines
25 KiB
ReStructuredText
Raw Normal View History

Wire format
===========
This document describes the Corda wire format. With the following information and an implementation of the AMQP/1.0
specification, you can read Corda serialised binary messages. An example implementation of AMQP/1.0 would be Apache
Qpid Proton, or Microsoft AMQP.NET Lite.
Header
------
All messages start with the 8 byte sequence ``corda\1\0\0``, that is, the string "corda" followed by a one byte and then two
zero bytes. That means you can't directly feed a Corda message into an AMQP library. You must check the header string and
then skip it.
The '1' byte indicates the major version of the format. It should always be set to 1, if it isn't that implies a backwards
incompatible serialisation format has been developed and you should abort. The second and third bytes are incremented if we make
extensions to the format. You can usually ignore these.
AMQP intro
----------
AMQP/1.0 (which is quite different to AMQP/0.9) is protocol that contains a standardised binary encoding scheme, comparable to but
more advanced than Google protocol buffers. `The AMQP specification <https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-types-v1.0-os.html>`_
is quite concise and easy to read: this document will reference it in many places. It also provides a variety of encoded examples
that can be used to understand each byte of a message.
The format specifies encodings for several 'primitive' types: numbers, strings, UUIDs, timestamps
and symbols (these can be thought of as enum entries). It also defines how to encode maps, lists and arrays. The difference
between the latter two is that arrays always contain a single type, whereas lists can contain elements of different types.
An AMQP byte stream is simply a repeated series of elements.
So far, so standard. However AMQP goes further than most such tagged binary encodings by including the concept of
*described types*. This is a way to impose an application-level type system on top of the basic "bags of elements"
that low-level AMQP gives you. Any element in the stream can be prefixed with a *descriptor*, which is either a string
or a 64 bit value. Both types of label have a defined namespacing mechanism. This labelling scheme allows sophisticated
layerings to be added on top of the simple, interoperable core.
AMQP therefore also defines a type system and schema representation, that allows you to create the app-level type layer.
Standard AMQP defines an XML based schema language. Fields can be grouped together using *composite types*. A composite
type is simply a described list, in which each list entry is one field of the composite. Composites are used to encode
language-level classes, records, structs etc.
You can also define in a *restricted type*, which can be used to define a new type that is a specialisation or subset of
an existing one. For enumerations the choices can be listed in the schema.
Due to this design you can think of a serialised message as being interpretable at several levels of detail.
You can parse it just using the basic AMQP type system, which will give you nested lists and maps containing a few basic
types. This is similar to what JSON would give you. Or you can utilise the descriptors and map those containers to higher
level, more strongly typed structures.
Extended AMQP
-------------
So far we've got collections that contain primitives or more collections, and any element can be labelled with a
string or numeric code. This is good, but compared to a format like JSON or XML it's not really self describing.
A class will be mapped to a list of field contents. Even if we know the name of that class, we still won't really know
what the fields mean without having access to the original code of the class that the message was generated from.
AMQP's type system can solve this, however, out of the box there are two problems:
1. Messages don't include their own schemas.
2. AMQP only defines an XML based representation for schemas.
We'd rather not embed XML inside a binary format designed to be digitally signed, so we have defined a straightforward
mapping from this schema notation to AMQP encoding itself. This makes our AMQP messages self describing, by embedding a
schema for each application or platform level type that is serialised. The schema provides information like field names,
annotations and type variables for generic types. The schema can of course be ignored in many interop cases: it's there
to enable version evolution of persisted data structures over time.
.. note:: It is a deliberate choice to sacrifice encoding efficiency for self-description: we prefer to pay more now than risk
having data on the ledger later on that's hard to read due to loss of (old versions of) applications. The intention is
that a mix of compression and separating the schema parts out when both sides already agree on what they are will return
most of the lost efficiency.
Descriptors
-----------
Serialised messages use described types extensively. There are two types of descriptor:
1. 64 bit code. In Corda, the top 16 bits are always equal to 0xc562 which is R3's IANA assigned enterprise number. The
low bits define various elements in our meta-schema (i.e. the way we describe the schemas of other messages).
2. String. These always start with "net.corda:" and are then followed by either a 'well known' type name, or
a base64 encoded *fingerprint* of the underlying schema that was generated from the original class. They are
encoded using the AMQP symbol type.
The fingerprint can be used to determine if the serialised message maps precisely to a holder type (class) you already
have in your environment. If you don't recognise the fingerprint, you may need to examine the schema data to figure out
a reasonable approximate mapping to a type you do have ... or you can give up and throw a parse error.
The numeric codes are defined as follows (remember to mask out the top 16 bits first):
1. ENVELOPE
2. SCHEMA
3. OBJECT_DESCRIPTOR
4. FIELD
5. COMPOSITE_TYPE
6. RESTRICTED_TYPE
7. CHOICE
8. REFERENCED_OBJECT
9. TRANSFORM_SCHEMA
10. TRANSFORM_ELEMENT
11. TRANSFORM_ELEMENT_KEY
In this document, the term "record" is used to mean an AMQP list described with a numeric code as enumerated
above. A record may represent an actual logical list of variable length, or be a fixed length list of fields. Our
encoding should really have used AMQP arrays for the case where the contents are of variable length and lists only for
representing object/class like things, unfortunately it uses lists for both. The term "object" is used to mean a list
described with a string/symbolic descriptor that references a schema entry.
High level format
-----------------
Every Corda message is at the top level an *ENVELOPE* record containing three elements:
1. The top level message and is described using a string (symbolic) descriptor.
2. A *SCHEMA* record.
3. A *TRANSFORM_SCHEMA* record.
The transform schema will usually be empty - it's used to describe how a data structure has evolved over time, so
making it easier to map to old/new code.
The *SCHEMA* record always contains a single element, which is itself another list containing *COMPOSITE_TYPE* records.
Each *COMPOSITE_TYPE* record describes a single app-level type and has the following members:
1. Name: string
2. Label: nullable string
3. Provides: list of strings
4. Descriptor: An *OBJECT_DESCRIPTOR* record
5. Fields: A list of *FIELD* records
The label will typically be unused and left as null - it's here to match the AMQP specification and could in future contain
arbitrary unstructured text, e.g. a javadoc explaining more about the semantics of the field. The "provides list" is
a set of strings naming Java interfaces that the original type implements. It can be used to work with messages generically
in a strongly typed, safe manner. Rather than guessing whether a type is meant to be a Foo or Bar based on matching
with the field names, the schema itself declares what contracts it is intended to meet.
The descriptor record has two elements, the first is a string/symbol and the second is an unsigned long code. Typically
only one will be set. This record corresponds to the descriptor that will appear in the main message stream.
Finally, the fields are defined. Each *FIELD* record has the following members:
1. Name: string
2. Type: string
3. Requires: list of string
4. Default: nullable string
5. Label: nullable string
6. Mandatory: boolean
7. Multiple: boolean
The meaning of these are defined in the AMQP specification. The type string is a Java class name *with* generic parameters.
The other parts of the schema map to the AMQP XML schema spec in the same straightforward manner.
Mapping JVM classes to composite types
--------------------------------------
Corda does not need or use a separate schema definition language. Instead, source code is used as a way to define schemas
via regular class definitions in any statically typed JVM-bytecode targeting language. This specification will thus
frequently to types whose only definitions are found in the Corda source code: these definitions are canonical and not
derived from any other kind of schema. Any class annotated as ``@CordaSerializable`` could appear in an AMQP message.
Whilst you don't need access to the original class files to decode the typed structure of a Corda message due to the embedded AMQP
schema, it will often be much more convenient to work with the original structures using JVM reflection. This is typically
very useful for code generators.
If you want to you can nonetheless parse the Java .class file format using a variety of libraries. The format is a simple tagged
union style format and `can be parsed in about 300 lines of C <https://github.com/atcol/cfr/blob/master/src/class.c>`_. The only
part of the class file that actually matters for type information are the parameters to the constructor, as that defines which fields
are stored to the wire.
Source code does not have a deterministic field ordering. Developers may re-arrange fields in their classes as they refactor
their code, which in a conventional serialisation scheme would break the wire format. Thus when mapping classes to AMQP schemas,
we alphabetically sort the fields. If a new field is added, it may thus appear in the middle of the composite type list rather than
at the end.
.. warning:: The above implies that you cannot handle format evolution by simply skipping fields you don't understand. Instead you
must notice when the descriptors have changed from what you expect, and consult the schema to determine how to map the new message
to a schema that you can work with.
Containers
----------
AMQP defines encodings for maps and lists, which are mapped to/from ``java.util.Map`` and ``java.util.List`` in JVM code. You don't need
any special support to read these if you don't care about the higher level type system.
In the binary schemas containers are represented as follows. A field in a composite type that is a list will look like this:
1. Name: "livingIn"
2. Type: "*"
3. Requires: [ "java.util.List<net.corda.tools.serialization.City>" ]
4. Default: NULL
5. Label: NULL
6. Mandatory: true
7. Multiple: false
The *requires* field is a list of *archetypes*. These are simply uninterpreted strings that refer to other schema elements, which
list the same string in their *provides* field. In this way a form of intersection typing is implemented. We use Java type names
with generics to link the field to the definition of a restricted type.
The list type will be defined as a restricted type, like so:
0. Name: "java.util.List<net.corda.tools.serialization.City>"
1. Label: NULL
2. Provides: []
3. Source: "list"
4. Descriptor: [
0. Symbol: net.corda:2A8U5kaXW/lD5ns+l0xPFg==
1. Numeric: NULL
]
5. Choices: []
Signed data
-----------
A common pattern in Corda is that an outer wrapper serialised message contains signatures and certificates for an inner
serialised message. The inner message is represented as 'binary', thus it requires two passes to deserialise such a
message fully. This is intended as a form of security firebreak, because it means you can avoid processing any serialised
data until the signatures have been checked and provenance established. It also helps ensure everyone calculates a
signature over the same binary data without roundtripping issues appearing.
The following types are used for this in the current version of the protocol (correct as of Corda 4):
* ``net.corda.core.internal.SignedDataWithCert``, descriptor ``net.corda:VywzVs/TR8ztvQBpYFpnlQ==``. Fields:
* raw: ``net.corda.core.serialization.SerializedBytes<?>``
* sig: ``net.corda.core.internal.DigitalSignatureWithCert``
* ``net.corda.core.internal.DigitalSignatureWithCert``, descriptor ``net.corda:AJin3eE1QDfCwTiDWC5hJA==``. Fields:
* by: ``java.security.cert.X509Certificate``
* bytes: binary
The signature bytes are opaque and their format depends on the cryptographic scheme identified in the X.509 certificate,
for example, elliptic curve signatures use a standardised (non-AMQP) binary format that encodes the coordinates of the
point on the curve. The type ``java.security.cert.X509Certificate`` does not appear in the schema, it is parsed as a
special case and has the descriptor ``net.corda:java.security.cert.X509Certificate``. A field with this descriptor is
of type 'binary' and contains a certificate in the standard X.509 binary format (again, not AMQP).
Examples
--------
The following sample shows how a few lines of Kotlin code defining some sophisticated data structures maps to an AMQP message.
.. sourcecode:: kotlin
@CordaSerializable
data class Employee(val names: Pair<String, String>)
@CordaSerializable
data class Department(val name: String, val employees: List<Employee>)
@CordaSerializable
data class Company(
val name: String,
val createdInYear: Short,
val logo: OpaqueBytes,
val departments: List<Department>,
val historicalEvents: Map<String, Instant>
)
and here is an ad-hoc textual representation of what it turns into on the wire (this format is not stable or meaningful)::
envelope [
0. net.corda:XIBlQ9Yl/RlKGLjCMY1/Kg== [
0. 2014: short
0. net.corda:J6fOfvKOUIhpLqSmzN2ecw== [
1. net.corda:mCdn5Q/6wPrRd120wfv5og== [
0. net.corda:KwaBqNRsTDOaXBrYdtDZpw== [
0. net.corda:c0Lkwk4E63sshTPr2G60aQ== [
0. net.corda:zjQ3JQXiArQUxXuCcaWANw== [
0. "Mike"
]
1. "Hearn"
]
0. net.corda:c0Lkwk4E63sshTPr2G60aQ== [
1. net.corda:zjQ3JQXiArQUxXuCcaWANw== [
0. "Richard"
]
1. "Brown"
]
0. net.corda:c0Lkwk4E63sshTPr2G60aQ== [
2. net.corda:zjQ3JQXiArQUxXuCcaWANw== [
0. "James"
]
1. "Carlyle"
]
]
1. "Platform"
]
]
2. net.corda:QXkG3ayKZNvF8dIEKbOTSw== {
"First lab project proposal email" -> net.corda:java.time.Instant [
0. 1411596660: long
1. 0: int
]
"Hired Mike" -> net.corda:java.time.Instant [
0. 1446552000: long
1. 0: int
]
}
3. net.corda:pgT0Kc3t/bvnzmgu/nb4Cg== [
0. <binary of 1 bytes>
]
4. "R3"
]
1. schema [
0. [
0. composite type [
0. "net.corda.tools.serialization.Company"
1. NULL
2. []
3. object descriptor [
0. net.corda:XIBlQ9Yl/RlKGLjCMY1/Kg==: symbol
1. NULL
]
4. [
0. field [
0. "createdInYear"
1. "short"
2. []
3. "0"
4. NULL
5. true
6. false
]
1. field [
0. "departments"
1. "*"
2. [
0. "java.util.List<net.corda.tools.serialization.Department>"
]
3. NULL
4. NULL
5. true
6. false
]
2. field [
0. "historicalEvents"
1. "*"
2. [
0. "java.util.Map<string, java.time.Instant>"
]
3. NULL
4. NULL
5. true
6. false
]
3. field [
0. "logo"
1. "net.corda.core.utilities.OpaqueBytes"
2. []
3. NULL
4. NULL
5. true
6. false
]
4. field [
0. "name"
1. "string"
2. []
3. NULL
4. NULL
5. true
6. false
]
]
]
1. restricted type [
0. "java.util.List<net.corda.tools.serialization.Department>"
1. NULL
2. []
3. "list"
4. object descriptor [
0. net.corda:mCdn5Q/6wPrRd120wfv5og==: symbol
1. NULL
]
5. []
]
2. composite type [
0. "net.corda.tools.serialization.Department"
1. NULL
2. []
3. object descriptor [
0. net.corda:J6fOfvKOUIhpLqSmzN2ecw==: symbol
1. NULL
]
4. [
0. field [
0. "employees"
1. "*"
2. [
0. "java.util.List<net.corda.tools.serialization.Employee>"
]
3. NULL
4. NULL
5. true
6. false
]
1. field [
0. "name"
1. "string"
2. []
3. NULL
4. NULL
5. true
6. false
]
]
]
3. restricted type [
0. "java.util.List<net.corda.tools.serialization.Employee>"
1. NULL
2. []
3. "list"
4. object descriptor [
0. net.corda:KwaBqNRsTDOaXBrYdtDZpw==: symbol
1. NULL
]
5. []
]
4. composite type [
0. "net.corda.tools.serialization.Employee"
1. NULL
2. []
3. object descriptor [
0. net.corda:zjQ3JQXiArQUxXuCcaWANw==: symbol
1. NULL
]
4. [
0. field [
0. "names"
1. "kotlin.Pair<string, string>"
2. []
3. NULL
4. NULL
5. true
6. false
]
]
]
5. composite type [
0. "kotlin.Pair<string, string>"
1. NULL
2. []
3. object descriptor [
0. net.corda:c0Lkwk4E63sshTPr2G60aQ==: symbol
1. NULL
]
4. [
0. field [
0. "first"
1. "string"
2. []
3. NULL
4. NULL
5. true
6. false
]
1. field [
0. "second"
1. "string"
2. []
3. NULL
4. NULL
5. true
6. false
]
]
]
6. restricted type [
0. "java.util.Map<string, java.time.Instant>"
1. NULL
2. []
3. "map"
4. object descriptor [
0. net.corda:QXkG3ayKZNvF8dIEKbOTSw==: symbol
1. NULL
]
5. []
]
7. composite type [
0. "net.corda.core.utilities.OpaqueBytes"
1. NULL
2. []
3. object descriptor [
0. net.corda:pgT0Kc3t/bvnzmgu/nb4Cg==: symbol
1. NULL
]
4. [
0. field [
0. "bytes"
1. "binary"
2. []
3. NULL
4. NULL
5. true
6. false
]
]
]
]
]
2. transform schema {
}
]