From 74354580b68409645122163e13652d3fee5c21e4 Mon Sep 17 00:00:00 2001 From: Mike Hearn Date: Wed, 7 Nov 2018 10:57:46 +0100 Subject: [PATCH] Add some documentation on the wire format. --- docs/source/serialization-index.rst | 1 + docs/source/serialization.rst | 3 +- docs/source/wire-format.rst | 501 ++++++++++++++++++++++++++++ 3 files changed, 504 insertions(+), 1 deletion(-) create mode 100644 docs/source/wire-format.rst diff --git a/docs/source/serialization-index.rst b/docs/source/serialization-index.rst index 6ec054bd0b..d9cbcc2810 100644 --- a/docs/source/serialization-index.rst +++ b/docs/source/serialization-index.rst @@ -9,3 +9,4 @@ Serialization serialization-default-evolution.rst serialization-enum-evolution.rst blob-inspector + wire-format.rst diff --git a/docs/source/serialization.rst b/docs/source/serialization.rst index b394754147..88d78a088c 100644 --- a/docs/source/serialization.rst +++ b/docs/source/serialization.rst @@ -73,7 +73,8 @@ It's reproduced here as an example of both ways you can do this for a couple of AMQP ---- -Corda uses an extended form of AMQP 1.0 as its binary wire protocol. +Corda uses an extended form of AMQP 1.0 as its binary wire protocol. You can learn more about the :doc:`wire-format` Corda +uses if you intend to parse Corda messages from non-JVM platforms. Corda serialisation is currently used for: diff --git a/docs/source/wire-format.rst b/docs/source/wire-format.rst new file mode 100644 index 0000000000..527726723c --- /dev/null +++ b/docs/source/wire-format.rst @@ -0,0 +1,501 @@ +Wire format +=========== + +This document describes the Corda wire format. With the following information and an implementation of the AMQP/1.0 +specification, you can read Corda serialised binary messages. An example implementation of AMQP/1.0 would be Apache +Qpid Proton, or Microsoft AMQP.NET Lite. + +Header +------ + +All messages start with the 8 byte sequence ``corda\1\0\0``, that is, the string "corda" followed by a one byte and then two +zero bytes. That means you can't directly feed a Corda message into an AMQP library. You must check the header string and +then skip it. + +The '1' byte indicates the major version of the format. It should always be set to 1, if it isn't that implies a backwards +incompatible serialisation format has been developed and you should abort. The second and third bytes are incremented if we make +extensions to the format. You can usually ignore these. + +AMQP intro +---------- + +AMQP/1.0 (which is quite different to AMQP/0.9) is protocol that contains a standardised binary encoding scheme, comparable to but +more advanced than Google protocol buffers. `The AMQP specification `_ +is quite concise and easy to read: this document will reference it in many places. It also provides a variety of encoded examples +that can be used to understand each byte of a message. + +The format specifies encodings for several 'primitive' types: numbers, strings, UUIDs, timestamps +and symbols (these can be thought of as enum entries). It also defines how to encode maps, lists and arrays. The difference +between the latter two is that arrays always contain a single type, whereas lists can contain elements of different types. +An AMQP byte stream is simply a repeated series of elements. + +So far, so standard. However AMQP goes further than most such tagged binary encodings by including the concept of +*described types*. This is a way to impose an application-level type system on top of the basic "bags of elements" +that low-level AMQP gives you. Any element in the stream can be prefixed with a *descriptor*, which is either a string +or a 64 bit value. Both types of label have a defined namespacing mechanism. This labelling scheme allows sophisticated +layerings to be added on top of the simple, interoperable core. + +AMQP therefore also defines a type system and schema representation, that allows you to create the app-level type layer. +Standard AMQP defines an XML based schema language. Fields can be grouped together using *composite types*. A composite +type is simply a described list, in which each list entry is one field of the composite. Composites are used to encode +language-level classes, records, structs etc. + +You can also define in a *restricted type*, which can be used to define a new type that is a specialisation or subset of +an existing one. For enumerations the choices can be listed in the schema. + +Due to this design you can think of a serialised message as being interpretable at several levels of detail. +You can parse it just using the basic AMQP type system, which will give you nested lists and maps containing a few basic +types. This is similar to what JSON would give you. Or you can utilise the descriptors and map those containers to higher +level, more strongly typed structures. + +Extended AMQP +------------- + +So far we've got collections that contain primitives or more collections, and any element can be labelled with a +string or numeric code. This is good, but compared to a format like JSON or XML it's not really self describing. +A class will be mapped to a list of field contents. Even if we know the name of that class, we still won't really know +what the fields mean without having access to the original code of the class that the message was generated from. + +AMQP's type system can solve this, however, out of the box there are two problems: + +1. Messages don't include their own schemas. +2. AMQP only defines an XML based representation for schemas. + +We'd rather not embed XML inside a binary format designed to be digitally signed, so we have defined a straightforward +mapping from this schema notation to AMQP encoding itself. This makes our AMQP messages self describing, by embedding a +schema for each application or platform level type that is serialised. The schema provides information like field names, +annotations and type variables for generic types. The schema can of course be ignored in many interop cases: it's there +to enable version evolution of persisted data structures over time. + +.. note:: It is a deliberate choice to sacrifice encoding efficiency for self-description: we prefer to pay more now than risk + having data on the ledger later on that's hard to read due to loss of (old versions of) applications. The intention is + that a mix of compression and separating the schema parts out when both sides already agree on what they are will return + most of the lost efficiency. + +Descriptors +----------- + +Serialised messages use described types extensively. There are two types of descriptor: + +1. 64 bit code. In Corda, the top 16 bits are always equal to 0xc562 which is R3's IANA assigned enterprise number. The + low bits define various elements in our meta-schema (i.e. the way we describe the schemas of other messages). +2. String. These always start with "net.corda:" and are then followed by either a 'well known' type name, or + a base64 encoded *fingerprint* of the underlying schema that was generated from the original class. They are + encoded using the AMQP symbol type. + +The fingerprint can be used to determine if the serialised message maps precisely to a holder type (class) you already +have in your environment. If you don't recognise the fingerprint, you may need to examine the schema data to figure out +a reasonable approximate mapping to a type you do have ... or you can give up and throw a parse error. + +The numeric codes are defined as follows (remember to mask out the top 16 bits first): + +1. ENVELOPE +2. SCHEMA +3. OBJECT_DESCRIPTOR +4. FIELD +5. COMPOSITE_TYPE +6. RESTRICTED_TYPE +7. CHOICE +8. REFERENCED_OBJECT +9. TRANSFORM_SCHEMA +10. TRANSFORM_ELEMENT +11. TRANSFORM_ELEMENT_KEY + +In this document, the term "record" is used to mean an AMQP list described with a numeric code as enumerated +above. A record may represent an actual logical list of variable length, or be a fixed length list of fields. Our +encoding should really have used AMQP arrays for the case where the contents are of variable length and lists only for +representing object/class like things, unfortunately it uses lists for both. The term "object" is used to mean a list +described with a string/symbolic descriptor that references a schema entry. + +High level format +----------------- + +Every Corda message is at the top level an *ENVELOPE* record containing three elements: + +1. The top level message and is described using a string (symbolic) descriptor. +2. A *SCHEMA* record. +3. A *TRANSFORM_SCHEMA* record. + +The transform schema will usually be empty - it's used to describe how a data structure has evolved over time, so +making it easier to map to old/new code. + +The *SCHEMA* record always contains a single element, which is itself another list containing *COMPOSITE_TYPE* records. +Each *COMPOSITE_TYPE* record describes a single app-level type and has the following members: + +1. Name: string +2. Label: nullable string +3. Provides: list of strings +4. Descriptor: An *OBJECT_DESCRIPTOR* record +5. Fields: A list of *FIELD* records + +The label will typically be unused and left as null - it's here to match the AMQP specification and could in future contain +arbitrary unstructured text, e.g. a javadoc explaining more about the semantics of the field. The "provides list" is +a set of strings naming Java interfaces that the original type implements. It can be used to work with messages generically +in a strongly typed, safe manner. Rather than guessing whether a type is meant to be a Foo or Bar based on matching +with the field names, the schema itself declares what contracts it is intended to meet. + +The descriptor record has two elements, the first is a string/symbol and the second is an unsigned long code. Typically +only one will be set. This record corresponds to the descriptor that will appear in the main message stream. + +Finally, the fields are defined. Each *FIELD* record has the following members: + +1. Name: string +2. Type: string +3. Requires: list of string +4. Default: nullable string +5. Label: nullable string +6. Mandatory: boolean +7. Multiple: boolean + +The meaning of these are defined in the AMQP specification. The type string is a Java class name *with* generic parameters. + +The other parts of the schema map to the AMQP XML schema spec in the same straightforward manner. + +Mapping JVM classes to composite types +-------------------------------------- + +Corda does not need or use a separate schema definition language. Instead, source code is used as a way to define schemas +via regular class definitions in any statically typed JVM-bytecode targeting language. This specification will thus +frequently to types whose only definitions are found in the Corda source code: these definitions are canonical and not +derived from any other kind of schema. Any class annotated as ``@CordaSerializable`` could appear in an AMQP message. +Whilst you don't need access to the original class files to decode the typed structure of a Corda message due to the embedded AMQP +schema, it will often be much more convenient to work with the original structures using JVM reflection. This is typically +very useful for code generators. + +If you want to you can nonetheless parse the Java .class file format using a variety of libraries. The format is a simple tagged +union style format and `can be parsed in about 300 lines of C `_. The only +part of the class file that actually matters for type information are the parameters to the constructor, as that defines which fields +are stored to the wire. + +Source code does not have a deterministic field ordering. Developers may re-arrange fields in their classes as they refactor +their code, which in a conventional serialisation scheme would break the wire format. Thus when mapping classes to AMQP schemas, +we alphabetically sort the fields. If a new field is added, it may thus appear in the middle of the composite type list rather than +at the end. + +.. warning:: The above implies that you cannot handle format evolution by simply skipping fields you don't understand. Instead you + must notice when the descriptors have changed from what you expect, and consult the schema to determine how to map the new message + to a schema that you can work with. + +Containers +---------- + +AMQP defines encodings for maps and lists, which are mapped to/from ``java.util.Map`` and ``java.util.List`` in JVM code. You don't need +any special support to read these if you don't care about the higher level type system. + +In the binary schemas containers are represented as follows. A field in a composite type that is a list will look like this: + +1. Name: "livingIn" +2. Type: "*" +3. Requires: [ "java.util.List" ] +4. Default: NULL +5. Label: NULL +6. Mandatory: true +7. Multiple: false + +The *requires* field is a list of *archetypes*. These are simply uninterpreted strings that refer to other schema elements, which +list the same string in their *provides* field. In this way a form of intersection typing is implemented. We use Java type names +with generics to link the field to the definition of a restricted type. + +The list type will be defined as a restricted type, like so: + +0. Name: "java.util.List" +1. Label: NULL +2. Provides: [] +3. Source: "list" +4. Descriptor: [ + 0. Symbol: net.corda:2A8U5kaXW/lD5ns+l0xPFg== + 1. Numeric: NULL + ] +5. Choices: [] + +Signed data +----------- + +A common pattern in Corda is that an outer wrapper serialised message contains signatures and certificates for an inner +serialised message. The inner message is represented as 'binary', thus it requires two passes to deserialise such a +message fully. This is intended as a form of security firebreak, because it means you can avoid processing any serialised +data until the signatures have been checked and provenance established. It also helps ensure everyone calculates a +signature over the same binary data without roundtripping issues appearing. + +The following types are used for this in the current version of the protocol (correct as of Corda 4): + +* ``net.corda.core.internal.SignedDataWithCert``, descriptor ``net.corda:VywzVs/TR8ztvQBpYFpnlQ==``. Fields: + * raw: ``net.corda.core.serialization.SerializedBytes`` + * sig: ``net.corda.core.internal.DigitalSignatureWithCert`` +* ``net.corda.core.internal.DigitalSignatureWithCert``, descriptor ``net.corda:AJin3eE1QDfCwTiDWC5hJA==``. Fields: + * by: ``java.security.cert.X509Certificate`` + * bytes: binary + +The signature bytes are opaque and their format depends on the cryptographic scheme identified in the X.509 certificate, +for example, elliptic curve signatures use a standardised (non-AMQP) binary format that encodes the coordinates of the +point on the curve. The type ``java.security.cert.X509Certificate`` does not appear in the schema, it is parsed as a +special case and has the descriptor ``net.corda:java.security.cert.X509Certificate``. A field with this descriptor is +of type 'binary' and contains a certificate in the standard X.509 binary format (again, not AMQP). + +Examples +-------- + +The following sample shows how a few lines of Kotlin code defining some sophisticated data structures maps to an AMQP message. + +.. sourcecode:: kotlin + + @CordaSerializable + data class Employee(val names: Pair) + + @CordaSerializable + data class Department(val name: String, val employees: List) + + @CordaSerializable + data class Company( + val name: String, + val createdInYear: Short, + val logo: OpaqueBytes, + val departments: List, + val historicalEvents: Map + ) + +and here is an ad-hoc textual representation of what it turns into on the wire (this format is not stable or meaningful):: + + envelope [ + 0. net.corda:XIBlQ9Yl/RlKGLjCMY1/Kg== [ + 0. 2014: short + 0. net.corda:J6fOfvKOUIhpLqSmzN2ecw== [ + 1. net.corda:mCdn5Q/6wPrRd120wfv5og== [ + 0. net.corda:KwaBqNRsTDOaXBrYdtDZpw== [ + 0. net.corda:c0Lkwk4E63sshTPr2G60aQ== [ + 0. net.corda:zjQ3JQXiArQUxXuCcaWANw== [ + 0. "Mike" + ] + 1. "Hearn" + ] + 0. net.corda:c0Lkwk4E63sshTPr2G60aQ== [ + 1. net.corda:zjQ3JQXiArQUxXuCcaWANw== [ + 0. "Richard" + ] + 1. "Brown" + ] + 0. net.corda:c0Lkwk4E63sshTPr2G60aQ== [ + 2. net.corda:zjQ3JQXiArQUxXuCcaWANw== [ + 0. "James" + ] + 1. "Carlyle" + ] + ] + 1. "Platform" + ] + ] + 2. net.corda:QXkG3ayKZNvF8dIEKbOTSw== { + "First lab project proposal email" -> net.corda:java.time.Instant [ + 0. 1411596660: long + 1. 0: int + ] + "Hired Mike" -> net.corda:java.time.Instant [ + 0. 1446552000: long + 1. 0: int + ] + } + 3. net.corda:pgT0Kc3t/bvnzmgu/nb4Cg== [ + 0. + ] + 4. "R3" + ] + 1. schema [ + 0. [ + 0. composite type [ + 0. "net.corda.tools.serialization.Company" + 1. NULL + 2. [] + 3. object descriptor [ + 0. net.corda:XIBlQ9Yl/RlKGLjCMY1/Kg==: symbol + 1. NULL + ] + 4. [ + 0. field [ + 0. "createdInYear" + 1. "short" + 2. [] + 3. "0" + 4. NULL + 5. true + 6. false + ] + 1. field [ + 0. "departments" + 1. "*" + 2. [ + 0. "java.util.List" + ] + 3. NULL + 4. NULL + 5. true + 6. false + ] + 2. field [ + 0. "historicalEvents" + 1. "*" + 2. [ + 0. "java.util.Map" + ] + 3. NULL + 4. NULL + 5. true + 6. false + ] + 3. field [ + 0. "logo" + 1. "net.corda.core.utilities.OpaqueBytes" + 2. [] + 3. NULL + 4. NULL + 5. true + 6. false + ] + 4. field [ + 0. "name" + 1. "string" + 2. [] + 3. NULL + 4. NULL + 5. true + 6. false + ] + ] + ] + 1. restricted type [ + 0. "java.util.List" + 1. NULL + 2. [] + 3. "list" + 4. object descriptor [ + 0. net.corda:mCdn5Q/6wPrRd120wfv5og==: symbol + 1. NULL + ] + 5. [] + ] + 2. composite type [ + 0. "net.corda.tools.serialization.Department" + 1. NULL + 2. [] + 3. object descriptor [ + 0. net.corda:J6fOfvKOUIhpLqSmzN2ecw==: symbol + 1. NULL + ] + 4. [ + 0. field [ + 0. "employees" + 1. "*" + 2. [ + 0. "java.util.List" + ] + 3. NULL + 4. NULL + 5. true + 6. false + ] + 1. field [ + 0. "name" + 1. "string" + 2. [] + 3. NULL + 4. NULL + 5. true + 6. false + ] + ] + ] + 3. restricted type [ + 0. "java.util.List" + 1. NULL + 2. [] + 3. "list" + 4. object descriptor [ + 0. net.corda:KwaBqNRsTDOaXBrYdtDZpw==: symbol + 1. NULL + ] + 5. [] + ] + 4. composite type [ + 0. "net.corda.tools.serialization.Employee" + 1. NULL + 2. [] + 3. object descriptor [ + 0. net.corda:zjQ3JQXiArQUxXuCcaWANw==: symbol + 1. NULL + ] + 4. [ + 0. field [ + 0. "names" + 1. "kotlin.Pair" + 2. [] + 3. NULL + 4. NULL + 5. true + 6. false + ] + ] + ] + 5. composite type [ + 0. "kotlin.Pair" + 1. NULL + 2. [] + 3. object descriptor [ + 0. net.corda:c0Lkwk4E63sshTPr2G60aQ==: symbol + 1. NULL + ] + 4. [ + 0. field [ + 0. "first" + 1. "string" + 2. [] + 3. NULL + 4. NULL + 5. true + 6. false + ] + 1. field [ + 0. "second" + 1. "string" + 2. [] + 3. NULL + 4. NULL + 5. true + 6. false + ] + ] + ] + 6. restricted type [ + 0. "java.util.Map" + 1. NULL + 2. [] + 3. "map" + 4. object descriptor [ + 0. net.corda:QXkG3ayKZNvF8dIEKbOTSw==: symbol + 1. NULL + ] + 5. [] + ] + 7. composite type [ + 0. "net.corda.core.utilities.OpaqueBytes" + 1. NULL + 2. [] + 3. object descriptor [ + 0. net.corda:pgT0Kc3t/bvnzmgu/nb4Cg==: symbol + 1. NULL + ] + 4. [ + 0. field [ + 0. "bytes" + 1. "binary" + 2. [] + 3. NULL + 4. NULL + 5. true + 6. false + ] + ] + ] + ] + ] + 2. transform schema { + } + ] \ No newline at end of file