From 74354580b68409645122163e13652d3fee5c21e4 Mon Sep 17 00:00:00 2001
From: Mike Hearn <mike@r3.com>
Date: Wed, 7 Nov 2018 10:57:46 +0100
Subject: [PATCH] Add some documentation on the wire format.

---
 docs/source/serialization-index.rst |   1 +
 docs/source/serialization.rst       |   3 +-
 docs/source/wire-format.rst         | 501 ++++++++++++++++++++++++++++
 3 files changed, 504 insertions(+), 1 deletion(-)
 create mode 100644 docs/source/wire-format.rst

diff --git a/docs/source/serialization-index.rst b/docs/source/serialization-index.rst
index 6ec054bd0b..d9cbcc2810 100644
--- a/docs/source/serialization-index.rst
+++ b/docs/source/serialization-index.rst
@@ -9,3 +9,4 @@ Serialization
    serialization-default-evolution.rst
    serialization-enum-evolution.rst
    blob-inspector
+   wire-format.rst
diff --git a/docs/source/serialization.rst b/docs/source/serialization.rst
index b394754147..88d78a088c 100644
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -73,7 +73,8 @@ It's reproduced here as an example of both ways you can do this for a couple of
 AMQP
 ----
 
-Corda uses an extended form of AMQP 1.0 as its binary wire protocol.
+Corda uses an extended form of AMQP 1.0 as its binary wire protocol. You can learn more about the :doc:`wire-format` Corda
+uses if you intend to parse Corda messages from non-JVM platforms.
 
 Corda serialisation is currently used for:
 
diff --git a/docs/source/wire-format.rst b/docs/source/wire-format.rst
new file mode 100644
index 0000000000..527726723c
--- /dev/null
+++ b/docs/source/wire-format.rst
@@ -0,0 +1,501 @@
+Wire format
+===========
+
+This document describes the Corda wire format. With the following information and an implementation of the AMQP/1.0
+specification, you can read Corda serialised binary messages. An example implementation of AMQP/1.0 would be Apache
+Qpid Proton, or Microsoft AMQP.NET Lite.
+
+Header
+------
+
+All messages start with the 8 byte sequence ``corda\1\0\0``, that is, the string "corda" followed by a one byte and then two
+zero bytes. That means you can't directly feed a Corda message into an AMQP library. You must check the header string and
+then skip it.
+
+The '1' byte indicates the major version of the format. It should always be set to 1, if it isn't that implies a backwards
+incompatible serialisation format has been developed and you should abort. The second and third bytes are incremented if we make
+extensions to the format. You can usually ignore these.
+
+AMQP intro
+----------
+
+AMQP/1.0 (which is quite different to AMQP/0.9) is protocol that contains a standardised binary encoding scheme, comparable to but
+more advanced than Google protocol buffers. `The AMQP specification <https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-types-v1.0-os.html>`_
+is quite concise and easy to read: this document will reference it in many places. It also provides a variety of encoded examples
+that can be used to understand each byte of a message.
+
+The format specifies encodings for several 'primitive' types: numbers, strings, UUIDs, timestamps
+and symbols (these can be thought of as enum entries). It also defines how to encode maps, lists and arrays. The difference
+between the latter two is that arrays always contain a single type, whereas lists can contain elements of different types.
+An AMQP byte stream is simply a repeated series of elements.
+
+So far, so standard. However AMQP goes further than most such tagged binary encodings by including the concept of
+*described types*. This is a way to impose an application-level type system on top of the basic "bags of elements"
+that low-level AMQP gives you. Any element in the stream can be prefixed with a *descriptor*, which is either a string
+or a 64 bit value. Both types of label have a defined namespacing mechanism. This labelling scheme allows sophisticated
+layerings to be added on top of the simple, interoperable core.
+
+AMQP therefore also defines a type system and schema representation, that allows you to create the app-level type layer.
+Standard AMQP defines an XML based schema language. Fields can be grouped together using *composite types*. A composite
+type is simply a described list, in which each list entry is one field of the composite. Composites are used to encode
+language-level classes, records, structs etc.
+
+You can also define in a *restricted type*, which can be used to define a new type that is a specialisation or subset of
+an existing one. For enumerations the choices can be listed in the schema.
+
+Due to this design you can think of a serialised message as being interpretable at several levels of detail.
+You can parse it just using the basic AMQP type system, which will give you nested lists and maps containing a few basic
+types. This is similar to what JSON would give you. Or you can utilise the descriptors and map those containers to higher
+level, more strongly typed structures.
+
+Extended AMQP
+-------------
+
+So far we've got collections that contain primitives or more collections, and any element can be labelled with a
+string or numeric code. This is good, but compared to a format like JSON or XML it's not really self describing.
+A class will be mapped to a list of field contents. Even if we know the name of that class, we still won't really know
+what the fields mean without having access to the original code of the class that the message was generated from.
+
+AMQP's type system can solve this, however, out of the box there are two problems:
+
+1. Messages don't include their own schemas.
+2. AMQP only defines an XML based representation for schemas.
+
+We'd rather not embed XML inside a binary format designed to be digitally signed, so we have defined a straightforward
+mapping from this schema notation to AMQP encoding itself. This makes our AMQP messages self describing, by embedding a
+schema for each application or platform level type that is serialised. The schema provides information like field names,
+annotations and type variables for generic types. The schema can of course be ignored in many interop cases: it's there
+to enable version evolution of persisted data structures over time.
+
+.. note:: It is a deliberate choice to sacrifice encoding efficiency for self-description: we prefer to pay more now than risk
+   having data on the ledger later on that's hard to read due to loss of (old versions of) applications. The intention is
+   that a mix of compression and separating the schema parts out when both sides already agree on what they are will return
+   most of the lost efficiency.
+
+Descriptors
+-----------
+
+Serialised messages use described types extensively. There are two types of descriptor:
+
+1. 64 bit code. In Corda, the top 16 bits are always equal to 0xc562 which is R3's IANA assigned enterprise number. The
+   low bits define various elements in our meta-schema (i.e. the way we describe the schemas of other messages).
+2. String. These always start with "net.corda:" and are then followed by either a 'well known' type name, or
+   a base64 encoded *fingerprint* of the underlying schema that was generated from the original class. They are
+   encoded using the AMQP symbol type.
+
+The fingerprint can be used to determine if the serialised message maps precisely to a holder type (class) you already
+have in your environment. If you don't recognise the fingerprint, you may need to examine the schema data to figure out
+a reasonable approximate mapping to a type you do have ... or you can give up and throw a parse error.
+
+The numeric codes are defined as follows (remember to mask out the top 16 bits first):
+
+1. ENVELOPE
+2. SCHEMA
+3. OBJECT_DESCRIPTOR
+4. FIELD
+5. COMPOSITE_TYPE
+6. RESTRICTED_TYPE
+7. CHOICE
+8. REFERENCED_OBJECT
+9. TRANSFORM_SCHEMA
+10. TRANSFORM_ELEMENT
+11. TRANSFORM_ELEMENT_KEY
+
+In this document, the term "record" is used to mean an AMQP list described with a numeric code as enumerated
+above. A record may represent an actual logical list of variable length, or be a fixed length list of fields. Our
+encoding should really have used AMQP arrays for the case where the contents are of variable length and lists only for
+representing object/class like things, unfortunately it uses lists for both. The term "object" is used to mean a list
+described with a string/symbolic descriptor that references a schema entry.
+
+High level format
+-----------------
+
+Every Corda message is at the top level an *ENVELOPE* record containing three elements:
+
+1. The top level message and is described using a string (symbolic) descriptor.
+2. A *SCHEMA* record.
+3. A *TRANSFORM_SCHEMA* record.
+
+The transform schema will usually be empty - it's used to describe how a data structure has evolved over time, so
+making it easier to map to old/new code.
+
+The *SCHEMA* record always contains a single element, which is itself another list containing *COMPOSITE_TYPE* records.
+Each *COMPOSITE_TYPE* record describes a single app-level type and has the following members:
+
+1. Name: string
+2. Label: nullable string
+3. Provides: list of strings
+4. Descriptor: An *OBJECT_DESCRIPTOR* record
+5. Fields: A list of *FIELD* records
+
+The label will typically be unused and left as null - it's here to match the AMQP specification and could in future contain
+arbitrary unstructured text, e.g. a javadoc explaining more about the semantics of the field. The "provides list" is
+a set of strings naming Java interfaces that the original type implements. It can be used to work with messages generically
+in a strongly typed, safe manner. Rather than guessing whether a type is meant to be a Foo or Bar based on matching
+with the field names, the schema itself declares what contracts it is intended to meet.
+
+The descriptor record has two elements, the first is a string/symbol and the second is an unsigned long code. Typically
+only one will be set. This record corresponds to the descriptor that will appear in the main message stream.
+
+Finally, the fields are defined. Each *FIELD* record has the following members:
+
+1. Name: string
+2. Type: string
+3. Requires: list of string
+4. Default: nullable string
+5. Label: nullable string
+6. Mandatory: boolean
+7. Multiple: boolean
+
+The meaning of these are defined in the AMQP specification. The type string is a Java class name *with* generic parameters.
+
+The other parts of the schema map to the AMQP XML schema spec in the same straightforward manner.
+
+Mapping JVM classes to composite types
+--------------------------------------
+
+Corda does not need or use a separate schema definition language. Instead, source code is used as a way to define schemas
+via regular class definitions in any statically typed JVM-bytecode targeting language. This specification will thus
+frequently to types whose only definitions are found in the Corda source code: these definitions are canonical and not
+derived from any other kind of schema. Any class annotated as ``@CordaSerializable`` could appear in an AMQP message.
+Whilst you don't need access to the original class files to decode the typed structure of a Corda message due to the embedded AMQP
+schema, it will often be much more convenient to work with the original structures using JVM reflection. This is typically
+very useful for code generators.
+
+If you want to you can nonetheless parse the Java .class file format using a variety of libraries. The format is a simple tagged
+union style format and `can be parsed in about 300 lines of C <https://github.com/atcol/cfr/blob/master/src/class.c>`_. The only
+part of the class file that actually matters for type information are the parameters to the constructor, as that defines which fields
+are stored to the wire.
+
+Source code does not have a deterministic field ordering. Developers may re-arrange fields in their classes as they refactor
+their code, which in a conventional serialisation scheme would break the wire format. Thus when mapping classes to AMQP schemas,
+we alphabetically sort the fields. If a new field is added, it may thus appear in the middle of the composite type list rather than
+at the end.
+
+.. warning:: The above implies that you cannot handle format evolution by simply skipping fields you don't understand. Instead you
+   must notice when the descriptors have changed from what you expect, and consult the schema to determine how to map the new message
+   to a schema that you can work with.
+
+Containers
+----------
+
+AMQP defines encodings for maps and lists, which are mapped to/from ``java.util.Map`` and ``java.util.List`` in JVM code. You don't need
+any special support to read these if you don't care about the higher level type system.
+
+In the binary schemas containers are represented as follows. A field in a composite type that is a list will look like this:
+
+1. Name: "livingIn"
+2. Type: "*"
+3. Requires: [ "java.util.List<net.corda.tools.serialization.City>" ]
+4. Default: NULL
+5. Label: NULL
+6. Mandatory: true
+7. Multiple: false
+
+The *requires* field is a list of *archetypes*. These are simply uninterpreted strings that refer to other schema elements, which
+list the same string in their *provides* field. In this way a form of intersection typing is implemented. We use Java type names
+with generics to link the field to the definition of a restricted type.
+
+The list type will be defined as a restricted type, like so:
+
+0. Name: "java.util.List<net.corda.tools.serialization.City>"
+1. Label: NULL
+2. Provides: []
+3. Source: "list"
+4. Descriptor: [
+     0. Symbol: net.corda:2A8U5kaXW/lD5ns+l0xPFg==
+     1. Numeric: NULL
+   ]
+5. Choices: []
+
+Signed data
+-----------
+
+A common pattern in Corda is that an outer wrapper serialised message contains signatures and certificates for an inner
+serialised message. The inner message is represented as 'binary', thus it requires two passes to deserialise such a
+message fully. This is intended as a form of security firebreak, because it means you can avoid processing any serialised
+data until the signatures have been checked and provenance established. It also helps ensure everyone calculates a
+signature over the same binary data without roundtripping issues appearing.
+
+The following types are used for this in the current version of the protocol (correct as of Corda 4):
+
+* ``net.corda.core.internal.SignedDataWithCert``, descriptor ``net.corda:VywzVs/TR8ztvQBpYFpnlQ==``. Fields:
+    * raw: ``net.corda.core.serialization.SerializedBytes<?>``
+    * sig: ``net.corda.core.internal.DigitalSignatureWithCert``
+* ``net.corda.core.internal.DigitalSignatureWithCert``, descriptor ``net.corda:AJin3eE1QDfCwTiDWC5hJA==``. Fields:
+    * by: ``java.security.cert.X509Certificate``
+    * bytes: binary
+
+The signature bytes are opaque and their format depends on the cryptographic scheme identified in the X.509 certificate,
+for example, elliptic curve signatures use a standardised (non-AMQP) binary format that encodes the coordinates of the
+point on the curve. The type ``java.security.cert.X509Certificate`` does not appear in the schema, it is parsed as a
+special case and has the descriptor ``net.corda:java.security.cert.X509Certificate``. A field with this descriptor is
+of type 'binary' and contains a certificate in the standard X.509 binary format (again, not AMQP).
+
+Examples
+--------
+
+The following sample shows how a few lines of Kotlin code defining some sophisticated data structures maps to an AMQP message.
+
+.. sourcecode:: kotlin
+
+   @CordaSerializable
+   data class Employee(val names: Pair<String, String>)
+
+   @CordaSerializable
+   data class Department(val name: String, val employees: List<Employee>)
+
+   @CordaSerializable
+   data class Company(
+           val name: String,
+           val createdInYear: Short,
+           val logo: OpaqueBytes,
+           val departments: List<Department>,
+           val historicalEvents: Map<String, Instant>
+   )
+
+and here is an ad-hoc textual representation of what it turns into on the wire (this format is not stable or meaningful)::
+
+    envelope [
+        0. net.corda:XIBlQ9Yl/RlKGLjCMY1/Kg== [
+               0. 2014: short
+                      0. net.corda:J6fOfvKOUIhpLqSmzN2ecw== [
+               1. net.corda:mCdn5Q/6wPrRd120wfv5og== [
+                             0. net.corda:KwaBqNRsTDOaXBrYdtDZpw== [
+                                           0. net.corda:c0Lkwk4E63sshTPr2G60aQ== [
+                                    0. net.corda:zjQ3JQXiArQUxXuCcaWANw== [
+                                                  0. "Mike"
+                                              ]
+                                                  1. "Hearn"
+                                       ]
+                                           0. net.corda:c0Lkwk4E63sshTPr2G60aQ== [
+                                    1. net.corda:zjQ3JQXiArQUxXuCcaWANw== [
+                                                  0. "Richard"
+                                              ]
+                                                  1. "Brown"
+                                       ]
+                                           0. net.corda:c0Lkwk4E63sshTPr2G60aQ== [
+                                    2. net.corda:zjQ3JQXiArQUxXuCcaWANw== [
+                                                  0. "James"
+                                              ]
+                                                  1. "Carlyle"
+                                       ]
+                                ]
+                             1. "Platform"
+                         ]
+                  ]
+               2. net.corda:QXkG3ayKZNvF8dIEKbOTSw== {
+                      "First lab project proposal email" -> net.corda:java.time.Instant [
+                          0. 1411596660: long
+                          1. 0: int
+                      ]
+                      "Hired Mike" -> net.corda:java.time.Instant [
+                          0. 1446552000: long
+                          1. 0: int
+                      ]
+                  }
+               3. net.corda:pgT0Kc3t/bvnzmgu/nb4Cg== [
+                      0. <binary of 1 bytes>
+                  ]
+               4. "R3"
+           ]
+        1. schema [
+               0. [
+                      0. composite type [
+                             0. "net.corda.tools.serialization.Company"
+                             1. NULL
+                             2. []
+                             3. object descriptor [
+                                    0. net.corda:XIBlQ9Yl/RlKGLjCMY1/Kg==: symbol
+                                    1. NULL
+                                ]
+                             4. [
+                                    0. field [
+                                           0. "createdInYear"
+                                           1. "short"
+                                           2. []
+                                           3. "0"
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                    1. field [
+                                           0. "departments"
+                                           1. "*"
+                                           2. [
+                                                  0. "java.util.List<net.corda.tools.serialization.Department>"
+                                              ]
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                    2. field [
+                                           0. "historicalEvents"
+                                           1. "*"
+                                           2. [
+                                                  0. "java.util.Map<string, java.time.Instant>"
+                                              ]
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                    3. field [
+                                           0. "logo"
+                                           1. "net.corda.core.utilities.OpaqueBytes"
+                                           2. []
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                    4. field [
+                                           0. "name"
+                                           1. "string"
+                                           2. []
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                ]
+                         ]
+                      1. restricted type [
+                             0. "java.util.List<net.corda.tools.serialization.Department>"
+                             1. NULL
+                             2. []
+                             3. "list"
+                             4. object descriptor [
+                                    0. net.corda:mCdn5Q/6wPrRd120wfv5og==: symbol
+                                    1. NULL
+                                ]
+                             5. []
+                         ]
+                      2. composite type [
+                             0. "net.corda.tools.serialization.Department"
+                             1. NULL
+                             2. []
+                             3. object descriptor [
+                                    0. net.corda:J6fOfvKOUIhpLqSmzN2ecw==: symbol
+                                    1. NULL
+                                ]
+                             4. [
+                                    0. field [
+                                           0. "employees"
+                                           1. "*"
+                                           2. [
+                                                  0. "java.util.List<net.corda.tools.serialization.Employee>"
+                                              ]
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                    1. field [
+                                           0. "name"
+                                           1. "string"
+                                           2. []
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                ]
+                         ]
+                      3. restricted type [
+                             0. "java.util.List<net.corda.tools.serialization.Employee>"
+                             1. NULL
+                             2. []
+                             3. "list"
+                             4. object descriptor [
+                                    0. net.corda:KwaBqNRsTDOaXBrYdtDZpw==: symbol
+                                    1. NULL
+                                ]
+                             5. []
+                         ]
+                      4. composite type [
+                             0. "net.corda.tools.serialization.Employee"
+                             1. NULL
+                             2. []
+                             3. object descriptor [
+                                    0. net.corda:zjQ3JQXiArQUxXuCcaWANw==: symbol
+                                    1. NULL
+                                ]
+                             4. [
+                                    0. field [
+                                           0. "names"
+                                           1. "kotlin.Pair<string, string>"
+                                           2. []
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                ]
+                         ]
+                      5. composite type [
+                             0. "kotlin.Pair<string, string>"
+                             1. NULL
+                             2. []
+                             3. object descriptor [
+                                    0. net.corda:c0Lkwk4E63sshTPr2G60aQ==: symbol
+                                    1. NULL
+                                ]
+                             4. [
+                                    0. field [
+                                           0. "first"
+                                           1. "string"
+                                           2. []
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                    1. field [
+                                           0. "second"
+                                           1. "string"
+                                           2. []
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                ]
+                         ]
+                      6. restricted type [
+                             0. "java.util.Map<string, java.time.Instant>"
+                             1. NULL
+                             2. []
+                             3. "map"
+                             4. object descriptor [
+                                    0. net.corda:QXkG3ayKZNvF8dIEKbOTSw==: symbol
+                                    1. NULL
+                                ]
+                             5. []
+                         ]
+                      7. composite type [
+                             0. "net.corda.core.utilities.OpaqueBytes"
+                             1. NULL
+                             2. []
+                             3. object descriptor [
+                                    0. net.corda:pgT0Kc3t/bvnzmgu/nb4Cg==: symbol
+                                    1. NULL
+                                ]
+                             4. [
+                                    0. field [
+                                           0. "bytes"
+                                           1. "binary"
+                                           2. []
+                                           3. NULL
+                                           4. NULL
+                                           5. true
+                                           6. false
+                                       ]
+                                ]
+                         ]
+                  ]
+           ]
+        2. transform schema {
+           }
+    ]
\ No newline at end of file