Merge pull request #3156 from corda/ENT-1761/aslemmer-sgx-infrastructure-design

ENT-1761: Add design doc for SGX integration into the node
2025-04-07 19:34:41 +00:00 · 2018-06-11 21:21:21 +01:00 · 2018-06-11 21:21:21 +01:00 · 1eb3c25967
commit 1eb3c25967
parent 5d42f48966 6661ee8a3e
20 changed files with 1075 additions and 0 deletions
--- a/docs/source/design/sgx-infrastructure/Example
+++ b/docs/source/design/sgx-infrastructure/Example
--- a/docs/source/design/sgx-infrastructure/decisions/certification.md
+++ b/docs/source/design/sgx-infrastructure/decisions/certification.md
@ -0,0 +1,69 @@
+![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
+
+--------------------------------------------
+Design Decision: CPU certification method
+============================================
+
+## Background / Context
+
+Remote attestation is done in two main steps.
+1. Certification of the CPU. This boils down to some kind of Intel signature over a key that only a specific enclave has
+   access to.
+2. Using the certified key to sign business logic specific enclave quotes and providing the full chain of trust to
+   challengers.
+
+This design question concerns the way we can manage a certification key. A more detailed description is
+[here](../details/attestation.md)
+
+## Options Analysis
+
+### A. Use Intel's recommended protocol
+
+This involves using aesmd and the Intel SDK to establish an opaque attestation key that transparently signs quotes.
+Then for each enclave we need to do several roundtrips to IAS to get a revocation list (which we don't need) and request
+a direct Intel signature over the quote (which we shouldn't need as the trust has been established already during EPID
+join)
+
+#### Advantages
+
+1. We have a PoC implemented that does this
+
+#### Disadvantages
+
+1. Frequent roundtrips to Intel infrastructure
+2. Intel can reproduce the certifying private key
+3. Involves unnecessary protocol steps and features we don't need (EPID)
+
+### B. Use Intel's protocol to bootstrap our own certificate
+
+This involves using Intel's current attestation protocol to have Intel sign over our own certifying enclave's
+certificate that derives its certification key using the sealing fuse values.
+
+#### Advantages
+
+1. Certifying key not reproducible by Intel
+2. Allows for our own CPU enrollment process, should we need one
+3. Infrequent roundtrips to Intel infrastructure (only needed once per microcode update)
+
+#### Disadvantages
+
+1. Still uses the EPID protocol
+
+### C. Intercept Intel's recommended protocol
+
+This involves using Intel's current protocol as is but instead of doing roundtrips to IAS to get signatures over quotes
+we try to establish the chain of trust during EPID provisioning and reuse it later.
+
+#### Advantages
+
+1. Uses Intel's current protocol
+2. Infrequent rountrips to Intel infrastructure
+
+#### Disadvantages
+
+1. The provisioning protocol is underdocumented and it's hard to decipher how to construct the trust chain
+2. The chain of trust is not a traditional certificate chain but rather a sequence of signed messages
+
+## Recommendation and justification
+
+Proceed with Option B. This is the most readily available and flexible option.
--- a/docs/source/design/sgx-infrastructure/decisions/enclave-language.md
+++ b/docs/source/design/sgx-infrastructure/decisions/enclave-language.md
@ -0,0 +1,59 @@
+![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
+
+--------------------------------------------
+Design Decision: Enclave language of choice
+============================================
+
+## Background / Context
+
+In the long run we would like to use the JVM for all enclave code. This is so that later on we can solve the problem of
+side channel attacks on the bytecode level (e.g. oblivious RAM) rather than putting this burden on enclave functionality
+implementors.
+
+As we plan to use a JVM in the long run anyway and we already have an embedded Avian implementation I think the best
+course of action is to immediately use this together with the full JDK. To keep the native layer as minimal as possible
+we should forward enclave calls with little to no marshalling to the embedded JVM. All subsequent sanity checks, 
+including ones currently handled by the edger8r generated code should be done inside the JVM. Accessing native enclave
+functionality (including OCALLs and reading memory from untrusted heap) should be through a centrally defined JNI
+interface. This way when we switch from Avian we have a very clear interface to code against both from the hosted code's
+side and from the ECALL/OCALL side.
+
+The question remains what the thin native layer should be written in. Currently we use C++, but various alternatives
+popped up, most notably Rust.
+
+## Options Analysis
+
+### A. C++
+
+#### Advantages
+
+1. The Intel SDK is written in C++
+2. [Reproducible binaries](https://wiki.debian.org/ReproducibleBuilds)
+3. The native parts of Avian, HotSpot and SubstrateVM are written in C/C++
+
+#### Disadvantages
+
+1. Unsafe memory accesses (unless strict adherence to modern C++)
+2. Quirky build
+3. Larger attack surface
+
+### B. Rust
+
+#### Advantages
+
+1. Safe memory accesses
+2. Easier to read/write code, easier to audit
+
+#### Disadvantages
+
+1. Does not produce reproducible binaries currently (but it's [planned](https://github.com/rust-lang/rust/issues/34902))
+2. We would mostly be using it for unsafe things (raw pointers, calling C++ code)
+
+## Recommendation and justification
+
+Proceed with Option A (C++) and keep the native layer as small as possible. Rust currently doesn't produce reproducible
+binary code, and we need the native layer mostly to handle raw pointers and call Intel SDK functions anyway, so we
+wouldn't really leverage Rust's safe memory features.
+
+Having said that, once Rust implements reproducible builds we may switch to it, in this case the thinness of the native
+layer will be of big benefit.
--- a/docs/source/design/sgx-infrastructure/decisions/kv-store.md
+++ b/docs/source/design/sgx-infrastructure/decisions/kv-store.md
@ -0,0 +1,58 @@
+![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
+
+--------------------------------------------
+Design Decision: Key-value store implementation
+============================================
+
+This is a simple choice of technology.
+
+## Options Analysis
+
+### A. ZooKeeper
+
+#### Advantages
+
+1. Tried and tested
+2. HA team already uses ZooKeeper
+
+#### Disadvantages
+
+1. Clunky API
+2. No HTTP API
+3. Handrolled protocol
+
+### B. etcd
+
+#### Advantages
+
+1. Very simple API, UNIX philosophy
+2. gRPC
+3. Tried and tested
+4. MVCC
+5. Kubernetes uses it in the background already
+6. "Successor" of ZooKeeper
+7. Cross-platform, OSX and Windows support
+8. Resiliency, supports backups for disaster recovery
+
+#### Disadvantages
+
+1. HA team uses ZooKeeper
+
+### C. Consul
+
+#### Advantages
+
+1. End to end discovery including UIs
+
+#### Disadvantages
+
+1. Not very well spread
+2. Need to store other metadata as well
+3. HA team uses ZooKeeper
+
+## Recommendation and justification
+
+Proceed with Option B (etcd). It's practically a successor of ZooKeeper, the interface is quite simple, it focuses on 
+primitives (CAS, leases, watches etc) and is tried and tested by many heavily used applications, most notably 
+Kubernetes. In fact we have the option to use etcd indirectly by writing Kubernetes extensions, this would have the
+advantage of getting readily available CLI and UI tools to manage an enclave cluster. 
--- a/docs/source/design/sgx-infrastructure/decisions/roadmap.md
+++ b/docs/source/design/sgx-infrastructure/decisions/roadmap.md
@ -0,0 +1,81 @@
+![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
+
+--------------------------------------------
+Design Decision: Strategic SGX roadmap
+============================================
+
+## Background / Context
+
+The statefulness of the enclave affects the complexity of both the infrastructure and attestation greatly.
+The infrastructure needs to take care of tracking enclave state for request routing, and we need extra care if we want 
+to make sure that old keys cannot be used to reveal sealed secrets.
+
+As the first step the easiest thing to do would be to provide an infrastructure for hosting *stateless* enclaves that
+are only concerned with enclave to non-enclave attestation. This provides a framework to do provable computations,
+without the headache of handling sealed state and the various implied upgrade paths.
+
+In the first phase we want to facilitate the ease of rolling out full enclave images (JAR linked into the image)
+regardless of what the enclaves are doing internally. The contract of an enclave is the host-enclave API (attestation
+protocol) and the exposure of the static set of channels the enclave supports. Furthermore the infrastructure will allow
+deployment in a cloud environment and trivial scalability of enclaves through starting them on-demand.
+
+The first phase will allow for a "fixed stateless provable computations as a service" product, e.g. provable builds or
+RNG.
+
+The question remains on how we should proceed afterwards. In terms of infrastructure we have a choice of implementing
+sealed state or focusing on dynamic loading of bytecode. We also have the option to delay this decision until the end of
+the first phase.
+
+## Options Analysis
+
+### A. Implement sealed state
+
+Implementing sealed state involves solving the routing problem, for this we can use the concept of active channel sets.
+Furthermore we need to solve various additional security issues around guarding sealed secret provisioning, most notably
+expiration checks. This would involve implementing a future-proof calendar time oracle, which may turn out to be
+impossible, or not quite good enough. We may decide that we cannot actually provide strong privacy guarantees and need
+to enforce epochs as mentioned [here](../details/time.md).
+
+#### Advantages
+
+1. We would solve long term secret persistence early, allowing for a longer timeframe for testing upgrades and
+   reprovisioning before we integrate Corda
+2. Allows "fixed stateful provable computations as a service" product, e.g. HA encryption
+
+#### Disadvantages
+
+1. There are some unsolved issues (Calendar time, sealing epochs)
+2. It would delay non-stateful Corda integration
+
+### B. Implement dynamic code loading
+
+Implementing dynamic loading involves sandboxing of the bytecode, providing bytecode verification and perhaps
+storage/caching of JARs (although it may be better to develop a more generic caching layer and use channels themselves
+to do the upload). Doing bytecode verification is quite involved as Avian does not support verification, so this
+would mean switching to a different JVM. This JVM would either be HotSpot or SubstrateVM, we are doing some preliminary
+exploratory work to assess their feasibility. If we choose this path it opens up the first true integration point with
+Corda by enabling semi-validating notaries - these are non-validating notaries that check an SGX signature over the
+transaction. It would also enable an entirely separate generic product for verifiable pure computation.
+
+#### Advantages
+
+1. Early adoption of Graal if we choose to go with it (the alternative is HotSpot)
+2. Allows first integration with Corda (semi-validating notaries)
+3. Allows "generic stateless provable computation as a service" product, i.e. anything expressible as a JAR
+4. Holding off on sealed state
+
+#### Disadvantages
+
+1. Too early Graal integration may result in maintenance headache later
+
+## Recommendation and justification
+
+Proceed with Option B, dynamic code loading. It would make us very early adopters of Graal (with the implied ups and 
+downs), and most importantly kickstart collaboration between R3 and Oracle. We would also move away from Avian which we 
+wanted to do anyway. It would also give us more time to think about the issues around sealed state, do exploratory work
+on potential solutions, and there may be further development from Intel's side. Furthermore we need dynamic loading for
+any fully fledged Corda integration, so we should finish this ASAP.
+
+## Appendix: Proposed roadmap breakdown
+
+![Dynamic code loading first](roadmap.png)
--- a/docs/source/design/sgx-infrastructure/decisions/roadmap.png
+++ b/docs/source/design/sgx-infrastructure/decisions/roadmap.png
--- a/docs/source/design/sgx-infrastructure/design.md
+++ b/docs/source/design/sgx-infrastructure/design.md
@ -0,0 +1,78 @@
+# SGX Infrastructure design
+
+.. important:: This design document describes a feature of Corda Enterprise.
+
+This document is intended as a design description of the infrastructure around the hosting of SGX enclaves, interaction
+with enclaves and storage of encrypted data. It assumes basic knowledge of SGX concepts, and some knowledge of
+Kubernetes for parts specific to that.
+
+## High level description
+
+The main idea behind the infrastructure is to provide a highly available cluster of enclave services (hosts) which can
+serve enclaves on demand. It provides an interface for enclave business logic that's agnostic with regards to the
+infrastructure, similar to [serverless architectures](details/serverless.md). The enclaves will use an opaque reference
+to other enclaves or services in the form of [enclave channels](details/channels.md). Channels hides attestation details
+and provide a loose coupling between enclave/non-enclave functionality and specific enclave images/services implementing
+it. This loose coupling allows easier upgrade of enclaves, relaxed trust (whitelisting), dynamic deployment, and
+horizontal scaling as we can spin up enclaves dynamically on demand when a channel is requested.
+
+## Infrastructure components
+
+Here are the major components of the infrastructure. Note that this doesn't include business logic specific
+infrastructure pieces (like ORAM blob storage for Corda privacy model integration).
+
+* [**Distributed key-value store**](details/kv-store.md):
+  Responsible for maintaining metadata about enclaves, hosts, sealed secrets and CPU locality.
+  
+* [**Discovery service**](details/discovery.md)
+  Responsible for resolving an enclave channel to a specific enclave image and a host that can serve it using the
+  metadata in the key-value store.
+
+* [**Enclave host**](details/host.md):
+  This is a service capable of serving enclaves and driving the underlying traffic. Third party components like Intel's
+  SGX driver and aesmd also belong here.
+
+* [**Enclave storage**](details/enclave-storage.md):
+  Responsible for serving enclave images to hosts. This is a simple static content server.
+
+* [**IAS proxy**](details/ias-proxy.md):
+  This is an unfortunate necessity because of Intel's requirement to do mutual TLS with their services.
+
+## Infrastructure interactions
+
+* **Enclave deployment**:
+  This includes uploading of the enclave image/container to enclave storage and adding of the enclave metadata to the
+  key-value store.
+
+* **Enclave usage**:
+  This includes using the discovery service to find a specific enclave image and a host to serve it, then connecting to
+  the host, authenticating(attestation) and proceeding with the needed functionality.
+
+* **Ops**:
+  This includes management of the cluster (Kubernetes/Kubespray) and management of the metadata relating to discovery to
+  control enclave deployment (e.g. canary, incremental, rollback).
+
+## Decisions to be made
+
+* [**Strategic roadmap**](decisions/roadmap.md)
+* [**CPU certification method**](decisions/certification.md)
+* [**Enclave language of choice**](decisions/enclave-language.md)
+* [**Key-value store**](decisions/kv-store.md)
+
+## Further details
+
+* [**Attestation**](details/attestation.md)
+* [**Calendar time for data at rest**](details/time.md)
+* [**Enclave deployment**](details/enclave-deployment.md)
+
+## Example deployment
+
+This is an example of how two Corda parties may use the above infrastructure. In this example R3 is hosting the IAS 
+proxy and the enclave image store and the parties host the rest of the infrastructure, aside from Intel components.
+
+Note that this is flexible, the parties may decide to host their own proxies (as long as they whitelist their keys) or
+the enclave image store (although R3 will need to have a repository of the signed enclaves somewhere).
+We may also decide to go the other way and have R3 host the enclave hosts and the discovery service, shared between
+parties (if e.g. they don't have access to/want to maintain SGX capable boxes).
+
+![Example SGX deployment](Example%20SGX%20deployment.png)
--- a/docs/source/design/sgx-infrastructure/details/attestation.md
+++ b/docs/source/design/sgx-infrastructure/details/attestation.md
@ -0,0 +1,92 @@
+### Terminology recap
+
+**measurement**: The hash of an enclave image, uniquely pinning the code and related configuration
+**report**: A datastructure produced by an enclave including the measurement and other non-static properties of the
+  running enclave instance (like the security version number of the hardware)
+**quote**: A signed report of an enclave produced by Intel's quoting enclave.
+
+# Attestation
+
+The goal of attestation is to authenticate enclaves. We are concerned with two variants of this, enclave to non-enclave
+attestation and enclave to enclave attestation.
+
+In order to authenticate an enclave we need to establish a chain of trust rooted in an Intel signature certifying that a
+report is coming from an enclave running on genuine Intel hardware.
+
+Intel's recommended attestation protocol is split into two phases.
+
+1. Provisioning
+The first phase's goal is to establish an Attestation Key(AK) aka EPID key, unique to the SGX installation.
+The establishment of this key uses an underdocumented protocol similar to the attestation protocol:
+ - Intel provides a Provisioning Certification Enclave(PCE). This enclave has special privileges in that it can derive a
+   key in a deterministic fashion based on the *provisioning* fuse values. Intel stores these values in their databases
+   and can do the same derivation to later check a signature from PCE.
+ - Intel provides a separate enclave called the Provisioning Enclave(PvE), also privileged, which interfaces with PCE
+   (using local attestation) to certify the PvE's report and talks with a special Intel endpoint to join an EPID group
+   anonymously. During the join Intel verifies the PCE's signature. Once the join happened the PvE creates a related
+   private key(the AK) that cannot be linked by Intel to a specific CPU. The PvE seals this key (also sometimes referred
+   to as the "EPID blob") to MRSIGNER, which means it can only be unsealed by Intel enclaves.
+
+2. Attestation
+ - When a user wants to do attestation of their own enclave they need to do so through the Quoting Enclave(QE), also
+   signed by Intel. This enclave can unseal the EPID blob and use the key to sign over user provided reports
+ - The signed quote in turn is sent to the Intel Attestation Service, which can check whether the quote was signed by a
+   key in the EPID group. Intel also checks whether the QE was provided with an up-to-date revocation list.
+
+The end result is a signature of Intel over a signature of the AK over the user enclave quote. Challengers can then
+simply check this chain to make sure that the user provided data in the quote (probably another key) comes from a
+genuine enclave.
+
+All enclaves involved (PCE, PvE, QE) are owned by Intel, so this setup basically forces us to use Intel's infrastructure
+during attestation (which in turn forces us to do e.g. MutualTLS, maintain our own proxies etc). There are two ways we
+can get around this.
+
+1. Hook the provisioning phase. During the last step of provisioning the PvE constructs a chain of trust rooted in
+   Intel. If we can extract some provable chain that allows proving of membership based on an EPID signature then we can
+   essentially replicate what IAS does.
+2. Bootstrap our own certification. This would involve deriving another certification key based on sealing fuse values
+   and getting an Intel signature over it using the original IAS protocol. This signature would then serve the same
+   purpose as the certificate in 1.
+
+## Non-enclave to enclave channels
+
+When a non-enclave connects to a "leaf" enclave the goal is to establish a secure channel between the non-enclave and
+the enclave by authenticating the enclave and possibly authenticating the non-enclave. In addition we want to provide
+secrecy of the non-enclave. To this end we can use SIGMA-I to do a Diffie-Hellman key exchange between the non-enclave
+identity and the enclave identity.
+
+The enclave proves the authenticity of its identity by providing a certificate chain rooted in Intel. If we do our own
+enclave certification then the chain goes like this:
+
+* Intel signs quote of certifying enclave containing the certifying key pair's public part.
+* Certifying key signs report of leaf enclave containing the enclave's temporary identity.
+* Enclave identity signs the relevant bits in the SIGMA protocol.
+
+Intel's signature may be cached on disk, and the certifying enclave signature over the temporary identity may be cached
+in enclave memory.
+
+We can provide various invalidations, e.g. non-enclave won't accept signature if X time has passed since Intel's
+signature, or R3's whitelisting cert expired etc.
+
+If the enclave needs to authorise the non-enclave the situation is a bit more complicated. Let's say the enclave holds
+some secret that it should only reveal to authorised non-enclaves. Authorisation is expressed as a whitelisting
+signature over the non-enclave identity. How do we check the expiration of the whitelisting key's certificate?
+
+Calendar time inside enclaves deserves its own [document](time.md), the gist is that we simply don't have access to time
+unless we trust a calendar time oracle.
+
+Note however that we probably won't need in-enclave authorisation for *stateless* enclaves, as these have no secrets to 
+reveal at all. Authorisation would simply serve as access control, and we can solve access control in the hosting
+infrastructure instead.
+
+## Enclave to enclave channels
+
+Doing remote attestation between enclaves is similar to enclave to non-enclave, only this time authentication involves
+verifying the chain of trust on both sides. However note that this is also predicated on having access to a calendar
+time oracle, as this time expiration checks of the chain must be done in enclaves. So in a sense both enclave to enclave
+and stateful enclave to non-enclave attestation forces us to trust a calendar time oracle.
+
+But note that remote enclave to enclave attestation is mostly required when there *is* sealed state (secrets to share
+with the other enclave). One other use case is the reduction of audit surface, once it comes to that. We may be able to
+split stateless enclaves into components that have different upgrade lifecycles. By doing so we ease the auditors' job
+by reducing the enclaves' contracts and code size.
--- a/docs/source/design/sgx-infrastructure/details/channels.md
+++ b/docs/source/design/sgx-infrastructure/details/channels.md
@ -0,0 +1,75 @@
+# Enclave channels
+
+AWS Lambdas may be invoked by name, and are simple request-response type RPCs. The lambda's name abstracts the 
+specific JAR or code image that implements the functionality, which allows upgrading of a lambda without disrupting 
+the rest of the lambdas.
+
+Any authentication required for the invocation is done by a different AWS service (IAM), and is assumed to be taken 
+care of by the time the lambda code is called.
+
+Serverless enclaves also require ways to be addressed, let's call these "enclave channels". Each such channel may be 
+identified with a string similar to Lambdas, however unlike lambdas we need to incorporate authentication into the 
+concept of a channel in the form of attestation.
+
+Furthermore unlike Lambdas we can implement a generic two-way communication channel. This reintroduces state into the 
+enclave logic. However note that this state is in-memory only, and because of the transient nature of enclaves (they 
+may be "lost" at any point) enclave authors are in general incentivised to either keep in-memory state minimal (by 
+sealing state) or make their functionality idempotent (allowing retries).
+
+We should be able to determine an enclave's supported channels statically. Enclaves may store this data for example in a
+specific ELF section or a separate file. The latter may be preferable as it may be hard to have a central definition of
+channels in an ELF section if we use JVM bytecode. Instead we could have a specific static JVM datastructure that can be
+extracted from the enclave statically during the build.
+
+## Sealed state
+
+Sealing keys tied to specific CPUs seem to throw a wrench in the requirement of statelessness. Routing a request to an 
+enclave that has associated sealed state cannot be the same as routing to one which doesn't. How can we transparently 
+scale enclaves like Lambdas if fresh enclaves by definition don't have associated sealed state?
+
+Take key provisioning as an example: we want some key to be accessible by a number of enclaves, how do we 
+differentiate between enclaves that have the key provisioned versus ones that don't? We need to somehow expose an 
+opaque version of the enclave's sealed state to the hosting infrastructure for this.
+
+The way we could do this is by expressing this state in terms of a changing set of "active" enclave channels. The 
+enclave can statically declare the channels it potentially supports, and start with some initial subset of them as 
+active. As the enclave's lifecycle (sealed state) evolves it may change this active set to something different, 
+thereby informing the hosting infrastructure that it shouldn't route certain requests there, or that it can route some 
+other ones.
+
+Take the above key provisioning example. An enclave can be in two states, unprovisioned or provisioned. When it's 
+unprovisioned its set of active channels will be related to provisioning (for example, request to bootstrap key or 
+request from sibling enclave), when it's provisioned its active set will be related to the usage of the key and 
+provisioning of the key itself to unprovisioned enclaves.
+
+The enclave's initial set of active channels defines how enclaves may be scaled horizontally, as these are the 
+channels that will be active for the freshly started enclaves without sealed state.
+
+"Hold on" you might say, "this means we didn't solve the scalability of stateful enclaves!".
+
+This is partly true. However in the above case we can force certain channels to be part of the initial active set! In 
+particular the channels that actually use the key (e.g. for signing) may be made "stateless" by lazily requesting 
+provisioning of the key from sibling enclaves. Enclaves may be spun up on demand, and as long as there is at least one 
+sibling enclave holding the key it will be provisioned as needed. This hints at a general pattern of hiding stateful
+functionality behind stateless channels, if we want them to scale automatically.
+
+Note that this doesn't mean we can't have external control over the provisioning of the key. For example we probably 
+want to enforce redundancy across N CPUs. This requires the looping in of the hosting infrastructure, we cannot 
+enforce this invariant purely in enclave code.
+
+As we can see the set of active enclave channels are inherently tied to the sealed state of the enclave, therefore we 
+should make the updating both of them an atomic operation.
+
+### Side note
+
+Another way to think about enclaves using sealed state is like an actor model. The sealed state is the actor's state,
+and state transitions may be executed by any enclave instance running on the same CPU. By transitioning the actor state
+one can also transition the type of messages the actor can receive atomically (= active channel set).
+
+## Potential gRPC integration
+
+It may be desirable to expose a built-in serialisation and network protocol. This would tie us to a specific protocol,
+but in turn it would ease development.
+
+An obvious candidate for this is gRPC as it supports streaming and a specific serialization protocol. We need to
+investigate how we can integrate it so that channels are basically responsible for tunneling gRPC packets.
--- a/docs/source/design/sgx-infrastructure/details/discovery.md
+++ b/docs/source/design/sgx-infrastructure/details/discovery.md
@ -0,0 +1,88 @@
+# Discovery
+
+In order to understand enclave discovery and routing we first need to understand the mappings between CPUs, VMs and 
+enclave hosts.
+
+The cloud provider manages a number of physical machines (CPUs), each of those machines hosts a hypervisor which in 
+turn hosts a number of guest VMs. Each VM in turn may host a number of enclave host containers (together with required 
+supporting software like aesmd) and the sgx device driver. Each enclave host in turn may host several enclave instances.
+For the sake of simplicity let's assume that an enclave host may only host a single enclave instance per measurement.
+
+We can figure out the identity of the CPU the VM is running on by using a dedicated enclave to derive a unique ID 
+specific to the CPU. For this we can use EGETKEY with pre-defined inputs to derive a seal key sealed to MRENCLAVE. This
+provides a 128bit value reproducible only on the same CPU in this manner. Note that this is completely safe as the 
+value won't be used for encryption and is specific to the measurement doing this. With this ID we can reason about 
+physical locality of enclaves without looping in the cloud provider.
+Note: we should set OWNEREPOCH to a static value before doing this.
+
+We don't need an explicit handle on the VM's identity, the mapping from VM to container will be handled by the
+orchestration engine (Kubernetes).
+
+Similarly to VM identity, the specific host container's identity(IP address/DNS A) is also tracked by Kubernetes,
+however we do need access to this identity in order to implement discovery.
+
+When an enclave instance seals a secret that piece of data is tied to the measurement+CPU combo. The secret can only be
+revealed to an enclave with the same measurement running on the same CPU. However the management of this secret is 
+tied to the enclave host container, which we may have several of running on the same CPU, possibly all of them hosting
+enclaves with the same measurement.
+
+To solve this we can introduce a *sealing identity*. This is basically a generated ID/namespace for a collection of
+secrets belonging to a specific CPU. It is generated when a fresh enclave host starts up and subsequently the host will 
+store sealed secrets under this ID. These secrets should survive host death, so they will be persisted in etcd (together
+with the associated active channel sets). Every host owns a single sealing identity, but not every sealing identity may
+have an associated host (e.g. in case the host died).
+
+## Mapping to Kubernetes
+
+The following mapping of the above concepts to Kubernetes concepts is not yet fleshed out and requires further
+investigation into Kubernetes capabilities.
+
+VMs correspond to Nodes, and enclave hosts correspond to Pods. The host's identity is the same as the Pod's, which is
+the Pod's IP address/DNS A record. From Kubernetes's point of view enclave hosts provide a uniform stateless Headless
+Service. This means we can use their scaling/autoscaling features to  provide redundancy across hosts (to balance load).
+
+However we'll probably need to tweak their (federated?) ReplicaSet concept in order to provide redundancy across CPUs
+(to be tolerant of CPU failures), or perhaps use their anti-affinity feature somehow, to be explored.
+
+The concept of a sealing identity is very close to the stable identity of Pods in Kubernetes StatefulSets. However I
+couldn't find a way to use this directly as we need to tie the sealing identity to the CPU identity, which in Kubernetes
+would translate to a requirement to pin stateful Pods to Nodes based on a dynamically determined identity. We could
+however write an extension to handle this metadata.
+
+## Registration
+
+When an enclave host is started it first needs to establish its sealing identity. To this end first it needs to check
+whether there are any sealing identities available for the CPU it's running on. If not it can generate a fresh one and
+lease it for a period of time (and update the lease periodically) and atomically register its IP address in the process.
+If an existing identity is available the host can take over it by leasing it. There may be existing Kubernetes
+functionality to handle some of this.
+
+Non-enclave services (like blob storage) could register similarly, but in this case we can take advantage of Kubernetes'
+existing discovery infrastructure to abstract a service behind a Service cluster IP. We do need to provide the metadata
+about supported channels though.
+
+## Resolution
+
+The enclave/service discovery problem boils down to:
+"Given a channel, my trust model and my identity, give me an enclave/service that serves this channel, trusts me, and I
+trust them".
+
+This may be done in the following steps:
+
+1. Resolve the channel to a set of measurements supporting it
+2. Filter the measurements to trusted ones and ones that trust us
+3. Pick one of the measurements randomly
+4. Find an alive host that has the channel in its active set for the measurement
+
+1 may be done by maintaining a channel -> measurements map in etcd. This mapping would effectively define the enclave
+deployment and would be the central place to control incremental rollout or rollbacks.
+
+2 requires storing of additional metadata per advertised channel, namely a datastructure describing the enclave's trust
+predicate. A similar datastructure is provided by the discovering entity - these two predicates can then be used to
+filter measurements based on trust.
+
+3 is where we may want to introduce more control if we want to support incremental rollout/canary deployments.
+
+4 is where various (non-MVP) optimisation considerations come to mind. We could add a loadbalancer, do autoscaling based
+on load (although Kubernetes already provides support for this), could have a preference for looping back to the same
+host to allow local attestation, or ones that have the enclave image cached locally or warmed up.
--- a/docs/source/design/sgx-infrastructure/details/enclave-deployment.md
+++ b/docs/source/design/sgx-infrastructure/details/enclave-deployment.md
@ -0,0 +1,16 @@
+# Enclave deployment
+
+What happens if we roll out a new enclave image?
+
+In production we need to sign the image directly with the R3 key as MRSIGNER (process to be designed), as well as create
+any whitelisting signatures needed (e.g. from auditors) in order to allow existing enclaves to trust the new one.
+
+We need to make the enclave build sources available to users - we can package this up as a single container pinning all
+build dependencies and source code. Docker style image layering/caching will come in handy here.
+
+Once the image, build containers and related signatures are created we need to push this to the main R3 enclave storage.
+
+Enclave infrastructure owners (e.g. Corda nodes) may then start using the images depending on their upgrade policy. This
+involves updating their key value store so that new channel discovery requests resolve to the new measurement, which in
+turn will trigger the image download on demand on enclave hosts. We can potentially add pre-caching here to reduce
+latency for first-time enclave users.
--- a/docs/source/design/sgx-infrastructure/details/enclave-storage.md
+++ b/docs/source/design/sgx-infrastructure/details/enclave-storage.md
@ -0,0 +1,7 @@
+# Enclave storage
+
+The enclave storage is a simple static content server. It should allow uploading of and serving of enclave images based
+on their measurement. We may also want to store metadata about the enclave build itself (e.g. github link/commit hash).
+
+We may need to extend its responsibilities to serve other SGX related static content such as whitelisting signatures
+over measurements.
--- a/docs/source/design/sgx-infrastructure/details/host.md
+++ b/docs/source/design/sgx-infrastructure/details/host.md
@ -0,0 +1,11 @@
+# Enclave host
+
+An enclave host's responsibility is the orchestration of the communication with hosted enclaves.
+
+It is responsible for:
+* Leasing a sealing identity
+* Getting a CPU certificate in the form of an Intel-signed quote
+* Downloading and starting of requested enclaves
+* Driving attestation and subsequent encrypted traffic
+* Using discovery to connect to other enclaves/services
+* Various caching layers (and invalidation of) for the CPU certificate, hosted enclave quotes and enclave images
--- a/docs/source/design/sgx-infrastructure/details/ias-proxy.md
+++ b/docs/source/design/sgx-infrastructure/details/ias-proxy.md
@ -0,0 +1,10 @@
+# IAS proxy
+
+The Intel Attestation Service proxy's responsibility is simply to forward requests to and from the IAS.
+
+The reason we need this proxy is because Intel requires us to do Mutual TLS with them for each attestation roundtrip.
+For this we need an R3 maintained private key, and as we want third parties to be able to do attestation we need to
+store this private key in these proxies.
+
+Alternatively we may decide to circumvent this mutual TLS requirement completely by distributing the private key with
+the host containers.
--- a/docs/source/design/sgx-infrastructure/details/kv-store.md
+++ b/docs/source/design/sgx-infrastructure/details/kv-store.md
@ -0,0 +1,13 @@
+# Key-value store
+
+To solve enclave to enclave and enclave to non-enclave communication we need a way to route requests correctly. There 
+are readily available discovery solutions out there, however we have some special requirements because of the inherent 
+statefulness of enclaves (route to enclave with correct state) and the dynamic nature of trust between them (route to 
+enclave I can trust and that trusts me). To store metadata about discovery we can need some kind of distributed
+key-value store.
+
+The key-value store needs to store information about the following entities:
+* Enclave image: measurement and supported channels
+* Sealing identity: the sealing ID, the corresponding CPU ID and the host leasing it (if any)
+* Sealed secret: the sealing ID, the sealing measurement, the sealed secret and corresponding active channel set
+* Enclave deployment: mapping from channel to set of measurements
--- a/docs/source/design/sgx-infrastructure/details/serverless.md
+++ b/docs/source/design/sgx-infrastructure/details/serverless.md
@ -0,0 +1,33 @@
+# Serverless architectures
+
+In 2014 Amazon launched AWS Lambda, which they coined a "serverless architecture". It essentially creates an abstraction
+layer which hides the infrastructure details. Users provide "lambdas", which are stateless functions that may invoke
+other lambdas, access other AWS services etc. Because Lambdas are inherently stateless (any state they need must be
+accessed through a service) they may be loaded and executed on demand. This is in contrast with microservices, which 
+are inherently stateful. Internally AWS caches the lambda images and even caches JIT compiled/warmed up code in order 
+to reduce latency. Furthermore the lambda invokation interface provides a convenient way to scale these lambdas: as the 
+functions are statelesss AWS can spin up new VMs to push lambda functions to. The user simply pays for CPU usage, all 
+the infrastructure pain is hidden by Amazon.
+
+Google and Microsoft followed suit in a couple of years with Cloud Functions and Azure Functions.
+
+This way of splitting hosting computation from a hosted restricted computation is not a new idea, examples are web
+frameworks (web server vs application), MapReduce (Hadoop vs mappers/reducers), or even the cloud (hypervisors vs vms) 
+and the operating system (kernel vs userspace). The common pattern is: the hosting layer hides some kind of complexity, 
+imposes some restriction on the guest layer (and provides a simpler interface in turn), and transparently multiplexes 
+a number of resources for them.
+
+The relevant key features of serverless architectures are 1. on-demand scaling and 2. business logic independent of 
+hosting logic.
+
+# Serverless SGX?
+
+How are Amazon Lambdas relevant to SGX? Enclaves exhibit very similar features to Lambdas: they are pieces of business 
+logic completely independent of the hosting functionality. Not only that, enclaves treat hosts as adversaries! This 
+provides a very clean separation of concerns which we can exploit.
+
+If we could provide a similar infrastructure for enclaves as Amazon provides for Lambdas it would not only allow easy 
+HA and scaling, it would also decouple the burden of maintaining the infrastructure from the enclave business logic. 
+Furthermore our plan of using the JVM within enclaves also aligns with the optimizations Amazon implemented (e.g. 
+keeping warmed up enclaves around). Optimizations like upgrading to local attestation also become orthogonal to 
+enclave business logic. Enclave code can focus on the specific functionality at hand, everything else is taken care of.
--- a/docs/source/design/sgx-infrastructure/details/time.md
+++ b/docs/source/design/sgx-infrastructure/details/time.md
@ -0,0 +1,69 @@
+# Time in enclaves
+
+In general we know that any one crypto algorithm will be broken in X years time. The usual way to mitigate this is by
+using certificate expiration. If a peer with an expired certificate tries to connect we reject it in order to enforce
+freshness of their key.
+
+In order to check certificate expiration we need some notion of calendar time. However in SGX's threat model the host
+of the enclave is considered malicious, so we cannot rely on their notion of time. Intel provides trusted time through
+their PSW, however this uses the Management Engine which is known to be a proprietary vulnerable piece of architecture.
+
+Therefore in order to check calendar time in general we need some kind of time oracle. We can burn in the oracle's
+identity to the enclave and request timestamped signatures from it. This already raises questions with regards to the
+oracle's identity itself, however for the time being let's assume we have something like this in place.
+
+### Timestamped nonces
+
+The most straightforward way to implement calendar time checks is to generate a nonce *after* DH exchange, send it to
+the oracle and have it sign over it with a timestamp. The nonce is required to avoid replay attacks. A malicious host
+may delay the delivery of the signature indefinitely, even until after the certificate expires. However note that the
+DH happened before the nonce was generated, which means even if an attacker can crack the expired key they would not be
+able to steal the DH session, only try creating new ones, which will fail at the timestamp check.
+
+This seems to be working, however note that this would impose a full roundtrip to an oracle *per DH exchange*.
+
+### Timestamp-encrypted channels
+
+In order to reduce the roundtrips required for timestamp checking we can invert the responsibility of checking of the
+timestamp. We can do this by encrypting the channel traffic with an additional key generated by the enclave but that can
+only be revealed by the time oracle. The enclave encrypts the encryption key with the oracle's public key so the peer
+trying to communicate with the enclave must forward the encrypted key to the oracle. The oracle in turn will check the
+timestamp and reveal the contents (perhaps double encrypted with a DH-derived key). The peer can cache the key and later
+use the same encryption key with the enclave. It is then the peer's responsibility to get rid of the key after a while.
+
+Note that this mitigates attacks where the attacker is a third party trying to exploit an expired key, but this method
+does *not* mitigate against malicious peers that keep around the encryption key until after expiration(= they "become"
+malicious).
+
+### Oracle key break
+
+So given an oracle we can secure a channel against expired keys and potentially improve performance by trusting
+once-authorized enclave peers to not become malicious.
+
+However what happens if the oracle key itself is broken? There's a chicken-and-egg problem where we can't check the
+expiration of the time oracle's certificate itself! Once the oracle's key is broken an attacker can fake timestamping
+replies (or decrypt the timestamp encryption key), which in turn allows it to bypass the expiration check.
+
+The main issue with this is in relation to sealed secrets, and sealed secret provisioning between enclaves. If an
+attacker can fake being e.g. an authorized enclave then it can extract old secrets. We have yet to come up with a
+solution to this, and I don't think it's possible.
+
+Instead, knowing that current crypto algorithms are bound to be broken at *some* point in the future, instead of trying
+to make sealing future-proof we can become explicit about the time-boundness of security guarantees.
+
+### Sealing epochs
+
+Let's call the time period in which a certain set of algorithms are considered safe a *sealing epoch*. During this
+period sealed data at rest is considered to be secure. However once the epoch finishes old sealed data is considered to
+be potentially compromised. We can then think of sealed data as an append-only log of secrets with overlapping epoch
+intervals where the "breaking" of old epochs is constantly catching up with new ones.
+
+In order to make sure that this works we need to enforce an invariant where secrets only flow from old epochs to newer
+ones, never the other way around.
+
+This translates to the ledger nicely, data in old epochs are generally not valuable anymore, so it's safe to consider
+them compromised. Note however that in the privacy model an epoch transition requires a full re-provisioning of the
+ledger to the new set of algorithms/enclaves.
+
+In any case this is an involved problem, and I think we should defer the fleshing out of it for now as we won't need it
+for the first round of stateless enclaves.
--- a/docs/source/design/sgx-integration/SgxProvisioning.png
+++ b/docs/source/design/sgx-integration/SgxProvisioning.png
--- a/docs/source/design/sgx-integration/design.md
+++ b/docs/source/design/sgx-integration/design.md
@ -0,0 +1,315 @@
+This document is intended as a design description of how we can go about integrating SGX with Corda. As the 
+infrastructure design of SGX is quite involved (detailed elsewhere) but otherwise flexible we can discuss the possible 
+integration points separately, without delving into lower level technical detail.
+
+For the purposes of this document we can think of SGX as a way to provision secrets to a remote node with the 
+knowledge that only trusted code(= enclave) will operate on it. Furthermore it provides a way to durably encrypt data 
+in a scalable way while also ensuring that the encryption key is never leaked (unless the encrypting enclave is 
+compromised).
+
+Broadly speaking there are two dimensions to deciding how we can integrate SGX: *what* we store in the ledger and
+*where* we store it.
+
+The first dimension is the what: this relates to what we so far called the "integrity model" vs the "privacy model".
+
+In the **integrity model** we rely on SGX to ensure the integrity of the ledger. Using this assumption we can cut off 
+the transaction body and only store an SGX-backed signature over filtered transactions. Namely we would only store 
+information required for notarisation of the current and subsequent spending transactions. This seems neat on first 
+sight, however note that if we do this naively then if an attacker can impersonate an enclave they'll gain write 
+access to the ledger, as the fake enclave can sign transactions as valid without having run verification.
+
+In the **privacy model** we store the full transaction backchain (encrypted) and we keep provisioning it between nodes 
+on demand, just like in the current Corda implementation. This means we only rely on SGX for the privacy aspects - if 
+an enclave is compromised we only lose privacy, the verification cannot be eluded by providing a fake signature.
+
+The other dimension is the where: currently in non-SGX Corda the full transaction backchain is provisioned between non-
+notary nodes, and is also provisioned to notaries in the case they are validating ones. With SGX+BFT notaries we have 
+the possibility to offload the storage of the encrypted ledger (or encrypted signatures thereof) to notary nodes (or 
+dedicated oracles) and only store bookkeeping information required for further ledger updates in non-notary nodes. The 
+storage policy is very important, customers want control over the persistence of even encrypted data, and with the 
+introduction of recent regulation (GDPR) unrestricted provisioning of sensitive data will be illegal by law, even when 
+encrypted.
+
+We'll explore the different combination of choices below. Note that we don't need to marry to any one of them, we may 
+decide to implement several.
+
+## Privacy model + non-notary provisioning
+
+Let's start with the model that's closest to the current Corda implementation as this is an easy segue into the 
+possibilities with SGX. We also have a simple example and a corresponding neat diagram (thank you Kostas!!) we showed 
+to a member bank Itau to indicate in a semi-handwavy way what the integration will look like.
+
+We have a cordapp X used by node A and B. The cordapp contains a flow XFlow and a (deterministic) contract XContract. 
+The two nodes are negotiating a transaction T2. T2 consumes a state that comes from transaction T1.
+
+Let's assume that both A and B are happy with T2, except Node A hasn't established the validity of it yet. Our goal is 
+to prove the validity of T2 to A without revealing the details of T1.
+
+The following diagram shows an overview of how this can be achieved. Note that the diagram is highly oversimplified 
+and is meant to communicate the high-level dataflow relevant to Corda.
+
+![SGX Provisioning](SgxProvisioning.png "SGX Provisioning")
+
+* In order to validate T2, A asks its enclave whether T2 is valid.
+* The enclave sees that T2 depends on T1, so it consults its sealed ledger whether it contains T1.
+* If it does then this means T1 has been verified already, so the enclave moves on to the verification of T2.
+* If the ledger doesn't contain T1 then the enclave needs to retrieve it from node B.
+* In order to do this A's enclave needs to prove to B's enclave that it is indeed a trusted enclave B can provision T1 
+  to. This proof is what the attestation process provides.
+* Attestation is done in the clear: (TODO attestation diagram)
+  * A's enclave generates a keypair, the public part of which is sent to Node B in a datastructure signed by Intel, 
+    this is called the quote(1).
+  * Node B's XFlow may do various checks on this datastructure that cannot be performed by B's enclave, for example 
+    checking of the timeliness of Intel's signature(2).
+  * Node B's XFlow then forwards the quote to B's enclave, which will check Intel's signature and whether it trusts A'
+    s enclave. For the sake of simplicity we can assume this to be strict check that A is running the exact same 
+    enclave B is.
+  * At this point B's enclave has established trust in A's enclave, and has the public part of the key generated by A'
+    s enclave.
+  * The nodes repeat the above process the other way around so that A's enclave establishes trust in B's and gets hold 
+    of B's public key(3).
+  * Now they proceed to perform an ephemeral Diffie-Hellman key exchange using the keys in the quotes(4).
+  * The ephemeral key is then used to encrypt further communication. Beyond this point the nodes' flows (and anything 
+    outside of the enclaves) have no way of seeing what data is being exchanged, all the nodes can do is forward the 
+    encrypted messages.
+* Once attestation is done B's enclave provisions T1 to A's enclave using the DH key. If there are further 
+  dependencies those would be provisioned as well.
+* A's enclave then proceeds to verify T1 using the embedded deterministic JVM to run XContract. The verified 
+  transaction is then sealed to disk(5). We repeat this for T2.
+* If verification or attestation fails at any point the enclave returns to A's XFlow with a failure. Otherwise if all 
+  is good the enclave returns with a success. At this point A's XFlow knows that T2 is valid, but hasn't seen T1 in 
+  the clear.
+
+(1) This is simplified, the actual protocol is a bit different. Namely the quote is not generated every time A requires provisioning, but is rather generated periodically.
+
+(2) There is a way to do this check inside the enclave, however it requires switching on of the Intel ME which in general isn't available on machines in the cloud and is known to have vulnerabilities.
+
+(3) We need symmetric trust even if the secrets seem to only flow from B to A. Node B may try to fake being an enclave to fish for information from A.
+
+(4) The generated keys in the quotes are used to authenticate the respective parts of the DH key exchange.
+
+(5) Sealing means encryption of data using a key unique to the enclave and CPU. The data may be subsequently unsealed (decrypted) by the enclave, even if the enclave was restarted. Also note that there is another layer of abstraction needed which we don't detail here, needed for redundancy of the encryption key.
+
+To summarise, the journey of T1 is:
+
+1. Initially it's sitting encrypted in B's storage.
+2. B's enclave decrypts it using its seal key specific to B's enclave + CPU combination.
+3. B's enclave encrypts it using the ephemeral DH key.
+4. The encrypted transaction is sent to A. The safety of this (namely that A's enclave doesn't reveal the transaction to node A) hinges on B's enclave's trust in A's enclave, which is expressed as a check of A's enclave measurement during attestation, which in turn requires auditing of A's enclave code and reproducing of the measurement.
+5. A's enclave decrypts the transaction using the DH key.
+6. A's enclave verifies the transaction using a deterministic JVM.
+7. A's enclave encrypts the transaction using A's seal key specific to A's enclave + CPU combination.
+8. The encrypted transaction is stored in A's storage.
+
+As we can see in this model each non-notary node runs their own SGX enclave and related storage. Validation of the 
+backchain happens by secure provisioning of it between enclaves, plus subsequent verification and storage. However 
+there is one important thing missing from the example (actually it has several, but those are mostly technical detail):
+the notary!
+
+In reality we cannot establish the full validity of T2 at this point of negotiation, we need to first notarise it. 
+This model gives us some flexibility in this regard: we can use a validating notary (also running SGX) or a
+non-validating one. This indicates that the enclave API should be split in two, mirroring the signature check choice 
+in SignedTransaction.verify. Only when the transaction is fully signed and notarised should it be persisted (sealed).
+
+This model has both advantages and disadvantages. On one hand it is the closest to what we have now - we (and users) 
+are familiar with this model, we can fairly easily nest it into the existing codebase and it gives us flexibility with 
+regards to notary modes. On the other hand it is a compromising answer to the regulatory problem. If we use non-
+validating notaries then the backchain storage is restricted to participants, however consider the following example: 
+if we have a transaction X that parties A and B can process legally, but a later transaction Y that has X in its 
+backchain is sent for verification to party C, then C will process and store X as well, which may be illegal.
+
+## Privacy model + notary provisioning
+
+This model would work similarly to the previous one, except non-notary nodes wouldn't need to run SGX or care about 
+storage of the encrypted ledger, it would all be done in notary nodes. Nodes would connect to SGX capable notary nodes,
+and after attestation the nodes can be sure that the notary has run verification before signing.
+
+This fixes the choice of using validating notaries, as notaries would be the only entities capable of verification: 
+only they have access to the full backchain inside enclaves.
+
+Note that because we still provision the full backchain between notary members for verification, we don't necessarily 
+need a BFT consensus on validity - if an enclave is compromised an invalid transaction will be detected at the next 
+backchain provisioning.
+
+This model reduces the number of responsibilities of a non-notary node, in particular it wouldn't need to provide 
+storage for the backchain or verification, but could simply trust notary signatures. Also it wouldn't need to host SGX 
+enclaves, only partake in the DH exchange with notary enclaves. The node's responsibilities would be reduced to the 
+orchestration of ledger updates (flows) and related bookkeeping (vault, network map). This split would also enable us 
+to be flexible with regards to the update orchestration: trust in the validity of the ledger would cease to depend on 
+the transaction resolution currently embedded into flows - we could provide a from-scratch light-weight implementation 
+of a "node" (say a mobile app) that doesn't use flows and related code at all, it just needs to be able to connect to 
+notary enclaves to notarise, validity is taken care of by notaries.
+
+Note that although we wouldn't require validation checks from non-notary nodes, in theory it would be safe to allow 
+them to do so (if they want a stronger-than-BFT guarantee).
+
+Of course this model has disadvantages too. From the regulatory point of view it is a strictly worse solution than the 
+non-notary provisioning model: the backchain would be provisioned between notary nodes not owned by actual 
+participants in the backchain. It also disables us from using non-validating notaries.
+
+## Integrity model + non-notary provisioning
+
+In this model we would trust SGX-backed signatures and related attestation datastructures (quote over signature key 
+signed by Intel) as proof of validity. When node A and B are negotiating a transaction it's enough to provision SGX 
+signatures over the dependency hashes to one another, there's no need to provision the full backchain.
+
+This sounds very simple and efficient, and it's even more private than the privacy model as we're only passing 
+signatures around, not transactions. However there are a couple of issues that need addressing: If an SGX enclave is 
+compromised a malicious node can provide a signature over an invalid transaction that checks out, and nobody will ever 
+know about it, because the original transaction will never be verified. One way we can mitigate this is by requiring a 
+BFT consensus signature, or perhaps a threshold signature is enough. We could decouple verification into "verifying 
+oracles" which verify in SGX and return signatures over transaction hashes, and require a certain number of them to 
+convince the notary to notarise and subsequent nodes to trust validity. Another issue is enclave updates. If we find a 
+vulnerability in an enclave and update it, what happens to the already signed backchain? Historical transactions have 
+signatures that are rooted in SGX quotes belonging to old untrusted enclave code. One option is to simply have a 
+cutoff date before which we accept old signatures. This requires a consensus-backed timestamp on the notary signature. 
+Another option would be to keep the old ledger around and re-verify it with the new enclaves. However if we do this we 
+lose the benefits of the integrity model - we get back the regulatory issue, and we don't gain the performance benefits.
+
+## Integrity model + notary provisioning
+
+This is similar to the previous model, only once again non-notary nodes wouldn't need to care about verifying or 
+collecting proofs of validity before sending the transaction off for notarisation. All of the complexity would be 
+hidden by notary nodes, which may use validating oracles or perhaps combine consensus over validity with consensus 
+over spending. This model would be a very clean separation of concerns which solves the regulatory problem (almost) 
+and is quite efficient as we don't need to keep provisioning the chain. One potential issue with regards to regulation 
+is the tip of the ledger (the transaction being notarised) - this is sent to notaries and although it is not stored it 
+may still be against the law to receive it and hold it in volatile memory, even inside an enclave. I'm unfamiliar with 
+the legal details of whether this is good enough. If this is an issue, one way we could address this would be to scope 
+the validity checks required for notarisation within legal boundaries and only require "full" consensus on the 
+spentness check. Of course this has the downside that ledger participants outside of the regulatory boundary need to 
+trust the BFT-SGX of the scope. I'm not sure whether it's possible to do any better, after all we can't send the 
+transaction body outside the scope in any shape or form.
+
+## Threat model
+
+In all models we have the following actors, which may or may not overlap depending on the model:
+
+* Notary quorum members
+* Non-notary nodes/entities interacting with the ledger
+* Identities owning the verifying enclave hosting infrastructure
+* Identities owning the encrypted ledger/signature storage infrastructure
+* R3 = enclave whitelisting identity
+* Network Map = contract whitelisting identity
+* Intel
+
+We have two major ways of compromise:
+
+* compromise of a non-enclave entity (notary, node, R3, Network Map, storage)
+* compromise of an enclave.
+
+In the case of **notaries** compromise means malicious signatures, for **nodes** it's malicious transactions, for **R3**
+it's signing malicious enclaves, for **Network Map** it's signing malicious contracts, for **storage** it's read-write 
+access to encrypted data, and for **Intel** it's forging of quotes or signing over invalid ones.
+
+A compromise of an **enclave** means some form of access to the enclave's temporary identity key. This may happen 
+through direct hardware compromise (extracting of fuse values) and subsequent forging of a quote, or leaking of secrets 
+through weakness of the enclave-host boundary or other side-channels like Spectre(hacking). In any case it allows an
+adversary to impersonate an enclave and therefore to intercept enclave traffic and forge signatures.
+
+The actors relevant to SGX are enclave hosts, storage infrastructure owners, regular nodes and R3.
+
+* **Enclave hosts**: enclave code is specifically written with malicious (compromised) hosts in mind. That said we 
+  cannot be 100% secure against yet undiscovered side channel attacks and other vulnerabilities, so we need to be 
+  prepared for the scenario where enclaves get compromised. The privacy model effectively solves this problem by 
+  always provisioning and re-verifying the backchain. An impersonated enclave may be able to see what's on the ledger, 
+  but tampering with it will not check out at the next provisioning. On the other hand if a compromise happens in the 
+  integrity model an attacker can forge a signature over validity. We can mitigate this with a BFT guarantee by 
+  requiring a consensus over validity. This way we effectively provide the same guarantee for validity as notaries 
+  provide with regards to double spend.
+
+* **Storage infrastructure owner**: 
+  * A malicious actor would need to crack the encryption key to decrypt transactions 
+    or transaction signatures. Although this is highly unlikely, we can mitigate by preparing for and forcing of key 
+    updates (i.e. we won't provision new transactions to enclaves using old keys).
+  * What an attacker *can* do is simply erase encrypted data (or perhaps re-encrypt as part of ransomware), blocking 
+    subsequent resolution and verification. In the non-notary provisioning models we can't really mitigate this as the 
+    tip of the ledger (or signature over) may only be stored by a single non-notary entity (assumed to be compromised).
+    However if we require consensus over validity between notary or non-notary entities (e.g. validating oracles) then 
+    this implicitly provides redundancy of storage.
+  * Furthermore storage owners can spy on the enclave's activity by observing access patterns to the encrypted blobs.
+    We can mitigate by implementing ORAM storage.
+
+* **Regular nodes**: if a regular node is compromised the attacker may gain access to the node's long term key that 
+  allows them to Diffie-Hellman with an enclave, or get the ephemeral DH value calculated during attestation directly. 
+  This means they can man-in-the-middle between the node and the enclave. From the ledger's point of view we are 
+  prepared for this scenario as we never leak sensitive information to the node from the enclave, however it opens the 
+  possibility that the attacker can fake enclave replies (e.g. validity checks) and can sniff on secrets flowing from 
+  the node to the enclave. We can mitigate the fake enclave replies by requiring an extra signature on messages. 
+  Sniffing cannot really be mitigated, but one could argue that if the transient DH key (that lives temporarily in 
+  volatile memory) or long term key (that probably lives in an HSM) was leaked then the attacker has access to node 
+  secrets anyway.
+
+* **R3**: the entity that's whitelisting enclaves effectively controls attestation trust, which means they can 
+  backdoor the ledger by whitelisting a secret-revealing/signature-forging enclave. One way to mitigate this is by 
+  requiring a threshold signature/consensus over new trusted enclave measurements. Another way would be to use "canary" 
+  keys controlled by neutral parties. These parties' responsibility would simply be to publish enclave measurements (and
+  perhaps the reproducing build) to the public before signing over them. The "publicity" and signature would be checked 
+  during attestation, so a quote with a non-public measurement would be rejected. Although this wouldn't prevent 
+  backdoors (unless the parties also do auditing), it would make them public.
+
+* **Intel**: There are two ways a compromised Intel can interact with the ledger maliciously, both provide a backdoor.
+  * It can sign over invalid quotes. This can be mitigated by implementing our own attestation service. Intel told us 
+    we'll be able to do this in the future (by downloading a set of certificates tied to CPU+CPUSVN combos that may be 
+    used to check QE signatures).
+  * It can produce valid quotes without an enclave. This is due to the fact that they store one half of the SGX-
+    specific fuse values in order to validate quotes flexibly. One way to circumvent this would be to only use the 
+    other half of the fuse values (the seal values) which they don't store (or so they claim). However this requires 
+    our own "enrollment" process of CPUs where we replicate the provisioning process based off of seal values and 
+    verify manually that the provisioning public key comes from the CPU. And even if we do this all we did was move 
+    the requirement of trust from Intel to R3.
+    
+  Note however that even if an attacker compromises Intel and decides to backdoor they would need to connect to the
+  ledger participants in order to take advantage. The flow framework and the business network concept act as a form of
+  ACL on data that would make an Intel backdoor quite useless.
+
+## Summary
+
+As we can see we have a number of options here, all of them have advantages and disadvantages.
+
+#### Privacy + non-notary
+
+**Pros**:
+* Closest to our current non-SGX model
+* Strong guarantee of validity
+* Flexible with respect to notary modes
+
+**Cons**:
+* Regulatory problem about provisioning of ledger
+* Relies on ledger participants to do validation checks
+* No redundancy across ledger participants
+
+#### Privacy + notary
+
+**Pros**:
+* Strong guarantee of validity
+* Separation of concerns, allows lightweight ledger participants
+* Redundancy across notary nodes
+
+**Cons**:
+* Regulatory problem about provisioning of ledger
+
+#### Integrity + non-notary
+
+**Pros**:
+* Efficient validity checks
+* No storage of sensitive transaction body only signatures
+
+**Cons**:
+* Enclave impersonation compromises ledger (unless consensus validation)
+* Relies on ledger participants to do validation checks
+* No redundancy across ledger participants
+
+#### Integrity + notary
+
+**Pros**:
+* Efficient validity check
+* No storage of sensitive transaction body only signatures
+* Separation of concerns, allows lightweight ledger participants
+* Redundancy across notary nodes
+
+**Cons**:
+* Only BFT guarantee over validity
+* Temporary storage of transaction in RAM may be against regulation
+
+Personally I'm strongly leaning towards an integrity model where SGX compromise is mitigated by a BFT consensus over validity (perhaps done by a validating oracle cluster). This would solve the regulatory problem, it would be efficient and the infrastructure would have a very clean separation of concerns between notary and non-notary nodes, allowing lighter-weight interaction with the ledger.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -64,6 +64,7 @@ We look forward to seeing what you can do with Corda!
   design/hadr/design.md
   design/kafka-notary/design.md
   design/monitoring-management/design.md
+   design/sgx-integration/design.md

 .. toctree::
   :caption: Participate