diff --git a/docs/source/api-contract-constraints.rst b/docs/source/api-contract-constraints.rst index 9ce088c679..ad00dbc65f 100644 --- a/docs/source/api-contract-constraints.rst +++ b/docs/source/api-contract-constraints.rst @@ -89,8 +89,6 @@ logic provided by the apps. Hash and zone whitelist constraints are left over from earlier Corda versions before Signature Constraints were implemented. They make it harder to upgrade applications than when using signature constraints, so they're best avoided. -Further information into the design of Signature Constraints can be found in its :doc:`design document `. - .. _signing_cordapps_for_use_with_signature_constraints: Signing CorDapps for use with Signature Constraints diff --git a/docs/source/app-upgrade-notes.rst b/docs/source/app-upgrade-notes.rst index d066403044..a914d22f0c 100644 --- a/docs/source/app-upgrade-notes.rst +++ b/docs/source/app-upgrade-notes.rst @@ -530,8 +530,7 @@ packages, they could call package-private methods, which may not be expected by and request ownership of your root package namespaces (e.g. ``com.megacorp.*``), with the signing keys you will be using to sign your app JARs. The zone operator can then add your signing key to the network parameters, and prevent attackers defining types in your own package namespaces. Whilst this feature is optional and not strictly required, it may be helpful to block attacks at the boundaries of a Corda based application -where type names may be taken "as read". You can learn more about this feature and the motivation for it by reading -":doc:`design/data-model-upgrades/package-namespace-ownership`". +where type names may be taken "as read". Step 11. Consider adding extension points to your flows ------------------------------------------------------- diff --git a/docs/source/conf.py b/docs/source/conf.py index a4fb0303b7..55ae2f0218 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -100,12 +100,6 @@ language = None # Else, today_fmt is used as the format for a strftime call. # today_fmt = '%B %d, %Y' -# List of patterns, relative to source directory, that match files and -# directories to ignore when looking for source files. -exclude_patterns = ['design/README.md'] -if tags.has('pdfmode'): - exclude_patterns = ['design', 'design/README.md'] - # The reST default role (used for this markup: `text`) to use for all # documents. # default_role = None diff --git a/docs/source/cordapp-advanced-concepts.rst b/docs/source/cordapp-advanced-concepts.rst index 3ae096ea23..a64bc25648 100644 --- a/docs/source/cordapp-advanced-concepts.rst +++ b/docs/source/cordapp-advanced-concepts.rst @@ -280,8 +280,8 @@ But if another CorDapp developer, `OrangeCo` bundles the `Fruit` library, they m This will create a `com.fruitcompany.Banana` @SignedBy_TheOrangeCo, so there could be two types of Banana states on the network, but "owned" by two different parties. This means that while they might have started using the same code, nothing stops these `Banana` contracts from diverging. Parties on the network receiving a `com.fruitcompany.Banana` will need to explicitly check the constraint to understand what they received. -In Corda 4, to help avoid this type of confusion, we introduced the concept of Package Namespace Ownership (see ":doc:`design/data-model-upgrades/package-namespace-ownership`"). -Briefly, it allows companies to claim namespaces and anyone who encounters a class in that package that is not signed by the registered key knows is invalid. +In Corda 4, to help avoid this type of confusion, we introduced the concept of Package Namespace Ownership. Briefly, it allows companies to claim namespaces +and anyone who encounters a class in that package that is not signed by the registered key knows is invalid. This new feature can be used to solve the above scenario. If `TheFruitCo` claims package ownership of `com.fruitcompany`, it will prevent anyone from bundling its code because they will not be able to sign it with the right key. diff --git a/docs/source/design/README.md b/docs/source/design/README.md deleted file mode 100644 index f664a1a3ca..0000000000 --- a/docs/source/design/README.md +++ /dev/null @@ -1,40 +0,0 @@ -![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) - - - -# Design Documentation - -This directory should be used to version control Corda design documents. - -These should be written in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) (a design template is provided for general guidance) and follow the design review process outlined below. It is recommended you use a Markdown editor such as [Typora](https://typora.io/), or an appropriate plugin for your favourite editor (eg. [Sublime Markdown editing theme](http://plaintext-productivity.net/2-04-how-to-set-up-sublime-text-for-markdown-editing.html)). - -## Design Review Process - -Please see the [design review process](design-review-process.md). - -* Feature request submission -* High level design -* Review / approve gate -* Technical design -* Review / approve gate -* Plan, prototype, implement, QA - -## Design Template - -Please copy this [directory](template) to a new location under `/docs/source/design` (use a meaningful short descriptive directory name) and use the [Design Template](template/design.md) contained within to guide writing your Design Proposal. Whilst the section headings may be treated as placeholders for guidance, you are expected to be able to answer any questions related to pertinent section headings (where relevant to your design) at the design review stage. Use the [Design Decision Template](template/decisions/decision.md) (as many times as needed) to record the pros and cons, and justification of any design decision recommendations where multiple options are available. These should be directly referenced from the *Design Decisions* section of the main design document. - -The design document may be completed in one or two iterations, by completing the following main two sections individually or singularly: - -* High level design - Where a feature requirement is specified at a high level, and multiple design solutions are possible, this section should be completed and circulated for review prior to completing the detailed technical design. - High level designs will often benefit from a formal meeting and discussion review amongst stakeholders to reach consensus on the preferred way to proceed. The design author will then incorporate all meeting outcome decisions back into a revision for final GitHub PR approval. -* Technical design - The technical design will consist of implementation specific details which require a deeper understanding of the Corda software stack, such as public API's and services, libraries, and associated middleware infrastructure (messaging,security, database persistence, serialization) used to realize these. - Technical designs should lead directly to a GitHub PR review process. - -Once a design is approved using the GitHub PR process, please commit the PR to the GitHub repository with a meaningful version identifier (eg. my super design document - **V1.0**) - -## Design Repository - -All design documents will be version controlled under github under the directory `/docs/source/design`. -For designs that relate to Enterprise-only features (and that may contain proprietary IP), these should be stored under the [Enterprise Github repository](https://github.com/corda/enterprise). All other public designs should be stored under the [Open Source Github repository](https://github.com/corda/corda). diff --git a/docs/source/design/certificate-hierarchies/decisions/levels.md b/docs/source/design/certificate-hierarchies/decisions/levels.md deleted file mode 100644 index c34d078cf0..0000000000 --- a/docs/source/design/certificate-hierarchies/decisions/levels.md +++ /dev/null @@ -1,50 +0,0 @@ -Design Decision: Certificate hierarchy levels -============================================ - -## Background / Context - -The decision of how many levels to include is a key feature of the [proposed certificate hierarchy](../design.md). - -## Options Analysis - -### Option 1: 2-level hierarchy - -Under this option, intermediate CA certificates for key signing services (Doorman, Network Map, CRL) are generated as -direct children of the root certificate. - -![Current](../images/option1.png) - -#### Advantages - -- Simplest option -- Minimal change to existing structure - -#### Disadvantages - -- The Root CA certificate is used to sign both intermediate certificates and CRL. This may be considered as a drawback - as the Root CA should be used only to issue other certificates. - -### Option 2: 3-level hierarchy - -Under this option, an additional 'Company CA' cert is generated from the root CA cert, which is then used to generate -intermediate certificates. - -![Current](../images/option2.png) - -#### Advantages - -- Allows for option to remove the root CA from the network altogether and store in an offline medium - may be preferred by some stakeholders -- Allows (theoretical) revocation and replacement of the company CA cert without needing to replace the trust root. - -#### Disadvantages - -- Greater complexity - -## Recommendation and justification - -Proceed with option 1: 2-level hierarchy. - -No authoritative argument from a security standpoint has been made which would justify the added complexity of option 2. -Given the business impact of revoking the Company CA certificate, this must be considered an extremely unlikely event -with comparable implications to the revocation of the root certificate itself; hence no practical justification for the -addition of the third level is observed. \ No newline at end of file diff --git a/docs/source/design/certificate-hierarchies/decisions/tls-trust-root.md b/docs/source/design/certificate-hierarchies/decisions/tls-trust-root.md deleted file mode 100644 index dae762667b..0000000000 --- a/docs/source/design/certificate-hierarchies/decisions/tls-trust-root.md +++ /dev/null @@ -1,42 +0,0 @@ -Design Decision: Certificate Hierarchy -====================================== - -## Background / Context - -This document purpose is to make a decision on the certificate hierarchy. It is necessary to make this decision as it -affects development of features (e.g. Certificate Revocation List). - -## Options Analysis - -There are various options in how we structure the hierarchy above the node CA. - -### Option 1: Single trust root - -Under this option, TLS certificates are issued by the node CA certificate. - -#### Advantages - -- Existing design - -#### Disadvantages - -- The Root CA certificate is used to sign both intermediate certificates and CRL. This may be considered as a drawback as the Root CA should be used only to issue other certificates. - -### Option 2: Separate TLS vs. identity trust roots - -This option splits the hierarchy by introducing a separate trust root for TLS certificates. - -#### Advantages - -- Simplifies issuance of TLS certificates (implementation constraints beyond those of other certificates used by Corda - specifically, EdDSA keys are not yet widely supported for TLS certificates) -- Avoids requirement to specify accurate usage restrictions on node CA certificates to issue their own TLS certificates - -#### Disadvantages - -- Additional complexity - -## Recommendation and justification - -Proceed with option 1 (Single Trust Root) for current purposes. - -Feasibility of option 2 in the code should be further explored in due course. \ No newline at end of file diff --git a/docs/source/design/certificate-hierarchies/design.md b/docs/source/design/certificate-hierarchies/design.md deleted file mode 100644 index ca4ad66534..0000000000 --- a/docs/source/design/certificate-hierarchies/design.md +++ /dev/null @@ -1,84 +0,0 @@ -# Certificate hierarchies - -.. important:: This design doc applies to the main Corda network. Other networks may use different certificate hierarchies. - -## Overview - -A certificate hierarchy is proposed to enable effective key management in the context of managing Corda networks. -This includes certificate usage for the data signing process and certificate revocation process -in case of a key compromise. At the same time, result should remain compliant with -[OCSP](https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol) and [RFC 5280](https://www.ietf.org/rfc/rfc5280.txt) - -## Background - -Corda utilises public key cryptography for signing and authentication purposes, and securing communication -via TLS. As a result, every entity participating in a Corda network owns one or more cryptographic key pairs {*private, -public*}. Integrity and authenticity of an entity's public key is assured using digital certificates following the -[X.509 standard](https://tools.ietf.org/html/rfc5280), whereby the receiver’s identity is cryptographically bonded to -his or her public key. - -Certificate Revocation List (CRL) functionality interacts with the hierarchy of the certificates, as the revocation list -for any given certificate must be signed by the certificate's issuer. Therefore if we have a single doorman CA, the sole -CRL for node CA certificates would be maintained by that doorman CA, creating a bottleneck. Further, if that doorman CA -is compromised and its certificate revoked by the root certificate, the entire network is invalidated as a consequence. - -The current solution of a single intermediate CA is therefore too simplistic. - -Further, the split and location of intermediate CAs has impact on where long term infrastructure is hosted, as the CRLs -for certificates issued by these CAs must be hosted at the same URI for the lifecycle of the issued certificates. - -## Scope - -Goals: - -* Define effective certificate relationships between participants and Corda network services (i.e. nodes, notaries, network map, doorman). -* Enable compliance with both [OCSP](https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol) and [RFC 5280](https://www.ietf.org/rfc/rfc5280.txt) (CRL)-based revocation mechanisms -* Mitigate relevant security risks (keys being compromised, data privacy loss etc.) - -Non-goals: - -* Define an end-state mechanism for certificate revocation. - -## Requirements - -In case of a private key being compromised, or a certificate incorrectly issued, it must be possible for the issuer to -revoke the appropriate certificate(s). - -The solution needs to scale, keeping in mind that the list of revoked certificates from any given certificate authority -is likely to grow indefinitely. However for an initial deployment a temporary certificate authority may be used, and -given that it will not require to issue certificates in the long term, scaling issues are less of a concern in this -context. - -## Design Decisions - -.. toctree:: - :maxdepth: 2 - - decisions/levels.md - decisions/tls-trust-root.md - -## **Target** Solution - -![Target certificate structure](./images/cert_structure_v3.png) - -The design introduces discrete intermediate CAs below the network trust root for each logical service exposed by the doorman - specifically: - -1. Node CA certificate issuance -2. Network map signing -3. Certificate Revocation List (CRL) signing -4. OCSP revocation signing - -The use of discrete certificates in this way facilitates subsequent changes to the model, including retiring and replacing certificates as needed. - -Each of the above certificates will specify a CRL allowing the certificate to be revoked. The root CA operator -(primarily R3) will be required to maintain this CRL for the lifetime of the process. - -TLS certificates will remain issued under Node CA certificates (see [decision: TLS trust -root](./decisions/tls-trust-root.md)). - -Nodes will be able to specify CRL(s) for TLS certificates they issue; in general, they will be required to such CRLs for -the lifecycle of the TLS certificates. - -In the initial state, a single doorman intermediate CA will be used for issuing all node certificates. Further -intermediate CAs for issuance of node CA certificates may subsequently be added to the network, where appropriate, -potentially split by geographic region or otherwise. \ No newline at end of file diff --git a/docs/source/design/certificate-hierarchies/images/cert_structure_v2.png b/docs/source/design/certificate-hierarchies/images/cert_structure_v2.png deleted file mode 100644 index 7ea3361c20..0000000000 Binary files a/docs/source/design/certificate-hierarchies/images/cert_structure_v2.png and /dev/null differ diff --git a/docs/source/design/certificate-hierarchies/images/cert_structure_v3.png b/docs/source/design/certificate-hierarchies/images/cert_structure_v3.png deleted file mode 100644 index 2b872a1071..0000000000 Binary files a/docs/source/design/certificate-hierarchies/images/cert_structure_v3.png and /dev/null differ diff --git a/docs/source/design/certificate-hierarchies/images/current.png b/docs/source/design/certificate-hierarchies/images/current.png deleted file mode 100644 index a07f1983f1..0000000000 Binary files a/docs/source/design/certificate-hierarchies/images/current.png and /dev/null differ diff --git a/docs/source/design/certificate-hierarchies/images/option1.png b/docs/source/design/certificate-hierarchies/images/option1.png deleted file mode 100644 index 0265090b5e..0000000000 Binary files a/docs/source/design/certificate-hierarchies/images/option1.png and /dev/null differ diff --git a/docs/source/design/certificate-hierarchies/images/option2.png b/docs/source/design/certificate-hierarchies/images/option2.png deleted file mode 100644 index 517206da26..0000000000 Binary files a/docs/source/design/certificate-hierarchies/images/option2.png and /dev/null differ diff --git a/docs/source/design/certificate-hierarchies/images/option3.png b/docs/source/design/certificate-hierarchies/images/option3.png deleted file mode 100644 index 8f1fff6478..0000000000 Binary files a/docs/source/design/certificate-hierarchies/images/option3.png and /dev/null differ diff --git a/docs/source/design/data-model-upgrades/migrate-to-signature-constraint.md b/docs/source/design/data-model-upgrades/migrate-to-signature-constraint.md deleted file mode 100644 index 501ac8636f..0000000000 --- a/docs/source/design/data-model-upgrades/migrate-to-signature-constraint.md +++ /dev/null @@ -1,151 +0,0 @@ -# Migration from the hash constraint to the Signature constraint - - -## Background - -Corda pre-V4 only supports HashConstraints and the WhitelistedByZoneConstraint. -The default constraint, if no entry was added to the network parameters is the hash constraint. -Thus, it's very likely that most first states were created with the Hash constraint. - -When changes will be required to the contract, the only alternative is the explicit upgrade, which creates a new contract, but inherits the HashConstraint (with the hash of the new jar this time). - -**The current implementation of the explicit upgrade does not support changing the constraint.** - -It's very unlikely that these first deployments actually wanted a non-upgradeable version. - -This design doc is presenting a smooth migration path from the hash constraint to the signature constraint. - - -## Goals - -CorDapps that were released (states created) with the hash constraint should be able to transition to the signature constraint if the original developer decides to do that. - -A malicious party should not be able to attack this feature, by "taking ownership" of the original code. - - -## Non-Goals - -Migration from the whitelist constraint was already implemented. so will not be addressed. (The cordapp developer or owner just needs to sign the jar and whitelist the signed jar.) - -Also versioning is being addressed in different design docs. - - -## Design details - -### Requirements - -To migrate without disruption from the hash constraint, the jar that is attached to a spending transaction needs to satisfy both the hash constraint of the input state, as well as the signature constraint of the output state. - -Also, it needs to reassure future transaction verifiers - when doing transaction resolution - that this was a legitimate transition, and not a malicious attempt to replace the contract logic. - - -### Process - -To achieve the first part, we can create this convention: - -- Developer signs the original jar (that was used with the hash constraint). -- Nodes install it, thus whitelisting it. -- The HashConstraint.verify method will be modified to verify the hash with and without signatures. -- The nodes create normal transactions that spend an input state with the hashConstraint and output states with the signature constraint. No special spend-to-self transactions should be required. -- This transaction would validate correctly as both constraints will pass - the unsigned hash matches, and the signatures are there. -- This logic needs to be added to the constraint propagation transition matrix. This could be only enabled for states created pre-v4, when there was no alternative. - - -For the second part: - -- The developer needs to claim the package (See package ownership). This will give confidence to future verifiers that it was the actual developer that continues to be the owner of that jar. - - -To summarise, if a CorDapp developer wishes to migrate to the code it controls to the signature constraint for better flexibility: - -1. Claim the package. -2. Sign the jar and distribute it. -3. In time all states will naturally transition to the signature constraint. -4. Release new version as per the signature constraint. - - -A normal node would just download the signed jar using the normal process for that, and the platform will do the rest. - - -### Caveats - -#### Someone really wants to issue states with the HashConstraint, and ensure that can never change. - - - As mentioned above the transaction builder could only automatically transition states created pre-v4. - - - If this is the original developer of the cordapp, then they can just hardcode the check in the contract that the constraint must be the HashConstraint. - - - It is actually a third party that uses a contract it doesn't own, but wants to ensure that it's only that code that is used. - This should not be allowed, as it would clash with states created without this constraint (that might have higher versions), and create incompatible states. - The option in this case is to force such parties to actually create a new contract (maybe subclass the version they want), own it, and hardcode the check as above. - - -#### Some nodes haven't upgraded all their states by the time a new release is already being used on the network. - - - A transaction mixing an original HashConstraint state, and a v2 Signature constraint state will not pass. The only way out is to strongly "encourage" nodes to upgrade before the new release. - -The problem is that, in a transaction, the attachment needs to pass the constraint of all states. - -If the rightful owner took over the contract of states originally released with the HashConstraint, and started releasing new versions then the following might happen: - -- NodeA did not migrate all his states to the Signature Constraint. -- NodeB did, and already has states created with version 2. -- If they decide to trade, NodeA will add his HashConstraint state, and NodeB will add his version2 SignatureConstraint state to a new transaction. -This is an impossible transaction. Because if you select version1 of the contract you violate the non-downgrade rule. If you select version2 , you violate the initial HashConstraint. - - -Note: If we consider this to be a real problem, then we can implement a new NoOp Transaction type similar to the Notary change or the contract upgrade. The security implications need to be considered before such work is started. -Nodes could use this type of transaction to change the constraint of existing states without the need to transact. - - -### Implementation details - -- Create a function to return the hash of a signed jar after it stripped the signatures. -- Change the HashConstraint to check against any of these 2 hashes. -- Change the transaction builder logic to automatically transition constraints that are signed, owned, etc.. -- Change the constraint propagation transition logic to allow this. - - - -## Alternatives considered - - -### Migrating from the HashConstraint to the SignatureConstraint via the WhitelistConstraint. - -We already have a strategy to migrate from the WhitelistConstraint to the Signature contraint: - -- Original developer (owner) signs the last version of the jar, and whitelists the signed version. -- The platform allows transitioning to the SignatureConstraint as long as all the signers of the jar are in the SignatureConstraint. - - -We could attempt to extend this strategy with a HashConstraint -> WhitelistConstraint path. - -#### The process would be: - -- Original developer of the contract that used the hashConstraint will make a whitelist contract request, and provide both the original jar and the original jar but signed. -- The zone operator needs to make sure that this is the original developer who claims ownership of that corDapp. - -##### Option 1: Skip the WhitelistConstraint when spending states. (InputState = HashConstraint, OutputState = SignatureConstraint) - -- This is not possible as one of the 2 constraints will fail. -- Special constraint logic is needed which is risky. - - -##### Option 2: Go through the WhitelistConstraint when spending states - -- When a state is spent, the transaction builder sees there is a whitelist constraint, and selects the first entry. -- The transition matrix will allow the transition from hash to Whitelist. -- Next time the state is spent, it will transition from the Whitelist constraint to the signature constraint. - - -##### Advantage: - -- The tricky step of removing the signature from a jar to calculate the hash is no longer required. - - -##### Disadvantage: - -- The transition will happen in 2 steps, which will add another layer of surprise and of potential problems. -- The No-Op transaction will become mandatory for this, along with all the complexity it brings (all participants signing). It will need to be run twice. -- An unnecessary whitelist entry is added. If that developer also decides to claim the package (as probably most will do in the beginning), it will grow the network parameters and increase the workload on the Zone Operator. -- We create an unintended migration path from the HashConstraint to the WhitelistConstraint. diff --git a/docs/source/design/data-model-upgrades/package-namespace-ownership.md b/docs/source/design/data-model-upgrades/package-namespace-ownership.md deleted file mode 100644 index 2878222be1..0000000000 --- a/docs/source/design/data-model-upgrades/package-namespace-ownership.md +++ /dev/null @@ -1,110 +0,0 @@ -# Package namespace ownership - -This design document outlines a new Corda feature that allows a compatibility zone to give ownership of parts of the Java package namespace to certain users. - -"*There are only two hard problems in computer science: 1. Cache invalidation, 2. Naming things, 3. Off by one errors*" - - - -## Background - -Corda implements a decentralised database that can be unilaterally extended with new data types and logic by its users, without any involvement by the closest equivalent we have to administrators (the "zone operator"). Even informing them is not required. - -This design minimises the power zone operators have and ensures deploying new apps can be fast and cheap - it's limited only by the speed with which the users themselves can move. But it introduces problematic levels of namespace complexity which can make programming securely harder than in regular non-decentralised programming. - -#### Java namespaces - -A typical Java application, seen from the JVM level, has a flat namespace in which a single string name binds to a single class. In object oriented programming a class defines both a data structure and the code used to enforce various invariants like "a person's age may not be negative", so this allows a developer to reason about what the identifier `com.example.Person` really means throughout the lifetime of his program. - -More complex Java applications may have a nested namespace using classloaders, thus inside a JVM a class is actually a pair of (classloader pointer, class name) and this can be used to support tricks like having two different versions of the same class in use simultaneously. The downside is more complexity for the developer to deal with. When things get mixed up this can surface (in Java 8) as nonsensical error messages like "com.example.Person cannot be casted to com.example.Person". In Java 9 classloaders were finally given names so these errors make more sense. - - - -#### Corda namespaces - -Corda faces an extension of the Java namespace problem - we have a global namespace in which malicious adversaries might be choosing names to be deliberately confusing. Nothing forces an app developer to follow the standard conventions for Java package or class names - someone could make an app that uses the same class name as one of your own apps. Corda needs to keep these two different classes, from different origins, separated. - -On the core ledger this is done by associating each state with an _attachment_. The attachment is the JAR file that contains the class files used by states. To load a state, a classloader is defined that uses the attachments on a transaction, and then the state class is loaded via that classloader. - -With this infrastructure in place, the Corda node and JVM can internally keep two classes that share the same name separated. The name of the state is, in effect, a list of attachments (hashes of JAR files) combined with a regular class name. - - - -#### Namespaces and versioning - -Names and namespaces are a critical part of how platforms of any kind handle software evolution. If component A is verifying the precise content of component B, e.g. by hashing it, then there can be no agility - component B can never be upgraded. Sometimes this is what's wanted. But usually you want the indirection of a name or set of names that stands in for some behaviour. Exactly how that behaviour is provided is abstracted away behind the mapping of the namespace to concrete artifacts. - -Versioning and resistance to malicious attack are likewise heavily interrelated, because given two different codebases that export the same names, it's possible that one is a legitimate upgrade which changes the logic behind the names in beneficial ways, and the other is an imposter that changes the logic in malicious ways. It's important to keep the differences straight, which can be hard because by their very nature, two versions of the same app tend to be nearly identical. - - - -#### Namespace complexity - -Reasoning about namespaces is hard and has historically led to security flaws in many platforms. - -Although the Corda namespace system _can_ keep overlapping but distinct apps separated, that unfortunately doesn't mean that everywhere it actually does. In a few places Corda does not currently provide all the data needed to work with full state names, although we are adding this data to RPC in Corda 4. - -Even if Corda was sure to get every detail of this right in every area, a full ecosystem consists of many programs written by app developers - not just contracts and flows, but also RPC clients, bridges from internal systems and so on. It is unreasonable to expect developers to fully keep track of Corda compound names everywhere throughout the entire pipeline of tools and processes that may surround the node: some of them will lose track of the attachments list and end up with only a class name, and others will do things like serialise to JSON in which even type names go missing. - -Although we can work on improving our support and APIs for working with sophisticated compound names, we should also allow people to work with simpler namespaces again - like just Java class names. This involves a small sacrifice of decentralisation but the increase in security is probably worth it for most developers. - -## Goals - -* Provide a way to reduce the complexity of naming and working with names in Corda by allowing for a small amount of centralisation, balanced by a reduction in developer mental load. -* Keep it optional for both zones and developers. -* Allow most developers to work just with ordinary Java class names, without needing to consider the complexities of a decentralised namespace. - -## Non-goals - -* Directly make it easier to work with "decentralised names". This can be a project that comes later. - -## Design - -To make it harder to accidentally write insecure code, we would like to support a compromise configuration in which a compatibility zone can publish a map of Java package namespaces to public keys. An app/attachment JAR may only define a class in that namespace if it is signed by the given public key. Using this feature would make a zone slightly less decentralised, in order to obtain a significant reduction in mental overhead for developers. - -Example of how the network parameters would be extended, in pseudo-code: - -```kotlin -data class JavaPackageName(name: String) { - init { /* verify 'name' is a valid Java package name */ } -} - -data class NetworkParameters( - ... - val packageOwnership: Map -) -``` - -Where the `PublicKey` object can be any of the algorithms supported by signature constraints. The map defines a set of dotted package names like `com.foo.bar` where any class in that package or any sub-package of that package is considered to match (so `com.foo.bar.baz.boz.Bish` is a match but `com.foo.barrier` does not). - -When a class is loaded from an attachment or application JAR signature checking is enabled. If the package of the class matches one of the owned namespaces, the JAR must be have enough signatures to satisfy the PublicKey (there may need to be more than one if the PublicKey is composite). - -Please note the following: - -* It's OK to have unsigned JARs. -* It's OK to have JARs that are signed, but for which there are no claims in the network parameters. -* It's OK if entries in the map are removed (system becomes more open). If entries in the map are added, this could cause consensus failures if people are still using old unsigned versions of the app. -* The map specifies keys not certificate chains, therefore, the keys do not have to chain off the identity key of a zone member. App developers do not need to be members of a zone for their app to be used there. - -From a privacy and decentralisation perspective, the zone operator *may* learn who is developing apps in their zone or (in cases where a vendor makes a single app and thus it's obvious) which apps are being used. This is not ideal, but there are mitigations: - -* The privacy leak is optional. -* The zone operator still doesn't learn who is using which apps. -* There is no obligation for Java package namespaces to correlate obviously to real world identities or products. For example you could register a trivial "front" domain and claim ownership of that, then use it for your apps. The zone operator would see only a codename. - -#### Claiming a namespace - -The exact mechanism used to claim a namespace is up to the zone operator. A typical approach would be to accept an SSL certificate with the domain in it as proof of domain ownership, or to accept an email from that domain as long as the domain is using DKIM to prevent from header spoofing. - -#### The vault API - -The vault query API is an example of how tricky it can be to manage truly decentralised namespaces. The `Vault.Page` class does not include constraint information for a state. Therefore, if a generic app were to be storing states of many different types to the vault without having the specific apps installed, it might be possible for someone to create a confusing name e.g. an app created by MiniCorp could export a class named `com.megacorp.example.Token` and this would be mapped by the RPC deserialisation logic to the actual MegaCorp app - the RPC client would have no way to know this had happened, even if the user was correctly checking, which it's unlikely they would. - -The `StateMetadata` class can be easily extended to include constraint information, to make safely programming against a decentralised namespace possible. As part of this work this extension will be made. - -But the new field would still need to be used - a subtle detail that would be easy to overlook. Package namespace ownership ensures that if you have an app installed locally on the client side that implements `com.megacorp.example` , then that code is likely to match closely enough with the version that was verified by the node. - - - - - diff --git a/docs/source/design/data-model-upgrades/signature-constraints.md b/docs/source/design/data-model-upgrades/signature-constraints.md deleted file mode 100644 index 09bbeef5f1..0000000000 --- a/docs/source/design/data-model-upgrades/signature-constraints.md +++ /dev/null @@ -1,155 +0,0 @@ -# Signature constraints - -This design document outlines an additional kind of *contract constraint*, used for specifying inside a transaction what the set of allowable attached contract JARs can be for each state. - -## Background - -Contract constraints are a part of how Corda ensures the correct code is executed to verify transactions, and also how it manages application upgrades. There are two kinds of upgrade that can be applied to the ledger: - -* Explicit -* Implicit - -An *explicit* upgrade is when a special kind of transaction is used, the *contract upgrade transaction*, which has the power to suspend normal contract execution and validity checking. The new contract being upgraded-to must be willing to accept the old state and can replace it with a new one. Because this can cause arbitrary edits to the ledger, every participant in a state must sign the contract upgrade transaction for it to be considered valid. - -Note that in the case of single-participant states whilst you could unilaterally replace a token state with a different state, this would be a state controlled by an application that other users wouldn't recognise, so you cannot transmute a token into a private contract with yourself then transmute it back, because contracts will only upgrade states they created themselves. - -An *implicit* upgrade is when the creator of a state has pre-authorised upgrades, quite possibly including versions of the app that didn't exist when the state was first authored. Implicit upgrades don't require a manual approval step - the new code can start being used whenever the next transaction for a state is needed, as long as it meets the state's constraint. - -Our current set of constraints is quite small. We support: - -* `AlwaysAcceptContractConstraint` - any attachment can be used, effectively this disables ledger security. -* `HashAttachmentContractConstraint` - only an attachment of the specified hash can be used. This is the same as Bitcoin or Ethereum and means once the state is created, the code is locked in permanently. -* `WhitelistedByZoneContractConstraint` - the network parameters contains a map of state class name to allowable hashes for the attachments. - -The last constraint allows upgrades 'from the future' to be applied, without disabling ledger security. However it is awkward to use, because any new version of any app requires a new set of network parameters to be signed by the zone operator and accepted by all participants, which in turn requires a node restart. - -The problems of `WhitelistedByZone` were known at the time it was developed, however, the feature was implemented anyway to reduce schedule slip for the Corda 3.0 release, whilst still allowing some form of application upgrade. - -We would like a new kind of constraint that is more convenient and decentralised whilst still being secure. - - -## Goals - -* Improve usability by eliminating the need to change the network parameters. -* Improve decentralisation by allowing apps to be developed and upgraded without the zone operator knowing or being able to influence it. -* Eventually, phase out zone whitelisting constraints. - -## Non-goals - -* Preventing downgrade attacks. Downgrade attack prevention will be tackled in a different design effort. -* Phase out of hash constraints. If malicious app creators are in the users threat model then hash constraints are the way to go. -* Handling the case where third parties re-sign app jars. -* Package namespace ownership (a separate effort). -* Allowing the zone operator to override older constraints, to provide a non-explicit upgrade path. - -## Design details - -We propose being able to constrain to any attachments whose files are signed by a specified set of keys. - -This satisfies the usability requirement because the creation of a new application is as simple as invoking the `jarsigner` tool that comes with the JDK. This can be integrated with the build system via a Gradle or Maven task. For example, Gradle can use jarsigner via [the signjar task](https://ant.apache.org/manual/Tasks/signjar.html) ([example](https://gist.github.com/Lien/7150434)). - -This also satisfies the decentralisation requirement, because app developers can sign apps without the zone operator's involvement or knowledge. - -Using JDK style JAR code signing has several advantages over rolling our own: - -* Although a signing key is required, this can be set up once. It can be protected by a password, or Windows/Mac built in keychain security, a token that supports PIN /biometrics or an HSM. All these options are supported out of the box by the Java security architecture. -* JARs can be signed multiple times by different entities. The nature of this process means the signatures can be combined easily - there is no ordering requirement or complex collaboration tools needed. By implication this means that a signature constraint can use a composite key. -* APIs for verifying JAR signatures are included in the platform already. -* File hashes can be checked file-at-a-time, so random access is made easier e.g. from inside an SGX enclave. -* Although Gradle can make reproducible JARs quite easily, JAR signatures do not include irrelevant metadata like file ordering or timestamps, so they are robust to being unpacked and repacked. -* The signature can be timestamped using an RFC compliant timestamping server. Our notaries do not currently implement this protocol, but they could. -* JAR signatures are in-lined to the JAR itself and do not ride alongside it. This is a good fit for our current attachments capabilities. - -There are also some disadvantages: - -* JAR signatures do *not* have to cover every file in the JAR. It is possible to add files to the JAR later that are unsigned, and for the verification process to still pass, as verification is done on a per-file basis. This is unintuitive and requires special care. -* The JAR verification APIs do not validate that the certificate chain in the JAR is meaningful. Therefore you must validate the certificate chain yourself in every case where a JAR is being verified. -* JAR signing does not cover the MANIFEST.MF file or files that start with SIG- (case INsensitive). Storing sensitive data in the manifest could be a problem as a result. - -### Data structures - -The proposed data structure for the new constraint type is as follows: - -```kotlin -data class SignatureAttachmentConstraint( - val key: PublicKey -) : AttachmentConstraint -``` - -Therefore if a state advertises this constraint, along with a class name of `com.foo.Bar` then the definition of Bar must reside in an attachment with signatures sufficient to meet the given public key. Note that the `key` may be a `CompositeKey` which is fulfilled by multiple signers. Multiple signers of a JAR is useful for decentralised administration of an app that wishes to have a threat model in which one of the app developers may go bad, but not a majority of them. For example there could be a 2-of-3 threshold of {app developer, auditor, R3} in which R3 is legally bound to only sign an upgrade if the auditor is unavailable e.g. has gone bankrupt. However, we anticipate that most constraints will be one-of-one for now. - -We will add a `signers` field to the `ContractAttachment` class that will be filled out at load time if the JAR is signed. The signers will be computed by checking the certificate chain for every file in the JAR, and any unsigned files will cause an exception to be thrown. - -### Transaction building - -The `TransactionBuilder` class can select the right constraint given what it already knows. If it locates the attachment JAR and discovers it has signatures in it, it can automatically set an N-of-N constraint that requires all of them on any states that don't already have a constraint specified. If the developer wants a more sophisticated constraint, it is up to them to set that explicitly in the usual manner. - -### Tooling and workflow - -The primary tool required is of course `jarsigner`. In dev mode, the node will ignore missing signatures in attachment JARs and will simply log an error if no signature is present when a constraint requires one. - -To verify and print information about the signatures on a JAR, the `jarsigner` tool can be used again. In addition, we should add some new shell commands that do the same thing, but for a given attachment hash or transaction hash - these may be useful for debugging and analysis. Actually a new shell command should cover all aspects of inspecting attachments - not just signatures but what's inside them, simple way to save them to local disk etc. - -### Key structure - -There are no requirements placed on the keys used to sign JARs. In particular they do not have to be keys used on the Corda ledger, and they do not need a certificate chain that chains to the zone root. This is to ensure that app JARs are not specific to any particular zone. Otherwise app developers would need to go through the on-boarding process for a zone and that may not always be necessary or appropriate. - -The certificate hierarchy for the JAR signature can be a single self-signed cert. There is no need for the key to present a valid certificate chain. - -### Third party signing of JARs - -Consider an app written and signed by the hypothetical company MiniCorp™. It allows users to issue tokens of some sort. An issuer called MegaCorp™ decides that they do not completely trust MiniCorp to create new versions of the app, and they would like to retain some control, so they take the app jar and sign it themselves. Thus there are now two JARs in circulation for the same app. - -Out of the box, this situation will break when combining tokens using the original JAR and tokens from MegaCorp into a single transaction. The `TransactionBuilder` class will fail because it'll try to attach both JARs to satisfy both constraints, yet the JARs define classes with the same name. This violates the no-overlap rule (the no-overlap rule doesn't check for whether the files are actually identical in content). - -For now we will make this problem out of scope. It can be resolved in a future version of the platform. - -There are a couple of ways this could be addressed: - -1. Teach the node how to create a new JAR by combining two separately signed versions of the same JAR into a third. -2. Alter the no-overlap rule so when two files in two different attachments are identical they are not considered to overlap. - -### Upgrading from other constraints - -We anticipate that signature constraints will probably become the standard type of constraint, as it strikes a good balance between security and rigidity. - -The "explicit upgrade" mechanism using dedicated upgrade transactions already exists and can be used to move data from old constraints to new constraints, but this approach suffers from the usual problems associated with this form of upgrade (requires signatures from every participant, creating a new tx, manual approval of states to be upgraded etc). - -Alternatively, network parameters can be extended to support selective overrides of constraints to allow such upgrades in an announced and opt-in way. Designing such a mechanism is out of scope for the first cut of this feature however. - -## Alternatives considered - -### Out-of-line / external JAR signatures - -One obvious alternative is to sign the entire JAR instead of using the Java approach of signing a manifest file that in turn contains hashes of each file. The resulting signature would then ride alongside the JAR in a new set of transaction fields. - -The Java approach of signing a manifest in-line with the JAR itself is more complex, and complexity in cryptographic operations is rarely a good thing. In particular the Java approach means it's possible to have files in the JAR that aren't signed mixed with files that are. This could potentially be a useful source of flexibility but is more likely to be a foot-gun: we should reject attachments that contain a mix of signed and unsigned files. - -However, signing a full JAR as a raw byte stream has other downsides: - -* Would require a custom tool to create the detached signatures. Then it'd require new RPCs and more tools to upload and download the signatures separately from the JARs, and yet more tools to check the signatures. By bundling the signature inside the JAR, we preserve the single-artifact property of the current system, which is quite nice. -* Would require more fields to be added to the WireTransaction format, although we'll probably have to bite this bullet as part of adding attachment metadata eventually anyway. -* The signature ends up covering irrelevant metadata like file modification timestamps, file ordering, compression levels and so on. However, we need to move the ecosystem to producing reproducible JARs anyway for other reasons. -* JAR signature metadata is already exposed via the Java API, so attachments that are not covered by a constraint e.g. an attachment with holiday calendar text files in it, can also be signed, and contract code could check those signatures in the usual documented way. With out-of-line signatures there'd need to be custom APIs to do this. -* Inline JAR signatures have the property that they can be checked on a per file basis. This is potentially useful later for SGX enclaves, if they wish to do random access to JAR files too large to reasonably fit inside the rather restricted enclave memory environment. - -### Package name constraints - -Our goal is to communicate "I want an attachment created by party/parties $FOO". The obvious way to do this is specify the party in the constraint. But as part of other work we are considering introducing the concept of package hierarchy ownership - so `com.foobar.*` would be owned by the Foo Corporation of London and this ownership link between namespace glob and `Party` would be specified in the network parameters. - -If such an indirection were to be introduced then you could make the constraint simpler - it wouldn't need any contents at all. Rather, it would indicate that any attachment that legitimately exported the package name of the contract classname would be accepted. It'd be up to the platform to check that the signature on the JAR was by the same party that is listed in the network parameters as owning that package namespace. - -There are some further issues to think through here: - -1. Is this a fourth type of constraint (package name constraint) that we should support along with the other three? Or is it actually just a better design and should subsume this work? -2. Should it always be the package name of the contract class, or should it specify a package glob specifically? If so what happens if the package name of the contract class and the package name of the constraint don't match - is it OK if the latter is a subset of the former? -3. Indirecting through package names increases centralisation somewhat, because now the zone operator has to agree to you taking ownership of a part of the namespace. This is also a privacy leak, it may expose what apps are being used on the network. *However* what it really exposes is application *developers* and not actual apps, and the zone op doesn't get to veto specific apps once they approved an app developer. More problematically unless an additional indirection is added to the network parameters, every change to the package ownership list requires a "hard fork" acceptance of new parameters. - - -### Using X.500 names in the constraint instead of PublicKey - -We advertise a `PublicKey` (which may be a `CompositeKey`) in the constraint and *not* a set of `CordaX500Name` objects. This means that apps can be developed by entities that aren't in the network map (i.e. not a part of your zone), and it enables threshold keys, *but* the downside is there's no way to rotate or revoke a compromised key beyond adjusting the states themselves. We lose the indirection-through-identity. - -We could introduce such an indirection. This would disconnect the constraint from a particular public key. However then each zone an app is deployed to requires a new JAR signature by the creator, using a certificate issued by the zone operator. Because JARs can be signed by multiple certificates, this is OK, a JAR can be resigned N times if it's to be used in N zones. But it means that effectively zone operators get a power of veto over application developers, increasing centralisation and it increases required logistical efforts. - -In practice, as revoking on-ledger keys is not possible at the moment in Corda, changing a code signing key would require an explicit upgrade or the app to have a command that allows the constraint to be changed. \ No newline at end of file diff --git a/docs/source/design/design-docs-index.rst b/docs/source/design/design-docs-index.rst deleted file mode 100644 index fc1a6cc63c..0000000000 --- a/docs/source/design/design-docs-index.rst +++ /dev/null @@ -1,26 +0,0 @@ -Design Docs -=========== - -.. conditional-toctree:: - :maxdepth: 1 - :if_tag: htmlmode - - design-review-process.md - certificate-hierarchies/design.md - failure-detection-master-election/design.md - float/design.md - hadr/design.md - kafka-notary/design.md - monitoring-management/design.md - sgx-integration/design.md - reference-states/design.md - sgx-infrastructure/design.md - threat-model/corda-threat-model.md - data-model-upgrades/signature-constraints.md - data-model-upgrades/package-namespace-ownership.md - targetversion/design.md - data-model-upgrades/migrate-to-signature-constraint.md - versioning/contract-versioning.md - linear-pointer/design.md - maximus/design.md - accounts/design.md \ No newline at end of file diff --git a/docs/source/design/design-review-process.md b/docs/source/design/design-review-process.md deleted file mode 100644 index 8387eec086..0000000000 --- a/docs/source/design/design-review-process.md +++ /dev/null @@ -1,35 +0,0 @@ -# Design review process - -The Corda design review process defines a means of collaborating approving Corda design thinking in a consistent, -structured, easily accessible and open manner. - -The process has several steps: - -1. High level discussion with the community and developers on corda-dev. -2. Writing a design doc and submitting it for review via a PR to this directory. See other design docs and the - design doc template (below). -3. Respond to feedback on the github discussion. -4. You may be invited to a design review board meeting. This is a video conference in which design may be debated in - real time. Notes will be sent afterwards to corda-dev. -5. When the design is settled it will be approved and can be merged as normal. - -The following diagram illustrates the process flow: - -![Design Review Process](./designReviewProcess.png) - -At least some of the following people will take part in a DRB meeting: - -* Richard G Brown (CTO) -* James Carlyle (Chief Engineer) -* Mike Hearn (Lead Platform Engineer) -* Mark Oldfield (Lead Platform Architect) -* Jonathan Sartin (Information Security manager) -* Select external key contributors (directly involved in design process) - -The Corda Technical Advisory Committee may also be asked to review a design. - -Here's the outline of the design doc template: - -.. toctree:: - - template/design.md \ No newline at end of file diff --git a/docs/source/design/designReviewProcess.png b/docs/source/design/designReviewProcess.png deleted file mode 100644 index c694eea221..0000000000 Binary files a/docs/source/design/designReviewProcess.png and /dev/null differ diff --git a/docs/source/design/failure-detection-master-election/atomix.png b/docs/source/design/failure-detection-master-election/atomix.png deleted file mode 100644 index 9e71cdb134..0000000000 Binary files a/docs/source/design/failure-detection-master-election/atomix.png and /dev/null differ diff --git a/docs/source/design/failure-detection-master-election/design.md b/docs/source/design/failure-detection-master-election/design.md deleted file mode 100644 index 1d9a85a1ba..0000000000 --- a/docs/source/design/failure-detection-master-election/design.md +++ /dev/null @@ -1,118 +0,0 @@ -# Failure detection and master election - -.. important:: This design document describes a feature of Corda Enterprise. - -## Background - -Two key issues need to be resolved before Hot-Warm can be implemented: - -* Automatic failure detection (currently our Hot-Cold set-up requires a human observer to detect a failed node) -* Master election and node activation (currently done manually) - -This document proposes two solutions to the above mentioned issues. The strengths and drawbacks of each solution are explored. - -## Constraints/Requirements - -Typical modern HA environments rely on a majority quorum of the cluster to be alive and operating normally in order to -service requests. This means: - -* A cluster of 1 replica can tolerate 0 failures -* A cluster of 2 replicas can tolerate 0 failures -* A cluster of 3 replicas can tolerate 1 failure -* A cluster of 4 replicas can tolerate 1 failure -* A cluster of 5 replicas can tolerate 2 failures - -This already poses a challenge to us as clients will most likely want to deploy the minimum possible number of R3 Corda -nodes. Ideally that minimum would be 3 but a solution for only 2 nodes should be available (even if it provides a lesser -degree of HA than 3, 5 or more nodes). The problem with having only two nodes in the cluster is there is no distinction -between failure and network partition. - -Users should be allowed to set a preference for which node to be active in a hot-warm environment. This would probably -be done with the help of a property(persisted in the DB in order to be changed on the fly). This is an important -functionality as users might want to have the active node on better hardware and switch to the back-ups and back as soon -as possible. - -It would also be helpful for the chosen solution to not add deployment complexity. - -## Design decisions - -.. toctree:: - :maxdepth: 2 - - drb-meeting-20180131.md - -## Proposed solutions - -Based on what is needed for Hot-Warm, 1 active node and at least one passive node (started but in stand-by mode), and -the constraints identified above (automatic failover with at least 2 nodes and master preference), two frameworks have -been explored: Zookeeper and Atomix. Neither apply to our use cases perfectly and require some tinkering to solve our -issues, especially the preferred master election. - -### Zookeeper - -![Zookeeper design](zookeeper.png) - -Preferred leader election - while the default algorithm does not take into account a leader preference, a custom -algorithm can be implemented to suit our needs. - -Environment with 2 nodes - while this type of set-up can't distinguish between a node failure and network partition, a -workaround can be implemented by having 2 nodes and 3 zookeeper instances(3rd would be needed to form a majority). - -Pros: -- Very well documented -- Widely used, hence a lot of cookbooks, recipes and solutions to all sorts of problems -- Supports custom leader election - -Cons: -- Added deployment complexity -- Bootstrapping a cluster is not very straightforward -- Too complex for our needs? - -### Atomix - -![](./atomix.png) - -Preferred leader election - cannot be implemented easily; a creative solution would be required. - -Environment with 2 nodes - using only embedded replicas, there's no solution; Atomix comes also as a standalone server -which could be run outside the node as a 3rd entity to allow a quorum(see image above). - -Pros: -- Easy to get started with -- Embedded, no added deployment complexity -- Already used partially (Atomix Catalyst) in the notary cluster - -Cons: -- Not as popular as Zookeeper, less used -- Documentation is underwhelming; no proper usage examples -- No easy way of influencing leader election; will require some creative use of Atomix functionality either via distributed groups or other resources - -## Recommendations - -If Zookeeper is chosen, we would need to look into a solution for easy configuration and deployment (maybe docker -images). Custom leader election can be implemented by following one of the -[examples](https://github.com/SainTechnologySolutions/allprogrammingtutorials/tree/master/apache-zookeeper/leader-election) -available online. - -If Atomix is chosen, a solution to enforce some sort of preferred leader needs to found. One way to do it would be to -have the Corda cluster leader be a separate entity from the Atomix cluster leader. Implementing the election would then -be done using the distributed resources made available by the framework. - -## Conclusions - -Whichever solution is chosen, using 2 nodes in a Hot-Warm environment is not ideal. A minimum of 3 is required to ensure proper failover. - -Almost every configuration option that these frameworks offer should be exposed through node.conf. - -We've looked into using Galera which is currently used for the notary cluster for storing the committed state hashes. It -offers multi-master read/write and certification-based replication which is not leader based. It could be used to -implement automatic failure detection and master election(similar to our current mutual exclusion).However, we found -that it doesn't suit our needs because: - -- it adds to deployment complexity -- usable only with MySQL and InnoDB storage engine -- we'd have to implement node failure detection and master election from scratch; in this regard both Atomix and Zookeeper are better suited - -Our preference would be Zookeeper despite not being as lightweight and deployment-friendly as Atomix. The wide spread -use, proper documentation and flexibility to use it not only for automatic failover and master election but also -configuration management(something we might consider moving forward) makes it a better fit for our needs. \ No newline at end of file diff --git a/docs/source/design/failure-detection-master-election/drb-meeting-20180131.md b/docs/source/design/failure-detection-master-election/drb-meeting-20180131.md deleted file mode 100644 index 094fc153e4..0000000000 --- a/docs/source/design/failure-detection-master-election/drb-meeting-20180131.md +++ /dev/null @@ -1,104 +0,0 @@ -# Design Review Board Meeting Minutes - -**Date / Time:** Jan 31 2018, 11.00 - -## Attendees - -- Matthew Nesbit (MN) -- Bogdan Paunescu (BP) -- James Carlyle (JC) -- Mike Hearn (MH) -- Wawrzyniec Niewodniczanski (WN) -- Jonathan Sartin (JS) -- Gavin Thomas (GT) - - -## **Decision** - -Proceed with recommendation to use Zookeeper as the master selection solution - - -## **Primary Requirement of Design** - -- Client can run just 2 nodes, master and slave -- Current deployment model to not change significantly -- Prioritised mastering or be able to automatically elect a master. Useful to allow clients to do rolling upgrades, or for use when a high spec machine is used for master -- Nice to have: use for flow sharding and soft locking - -## **Minutes** - -MN presented a high level summary of the options: -- Galera: - - Negative: does not have leader election and failover capability. - -- Atomix IO: - - Positive: does integrate into node easily, can setup ports - - Negative: requires min 3 nodes, cannot manipulate election e.g. drop the master rolling deployments / upgrades, cannot select the 'beefy' host for master where cost efficiencies have been used for the slave / DR, young library and has limited functionality, poor documentation and examples - -- Zookeeper (recommended option): industry standard widely used and trusted. May be able to leverage clients' incumbent Zookeeper infrastructure - - Positive: has flexibility for storage and a potential for future proofing; good permissioning capabilities; standalone cluster of Zookeeper servers allows 2 nodes solution rather than 3 - - Negative: adds deployment complexity due to need for Zookeeper cluster split across data centers -Wrapper library choice for Zookeeper requires some analysis - - -MH: predictable source of API for RAFT implementations and Zookeeper compared to Atomix. Be better to have master -selector implemented as an abstraction - -MH: hybrid approach possible - 3rd node for oversight, i.e. 2 embedded in the node, 3rd is an observer. Zookeeper can -have one node in primary data centre, one in secondary data centre and 3rd as tie-breaker - -WN: why are we concerned about cost of 3 machines? MN: we're seeing / hearing clients wanting to run many nodes on one -VM. Zookeeper is good for this since 1 Zookepper cluster can serve 100+ nodes - -MH: terminology clarification required: what holds the master lock? Ideally would be good to see design thinking around -split node and which bits need HA. MB: as a long term vision, ideally have 1 database for many IDs and the flows for -those IDs are load balanced. Regarding services internally to node being suspended, this is being investigated. - -MH: regarding auto failover, in the event a database has its own perception of master and slave, how is this handled? -Failure detector will need to grow or have local only schedule to confirm it is processing everything including -connectivity between database and bus, i.e. implement a 'healthiness' concept - -MH: can you get into a situation where the node fails over but the database does not, but database traffic continues to -be sent to down node? MB: database will go offline leading to an all-stop event. - -MH: can you have master affinity between node and database? MH: need watchdog / heartbeat solutions to confirm state of -all components - -JC: how long will this solution live? MB: will work for hot / hot flow sharding, multiple flow workers and soft locks, -then this is long term solution. Service abstraction will be used so we are not wedded to Zookeeper however the -abstraction work can be done later - -JC: does the implementation with Zookeeper have an impact on whether cloud or physical deployments are used? MB: its an -internal component, not part of the larger Corda network therefore can be either. For the customer they will have to -deploy a separate Zookeeper solution, but this is the same for Atomix. - -WN: where Corda as a service is being deployed with many nodes in the cloud. Zookeeper will be better suited to big -providers. - -WN: concern is the customer expects to get everything on a plate, therefore will need to be educated on how to implement -Zookeeper, but this is the same for other master selection solutions. - -JC: is it possible to launch R3 Corda with a button on Azure marketplace to commission a Zookeeper? Yes, if we can -resource it. But expectation is Zookeeper will be used by well-informed clients / implementers so one-click option is -less relevant. - -MH: how does failover work with HSMs? - -MN: can replicate realm so failover is trivial - -JC: how do we document Enterprise features? Publish design docs? Enterprise fact sheets? R3 Corda marketing material? -Clear separation of documentation is required. GT: this is already achieved by having docs.corda.net for open source -Corda and docs.corda.r3.com for enterprise R3 Corda - - -### Next Steps - -MN proposed the following steps: - -1) Determine who has experience in the team to help select wrapper library -2) Build container with Zookeeper for development -3) Demo hot / cold with current R3 Corda Dev Preview release (writing a guide) -4) Turn nodes passive or active -5) Leader election -6) Failure detection and tooling -7) Edge case testing diff --git a/docs/source/design/failure-detection-master-election/zookeeper.png b/docs/source/design/failure-detection-master-election/zookeeper.png deleted file mode 100644 index 0259e85294..0000000000 Binary files a/docs/source/design/failure-detection-master-election/zookeeper.png and /dev/null differ diff --git a/docs/source/design/float/current-p2p-state.png b/docs/source/design/float/current-p2p-state.png deleted file mode 100644 index e33da2890f..0000000000 Binary files a/docs/source/design/float/current-p2p-state.png and /dev/null differ diff --git a/docs/source/design/float/decisions/drb-meeting-20171116.md b/docs/source/design/float/decisions/drb-meeting-20171116.md deleted file mode 100644 index a1271c80c0..0000000000 --- a/docs/source/design/float/decisions/drb-meeting-20171116.md +++ /dev/null @@ -1,147 +0,0 @@ -# Design Review Board Meeting Minutes - -**Date / Time:** 16/11/2017, 14:00 - -## Attendees - -- Mark Oldfield (MO) -- Matthew Nesbit (MN) -- Richard Gendal Brown (RGB) -- James Carlyle (JC) -- Mike Hearn (MH) -- Jose Coll (JoC) -- Rick Parker (RP) -- Andrey Bozhko (AB) -- Dave Hudson (DH) -- Nick Arini (NA) -- Ben Abineri (BA) -- Jonathan Sartin (JS) -- David Lee (DL) - -## Minutes - -MO opened the meeting, outlining the agenda and meeting review process, and clarifying that consensus on each design decision would be sought from RGB, JC and MH. - -MO set out ground rules for the meeting. RGB asked everyone to confirm they had read both documents; all present confirmed. - -MN outlined the motivation for a Float as responding to organisation’s expectation for a‘fire break’ protocol termination in the DMZ where manipulation and operation can be checked and monitored. - -The meeting was briefly interrupted by technical difficulties with the GoToMeeting conferencing system. - -MN continued to outline how the design was constrained by expected DMZ rules and influenced by currently perceived client expectations – e.g. making the float unidirectional. He gave a prelude to certain design decisions e.g. the use ofAMQP from the outset. - -MN went onto describe the target solution in detail, covering the handling of both inbound and outbound connections. He highlighted implicit overlaps with the HA design – clustering support, queue names etc., and clarified that the local broker was not required to use AMQP. - -### [TLS termination](./ssl-termination.md) - -JC questioned where the TLS connection would terminate. MN outlined the pros and cons of termination on firewall vs. float, highlighting the consequence of float termination that access by the float to the to the private key was required, and that mechanisms may be needed to store that key securely. - -MH contended that the need to propagate TLS headers etc. through to the node (for reinforcing identity checks etc.) implied a need to terminate on the float. MN agreed but noted that in practice the current node design did not make much use of that feature. - -JC questioned how users would provision a TLS cert on a firewall – MN confirmed users would be able to do this themselves and were typically familiar with doing so. - -RGB highlighted the distinction between the signing key for the TLS vs. identity certificates, and that this needed to be made clear to users. MN agreed that TLS private keys could be argued to be less critical from a security perspective, particularly when revocation was enabled. - -MH noted potential to issue sub-certs with key usage flags as an additional mitigating feature. - -RGB queried at what point in the flow a message would be regarded as trusted. MN set an expectation that the float would apply basic checks (e.g. stopping a connection talking on other topics etc.) but that subsequent sanitisation should happen in internal trusted portion. - -RGB questioned whether the TLS key on the float could be re-used on the bridge to enable wrapped messages to be forwarded in an encrypted form – session migration. MH and MN maintained TLS forwarding could not work in that way, and this would not allow the ‘fire break’ requirement to inspect packets. - -RGB concluded the bridge must effectively trust the firewall or bridge on the origin of incoming messages. MN raised the possibility of SASL verification,but noted objections by MH (clumsy because of multiple handshakes etc.). - -JC queried whether SASL would allow passing of identity and hence termination at the firewall;MN confirmed this. - -MH contented that the TLS implementation was specific to Corda in several ways which may challenge implementation using firewalls, and that typical firewalls(using old OpenSSL etc.) were probably not more secure than R3’s own solutions. RGB pointed out that the design was ultimately driven by client perception of security (MN: “security theatre”) rather than objective assessment. MH added that implementations would be firewall-specific and not all devices would support forwarding, support for AMQP etc. - -RGB proposed messaging to clients that the option existed to terminate on the firewall if it supported the relevant requirements. - -MN re-raised the question of key management. RGB asked about the risk implied from the threat of a compromised float. MN said an attacker who compromised a float could establish TLS connections in the name of the compromised party, and could inspect and alter packets including readable business data (assuming AMQP serialisation). MH gave an example of a MITM attack where an attacker could swap in their own single-use key allowing them to gain control of (e.g.) a cash asset; the TLS layer is the only current protection against that. - -RGB queried whether messages could be signed by senders. MN raised potential threat of traffic analysis, and stated E2E encryption was definitely possible but not for March-April. - -MH viewed the use-case for extra encryption as the consumer/SME market, where users would want to upload/download messages from a mailbox without needing to trust it –not the target market yet. MH maintained TLS really strong and that assuming compromise of float was not conceptually different from compromise of another device e.g. the firewall. MN confirmed that use of an HSM would generally require signing on the HSM device for every session; MH observed this could bea bottleneck in the scenario of a restored node seeking to re-establish a large number of connections. It was observed that the float would still need access to a key provisioning access to the HSM, so this did not materially improve the security in a compromised float scenario. - -MH advised against offering clients support for their own firewall since it would likely require R3 effort to test support and help with customisations. - -MN described option 2b to tunnel through to the internal trusted portion of the float over a connection initiated from inside the internal network in order for the key to be loaded into memory at run-time; this would require a bit more code. - -MH advocated option 2c - just to accept risk and store on file system – on the basis of time constraints, maintaining that TLS handshakes are complicated to code and hard to proxy. MH suggested upgrading to 2b or 2a later if needed. MH described how keys were managed at Google. - -**DECISION CONFIRMED**: Accept option 2b - Terminate on float, inject key from internal portion of the float (RGB, JC, MH agreed) - -### [E2E encryption](./e2e-encryption.md) - -DH proposed that E2E encryption would be much better but conceded the time limitations and agreed that the threat scenario of a compromised DMZ device was the same under the proposed options. MN agreed. - -MN argued for a placeholder vs. ignoring or scheduling work to build e2e encryption now. MH agreed, seeking more detailed proposals on what the placeholder was and how it would be used. - -MH queried whether e2e encryption would be done at the app level rather than the AMQP level, raising questions what would happen on non-supporting nodes etc. - -MN highlighted the link to AMQP serialisation work being done. - -**DECISION CONFIRMED:** Add placeholder, subject to more detailed design proposal (RGB, JC, MH agreed) - -### [AMQP vs. custom protocol](./p2p-protocol.md) - -MN described alternative options involving onion-routing etc. - -JoC questioned whether this would also allow support for load balancing; MN advised this would be too much change in direction in practice. - -MH outlined his original reasoning for AMQP (lots of e.g. manageability features, not all of which would be needed at the outset but possibly in future) vs. other options e.g. MQTT. - -MO questioned whether the broker would imply performance limitations. - -RGB argued there were two separate concerns: Carrying messages from float to bridge and then bridge to node, with separate design options. - -JC proposed the decision could be deferred until later. MN pointed out changing the protocol would compromise wire stability. - -MH advocated sticking with AMQP for now and implementing a custom protocol later with suitable backwards-compatibility features when needed. - -RGB queried whether full AMQP implementation should be done in this phase. MN provided explanation. - -**DECISION CONFIRMED:** Continue to use AMQP (RGB, JC, MH agreed) - -### [Pluggable broker prioritisation](./pluggable-broker.md) - -MN outlined arguments for deferring pluggable brokers, whilst describing how he’d go about implementing the functionality. MH agreed with prioritisation for later. - -JC queried whether broker providers could be asked to deliver the feature. AB mentioned that Solace seemed keen on working with R3 and could possibly be utilised. MH was sceptical, arguing that R3 resource would still be needed to support. - -JoC noted a distinction in scope for P2P and/or RPC. - -There was discussion of replacing the core protocol with JMS + plugins. RGB drew focus to the question of when to do so, rather than how. - -AB noted Solace have functionality with conceptual similarities to the float, and questioned to what degree the float could be considered non-core technology. MH argued the nature of Corda as a P2P network made the float pretty core to avoiding dedicated network infrastructure. - -**DECISION CONFIRMED:** Defer support for pluggable brokers until later, except in the event that a requirement to do so emerges from higher priority float / HA work. (RGB, JC, MH agreed) - -### Inbound only vs. inbound & outbound connections - -DL sought confirmation that the group was happy with the float to act as a Listener only.MN repeated the explanation of how outbound connections would be initiated through a SOCKS 4/5 proxy. No objections were raised. - -### Overall design and implementation plan - -MH requested more detailed proposals going forward on: - -1) To what degree logs from different components need to be integrated (consensus was no requirement at this stage) - -2) Bridge control protocols. - -3) Scalability of hashing network map entries to a queue names - -4) Node admins' user experience – MH argued for documenting this in advance to validate design - -5) Behaviour following termination of a remote node (retry frequency, back-off etc.)? - -6) Impact on standalone nodes (no float)? - -JC noted an R3 obligation with Microsoft to support AMQP-compliant Azure messaging,. MN confirmed support for pluggable brokers should cover that. - -JC argued for documentation of procedures to be the next step as it is needed for the Project Agent Pilot phase. MH proposed sharing the advance documentation. - -JoC questioned whether the Bridge Manager locked the design to Artemis? MO highlighted the transitional elements of the design. - -RGB questioned the rationale for moving the broker out of the node. MN provided clarification. - -**DECISION CONFIRMED**: Design to proceed as discussed (RGB, JC, MH agreed) diff --git a/docs/source/design/float/decisions/e2e-encryption.md b/docs/source/design/float/decisions/e2e-encryption.md deleted file mode 100644 index debb9fd1fc..0000000000 --- a/docs/source/design/float/decisions/e2e-encryption.md +++ /dev/null @@ -1,55 +0,0 @@ -# Design Decision: End-to-end encryption - -## Background / Context - -End-to-end encryption is a desirable potential design feature for the [float](../design.md). - -## Options Analysis - -### 1. No end-to-end encryption - -#### Advantages - -1. Least effort -2. Easier to fault find and manage - -#### Disadvantages - -1. With no placeholder, it is very hard to add support later and maintain wire stability. -2. May not get past security reviews of Float. - -### 2. Placeholder only - -#### Advantages - -1. Allows wire stability when we have agreed an encrypted approach -2. Shows that we are serious about security, even if this isn’t available yet. -3. Allows later encrypted version to be an enterprise feature that can interoperate with OS versions. - -#### Disadvantages - -1. Doesn’t actually provide E2E, or define what an encrypted payload looks like. -2. Doesn’t address any crypto features that target protecting the AMQP headers. - -### 3. Implement end-to-end encryption - -1. Will protect the sensitive data fully. - -#### Disadvantages - -1. Lots of work. -2. Difficult to get right. -3. Re-inventing TLS. - -## Recommendation and justification - -Proceed with Option 2: Placeholder - -## Decision taken - -Proceed with Option 2 - Add placeholder, subject to more detailed design proposal (RGB, JC, MH agreed) - -.. toctree:: - - drb-meeting-20171116.md - diff --git a/docs/source/design/float/decisions/p2p-protocol.md b/docs/source/design/float/decisions/p2p-protocol.md deleted file mode 100644 index 618a46ecfc..0000000000 --- a/docs/source/design/float/decisions/p2p-protocol.md +++ /dev/null @@ -1,75 +0,0 @@ -# Design Decision: P2P Messaging Protocol - -## Background / Context - -Corda requires messages to be exchanged between nodes via a well-defined protocol. - -Determining this protocol is a critical upstream dependency for the design of key messaging components including the [float](../design.md). - -## Options Analysis - -### 1. Use AMQP - -Under this option, P2P messaging will follow the [Advanced Message Queuing Protocol](https://www.amqp.org/). - -#### Advantages - -1. As we have described in our marketing materials. -2. Well-defined standard. -3. Support for packet level flow control and explicit delivery acknowledgement. -4. Will allow eventual swap out of Artemis for other brokers. - -#### Disadvantages - -1. AMQP is a complex protocol with many layered state machines, for which it may prove hard to verify security properties. -2. No support for secure MAC in packets frames. -3. No defined encryption mode beyond creating custom payload encryption and custom headers. -4. No standardised support for queue creation/enumeration, or deletion. -5. Use of broker durable queues and autonomous bridge transfers does not align with checkpoint timing, so that independent replication of the DB and Artemis data risks causing problems. (Writing to the DB doesn’t work currently and is probably also slow). - -### 2. Develop a custom protocol - -This option would discard existing Artemis server/AMQP support for peer-to-peer communications in favour of a custom -implementation of the Corda MessagingService, which takes direct responsibility for message retries and stores the -pending messages into the node's database. The wire level of this service would be built on top of a fully encrypted MIX -network which would not require a fully connected graph, but rather send messages on randomly selected paths over the -dynamically managed network graph topology. - -Packet format would likely use the [SPHINX packet format](http://www0.cs.ucl.ac.uk/staff/G.Danezis/papers/sphinx-eprint.pdf) although with the body encryption updated to -a modern AEAD scheme as in https://www.cs.ru.nl/~bmennink/pubs/16cans.pdf . In this scheme, nodes would be identified in -the overlay network solely by Curve25519 public key addresses and floats would be dumb nodes that only run the MIX -network code and don't act as message sources, or sinks. Intermediate traffic would not be readable except by the -intended waypoint and only the final node can read the payload. - -Point to point links would be standard TLS and the network certificates would be whatever is acceptable to the host -institutions e.g. standard Verisign certs. It is assumed institutions would select partners to connect to that they -trust and permission them individually in their firewalls. Inside the MIX network the nodes would be connected mostly in -a static way and use standard HELLO packets to determine the liveness of neighbour routes, then use tunnelled gossip to -distribute the signed/versioned Link topology messages. Nodes will also be allowed to advertise a public IP, so some -dynamic links and publicly visible nodes would exist. Network map addresses would then be mappings from Legal Identity -to these overlay network addresses, not to physical network locations. - -#### Advantages - -1. Can be defined with very small message surface area that is amenable to security analysis. -2. Packet formats can follow best practice cryptography from the start and be matched to Corda’s needs. -3. Doesn’t require a complete graph structure for network if we have intermediate routing. -4. More closely aligns checkpointing and message delivery handling at the application level. - -#### Disadvantages - -1. Inconsistent with previous design statements published to external stakeholders. -2. Effort implications - starting from scratch -3. Technical complexity in developing a P2P protocols which is attack tolerant. - -## Recommendation and justification - -Proceed with Option 1 - -## Decision taken - -Proceed with Option 1 - Continue to use AMQP (RGB, JC, MH agreed) - -.. toctree:: - - drb-meeting-20171116.md diff --git a/docs/source/design/float/decisions/pluggable-broker.md b/docs/source/design/float/decisions/pluggable-broker.md deleted file mode 100644 index 3096360337..0000000000 --- a/docs/source/design/float/decisions/pluggable-broker.md +++ /dev/null @@ -1,62 +0,0 @@ -# Design Decision: Pluggable Broker prioritisation - -## Background / Context - -A decision on when to prioritise implementation of a pluggable broker has implications for delivery of key messaging -components including the [float](../design.md). - -## Options Analysis - -### 1. Deliver pluggable brokers now - -#### Advantages - -1. Meshes with business opportunities from HPE and Solace Systems. -2. Would allow us to interface to existing Bank middleware. -3. Would allow us to switch away from Artemis if we need higher performance. -4. Makes our AMQP story stronger. - -#### Disadvantages - -1. More up-front work. -2. Might slow us down on other priorities. - -### 2. Defer development of pluggable brokers until later - -#### Advantages - -1. Still gets us where we want to go, just later. -2. Work can be progressed as resource is available, rather than right now. - -#### Disadvantages - -1. Have to take care that we have sufficient abstractions that things like CORE connections can be replaced later. -2. Leaves HPE and Solace hanging even longer. - - -### 3. Never enable pluggable brokers - -#### Advantages - -1. What we already have. - -#### Disadvantages - -1. Ties us to ArtemisMQ development speed. - -2. Not good for our relationship with HPE and Solace. - -3. Probably limits our maximum messaging performance longer term. - - -## Recommendation and justification - -Proceed with Option 2 (defer development of pluggable brokers until later) - -## Decision taken - -.. toctree:: - - drb-meeting-20171116.md - -Proceed with Option 2 - Defer support for pluggable brokers until later, except in the event that a requirement to do so emerges from higher priority float / HA work. (RGB, JC, MH agreed) diff --git a/docs/source/design/float/decisions/ssl-termination.md b/docs/source/design/float/decisions/ssl-termination.md deleted file mode 100644 index b42dd82111..0000000000 --- a/docs/source/design/float/decisions/ssl-termination.md +++ /dev/null @@ -1,91 +0,0 @@ -# Design Decision: TLS termination point - -## Background / Context - -Design of the [float](../design.md) is critically influenced by the decision of where TLS connections to the node should -be terminated. - -## Options Analysis - -### 1. Terminate TLS on Firewall - -#### Advantages - -1. Common practice for DMZ web solutions, often with an HSM associated with the Firewall and should be familiar for banks to setup. -2. Doesn’t expose our private key in the less trusted DMZ context. -3. Bugs in the firewall TLS engine will be patched frequently. -4. The DMZ float server would only require a self-signed certificate/private key to enable secure communications, so theft of this key has no impact beyond the compromised machine. - -#### Disadvantages - -1. May limit cryptography options to RSA, and prevent checking of X500 names (only the root certificate checked) - Corda certificates are not totally standard. -2. Doesn’t allow identification of the message source. -3. May require additional work and SASL support code to validate the ultimate origin of connections in the float. - -#### Variant option 1a: Include SASL connection checking - -##### Advantages - -1. Maintain authentication support -2. Can authenticate against keys held internally e.g. Legal Identity not just TLS. - -##### Disadvantages - -1. More work than the do-nothing approach -2. More protocol to design for sending across the inner firewall. - -### 2. Direct TLS Termination onto Float - -#### Advantages - -1. Validate our PKI certificates directly ourselves. -2. Allow messages to be reliably tagged with source. - -#### Disadvantages - -1. We don’t currently use the identity to check incoming packets, only for connection authentication anyway. -2. Management of Private Key a challenge requiring extra work and security implications. Options for this are presented below. - -#### Variant Option 2a: Float TLS certificate via direct HSM - -##### Advantages - -1. Key can’t be stolen (only access to signing operations) -2. Audit trail of signings. - -##### Disadvantages - -1. Accessing HSM from DMZ probably not allowed. -2. Breaks the inbound-connection-only rule of modern DMZ. - -#### Variant Option 2b: Tunnel signing requests to bridge manager - -##### Advantages - -1. No new connections involved from Float box. -2. No access to actual private key from DMZ. - -##### Disadvantages - -1. Requires implementation of a message protocol, in addition to a key provider that can be passed to the standard SSLEngine, but proxies signing requests. - -#### Variant Option 2c: Store key on local file system - -##### Advantages - -1. Simple with minimal extra code required. -2. Delegates access control to bank’s own systems. -3. Risks losing only the TLS private key, which can easily be revoked. This isn’t the legal identity key at all. - -##### Disadvantages - -1. Risks losing the TLS private key. -2. Probably not allowed. - -## Recommendation and justification - -Proceed with Variant option 1a: Terminate on firewall; include SASL connection checking. - -## Decision taken - -[DNB Meeting, 16/11/2017](./drb-meeting-20171116.md): Proceed with option 2b - Terminate on float, inject key from internal portion of the float (RGB, JC, MH agreed) diff --git a/docs/source/design/float/design.md b/docs/source/design/float/design.md deleted file mode 100644 index 36acc1475d..0000000000 --- a/docs/source/design/float/design.md +++ /dev/null @@ -1,256 +0,0 @@ -# Float Design - -.. important:: This design document describes a feature of Corda Enterprise. - -## Overview - -The role of the 'float' is to meet the requirements of organisations that will not allow direct incoming connections to -their node, but would rather host a proxy component in a DMZ to achieve this. As such it needs to meet the requirements -of modern DMZ security rules, which essentially assume that the entire machine in the DMZ may become compromised. At -the same time, we expect that the Float can interoperate with directly connected nodes, possibly even those using open -source Corda. - -### Background - -#### Current state of peer-to-peer messaging in Corda - -The diagram below illustrates the current mechanism for peer-to-peer messaging between Corda nodes. - -![Current P2P State](./current-p2p-state.png) - -When a flow running on a Corda node triggers a requirement to send a message to a peer node, it first checks for -pre-existence of an applicable message queue for that peer. - -**If the relevant queue exists:** - -1. The node submits the message to the queue and continues after receiving acknowledgement. -2. The Core Bridge picks up the message and transfers it via a TLS socket to the inbox of the destination node. -3. A flow on the recipient receives message from peer and acknowledged consumption on bus when the flow has checkpointed this progress. - -**If the queue does not exist (messaging a new peer):** - -1. The flow triggers creation of a new queue with a name encoding the identity of the intended recipient. -2. When the queue creation has completed the node sends the message to the queue. -3. The hosted Artemis server within the node has a queue creation hook which is called. -4. The queue name is used to lookup the remote connection details and a new bridge is registered. -5. The client certificate of the peer is compared to the expected legal identity X500 Name. If this is OK, message flow proceeds as for a pre-existing queue (above). - -## Scope - -* Goals: - * Allow connection to a Corda node without requiring direct incoming connections from external participants. - * Allow connections to a Corda node without requiring the node itself to have a public IP address. Separate TLS connection handling from the MQ broker. -* Non-goals (out of scope): - * Support for MQ brokers other than Apache Artemis - -## Timeline -For delivery by end Q1 2018. - -## Requirements -Allow connectivity in compliance with DMZ constraints commonly imposed by modern financial institutions; namely: -1. Firewalls required between the internet and any device in the DMZ, and between the DMZ and the internal network -2. Data passing from the internet and the internal network via the DMZ should pass through a clear protocol break in the DMZ. -3. Only identified IPs and ports are permitted to access devices in the DMZ; this include communications between devices co-located in the DMZ. -4. Only a limited number of ports are opened in the firewall (<5) to make firewall operation manageable. These ports must change slowly. -5. Any DMZ machine is typically multi-homed, with separate network cards handling traffic through the institutional - firewall vs. to the Internet. (There is usually a further hidden management interface card accessed via a jump box for - managing the box and shipping audit trail information). This requires that our software can bind listening ports to the - correct network card not just to 0.0.0.0. -6. No connections to be initiated by DMZ devices towards the internal network. Communications should be initiated from - the internal network to form a bidirectional channel with the proxy process. -7. No business data should be persisted on the DMZ box. -8. An audit log of all connection events is required to track breaches. Latency information should also be tracked to - facilitate management of connectivity issues. -9. Processes on DMZ devices run as local accounts with no relationship to internal permission systems, or ability to - enumerate devices on the internal network. -10. Communications in the DMZ should yse modern TLS, often with local-only certificates/keys that hold no value outside of use in predefined links. -11. Where TLS is required to terminate on the firewall, provide a suitably secure key management mechanism (e.g. an HSM). -12. Any proxy in the DMZ should be subject to the same HA requirements as the devices it is servicing -13. Any business data passing through the proxy should be separately encrypted, so that no data is in the clear of the - program memory if the DMZ box is compromised. - -## Design Decisions - -The following design decisions fed into this design: - -.. toctree:: - :maxdepth: 2 - - decisions/p2p-protocol.md - decisions/ssl-termination.md - decisions/e2e-encryption.md - decisions/pluggable-broker.md - -## Target Solution - -The proposed solution introduces a reverse proxy component ("**float**") which may be sited in the DMZ, as illustrated -in the diagram below. - -![Full Float Implementation](./full-float.png) - -The main role of the float is to forward incoming AMQP link packets from authenticated TLS links to the AMQP Bridge -Manager, then echo back final delivery acknowledgements once the Bridge Manager has successfully inserted the messages. -The Bridge Manager is responsible for rejecting inbound packets on queues that are not local inboxes to prevent e.g. -'cheating' messages onto management topics, faking outgoing messages etc. - -The float is linked to the internal AMQP Bridge Manager via a single AMQP/TLS connection, which can contain multiple -logical AMQP links. This link is initiated at the socket level by the Bridge Manager towards the float. - -The float is a **listener only** and does not enable outgoing bridges (see Design Decisions, above). Outgoing bridge -formation and message sending come directly from the internal Bridge Manager (possibly via a SOCKS 4/5 proxy, which is -easy enough to enable in netty, or directly through the corporate firewall. Initiating from the float gives rise to -security concerns.) - -The float is **not mandatory**; interoperability with older nodes, even those using direct AMQP from bridges in the -node, is supported. - -**No state will be serialized on the float**, although suitably protected logs will be recorded of all float activities. - -**End-to-end encryption** of the payload is not delivered through this design (see Design Decisions, above). For current -purposes, a header field indicating plaintext/encrypted payload is employed as a placeholder. - -**HA** is enabled (this should be easy as the bridge manager can choose which float to make active). Only fully -connected DMZ floats should activate their listening port. - -Implementation of the float is expected to be based on existing AMQP Bridge Manager code - see Implementation Plan, -below, for expected work stages. - -### Bridge control protocol - -The bridge control is designed to be as stateless as possible. Thus, nodes and bridges restarting must -re-request/broadcast information to each other. Messages are sent to a 'bridge.control' address in Artemis as -non-persistent messages with a non-durable queue. Each message should contain a duplicate message ID, which is also -re-used as the correlation id in replies. Relevant scenarios are described below: - -#### On bridge start-up, or reconnection to Artemis -1. The bridge process should subscribe to the 'bridge.control'. -2. The bridge should start sending QueueQuery messages which will contain a unique message id and an identifier for the bridge sending the message. -3. The bridge should continue to send these until at least one node replies with a matched QueueSnapshot message. -4. The QueueSnapshot message replies from the nodes contains a correlationId field set to the unique id of the QueueQuery query, or the correlation id is null. The message payload is a list of inbox queue info items and a list of outbound queue info items. Each queue info item is a tuple of Legal X500 Name (as expected upon the destination TLS certificates) and the queue name which should have the form of "internal.peers."+hash key of legal identity (using the same algorithm as we use in the db to make the string). Note this queue name is a change from the current logic, but will be more portable to length constrained topics and allow multiple inboxes on the same broker. -5. The bridge should process the QueueSnapshot, initiating links to the outgoing targets. It should also add expected inboxes to its in-bound permission list. -6. When an outgoing link is successfully formed the remote client certificate should be checked against the expected X500 name. Assuming the link is valid the bridge should subscribe to the related queue and start trying to forward the messages. - -#### On node start-up, or reconnection to Artemis -1. The node should subscribe to 'bridge.control'. -2. The node should enumerate the queues and identify which are have well known identities in the network map cache. The appropriate information about its own inboxes and any known outgoing queues should be compiled into an unsolicited QueueSnapshot message with a null correlation id. This should be broadcasted to update any bridges that are running. -3. If any QueueQuery messages arrive these should be responded to with specific QueueSnapshot messages with the correlation id set. - -#### On network map updates -1. On receipt of any network map cache updates the information should be evaluated to see if any addition queues can now be mapped to a bridge. At this point a BridgeRequest packet should be sent which will contain the legal X500Name and queue name of the new update. - -#### On flow message to Peer -1. If a message is to be sent to a peer the code should (as it does now) check for queue existence in its cache and then on the broker. If it does exist it simply sends the message. -2. If the queue is not listed in its cache it should block until the queue is created (this should be safe versus race conditions with other nodes). -3. Once the queue is created the original message and subsequent messages can now be sent. -4. In parallel a BridgeRequest packet should be sent to activate a new connection outwards. This will contain the contain the legal X500Name and queue name of the new queue. -5. Future QueueSnapshot requests should be responded to with the new queue included in the list. - -### Behaviour with a Float portion in the DMZ - -1. On initial connection of an inbound bridge, AMQP is configured to run a SASL challenge response to (re-)validate the - origin and confirm the client identity. (The most likely SASL mechanism for this is using https://tools.ietf.org/html/rfc3163 - as this allows reuse of our PKI certificates in the challenge response. Potentially we could forward some bridge control - messages to cover the SASL exchange to the internal Bridge Controller. This would allow us to keep the private keys - internal to the organisation, so we may also require a SASLAuth message type as part of the bridge control protocol.) -2. The float restricts acceptable AMQP topics to the name space appropriate for inbound messages only. Hence, there - should be no way to tunnel messages to bridge control, or RPC topics on the bus. -3. On receipt of a message from the external network, the Float should append a header to link the source channel's X500 - name, then create a Delivery for forwarding the message inwards. -4. The internal Bridge Control Manager process validates the message further to ensure that it is targeted at a legitimate - inbox (i.e. not an outbound queue) and then forwards it to the bus. Once delivered to the broker, the Delivery - acknowledgements are cascaded back. -5. On receiving Delivery notification from the internal side, the Float acknowledges back the correlated original Delivery. -6. The Float should protect against excessive inbound messages by AMQP flow control and refusing to accept excessive unacknowledged deliveries. -7. The Float only exposes its inbound server socket when activated by a valid AMQP link from the Bridge Control Manager - to allow for a simple HA pool of DMZ Float processes. (Floats cannot run hot-hot as this would invalidate Corda's - message ordering guarantees.) - -## Implementation plan - -### Proposed incremental steps towards a float - -1. First, I would like to more explicitly split the RPC and P2P MessagingService instances inside the Node. They can - keep the same interface, but this would let us develop P2P and RPC at different rates if required. - -2. The current in-node design with Artemis Core bridges should first be replaced with an equivalent piece of code that - initiates send only bridges using an in-house wrapper over the proton-j library. Thus, the current Artemis message - objects will be picked up from existing queues using the CORE protocol via an abstraction interface to allow later - pluggable replacement. The specific subscribed queues are controlled as before and bridges started by the existing code - path. The only difference is the bridges will be the new AMQP client code. The remote Artemis broker should accept - transferred packets directly onto its own inbox queue and acknowledge receipt via standard AMQP Delivery notifications. - This in turn will be acknowledged back to the Artemis Subscriber to permanently remove the message from the source - Artemis queue. The headers for deduplication, address names, etc will need to be mapped to the AMQP messages and we will - have to take care about the message payload. This should be an envelope that is capable in the future of being - end-to-end encrypted. Where possible we should stay close to the current Artemis mappings. - -3. We need to define a bridge control protocol, so that we can have an out of process float/bridge. The current process - is that on message send the node checks the target address to see if the target queue already exists. If the queue - doesn't exist it creates a new queue which includes an encoding of the PublicKey in its name. This is picked up by a - wrapper around the Artemis Server which is also hosted inside the node and can ask the network map cache for a - translation to a target host and port. This in turn allows a new bridge to be provisioned. At node restart the - re-population of the network map cache is followed to re-create the bridges to any unsent queues/messages. - -4. My proposal for a bridge control protocol is partly influenced by the fact that AMQP does not have a built-in - mechanism for queue creation/deletion/enumeration. Also, the flows cannot progress until they are sure that there is an - accepting queue. Finally, if one runs a local broker it should be fine to run multiple nodes without any bridge - processes. Therefore, I will leave the queue creation as the node's responsibility. Initially we can continue to use the - existing CORE protocol for this. The requirement to initiate a bridge will change from being implicit signalling via - server queue detection to being an explicit pub-sub message that requests bridge formation. This doesn't need - durability, or acknowledgements, because when a bridge process starts it should request a refresh of the required bridge - list. The typical create bridge messages should contain: - - 1. The queue name (ideally with the sha256 of the PublicKey, not the whole PublicKey as that may not work on brokers with queue name length constraints). - 2. The expected X500Name for the remote TLS certificate. - 3. The list of host and ports to attempt connection to. See separate section for more info. - -5. Once we have the bridge protocol in place and a bridge out of process the broker can move out of process too, which - is a requirement for clustering anyway. We can then start work on floating the bridge and making our broker pluggable. - - 1. At this point the bridge connection to the local queues should be upgraded to also be AMQP client, rather than CORE - protocol, which will give the ability for the P2P bridges to work with other broker products. - 2. An independent task is to look at making the Bridge process HA, probably using a similar hot-warm mastering solution - as the node, or atomix.io. The inactive node should track the control messages, but obviously doesn't initiate any - bridges. - 3. Another potentially parallel piece of development is to start to build a float, which is essentially just splitting - the bridge in two and putting in an intermediate hop AMQP/TLS link. The thin proxy in the DMZ zone should be as - stateless as possible in this. - 4. Finally, the node should use AMQP to talk to its local broker cluster, but this will have to remain partly tied - to Artemis, as queue creation will require sending management messages to the Artemis core, but we should be - able to abstract this. - -### Float evolution - -#### In-Process AMQP Bridging - -![In-Process AMQP Bridging](./in-process-amqp-bridging.png) - -In this phase of evolution we hook the same bridge creation code as before and use the same in-process data access to -network map cache. However, we now implement AMQP sender clients using proton-j and netty for TLS layer and connection -retry. This will also involve formalising the AMQP packet format of the Corda P2P protocol. Once a bridge makes a -successful link to a remote node's Artemis broker it will subscribe to the associated local queue. The messages will be -picked up from the local broker via an Artemis CORE consumer for simplicity of initial implementation. The queue -consumer should be implemented with a simple generic interface as façade, to allow future replacement. The message will -be sent across the AMQP protocol directly to the remote Artemis broker. Once acknowledgement of receipt is given with an -AMQP Delivery notification the queue consumption will be acknowledged. This will remove the original item from the -source queue. If delivery fails due to link loss the subscriber should be closed until a new link is established to -ensure messages are not consumed. If delivery fails for other reasons there should be some for of periodic retry over -the AMQP link. For authentication checks the client cert returned from the remote server will be checked and the link -dropped if it doesn't match expectations. - -#### Out of process Artemis Broker and Bridges -![Out of process Artemis Broker and Bridges](./out-of-proc-artemis-broker-bridges.png) - -Move the Artemis broker and bridge formation logic out of the node. This requires formalising the bridge creation -requests, but allows clustered brokers, standardised AMQP usage and ultimately pluggable brokers. We should implement a -netty socket server on the bridge and forward authenticated packets to the local Artemis broker inbound queues. An AMQP -server socket is required for the float, although it should be transparent whether a NodeInfo refers to a bridge socket -address, or an Artemis broker. The queue names should use the sha-256 of the PublicKey not the full key. Also, the name -should be used for in and out queues, so that multiple distinct nodes can coexist on the same broker. This will simplify -development as developers just run a background broker and shouldn't need to restart it. To export the network map -information and to initiate bridges a non-durable bridge control protocol will be needed (in blue). Essentially the -messages declare the local queue names and target TLS link information. For in-bound messages only messages for known -inbox targets will be acknowledged. It should not be hard to make the bridges active-passive HA as they contain no -persisted message state and simple RPC can resync the state of the bridge. Queue creation will remain with the node as -this must use non-AMQP mechanisms and because flows should be able to queue sent messages even if the bridge is -temporarily down. In parallel work can start to upgrade the local links to Artemis (i.e. the node-Artemis link and the -Bridge Manager-Artemis link) to be AMQP clients as much as possible. diff --git a/docs/source/design/float/full-float.png b/docs/source/design/float/full-float.png deleted file mode 100644 index bd7e676980..0000000000 Binary files a/docs/source/design/float/full-float.png and /dev/null differ diff --git a/docs/source/design/float/in-process-amqp-bridging.png b/docs/source/design/float/in-process-amqp-bridging.png deleted file mode 100644 index c8f309443d..0000000000 Binary files a/docs/source/design/float/in-process-amqp-bridging.png and /dev/null differ diff --git a/docs/source/design/float/out-of-proc-artemis-broker-bridges.png b/docs/source/design/float/out-of-proc-artemis-broker-bridges.png deleted file mode 100644 index f16b758050..0000000000 Binary files a/docs/source/design/float/out-of-proc-artemis-broker-bridges.png and /dev/null differ diff --git a/docs/source/design/hadr/decisions/crash-shell.md b/docs/source/design/hadr/decisions/crash-shell.md deleted file mode 100644 index e3cd0a1ae6..0000000000 --- a/docs/source/design/hadr/decisions/crash-shell.md +++ /dev/null @@ -1,50 +0,0 @@ -# Design Decision: Node starting & stopping - -## Background / Context - -The potential use of a crash shell is relevant to high availability capabilities of nodes. - -## Options Analysis - -### 1. Use crash shell - -#### Advantages - -1. Already built into the node. -2. Potentially add custom commands. - -#### Disadvantages - -1. Won’t reliably work if the node is in an unstable state -2. Not practical for running hundreds of nodes as our customers already trying to do. -3. Doesn’t mesh with the user access controls of the organisation. -4. Doesn’t interface to the existing monitoring and control systems i.e. Nagios, Geneos ITRS, Docker Swarm, etc. - -### 2. Delegate to external tools - -#### Advantages - -1. Doesn’t require change from our customers -2. Will work even if node is completely stuck -3. Allows scripted node restart schedules -4. Doesn’t raise questions about access control lists and audit - -#### Disadvantages - -1. More uncertainty about what customers do. -2. Might be more requirements on us to interact nicely with lots of different products. -3. Might mean we get blamed for faults in other people’s control software. -4. Doesn’t coordinate with the node for graceful shutdown. -5. Doesn’t address any crypto features that target protecting the AMQP headers. - -## Recommendation and justification - -Proceed with Option 2: Delegate to external tools - -## Decision taken - -Restarts should be handled by polite shutdown, followed by a hard clear. (RGB, JC, MH agreed) - -.. toctree:: - - drb-meeting-20171116.md diff --git a/docs/source/design/hadr/decisions/db-msg-store.md b/docs/source/design/hadr/decisions/db-msg-store.md deleted file mode 100644 index 18649eff31..0000000000 --- a/docs/source/design/hadr/decisions/db-msg-store.md +++ /dev/null @@ -1,46 +0,0 @@ -# Design Decision: Message storage - -## Background / Context - -Storage of messages by the message broker has implications for replication technologies which can be used to ensure both -[high availability](../design.md) and disaster recovery of Corda nodes. - -## Options Analysis - -### 1. Storage in the file system - -#### Advantages - -1. Out of the box configuration. -2. Recommended Artemis setup -3. Faster -4. Less likely to have interaction with DB Blob rules - -#### Disadvantages - -1. Unaligned capture time of journal data compared to DB checkpointing. -2. Replication options on Azure are limited. Currently we may be forced to the ‘Azure Files’ SMB mount, rather than the ‘Azure Data Disk’ option. This is still being evaluated - -### 2. Storage in node database - -#### Advantages - -1. Single point of data capture and backup -2. Consistent solution between VM and physical box solutions - -#### Disadvantages - -1. Doesn’t work on H2, or SQL Server. From my own testing LargeObject support is broken. The current Artemis code base does allow some pluggability, but not of the large object implementation, only of the SQL statements. We should lobby for someone to fix the implementations for SQLServer and H2. -2. Probably much slower, although this needs measuring. - -## Recommendation and justification - -Continue with Option 1: Storage in the file system - -## Decision taken - -Use storage in the file system (for now) - -.. toctree:: - - drb-meeting-20171116.md diff --git a/docs/source/design/hadr/decisions/drb-meeting-20171116.md b/docs/source/design/hadr/decisions/drb-meeting-20171116.md deleted file mode 100644 index d19ee14f71..0000000000 --- a/docs/source/design/hadr/decisions/drb-meeting-20171116.md +++ /dev/null @@ -1,118 +0,0 @@ -# Design Review Board Meeting Minutes - -**Date / Time:** 16/11/2017, 16:30 - -## Attendees - -- Mark Oldfield (MO) -- Matthew Nesbit (MN) -- Richard Gendal Brown (RGB) -- James Carlyle (JC) -- Mike Hearn (MH) -- Jose Coll (JoC) -- Rick Parker (RP) -- Andrey Bozhko (AB) -- Dave Hudson (DH) -- Nick Arini (NA) -- Ben Abineri (BA) -- Jonathan Sartin (JS) -- David Lee (DL) - -## Minutes - -The meeting re-opened following prior discussion of the float design. - -MN introduced the design for high availability, clarifying that the design did not include support for DR-implied features (asynchronous replication etc.). - -MN highlighted limitations in testability: Azure had confirmed support for geo replication but with limited control by the user and no testing facility; all R3 can do is test for impact on performance. - -The design was noted to be dependent on a lot on external dependencies for replication, with R3's testing capability limited to Azure. Agent banks may want to use SAN across dark fiber sites, redundant switches etc. not available to R3. - -MN noted that certain databases are not yet officially supported in Corda. - -### [Near-term-target](./near-term-target.md), [Medium-term target](./medium-term-target.md) - -Outlining the hot-cold design, MN highlighted importance of ensuring only one node is active at one time. MN argued for having a tested hot-cold solution as a ‘backstop’. MN confirmed the work involved was to develop DB/SAN exclusion checkers and test appropriately. - -JC queried whether unknowns exist for hot-cold. MN described limitations of Azure file replication. - -JC noted there was optionality around both the replication mechanisms and the on-premises vs. cloud deployment. - -### [Message storage](./db-msg-store.md) - -Lack of support for storing Artemis messages via JDBC was raised, and the possibility for RedHat to provide an enhancement was discussed. - -MH raised the alternative of using Artemis’ inbuilt replication protocol - MN confirmed this was in scope for hot-warm, but not hot-cold. - -JC posited that file system/SAN replication should be OK for banks - -**DECISION AGREED**: Use storage in the file system (for now) - -AB questioned about protections against corruption; RGB highlighted the need for testing on this. MH described previous testing activity, arguing for a performance cluster that repeatedly runs load tests, kills nodes,checking they come back etc. - -MN could not comment on testing status of current code. MH noted the notary hasn't been tested. - -AB queried how basic node recovery would work. MN explained, highlighting the limitation for RPC callbacks. - -JC proposed these limitations should be noted and explained to Finastra; move on. - -There was discussion of how RPC observables could be made to persist across node outages. MN argued that for most applications, a clear signal of the outage that triggered clients to resubscribe was preferable. This was agreed. - -JC argued for using Kafka. - -MN presented the Hot-warm solution as a target for March-April and provide clarifications on differences vs. hot-cold and hot-hot. - -JC highlighted that the clustered artemis was an important intermediate step. MN highlighted other important features - -MO noted that different banks may opt for different solutions. - -JoC raised the question of multi-IP per node. - -MN described the Hot-hot solution, highlighting that flows remained 'sticky' to a particular instance but could be picked up by another when needed. - -AB preferred the hot-hot solution. MN noted the many edge cases to be worked through. - -AB Queried the DR story. MO stated this was out of scope at present. - -There was discussion of the implications of not having synchronous replication. - -MH questioned the need for a backup strategy that allows winding back the clock. MO stated this was out of scope at present. - -MO drew attention to the expectation that Corda would be considered part of larger solutions with controlled restore procedures under BCP. - -JC noted the variability in many elements as a challenge. - -MO argued for providing a 'shrink-wrapped' solution based around equipment R3 could test (e.g. Azure) - -JC argued for the need to manage testing of banks' infrastructure choices in order to reduce time to implementation. - -There was discussion around the semantic difference between HA and DR. MH argued for a definition based around rolling backups. MN and MO shared banks' view of what DR is. MH contrasted this with Google definitions. AB noted HA and DR have different SLAs. - -**DECISION AGREED:** Near-term target: Hot Cold; Medium-term target: Hot-warm (RGB, JC, MH agreed) - -RGB queried why Artemis couldn't be run in clustered mode now. MN explained. - -AB queried what Finastra asked for. MO implied nothing specific; MH maintained this would be needed anyway. - -### [Broker separation](./external-broker.md) - -MN outlined his rationale for Broker separation. - -JC queried whether this would affect demos. - -MN gave an assumption that HA was for enterprise only; RGB, JC: pointed out that Enterprise might still be made available for non-production use. - -**DECISION AGREED**: The broker should only be separated if required by other features (e.g. the float), otherwise not. (RGB, JC, MH agreed). - -### [Load balancers and multi-IP](./ip-addressing.md) - -The topic was discussed. - -**DECISION AGREED**: The design can allow for optional load balancers to be implemented by clients. - -### [Crash shell](./crash-shell.md) - -MN provided outline explanation. - -**DECISION AGREED**: Restarts should be handled by polite shutdown, followed by a hard clear. (RGB, JC, MH agreed) - diff --git a/docs/source/design/hadr/decisions/external-broker.md b/docs/source/design/hadr/decisions/external-broker.md deleted file mode 100644 index e5c5720b01..0000000000 --- a/docs/source/design/hadr/decisions/external-broker.md +++ /dev/null @@ -1,48 +0,0 @@ -# Design Decision: Broker separation - -## Background / Context - -A decision of whether to extract the Artemis message broker as a separate component has implications for the design of -[high availability](../design.md) for nodes. - -## Options Analysis - -### 1. No change (leave broker embedded) - -#### Advantages - -1. Least change - -#### Disadvantages - -1. Means that starting/stopping Corda is tightly coupled to starting/stopping Artemis instances. -2. Risks resource leaks from one system component affecting other components. -3. Not pluggable if we wish to have an alternative broker. - -### 2. External broker - -#### Advantages - -1. Separates concerns -2. Allows future pluggability and standardisation on AMQP -3. Separates life cycles of the components -4. Makes Artemis deployment much more out of the box. -5. Allows easier tuning of VM resources for Flow processing workloads vs broker type workloads. -6. Allows later encrypted version to be an enterprise feature that can interoperate with OS versions. - -#### Disadvantages - -1. More work -2. Requires creating a protocol to control external bridge formation. - -## Recommendation and justification - -Proceed with Option 2: External broker - -## Decision taken - -The broker should only be separated if required by other features (e.g. the float), otherwise not. (RGB, JC, MH agreed). - -.. toctree:: - - drb-meeting-20171116.md diff --git a/docs/source/design/hadr/decisions/ip-addressing.md b/docs/source/design/hadr/decisions/ip-addressing.md deleted file mode 100644 index e8b34ab921..0000000000 --- a/docs/source/design/hadr/decisions/ip-addressing.md +++ /dev/null @@ -1,46 +0,0 @@ -# Design Decision: IP addressing mechanism (near-term) - -## Background / Context - -End-to-end encryption is a desirable potential design feature for the [high availability support](../design.md). - -## Options Analysis - -### 1. Via load balancer - -#### Advantages - -1. Standard technology in banks and on clouds, often for non-HA purposes. -2. Intended to allow us to wait for completion of network map work. - -#### Disadvantages - -1. We do need to support multiple IP address advertisements in network map long term. -2. Might involve small amount of code if we find Artemis doesn’t like the health probes. So far though testing of the Azure Load balancer doesn’t need this. -3. Won’t work over very large data centre separations, but that doesn’t work for HA/DR either - -### 2. Via IP list in Network Map - -#### Advantages - -1. More flexible -2. More deployment options -3. We will need it one day - -#### Disadvantages - -1. Have to write code to support it. -2. Configuration more complicated and now the nodes are non-equivalent, so you can’t just copy the config to the backup. -3. Artemis has round robin and automatic failover, so we may have to expose a vendor specific config flag in the network map. - -## Recommendation and justification - -Proceed with Option 1: Via Load Balancer - -## Decision taken - -The design can allow for optional load balancers to be implemented by clients. (RGB, JC, MH agreed) - -.. toctree:: - - drb-meeting-20171116.md diff --git a/docs/source/design/hadr/decisions/medium-term-target.md b/docs/source/design/hadr/decisions/medium-term-target.md deleted file mode 100644 index 8f7b779a95..0000000000 --- a/docs/source/design/hadr/decisions/medium-term-target.md +++ /dev/null @@ -1,49 +0,0 @@ -# Design Decision: Medium-term target for node HA - -## Background / Context - -Designing for high availability is a complex task which can only be delivered over an operationally-significant -timeline. It is therefore important to determine whether an intermediate state design (deliverable for around March -2018) is desirable as a precursor to longer term outcomes. - -## Options Analysis - -### 1. Hot-warm as interim state - -#### Advantages - -1. Simpler master/slave election logic -2. Less edge cases with respect to messages being consumed by flows. -3. Naive solution of just stopping/starting the node code is simple to implement. - -#### Disadvantages - -1. Still probably requires the Artemis MQ outside of the node in a cluster. -2. May actually turn out more risky than hot-hot, because shutting down code is always prone to deadlocks and resource leakages. -3. Some work would have to be thrown away when we create a full hot-hot solution. - -### 2. Progress immediately to Hot-hot - -#### Advantages - -1. Horizontal scalability is what all our customers want. -2. It simplifies many deployments as nodes in a cluster are all equivalent. - -#### Disadvantages - -1. More complicated especially regarding message routing. -2. Riskier to do this big-bang style. -3. Might not meet deadlines. - -## Recommendation and justification - -Proceed with Option 1: Hot-warm as interim state. - -## Decision taken - -Adopt option 1: Medium-term target: Hot Warm (RGB, JC, MH agreed) - -.. toctree:: - - drb-meeting-20171116.md - diff --git a/docs/source/design/hadr/decisions/near-term-target.md b/docs/source/design/hadr/decisions/near-term-target.md deleted file mode 100644 index 6461e18f27..0000000000 --- a/docs/source/design/hadr/decisions/near-term-target.md +++ /dev/null @@ -1,46 +0,0 @@ -# Design Decision: Near-term target for node HA - -## Background / Context - -Designing for high availability is a complex task which can only be delivered over an operationally-significant -timeline. It is therefore important to determine the target state in the near term as a precursor to longer term -outcomes. - -## Options Analysis - -### 1. No HA - -#### Advantages - -1. Reduces developer distractions. - -#### Disadvantages - -1. No backstop if we miss our targets for fuller HA. -2. No answer at all for simple DR modes. - -### 2. Hot-cold (see [HA design doc](../design.md)) - -#### Advantages - -1. Flushes out lots of basic deployment issues that will be of benefit later. -2. If stuff slips we at least have a backstop position with hot-cold. -3. For now, the only DR story we have is essentially a continuation of this mode -4. The intent of decisions such as using a loadbalancer is to minimise code changes - -#### Disadvantages - -1. Distracts from the work for more complete forms of HA. -2. Involves creating a few components that are not much use later, for instance the mutual exclusion lock. - -## Recommendation and justification - -Proceed with Option 2: Hot-cold. - -## Decision taken - -Adopt option 2: Near-term target: Hot Cold (RGB, JC, MH agreed) - -.. toctree:: - - drb-meeting-20171116.md diff --git a/docs/source/design/hadr/design.md b/docs/source/design/hadr/design.md deleted file mode 100644 index 59c787f40f..0000000000 --- a/docs/source/design/hadr/design.md +++ /dev/null @@ -1,284 +0,0 @@ -# High availability support - -.. important:: This design document describes a feature of Corda Enterprise. - -## Overview -### Background - -The term high availability (HA) is used in this document to refer to the ability to rapidly handle any single component -failure, whether due to physical issues (e.g. hard drive failure), network connectivity loss, or software faults. - -Expectations of HA in modern enterprise systems are for systems to recover normal operation in a few minutes at most, -while ensuring minimal/zero data loss. Whilst overall reliability is the overriding objective, it is desirable for Corda -to offer HA mechanisms which are both highly automated and transparent to node operators. HA mechanism must not involve -any configuration changes that require more than an appropriate admin tool, or a simple start/stop of a process as that -would need an Emergency Change Request. - -HA naturally grades into requirements for Disaster Recovery (DR), which requires that there is a tested procedure to -handle large scale multi-component failures e.g. due to data centre flooding, acts of terrorism. DR processes are -permitted to involve significant manual intervention, although the complications of actually invoking a Business -Continuity Plan (BCP) mean that the less manual intervention, the more competitive Corda will be in the modern vendor -market. For modern financial institutions, maintaining comprehensive and effective BCP procedures are a legal -requirement which is generally tested at least once a year. - -However, until Corda is the system of record, or the primary system for transactions we are unlikely to be required to -have any kind of fully automatic DR. In fact, we are likely to be restarted only once BCP has restored the most critical -systems. In contrast, typical financial institutions maintain large, complex technology landscapes in which individual -component failures can occur, such as: - -* Small scale software failures -* Mandatory data centre power cycles -* Operating system patching and restarts -* Short lived network outages -* Middleware queue build-up -* Machine failures - -Thus, HA is essential for enterprise Corda and providing help to administrators necessary for rapid fault diagnosis. - -### Current node topology - -![Current (single process)](./no-ha.png) - -The current solution has a single integrated process running in one JVM including Artemis, H2 database, Flow State -Machine, P2P bridging. All storage is on the local file system. There is no HA capability other than manual restart of -the node following failure. - -#### Limitations - -- All sub-systems must be started and stopped together. -- Unable to handle partial failure e.g. Artemis. -- Artemis cannot use its in-built HA capability (clustered slave mode) as it is embedded. -- Cannot run the node with the flow state machine suspended. -- Cannot use alternative message brokers. -- Cannot run multiple nodes against the same broker. -- Cannot use alternative databases to H2. -- Cannot share the database across Corda nodes. -- RPC clients do have automatic reconnect but there is no clear solution for resynchronising on reconnect. -- The backup strategy is unclear. - -## Requirements -### Goals - -* A logical Corda node should continue to function in the event of an individual component failure or (e.g.) restart. -* No loss, corruption or duplication of data on the ledger due to component outages -* Ensure continuity of flows throughout any disruption -* Support software upgrades in a live network - -### Non-goals (out of scope for this design document) - -* Be able to distribute a node over more than two data centers. -* Be able to distribute a node between data centers that are very far apart latency-wise (unless you don't care about performance). -* Be able to tolerate arbitrary byzantine failures within a node cluster. -* DR, specifically in the case of the complete failure of a site/datacentre/cluster or region will require a different - solution to that specified here. For now DR is only supported where performant synchronous replication is feasible - i.e. sites only a few miles apart. - -## Timeline - -This design document outlines a range of topologies which will be enabled through progressive enhancements from the -short to long term. - -On the timescales available for the current production pilot deployments we clearly do not have time to reach the ideal -of a highly fault tolerant, horizontally scaled Corda. - -Instead, I suggest that we can only achieve the simplest state of a standby Corda installation only by January 5th and -even this is contingent on other enterprise features, such as external database and network map stabilisation being -completed on this timescale, plus any issues raised by testing. - -For the Enterprise GA timeline, I hope that we can achieve a more fully automatic node failover state, with the Artemis -broker running as a cluster too. I include a diagram of a fully scaled Corda for completeness and so that I can discuss -what work is re-usable/throw away. - -With regards to DR it is unclear how this would work where synchronous replication is not feasible. At this point we can -only investigate approaches as an aside to the main thrust of work for HA support. In the synchronous replication mode -it is assumed that the file and database replication can be used to ensure a cold DR backup. - -## Design Decisions - -The following design decisions are assumed by this design: - -.. toctree:: - :maxdepth: 1 - - decisions/near-term-target.md - decisions/medium-term-target.md - decisions/external-broker.md - decisions/db-msg-store.md - decisions/ip-addressing.md - decisions/crash-shell.md - -## Target Solution - -### Hot-Cold (minimum requirement) -![Hot-Cold (minimum requirement)](./hot-cold.png) - -Small scale software failures on a node are recovered from locally via restarting/re-setting the offending component by -the external (to JVM) "Health Watchdog" (HW) process. The HW process (eg a shell script or similar) would monitor -parameters for java processes by periodically query them (sleep period a few seconds). This may require introduction of -a few monitoring 'hooks' into Corda codebase or a "health" CorDapp the HW script can interface with. There would be a -back-off logic to prevent continues restarts in the case of persistent failure. - -We would provide a fully-functional sample HW script for Linux/Unix deployment platforms. - -The hot-cold design provides a backup VM and Corda deployment instance that can be manually started if the primary is -stopped. The failed primary must be killed to ensure it is fully stopped. - -For single-node deployment scenarios the simplest supported way to recover from failures is to re-start the entire set -of Corda Node processes or reboot the node OS. - -For a 2-node HA deployment scenario a load balancer determines which node is active and routes traffic to that node. The -load balancer will need to monitor the health of the primary and secondary nodes and automatically route traffic from -the public IP address to the only active end-point. An external solution is required for the load balancer and health -monitor. In the case of Azure cloud deployments, no custom code needs to be developed to support the health monitor. - -An additional component will be written to prevent accidental dual running which is likely to make use of a database -heartbeat table. Code size should be minimal. - -#### Advantages - -- This approach minimises the need for new code so can be deployed quickly. -- Use of a load balancer in the short term avoids the need for new code and configuration management to support the alternative approach of multiple advertised addresses for a single legal identity. -- Configuration of the inactive mode should be a simple mirror of the primary. -- Assumes external monitoring and management of the nodes e.g. ability to identify node failure and that Corda watchdog code will not be required (customer developed). - -#### Limitations - -- Slow failover as this is manually controlled. -- Requires external solutions for replication of database and Artemis journal data. -- Replication mechanism on agent banks with real servers not tested. -- Replication mechanism on Azure is under test but may prove to be too slow. -- Compatibility with external load balancers not tested. Only Azure configuration tested. -- Contingent on completion of database support and testing of replication. -- Failure of database (loss of connection) may not be supported or may require additional code. -- RPC clients assumed to make short lived RPC requests e.g. from Rest server so no support for long term clients operating across failover. -- Replication time point of the database and Artemis message data are independent and may not fully synchronise (may work subject to testing) . -- Health reporting and process controls need to be developed by the customer. - -### Hot-Warm (Medium-term solution) -![Hot-Warm (Medium-term solution)](./hot-warm.png) - -Hot-warm aims to automate failover and provide failover of individual major components e.g. Artemis. - -It involves Two key changes to the hot-cold design: -1) Separation and clustering of the Artemis broker. -2) Start and stop of flow processing without JVM exit. - -The consequences of these changes are that peer to peer bridging is separated from the node and a bridge control -protocol must be developed. A leader election component is a pre-cursor to load balancing – likely to be a combination -of custom code and standard library and, in the short term, is likely to be via the database. Cleaner handling of -disconnects from the external components (Artemis and the database) will also be needed. - -#### Advantages - -- Faster failover as no manual intervention. -- We can use Artemis replication protocol to replicate the message store. -- The approach is integrated with preliminary steps for the float. -- Able to handle loss of network connectivity to the database from one node. -- Extraction of Artemis server allows a more standard Artemis deployment. -- Provides protection against resource leakage in Artemis or Node from affecting the other component. -- VMs can be tuned to address different work load patterns of broker and node. -- Bridge work allows chance to support multiple IP addresses without a load balancer. - -#### Limitations - -- This approach will require careful testing of resource management on partial shutdown. -- No horizontal scaling support. -- Deployment of master and slave may not be completely symmetric. -- Care must be taken with upgrades to ensure master/slave election operates across updates. -- Artemis clustering does require a designated master at start-up of its cluster hence any restart involving changing - the primary node will require configuration management. -- The development effort is much more significant than the hot-cold configuration. - -### Hot-Hot (Long-term strategic solution) -![Hot-Hot (Long-term strategic solution)](./hot-hot.png) - -In this configuration, all nodes are actively processing work and share a clustered database. A mechanism for sharding -or distributing the work load will need to be developed. - -#### Advantages - -- Faster failover as flows are picked up by other active nodes. -- Rapid scaling by adding additional nodes. -- Node deployment is symmetric. -- Any broker that can support AMQP can be used. -- RPC can gracefully handle failover because responsibility for the flow can be migrated across nodes without the client being aware. - -#### Limitations - -- Very significant work with many edge cases during failure. -- Will require handling of more states than just checkpoints e.g. soft locks and RPC subscriptions. -- Single flows will not be active on multiple nodes without future development work. - -## Implementation plan - -### Transitioning from Corda 2.0 to Manually Activated HA - -The current Corda is built to run as a fully contained single process with the Flow logic, H2 database and Artemis -broker all bundled together. This limits the options for automatic replication, or subsystem failure. Thus, we must use -external mechanisms to replicate the data in the case of failure. We also should ensure that accidental dual start is -not possible in case of mistakes, or slow shutdown of the primary. - -Based on this situation, I suggest the following minimum development tasks are required for a tested HA deployment: - -1. Complete and merge JDBC support for an external clustered database. Azure SQL Server has been identified as the most - likely initial deployment. With this we should be able to point at an HA database instance for Ledger and Checkpoint data. -2. I am suggesting that for the near term we just use the Azure Load Balancer to hide the multiple machine addresses. - This does require allowing a health monitoring link to the Artemis broker, but so far testing indicates that this - operates without issue. Longer term we need to ensure that the network map and configuration support exists for the - system to work with multiple TCP/IP endpoints advertised to external nodes. Ideally this should be rolled into the - work for AMQP bridges and Floats. -3. Implement a very simple mutual exclusion feature, so that an enterprise node cannot start if another is running onto - the same database. This can be via a simple heartbeat update in the database, or possibly some other library. This - feature should be enabled only when specified by configuration. -4. The replication of the Artemis Message Queues will have to be via an external mechanism. On Azure we believe that the - only practical solution is the 'Azure Files' approach which maps a virtual Samba drive. This we are testing in-case it - is too slow to work. The mounting of separate Data Disks is possible, but they can only be mounted to one VM at a - time, so they would not be compatible with the goal of no change requests for HA. -5. Improve health monitoring to better indicate fault failure. Extending the existing JMX and logging support should - achieve this, although we probably need to create watchdog CordApp that verifies that the State Machine and Artemis - messaging are able to process new work and to monitor flow latency. -6. Test the checkpointing mechanism and confirm that failures don't corrupt the data by deploying an HA setup on Azure - and driving flows through the system as we stop the node randomly and switch to the other node. If this reveals any - issues we will have to fix them. -7. Confirm that the behaviour of the RPC Client API is stable through these restarts, from the perspective of a stateless - REST server calling through to RPC. The RPC API should provide positive feedback to the application, so that it can - respond in a controlled fashion when disconnected. -8. Work on flow hospital tools where needed - -### Moving Towards Automatic Failover HA - -To move towards more automatic failover handling we need to ensure that the node can be partially active i.e. live -monitoring the health status and perhaps keeping major data structures in sync for faster activation, but not actually -processing flows. This needs to be reversible without leakage, or destabilising the node as it is common to use manually -driven master changes to help with software upgrades and to carry out regular node shutdown and maintenance. Also, to -reduce the risks associated with the uncoupled replication of the Artemis message data and the database I would -recommend that we move the Artemis broker out of the node to allow us to create a failover cluster. This is also in line -with the goal of creating a AMQP bridges and Floats. - -To this end I would suggest packages of work that include: - -1. Move the broker out of the node, which will require having a protocol that can be used to signal bridge creation and - which decouples the network map. This is in line with the Flow work anyway. -2. Create a mastering solution, probably using Atomix.IO although this might require a solution with a minimum of three - nodes to avoid split brain issues. Ideally this service should be extensible in the future to lead towards an eventual - state with Flow level sharding. Alternatively, we may be able to add a quick enterprise adaptor to ZooKeeper as - master selector if time is tight. This will inevitably impact upon configuration and deployment support. -3. Test the leakage when we repeated start-stop the Node class and fix any resource leaks, or deadlocks that occur at shutdown. -4. Switch the Artemis client code to be able to use the HA mode connection type and thus take advantage of the rapid - failover code. Also, ensure that we can support multiple public IP addresses reported in the network map. -5. Implement proper detection and handling of disconnect from the external database and/or Artemis broker, which should - immediately drop the master status of the node and flush any incomplete flows. -6. We should start looking at how to make RPC proxies recover from disconnect/failover, although this is probably not a - top priority. However, it would be good to capture the missed results of completed flows and ensure the API allows - clients to unregister/re-register Observables. - -## The Future - -Hopefully, most of the work from the automatic failover mode can be modified when we move to a full hot-hot sharding of -flows across nodes. The mastering solution will need to be modified to negotiate finer grained claim on individual -flows, rather than stopping the whole of Node. Also, the routing of messages will have to be thought about so that they -go to the correct node for processing, but failover if the node dies. However, most of the other health monitoring and -operational aspects should be reusable. - -We also need to look at DR issues and in particular how we might handle asynchronous replication and possibly -alternative recovery/reconciliation mechanisms. diff --git a/docs/source/design/hadr/hot-cold.png b/docs/source/design/hadr/hot-cold.png deleted file mode 100644 index 4ef4a54de7..0000000000 Binary files a/docs/source/design/hadr/hot-cold.png and /dev/null differ diff --git a/docs/source/design/hadr/hot-hot.png b/docs/source/design/hadr/hot-hot.png deleted file mode 100644 index 7411d04239..0000000000 Binary files a/docs/source/design/hadr/hot-hot.png and /dev/null differ diff --git a/docs/source/design/hadr/hot-warm.png b/docs/source/design/hadr/hot-warm.png deleted file mode 100644 index c8be0e67d0..0000000000 Binary files a/docs/source/design/hadr/hot-warm.png and /dev/null differ diff --git a/docs/source/design/hadr/no-ha.png b/docs/source/design/hadr/no-ha.png deleted file mode 100644 index 18e14f4281..0000000000 Binary files a/docs/source/design/hadr/no-ha.png and /dev/null differ diff --git a/docs/source/design/kafka-notary/decisions/index-storage.md b/docs/source/design/kafka-notary/decisions/index-storage.md deleted file mode 100644 index 008025aeae..0000000000 --- a/docs/source/design/kafka-notary/decisions/index-storage.md +++ /dev/null @@ -1,50 +0,0 @@ -# Design Decision: Storage engine for committed state index - -## Background / Context - -The storage engine for the committed state index needs to support a single operation: "insert all values with unique -keys, or abort if any key conflict found". A wide range of solutions could be used for that, from embedded key-value -stores to full-fledged relational databases. However, since we don't need any extra features a RDBMS provides over a -simple key-value store, we'll only consider lightweight embedded solutions to avoid extra operational costs. - -Most RDBMSs are also generally optimised for read performance (use B-tree based storage engines like InnoDB, MyISAM). -Our workload is write-heavy and uses "random" primary keys (state references), which leads to particularly poor write -performance for those types of engines – as we have seen with our Galera-based notary service. One exception is the -MyRocks storage engine, which is based on RocksDB and can handle write workloads well, and is supported by Percona -Server, and MariaDB. It is easier, however, to just use RocksDB directly. - -## Options Analysis - -### A. RocksDB - -An embedded key-value store based on log-structured merge-trees (LSM). It's highly configurable, provides lots of -configuration options for performance tuning. E.g. can be tuned to run on different hardware – flash, hard disks or -entirely in-memory. - -### B. LMDB - -An embedded key-value store using B+ trees, has ACID semantics and support for transactions. - -### C. MapDB - -An embedded Java database engine, providing persistent collection implementations. Uses memory mapped files. Simple to -use, implements Java collection interfaces. Provides a HashMap implementation that we can use for storing committed -states. - -### D. MVStore - -An embedded log structured key-value store. Provides a simple persistent map abstraction. Supports multiple map -implementations (B-tree, R-tree, concurrent B-tree). - -## Recommendation and justification - -Performance test results when running on a Macbook Pro with Intel Core i7-4980HQ CPU @ 2.80GHz, 16 GB RAM, SSD: - -![Comparison](../images/store-comparison.png) - -Multiple tests were run with varying number of transactions and input states per transaction: "1m x 1" denotes a million -transactions with one input state. - -Proceed with Option A, as RocksDB provides most tuning options and achieves by far the best write performance. - -Note that the index storage engine can be replaced in the future with minimal changes required on the notary service. \ No newline at end of file diff --git a/docs/source/design/kafka-notary/decisions/replicated-storage.md b/docs/source/design/kafka-notary/decisions/replicated-storage.md deleted file mode 100644 index b0fd0f24b7..0000000000 --- a/docs/source/design/kafka-notary/decisions/replicated-storage.md +++ /dev/null @@ -1,144 +0,0 @@ -# Design Decision: Replication framework - -## Background / Context - -Multiple libraries/platforms exist for implementing fault-tolerant systems. In existing CFT notary implementations we -experimented with using a traditional relational database with active replication, as well as a pure state machine -replication approach based on CFT consensus algorithms. - -## Options Analysis - -### A. Atomix - -*Raft-based fault-tolerant distributed coordination framework.* - -Our first CFT notary notary implementation was based on Atomix. Atomix can be easily embedded into a Corda node and -provides abstractions for implementing custom replicated state machines. In our case the state machine manages committed -Corda contract states. When notarisation requests are sent to Atomix, they get forwarded to the leader node. The leader -persists the request to a log, and replicates it to all followers. Once the majority of followers acknowledge receipt, -it applies the request to the user-defined state machine. In our case we commit all input states in the request to a -JDBC-backed map, or return an error if conflicts occur. - -#### Advantages - -1. Lightweight, easy to integrate – embeds into Corda node. -2. Uses Raft for replication – simpler and requires less code than other algorithms like Paxos. - -#### Disadvantages - -1. Not designed for storing large datasets. State is expected to be maintained in memory only. On restart, each replica re-reads the entire command log to reconstruct the state. This behaviour is not configurable and would require code changes. -2. Does not support batching, not optimised for performance. -3. Since version 2.0, only supports snapshot replication. This means that each replica has to periodically dump the entire commit log to disk, and replicas that fall behind have to download the _entire_ snapshot. -4. Limited tooling. - -### B. Permazen - -*Java persistence layer with a built-in Raft-based replicated key-value store.* - -Conceptually similar to Atomix, but persists the state machine instead of the request log. Built around an abstract -persistent key-value store: requests get cleaned up after replication and processing. - -#### Advantages - -1. Lightweight, easy to integrate – embeds into Corda node. -2. Uses Raft for replication – simpler and requires less code than other algorithms like Paxos. -3. Built around a (optionally) persistent key-value store – supports large datasets. - -#### Disadvantages - -1. Maintained by a single developer, used by a single company in production. Code quality and documentation looks to be of a high standard though. -2. Not tested with large datasets. -3. Designed for read-write-delete workloads. Replicas that fall behind too much will have to download the entire state snapshot (similar to Atomix). -4. Does not support batching, not optimised for performance. -5. Limited tooling. - -### C. Apache Kafka - -*Paxos-based distributed streaming platform.* - -Atomix and Permazen implement both the replicated request log and the state machine, but Kafka only provides the log -component. In theory that means more complexity having to implement request log processing and state machine management, -but for our use case it's fairly straightforward: consume requests and insert input states into a database, marking the -position of the last processed request. If the database is lost, we can just replay the log from the beginning. The main -benefit of this approach is that it gives a more granular control and performance tuning opportunities in different -parts of the system. - -#### Advantages - -1. Stable – used in production for many years. -2. Optimised for performance. Provides multiple configuration options for performance tuning. -3. Designed for managing large datasets (performance not affected by dataset size). - -#### Disadvantages - -1. Relatively complex to set up and operate, requires a Zookeeper cluster. Note that some hosting providers offer Kafka as-a-service (e.g. Confluent Cloud), so we could delegate the setup and management. -2. Dictates a more complex notary service architecture. - -### D. Custom Raft-based implementation - -For even more granular control, we could replace Kafka with our own replicated log implementation. Kafka was started -before the Raft consensus algorithm was introduced, and is using Zookeeper for coordination, which is based on Paxos for -consensus. Paxos is known to be complex to understand and implement, and the main driver behind Raft was to create a -much simpler algorithm with equivalent functionality. Hence, while reimplementing Zookeeper would be an onerous task, -building a Raft-based alternative from scratch is somewhat feasible. - -#### Advantages - -Most of the implementations above have many extra features our use-case does not require. We can implement a relatively -simple clean optimised solution that will most likely outperform others (Thomas Schroeter already built a prototype). - -#### Disadvantages - -Large effort required to make it highly performant and reliable. - -### E. Galera - -*Synchronous replication plugin for MySQL, uses certification-based replication.* - -All of the options discussed so far were based on abstract state machine replication. Another approach is simply using a -more traditional RDBMS with active replication support. Note that most relational databases support some form -replication in general, however, very few provide strong consistency guarantees and ensure no data loss. Galera is a -plugin for MySQL enabling synchronous multi-master replication. - -Galera uses certification-based replication, which operates on write-sets: a database server executes the (database) -transaction, and only performs replication if the transaction requires write operations. If it does, the transaction is -broadcasted to all other servers (using atomic broadcast). On delivery, each server executes a deterministic -certification phase, which decides if the transaction can commit or must abort. If a conflict occurs, the entire cluster -rolls back the transaction. This type of technique is quite efficient in low-conflict situations and allows read scaling -(the latter is mostly irrelevant for our use case). - -#### Advantages - -1. Very little code required on Corda side to implement. -2. Stable – used in production for many years. -3. Large tooling and support ecosystem. - -#### Disadvantages - -1. Certification-based replication is based on database transactions. A replication round is performed on every transaction commit, and batching is not supported. To improve performance, we need to combine the committing of multiple Corda transactions into a single database transaction, which gets complicated when conflicts occur. -2. Only supports the InnoDB storage engine, which is based on B-trees. It works well for reads, but performs _very_ poorly on write-intensive workloads with "random" primary keys. In tests we were only able to achieve up to 60 TPS throughput. Moreover, the performance steadily drops with more data added. - -### F. CockroachDB - -*Distributed SQL database built on a transactional and strongly-consistent key-value store. Uses Raft-based replication.* - -On paper, CockroachDB looks like a great candidate, but it relies on sharding: data is automatically split into -partitions, and each partition is replicated using Raft. It performs great for single-shard database transactions, and -also natively supports cross-shard atomic commits. However, the majority of Corda transactions are likely to have more -than one input state, which means that most transaction commits will require cross-shard database transactions. In our -tests we were only able to achieve up to 30 TPS in a 3 DC deployment. - -#### Advantages - -1. Scales very well horizontally by sharding data. -2. Easy to set up and operate. - -#### Disadvantages - -1. Cross-shard atomic commits are slow. Since we expect most transactions to contain more than one input state, each transaction commit will very likely span multiple shards. -2. Fairly new, limited use in production so far. - -## Recommendation and justification - -Proceed with Option C. A Kafka-based solution strikes the best balance between performance and the required effort to -build a production-ready solution. \ No newline at end of file diff --git a/docs/source/design/kafka-notary/design.md b/docs/source/design/kafka-notary/design.md deleted file mode 100644 index 50d85de69f..0000000000 --- a/docs/source/design/kafka-notary/design.md +++ /dev/null @@ -1,236 +0,0 @@ -# High Performance CFT Notary Service - -.. important:: This design document describes a prototyped but not shipped feature of Corda Enterprise. There are presently no plans to ship this notary. - -## Overview - -This proposal describes the architecture and an implementation for a high performance crash fault-tolerant notary -service, operated by a single party. - -## Background - -For initial deployments, we expect to operate a single non-validating CFT notary service. The current Raft and Galera -implementations cannot handle more than 100-200 TPS, which is likely to be a serious bottleneck in the near future. To -support our clients and compete with other platforms we need a notary service that can handle TPS in the order of -1,000s. - -## Scope - -Goals: - -- A CFT non-validating notary service that can handle more than 1,000 TPS. Stretch goal: 10,000 TPS. -- Disaster recovery strategy and tooling. -- Deployment strategy. - -Out-of-scope: - -- Validating notary service. -- Byzantine fault-tolerance. - -## Requirements - -The notary service should be able to: - -- Notarise more than 1,000 transactions per second, with average 4 inputs per transaction. -- Notarise a single transaction within 1s (from the service perspective). -- Tolerate single node crash without affecting service availability. -- Tolerate single data center failure. -- Tolerate single disk failure/corruption. - - -## Design Decisions - -.. toctree:: - :maxdepth: 2 - - decisions/replicated-storage.md - decisions/index-storage.md - -## Target Solution - -Having explored different solutions for implementing notaries we propose the following architecture for a CFT notary, -consisting of two components: - -1. A central replicated request log, which orders and stores all notarisation requests. Efficient append-only log - storage can be used along with batched replication, making performance mainly dependent on network throughput. -2. Worker nodes that service clients and maintain a consumed state index. The state index is a simple key-value store - containing committed state references and pointers to the corresponding request positions in the log. If lost, it can be - reconstructed by replaying and applying request log entries. There is a range of fast key-value stores that can be used - for implementation. - -![High level architecture](./images/high-level.svg) - -At high level, client notarisation requests first get forwarded to a central replicated request log. The requests are -then applied in order to the consumed state index in each worker to verify input state uniqueness. Each individual -request outcome (success/conflict) is then sent back to the initiating client by the worker responsible for it. To -emphasise, each worker will process _all_ notarisation requests, but only respond to the ones it received directly. - -Messages (requests) in the request log are persisted and retained forever. The state index has a relatively low -footprint and can in theory be kept entirely in memory. However, when a worker crashes, replaying the log to recover the -index may take too long depending on the SLAs. Additionally, we expect applying the requests to the index to be much -faster than consuming request batches even with persistence enabled. - -_Technically_, the request log can also be kept entirely in memory, and the cluster will still be able to tolerate up to -$f < n/2$ node failures. However, if for some reason the entire cluster is shut down (e.g. administrator error), all -requests will be forever lost! Therefore, we should avoid it. - -The request log does not need to be a separate cluster, and the worker nodes _could_ maintain the request log replicas -locally. This would allow workers to consume ordered requests from the local copy rather than from a leader node across -the network. It is hard to say, however, if this would have a significant performance impact without performing tests in -the specific network environment (e.g. the bottleneck could be the replication step). - -One advantage of hosting the request log in a separate cluster is that it makes it easier to independently scale the -number of worker nodes. If, for example, if transaction validation and resolution is required when receiving a -notarisation request, we might find that a significant number of receivers is required to generate enough incoming -traffic to the request log. On the flip side, increasing the number of workers adds additional consumers and load on the -request log, so a balance needs to be found. - -## Design Decisions - -As the design decision documents below discuss, the most suitable platform for managing the request log was chosen to be -[Apache Kafka](https://kafka.apache.org/), and [RocksDB](http://rocksdb.org/) as the storage engine for the committed -state index. - -| Heading | Recommendation | -| ---------------------------------------- | -------------- | -| [Replication framework](decisions/replicated-storage.md) | Option C | -| [Index storage engine](decisions/index-storage.md) | Option A | - -TECHNICAL DESIGN ---- - -## Functional - -A Kafka-based notary service does not deviate much from the high-level target solution architecture as described above. - -![Kafka overview](./images/kafka-high-level.svg) - -For our purposes we can view Kafka as a replicated durable queue we can push messages (_records_) to and consume from. -Consuming a record just increments the consumer's position pointer, and does not delete it. Old records eventually -expire and get cleaned up, but the expiry time can be set to "indefinite" so all data is retained (it's a supported -use-case). - -The main caveat is that Kafka does not allow consuming records from replicas directly – all communication has to be -routed via a single leader node. - -In Kafka, logical queues are called _topics_. Each topic can be split into multiple partitions. Topics are assigned a -_replication factor_, which specifies how many replicas Kafka should create for each partition. Each replicated -partition has an assigned leader node which producers and consumers can connect to. Partitioning topics and evenly -distributing partition leadership allows Kafka to scale well horizontally. - -In our use-case, however, we can only use a single-partition topic for notarisation requests, which limits the total -capacity and throughput to a single machine. Partitioning requests would break global transaction ordering guarantees -for consumers. There is a [proposal](#kafka-throughput-scaling-via-partitioning) from Rick Parker on how we _could_ use -partitioning to potentially avoid traffic contention on the single leader node. - -### Data model - -Each record stored in the Kafka topic contains: -1. Transaction Id -2. List of input state references -2. Requesting party X.500 name -3. Notarisation request signature - -The committed state index contains a map of: - -`Input state reference: StateRef -> ( Transaction Id: SecureHash, Kafka record position: Long )` - -It also stores a special key-value pair denoting the position of the last applied Kafka record. - -## Non-Functional - -### Fault tolerance, durability and consistency guarantees - -Let's have a closer look at what exactly happens when a client sends a notarisation request to a notary worker node. - -![Sequence diagram](./images/steps.svg) - -A small note on terminology: the "notary service" we refer to in this section is the internal long-running service in the Corda node. - -1. Client sends a notarisation request to the chosen Worker node. The load balancing is handled on the client by Artemis (round-robin). -2. Worker acknowledges receipt and starts the service flow. The flow validates the request: verifies the transaction if needed, validates timestamp and notarisation request signature. The flow then forwards the request to the notary service, and suspends waiting for a response. -3. The notary service wraps the request in a Kafka record and sends it to the global log via a Kafka producer. The sends are asynchronous from the service's perspective, and the producer is configured to buffer records and perform sends in batches. -4. The Kafka leader node responsible for the topic partition replicates the received records to followers. The producer also specifies "ack" settings, which control when the records are considered to be committed. Only committed records are available for consumers. Using the "all" setting ensures that the records are persisted all replicas before it is available for consumption. **This ensures that no worker will consume a record that may later be lost if the Kafka leader crashes**. -7. The notary service maintains a separate thread that continuously attempts to pull new available batches of records from the Kafka leader node. It processes the received batches of notarisation requests – commits input states to a local persistent key-value store. Once a batch is processed, the last record position in the Kafka partition is also persisted locally. On restart, the consumption of records is started from the last recorded position. -9. Kafka also tracks consumer positions in Zookeeper, and provides the ability for consumers to commit the last consumed position either synchronously, or asynchronously. Since we don't require exactly once delivery semantics, we opt for asynchronous position commits for performance reasons. -10. Once notarisation requests are processed, the notary service matches them against ones received by this particular worker node, and resumes the flows to send responses back to the clients. - -Now let's consider the possible failure scenarios and how they are handled: -* 2: Worker fails to acknowledge request. The Artemis broker on the client will redirect the message to a different worker node. -* 3: Worker fails right after acknowledging the request, nothing is sent to the Kafka request log. Without some heartbeat mechanism the client can't know if the worker has failed, or the request is simply taking a long time to process. For this reason clients have special logic to retry notarisation requests with different workers, if a response is not received before a specified timeout. -* 4: Kafka leader fails before replicating records. The producer does not receive an ack and the batch send fails. A new leader is elected and all producers and consumers switch to it. The producer retries sending with the new leader (it has to be configured to auto-retry). The lost records were not considered to be committed and therefore not made available for any consumers. Even if the producer did not re-send the batch to the new leader, client retries would fire and the requests would be reinserted into the "pipeline". -* 7: The worker fails after sending out a batch of requests. The requests will be replicated and processed by other worker nodes. However, other workers will not send back replies to clients that the failed worker was responsible for. - The client will retry with another worker. That worker will have already processed the same request, and committing the input states will result in a conflict. Since the conflict is caused by the same Corda transaction, it will ignore it and send back a successful response. -* 8: The worker fails right after consuming a record batch. The consumer position is not recorded anywhere so it would re-consume the batch once it's back up again. -* 9: The worker fails right after committing input states, but before recording last processed record position. On restart, it will re-consume the last batch of requests it had already processed. Committing input states is idempotent so re-processing the same request will succeed. Committing the consumer position to Kafka is strictly speaking not needed in our case, since we maintain it locally and manually "rewind" the partition to the last processed position on startup. -* 10: The worker fails just before sending back a response. The client will retry with another worker. - -The above discussion only considers crash failures which don't lead to data loss. What happens if the crash also results in disk corruption/failure? -* If a Kafka leader node fails and loses all data, the machine can be re-provisioned, the Kafka node will reconnect to the cluster and automatically synchronise all data from one of the replicas. It can only become a leader again once it fully catches up. -* If a worker node fails and loses all data, it can replay the Kafka partition from the beginning to reconstruct the committed state index. To speed this up, periodical backups can be taken so the index can be restored from a more recent snapshot. - -One open question is flow handling on the worker node. If notary service flow is checkpointed and the worker crashes while the flow is suspended and waiting for a response (the completion of a future), on restart the flow will re-issue the request to the notary service. The service will in turn forward it to the request log (Kafka) for processing. If the worker node was down long enough for the client to retry the request with a different worker, a single notarisation request will get processed 3 times. - -If the notary service flow is not checkpointed, the request won't be re-issued after restart, resulting in it being processed only twice. However, in the latter case, the client will need to wait for the entire duration until the timeout expires, and if the worker is down for only a couple of seconds, the first approach would result in a much faster response time. - -### Performance - -Kafka provides various configuration parameters allowing to control producer and consumer record batch size, compression, buffer size, ack synchrony and other aspects. There are also guidelines on optimal filesystem setup. - -RocksDB is highly tunable as well, providing different table format implementations, compression, bloom filters, compaction styles, and others. - -Initial prototype tests showed up to *15,000* TPS for single-input state transactions, or *40,000* IPS (inputs/sec) for 1,000 input transactions. No performance drop observed even after 1.2m transactions were notarised. The tests were run on three 8 core, 28 GB RAM Azure VMs in separate data centers. - -With the recent introduction of notarisation request signatures the figures are likely to be much lower, as the request payload size is increased significantly. More tuning and testing required. - -### Scalability - -Not possible to scale beyond peak single machine throughput. Possible to scale the number of worker nodes for transactions verification and signing. - -## Operational - -As a general note, Kafka and Zookeeper are widely used in the industry and there are plenty of deployment guidelines and management tools available. - -### Deployment - -Different options available. A singe Kafka broker, Zookeeper replica and a Corda notary worker node can be hosted on the same machine for simplicity and cost-saving. At the other extreme, every Kafka/Zookeeper/Corda node can be hosted on its own machine. The latter arguably provides more room for error, at the expense of extra operational costs and effort. - -### Management - -Kafka provides command-line tools for managing brokers and topics. Third party UI-based tools are also available. - -### Monitoring - -Kafka exports a wide range of metrics via JMX. Datadog integration available. - -### Disaster recovery - -Failure modes: -1. **Single machine or data center failure**. No backup/restore procedures are needed – nodes can catch up with the cluster on start. The RocksDB-backed committed state index keeps a pointer to the position of the last applied Kafka record, and it can resume where it left after restart. -2. **Multi-data center disaster leading to data loss**. Out of scope. -3. **User error**. It is possible for an admin to accidentally delete a topic – Kafka provides tools for that. However, topic deletion has to be explicitly enabled in the configuration (disabled by default). Keeping that option disabled should be a sufficient safeguard. -4. **Protocol-level corruption**. This covers scenarios when data stored in Kafka gets corrupted and the corruption is replicated to healthy replicas. In general, this is extremely unlikely to happen since Kafka records are immutable. The only such corruption in practical sense could happen due to record deletion during compaction, which would occur if the broker is misconfigured to not retrain records indefinitely. However, compaction is performed asynchronously and local to the broker. In order for all data to be lost, _all_ brokers have to be misconfigured. - -It is not possible to recover without any data loss in the event of 3 or 4. We can only _minimise_ data loss. There are two options: -1. Run a backup Kafka cluster. Kafka provides a tool that forwards messages from one cluster to another (asynchronously). -2. Take periodical physical backups of the Kafka topic. - -In both scenarios the most recent requests will be lost. If data loss only occurs in Kafka, and the worker committed state indexes are intact, the notary could still function correctly and prevent double-spends of the transactions that were lost. However, in the non-validating notary scenario, the notarisation request signature and caller identity will be lost, and it will be impossible to trace the submitter of a fraudulent transaction. We could argue that the likelihood of request loss _and_ malicious transactions occurring at the same time is very low. - -## Security - -* **Communication**. Kafka supports SSL for both client-to-server and server-to-server communication. However, Zookeeper only supports SSL in client-to-server, which means that running Zookeeper across data centers will require setting up a VPN. For simplicity, we can reuse the same VPN for the Kafka cluster as well. The notary worker nodes can talk to Kafka either via SSL or the VPN. - -* **Data privacy**. No transaction contents or PII is revealed or stored. - -APPENDICES ---- - -## Kafka throughput scaling via partitioning - -We have to use a single partition for global transaction ordering guarantees, but we could reduce the load on it by using it _just_ for ordering: - -* Have a single-partition `transactions` topic where all worker nodes send only the transaction id. -* Have a separate _partitioned_ `payload` topic where workers send the entire notarisation request content: transaction id, inputs states, request signature. A single request can be around 1KB in size). - -Workers would need to consume from the `transactions` partition to obtain the ordering, and from all `payload` partitions for the actual notarisation requests. A request will not be processed until its global order is known. Since Kafka tries to distribute leaders for different partitions evenly across the cluster, we would avoid a single Kafka broker handling all of the traffic. Load-wise, nothing changes from the worker node's perspective – it still has to process all requests – but a larger number of worker nodes could be supported. diff --git a/docs/source/design/kafka-notary/images/high-level.svg b/docs/source/design/kafka-notary/images/high-level.svg deleted file mode 100644 index 8c3d0bae08..0000000000 --- a/docs/source/design/kafka-notary/images/high-level.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/docs/source/design/kafka-notary/images/kafka-high-level.svg b/docs/source/design/kafka-notary/images/kafka-high-level.svg deleted file mode 100644 index a653f31b3a..0000000000 --- a/docs/source/design/kafka-notary/images/kafka-high-level.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/docs/source/design/kafka-notary/images/steps.svg b/docs/source/design/kafka-notary/images/steps.svg deleted file mode 100644 index a7e3703209..0000000000 --- a/docs/source/design/kafka-notary/images/steps.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/docs/source/design/kafka-notary/images/store-comparison.png b/docs/source/design/kafka-notary/images/store-comparison.png deleted file mode 100644 index db1dcec1f9..0000000000 Binary files a/docs/source/design/kafka-notary/images/store-comparison.png and /dev/null differ diff --git a/docs/source/design/linear-pointer/design.md b/docs/source/design/linear-pointer/design.md deleted file mode 100644 index 1e5fcf318a..0000000000 --- a/docs/source/design/linear-pointer/design.md +++ /dev/null @@ -1,144 +0,0 @@ -# StatePointer - -## Background - -Occasionally there is a need to create a link from one `ContractState` to another. This has the effect of creating a uni-directional "one-to-one" relationship between a pair of `ContractState`s. - -There are two ways to do this. - -### By `StateRef` - -Link one `ContractState` to another by including a `StateRef` or a `StateAndRef` as a property inside another `ContractState`: - -```kotlin -// StateRef. -data class FooState(val ref: StateRef) : ContractState -// StateAndRef. -data class FooState(val ref: StateAndRef) : ContractState -``` - -Linking to a `StateRef` or `StateAndRef` is only recommended if a specific version of a state is required in perpetuity. Clearly, adding a `StateAndRef` embeds the data directly. This type of pointer is compatible with any `ContractState` type. - -But what if the linked state is updated? The `StateRef` will be pointing to an older version of the data and this could be a problem for the `ContractState` which contains the pointer. - -### By `linearId` - -To create a link to the most up-to-date version of a state, instead of linking to a specific `StateRef`, a `linearId` which references a `LinearState` can be used. This is because all `LinearState`s contain a `linearId` which refers to a particular lineage of `LinearState`. The vault can be used to look-up the most recent state with the specified `linearId`. - -```kotlin -// Link by LinearId. -data class FooState(val ref: UniqueIdentifier) : ContractState -``` - -This type of pointer only works with `LinearState`s. - -### Resolving pointers - -The trade-off with pointing to data in another state is that the data being pointed to cannot be immediately seen. To see the data contained within the pointed-to state, it must be "resolved". - -## Design - -Introduce a `StatePointer` interface and two implementations of it; the `StaticPointer` and the `LinearPointer`. The `StatePointer` is defined as follows: - -```kotlin -interface StatePointer { - val pointer: Any - fun resolve(services: ServiceHub): StateAndRef -} -``` - -The `resolve` method facilitates the resolution of the `pointer` to a `StateAndRef`. - -The `StaticPointer` type requires developers to provide a `StateRef` which points to a specific state. - -```kotlin -class StaticPointer(override val pointer: StateRef) : StatePointer { - override fun resolve(services: ServiceHub): StateAndRef { - val transactionState = services.loadState(pointer) - return StateAndRef(transactionState, pointer) - } -} -``` - -The `LinearPointer` type contains the `linearId` of the `LinearState` being pointed to and a `resolve` method. Resolving a `LinearPointer` returns a `StateAndRef` containing the latest version of the `LinearState` that the node calling `resolve` is aware of. - -```kotlin -class LinearPointer(override val pointer: UniqueIdentifier) : StatePointer { - override fun resolve(services: ServiceHub): StateAndRef { - val query = QueryCriteria.LinearStateQueryCriteria(linearId = listOf(pointer)) - val result = services.vaultService.queryBy(query).states - check(result.isNotEmpty()) { "LinearPointer $pointer cannot be resolved." } - return result.single() - } -} -``` - -### Bi-directional link - -Symmetrical relationships can be modelled by embedding a `LinearPointer` in the pointed-to `LinearState` which points in the "opposite" direction. **Note:** this can only work if both states are `LinearState`s. - -## Use-cases - -It is important to note that this design only standardises a pattern which is currently possible with the platform. In other words, this design does not enable anything new. - -### Tokens - -Uncoupling token type definitions from the notion of ownership. Using the `LinearPointer`, `Token` states can include an `Amount` of some pointed-to type. The pointed-to type can evolve independently from the `Token` state which should just be concerned with the question of ownership. - -## Issues and resolutions - -Some issue to be aware of and their resolutions: - -| Problem | Resolution | -| :----------------------------------------------------------- | ------------------------------------------------------------ | -| If the node calling `resolve` has not seen the specified `StateRef`, then `resolve` will return `null`. Here, the node calling `resolve` might be missing some crucial data. | Use data distribution groups. Assuming the creator of the `ContractState` publishes it to a data distribution group, subscribing to that group ensures that the node calling resolve will eventually have the required data. | -| The node calling `resolve` has seen and stored transactions containing a `LinearState` with the specified `linearId`. However, there is no guarantee the `StateAndRef` returned by `resolve` is the most recent version of the `LinearState`. | Embed the pointed-to `LinearState` in transactions containing the `LinearPointer` as a reference state. The reference states feature will ensure the pointed-to state is the latest version. | -| The creator of the pointed-to `ContractState` exits the state from the ledger. If the pointed-to state is included a reference state then notaries will reject transactions containing it. | Contract code can be used to make a state un-exitable. | - -All of the noted resolutions rely on additional paltform features: - -* Reference states which will be available in V4 -* Data distribution groups which are not currently available. However, there is an early prototype -* Additional state interface - -### Additional concerns and responses - -#### Embedding reference states in transactions - -**Concern:** Embedding reference states for pointed-to states in transactions could cause transactions to increase by some unbounded size. - -**Response:** The introduction of this feature doesn't create a new platform capability. It merely formalises a pattern which is currently possible. Futhermore, there is a possibility that _any_ type of state can cause a transaction to increase by some un-bounded size. It is also worth remembering that the maximum transaction size is 10MB. - -#### `StatePointer`s are not human readable - -**Concern:** Users won't know what sits behind the pointer. - -**Response:** When the state containing the pointer is used in a flow, the pointer can be easily resolved. When the state needs to be displayed on a UI, the pointer can be resolved via vault query. - -#### This feature adds complexity to the platform - -**Concern:** This all seems quite complicated. - -**Response:** It's possible anyway. Use of this feature is optional. - -#### Coinselection will be slow. - -**Concern:** We'll need to join on other tables to perform coinselection, making it slower. This is when a `StatePointer` is used as a `FungibleState` or `FungibleAsset` type. - -**Response:** This is probably not true in most cases. Take the existing coinselection code from `CashSelectionH2Impl.kt`: - -```sql -SELECT vs.transaction_id, vs.output_index, ccs.pennies, SET(@t, ifnull(@t,0)+ccs.pennies) total_pennies, vs.lock_id -FROM vault_states AS vs, contract_cash_states AS ccs -WHERE vs.transaction_id = ccs.transaction_id AND vs.output_index = ccs.output_index -AND vs.state_status = 0 -AND vs.relevancy_status = 0 -AND ccs.ccy_code = ? and @t < ? -AND (vs.lock_id = ? OR vs.lock_id is null) -``` - -Notice that the only property required which is not accessible from the `StatePointer` is the `ccy_code`. This is not necessarily a problem though, as the `pointer` specified in the pointer can be used as a proxy for the `ccy_code` or "token type". - - - - diff --git a/docs/source/design/maximus/design.md b/docs/source/design/maximus/design.md deleted file mode 100644 index 4eee260a29..0000000000 --- a/docs/source/design/maximus/design.md +++ /dev/null @@ -1,146 +0,0 @@ -# Validation of Maximus Scope and Future Work Proposal - -## Introduction - -The intent of this document is to ensure that the Tech Leads and Product Management are comfortable with the proposed -direction of HA team future work. The term Maximus has been used widely across R3 and we wish to ensure that the scope -is clearly understood and in alignment with wider delivery expectations. - -I hope to explain the successes and failures of our rapid POC work, so it is clearer what guides our decision making in -this. - -Also, it will hopefully inform other teams of changes that may cross into their area. - -## What is Maximus? - -Mike’s original proposal for Maximus, made at CordaCon Tokyo 2018, was to use some automation to start and stop node -VM’s using some sort of automation to reduce runtime cost. In Mike’s words this would allow ‘huge numbers of -identities’, perhaps ‘thousands’. - -The HA team and Andrey Brozhko have tried to stay close to this original definition that Maximus is for managing -100’s-1000’s Enterprise Nodes and that the goal of the project is to better manage costs, especially in cloud -deployments and with low overall flow rates. However, this leads to the following assumptions: - -1. The overall rate of flows is low and users will accept some latency. The additional sharing of identities on a -reduced physical footprint will inevitably reduce throughput compared to dedicated nodes, but should not be a problem. - -2. At least in the earlier phases it is acceptable to statically manage identity keys/certificates for each individual -identity. This will be scripted but will incur some effort/procedures/checking on the doorman side. - -3. Every identity has an associated ‘DB schema’, which might be on a shared database server, but the separation is -managed at that level. This database is a fixed runtime cost per identity and will not be shared in the earlier phases -of Maximus. It might be optionally shareable in future, but this is not a hard requirement for Corda 5 as it needs -significant help from core to change the DB schemas. Also, our understanding is that the isolation is a positive feature -in some deployments. - -4. Maximus may share infrastructure and possibly JVM memory between identities without breaking some customer -requirement for isolation. In other words we are virtualizing the ‘node’, but CorDapps and peer nodes will be unaware of -any changes. - -## What Maximus is not - -1. Maximus is not designed to handle millions of identities. That is firmly Marco Polo and possibly handled completely -differently. - -2. Maximus should be not priced such as to undercut our own high-performance Enterprise nodes, or allow customers to run -arbitrary numbers of nodes for free. - -3. Maximus is not a ‘wallet’ based solution. The nodes in Maximus are fully equivalent to the current Enterprise -offering and have first class identities. There is also no remoting of the signing operations. - -## The POC technologies we have tried - -The HA team has looked at several elements of the solution. Some approaches look promising, some do not. - -1. We have already started the work to share a common P2P Artemis between multiple nodes and common bridge/float. This -is the ‘SNI header’ work which has been DRB’s recently. This should be functionally complete soon and available in Corda -4.0 This work will reduce platform cost and simplify deployment of multiple nodes. For Maximus the main effect is that it -should make the configuration much more consistent between nodes and it means that where a node runs is immaterial as -the shared broker distributes messages and the Corda firewall handles the public communication. - -2. I looked at flattening the flow state machine, so that we could map Corda operations into combining state and -messages in the style of a Map-Reduce pattern. Unfortunately, the work involved is extreme and not compatible with the -Corda API. Therefore a pure ‘flow worker’ approach does not look viable any time soon and in general full hot-hot is -still a way off. - -3. Chris looked at reducing the essential service set in the node to those needed to support the public flow API and the -StateMachine. Then we attached a simple start flow messaging interface. This simple ‘FlowRunner’ class allowed -exploration of several options in a gaffer taped state. - - 1. We created a simple messaging interface between an RPC runner and a Flow Runner and showed that we can run - standard flows. - - 2. We were able to POC combining two identities running side-by-side in a Flow Runner, which is in fact quite similar - to many of our integration tests. We must address static variable leakage but should be feasible. - - 3. We were able to create an RPC worker that could handle several identities at once and start flows on the - same/different flow runner harnesses. - -4. We then pushed forward looking into flow sharding. Here we made some progress, but the task started to get more and more - complicated. It also highlighted that we don’t have suitable headers on our messages and that the message header - whitelist will make this difficult to change whilst maintaining wire compatibility. The conclusion from this is that - hot-hot flow sharding will have to wait. - -8. We have been looking at resource/cost management technologies. The almost immediate conclusion is that whilst cloud -providers do have automated VM/container as service they are not standardized. Instead, the only standardized approach -is Kubernetes+docker, which will charge dynamically according to active use levels. - -9. Looking at resource management in Kubernetes we can dynamically scale relatively homogeneous pods, but the metrics -approach cannot easily cope with identity injection. Instead we can scale the number of running pods, but they will have -to self-organize the work balancing amongst themselves. - -## Maximus Work Proposal - -#### Current State - -![Current Enterprise State](./images/current_state.png) - -The current enterprise node solution in GA 3.1 is as above. This has dynamic HA failover available for the bridge/float -using ZooKeeper as leader elector, but the node has to be hot-cold. There is some sharing support for the ZooKeeper -cluster, but otherwise all this infrastructure has to be replicated per identity. In addition, all elements of this have -to have at least one resident instance to ensure that messages are captured and RPC clients have an endpoint to talk to. - -#### Corda 4.0 Agreed Target with SNI Shared Corda Firewalls - -![Corda 4.0 Enterprise State](./images/shared_bridge_float.png) - -Here by sharing the P2P Artemis externally and work on the messaging protocol it should be possible to reuse the corda -firewall for multiple nodes. This means that the externally advertised address will be stable for the whole cluster -independent of the deployed identities. Also, the durable messaging is outside nodes, which means that we can -theoretically schedule running the nodes only if a few times a day if they only act in response to external peer -messages. Mostly this is a prelude to greater sharing in the future Maximus state. - -#### Intermediate State Explored during POC - -![Maximus POC](./images/maximus_poc.png) - -During the POC we explore the model above, although none of the components were completed to a production standard. The -key feature here is that the RPC side has been split out of the node and has API support for multiple identities built -in. The flow and P2P elements of the node have been split out too, which means that the ‘FlowWorker’ start-up code can -be simpler than the current AbstractNode as it doesn’t have to support the same testing framework. The actual service -implementations are unchanged in this. - -The principal communication between the RPC and FlowWorker is about starting flows and completed work is broadcast as -events. A message protocol will be defined to allow re-attachment and status querying if the RPC client is restarted. -The vault RPC api will continue to the database directly in the RpcWorker and not involve the FlowWorker. The scheduler -service will live in the RPC service as potentially the FlowWorkers will not yet be running when the due time occurs. - -#### Proposed Maximus Phase 1 State - -![Maximus Phase 1](./images/maximus_phase1.png) - -The productionised version of the above POC will introduce ‘Max Nodes’ that can load FlowWorkers on demand. We still -require only one runs at once, but for this we will use ZooKeeper to ensure that FlowWorkers with capacity compete to -process the work and only one wins. Based on trials we can safely run a couple of identities at one inside the same Max -Node assuming load is manageable. Idle identities will be dropped trivially, since the Hibernate, Artemis connections -and thread pools will be owned by the Max Node not the flow workers. At this stage there is no dynamic management of the -physical resources, but some sort of scheduler could control how many Max Nodes are running at once. - -#### Final State Maximus with Dynamic Resource Management - -![Maximus Final](./images/maximus_final.png) - -The final evolution is to add dynamic cost control to the system. As the Max Nodes are homogeneous the RpcWorker can -monitor the load and signal metrics available to Kubernetes. This means that Max Nodes can be added and removed as -required and potentially cost zero. Ideally, separate work would begin in parallel to combine database data into a -single schema, but that is possibly not required. \ No newline at end of file diff --git a/docs/source/design/maximus/images/current_state.png b/docs/source/design/maximus/images/current_state.png deleted file mode 100644 index e8dd93aa31..0000000000 Binary files a/docs/source/design/maximus/images/current_state.png and /dev/null differ diff --git a/docs/source/design/maximus/images/maximus_final.png b/docs/source/design/maximus/images/maximus_final.png deleted file mode 100644 index 4850703e00..0000000000 Binary files a/docs/source/design/maximus/images/maximus_final.png and /dev/null differ diff --git a/docs/source/design/maximus/images/maximus_phase1.png b/docs/source/design/maximus/images/maximus_phase1.png deleted file mode 100644 index 369e347adf..0000000000 Binary files a/docs/source/design/maximus/images/maximus_phase1.png and /dev/null differ diff --git a/docs/source/design/maximus/images/maximus_poc.png b/docs/source/design/maximus/images/maximus_poc.png deleted file mode 100644 index 906a45dba4..0000000000 Binary files a/docs/source/design/maximus/images/maximus_poc.png and /dev/null differ diff --git a/docs/source/design/maximus/images/shared_bridge_float.png b/docs/source/design/maximus/images/shared_bridge_float.png deleted file mode 100644 index 8c8d7be9dd..0000000000 Binary files a/docs/source/design/maximus/images/shared_bridge_float.png and /dev/null differ diff --git a/docs/source/design/monitoring-management/MonitoringLoggingOverview.png b/docs/source/design/monitoring-management/MonitoringLoggingOverview.png deleted file mode 100644 index 768507853c..0000000000 Binary files a/docs/source/design/monitoring-management/MonitoringLoggingOverview.png and /dev/null differ diff --git a/docs/source/design/monitoring-management/design.md b/docs/source/design/monitoring-management/design.md deleted file mode 100644 index f36809a8ca..0000000000 --- a/docs/source/design/monitoring-management/design.md +++ /dev/null @@ -1,533 +0,0 @@ -# Monitoring and Logging Design - -## Overview - -The successful deployment and operation of Corda (and associated CorDapps) in a production environment requires a -supporting monitoring and management capability to ensure that both a Corda node (and its supporting middleware -infrastructure) and deployed CorDapps execute in a functionally correct and consistent manner. A pro-active monitoring -solution will enable the immediate alerting of unexpected behaviours and associated management tooling should enable -swift corrective action. - -This design defines the monitoring metrics and logging outputs, and associated implementation approach, required to -enable a proactive enterprise management and monitoring solution of Corda nodes and their associated CorDapps. This also -includes a set of "liveliness" checks to verify and validate correct functioning of a Corda node (and associated -CorDapp). - -![MonitoringLoggingOverview](./MonitoringLoggingOverview.png) - -In the above diagram, the left hand side dotted box represents the components within scope for this design. It is -anticipated that 3rd party enterprise-wide system management solutions will closely follow the architectural component -breakdown in the right hand side box, and thus seamlessly integrate with the proposed Corda event generation and logging -design. The interface between the two is de-coupled and based on textual log file parsing and adoption of industry -standard JMX MBean events. - -## Background - -Corda currently exposes several forms of monitorable content: - -* Application log files using the [SLF4J](https://www.slf4j.org/) (Simple Logging Facade for Java) which provides an - abstraction over various concrete logging frameworks (several of which are used within other Corda dependent 3rd party - libraries). Corda itself uses the [Apache Log4j 2](https://logging.apache.org/log4j/2.x/) framework for logging output - to a set of configured loggers (to include a rolling file appender and the console). Currently the same set of rolling - log files are used by both the node and CorDapp(s) deployed to the node. The log file policy specifies a 60 day - rolling period (but preserving the most recent 10Gb) with a maximum of 10 log files per day. - -* Industry standard exposed JMX-based metrics, both standard JVM and custom application metrics are exposed directly - using the [Dropwizard.io](http://metrics.dropwizard.io/3.2.3/) *JmxReporter* facility. In addition Corda also uses the - [Jolokia](https://jolokia.org/) framework to make these accessible over an HTTP endpoint. Typically, these metrics are - also collated by 3rd party tools to provide pro-active monitoring, visualisation and re-active management. - - A full list of currently exposed metrics can be found in the appendix A. - -The Corda flow framework also has *placeholder* support for recording additional Audit data in application flows using a -simple *AuditService*. Audit event types are currently loosely defined and data is stored in string form (as a -description and contextual map of name-value pairs) together with a timestamp and principal name. This service does not -currently have an implementation of the audit event data to a persistent store. - -The `ProgressTracker` component is used to report the progress of a flow throughout its business lifecycle, and is -typically configured to report the start of a specific business workflow step (often before and after message send and -receipt where other participants form part of a multi-staged business workflow). The progress tracking framework was -designed to become a vital part of how exceptions, errors, and other faults are surfaced to human operators for -investigation and resolution. It provides a means of exporting progress as a hierarchy of steps in a way that’s both -human readable and machine readable. - -In addition, in-house Corda networks at R3 use the following tools: - -* Standard [DataDog](https://docs.datadoghq.com/guides/overview/) probes are currently used to provide e-mail based - alerting for running Corda nodes. [Telegraf](https://github.com/influxdata/telegraf) is used in conjunction with a - [Jolokia agent](https://jolokia.org/agent.html) as a collector to parse emitted metric data and push these to DataDog. -* Investigation is underway to evaluate [ELK](https://logz.io/learn/complete-guide-elk-stack/) as a mechanism for parsing, - indexing, storing, searching, and visualising log file data. - -## Scope - -### Goals - -- Add new metrics at the level of a Corda node, individual CorDapps, and other supporting Corda components (float, bridge manager, doorman) -- Support liveness checking of the node, deployed flows and services -- Review logging groups and severities in the node. -- Separate application logging from node logging. -- Implement the audit framework that is currently only a stubbed out API -- Ensure that Corda can be used with third party systems for monitoring, log collection and audit - -### Out of scope - -- Recommendation of a specific set of monitoring tools. -- Monitoring of network infrastructure like the network map service. -- Monitoring of liveness of peers. - -## Requirements - -Expanding on the first goal identified above, the following requirements have been identified: - -1. Node health - - Message queues: latency, number of queues/messages, backlog, bridging establishment and connectivity (success / failure) - - Database: connections (retries, errors), latency, query time - - RPC metrics, latency, authentication/authorisation checking (eg. number of successful / failed attempts). - - Signing performance (eg. signatures per sec). - - Deployed CorDapps - - Garbage collector and JVM statistics - -2. CorDapp health - - Number of flows broken down by type (including flow status and aging statistics: oldest, latest) - - Flow durations - - JDBC connections, latency/histograms - -3. Logging - - RPC logging - - Shell logging (user/command pairs) - - Message queue - - Traces - - Exception logging (including full stack traces) - - Crash dumps (full stack traces) - - Hardware Security Module (HSM) events. - - per CorDapp logging - -4. Auditing - - - Security: login authentication and authorisation - - Business Event flow progress tracking - - System events (particularly failures) - - Audit data should be stored in a secure, storage medium. - Audit data should include sufficient contextual information to enable optimal off-line analysis. - Auditing should apply to all Corda node processes (running CorDapps, notaries, oracles). - -### Use Cases - -It is envisaged that operational management and support teams will use the metrics and information collated from this -design, either directly or through an integrated enterprise-wide systems management platform, to perform the following: - -- Validate liveness and correctness of Corda nodes and deployed CorDapps, and the physical machine or VM they are hosted on. - -* Use logging to troubleshoot operational failures (in conjunction with other supporting failure information: eg. GC logs, stack traces) -* Use reported metrics to fine-tune and tweak operational systems parameters (including dynamic setting of logging - modules and severity levels to enable detailed logging). - -## Design Decisions - -The following design decisions are to be confirmed: - -1. JMX for metric eventing and SLF4J for logging - Both above are widely adopted mechanisms that enable pluggability and seamless interoperability with other 3rd party - enterprise-wide system management solutions. -2. Continue or discontinue usage of Jolokia? (TBC - most likely yes, subject to read-only security lock-down) -3. Separation of Corda Node and CorDapp log outputs (TBC) - -## Proposed Solution - -There are a number of activities and parts to the solution proposal: - -1. Extend JMX metric reporting through the Corda Monitoring Service and associated jolokia conversion to REST/JSON) - coverage (see implementation details) to include all Corda services (vault, key management, transaction storage, - network map, attachment storage, identity, cordapp provision) & sub-sytems components (state machine) - -2. Review and extend Corda log4j2 coverage (see implementation details) to ensure - - - consistent use of severities according to situation - - consistent coverage across all modules and libraries - - consistent output format with all relevant contextual information (node identity, user/execution identity, flow - session identity, version information) - - separation of Corda Node and CorDapp log outputs (TBC) - For consistent interleaving reasons, it may be desirable to continue using combined log output. - - Publication of a *code style guide* to define when to use different severity levels. - -3. Implement a CorDapp to perform sanity checking of flow framework, fundamental corda services (vault, identity), and - dependent middleware infrastructure (message broker, database). - -4. Revisit and enhance as necessary the [Audit service API]( https://github.com/corda/corda/pull/620 ), and provide a - persistent backed implementation, to include: - - - specification of Business Event Categories (eg. User authentication and authorisation, Flow-based triggering, Corda - Service invocations, Oracle invocations, Flow-based send/receive calls, RPC invocations) - - auto-enabled with Progress Tracker as Business Event generator - - RDBMS backed persistent store (independent of Corda database), with adequate security controls (authenticated access - and read-only permissioning). Captured information should be consistent with standard logging, and it may be desirable - to define auditable loggers within log4j2 to automatically redirect certain types of log events to the audit service. - -5. Ensure 3rd party middleware drivers (JDBC for database, MQ for messaging) and the JVM are correctly configured to export - JMX metrics. Ensure the [JVM Hotspot VM command-line parameters](https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/clopts001.html) - are tuned correctly to enable detailed troubleshooting upon failure. Many of these metrics are already automatically - exposed to 3rd party profiling tools such as Yourkit. - - Apache Artemis has a comprehensive [management API](https://activemq.apache.org/artemis/docs/latest/management.html) - that allows a user to modify a server configuration, create new resources (e.g. addresses and queues), inspect these - resources (e.g. how many messages are currently held in a queue) and interact with it (e.g. to remove messages from a - queue), and exposes key metrics using JMX (using role-based authentication using Artemis's JAAS plug-in support to - ensure Artemis cannot be controlled via JMX).. - -### Restrictions - -As of Corda M11, Java serialisation in the Corda node has been restricted, meaning MBeans access via the JMX port will no longer work. - -Usage of Jolokia requires bundling an associated *jolokia-agent-war* file on the classpath, and associated configuration -to export JMX monitoring statistics and data over the Jolokia REST/JSON interface. An associated *jolokia-access.xml* -configuration file defines role based permissioning to HTTP operations. - -## Complementary solutions - -A number of 3rd party libraries and frameworks have been proposed which solve different parts of the end to end -solution, albeit with most focusing on the Agent Collector (eg. collect metrics from systems then output them to some -backend storage.), Event Storage and Search, and Visualization aspects of Systems Management and Monitoring. These -include: - -| Solution | Type (OS/£) | Description | -| ---------------------------------------- | ----------- | ---------------------------------------- | -| [Splunk](https://www.splunk.com/en_us/products.html) | £ | General purpose enterprise-wide system management solution which performs collection and indexing of data, searching, correlation and analysis, visualization and reporting, monitoring and alerting. | -| [ELK](https://logz.io/learn/complete-guide-elk-stack/) | OS | The ELK stack is a collection of 3 open source products from Elastic which provide an end to end enterprise-wide system management solution:
Elasticsearch: NoSQL database based on Lucene search engine
Logstash: is a log pipeline tool that accepts inputs from various sources, executes different transformations, and exports the data to various targets. Kibana: is a visualization layer that works on top of Elasticsearch. | -| [ArcSight](https://software.microfocus.com/en-us/software/siem-security-information-event-management) | £ | Enterprise Security Manager | -| [Collectd](https://collectd.org/) | OS | Collector agent (written in C circa 2005). Data acquisition and storage handled by over 90 plugins. | -| [Telegraf](https://github.com/influxdata/telegraf) | OS | Collector agent (written in Go, active community) | -| [Graphite](https://graphiteapp.org/) | OS | Monitoring tool that stores, retrieves, shares, and visualizes time-series data. | -| [StatsD](https://github.com/etsy/statsd) | OS | Collector daemon that runs on the [Node.js](http://nodejs.org/) platform and listens for statistics, like counters and timers, sent over [UDP](http://en.wikipedia.org/wiki/User_Datagram_Protocol) or [TCP](http://en.wikipedia.org/wiki/Transmission_Control_Protocol) and sends aggregates to one or more pluggable backend services (e.g., Graphite). | -| [fluentd](https://www.fluentd.org/) | OS | Collector daemon which collects data directly from logs and databases. Often used to analyze event logs, application logs, and clickstreams (a series of mouse clicks). | -| [Prometheus](https://prometheus.io/) | OS | End to end monitoring solution using time-series data (eg. metric name and a set of key-value pairs) and includes collection, storage, query and visualization. | -| [NewRelic](https://newrelic.com/) | £ | Full stack instrumentation for application monitoring and real-time analytics solution. | - -Most of the above solutions are not within the scope of this design proposal, but should be capable of ingesting the outputs (logging and metrics) defined by this design. - -## Technical design - -In general, the requirements outlined in this design are cross-cutting concerns which affect the Corda codebase holistically, both for logging and capture/export of JMX metrics. - -### Interfaces - -* Public APIs impacted - * No Public API's are impacted. -* Internal APIs impacted - * No identified internal API's are impacted. -* Services impacted: - * No change anticipated to following service: - * *Monitoring* - This service defines and used the *Codahale* `MetricsRegistry`, which is used by all other Corda services. - * Changes expected to: - * *AuditService* - This service has been specified but not implemented. - The following event types have been defined (and may need reviewing): - * `FlowAppAuditEvent`: used in `FlowStateMachine`, exposed on `FlowLogic` (but never called) - * `FlowPermissionAuditEvent`: (as above) - * `FlowStartEvent` (unused) - * `FlowProgressAuditEvent` (unused) - * `FlowErrorAuditEvent` (unused) - * `SystemAuditEvent` (unused) -* Modules impacted - * All modules packaged and shipped as part of a Corda distribution (as published to Artifactory / Maven): *core, node, node-api, node-driver, finance, confidential-identities, test-common, test-utils, webserver, jackson, jfx, mock, rpc* - -### Functional - -#### Health Checker - -The Health checker is a CorDapp which verifies the health and liveliness of the Corda node it is deployed and running within by performing the following activities: - -1. Corda network and middleware infrastructure connectivity checking: - - - Database connectivity - - Message broker connectivity - -2. Network Map participants summary (count, list) - - - Notary summary (type, [number of cluster members] - -3. Flow framework verification - - Implement a simple flow that performs a simple "in-node" (no external messaging to 3rd party processes) round trip, and by doing so, exercises: - - - flow checkpointing (including persistence to relational data store) - - message subsystem verification (creation of a send-to-self queue for purpose of routing) - - custom CordaService invocation (verify and validate behaviour of an installed CordaService) - - vault querying (verify and validate behaviour of vault query mechanism) - - [this CorDapp could perform a simple Issuance of a fictional Corda token, Spend Corda token to self, Corda token exit, plus a couple of Vault queries in between: one using the VaultQuery API and the other using a Custom Query via a registered @CordaService] - -4. RPC triggering - Autotriggering of above flow using RPC to exercise the following: - - - messaging subsystem verification (RPC queuing) - - authenticaton and permissions checking (against underlying configuration) - - -The Health checker may be deployed as part of a Corda distribution and automatically invoked upon start-up and/or manually triggered via JMX or the nodes associated Crash shell (using the startFlow command) - -Please note that the Health checker application is not responsible for determining the healthiness of a Corda Network. This is the responsibility of the network operator, and may include verification checks such as: - -- correct functioning of Network Map Service (registration, discovery) -- correct functioning of configured Notary -- remote messaging sub-sytem (including bridge creation) - -#### Metrics augmentation within Corda Subsystems and Components - -*Codahale* provides the following types of reportable metrics: - -- Gauge: is an instantaneous measurement of a value. -- Counter: is a gauge for a numeric value (specifically of type `AtomicLong`) which can be incremented or decremented. -- Meter: measures mean throughput (eg. the rate of events over time, e.g., “requests per second”). Also measures one-, five-, and fifteen-minute exponentially-weighted moving average throughputs. -- Histogram: measures the statistical distribution of values in a stream of data (minimum, maximum, mean, median, 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles). -- Timer: measures both the rate that a particular piece of code is called and the distribution of its duration (eg. rate of requests in requests per second). -- Health checks: provides a means of centralizing service (database, message broker health checks). - -See Appendix B for summary of current JMX Metrics exported by the Corda codebase. - -The following table identifies additional metrics to report for a Corda node: - -| Component / Subsystem | Proposed Metric(s) | -| ---------------------------------------- | ---------------------------------------- | -| Database | Connectivity (health check) | -| Corda Persistence | Database configuration details:
Data source properties: JDBC driver, JDBC driver class name, URL
Database properties: isolation level, schema name, init database flag
Run-time metrics: total & in flight connection, session, transaction counts; committed / rolled back transaction (counter); transaction durations (metric) | -| Message Broker | Connectivity (health check) | -| Corda Messaging Client | | -| State Machine | Fiber thread pool queue size (counter), Live fibers (counter) , Fibers waiting for ledger commit (counter)
Flow Session Messages (counters): init, confirm, received, reject, normal end, error end, total received messages (for a given flow session, Id and state)
(in addition to existing metrics captured)
Flow error (count) | -| Flow State Machine | Initiated flows (counter)
For a given flow session (counters): initiated flows, send, sendAndReceive, receive, receiveAll, retries upon send
For flow messaging (timers) to determine round trip latencies between send/receive interactions with counterparties.
Flow suspension metrics (count, age, wait reason, cordapp) | -| RPC | For each RPC operation we should export metrics to report: calling user, round trip latency (timer), calling frequency (meter). Metric reporting should include the Corda RPC protocol version (should be the same as the node's Platform Version) in play.
Failed requests would be of particular interest for alerting. | -| Vault | round trip latency of Vault Queries (timer)
Soft locking counters for reserve, release (counter), elapsed times soft locks are held for per flow id (timer, histogram), list of soft locked flow ids and associated stateRefs.
attempt to soft lock fungible states for spending (timer) | -| Transaction Verification
(InMemoryTransactionVerifierService) | worker pool size (counter), verify duration (timer), verify throughput (meter), success (counter), failure counter), in flight (counter) | -| Notarisation | Notary details (type, members in cluster)
Counters for success, failures, failure types (conflict, invalid time window, invalid transaction, wrong notary), elapsed time (timer)
Ideally provide breakdown of latency across notarisation steps: state ref notary validation, signature checking, from sending to remote notary to receiving response | -| RAFT Notary Service
(awaiting choice of new RAFT implementation) | should include similar metrics to previous RAFT (see appendix). | -| SimpleNotaryService | success/failure uniqueness checking
success/failure time-window checking | -| ValidatingNotaryService | as above plus success/failure of transaction validation | -| RaftNonValidatingNotaryService | as `SimpleNotaryService`, plus timer for algorithmic execution latency | -| RaftValidatingNotaryService | as `ValidatingNotaryService`, plus timer for algorithmic execution latency | -| BFTNonValidatingNotaryService | as `RaftNonValidatingNotaryService` | -| CorDapps
(CordappProviderImpl, CordappImpl) | list of corDapps loaded in node, path used to load corDapp jars
Details per CorDapp: name, contract class names, initiated flows, rpc flows, service flows, schedulable flows, services, serialization whitelists, custom schemas, jar path | -| Doorman Server | TBC | -| KeyManagementService | signing requests (count), fresh key requests (count), fresh key and cert requests (count), number of loaded keys (count) | -| ContractUpgradeServiceImpl | number of authorisation upgrade requests (counter) | -| DBTransactionStorage | number of transactions in storage map (cache)
cache size (max. 1024), concurrency level (def. 8) | -| DBTransactionMappingStorage | as above | -| Network Map | TBC (following re-engineering) | -| Identity Service | number or parties, keys, principals (in cache)
Identity verification count & latency (count, metric) | -| Attachment Service | counters for open, import, checking requests
(in addition to exiting attachment count) | -| Schema Service | list of registered schemas; schemaOptions per schema; table prefix. | - -#### Logging augmentation within Corda Subsystems and Components - -Need to ensure that Log4J2 log messages within Corda code are correctly categorized according to defined severities (from most specific to least): - -- ERROR: an error in the application, possibly recoverable. -- WARNING: an event that might possible lead to an error. -- INFO: an event for informational purposes. -- DEBUG: a general debugging event. -- TRACE: a fine-grained debug message, typically capturing the flow through the application. - -A *logging style guide* will be published to answer questions such as what severity level should be used and why when: - -- A connection to a remote peer is unexpectedly terminated. -- A database connection timed out but was successfully re-established. -- A message was sent to a peer. - -It is also important that we capture the correct amount of contextual information to enable rapid identification and resolution of issues using log file output. Specifically, within Corda we should include the following information in logged messages: - -- Node identifier -- User name -- Flow id (runId, also referred to as `StateMachineRunId`), if logging within a flow -- Other contextual Flow information (eg. counterparty), if logging within a flow -- `FlowStackSnapshot` information for catastrophic flow failures. - Note: this information is not currently supposed to be used in production. -- Session id information for RPC calls -- CorDapp name, if logging from within a CorDapp - -See Appendix C for summary of current Logging and Progress Tracker Reporting coverage within the Corda codebase. - -##### Custom logging for enhanced visibility and troubleshooting: - -1. Database SQL logging is controlled via explicit configuration of the Hibernate log4j2 logger as follows: - -``` -
 - 
 - -``` - -2. Message broker (Apache Artemis) advanced logging is enabled by configuring log4j2 for each of the 6 available [loggers defined](https://activemq.apache.org/artemis/docs/latest/logging.html). In general, Artemis logging is highly chatty so default logging is actually toned down for one of the defined loggers: - -``` -
 - 
 - -``` - -3. Corda coin selection advanced logging - including display of prepared statement parameters (which are not displayed for certain database providers when enabling Hibernate debug logging): - -``` -
 - 
 - -``` - -#### Audit Service persistence implementation and enablement - -1. Implementation of the existing `AuditService` API to write to a (pluggable) secure destination (database, message queue, other) -2. Identification of Business Events that we should audit, and instrumentation of code to ensure the AuditService is called with the correct Event Type according to Business Event. - For Corda Flows it would be a good idea to use the `ProgressTracker` component as a means of sending Business audit events. Refer [here](https://docs.corda.net/head/flow-state-machines.html?highlight=progress%20tracker#progress-tracking) for a detailed description of the ProgressTracker API. -3. Identification of System Events that should be automatically audited. -4. Specification of a database schema and associated object relational mapping implementation. -5. Setup and configuration of separate database and user account. - -## Software Development Tools and Programming Standards to be adopted. - -* Design patterns - - [Michele] proposes the adoption of an [event-based propagation](https://r3-cev.atlassian.net/browse/ENT-1131) solution (and associated event-driven framework) based on separation of concerns (performance improvements through parallelisation, latency minimisation for mainline execution thread): mainstream flow logic, business audit event triggering, JMX metric reporting. This approach would continue to use the same libraries for JMX event triggering and file logging. - -* 3rd party libraries - - [Jolokia](https://jolokia.org/) is a JMX-HTTP bridge giving access to the raw data and operations without connecting to the JMX port directly. Jolokia defines the JSON and REST formats for accessing MBeans, and provides client libraries to work with that protocol as well. - - [Dropwizard Metrics](http://metrics.dropwizard.io/3.2.3/) (formerly Codahale) provides a toolkit of ways to measure the behavior of critical components in a production environment. - -* supporting tools - - [VisualVM](http://visualvm.github.io/) is a visual tool integrating commandline JDK tools and lightweight profiling capabilities. - -## Appendix A - Corda exposed JMX Metrics - -The following metrics are exposed directly by a Corda Node at run-time: - -| Module | Metric | Desccription | -| ------------------------ | ---------------------------- | ---------------------------------------- | -| Attachment Service | Attachments | Counts number of attachments persisted in database. | -| RAFT Uniqueness Provider | RaftCluster.ThisServerStatus | Gauge | -| RAFT Uniqueness Provider | RaftCluster.MembersCount | Count | -| RAFT Uniqueness Provider | RaftCluster.Members | Gauge, containing a list of members (by server address) | -| State Machine Manager | Flows.InFlight | Gauge (number of instances of state machine manager) | -| State Machine Manager | Flows.CheckpointingRate | Meter | -| State Machine Manager | Flows.Started | Count | -| State Machine Manager | Flows.Finished | Count | - -Additionally, JMX metrics are also generated within the Corda *node-driver* performance testing utilities. Specifically, the `startPublishingFixedRateInjector` defines and exposes `QueueSize` and `WorkDuration` metrics. - -## Appendix B - Corda Logging and Reporting coverage - -Primary node services exposed publicly via ServiceHub (SH) or internally by ServiceHubInternal (SHI): - -| Service | Type | Implementation | Logging summary | -| ---------------------------------------- | ---- | ---------------------------------- | ---------------------------------------- | -| VaultService | SH | NodeVaultService | extensive coverage including Vault Query api calls using `HibernateQueryCriteriaParser` | -| KeyManagementService | SH | PersistentKeyManagementService | none | -| ContractUpgradeService | SH | ContractUpgradeServiceImpl | none | -| TransactionStorage | SH | DBTransactionStorage | none | -| NetworkMapCache | SH | NetworkMapCacheImpl | some logging (11x info, 1x warning) | -| TransactionVerifierService | SH | InMemoryTransactionVerifierService | | -| IdentityService | SH | PersistentIdentityService | some logging (error, debug) | -| AttachmentStorage | SH | NodeAttachmentService | minimal logging (info) | -| | | | | -| TransactionStorage | SHI | DBTransactionStorage | see SH | -| StateMachineRecordedTransactionMappingStorage | SHI | DBTransactionMappingStorage | none | -| MonitoringService | SHI | MonitoringService | none | -| SchemaService | SHI | NodeSchemaService | none | -| NetworkMapCacheInternal | SHI | PersistentNetworkMapCache | see SH | -| AuditService | SHI | | | -| MessagingService | SHI | NodeMessagingClient | Good coverage (error, warning, info, trace) | -| CordaPersistence | SHI | CordaPersistence | INFO coverage within `HibernateConfiguration` | -| CordappProviderInternal | SHI | CordappProviderImpl | none | -| VaultServiceInternal | SHI | NodeVaultService | see SH | - -Corda subsystem components: - -| Name | Implementation | Logging summary | -| -------------------------- | ---------------------------------------- | ---------------------------------------- | -| NotaryService | SimpleNotaryService | some logging (warn) via `TrustedAuthorityNotaryService` | -| NotaryService | ValidatingNotaryService | as above | -| NotaryService | RaftValidatingNotaryService | some coverage (info, debug) within `RaftUniquenessProvider` | -| NotaryService | RaftNonValidatingNotaryService | as above | -| NotaryService | BFTNonValidatingNotaryService | Logging coverage (info, debug) | -| Doorman | DoormanServer (Enterprise only) | Some logging (info, warn, error), and use of `println` | - -Corda core flows: - -| Flow name | Logging | Exception handling | Progress Tracking | -| --------------------------------------- | ------------------- | ---------------------------------------- | ----------------------------- | -| FinalityFlow | none | NotaryException | NOTARISING, BROADCASTING | -| NotaryFlow | none | NotaryException (NotaryError types: TimeWindowInvalid, TransactionInvalid, WrongNotary), IllegalStateException, some via `check` assertions | REQUESTING, VALIDATING | -| NotaryChangeFlow | none | StateReplacementException | SIGNING, NOTARY | -| SendTransactionFlow | none | FetchDataFlow.HashNotFound (FlowException) | none | -| ReceiveTransactionFlow | none | SignatureException, AttachmentResolutionException, TransactionResolutionException, TransactionVerificationException | none | -| ResolveTransactionsFlow | none | FetchDataFlow.HashNotFound (FlowException), ExcessivelyLargeTransactionGraph (FlowException) | none | -| FetchAttachmentsFlow | none | FetchDataFlow.HashNotFound | none | -| FetchTransactionsFlow | none | FetchDataFlow.HashNotFound | none | -| FetchDataFlow | some logging (info) | FetchDataFlow.HashNotFound | none | -| AbstractStateReplacementFlow.Instigator | none | StateReplacementException | SIGNING, NOTARY | -| AbstractStateReplacementFlow.Acceptor | none | StateReplacementException | VERIFYING, APPROVING | -| CollectSignaturesFlow | none | IllegalArgumentException via `require` assertions | COLLECTING, VERIFYING | -| CollectSignatureFlow | none | as above | none | -| SignTransactionFlow | none | FlowException, possibly other (general) Exception | RECEIVING, VERIFYING, SIGNING | -| ContractUpgradeFlow | none | FlowException | none | - -Corda finance flows: - -| Flow name | Logging | Exception handling | Progress Tracking | -| -------------------------- | ------- | ---------------------------------------- | ---------------------------------------- | -| AbstractCashFlow | none | CashException (FlowException) | GENERATING_ID, GENERATING_TX, SIGNING_TX, FINALISING_TX | -| CashIssueFlow | none | CashException (via call to `FinalityFlow`) | GENERATING_TX, SIGNING_TX, FINALISING_TX | -| CashPaymentFlow | none | CashException (caused by `InsufficientBalanceException` or thrown by `FinalityFlow`), SwapIdentitiesException | GENERATING_ID, GENERATING_TX, SIGNING_TX, FINALISING_TX | -| CashExitFlow | none | CashException (caused by `InsufficientBalanceException` or thrown by `FinalityFlow`), | GENERATING_TX, SIGNING_TX, FINALISING_TX | -| CashIssueAndPaymentFlow | none | any thrown by `CashIssueFlow` and `CashPaymentFlow` | as `CashIssueFlow` and `CashPaymentFlow` | -| TwoPartyDealFlow.Primary | none | | GENERATING_ID, SENDING_PROPOSAL | -| TwoPartyDealFlow.Secondary | none | IllegalArgumentException via `require` assertions | RECEIVING, VERIFYING, SIGNING, COLLECTING_SIGNATURES, RECORDING | -| TwoPartyTradeFlow.Seller | none | FlowException, IllegalArgumentException via `require` assertions | AWAITING_PROPOSAL, VERIFYING_AND_SIGNING | -| TwoPartyTradeFlow.Buyer | none | IllegalArgumentException via `require` assertions, IllegalStateException | RECEIVING, VERIFYING, SIGNING, COLLECTING_SIGNATURES, RECORDING | - -Confidential identities flows: - -| Flow name | Logging | Exception handling | Progress Tracking | -| ------------------------ | ------- | ---------------------------------------- | ---------------------------------------- | -| SwapIdentitiesFlow | | | | -| IdentitySyncFlow.Send | none | IllegalArgumentException via `require` assertions, IllegalStateException | SYNCING_IDENTITIES | -| IdentitySyncFlow.Receive | none | CertificateExpiredException, CertificateNotYetValidException, InvalidAlgorithmParameterException | RECEIVING_IDENTITIES, RECEIVING_CERTIFICATES | - -## Appendix C - Apache Artemis JMX Event types and Queuing Metrics. - -The following table contains a list of Notification Types and associated perceived importance to a Corda node at run-time: - -| Name | Code | Importance | -| --------------------------------- | :--: | ---------- | -| BINDING_ADDED | 0 | | -| BINDING_REMOVED | 1 | | -| CONSUMER_CREATED | 2 | Medium | -| CONSUMER_CLOSED | 3 | Medium | -| SECURITY_AUTHENTICATION_VIOLATION | 6 | Very high | -| SECURITY_PERMISSION_VIOLATION | 7 | Very high | -| DISCOVERY_GROUP_STARTED | 8 | | -| DISCOVERY_GROUP_STOPPED | 9 | | -| BROADCAST_GROUP_STARTED | 10 | N/A | -| BROADCAST_GROUP_STOPPED | 11 | N/A | -| BRIDGE_STARTED | 12 | High | -| BRIDGE_STOPPED | 13 | High | -| CLUSTER_CONNECTION_STARTED | 14 | Soon | -| CLUSTER_CONNECTION_STOPPED | 15 | Soon | -| ACCEPTOR_STARTED | 16 | | -| ACCEPTOR_STOPPED | 17 | | -| PROPOSAL | 18 | | -| PROPOSAL_RESPONSE | 19 | | -| CONSUMER_SLOW | 21 | High | - -The following table summarised the types of metrics associated with Message Queues: - -| Metric | Description | -| ----------------- | ---------------------------------------- | -| count | total number of messages added to a queue since the server started | -| countDelta | number of messages added to the queue *since the last message counter update* | -| messageCount | *current* number of messages in the queue | -| messageCountDelta | *overall* number of messages added/removed from the queue *since the last message counter update*. Positive value indicated more messages were added, negative vice versa. | -| lastAddTimestamp | timestamp of the last time a message was added to the queue | -| updateTimestamp | timestamp of the last message counter update | diff --git a/docs/source/design/reference-states/design.md b/docs/source/design/reference-states/design.md deleted file mode 100644 index 619bbdd094..0000000000 --- a/docs/source/design/reference-states/design.md +++ /dev/null @@ -1,175 +0,0 @@ -# Reference states - -## Overview - -See a prototype implementation here: https://github.com/corda/corda/pull/2889 - -There is an increasing need for Corda to support use-cases which require reference data which is issued and updated by specific parties, but available for use, by reference, in transactions built by other parties. - -Why is this type of reference data required? A key benefit of blockchain systems is that everybody is sure they see the -same as their counterpart - and for this to work in situations where accurate processing depends on reference data -requires everybody to be operating on the same reference data. This, in turn, requires any given piece of reference data -to be uniquely identifiable and, requires that any given transaction must be certain to be operating on the most current -version of that reference data. In cases where the latter condition applies, only the notary can attest to this fact and -this, in turn, means the reference data must be in the form of an unconsumed state. - -This document outlines the approach for adding support for this type of reference data to the Corda transaction model -via a new approach called "reference input states". - -## Background - -Firstly, it is worth considering the types of reference data on Corda how it is distributed: - -1. **Rarely changing universal reference data.** Such as currency codes and holiday calendars. This type of data can be added to transactions as attachments and referenced within contracts, if required. This data would only change based upon the decision of an International standards body, for example, therefore it is not critical to check the data is current each time it is used. -2. **Constantly changing reference data.** Typically, this type of data must be collected and aggregated by a central party. Oracles can be used as a central source of truth for this type of constantly changing data. There are multiple examples of making transaction validity contingent on data provided by Oracles (IRS demo and SIMM demo). The Oracle asserts the data was valid at the time it was provided. -3. **Periodically changing subjective reference data.** Reference data provided by entities such as bond issuers where the data changes frequently enough to warrant users of the data check it is current. - -At present, periodically changing subjective data can only be provided via: - -* Oracles, -* Attachments, -* Regular contract states, or alternatively, -* kept off-ledger entirely - -However, neither of these solutions are optimal for reasons discussed in later sections of this design document. - -As such, this design document introduces the concept of a "reference input state" which is a better way to serve "periodically changing subjective reference data" on Corda. - -A reference input state is a `ContractState` which can be referred to in a transaction by the contracts of input and output states but whose contract is not executed as part of the transaction verification process and is not consumed when the transaction is committed to the ledger but _is_ checked for "current-ness". In other words, the contract logic isn't run for the referencing transaction only. It's still a normal state when it occurs in an input or output position. - -Reference data states will enable many parties to "reuse" the same state in their transactions as reference data whilst still allowing the reference data state owner the capability to update the state. When data distribution groups are available then reference state owners will be able to distribute updates to subscribers more easily. Currently, distribution would have to be performed manually. - -Reference input states can be added to Corda by adding a new transaction component group that allows developers to add reference data `ContractState`s that are not consumed when the transaction is committed to the ledger. This eliminates the problems created by long chains of provenance, contention, and allows developers to use any `ContractState` for reference data. The feature should allow developers to add _any_ `ContractState` available in their vault, even if they are not a `participant` whilst nevertheless providing a guarantee that the state being used is the most recent version of that piece of information. - -## Scope - -Goals - -* Add the capability to Corda transactions to support reference states - -Non-goals (eg. out of scope) - -* Data distribution groups are required to realise the full potential of reference data states. This design document does not discuss data distribution groups. - -## Requirements - -1. Reference states can be any `ContractState` created by one or more `Party`s and subsequently updated by those `Party`s. E.g. `Cash`, `CompanyData`, `InterestRateSwap`, `FxRate`. Reference states can be `OwnableState`s, but it's more likely they will be `LinearState`s. -2. Any `Party` with a `StateRef` for a reference state should be able to add it to a transaction to be used as a reference, even if they are not a `participant` of the reference state. -3. The contract code for reference states should not be executed. However, reference data states can be referred to by the contracts of `ContractState`s in the input and output lists. -4. `ContractStates` should not be consumed when used as reference data. -5. Reference data must be current, therefore when reference data states are used in a transaction, notaries should check that they have not been consumed before. -6. To ensure determinism of the contract verification process, reference data states must be in scope for the purposes of transaction resolution. This is because whilst users of the reference data are not consuming the state, they must be sure that the series of transactions that created and evolved the state were executed validly. - -**Use-cases:** - -The canonical use-case for reference states: *KYC* - -* KYC data can be distributed as reference data states. -* KYC data states can only updatable by the data owner. -* Usable by any party - transaction verification can be conditional on this KYC/reference data. -* Notary ensures the data is current. - -Collateral reporting: - -* Imagine a bank needs to provide evidence to another party (like a regulator) that they hold certain states, such as cash and collateral, for liquidity reporting purposes -* The regulator holds a liquidity reporting state that maintains a record of past collateral reports and automates the handling of current reports using some contract code -* To update the liquidity reporting state, the regulator needs to include the bank’s cash/collateral states in a transaction – the contract code checks available collateral vs requirements. By doing this, the cash/collateral states would be consumed, which is not desirable -* Instead, what if those cash/collateral states could be referenced in a transaction but not consumed? And at the same time, the notary still checks to see if the cash/collateral states are current, or not (i.e. does the bank still own them) - -Other uses: - -* Distributing reference data for financial instruments. E.g. Bond issuance details created, updated and distributed by the bond issuer rather than a third party. -* Account level data included in cash payment transactions. - -## Design Decisions - -There are various other ways to implement reference data on Corda, discussed below: - -**Regular contract states** - -Currently, the transaction model is too cumbersome to support reference data as unconsumed states for the following reasons: - -* Contract verification is required for the `ContractState`s used as reference data. This limits the use of states, such as `Cash` as reference data (unless a special "reference" command is added which allows a "NOOP" state transaction to assert no that changes were made.) -* As such, whenever an input state reference is added to a transaction as reference data, an output state must be added, otherwise the state will be extinguished. This results in long chains of unnecessarily duplicated data. -* Long chains of provenance result in confidentiality breaches as down-stream users of the reference data state see all the prior uses of the reference data in the chain of provenance. This is an important point: it means that two parties, who have no business relationship and care little about each other's transactions nevertheless find themselves intimately bound: should one of them rely on a piece of common reference data in a transaction, the other one will not only need to be informed but will need to be furnished with a copy of the transaction. -* Reference data states will likely be used by many parties so they will be come highly contended. Parties will "race" to use the reference data. The latest copy must be continually distributed to all that require it. - -**Attachments** - -Of course, attachments can be used to store and share reference data. This approach does solve the contention issue around reference data as regular contract states. However, attachments don't allow users to ascertain whether they are working on the most recent copy of the data. Given that it's crucial to know whether reference data is current, attachments cannot provide a workable solution here. - -The other issue with attachments is that they do not give an intrinsic "format" to data, like state objects do. This makes working with attachments much harder as their contents are effectively bespoke. Whilst a data format tool could be written, it's more convenient to work with state objects. - -**Oracles** - -Whilst Oracles could provide a solution for periodically changing reference data, they introduce unnecessary centralisation and are onerous to implement for each class of reference data. Oracles don't feel like an optimal solution here. - -**Keeping reference data off-ledger** - -It makes sense to push as much verification as possible into the contract code, otherwise why bother having it? Performing verification inside flows is generally not a good idea as the flows can be re-written by malicious developers. In almost all cases, it is much more difficult to change the contract code. If transaction verification can be conditional on reference data included in a transaction, as a state, then the result is a more robust and secure ledger (and audit trail). - -## Target Solution - -Changes required: - -1. Add a `references` property of type `List` and `List` (for `FullTransaction`s) to all the transaction types. -2. Add a `REFERENCE_STATES` component group. -3. Amend the notary flows to check that reference states are current (but do not consume them) -4. Add a `ReferencedStateAndRef` class that encapsulates a `StateAndRef`, this is so `TransactionBuilder.withItems` can delineate between `StateAndRef`s and state references. -5. Add a `StateAndRef.referenced` method which wraps a `StateAndRef` in a `ReferencedStateAndRef`. -6. Add helper methods to `LedgerTransaction` to get `references` by type, etc. -7. Add a check to the transaction classes that asserts all references and inputs are on the same notary. -8. Add a method to `TransactionBuilder` to add a reference state. -9. Update the transaction resolution flow to resolve references. -10. Update the transaction and ledger DSLs to support references. -11. No changes are required to be made to contract upgrade or notary change transactions. - -Implications: - -**Versioning** - -This can be done in a backwards compatible way. However, a minimum platform version must be mandated. Nodes running on an older version of Corda will not be able to verify transactions which include references. Indeed, contracts which refer to `references` will fail at run-time for older nodes. - -**Privacy** - -Reference states will be visible to all that possess a chain of provenance including them. There are potential implications from a data protection perspective here. Creators of reference data must be careful **not** to include sensitive personal data. - -Outstanding issues: - -**Oracle choice** - -If the party building a transaction is using a reference state which they are not the owner of, they must move their states to the reference state's notary. If two or more reference states with different notaries are used, then the transaction cannot be committed as there is no notary change solution that works absent asking the reference state owner to change the notary. - -This can be mitigated by requesting that reference state owners distribute reference states for all notaries. This solution doesn't work for `OwnableState`s used as reference data as `OwnableState`s should be unique. However, in most cases it is anticipated that the users of `OwnableState`s as reference data will be the owners of those states. - -This solution introduces a new issue where nodes may store the same piece of reference data under different linear IDs. `TransactionBuilder`s would also need to know the required notary before a reference state is added. - -**Syndication of reference states** - -In the absence of data distribution groups, reference data must be manually transmitted to those that require it. Pulling might have the effect of DoS attacking nodes that own reference data used by many frequent users. Pushing requires reference data owners to be aware of all current users of the reference data. A temporary solution is required before data distribution groups are implemented. - -Initial thoughts are that pushing reference states is the better approach. - -**Interaction with encumbrances** - -It is likely not possible to reference encumbered states unless the encumbrance state is also referenced. For example, a cash state referenced for collateral reporting purposes may have been "seized" and thus encumbered by a regulator, thus cannot be counted for the collateral report. - -**What happens if a state is added to a transaction as an input as well as an input reference state?** - -An edge case where a developer might erroneously add the same StateRef as an input state _and_ input reference state. The effect is referring to reference data that immediately becomes out of date! This is an edge case that should be prevented as it is likely to confuse CorDapp developers. - -**Handling of update races.** - -Usage of a referenced state may race with an update to it. This would cause notarisation failure, however, the flow cannot simply loop and re-calculate the transaction because it has not necessarily seen the updated tx yet (it may be a slow broadcast). - -Therefore, it would make sense to extend the flows API with a new flow - call it WithReferencedStatesFlow that is given a set of LinearIDs and a factory that instantiates a subflow given a set of resolved StateAndRefs. - -It does the following: - -1. Checks that those linear IDs are in the vault and throws if not. -2. Resolves the linear IDs to the tip StateAndRefs. -3. Creates the subflow, passing in the resolved StateAndRefs to the factory, and then invokes it. -4. If the subflow throws a NotaryException because it tried to finalise and failed, that exception is caught and examined. If the failure was due to a conflict on a referenced state, the flow suspends until that state has been updated in the vault (there is an API to do wait for transaction already, but here the flow must wait for a state update). -5. Then it re-does the initial calculation, re-creates the subflow with the new resolved tips using the factory, and re-runs it as a new subflow. - -Care must be taken to handle progress tracking correctly in case of loops. \ No newline at end of file diff --git a/docs/source/design/sgx-infrastructure/ExampleSGXdeployment.png b/docs/source/design/sgx-infrastructure/ExampleSGXdeployment.png deleted file mode 100644 index 1ccdcddc65..0000000000 Binary files a/docs/source/design/sgx-infrastructure/ExampleSGXdeployment.png and /dev/null differ diff --git a/docs/source/design/sgx-infrastructure/decisions/certification.md b/docs/source/design/sgx-infrastructure/decisions/certification.md deleted file mode 100644 index 7bc3e55034..0000000000 --- a/docs/source/design/sgx-infrastructure/decisions/certification.md +++ /dev/null @@ -1,69 +0,0 @@ -![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) - --------------------------------------------- -Design Decision: CPU certification method -============================================ - -## Background / Context - -Remote attestation is done in two main steps. -1. Certification of the CPU. This boils down to some kind of Intel signature over a key that only a specific enclave has - access to. -2. Using the certified key to sign business logic specific enclave quotes and providing the full chain of trust to - challengers. - -This design question concerns the way we can manage a certification key. A more detailed description is -[here](../details/attestation.md) - -## Options Analysis - -### A. Use Intel's recommended protocol - -This involves using ``aesmd`` and the Intel SDK to establish an opaque attestation key that transparently signs quotes. -Then for each enclave we need to do several round trips to IAS to get a revocation list (which we don't need) and request -a direct Intel signature over the quote (which we shouldn't need as the trust has been established already during EPID -join) - -#### Advantages - -1. We have a PoC implemented that does this - -#### Disadvantages - -1. Frequent round trips to Intel infrastructure -2. Intel can reproduce the certifying private key -3. Involves unnecessary protocol steps and features we don't need (EPID) - -### B. Use Intel's protocol to bootstrap our own certificate - -This involves using Intel's current attestation protocol to have Intel sign over our own certifying enclave's -certificate that derives its certification key using the sealing fuse values. - -#### Advantages - -1. Certifying key not reproducible by Intel -2. Allows for our own CPU enrollment process, should we need one -3. Infrequent round trips to Intel infrastructure (only needed once per microcode update) - -#### Disadvantages - -1. Still uses the EPID protocol - -### C. Intercept Intel's recommended protocol - -This involves using Intel's current protocol as is but instead of doing round trips to IAS to get signatures over quotes -we try to establish the chain of trust during EPID provisioning and reuse it later. - -#### Advantages - -1. Uses Intel's current protocol -2. Infrequent rountrips to Intel infrastructure - -#### Disadvantages - -1. The provisioning protocol is underdocumented and it's hard to decipher how to construct the trust chain -2. The chain of trust is not a traditional certificate chain but rather a sequence of signed messages - -## Recommendation and justification - -Proceed with Option B. This is the most readily available and flexible option. diff --git a/docs/source/design/sgx-infrastructure/decisions/enclave-language.md b/docs/source/design/sgx-infrastructure/decisions/enclave-language.md deleted file mode 100644 index 2226116d65..0000000000 --- a/docs/source/design/sgx-infrastructure/decisions/enclave-language.md +++ /dev/null @@ -1,59 +0,0 @@ -![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) - --------------------------------------------- -Design Decision: Enclave language of choice -============================================ - -## Background / Context - -In the long run we would like to use the JVM for all enclave code. This is so that later on we can solve the problem of -side channel attacks on the bytecode level (e.g. oblivious RAM) rather than putting this burden on enclave functionality -implementors. - -As we plan to use a JVM in the long run anyway and we already have an embedded Avian implementation I think the best -course of action is to immediately use this together with the full JDK. To keep the native layer as minimal as possible -we should forward enclave calls with little to no marshalling to the embedded JVM. All subsequent sanity checks, -including ones currently handled by the edger8r generated code should be done inside the JVM. Accessing native enclave -functionality (including OCALLs and reading memory from untrusted heap) should be through a centrally defined JNI -interface. This way when we switch from Avian we have a very clear interface to code against both from the hosted code's -side and from the ECALL/OCALL side. - -The question remains what the thin native layer should be written in. Currently we use C++, but various alternatives -popped up, most notably Rust. - -## Options Analysis - -### A. C++ - -#### Advantages - -1. The Intel SDK is written in C++ -2. [Reproducible binaries](https://wiki.debian.org/ReproducibleBuilds) -3. The native parts of Avian, HotSpot and SubstrateVM are written in C/C++ - -#### Disadvantages - -1. Unsafe memory accesses (unless strict adherence to modern C++) -2. Quirky build -3. Larger attack surface - -### B. Rust - -#### Advantages - -1. ​Safe memory accesses -2. Easier to read/write code, easier to audit - -#### Disadvantages - -1. ​Does not produce reproducible binaries currently (but it's [planned](https://github.com/rust-lang/rust/issues/34902)) -2. ​We would mostly be using it for unsafe things (raw pointers, calling C++ code) - -## Recommendation and justification - -Proceed with Option A (C++) and keep the native layer as small as possible. Rust currently doesn't produce reproducible -binary code, and we need the native layer mostly to handle raw pointers and call Intel SDK functions anyway, so we -wouldn't really leverage Rust's safe memory features. - -Having said that, once Rust implements reproducible builds we may switch to it, in this case the thinness of the native -layer will be of big benefit. diff --git a/docs/source/design/sgx-infrastructure/decisions/kv-store.md b/docs/source/design/sgx-infrastructure/decisions/kv-store.md deleted file mode 100644 index 7c177b37a7..0000000000 --- a/docs/source/design/sgx-infrastructure/decisions/kv-store.md +++ /dev/null @@ -1,58 +0,0 @@ -![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) - --------------------------------------------- -Design Decision: Key-value store implementation -============================================ - -This is a simple choice of technology. - -## Options Analysis - -### A. ZooKeeper - -#### Advantages - -1. Tried and tested -2. HA team already uses ZooKeeper - -#### Disadvantages - -1. Clunky API -2. No HTTP API -3. Hand-rolled protocol - -### B. etcd - -#### Advantages - -1. Very simple API, UNIX philosophy -2. gRPC -3. Tried and tested -4. MVCC -5. Kubernetes uses it in the background already -6. "Successor" of ZooKeeper -7. Cross-platform, OSX and Windows support -8. Resiliency, supports backups for disaster recovery - -#### Disadvantages - -1. HA team uses ZooKeeper - -### C. Consul - -#### Advantages - -1. End to end discovery including UIs - -#### Disadvantages - -1. Not very well spread -2. Need to store other metadata as well -3. HA team uses ZooKeeper - -## Recommendation and justification - -Proceed with Option B (etcd). It's practically a successor of ZooKeeper, the interface is quite simple, it focuses on -primitives (CAS, leases, watches etc) and is tried and tested by many heavily used applications, most notably -Kubernetes. In fact we have the option to use etcd indirectly by writing Kubernetes extensions, this would have the -advantage of getting readily available CLI and UI tools to manage an enclave cluster. diff --git a/docs/source/design/sgx-infrastructure/decisions/roadmap.md b/docs/source/design/sgx-infrastructure/decisions/roadmap.md deleted file mode 100644 index cbb7899881..0000000000 --- a/docs/source/design/sgx-infrastructure/decisions/roadmap.md +++ /dev/null @@ -1,81 +0,0 @@ -![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) - --------------------------------------------- -Design Decision: Strategic SGX roadmap -============================================ - -## Background / Context - -The statefulness of the enclave affects the complexity of both the infrastructure and attestation greatly. -The infrastructure needs to take care of tracking enclave state for request routing, and we need extra care if we want -to make sure that old keys cannot be used to reveal sealed secrets. - -As the first step the easiest thing to do would be to provide an infrastructure for hosting *stateless* enclaves that -are only concerned with enclave to non-enclave attestation. This provides a framework to do provable computations, -without the headache of handling sealed state and the various implied upgrade paths. - -In the first phase we want to facilitate the ease of rolling out full enclave images (JAR linked into the image) -regardless of what the enclaves are doing internally. The contract of an enclave is the host-enclave API (attestation -protocol) and the exposure of the static set of channels the enclave supports. Furthermore the infrastructure will allow -deployment in a cloud environment and trivial scalability of enclaves through starting them on-demand. - -The first phase will allow for a "fixed stateless provable computations as a service" product, e.g. provable builds or -RNG. - -The question remains on how we should proceed afterwards. In terms of infrastructure we have a choice of implementing -sealed state or focusing on dynamic loading of bytecode. We also have the option to delay this decision until the end of -the first phase. - -## Options Analysis - -### A. Implement sealed state - -Implementing sealed state involves solving the routing problem, for this we can use the concept of active channel sets. -Furthermore we need to solve various additional security issues around guarding sealed secret provisioning, most notably -expiration checks. This would involve implementing a future-proof calendar time oracle, which may turn out to be -impossible, or not quite good enough. We may decide that we cannot actually provide strong privacy guarantees and need -to enforce epochs as mentioned [here](../details/time.md). - -#### Advantages - -1. We would solve long term secret persistence early, allowing for a longer time frame for testing upgrades and - reprovisioning before we integrate Corda -2. Allows "fixed stateful provable computations as a service" product, e.g. HA encryption - -#### Disadvantages - -1. There are some unsolved issues (Calendar time, sealing epochs) -2. It would delay non-stateful Corda integration - -### B. Implement dynamic code loading - -Implementing dynamic loading involves sandboxing of the bytecode, providing bytecode verification and perhaps -storage/caching of JARs (although it may be better to develop a more generic caching layer and use channels themselves -to do the upload). Doing bytecode verification is quite involved as Avian does not support verification, so this -would mean switching to a different JVM. This JVM would either be HotSpot or SubstrateVM, we are doing some preliminary -exploratory work to assess their feasibility. If we choose this path it opens up the first true integration point with -Corda by enabling semi-validating notaries - these are non-validating notaries that check an SGX signature over the -transaction. It would also enable an entirely separate generic product for verifiable pure computation. - -#### Advantages - -1. Early adoption of Graal if we choose to go with it (the alternative is HotSpot) -2. ​Allows first integration with Corda (semi-validating notaries) -3. Allows "generic stateless provable computation as a service" product, i.e. anything expressible as a JAR -4. Holding off on sealed state - -#### Disadvantages - -1. Too early ​Graal integration may result in maintenance headache later - -## Recommendation and justification - -Proceed with Option B, dynamic code loading. It would make us very early adopters of Graal (with the implied ups and -downs), and most importantly kickstart collaboration between R3 and Oracle. We would also move away from Avian which we -wanted to do anyway. It would also give us more time to think about the issues around sealed state, do exploratory work -on potential solutions, and there may be further development from Intel's side. Furthermore we need dynamic loading for -any fully fledged Corda integration, so we should finish this ASAP. - -## Appendix: Proposed roadmap breakdown - -![Dynamic code loading first](roadmap.png) \ No newline at end of file diff --git a/docs/source/design/sgx-infrastructure/decisions/roadmap.png b/docs/source/design/sgx-infrastructure/decisions/roadmap.png deleted file mode 100644 index 038f3430e6..0000000000 Binary files a/docs/source/design/sgx-infrastructure/decisions/roadmap.png and /dev/null differ diff --git a/docs/source/design/sgx-infrastructure/design.md b/docs/source/design/sgx-infrastructure/design.md deleted file mode 100644 index 0c3bf30ef1..0000000000 --- a/docs/source/design/sgx-infrastructure/design.md +++ /dev/null @@ -1,84 +0,0 @@ -# SGX Infrastructure design - -.. important:: This design document describes a feature of Corda Enterprise. - -This document is intended as a design description of the infrastructure around the hosting of SGX enclaves, interaction -with enclaves and storage of encrypted data. It assumes basic knowledge of SGX concepts, and some knowledge of -Kubernetes for parts specific to that. - -## High level description - -The main idea behind the infrastructure is to provide a highly available cluster of enclave services (hosts) which can -serve enclaves on demand. It provides an interface for enclave business logic that's agnostic with regards to the -infrastructure, similar to serverless architectures. The enclaves will use an opaque reference -to other enclaves or services in the form of enclave channels. Channels hides attestation details -and provide a loose coupling between enclave/non-enclave functionality and specific enclave images/services implementing -it. This loose coupling allows easier upgrade of enclaves, relaxed trust (whitelisting), dynamic deployment, and -horizontal scaling as we can spin up enclaves dynamically on demand when a channel is requested. - -For more information see: - -.. toctree:: - :maxdepth: 1 - - details/serverless.md - details/channels.md - -## Infrastructure components - -Here are the major components of the infrastructure. Note that this doesn't include business logic specific -infrastructure pieces (like ORAM blob storage for Corda privacy model integration). - -.. toctree:: - :maxdepth: 1 - - details/kv-store.md - details/discovery.md - details/host.md - details/enclave-storage.md - details/ias-proxy.md - -## Infrastructure interactions - -* **Enclave deployment**: - This includes uploading of the enclave image/container to enclave storage and adding of the enclave metadata to the - key-value store. - -* **Enclave usage**: - This includes using the discovery service to find a specific enclave image and a host to serve it, then connecting to - the host, authenticating(attestation) and proceeding with the needed functionality. - -* **Ops**: - This includes management of the cluster (Kubernetes/Kubespray) and management of the metadata relating to discovery to - control enclave deployment (e.g. canary, incremental, rollback). - -## Decisions to be made - -.. toctree:: - :maxdepth: 1 - - decisions/roadmap.md - decisions/certification.md - decisions/enclave-language.md - decisions/kv-store.md - -## Further details - -.. toctree:: - :maxdepth: 1 - - details/attestation.md - details/time.md - details/enclave-deployment.md - -## Example deployment - -This is an example of how two Corda parties may use the above infrastructure. In this example R3 is hosting the IAS -proxy and the enclave image store and the parties host the rest of the infrastructure, aside from Intel components. - -Note that this is flexible, the parties may decide to host their own proxies (as long as they whitelist their keys) or -the enclave image store (although R3 will need to have a repository of the signed enclaves somewhere). -We may also decide to go the other way and have R3 host the enclave hosts and the discovery service, shared between -parties (if e.g. they don't have access to/want to maintain SGX capable boxes). - -![Example SGX deployment](ExampleSGXdeployment.png) \ No newline at end of file diff --git a/docs/source/design/sgx-infrastructure/details/attestation.md b/docs/source/design/sgx-infrastructure/details/attestation.md deleted file mode 100644 index d7ba9bf5dc..0000000000 --- a/docs/source/design/sgx-infrastructure/details/attestation.md +++ /dev/null @@ -1,92 +0,0 @@ -### Terminology recap - -* **measurement**: The hash of an enclave image, uniquely pinning the code and related configuration -* **report**: A datastructure produced by an enclave including the measurement and other non-static properties of the - running enclave instance (like the security version number of the hardware) -* **quote**: A signed report of an enclave produced by Intel's quoting enclave. - -# Attestation - -The goal of attestation is to authenticate enclaves. We are concerned with two variants of this, enclave to non-enclave -attestation and enclave to enclave attestation. - -In order to authenticate an enclave we need to establish a chain of trust rooted in an Intel signature certifying that a -report is coming from an enclave running on genuine Intel hardware. - -Intel's recommended attestation protocol is split into two phases. - -1. Provisioning -The first phase's goal is to establish an Attestation Key(AK) aka EPID key, unique to the SGX installation. -The establishment of this key uses an underdocumented protocol similar to the attestation protocol: - - Intel provides a Provisioning Certification Enclave(PCE). This enclave has special privileges in that it can derive a - key in a deterministic fashion based on the *provisioning* fuse values. Intel stores these values in their databases - and can do the same derivation to later check a signature from PCE. - - Intel provides a separate enclave called the Provisioning Enclave(PvE), also privileged, which interfaces with PCE - (using local attestation) to certify the PvE's report and talks with a special Intel endpoint to join an EPID group - anonymously. During the join Intel verifies the PCE's signature. Once the join happened the PvE creates a related - private key(the AK) that cannot be linked by Intel to a specific CPU. The PvE seals this key (also sometimes referred - to as the "EPID blob") to MRSIGNER, which means it can only be unsealed by Intel enclaves. - -2. Attestation - - When a user wants to do attestation of their own enclave they need to do so through the Quoting Enclave(QE), also - signed by Intel. This enclave can unseal the EPID blob and use the key to sign over user provided reports - - The signed quote in turn is sent to the Intel Attestation Service, which can check whether the quote was signed by a - key in the EPID group. Intel also checks whether the QE was provided with an up-to-date revocation list. - -The end result is a signature of Intel over a signature of the AK over the user enclave quote. Challengers can then -simply check this chain to make sure that the user provided data in the quote (probably another key) comes from a -genuine enclave. - -All enclaves involved (PCE, PvE, QE) are owned by Intel, so this setup basically forces us to use Intel's infrastructure -during attestation (which in turn forces us to do e.g. MutualTLS, maintain our own proxies etc). There are two ways we -can get around this. - -1. Hook the provisioning phase. During the last step of provisioning the PvE constructs a chain of trust rooted in - Intel. If we can extract some provable chain that allows proving of membership based on an EPID signature then we can - essentially replicate what IAS does. -2. Bootstrap our own certification. This would involve deriving another certification key based on sealing fuse values - and getting an Intel signature over it using the original IAS protocol. This signature would then serve the same - purpose as the certificate in 1. - -## Non-enclave to enclave channels - -When a non-enclave connects to a "leaf" enclave the goal is to establish a secure channel between the non-enclave and -the enclave by authenticating the enclave and possibly authenticating the non-enclave. In addition we want to provide -secrecy of the non-enclave. To this end we can use SIGMA-I to do a Diffie-Hellman key exchange between the non-enclave -identity and the enclave identity. - -The enclave proves the authenticity of its identity by providing a certificate chain rooted in Intel. If we do our own -enclave certification then the chain goes like this: - -* Intel signs quote of certifying enclave containing the certifying key pair's public part. -* Certifying key signs report of leaf enclave containing the enclave's temporary identity. -* Enclave identity signs the relevant bits in the SIGMA protocol. - -Intel's signature may be cached on disk, and the certifying enclave signature over the temporary identity may be cached -in enclave memory. - -We can provide various invalidations, e.g. non-enclave won't accept signature if X time has passed since Intel's -signature, or R3's whitelisting cert expired etc. - -If the enclave needs to authorise the non-enclave the situation is a bit more complicated. Let's say the enclave holds -some secret that it should only reveal to authorised non-enclaves. Authorisation is expressed as a whitelisting -signature over the non-enclave identity. How do we check the expiration of the whitelisting key's certificate? - -Calendar time inside enclaves deserves its own [document](time.md), the gist is that we simply don't have access to time -unless we trust a calendar time oracle. - -Note however that we probably won't need in-enclave authorisation for *stateless* enclaves, as these have no secrets to -reveal at all. Authorisation would simply serve as access control, and we can solve access control in the hosting -infrastructure instead. - -## Enclave to enclave channels - -Doing remote attestation between enclaves is similar to enclave to non-enclave, only this time authentication involves -verifying the chain of trust on both sides. However note that this is also predicated on having access to a calendar -time oracle, as this time expiration checks of the chain must be done in enclaves. So in a sense both enclave to enclave -and stateful enclave to non-enclave attestation forces us to trust a calendar time oracle. - -But note that remote enclave to enclave attestation is mostly required when there *is* sealed state (secrets to share -with the other enclave). One other use case is the reduction of audit surface, once it comes to that. We may be able to -split stateless enclaves into components that have different upgrade lifecycles. By doing so we ease the auditors' job -by reducing the enclaves' contracts and code size. diff --git a/docs/source/design/sgx-infrastructure/details/channels.md b/docs/source/design/sgx-infrastructure/details/channels.md deleted file mode 100644 index 367ce6de3e..0000000000 --- a/docs/source/design/sgx-infrastructure/details/channels.md +++ /dev/null @@ -1,75 +0,0 @@ -# Enclave channels - -AWS Lambdas may be invoked by name, and are simple request-response type RPCs. The lambda's name abstracts the -specific JAR or code image that implements the functionality, which allows upgrading of a lambda without disrupting -the rest of the lambdas. - -Any authentication required for the invocation is done by a different AWS service (IAM), and is assumed to be taken -care of by the time the lambda code is called. - -Serverless enclaves also require ways to be addressed, let's call these "enclave channels". Each such channel may be -identified with a string similar to Lambdas, however unlike lambdas we need to incorporate authentication into the -concept of a channel in the form of attestation. - -Furthermore unlike Lambdas we can implement a generic two-way communication channel. This reintroduces state into the -enclave logic. However note that this state is in-memory only, and because of the transient nature of enclaves (they -may be "lost" at any point) enclave authors are in general incentivised to either keep in-memory state minimal (by -sealing state) or make their functionality idempotent (allowing retries). - -We should be able to determine an enclave's supported channels statically. Enclaves may store this data for example in a -specific ELF section or a separate file. The latter may be preferable as it may be hard to have a central definition of -channels in an ELF section if we use JVM bytecode. Instead we could have a specific static JVM datastructure that can be -extracted from the enclave statically during the build. - -## Sealed state - -Sealing keys tied to specific CPUs seem to throw a wrench in the requirement of statelessness. Routing a request to an -enclave that has associated sealed state cannot be the same as routing to one which doesn't. How can we transparently -scale enclaves like Lambdas if fresh enclaves by definition don't have associated sealed state? - -Take key provisioning as an example: we want some key to be accessible by a number of enclaves, how do we -differentiate between enclaves that have the key provisioned versus ones that don't? We need to somehow expose an -opaque version of the enclave's sealed state to the hosting infrastructure for this. - -The way we could do this is by expressing this state in terms of a changing set of "active" enclave channels. The -enclave can statically declare the channels it potentially supports, and start with some initial subset of them as -active. As the enclave's lifecycle (sealed state) evolves it may change this active set to something different, -thereby informing the hosting infrastructure that it shouldn't route certain requests there, or that it can route some -other ones. - -Take the above key provisioning example. An enclave can be in two states, unprovisioned or provisioned. When it's -unprovisioned its set of active channels will be related to provisioning (for example, request to bootstrap key or -request from sibling enclave), when it's provisioned its active set will be related to the usage of the key and -provisioning of the key itself to unprovisioned enclaves. - -The enclave's initial set of active channels defines how enclaves may be scaled horizontally, as these are the -channels that will be active for the freshly started enclaves without sealed state. - -"Hold on" you might say, "this means we didn't solve the scalability of stateful enclaves!". - -This is partly true. However in the above case we can force certain channels to be part of the initial active set! In -particular the channels that actually use the key (e.g. for signing) may be made "stateless" by lazily requesting -provisioning of the key from sibling enclaves. Enclaves may be spun up on demand, and as long as there is at least one -sibling enclave holding the key it will be provisioned as needed. This hints at a general pattern of hiding stateful -functionality behind stateless channels, if we want them to scale automatically. - -Note that this doesn't mean we can't have external control over the provisioning of the key. For example we probably -want to enforce redundancy across N CPUs. This requires the looping in of the hosting infrastructure, we cannot -enforce this invariant purely in enclave code. - -As we can see the set of active enclave channels are inherently tied to the sealed state of the enclave, therefore we -should make the updating both of them an atomic operation. - -### Side note - -Another way to think about enclaves using sealed state is like an actor model. The sealed state is the actor's state, -and state transitions may be executed by any enclave instance running on the same CPU. By transitioning the actor state -one can also transition the type of messages the actor can receive atomically (= active channel set). - -## Potential gRPC integration - -It may be desirable to expose a built-in serialisation and network protocol. This would tie us to a specific protocol, -but in turn it would ease development. - -An obvious candidate for this is gRPC as it supports streaming and a specific serialization protocol. We need to -investigate how we can integrate it so that channels are basically responsible for tunneling gRPC packets. diff --git a/docs/source/design/sgx-infrastructure/details/discovery.md b/docs/source/design/sgx-infrastructure/details/discovery.md deleted file mode 100644 index 5b5bb899bd..0000000000 --- a/docs/source/design/sgx-infrastructure/details/discovery.md +++ /dev/null @@ -1,88 +0,0 @@ -# Discovery - -In order to understand enclave discovery and routing we first need to understand the mappings between CPUs, VMs and -enclave hosts. - -The cloud provider manages a number of physical machines (CPUs), each of those machines hosts a hypervisor which in -turn hosts a number of guest VMs. Each VM in turn may host a number of enclave host containers (together with required -supporting software like aesmd) and the sgx device driver. Each enclave host in turn may host several enclave instances. -For the sake of simplicity let's assume that an enclave host may only host a single enclave instance per measurement. - -We can figure out the identity of the CPU the VM is running on by using a dedicated enclave to derive a unique ID -specific to the CPU. For this we can use EGETKEY with pre-defined inputs to derive a seal key sealed to MRENCLAVE. This -provides a 128bit value reproducible only on the same CPU in this manner. Note that this is completely safe as the -value won't be used for encryption and is specific to the measurement doing this. With this ID we can reason about -physical locality of enclaves without looping in the cloud provider. -Note: we should set OWNEREPOCH to a static value before doing this. - -We don't need an explicit handle on the VM's identity, the mapping from VM to container will be handled by the -orchestration engine (Kubernetes). - -Similarly to VM identity, the specific host container's identity(IP address/DNS A) is also tracked by Kubernetes, -however we do need access to this identity in order to implement discovery. - -When an enclave instance seals a secret that piece of data is tied to the measurement+CPU combo. The secret can only be -revealed to an enclave with the same measurement running on the same CPU. However the management of this secret is -tied to the enclave host container, which we may have several of running on the same CPU, possibly all of them hosting -enclaves with the same measurement. - -To solve this we can introduce a *sealing identity*. This is basically a generated ID/namespace for a collection of -secrets belonging to a specific CPU. It is generated when a fresh enclave host starts up and subsequently the host will -store sealed secrets under this ID. These secrets should survive host death, so they will be persisted in etcd (together -with the associated active channel sets). Every host owns a single sealing identity, but not every sealing identity may -have an associated host (e.g. in case the host died). - -## Mapping to Kubernetes - -The following mapping of the above concepts to Kubernetes concepts is not yet fleshed out and requires further -investigation into Kubernetes capabilities. - -VMs correspond to Nodes, and enclave hosts correspond to Pods. The host's identity is the same as the Pod's, which is -the Pod's IP address/DNS A record. From Kubernetes's point of view enclave hosts provide a uniform stateless Headless -Service. This means we can use their scaling/autoscaling features to provide redundancy across hosts (to balance load). - -However we'll probably need to tweak their (federated?) ReplicaSet concept in order to provide redundancy across CPUs -(to be tolerant of CPU failures), or perhaps use their anti-affinity feature somehow, to be explored. - -The concept of a sealing identity is very close to the stable identity of Pods in Kubernetes StatefulSets. However I -couldn't find a way to use this directly as we need to tie the sealing identity to the CPU identity, which in Kubernetes -would translate to a requirement to pin stateful Pods to Nodes based on a dynamically determined identity. We could -however write an extension to handle this metadata. - -## Registration - -When an enclave host is started it first needs to establish its sealing identity. To this end first it needs to check -whether there are any sealing identities available for the CPU it's running on. If not it can generate a fresh one and -lease it for a period of time (and update the lease periodically) and atomically register its IP address in the process. -If an existing identity is available the host can take over it by leasing it. There may be existing Kubernetes -functionality to handle some of this. - -Non-enclave services (like blob storage) could register similarly, but in this case we can take advantage of Kubernetes' -existing discovery infrastructure to abstract a service behind a Service cluster IP. We do need to provide the metadata -about supported channels though. - -## Resolution - -The enclave/service discovery problem boils down to: -"Given a channel, my trust model and my identity, give me an enclave/service that serves this channel, trusts me, and I -trust them". - -This may be done in the following steps: - -1. Resolve the channel to a set of measurements supporting it -2. Filter the measurements to trusted ones and ones that trust us -3. Pick one of the measurements randomly -4. Find an alive host that has the channel in its active set for the measurement - -1 may be done by maintaining a channel -> measurements map in etcd. This mapping would effectively define the enclave -deployment and would be the central place to control incremental roll-out or rollbacks. - -2 requires storing of additional metadata per advertised channel, namely a datastructure describing the enclave's trust -predicate. A similar datastructure is provided by the discovering entity - these two predicates can then be used to -filter measurements based on trust. - -3 is where we may want to introduce more control if we want to support incremental roll-out/canary deployments. - -4 is where various (non-MVP) optimisation considerations come to mind. We could add a loadbalancer, do autoscaling based -on load (although Kubernetes already provides support for this), could have a preference for looping back to the same -host to allow local attestation, or ones that have the enclave image cached locally or warmed up. diff --git a/docs/source/design/sgx-infrastructure/details/enclave-deployment.md b/docs/source/design/sgx-infrastructure/details/enclave-deployment.md deleted file mode 100644 index 905bab9930..0000000000 --- a/docs/source/design/sgx-infrastructure/details/enclave-deployment.md +++ /dev/null @@ -1,16 +0,0 @@ -# Enclave deployment - -What happens if we roll out a new enclave image? - -In production we need to sign the image directly with the R3 key as MRSIGNER (process to be designed), as well as create -any whitelisting signatures needed (e.g. from auditors) in order to allow existing enclaves to trust the new one. - -We need to make the enclave build sources available to users - we can package this up as a single container pinning all -build dependencies and source code. Docker style image layering/caching will come in handy here. - -Once the image, build containers and related signatures are created we need to push this to the main R3 enclave storage. - -Enclave infrastructure owners (e.g. Corda nodes) may then start using the images depending on their upgrade policy. This -involves updating their key value store so that new channel discovery requests resolve to the new measurement, which in -turn will trigger the image download on demand on enclave hosts. We can potentially add pre-caching here to reduce -latency for first-time enclave users. diff --git a/docs/source/design/sgx-infrastructure/details/enclave-storage.md b/docs/source/design/sgx-infrastructure/details/enclave-storage.md deleted file mode 100644 index 85db88363b..0000000000 --- a/docs/source/design/sgx-infrastructure/details/enclave-storage.md +++ /dev/null @@ -1,7 +0,0 @@ -# Enclave storage - -The enclave storage is a simple static content server. It should allow uploading of and serving of enclave images based -on their measurement. We may also want to store metadata about the enclave build itself (e.g. github link/commit hash). - -We may need to extend its responsibilities to serve other SGX related static content such as whitelisting signatures -over measurements. diff --git a/docs/source/design/sgx-infrastructure/details/host.md b/docs/source/design/sgx-infrastructure/details/host.md deleted file mode 100644 index 4c8475673e..0000000000 --- a/docs/source/design/sgx-infrastructure/details/host.md +++ /dev/null @@ -1,11 +0,0 @@ -# Enclave host - -An enclave host's responsibility is the orchestration of the communication with hosted enclaves. - -It is responsible for: -* Leasing a sealing identity -* Getting a CPU certificate in the form of an Intel-signed quote -* Downloading and starting of requested enclaves -* Driving attestation and subsequent encrypted traffic -* Using discovery to connect to other enclaves/services -* Various caching layers (and invalidation of) for the CPU certificate, hosted enclave quotes and enclave images diff --git a/docs/source/design/sgx-infrastructure/details/ias-proxy.md b/docs/source/design/sgx-infrastructure/details/ias-proxy.md deleted file mode 100644 index d2da6cc29b..0000000000 --- a/docs/source/design/sgx-infrastructure/details/ias-proxy.md +++ /dev/null @@ -1,10 +0,0 @@ -# IAS proxy - -The Intel Attestation Service proxy's responsibility is simply to forward requests to and from the IAS. - -The reason we need this proxy is because Intel requires us to do Mutual TLS with them for each attestation round trip. -For this we need an R3 maintained private key, and as we want third parties to be able to do attestation we need to -store this private key in these proxies. - -Alternatively we may decide to circumvent this mutual TLS requirement completely by distributing the private key with -the host containers. \ No newline at end of file diff --git a/docs/source/design/sgx-infrastructure/details/kv-store.md b/docs/source/design/sgx-infrastructure/details/kv-store.md deleted file mode 100644 index fef06d7e03..0000000000 --- a/docs/source/design/sgx-infrastructure/details/kv-store.md +++ /dev/null @@ -1,13 +0,0 @@ -# Key-value store - -To solve enclave to enclave and enclave to non-enclave communication we need a way to route requests correctly. There -are readily available discovery solutions out there, however we have some special requirements because of the inherent -statefulness of enclaves (route to enclave with correct state) and the dynamic nature of trust between them (route to -enclave I can trust and that trusts me). To store metadata about discovery we can need some kind of distributed -key-value store. - -The key-value store needs to store information about the following entities: -* Enclave image: measurement and supported channels -* Sealing identity: the sealing ID, the corresponding CPU ID and the host leasing it (if any) -* Sealed secret: the sealing ID, the sealing measurement, the sealed secret and corresponding active channel set -* Enclave deployment: mapping from channel to set of measurements diff --git a/docs/source/design/sgx-infrastructure/details/serverless.md b/docs/source/design/sgx-infrastructure/details/serverless.md deleted file mode 100644 index 42dbffb2dd..0000000000 --- a/docs/source/design/sgx-infrastructure/details/serverless.md +++ /dev/null @@ -1,33 +0,0 @@ -# Serverless architectures - -In 2014 Amazon launched AWS Lambda, which they coined a "serverless architecture". It essentially creates an abstraction -layer which hides the infrastructure details. Users provide "lambdas", which are stateless functions that may invoke -other lambdas, access other AWS services etc. Because Lambdas are inherently stateless (any state they need must be -accessed through a service) they may be loaded and executed on demand. This is in contrast with microservices, which -are inherently stateful. Internally AWS caches the lambda images and even caches JIT compiled/warmed up code in order -to reduce latency. Furthermore the lambda invocation interface provides a convenient way to scale these lambdas: as the -functions are statelesss AWS can spin up new VMs to push lambda functions to. The user simply pays for CPU usage, all -the infrastructure pain is hidden by Amazon. - -Google and Microsoft followed suit in a couple of years with Cloud Functions and Azure Functions. - -This way of splitting hosting computation from a hosted restricted computation is not a new idea, examples are web -frameworks (web server vs application), MapReduce (Hadoop vs mappers/reducers), or even the cloud (hypervisors vs vms) -and the operating system (kernel vs userspace). The common pattern is: the hosting layer hides some kind of complexity, -imposes some restriction on the guest layer (and provides a simpler interface in turn), and transparently multiplexes -a number of resources for them. - -The relevant key features of serverless architectures are 1. on-demand scaling and 2. business logic independent of -hosting logic. - -# Serverless SGX? - -How are Amazon Lambdas relevant to SGX? Enclaves exhibit very similar features to Lambdas: they are pieces of business -logic completely independent of the hosting functionality. Not only that, enclaves treat hosts as adversaries! This -provides a very clean separation of concerns which we can exploit. - -If we could provide a similar infrastructure for enclaves as Amazon provides for Lambdas it would not only allow easy -HA and scaling, it would also decouple the burden of maintaining the infrastructure from the enclave business logic. -Furthermore our plan of using the JVM within enclaves also aligns with the optimizations Amazon implemented (e.g. -keeping warmed up enclaves around). Optimizations like upgrading to local attestation also become orthogonal to -enclave business logic. Enclave code can focus on the specific functionality at hand, everything else is taken care of. diff --git a/docs/source/design/sgx-infrastructure/details/time.md b/docs/source/design/sgx-infrastructure/details/time.md deleted file mode 100644 index 972a6d0f1a..0000000000 --- a/docs/source/design/sgx-infrastructure/details/time.md +++ /dev/null @@ -1,69 +0,0 @@ -# Time in enclaves - -In general we know that any one crypto algorithm will be broken in X years time. The usual way to mitigate this is by -using certificate expiration. If a peer with an expired certificate tries to connect we reject it in order to enforce -freshness of their key. - -In order to check certificate expiration we need some notion of calendar time. However in SGX's threat model the host -of the enclave is considered malicious, so we cannot rely on their notion of time. Intel provides trusted time through -their PSW, however this uses the Management Engine which is known to be a proprietary vulnerable piece of architecture. - -Therefore in order to check calendar time in general we need some kind of time oracle. We can burn in the oracle's -identity to the enclave and request timestamped signatures from it. This already raises questions with regards to the -oracle's identity itself, however for the time being let's assume we have something like this in place. - -### Timestamped nonces - -The most straightforward way to implement calendar time checks is to generate a nonce *after* DH exchange, send it to -the oracle and have it sign over it with a timestamp. The nonce is required to avoid replay attacks. A malicious host -may delay the delivery of the signature indefinitely, even until after the certificate expires. However note that the -DH happened before the nonce was generated, which means even if an attacker can crack the expired key they would not be -able to steal the DH session, only try creating new ones, which will fail at the timestamp check. - -This seems to be working, however note that this would impose a full round trip to an oracle *per DH exchange*. - -### Timestamp-encrypted channels - -In order to reduce the round trips required for timestamp checking we can invert the responsibility of checking of the -timestamp. We can do this by encrypting the channel traffic with an additional key generated by the enclave but that can -only be revealed by the time oracle. The enclave encrypts the encryption key with the oracle's public key so the peer -trying to communicate with the enclave must forward the encrypted key to the oracle. The oracle in turn will check the -timestamp and reveal the contents (perhaps double encrypted with a DH-derived key). The peer can cache the key and later -use the same encryption key with the enclave. It is then the peer's responsibility to get rid of the key after a while. - -Note that this mitigates attacks where the attacker is a third party trying to exploit an expired key, but this method -does *not* mitigate against malicious peers that keep around the encryption key until after expiration(= they "become" -malicious). - -### Oracle key break - -So given an oracle we can secure a channel against expired keys and potentially improve performance by trusting -once-authorized enclave peers to not become malicious. - -However what happens if the oracle key itself is broken? There's a chicken-and-egg problem where we can't check the -expiration of the time oracle's certificate itself! Once the oracle's key is broken an attacker can fake timestamping -replies (or decrypt the timestamp encryption key), which in turn allows it to bypass the expiration check. - -The main issue with this is in relation to sealed secrets, and sealed secret provisioning between enclaves. If an -attacker can fake being e.g. an authorized enclave then it can extract old secrets. We have yet to come up with a -solution to this, and I don't think it's possible. - -Instead, knowing that current crypto algorithms are bound to be broken at *some* point in the future, instead of trying -to make sealing future-proof we can become explicit about the time-boundness of security guarantees. - -### Sealing epochs - -Let's call the time period in which a certain set of algorithms are considered safe a *sealing epoch*. During this -period sealed data at rest is considered to be secure. However once the epoch finishes old sealed data is considered to -be potentially compromised. We can then think of sealed data as an append-only log of secrets with overlapping epoch -intervals where the "breaking" of old epochs is constantly catching up with new ones. - -In order to make sure that this works we need to enforce an invariant where secrets only flow from old epochs to newer -ones, never the other way around. - -This translates to the ledger nicely, data in old epochs are generally not valuable anymore, so it's safe to consider -them compromised. Note however that in the privacy model an epoch transition requires a full re-provisioning of the -ledger to the new set of algorithms/enclaves. - -In any case this is an involved problem, and I think we should defer the fleshing out of it for now as we won't need it -for the first round of stateless enclaves. diff --git a/docs/source/design/sgx-integration/SgxProvisioning.png b/docs/source/design/sgx-integration/SgxProvisioning.png deleted file mode 100644 index 2c52a3f180..0000000000 Binary files a/docs/source/design/sgx-integration/SgxProvisioning.png and /dev/null differ diff --git a/docs/source/design/sgx-integration/design.md b/docs/source/design/sgx-integration/design.md deleted file mode 100644 index 62ff6a14e0..0000000000 --- a/docs/source/design/sgx-integration/design.md +++ /dev/null @@ -1,317 +0,0 @@ -# SGX Integration - -This document is intended as a design description of how we can go about integrating SGX with Corda. As the -infrastructure design of SGX is quite involved (detailed elsewhere) but otherwise flexible we can discuss the possible -integration points separately, without delving into lower level technical detail. - -For the purposes of this document we can think of SGX as a way to provision secrets to a remote node with the -knowledge that only trusted code(= enclave) will operate on it. Furthermore it provides a way to durably encrypt data -in a scalable way while also ensuring that the encryption key is never leaked (unless the encrypting enclave is -compromised). - -Broadly speaking there are two dimensions to deciding how we can integrate SGX: *what* we store in the ledger and -*where* we store it. - -The first dimension is the what: this relates to what we so far called the "integrity model" vs the "privacy model". - -In the **integrity model** we rely on SGX to ensure the integrity of the ledger. Using this assumption we can cut off -the transaction body and only store an SGX-backed signature over filtered transactions. Namely we would only store -information required for notarisation of the current and subsequent spending transactions. This seems neat on first -sight, however note that if we do this naively then if an attacker can impersonate an enclave they'll gain write -access to the ledger, as the fake enclave can sign transactions as valid without having run verification. - -In the **privacy model** we store the full transaction backchain (encrypted) and we keep provisioning it between nodes -on demand, just like in the current Corda implementation. This means we only rely on SGX for the privacy aspects - if -an enclave is compromised we only lose privacy, the verification cannot be eluded by providing a fake signature. - -The other dimension is the where: currently in non-SGX Corda the full transaction backchain is provisioned between non- -notary nodes, and is also provisioned to notaries in the case they are validating ones. With SGX+BFT notaries we have -the possibility to offload the storage of the encrypted ledger (or encrypted signatures thereof) to notary nodes (or -dedicated oracles) and only store bookkeeping information required for further ledger updates in non-notary nodes. The -storage policy is very important, customers want control over the persistence of even encrypted data, and with the -introduction of recent regulation (GDPR) unrestricted provisioning of sensitive data will be illegal by law, even when -encrypted. - -We'll explore the different combination of choices below. Note that we don't need to marry to any one of them, we may -decide to implement several. - -## Privacy model + non-notary provisioning - -Let's start with the model that's closest to the current Corda implementation as this is an easy segue into the -possibilities with SGX. We also have a simple example and a corresponding neat diagram (thank you Kostas!!) we showed -to a member bank Itau to indicate in a semi-handwavy way what the integration will look like. - -We have a cordapp X used by node A and B. The cordapp contains a flow XFlow and a (deterministic) contract XContract. -The two nodes are negotiating a transaction T2. T2 consumes a state that comes from transaction T1. - -Let's assume that both A and B are happy with T2, except Node A hasn't established the validity of it yet. Our goal is -to prove the validity of T2 to A without revealing the details of T1. - -The following diagram shows an overview of how this can be achieved. Note that the diagram is highly oversimplified -and is meant to communicate the high-level data flow relevant to Corda. - -![SGX Provisioning](SgxProvisioning.png "SGX Provisioning") - -* In order to validate T2, A asks its enclave whether T2 is valid. -* The enclave sees that T2 depends on T1, so it consults its sealed ledger whether it contains T1. -* If it does then this means T1 has been verified already, so the enclave moves on to the verification of T2. -* If the ledger doesn't contain T1 then the enclave needs to retrieve it from node B. -* In order to do this A's enclave needs to prove to B's enclave that it is indeed a trusted enclave B can provision T1 - to. This proof is what the attestation process provides. -* Attestation is done in the clear: (TODO attestation diagram) - * A's enclave generates a keypair, the public part of which is sent to Node B in a datastructure signed by Intel, - this is called the quote(1). - * Node B's XFlow may do various checks on this datastructure that cannot be performed by B's enclave, for example - checking of the timeliness of Intel's signature(2). - * Node B's XFlow then forwards the quote to B's enclave, which will check Intel's signature and whether it trusts A' - s enclave. For the sake of simplicity we can assume this to be strict check that A is running the exact same - enclave B is. - * At this point B's enclave has established trust in A's enclave, and has the public part of the key generated by A' - s enclave. - * The nodes repeat the above process the other way around so that A's enclave establishes trust in B's and gets hold - of B's public key(3). - * Now they proceed to perform an ephemeral Diffie-Hellman key exchange using the keys in the quotes(4). - * The ephemeral key is then used to encrypt further communication. Beyond this point the nodes' flows (and anything - outside of the enclaves) have no way of seeing what data is being exchanged, all the nodes can do is forward the - encrypted messages. -* Once attestation is done B's enclave provisions T1 to A's enclave using the DH key. If there are further - dependencies those would be provisioned as well. -* A's enclave then proceeds to verify T1 using the embedded deterministic JVM to run XContract. The verified - transaction is then sealed to disk(5). We repeat this for T2. -* If verification or attestation fails at any point the enclave returns to A's XFlow with a failure. Otherwise if all - is good the enclave returns with a success. At this point A's XFlow knows that T2 is valid, but hasn't seen T1 in - the clear. - -(1) This is simplified, the actual protocol is a bit different. Namely the quote is not generated every time A requires provisioning, but is rather generated periodically. - -(2) There is a way to do this check inside the enclave, however it requires switching on of the Intel ME which in general isn't available on machines in the cloud and is known to have vulnerabilities. - -(3) We need symmetric trust even if the secrets seem to only flow from B to A. Node B may try to fake being an enclave to fish for information from A. - -(4) The generated keys in the quotes are used to authenticate the respective parts of the DH key exchange. - -(5) Sealing means encryption of data using a key unique to the enclave and CPU. The data may be subsequently unsealed (decrypted) by the enclave, even if the enclave was restarted. Also note that there is another layer of abstraction needed which we don't detail here, needed for redundancy of the encryption key. - -To summarise, the journey of T1 is: - -1. Initially it's sitting encrypted in B's storage. -2. B's enclave decrypts it using its seal key specific to B's enclave + CPU combination. -3. B's enclave encrypts it using the ephemeral DH key. -4. The encrypted transaction is sent to A. The safety of this (namely that A's enclave doesn't reveal the transaction to node A) hinges on B's enclave's trust in A's enclave, which is expressed as a check of A's enclave measurement during attestation, which in turn requires auditing of A's enclave code and reproducing of the measurement. -5. A's enclave decrypts the transaction using the DH key. -6. A's enclave verifies the transaction using a deterministic JVM. -7. A's enclave encrypts the transaction using A's seal key specific to A's enclave + CPU combination. -8. The encrypted transaction is stored in A's storage. - -As we can see in this model each non-notary node runs their own SGX enclave and related storage. Validation of the -backchain happens by secure provisioning of it between enclaves, plus subsequent verification and storage. However -there is one important thing missing from the example (actually it has several, but those are mostly technical detail): -the notary! - -In reality we cannot establish the full validity of T2 at this point of negotiation, we need to first notarise it. -This model gives us some flexibility in this regard: we can use a validating notary (also running SGX) or a -non-validating one. This indicates that the enclave API should be split in two, mirroring the signature check choice -in SignedTransaction.verify. Only when the transaction is fully signed and notarised should it be persisted (sealed). - -This model has both advantages and disadvantages. On one hand it is the closest to what we have now - we (and users) -are familiar with this model, we can fairly easily nest it into the existing codebase and it gives us flexibility with -regards to notary modes. On the other hand it is a compromising answer to the regulatory problem. If we use non- -validating notaries then the backchain storage is restricted to participants, however consider the following example: -if we have a transaction X that parties A and B can process legally, but a later transaction Y that has X in its -backchain is sent for verification to party C, then C will process and store X as well, which may be illegal. - -## Privacy model + notary provisioning - -This model would work similarly to the previous one, except non-notary nodes wouldn't need to run SGX or care about -storage of the encrypted ledger, it would all be done in notary nodes. Nodes would connect to SGX capable notary nodes, -and after attestation the nodes can be sure that the notary has run verification before signing. - -This fixes the choice of using validating notaries, as notaries would be the only entities capable of verification: -only they have access to the full backchain inside enclaves. - -Note that because we still provision the full backchain between notary members for verification, we don't necessarily -need a BFT consensus on validity - if an enclave is compromised an invalid transaction will be detected at the next -backchain provisioning. - -This model reduces the number of responsibilities of a non-notary node, in particular it wouldn't need to provide -storage for the backchain or verification, but could simply trust notary signatures. Also it wouldn't need to host SGX -enclaves, only partake in the DH exchange with notary enclaves. The node's responsibilities would be reduced to the -orchestration of ledger updates (flows) and related bookkeeping (vault, network map). This split would also enable us -to be flexible with regards to the update orchestration: trust in the validity of the ledger would cease to depend on -the transaction resolution currently embedded into flows - we could provide a from-scratch light-weight implementation -of a "node" (say a mobile app) that doesn't use flows and related code at all, it just needs to be able to connect to -notary enclaves to notarise, validity is taken care of by notaries. - -Note that although we wouldn't require validation checks from non-notary nodes, in theory it would be safe to allow -them to do so (if they want a stronger-than-BFT guarantee). - -Of course this model has disadvantages too. From the regulatory point of view it is a strictly worse solution than the -non-notary provisioning model: the backchain would be provisioned between notary nodes not owned by actual -participants in the backchain. It also disables us from using non-validating notaries. - -## Integrity model + non-notary provisioning - -In this model we would trust SGX-backed signatures and related attestation datastructures (quote over signature key -signed by Intel) as proof of validity. When node A and B are negotiating a transaction it's enough to provision SGX -signatures over the dependency hashes to one another, there's no need to provision the full backchain. - -This sounds very simple and efficient, and it's even more private than the privacy model as we're only passing -signatures around, not transactions. However there are a couple of issues that need addressing: If an SGX enclave is -compromised a malicious node can provide a signature over an invalid transaction that checks out, and nobody will ever -know about it, because the original transaction will never be verified. One way we can mitigate this is by requiring a -BFT consensus signature, or perhaps a threshold signature is enough. We could decouple verification into "verifying -oracles" which verify in SGX and return signatures over transaction hashes, and require a certain number of them to -convince the notary to notarise and subsequent nodes to trust validity. Another issue is enclave updates. If we find a -vulnerability in an enclave and update it, what happens to the already signed backchain? Historical transactions have -signatures that are rooted in SGX quotes belonging to old untrusted enclave code. One option is to simply have a -cutoff date before which we accept old signatures. This requires a consensus-backed timestamp on the notary signature. -Another option would be to keep the old ledger around and re-verify it with the new enclaves. However if we do this we -lose the benefits of the integrity model - we get back the regulatory issue, and we don't gain the performance benefits. - -## Integrity model + notary provisioning - -This is similar to the previous model, only once again non-notary nodes wouldn't need to care about verifying or -collecting proofs of validity before sending the transaction off for notarisation. All of the complexity would be -hidden by notary nodes, which may use validating oracles or perhaps combine consensus over validity with consensus -over spending. This model would be a very clean separation of concerns which solves the regulatory problem (almost) -and is quite efficient as we don't need to keep provisioning the chain. One potential issue with regards to regulation -is the tip of the ledger (the transaction being notarised) - this is sent to notaries and although it is not stored it -may still be against the law to receive it and hold it in volatile memory, even inside an enclave. I'm unfamiliar with -the legal details of whether this is good enough. If this is an issue, one way we could address this would be to scope -the validity checks required for notarisation within legal boundaries and only require "full" consensus on the -spentness check. Of course this has the downside that ledger participants outside of the regulatory boundary need to -trust the BFT-SGX of the scope. I'm not sure whether it's possible to do any better, after all we can't send the -transaction body outside the scope in any shape or form. - -## Threat model - -In all models we have the following actors, which may or may not overlap depending on the model: - -* Notary quorum members -* Non-notary nodes/entities interacting with the ledger -* Identities owning the verifying enclave hosting infrastructure -* Identities owning the encrypted ledger/signature storage infrastructure -* R3 = enclave whitelisting identity -* Network Map = contract whitelisting identity -* Intel - -We have two major ways of compromise: - -* compromise of a non-enclave entity (notary, node, R3, Network Map, storage) -* compromise of an enclave. - -In the case of **notaries** compromise means malicious signatures, for **nodes** it's malicious transactions, for **R3** -it's signing malicious enclaves, for **Network Map** it's signing malicious contracts, for **storage** it's read-write -access to encrypted data, and for **Intel** it's forging of quotes or signing over invalid ones. - -A compromise of an **enclave** means some form of access to the enclave's temporary identity key. This may happen -through direct hardware compromise (extracting of fuse values) and subsequent forging of a quote, or leaking of secrets -through weakness of the enclave-host boundary or other side-channels like Spectre(hacking). In any case it allows an -adversary to impersonate an enclave and therefore to intercept enclave traffic and forge signatures. - -The actors relevant to SGX are enclave hosts, storage infrastructure owners, regular nodes and R3. - -* **Enclave hosts**: enclave code is specifically written with malicious (compromised) hosts in mind. That said we - cannot be 100% secure against yet undiscovered side channel attacks and other vulnerabilities, so we need to be - prepared for the scenario where enclaves get compromised. The privacy model effectively solves this problem by - always provisioning and re-verifying the backchain. An impersonated enclave may be able to see what's on the ledger, - but tampering with it will not check out at the next provisioning. On the other hand if a compromise happens in the - integrity model an attacker can forge a signature over validity. We can mitigate this with a BFT guarantee by - requiring a consensus over validity. This way we effectively provide the same guarantee for validity as notaries - provide with regards to double spend. - -* **Storage infrastructure owner**: - * A malicious actor would need to crack the encryption key to decrypt transactions - or transaction signatures. Although this is highly unlikely, we can mitigate by preparing for and forcing of key - updates (i.e. we won't provision new transactions to enclaves using old keys). - * What an attacker *can* do is simply erase encrypted data (or perhaps re-encrypt as part of ransomware), blocking - subsequent resolution and verification. In the non-notary provisioning models we can't really mitigate this as the - tip of the ledger (or signature over) may only be stored by a single non-notary entity (assumed to be compromised). - However if we require consensus over validity between notary or non-notary entities (e.g. validating oracles) then - this implicitly provides redundancy of storage. - * Furthermore storage owners can spy on the enclave's activity by observing access patterns to the encrypted blobs. - We can mitigate by implementing ORAM storage. - -* **Regular nodes**: if a regular node is compromised the attacker may gain access to the node's long term key that - allows them to Diffie-Hellman with an enclave, or get the ephemeral DH value calculated during attestation directly. - This means they can man-in-the-middle between the node and the enclave. From the ledger's point of view we are - prepared for this scenario as we never leak sensitive information to the node from the enclave, however it opens the - possibility that the attacker can fake enclave replies (e.g. validity checks) and can sniff on secrets flowing from - the node to the enclave. We can mitigate the fake enclave replies by requiring an extra signature on messages. - Sniffing cannot really be mitigated, but one could argue that if the transient DH key (that lives temporarily in - volatile memory) or long term key (that probably lives in an HSM) was leaked then the attacker has access to node - secrets anyway. - -* **R3**: the entity that's whitelisting enclaves effectively controls attestation trust, which means they can - backdoor the ledger by whitelisting a secret-revealing/signature-forging enclave. One way to mitigate this is by - requiring a threshold signature/consensus over new trusted enclave measurements. Another way would be to use "canary" - keys controlled by neutral parties. These parties' responsibility would simply be to publish enclave measurements (and - perhaps the reproducing build) to the public before signing over them. The "publicity" and signature would be checked - during attestation, so a quote with a non-public measurement would be rejected. Although this wouldn't prevent - backdoors (unless the parties also do auditing), it would make them public. - -* **Intel**: There are two ways a compromised Intel can interact with the ledger maliciously, both provide a backdoor. - * It can sign over invalid quotes. This can be mitigated by implementing our own attestation service. Intel told us - we'll be able to do this in the future (by downloading a set of certificates tied to CPU+CPUSVN combos that may be - used to check QE signatures). - * It can produce valid quotes without an enclave. This is due to the fact that they store one half of the SGX- - specific fuse values in order to validate quotes flexibly. One way to circumvent this would be to only use the - other half of the fuse values (the seal values) which they don't store (or so they claim). However this requires - our own "enrollment" process of CPUs where we replicate the provisioning process based off of seal values and - verify manually that the provisioning public key comes from the CPU. And even if we do this all we did was move - the requirement of trust from Intel to R3. - - Note however that even if an attacker compromises Intel and decides to backdoor they would need to connect to the - ledger participants in order to take advantage. The flow framework and the business network concept act as a form of - ACL on data that would make an Intel backdoor quite useless. - -## Summary - -As we can see we have a number of options here, all of them have advantages and disadvantages. - -#### Privacy + non-notary - -**Pros**: -* Closest to our current non-SGX model -* Strong guarantee of validity -* Flexible with respect to notary modes - -**Cons**: -* Regulatory problem about provisioning of ledger -* Relies on ledger participants to do validation checks -* No redundancy across ledger participants - -#### Privacy + notary - -**Pros**: -* Strong guarantee of validity -* Separation of concerns, allows lightweight ledger participants -* Redundancy across notary nodes - -**Cons**: -* Regulatory problem about provisioning of ledger - -#### Integrity + non-notary - -**Pros**: -* Efficient validity checks -* No storage of sensitive transaction body only signatures - -**Cons**: -* Enclave impersonation compromises ledger (unless consensus validation) -* Relies on ledger participants to do validation checks -* No redundancy across ledger participants - -#### Integrity + notary - -**Pros**: -* Efficient validity check -* No storage of sensitive transaction body only signatures -* Separation of concerns, allows lightweight ledger participants -* Redundancy across notary nodes - -**Cons**: -* Only BFT guarantee over validity -* Temporary storage of transaction in RAM may be against regulation - -Personally I'm strongly leaning towards an integrity model where SGX compromise is mitigated by a BFT consensus over validity (perhaps done by a validating oracle cluster). This would solve the regulatory problem, it would be efficient and the infrastructure would have a very clean separation of concerns between notary and non-notary nodes, allowing lighter-weight interaction with the ledger. diff --git a/docs/source/design/targetversion/design.md b/docs/source/design/targetversion/design.md deleted file mode 100644 index a0de10d1db..0000000000 --- a/docs/source/design/targetversion/design.md +++ /dev/null @@ -1,90 +0,0 @@ -# CorDapp Minimum and Target Platform Version - -## Overview - -We want to give CorDapps the ability to specify which versions of the platform they support. This will make it easier for CorDapp developers to support multiple platform versions, and enable CorDapp developers to tweak behaviour and opt in to changes that might be breaking (e.g. sandboxing). Corda developers gain the ability to introduce changes to the implementation of the API that would otherwise break existing CorDapps. - -This document proposes that CorDapps will have metadata associated with them specifying a minimum platform version and a target platform Version. The minimum platform version of a CorDapp would indicate that a Corda node would have to be running at least this version of the Corda platform in order to be able to run this CorDapp. The target platform version of a CorDapp would indicate that it was tested for this version of the Corda platform. - -## Background - -> Introduce target version and min platform version as app attributes -> -> This is probably as simple as a couple of keys in a MANIFEST.MF file. -> We should document what it means, make sure API implementations can always access the target version of the calling CorDapp (i.e. by examining the flow, doing a stack walk or using Reflection.getCallerClass()) and do a simple test of an API that acts differently depending on the target version of the app. -> We should also implement checking at CorDapp load time that min platform version <= current platform version. - -([from CORDA-470](https://r3-cev.atlassian.net/browse/CORDA-470)) - -### Definitions - -* *Platform version (Corda)* An integer representing the API version of the Corda platform - -> It starts at 1 and will increment by exactly 1 for each release which changes any of the publicly exposed APIs in the entire platform. This includes public APIs on the node itself, the RPC system, messaging, serialisation, etc. API backwards compatibility will always be maintained, with the use of deprecation to migrate away from old APIs. In rare situations APIs may have to be removed, for example due to security issues. There is no relationship between the Platform Version and the release version - a change in the major, minor or patch values may or may not increase the Platform Version. - -([from the docs](https://docs.corda.net/head/versioning.html#versioning)). - -* *Platform version (Node)* The value of the Corda platform version that a node is running and advertising to the network. - -* *Minimum platform version (Network)* The minimum platform version that the nodes must run in order to be able to join the network. Set by the network zone operator. The minimum platform version is distributed with the network parameters as `minimumPlatformVersion`. - ([see docs:](https://docs.corda.net/network-map.html#network-parameters)) - -* *Target platform version (CorDapp)* Introduced in this document. Indicates that a CorDapp was tested with this version of the Corda Platform and should be run at this API level if possible. - -* *Minimum platform version (CorDapp)* Introduced in this document. Indicates the minimum version of the Corda platform that a Corda Node has to run in order to be able to run a CorDapp. - - -## Goals - -Define the semantics of target platform version and minimum platform version attributes for CorDapps, and the minimum platform version for the Corda network. Describe how target and platform versions would be specified by CorDapp developers. Define how these values can be accessed by the node and the CorDapp itself. - -## Non-goals - -In the future it might make sense to integrate the minimum and target versions into a Corda gradle plugin. Such a plugin is out of scope of this document. - -## Timeline - -This is intended as a long-term solution. The first iteration of the implementation will be part of platform version 4 and contain the minimum and target platform version. - -## Requirements - -* The CorDapp's minimum and target platform version must be accessible to nodes at CorDapp load time. - -* At CorDapp load time there should be a check that the node's platform version is greater or equal to the CorDapp's Minimum Platform version. - -* API implementations must be able to access the target version of the calling CorDapp. - -* The node's platform version must be accessible to CorDapps. - -* The CorDapp's target platform version must be accessible to the node when running CorDapps. - -## Design - -### Testing - -When a new platform version is released, CorDapp developers can increase their CorDapp's target version and re-test their app. If the tests are successful, they can then release their CorDapp with the increased target version. This way they would opt-in to potentially breaking changes that were introduced in that version. If they choose to keep their current target version, their CorDapp will continue to work. - -### Implications for platform developers - -When new features or changes are introduced that require all nodes on the network to understand them (e.g. changes in the wire transaction format), they must be version-gated on the network level. This means that the new behaviour should only take effect if the minimum platform version of the network is equal to or greater than the version in which these changes were introduced. Failing that, the old behaviour must be used instead. - -Changes that risk breaking apps must be gated on targetVersion>=X where X is the version where the change was made, and the old behaviour must be preserved if that condition isn't met. - -## Technical Design - -The minimum- and target platform version will be written to the manifest of the CorDapp's JAR, in fields called `Min-Platform-Version` and `Target-Platform-Version`. -The node's CorDapp loader reads these values from the manifest when loading the CorDapp. If the CorDapp's minimum platform version is greater than the node's platform version, the node will not load the CorDapp and log a warning. The CorDapp loader sets the minimum and target version in `net.corda.core.cordapp.Cordapp`, which can be obtained via the `CorDappContext` from the service hub. - -To make APIs caller-sensitive in cases where the service hub is not available a different approach has to be used. It would possible to do a stack walk, and parse the manifest of each class on the stack to determine if it belongs to a CorDapp, and if yes, what its target version is. Alternatively, the mapping of classes to `Cordapp`s obtained by the CorDapp loader could be stored in a global singleton. This singleton would expose a lambda returning the current CorDapp's version information (e.g. `() -> Cordapp.Info`). - -Let's assume that we want to change `TimeWindow.Between` to make it inclusive, i.e. change `contains(instant: Instant) = instant >= fromTime && instant < untilTime` to `contains(instant: Instant) = instant >= fromTime && instant <= untilTime`. However, doing so will break existing CorDapps. We could then version-guard the change such that the new behaviour is only used if the target version of the CorDapp calling `contains` is equal to or greater than the platform version that contains this change. It would look similar to this: - - ``` - fun contains(instant: Instant) { - if (CorDappVersionResolver.resolve().targetVersion > 42) { - return instant >= fromTime && instant <= untilTime - } else { - return instant >= fromTime && instant < untilTime - } - ``` -Version-gating API changes when the service hub is available would look similar to the above example, in that case the service hub's CorDapp provider would be used to determine if this code is being called from a CorDapp and to obtain its target version information. diff --git a/docs/source/design/template/decisions/decision.md b/docs/source/design/template/decisions/decision.md deleted file mode 100644 index 9a2ca69069..0000000000 --- a/docs/source/design/template/decisions/decision.md +++ /dev/null @@ -1,39 +0,0 @@ -![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) - --------------------------------------------- -Design Decision: -============================================ - -## Background / Context - -Short outline of decision point. - -## Options Analysis - -### A.