@ -89,8 +89,6 @@ logic provided by the apps.
|
||||
Hash and zone whitelist constraints are left over from earlier Corda versions before Signature Constraints were
|
||||
implemented. They make it harder to upgrade applications than when using signature constraints, so they're best avoided.
|
||||
|
||||
Further information into the design of Signature Constraints can be found in its :doc:`design document <design/data-model-upgrades/signature-constraints>`.
|
||||
|
||||
.. _signing_cordapps_for_use_with_signature_constraints:
|
||||
|
||||
Signing CorDapps for use with Signature Constraints
|
||||
|
@ -530,8 +530,7 @@ packages, they could call package-private methods, which may not be expected by
|
||||
and request ownership of your root package namespaces (e.g. ``com.megacorp.*``), with the signing keys you will be using to sign your app JARs.
|
||||
The zone operator can then add your signing key to the network parameters, and prevent attackers defining types in your own package namespaces.
|
||||
Whilst this feature is optional and not strictly required, it may be helpful to block attacks at the boundaries of a Corda based application
|
||||
where type names may be taken "as read". You can learn more about this feature and the motivation for it by reading
|
||||
":doc:`design/data-model-upgrades/package-namespace-ownership`".
|
||||
where type names may be taken "as read".
|
||||
|
||||
Step 11. Consider adding extension points to your flows
|
||||
-------------------------------------------------------
|
||||
|
@ -100,12 +100,6 @@ language = None
|
||||
# Else, today_fmt is used as the format for a strftime call.
|
||||
# today_fmt = '%B %d, %Y'
|
||||
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
exclude_patterns = ['design/README.md']
|
||||
if tags.has('pdfmode'):
|
||||
exclude_patterns = ['design', 'design/README.md']
|
||||
|
||||
# The reST default role (used for this markup: `text`) to use for all
|
||||
# documents.
|
||||
# default_role = None
|
||||
|
@ -280,8 +280,8 @@ But if another CorDapp developer, `OrangeCo` bundles the `Fruit` library, they m
|
||||
This will create a `com.fruitcompany.Banana` @SignedBy_TheOrangeCo, so there could be two types of Banana states on the network,
|
||||
but "owned" by two different parties. This means that while they might have started using the same code, nothing stops these `Banana` contracts from diverging.
|
||||
Parties on the network receiving a `com.fruitcompany.Banana` will need to explicitly check the constraint to understand what they received.
|
||||
In Corda 4, to help avoid this type of confusion, we introduced the concept of Package Namespace Ownership (see ":doc:`design/data-model-upgrades/package-namespace-ownership`").
|
||||
Briefly, it allows companies to claim namespaces and anyone who encounters a class in that package that is not signed by the registered key knows is invalid.
|
||||
In Corda 4, to help avoid this type of confusion, we introduced the concept of Package Namespace Ownership. Briefly, it allows companies to claim namespaces
|
||||
and anyone who encounters a class in that package that is not signed by the registered key knows is invalid.
|
||||
|
||||
This new feature can be used to solve the above scenario. If `TheFruitCo` claims package ownership of `com.fruitcompany`, it will prevent anyone
|
||||
from bundling its code because they will not be able to sign it with the right key.
|
||||
|
@ -1,40 +0,0 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
<a href="https://ci-master.corda.r3cev.com/viewType.html?buildTypeId=CordaEnterprise_Build&tab=buildTypeStatusDiv"><img src="https://ci.corda.r3cev.com/app/rest/builds/buildType:Corda_CordaBuild/statusIcon"/></a>
|
||||
|
||||
# Design Documentation
|
||||
|
||||
This directory should be used to version control Corda design documents.
|
||||
|
||||
These should be written in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) (a design template is provided for general guidance) and follow the design review process outlined below. It is recommended you use a Markdown editor such as [Typora](https://typora.io/), or an appropriate plugin for your favourite editor (eg. [Sublime Markdown editing theme](http://plaintext-productivity.net/2-04-how-to-set-up-sublime-text-for-markdown-editing.html)).
|
||||
|
||||
## Design Review Process
|
||||
|
||||
Please see the [design review process](design-review-process.md).
|
||||
|
||||
* Feature request submission
|
||||
* High level design
|
||||
* Review / approve gate
|
||||
* Technical design
|
||||
* Review / approve gate
|
||||
* Plan, prototype, implement, QA
|
||||
|
||||
## Design Template
|
||||
|
||||
Please copy this [directory](template) to a new location under `/docs/source/design` (use a meaningful short descriptive directory name) and use the [Design Template](template/design.md) contained within to guide writing your Design Proposal. Whilst the section headings may be treated as placeholders for guidance, you are expected to be able to answer any questions related to pertinent section headings (where relevant to your design) at the design review stage. Use the [Design Decision Template](template/decisions/decision.md) (as many times as needed) to record the pros and cons, and justification of any design decision recommendations where multiple options are available. These should be directly referenced from the *Design Decisions* section of the main design document.
|
||||
|
||||
The design document may be completed in one or two iterations, by completing the following main two sections individually or singularly:
|
||||
|
||||
* High level design
|
||||
Where a feature requirement is specified at a high level, and multiple design solutions are possible, this section should be completed and circulated for review prior to completing the detailed technical design.
|
||||
High level designs will often benefit from a formal meeting and discussion review amongst stakeholders to reach consensus on the preferred way to proceed. The design author will then incorporate all meeting outcome decisions back into a revision for final GitHub PR approval.
|
||||
* Technical design
|
||||
The technical design will consist of implementation specific details which require a deeper understanding of the Corda software stack, such as public API's and services, libraries, and associated middleware infrastructure (messaging,security, database persistence, serialization) used to realize these.
|
||||
Technical designs should lead directly to a GitHub PR review process.
|
||||
|
||||
Once a design is approved using the GitHub PR process, please commit the PR to the GitHub repository with a meaningful version identifier (eg. my super design document - **V1.0**)
|
||||
|
||||
## Design Repository
|
||||
|
||||
All design documents will be version controlled under github under the directory `/docs/source/design`.
|
||||
For designs that relate to Enterprise-only features (and that may contain proprietary IP), these should be stored under the [Enterprise Github repository](https://github.com/corda/enterprise). All other public designs should be stored under the [Open Source Github repository](https://github.com/corda/corda).
|
@ -1,50 +0,0 @@
|
||||
Design Decision: Certificate hierarchy levels
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
The decision of how many levels to include is a key feature of the [proposed certificate hierarchy](../design.md).
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### Option 1: 2-level hierarchy
|
||||
|
||||
Under this option, intermediate CA certificates for key signing services (Doorman, Network Map, CRL) are generated as
|
||||
direct children of the root certificate.
|
||||
|
||||
![Current](../images/option1.png)
|
||||
|
||||
#### Advantages
|
||||
|
||||
- Simplest option
|
||||
- Minimal change to existing structure
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
- The Root CA certificate is used to sign both intermediate certificates and CRL. This may be considered as a drawback
|
||||
as the Root CA should be used only to issue other certificates.
|
||||
|
||||
### Option 2: 3-level hierarchy
|
||||
|
||||
Under this option, an additional 'Company CA' cert is generated from the root CA cert, which is then used to generate
|
||||
intermediate certificates.
|
||||
|
||||
![Current](../images/option2.png)
|
||||
|
||||
#### Advantages
|
||||
|
||||
- Allows for option to remove the root CA from the network altogether and store in an offline medium - may be preferred by some stakeholders
|
||||
- Allows (theoretical) revocation and replacement of the company CA cert without needing to replace the trust root.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
- Greater complexity
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with option 1: 2-level hierarchy.
|
||||
|
||||
No authoritative argument from a security standpoint has been made which would justify the added complexity of option 2.
|
||||
Given the business impact of revoking the Company CA certificate, this must be considered an extremely unlikely event
|
||||
with comparable implications to the revocation of the root certificate itself; hence no practical justification for the
|
||||
addition of the third level is observed.
|
@ -1,42 +0,0 @@
|
||||
Design Decision: Certificate Hierarchy
|
||||
======================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
This document purpose is to make a decision on the certificate hierarchy. It is necessary to make this decision as it
|
||||
affects development of features (e.g. Certificate Revocation List).
|
||||
|
||||
## Options Analysis
|
||||
|
||||
There are various options in how we structure the hierarchy above the node CA.
|
||||
|
||||
### Option 1: Single trust root
|
||||
|
||||
Under this option, TLS certificates are issued by the node CA certificate.
|
||||
|
||||
#### Advantages
|
||||
|
||||
- Existing design
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
- The Root CA certificate is used to sign both intermediate certificates and CRL. This may be considered as a drawback as the Root CA should be used only to issue other certificates.
|
||||
|
||||
### Option 2: Separate TLS vs. identity trust roots
|
||||
|
||||
This option splits the hierarchy by introducing a separate trust root for TLS certificates.
|
||||
|
||||
#### Advantages
|
||||
|
||||
- Simplifies issuance of TLS certificates (implementation constraints beyond those of other certificates used by Corda - specifically, EdDSA keys are not yet widely supported for TLS certificates)
|
||||
- Avoids requirement to specify accurate usage restrictions on node CA certificates to issue their own TLS certificates
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
- Additional complexity
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with option 1 (Single Trust Root) for current purposes.
|
||||
|
||||
Feasibility of option 2 in the code should be further explored in due course.
|
@ -1,84 +0,0 @@
|
||||
# Certificate hierarchies
|
||||
|
||||
.. important:: This design doc applies to the main Corda network. Other networks may use different certificate hierarchies.
|
||||
|
||||
## Overview
|
||||
|
||||
A certificate hierarchy is proposed to enable effective key management in the context of managing Corda networks.
|
||||
This includes certificate usage for the data signing process and certificate revocation process
|
||||
in case of a key compromise. At the same time, result should remain compliant with
|
||||
[OCSP](https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol) and [RFC 5280](https://www.ietf.org/rfc/rfc5280.txt)
|
||||
|
||||
## Background
|
||||
|
||||
Corda utilises public key cryptography for signing and authentication purposes, and securing communication
|
||||
via TLS. As a result, every entity participating in a Corda network owns one or more cryptographic key pairs {*private,
|
||||
public*}. Integrity and authenticity of an entity's public key is assured using digital certificates following the
|
||||
[X.509 standard](https://tools.ietf.org/html/rfc5280), whereby the receiver’s identity is cryptographically bonded to
|
||||
his or her public key.
|
||||
|
||||
Certificate Revocation List (CRL) functionality interacts with the hierarchy of the certificates, as the revocation list
|
||||
for any given certificate must be signed by the certificate's issuer. Therefore if we have a single doorman CA, the sole
|
||||
CRL for node CA certificates would be maintained by that doorman CA, creating a bottleneck. Further, if that doorman CA
|
||||
is compromised and its certificate revoked by the root certificate, the entire network is invalidated as a consequence.
|
||||
|
||||
The current solution of a single intermediate CA is therefore too simplistic.
|
||||
|
||||
Further, the split and location of intermediate CAs has impact on where long term infrastructure is hosted, as the CRLs
|
||||
for certificates issued by these CAs must be hosted at the same URI for the lifecycle of the issued certificates.
|
||||
|
||||
## Scope
|
||||
|
||||
Goals:
|
||||
|
||||
* Define effective certificate relationships between participants and Corda network services (i.e. nodes, notaries, network map, doorman).
|
||||
* Enable compliance with both [OCSP](https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol) and [RFC 5280](https://www.ietf.org/rfc/rfc5280.txt) (CRL)-based revocation mechanisms
|
||||
* Mitigate relevant security risks (keys being compromised, data privacy loss etc.)
|
||||
|
||||
Non-goals:
|
||||
|
||||
* Define an end-state mechanism for certificate revocation.
|
||||
|
||||
## Requirements
|
||||
|
||||
In case of a private key being compromised, or a certificate incorrectly issued, it must be possible for the issuer to
|
||||
revoke the appropriate certificate(s).
|
||||
|
||||
The solution needs to scale, keeping in mind that the list of revoked certificates from any given certificate authority
|
||||
is likely to grow indefinitely. However for an initial deployment a temporary certificate authority may be used, and
|
||||
given that it will not require to issue certificates in the long term, scaling issues are less of a concern in this
|
||||
context.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
decisions/levels.md
|
||||
decisions/tls-trust-root.md
|
||||
|
||||
## **Target** Solution
|
||||
|
||||
![Target certificate structure](./images/cert_structure_v3.png)
|
||||
|
||||
The design introduces discrete intermediate CAs below the network trust root for each logical service exposed by the doorman - specifically:
|
||||
|
||||
1. Node CA certificate issuance
|
||||
2. Network map signing
|
||||
3. Certificate Revocation List (CRL) signing
|
||||
4. OCSP revocation signing
|
||||
|
||||
The use of discrete certificates in this way facilitates subsequent changes to the model, including retiring and replacing certificates as needed.
|
||||
|
||||
Each of the above certificates will specify a CRL allowing the certificate to be revoked. The root CA operator
|
||||
(primarily R3) will be required to maintain this CRL for the lifetime of the process.
|
||||
|
||||
TLS certificates will remain issued under Node CA certificates (see [decision: TLS trust
|
||||
root](./decisions/tls-trust-root.md)).
|
||||
|
||||
Nodes will be able to specify CRL(s) for TLS certificates they issue; in general, they will be required to such CRLs for
|
||||
the lifecycle of the TLS certificates.
|
||||
|
||||
In the initial state, a single doorman intermediate CA will be used for issuing all node certificates. Further
|
||||
intermediate CAs for issuance of node CA certificates may subsequently be added to the network, where appropriate,
|
||||
potentially split by geographic region or otherwise.
|
Before Width: | Height: | Size: 142 KiB |
Before Width: | Height: | Size: 34 KiB |
Before Width: | Height: | Size: 175 KiB |
Before Width: | Height: | Size: 309 KiB |
Before Width: | Height: | Size: 349 KiB |
Before Width: | Height: | Size: 353 KiB |
@ -1,151 +0,0 @@
|
||||
# Migration from the hash constraint to the Signature constraint
|
||||
|
||||
|
||||
## Background
|
||||
|
||||
Corda pre-V4 only supports HashConstraints and the WhitelistedByZoneConstraint.
|
||||
The default constraint, if no entry was added to the network parameters is the hash constraint.
|
||||
Thus, it's very likely that most first states were created with the Hash constraint.
|
||||
|
||||
When changes will be required to the contract, the only alternative is the explicit upgrade, which creates a new contract, but inherits the HashConstraint (with the hash of the new jar this time).
|
||||
|
||||
**The current implementation of the explicit upgrade does not support changing the constraint.**
|
||||
|
||||
It's very unlikely that these first deployments actually wanted a non-upgradeable version.
|
||||
|
||||
This design doc is presenting a smooth migration path from the hash constraint to the signature constraint.
|
||||
|
||||
|
||||
## Goals
|
||||
|
||||
CorDapps that were released (states created) with the hash constraint should be able to transition to the signature constraint if the original developer decides to do that.
|
||||
|
||||
A malicious party should not be able to attack this feature, by "taking ownership" of the original code.
|
||||
|
||||
|
||||
## Non-Goals
|
||||
|
||||
Migration from the whitelist constraint was already implemented. so will not be addressed. (The cordapp developer or owner just needs to sign the jar and whitelist the signed jar.)
|
||||
|
||||
Also versioning is being addressed in different design docs.
|
||||
|
||||
|
||||
## Design details
|
||||
|
||||
### Requirements
|
||||
|
||||
To migrate without disruption from the hash constraint, the jar that is attached to a spending transaction needs to satisfy both the hash constraint of the input state, as well as the signature constraint of the output state.
|
||||
|
||||
Also, it needs to reassure future transaction verifiers - when doing transaction resolution - that this was a legitimate transition, and not a malicious attempt to replace the contract logic.
|
||||
|
||||
|
||||
### Process
|
||||
|
||||
To achieve the first part, we can create this convention:
|
||||
|
||||
- Developer signs the original jar (that was used with the hash constraint).
|
||||
- Nodes install it, thus whitelisting it.
|
||||
- The HashConstraint.verify method will be modified to verify the hash with and without signatures.
|
||||
- The nodes create normal transactions that spend an input state with the hashConstraint and output states with the signature constraint. No special spend-to-self transactions should be required.
|
||||
- This transaction would validate correctly as both constraints will pass - the unsigned hash matches, and the signatures are there.
|
||||
- This logic needs to be added to the constraint propagation transition matrix. This could be only enabled for states created pre-v4, when there was no alternative.
|
||||
|
||||
|
||||
For the second part:
|
||||
|
||||
- The developer needs to claim the package (See package ownership). This will give confidence to future verifiers that it was the actual developer that continues to be the owner of that jar.
|
||||
|
||||
|
||||
To summarise, if a CorDapp developer wishes to migrate to the code it controls to the signature constraint for better flexibility:
|
||||
|
||||
1. Claim the package.
|
||||
2. Sign the jar and distribute it.
|
||||
3. In time all states will naturally transition to the signature constraint.
|
||||
4. Release new version as per the signature constraint.
|
||||
|
||||
|
||||
A normal node would just download the signed jar using the normal process for that, and the platform will do the rest.
|
||||
|
||||
|
||||
### Caveats
|
||||
|
||||
#### Someone really wants to issue states with the HashConstraint, and ensure that can never change.
|
||||
|
||||
- As mentioned above the transaction builder could only automatically transition states created pre-v4.
|
||||
|
||||
- If this is the original developer of the cordapp, then they can just hardcode the check in the contract that the constraint must be the HashConstraint.
|
||||
|
||||
- It is actually a third party that uses a contract it doesn't own, but wants to ensure that it's only that code that is used.
|
||||
This should not be allowed, as it would clash with states created without this constraint (that might have higher versions), and create incompatible states.
|
||||
The option in this case is to force such parties to actually create a new contract (maybe subclass the version they want), own it, and hardcode the check as above.
|
||||
|
||||
|
||||
#### Some nodes haven't upgraded all their states by the time a new release is already being used on the network.
|
||||
|
||||
- A transaction mixing an original HashConstraint state, and a v2 Signature constraint state will not pass. The only way out is to strongly "encourage" nodes to upgrade before the new release.
|
||||
|
||||
The problem is that, in a transaction, the attachment needs to pass the constraint of all states.
|
||||
|
||||
If the rightful owner took over the contract of states originally released with the HashConstraint, and started releasing new versions then the following might happen:
|
||||
|
||||
- NodeA did not migrate all his states to the Signature Constraint.
|
||||
- NodeB did, and already has states created with version 2.
|
||||
- If they decide to trade, NodeA will add his HashConstraint state, and NodeB will add his version2 SignatureConstraint state to a new transaction.
|
||||
This is an impossible transaction. Because if you select version1 of the contract you violate the non-downgrade rule. If you select version2 , you violate the initial HashConstraint.
|
||||
|
||||
|
||||
Note: If we consider this to be a real problem, then we can implement a new NoOp Transaction type similar to the Notary change or the contract upgrade. The security implications need to be considered before such work is started.
|
||||
Nodes could use this type of transaction to change the constraint of existing states without the need to transact.
|
||||
|
||||
|
||||
### Implementation details
|
||||
|
||||
- Create a function to return the hash of a signed jar after it stripped the signatures.
|
||||
- Change the HashConstraint to check against any of these 2 hashes.
|
||||
- Change the transaction builder logic to automatically transition constraints that are signed, owned, etc..
|
||||
- Change the constraint propagation transition logic to allow this.
|
||||
|
||||
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
|
||||
### Migrating from the HashConstraint to the SignatureConstraint via the WhitelistConstraint.
|
||||
|
||||
We already have a strategy to migrate from the WhitelistConstraint to the Signature contraint:
|
||||
|
||||
- Original developer (owner) signs the last version of the jar, and whitelists the signed version.
|
||||
- The platform allows transitioning to the SignatureConstraint as long as all the signers of the jar are in the SignatureConstraint.
|
||||
|
||||
|
||||
We could attempt to extend this strategy with a HashConstraint -> WhitelistConstraint path.
|
||||
|
||||
#### The process would be:
|
||||
|
||||
- Original developer of the contract that used the hashConstraint will make a whitelist contract request, and provide both the original jar and the original jar but signed.
|
||||
- The zone operator needs to make sure that this is the original developer who claims ownership of that corDapp.
|
||||
|
||||
##### Option 1: Skip the WhitelistConstraint when spending states. (InputState = HashConstraint, OutputState = SignatureConstraint)
|
||||
|
||||
- This is not possible as one of the 2 constraints will fail.
|
||||
- Special constraint logic is needed which is risky.
|
||||
|
||||
|
||||
##### Option 2: Go through the WhitelistConstraint when spending states
|
||||
|
||||
- When a state is spent, the transaction builder sees there is a whitelist constraint, and selects the first entry.
|
||||
- The transition matrix will allow the transition from hash to Whitelist.
|
||||
- Next time the state is spent, it will transition from the Whitelist constraint to the signature constraint.
|
||||
|
||||
|
||||
##### Advantage:
|
||||
|
||||
- The tricky step of removing the signature from a jar to calculate the hash is no longer required.
|
||||
|
||||
|
||||
##### Disadvantage:
|
||||
|
||||
- The transition will happen in 2 steps, which will add another layer of surprise and of potential problems.
|
||||
- The No-Op transaction will become mandatory for this, along with all the complexity it brings (all participants signing). It will need to be run twice.
|
||||
- An unnecessary whitelist entry is added. If that developer also decides to claim the package (as probably most will do in the beginning), it will grow the network parameters and increase the workload on the Zone Operator.
|
||||
- We create an unintended migration path from the HashConstraint to the WhitelistConstraint.
|
@ -1,110 +0,0 @@
|
||||
# Package namespace ownership
|
||||
|
||||
This design document outlines a new Corda feature that allows a compatibility zone to give ownership of parts of the Java package namespace to certain users.
|
||||
|
||||
"*There are only two hard problems in computer science: 1. Cache invalidation, 2. Naming things, 3. Off by one errors*"
|
||||
|
||||
|
||||
|
||||
## Background
|
||||
|
||||
Corda implements a decentralised database that can be unilaterally extended with new data types and logic by its users, without any involvement by the closest equivalent we have to administrators (the "zone operator"). Even informing them is not required.
|
||||
|
||||
This design minimises the power zone operators have and ensures deploying new apps can be fast and cheap - it's limited only by the speed with which the users themselves can move. But it introduces problematic levels of namespace complexity which can make programming securely harder than in regular non-decentralised programming.
|
||||
|
||||
#### Java namespaces
|
||||
|
||||
A typical Java application, seen from the JVM level, has a flat namespace in which a single string name binds to a single class. In object oriented programming a class defines both a data structure and the code used to enforce various invariants like "a person's age may not be negative", so this allows a developer to reason about what the identifier `com.example.Person` really means throughout the lifetime of his program.
|
||||
|
||||
More complex Java applications may have a nested namespace using classloaders, thus inside a JVM a class is actually a pair of (classloader pointer, class name) and this can be used to support tricks like having two different versions of the same class in use simultaneously. The downside is more complexity for the developer to deal with. When things get mixed up this can surface (in Java 8) as nonsensical error messages like "com.example.Person cannot be casted to com.example.Person". In Java 9 classloaders were finally given names so these errors make more sense.
|
||||
|
||||
|
||||
|
||||
#### Corda namespaces
|
||||
|
||||
Corda faces an extension of the Java namespace problem - we have a global namespace in which malicious adversaries might be choosing names to be deliberately confusing. Nothing forces an app developer to follow the standard conventions for Java package or class names - someone could make an app that uses the same class name as one of your own apps. Corda needs to keep these two different classes, from different origins, separated.
|
||||
|
||||
On the core ledger this is done by associating each state with an _attachment_. The attachment is the JAR file that contains the class files used by states. To load a state, a classloader is defined that uses the attachments on a transaction, and then the state class is loaded via that classloader.
|
||||
|
||||
With this infrastructure in place, the Corda node and JVM can internally keep two classes that share the same name separated. The name of the state is, in effect, a list of attachments (hashes of JAR files) combined with a regular class name.
|
||||
|
||||
|
||||
|
||||
#### Namespaces and versioning
|
||||
|
||||
Names and namespaces are a critical part of how platforms of any kind handle software evolution. If component A is verifying the precise content of component B, e.g. by hashing it, then there can be no agility - component B can never be upgraded. Sometimes this is what's wanted. But usually you want the indirection of a name or set of names that stands in for some behaviour. Exactly how that behaviour is provided is abstracted away behind the mapping of the namespace to concrete artifacts.
|
||||
|
||||
Versioning and resistance to malicious attack are likewise heavily interrelated, because given two different codebases that export the same names, it's possible that one is a legitimate upgrade which changes the logic behind the names in beneficial ways, and the other is an imposter that changes the logic in malicious ways. It's important to keep the differences straight, which can be hard because by their very nature, two versions of the same app tend to be nearly identical.
|
||||
|
||||
|
||||
|
||||
#### Namespace complexity
|
||||
|
||||
Reasoning about namespaces is hard and has historically led to security flaws in many platforms.
|
||||
|
||||
Although the Corda namespace system _can_ keep overlapping but distinct apps separated, that unfortunately doesn't mean that everywhere it actually does. In a few places Corda does not currently provide all the data needed to work with full state names, although we are adding this data to RPC in Corda 4.
|
||||
|
||||
Even if Corda was sure to get every detail of this right in every area, a full ecosystem consists of many programs written by app developers - not just contracts and flows, but also RPC clients, bridges from internal systems and so on. It is unreasonable to expect developers to fully keep track of Corda compound names everywhere throughout the entire pipeline of tools and processes that may surround the node: some of them will lose track of the attachments list and end up with only a class name, and others will do things like serialise to JSON in which even type names go missing.
|
||||
|
||||
Although we can work on improving our support and APIs for working with sophisticated compound names, we should also allow people to work with simpler namespaces again - like just Java class names. This involves a small sacrifice of decentralisation but the increase in security is probably worth it for most developers.
|
||||
|
||||
## Goals
|
||||
|
||||
* Provide a way to reduce the complexity of naming and working with names in Corda by allowing for a small amount of centralisation, balanced by a reduction in developer mental load.
|
||||
* Keep it optional for both zones and developers.
|
||||
* Allow most developers to work just with ordinary Java class names, without needing to consider the complexities of a decentralised namespace.
|
||||
|
||||
## Non-goals
|
||||
|
||||
* Directly make it easier to work with "decentralised names". This can be a project that comes later.
|
||||
|
||||
## Design
|
||||
|
||||
To make it harder to accidentally write insecure code, we would like to support a compromise configuration in which a compatibility zone can publish a map of Java package namespaces to public keys. An app/attachment JAR may only define a class in that namespace if it is signed by the given public key. Using this feature would make a zone slightly less decentralised, in order to obtain a significant reduction in mental overhead for developers.
|
||||
|
||||
Example of how the network parameters would be extended, in pseudo-code:
|
||||
|
||||
```kotlin
|
||||
data class JavaPackageName(name: String) {
|
||||
init { /* verify 'name' is a valid Java package name */ }
|
||||
}
|
||||
|
||||
data class NetworkParameters(
|
||||
...
|
||||
val packageOwnership: Map<JavaPackageName, PublicKey>
|
||||
)
|
||||
```
|
||||
|
||||
Where the `PublicKey` object can be any of the algorithms supported by signature constraints. The map defines a set of dotted package names like `com.foo.bar` where any class in that package or any sub-package of that package is considered to match (so `com.foo.bar.baz.boz.Bish` is a match but `com.foo.barrier` does not).
|
||||
|
||||
When a class is loaded from an attachment or application JAR signature checking is enabled. If the package of the class matches one of the owned namespaces, the JAR must be have enough signatures to satisfy the PublicKey (there may need to be more than one if the PublicKey is composite).
|
||||
|
||||
Please note the following:
|
||||
|
||||
* It's OK to have unsigned JARs.
|
||||
* It's OK to have JARs that are signed, but for which there are no claims in the network parameters.
|
||||
* It's OK if entries in the map are removed (system becomes more open). If entries in the map are added, this could cause consensus failures if people are still using old unsigned versions of the app.
|
||||
* The map specifies keys not certificate chains, therefore, the keys do not have to chain off the identity key of a zone member. App developers do not need to be members of a zone for their app to be used there.
|
||||
|
||||
From a privacy and decentralisation perspective, the zone operator *may* learn who is developing apps in their zone or (in cases where a vendor makes a single app and thus it's obvious) which apps are being used. This is not ideal, but there are mitigations:
|
||||
|
||||
* The privacy leak is optional.
|
||||
* The zone operator still doesn't learn who is using which apps.
|
||||
* There is no obligation for Java package namespaces to correlate obviously to real world identities or products. For example you could register a trivial "front" domain and claim ownership of that, then use it for your apps. The zone operator would see only a codename.
|
||||
|
||||
#### Claiming a namespace
|
||||
|
||||
The exact mechanism used to claim a namespace is up to the zone operator. A typical approach would be to accept an SSL certificate with the domain in it as proof of domain ownership, or to accept an email from that domain as long as the domain is using DKIM to prevent from header spoofing.
|
||||
|
||||
#### The vault API
|
||||
|
||||
The vault query API is an example of how tricky it can be to manage truly decentralised namespaces. The `Vault.Page` class does not include constraint information for a state. Therefore, if a generic app were to be storing states of many different types to the vault without having the specific apps installed, it might be possible for someone to create a confusing name e.g. an app created by MiniCorp could export a class named `com.megacorp.example.Token` and this would be mapped by the RPC deserialisation logic to the actual MegaCorp app - the RPC client would have no way to know this had happened, even if the user was correctly checking, which it's unlikely they would.
|
||||
|
||||
The `StateMetadata` class can be easily extended to include constraint information, to make safely programming against a decentralised namespace possible. As part of this work this extension will be made.
|
||||
|
||||
But the new field would still need to be used - a subtle detail that would be easy to overlook. Package namespace ownership ensures that if you have an app installed locally on the client side that implements `com.megacorp.example` , then that code is likely to match closely enough with the version that was verified by the node.
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -1,155 +0,0 @@
|
||||
# Signature constraints
|
||||
|
||||
This design document outlines an additional kind of *contract constraint*, used for specifying inside a transaction what the set of allowable attached contract JARs can be for each state.
|
||||
|
||||
## Background
|
||||
|
||||
Contract constraints are a part of how Corda ensures the correct code is executed to verify transactions, and also how it manages application upgrades. There are two kinds of upgrade that can be applied to the ledger:
|
||||
|
||||
* Explicit
|
||||
* Implicit
|
||||
|
||||
An *explicit* upgrade is when a special kind of transaction is used, the *contract upgrade transaction*, which has the power to suspend normal contract execution and validity checking. The new contract being upgraded-to must be willing to accept the old state and can replace it with a new one. Because this can cause arbitrary edits to the ledger, every participant in a state must sign the contract upgrade transaction for it to be considered valid.
|
||||
|
||||
Note that in the case of single-participant states whilst you could unilaterally replace a token state with a different state, this would be a state controlled by an application that other users wouldn't recognise, so you cannot transmute a token into a private contract with yourself then transmute it back, because contracts will only upgrade states they created themselves.
|
||||
|
||||
An *implicit* upgrade is when the creator of a state has pre-authorised upgrades, quite possibly including versions of the app that didn't exist when the state was first authored. Implicit upgrades don't require a manual approval step - the new code can start being used whenever the next transaction for a state is needed, as long as it meets the state's constraint.
|
||||
|
||||
Our current set of constraints is quite small. We support:
|
||||
|
||||
* `AlwaysAcceptContractConstraint` - any attachment can be used, effectively this disables ledger security.
|
||||
* `HashAttachmentContractConstraint` - only an attachment of the specified hash can be used. This is the same as Bitcoin or Ethereum and means once the state is created, the code is locked in permanently.
|
||||
* `WhitelistedByZoneContractConstraint` - the network parameters contains a map of state class name to allowable hashes for the attachments.
|
||||
|
||||
The last constraint allows upgrades 'from the future' to be applied, without disabling ledger security. However it is awkward to use, because any new version of any app requires a new set of network parameters to be signed by the zone operator and accepted by all participants, which in turn requires a node restart.
|
||||
|
||||
The problems of `WhitelistedByZone` were known at the time it was developed, however, the feature was implemented anyway to reduce schedule slip for the Corda 3.0 release, whilst still allowing some form of application upgrade.
|
||||
|
||||
We would like a new kind of constraint that is more convenient and decentralised whilst still being secure.
|
||||
|
||||
|
||||
## Goals
|
||||
|
||||
* Improve usability by eliminating the need to change the network parameters.
|
||||
* Improve decentralisation by allowing apps to be developed and upgraded without the zone operator knowing or being able to influence it.
|
||||
* Eventually, phase out zone whitelisting constraints.
|
||||
|
||||
## Non-goals
|
||||
|
||||
* Preventing downgrade attacks. Downgrade attack prevention will be tackled in a different design effort.
|
||||
* Phase out of hash constraints. If malicious app creators are in the users threat model then hash constraints are the way to go.
|
||||
* Handling the case where third parties re-sign app jars.
|
||||
* Package namespace ownership (a separate effort).
|
||||
* Allowing the zone operator to override older constraints, to provide a non-explicit upgrade path.
|
||||
|
||||
## Design details
|
||||
|
||||
We propose being able to constrain to any attachments whose files are signed by a specified set of keys.
|
||||
|
||||
This satisfies the usability requirement because the creation of a new application is as simple as invoking the `jarsigner` tool that comes with the JDK. This can be integrated with the build system via a Gradle or Maven task. For example, Gradle can use jarsigner via [the signjar task](https://ant.apache.org/manual/Tasks/signjar.html) ([example](https://gist.github.com/Lien/7150434)).
|
||||
|
||||
This also satisfies the decentralisation requirement, because app developers can sign apps without the zone operator's involvement or knowledge.
|
||||
|
||||
Using JDK style JAR code signing has several advantages over rolling our own:
|
||||
|
||||
* Although a signing key is required, this can be set up once. It can be protected by a password, or Windows/Mac built in keychain security, a token that supports PIN /biometrics or an HSM. All these options are supported out of the box by the Java security architecture.
|
||||
* JARs can be signed multiple times by different entities. The nature of this process means the signatures can be combined easily - there is no ordering requirement or complex collaboration tools needed. By implication this means that a signature constraint can use a composite key.
|
||||
* APIs for verifying JAR signatures are included in the platform already.
|
||||
* File hashes can be checked file-at-a-time, so random access is made easier e.g. from inside an SGX enclave.
|
||||
* Although Gradle can make reproducible JARs quite easily, JAR signatures do not include irrelevant metadata like file ordering or timestamps, so they are robust to being unpacked and repacked.
|
||||
* The signature can be timestamped using an RFC compliant timestamping server. Our notaries do not currently implement this protocol, but they could.
|
||||
* JAR signatures are in-lined to the JAR itself and do not ride alongside it. This is a good fit for our current attachments capabilities.
|
||||
|
||||
There are also some disadvantages:
|
||||
|
||||
* JAR signatures do *not* have to cover every file in the JAR. It is possible to add files to the JAR later that are unsigned, and for the verification process to still pass, as verification is done on a per-file basis. This is unintuitive and requires special care.
|
||||
* The JAR verification APIs do not validate that the certificate chain in the JAR is meaningful. Therefore you must validate the certificate chain yourself in every case where a JAR is being verified.
|
||||
* JAR signing does not cover the MANIFEST.MF file or files that start with SIG- (case INsensitive). Storing sensitive data in the manifest could be a problem as a result.
|
||||
|
||||
### Data structures
|
||||
|
||||
The proposed data structure for the new constraint type is as follows:
|
||||
|
||||
```kotlin
|
||||
data class SignatureAttachmentConstraint(
|
||||
val key: PublicKey
|
||||
) : AttachmentConstraint
|
||||
```
|
||||
|
||||
Therefore if a state advertises this constraint, along with a class name of `com.foo.Bar` then the definition of Bar must reside in an attachment with signatures sufficient to meet the given public key. Note that the `key` may be a `CompositeKey` which is fulfilled by multiple signers. Multiple signers of a JAR is useful for decentralised administration of an app that wishes to have a threat model in which one of the app developers may go bad, but not a majority of them. For example there could be a 2-of-3 threshold of {app developer, auditor, R3} in which R3 is legally bound to only sign an upgrade if the auditor is unavailable e.g. has gone bankrupt. However, we anticipate that most constraints will be one-of-one for now.
|
||||
|
||||
We will add a `signers` field to the `ContractAttachment` class that will be filled out at load time if the JAR is signed. The signers will be computed by checking the certificate chain for every file in the JAR, and any unsigned files will cause an exception to be thrown.
|
||||
|
||||
### Transaction building
|
||||
|
||||
The `TransactionBuilder` class can select the right constraint given what it already knows. If it locates the attachment JAR and discovers it has signatures in it, it can automatically set an N-of-N constraint that requires all of them on any states that don't already have a constraint specified. If the developer wants a more sophisticated constraint, it is up to them to set that explicitly in the usual manner.
|
||||
|
||||
### Tooling and workflow
|
||||
|
||||
The primary tool required is of course `jarsigner`. In dev mode, the node will ignore missing signatures in attachment JARs and will simply log an error if no signature is present when a constraint requires one.
|
||||
|
||||
To verify and print information about the signatures on a JAR, the `jarsigner` tool can be used again. In addition, we should add some new shell commands that do the same thing, but for a given attachment hash or transaction hash - these may be useful for debugging and analysis. Actually a new shell command should cover all aspects of inspecting attachments - not just signatures but what's inside them, simple way to save them to local disk etc.
|
||||
|
||||
### Key structure
|
||||
|
||||
There are no requirements placed on the keys used to sign JARs. In particular they do not have to be keys used on the Corda ledger, and they do not need a certificate chain that chains to the zone root. This is to ensure that app JARs are not specific to any particular zone. Otherwise app developers would need to go through the on-boarding process for a zone and that may not always be necessary or appropriate.
|
||||
|
||||
The certificate hierarchy for the JAR signature can be a single self-signed cert. There is no need for the key to present a valid certificate chain.
|
||||
|
||||
### Third party signing of JARs
|
||||
|
||||
Consider an app written and signed by the hypothetical company MiniCorp™. It allows users to issue tokens of some sort. An issuer called MegaCorp™ decides that they do not completely trust MiniCorp to create new versions of the app, and they would like to retain some control, so they take the app jar and sign it themselves. Thus there are now two JARs in circulation for the same app.
|
||||
|
||||
Out of the box, this situation will break when combining tokens using the original JAR and tokens from MegaCorp into a single transaction. The `TransactionBuilder` class will fail because it'll try to attach both JARs to satisfy both constraints, yet the JARs define classes with the same name. This violates the no-overlap rule (the no-overlap rule doesn't check for whether the files are actually identical in content).
|
||||
|
||||
For now we will make this problem out of scope. It can be resolved in a future version of the platform.
|
||||
|
||||
There are a couple of ways this could be addressed:
|
||||
|
||||
1. Teach the node how to create a new JAR by combining two separately signed versions of the same JAR into a third.
|
||||
2. Alter the no-overlap rule so when two files in two different attachments are identical they are not considered to overlap.
|
||||
|
||||
### Upgrading from other constraints
|
||||
|
||||
We anticipate that signature constraints will probably become the standard type of constraint, as it strikes a good balance between security and rigidity.
|
||||
|
||||
The "explicit upgrade" mechanism using dedicated upgrade transactions already exists and can be used to move data from old constraints to new constraints, but this approach suffers from the usual problems associated with this form of upgrade (requires signatures from every participant, creating a new tx, manual approval of states to be upgraded etc).
|
||||
|
||||
Alternatively, network parameters can be extended to support selective overrides of constraints to allow such upgrades in an announced and opt-in way. Designing such a mechanism is out of scope for the first cut of this feature however.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
### Out-of-line / external JAR signatures
|
||||
|
||||
One obvious alternative is to sign the entire JAR instead of using the Java approach of signing a manifest file that in turn contains hashes of each file. The resulting signature would then ride alongside the JAR in a new set of transaction fields.
|
||||
|
||||
The Java approach of signing a manifest in-line with the JAR itself is more complex, and complexity in cryptographic operations is rarely a good thing. In particular the Java approach means it's possible to have files in the JAR that aren't signed mixed with files that are. This could potentially be a useful source of flexibility but is more likely to be a foot-gun: we should reject attachments that contain a mix of signed and unsigned files.
|
||||
|
||||
However, signing a full JAR as a raw byte stream has other downsides:
|
||||
|
||||
* Would require a custom tool to create the detached signatures. Then it'd require new RPCs and more tools to upload and download the signatures separately from the JARs, and yet more tools to check the signatures. By bundling the signature inside the JAR, we preserve the single-artifact property of the current system, which is quite nice.
|
||||
* Would require more fields to be added to the WireTransaction format, although we'll probably have to bite this bullet as part of adding attachment metadata eventually anyway.
|
||||
* The signature ends up covering irrelevant metadata like file modification timestamps, file ordering, compression levels and so on. However, we need to move the ecosystem to producing reproducible JARs anyway for other reasons.
|
||||
* JAR signature metadata is already exposed via the Java API, so attachments that are not covered by a constraint e.g. an attachment with holiday calendar text files in it, can also be signed, and contract code could check those signatures in the usual documented way. With out-of-line signatures there'd need to be custom APIs to do this.
|
||||
* Inline JAR signatures have the property that they can be checked on a per file basis. This is potentially useful later for SGX enclaves, if they wish to do random access to JAR files too large to reasonably fit inside the rather restricted enclave memory environment.
|
||||
|
||||
### Package name constraints
|
||||
|
||||
Our goal is to communicate "I want an attachment created by party/parties $FOO". The obvious way to do this is specify the party in the constraint. But as part of other work we are considering introducing the concept of package hierarchy ownership - so `com.foobar.*` would be owned by the Foo Corporation of London and this ownership link between namespace glob and `Party` would be specified in the network parameters.
|
||||
|
||||
If such an indirection were to be introduced then you could make the constraint simpler - it wouldn't need any contents at all. Rather, it would indicate that any attachment that legitimately exported the package name of the contract classname would be accepted. It'd be up to the platform to check that the signature on the JAR was by the same party that is listed in the network parameters as owning that package namespace.
|
||||
|
||||
There are some further issues to think through here:
|
||||
|
||||
1. Is this a fourth type of constraint (package name constraint) that we should support along with the other three? Or is it actually just a better design and should subsume this work?
|
||||
2. Should it always be the package name of the contract class, or should it specify a package glob specifically? If so what happens if the package name of the contract class and the package name of the constraint don't match - is it OK if the latter is a subset of the former?
|
||||
3. Indirecting through package names increases centralisation somewhat, because now the zone operator has to agree to you taking ownership of a part of the namespace. This is also a privacy leak, it may expose what apps are being used on the network. *However* what it really exposes is application *developers* and not actual apps, and the zone op doesn't get to veto specific apps once they approved an app developer. More problematically unless an additional indirection is added to the network parameters, every change to the package ownership list requires a "hard fork" acceptance of new parameters.
|
||||
|
||||
|
||||
### Using X.500 names in the constraint instead of PublicKey
|
||||
|
||||
We advertise a `PublicKey` (which may be a `CompositeKey`) in the constraint and *not* a set of `CordaX500Name` objects. This means that apps can be developed by entities that aren't in the network map (i.e. not a part of your zone), and it enables threshold keys, *but* the downside is there's no way to rotate or revoke a compromised key beyond adjusting the states themselves. We lose the indirection-through-identity.
|
||||
|
||||
We could introduce such an indirection. This would disconnect the constraint from a particular public key. However then each zone an app is deployed to requires a new JAR signature by the creator, using a certificate issued by the zone operator. Because JARs can be signed by multiple certificates, this is OK, a JAR can be resigned N times if it's to be used in N zones. But it means that effectively zone operators get a power of veto over application developers, increasing centralisation and it increases required logistical efforts.
|
||||
|
||||
In practice, as revoking on-ledger keys is not possible at the moment in Corda, changing a code signing key would require an explicit upgrade or the app to have a command that allows the constraint to be changed.
|
@ -1,26 +0,0 @@
|
||||
Design Docs
|
||||
===========
|
||||
|
||||
.. conditional-toctree::
|
||||
:maxdepth: 1
|
||||
:if_tag: htmlmode
|
||||
|
||||
design-review-process.md
|
||||
certificate-hierarchies/design.md
|
||||
failure-detection-master-election/design.md
|
||||
float/design.md
|
||||
hadr/design.md
|
||||
kafka-notary/design.md
|
||||
monitoring-management/design.md
|
||||
sgx-integration/design.md
|
||||
reference-states/design.md
|
||||
sgx-infrastructure/design.md
|
||||
threat-model/corda-threat-model.md
|
||||
data-model-upgrades/signature-constraints.md
|
||||
data-model-upgrades/package-namespace-ownership.md
|
||||
targetversion/design.md
|
||||
data-model-upgrades/migrate-to-signature-constraint.md
|
||||
versioning/contract-versioning.md
|
||||
linear-pointer/design.md
|
||||
maximus/design.md
|
||||
accounts/design.md
|
@ -1,35 +0,0 @@
|
||||
# Design review process
|
||||
|
||||
The Corda design review process defines a means of collaborating approving Corda design thinking in a consistent,
|
||||
structured, easily accessible and open manner.
|
||||
|
||||
The process has several steps:
|
||||
|
||||
1. High level discussion with the community and developers on corda-dev.
|
||||
2. Writing a design doc and submitting it for review via a PR to this directory. See other design docs and the
|
||||
design doc template (below).
|
||||
3. Respond to feedback on the github discussion.
|
||||
4. You may be invited to a design review board meeting. This is a video conference in which design may be debated in
|
||||
real time. Notes will be sent afterwards to corda-dev.
|
||||
5. When the design is settled it will be approved and can be merged as normal.
|
||||
|
||||
The following diagram illustrates the process flow:
|
||||
|
||||
![Design Review Process](./designReviewProcess.png)
|
||||
|
||||
At least some of the following people will take part in a DRB meeting:
|
||||
|
||||
* Richard G Brown (CTO)
|
||||
* James Carlyle (Chief Engineer)
|
||||
* Mike Hearn (Lead Platform Engineer)
|
||||
* Mark Oldfield (Lead Platform Architect)
|
||||
* Jonathan Sartin (Information Security manager)
|
||||
* Select external key contributors (directly involved in design process)
|
||||
|
||||
The Corda Technical Advisory Committee may also be asked to review a design.
|
||||
|
||||
Here's the outline of the design doc template:
|
||||
|
||||
.. toctree::
|
||||
|
||||
template/design.md
|
Before Width: | Height: | Size: 150 KiB |
Before Width: | Height: | Size: 85 KiB |
@ -1,118 +0,0 @@
|
||||
# Failure detection and master election
|
||||
|
||||
.. important:: This design document describes a feature of Corda Enterprise.
|
||||
|
||||
## Background
|
||||
|
||||
Two key issues need to be resolved before Hot-Warm can be implemented:
|
||||
|
||||
* Automatic failure detection (currently our Hot-Cold set-up requires a human observer to detect a failed node)
|
||||
* Master election and node activation (currently done manually)
|
||||
|
||||
This document proposes two solutions to the above mentioned issues. The strengths and drawbacks of each solution are explored.
|
||||
|
||||
## Constraints/Requirements
|
||||
|
||||
Typical modern HA environments rely on a majority quorum of the cluster to be alive and operating normally in order to
|
||||
service requests. This means:
|
||||
|
||||
* A cluster of 1 replica can tolerate 0 failures
|
||||
* A cluster of 2 replicas can tolerate 0 failures
|
||||
* A cluster of 3 replicas can tolerate 1 failure
|
||||
* A cluster of 4 replicas can tolerate 1 failure
|
||||
* A cluster of 5 replicas can tolerate 2 failures
|
||||
|
||||
This already poses a challenge to us as clients will most likely want to deploy the minimum possible number of R3 Corda
|
||||
nodes. Ideally that minimum would be 3 but a solution for only 2 nodes should be available (even if it provides a lesser
|
||||
degree of HA than 3, 5 or more nodes). The problem with having only two nodes in the cluster is there is no distinction
|
||||
between failure and network partition.
|
||||
|
||||
Users should be allowed to set a preference for which node to be active in a hot-warm environment. This would probably
|
||||
be done with the help of a property(persisted in the DB in order to be changed on the fly). This is an important
|
||||
functionality as users might want to have the active node on better hardware and switch to the back-ups and back as soon
|
||||
as possible.
|
||||
|
||||
It would also be helpful for the chosen solution to not add deployment complexity.
|
||||
|
||||
## Design decisions
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
drb-meeting-20180131.md
|
||||
|
||||
## Proposed solutions
|
||||
|
||||
Based on what is needed for Hot-Warm, 1 active node and at least one passive node (started but in stand-by mode), and
|
||||
the constraints identified above (automatic failover with at least 2 nodes and master preference), two frameworks have
|
||||
been explored: Zookeeper and Atomix. Neither apply to our use cases perfectly and require some tinkering to solve our
|
||||
issues, especially the preferred master election.
|
||||
|
||||
### Zookeeper
|
||||
|
||||
![Zookeeper design](zookeeper.png)
|
||||
|
||||
Preferred leader election - while the default algorithm does not take into account a leader preference, a custom
|
||||
algorithm can be implemented to suit our needs.
|
||||
|
||||
Environment with 2 nodes - while this type of set-up can't distinguish between a node failure and network partition, a
|
||||
workaround can be implemented by having 2 nodes and 3 zookeeper instances(3rd would be needed to form a majority).
|
||||
|
||||
Pros:
|
||||
- Very well documented
|
||||
- Widely used, hence a lot of cookbooks, recipes and solutions to all sorts of problems
|
||||
- Supports custom leader election
|
||||
|
||||
Cons:
|
||||
- Added deployment complexity
|
||||
- Bootstrapping a cluster is not very straightforward
|
||||
- Too complex for our needs?
|
||||
|
||||
### Atomix
|
||||
|
||||
![](./atomix.png)
|
||||
|
||||
Preferred leader election - cannot be implemented easily; a creative solution would be required.
|
||||
|
||||
Environment with 2 nodes - using only embedded replicas, there's no solution; Atomix comes also as a standalone server
|
||||
which could be run outside the node as a 3rd entity to allow a quorum(see image above).
|
||||
|
||||
Pros:
|
||||
- Easy to get started with
|
||||
- Embedded, no added deployment complexity
|
||||
- Already used partially (Atomix Catalyst) in the notary cluster
|
||||
|
||||
Cons:
|
||||
- Not as popular as Zookeeper, less used
|
||||
- Documentation is underwhelming; no proper usage examples
|
||||
- No easy way of influencing leader election; will require some creative use of Atomix functionality either via distributed groups or other resources
|
||||
|
||||
## Recommendations
|
||||
|
||||
If Zookeeper is chosen, we would need to look into a solution for easy configuration and deployment (maybe docker
|
||||
images). Custom leader election can be implemented by following one of the
|
||||
[examples](https://github.com/SainTechnologySolutions/allprogrammingtutorials/tree/master/apache-zookeeper/leader-election)
|
||||
available online.
|
||||
|
||||
If Atomix is chosen, a solution to enforce some sort of preferred leader needs to found. One way to do it would be to
|
||||
have the Corda cluster leader be a separate entity from the Atomix cluster leader. Implementing the election would then
|
||||
be done using the distributed resources made available by the framework.
|
||||
|
||||
## Conclusions
|
||||
|
||||
Whichever solution is chosen, using 2 nodes in a Hot-Warm environment is not ideal. A minimum of 3 is required to ensure proper failover.
|
||||
|
||||
Almost every configuration option that these frameworks offer should be exposed through node.conf.
|
||||
|
||||
We've looked into using Galera which is currently used for the notary cluster for storing the committed state hashes. It
|
||||
offers multi-master read/write and certification-based replication which is not leader based. It could be used to
|
||||
implement automatic failure detection and master election(similar to our current mutual exclusion).However, we found
|
||||
that it doesn't suit our needs because:
|
||||
|
||||
- it adds to deployment complexity
|
||||
- usable only with MySQL and InnoDB storage engine
|
||||
- we'd have to implement node failure detection and master election from scratch; in this regard both Atomix and Zookeeper are better suited
|
||||
|
||||
Our preference would be Zookeeper despite not being as lightweight and deployment-friendly as Atomix. The wide spread
|
||||
use, proper documentation and flexibility to use it not only for automatic failover and master election but also
|
||||
configuration management(something we might consider moving forward) makes it a better fit for our needs.
|
@ -1,104 +0,0 @@
|
||||
# Design Review Board Meeting Minutes
|
||||
|
||||
**Date / Time:** Jan 31 2018, 11.00
|
||||
|
||||
## Attendees
|
||||
|
||||
- Matthew Nesbit (MN)
|
||||
- Bogdan Paunescu (BP)
|
||||
- James Carlyle (JC)
|
||||
- Mike Hearn (MH)
|
||||
- Wawrzyniec Niewodniczanski (WN)
|
||||
- Jonathan Sartin (JS)
|
||||
- Gavin Thomas (GT)
|
||||
|
||||
|
||||
## **Decision**
|
||||
|
||||
Proceed with recommendation to use Zookeeper as the master selection solution
|
||||
|
||||
|
||||
## **Primary Requirement of Design**
|
||||
|
||||
- Client can run just 2 nodes, master and slave
|
||||
- Current deployment model to not change significantly
|
||||
- Prioritised mastering or be able to automatically elect a master. Useful to allow clients to do rolling upgrades, or for use when a high spec machine is used for master
|
||||
- Nice to have: use for flow sharding and soft locking
|
||||
|
||||
## **Minutes**
|
||||
|
||||
MN presented a high level summary of the options:
|
||||
- Galera:
|
||||
- Negative: does not have leader election and failover capability.
|
||||
|
||||
- Atomix IO:
|
||||
- Positive: does integrate into node easily, can setup ports
|
||||
- Negative: requires min 3 nodes, cannot manipulate election e.g. drop the master rolling deployments / upgrades, cannot select the 'beefy' host for master where cost efficiencies have been used for the slave / DR, young library and has limited functionality, poor documentation and examples
|
||||
|
||||
- Zookeeper (recommended option): industry standard widely used and trusted. May be able to leverage clients' incumbent Zookeeper infrastructure
|
||||
- Positive: has flexibility for storage and a potential for future proofing; good permissioning capabilities; standalone cluster of Zookeeper servers allows 2 nodes solution rather than 3
|
||||
- Negative: adds deployment complexity due to need for Zookeeper cluster split across data centers
|
||||
Wrapper library choice for Zookeeper requires some analysis
|
||||
|
||||
|
||||
MH: predictable source of API for RAFT implementations and Zookeeper compared to Atomix. Be better to have master
|
||||
selector implemented as an abstraction
|
||||
|
||||
MH: hybrid approach possible - 3rd node for oversight, i.e. 2 embedded in the node, 3rd is an observer. Zookeeper can
|
||||
have one node in primary data centre, one in secondary data centre and 3rd as tie-breaker
|
||||
|
||||
WN: why are we concerned about cost of 3 machines? MN: we're seeing / hearing clients wanting to run many nodes on one
|
||||
VM. Zookeeper is good for this since 1 Zookepper cluster can serve 100+ nodes
|
||||
|
||||
MH: terminology clarification required: what holds the master lock? Ideally would be good to see design thinking around
|
||||
split node and which bits need HA. MB: as a long term vision, ideally have 1 database for many IDs and the flows for
|
||||
those IDs are load balanced. Regarding services internally to node being suspended, this is being investigated.
|
||||
|
||||
MH: regarding auto failover, in the event a database has its own perception of master and slave, how is this handled?
|
||||
Failure detector will need to grow or have local only schedule to confirm it is processing everything including
|
||||
connectivity between database and bus, i.e. implement a 'healthiness' concept
|
||||
|
||||
MH: can you get into a situation where the node fails over but the database does not, but database traffic continues to
|
||||
be sent to down node? MB: database will go offline leading to an all-stop event.
|
||||
|
||||
MH: can you have master affinity between node and database? MH: need watchdog / heartbeat solutions to confirm state of
|
||||
all components
|
||||
|
||||
JC: how long will this solution live? MB: will work for hot / hot flow sharding, multiple flow workers and soft locks,
|
||||
then this is long term solution. Service abstraction will be used so we are not wedded to Zookeeper however the
|
||||
abstraction work can be done later
|
||||
|
||||
JC: does the implementation with Zookeeper have an impact on whether cloud or physical deployments are used? MB: its an
|
||||
internal component, not part of the larger Corda network therefore can be either. For the customer they will have to
|
||||
deploy a separate Zookeeper solution, but this is the same for Atomix.
|
||||
|
||||
WN: where Corda as a service is being deployed with many nodes in the cloud. Zookeeper will be better suited to big
|
||||
providers.
|
||||
|
||||
WN: concern is the customer expects to get everything on a plate, therefore will need to be educated on how to implement
|
||||
Zookeeper, but this is the same for other master selection solutions.
|
||||
|
||||
JC: is it possible to launch R3 Corda with a button on Azure marketplace to commission a Zookeeper? Yes, if we can
|
||||
resource it. But expectation is Zookeeper will be used by well-informed clients / implementers so one-click option is
|
||||
less relevant.
|
||||
|
||||
MH: how does failover work with HSMs?
|
||||
|
||||
MN: can replicate realm so failover is trivial
|
||||
|
||||
JC: how do we document Enterprise features? Publish design docs? Enterprise fact sheets? R3 Corda marketing material?
|
||||
Clear separation of documentation is required. GT: this is already achieved by having docs.corda.net for open source
|
||||
Corda and docs.corda.r3.com for enterprise R3 Corda
|
||||
|
||||
|
||||
### Next Steps
|
||||
|
||||
MN proposed the following steps:
|
||||
|
||||
1) Determine who has experience in the team to help select wrapper library
|
||||
2) Build container with Zookeeper for development
|
||||
3) Demo hot / cold with current R3 Corda Dev Preview release (writing a guide)
|
||||
4) Turn nodes passive or active
|
||||
5) Leader election
|
||||
6) Failure detection and tooling
|
||||
7) Edge case testing
|
Before Width: | Height: | Size: 100 KiB |
Before Width: | Height: | Size: 119 KiB |
@ -1,147 +0,0 @@
|
||||
# Design Review Board Meeting Minutes
|
||||
|
||||
**Date / Time:** 16/11/2017, 14:00
|
||||
|
||||
## Attendees
|
||||
|
||||
- Mark Oldfield (MO)
|
||||
- Matthew Nesbit (MN)
|
||||
- Richard Gendal Brown (RGB)
|
||||
- James Carlyle (JC)
|
||||
- Mike Hearn (MH)
|
||||
- Jose Coll (JoC)
|
||||
- Rick Parker (RP)
|
||||
- Andrey Bozhko (AB)
|
||||
- Dave Hudson (DH)
|
||||
- Nick Arini (NA)
|
||||
- Ben Abineri (BA)
|
||||
- Jonathan Sartin (JS)
|
||||
- David Lee (DL)
|
||||
|
||||
## Minutes
|
||||
|
||||
MO opened the meeting, outlining the agenda and meeting review process, and clarifying that consensus on each design decision would be sought from RGB, JC and MH.
|
||||
|
||||
MO set out ground rules for the meeting. RGB asked everyone to confirm they had read both documents; all present confirmed.
|
||||
|
||||
MN outlined the motivation for a Float as responding to organisation’s expectation for a‘fire break’ protocol termination in the DMZ where manipulation and operation can be checked and monitored.
|
||||
|
||||
The meeting was briefly interrupted by technical difficulties with the GoToMeeting conferencing system.
|
||||
|
||||
MN continued to outline how the design was constrained by expected DMZ rules and influenced by currently perceived client expectations – e.g. making the float unidirectional. He gave a prelude to certain design decisions e.g. the use ofAMQP from the outset.
|
||||
|
||||
MN went onto describe the target solution in detail, covering the handling of both inbound and outbound connections. He highlighted implicit overlaps with the HA design – clustering support, queue names etc., and clarified that the local broker was not required to use AMQP.
|
||||
|
||||
### [TLS termination](./ssl-termination.md)
|
||||
|
||||
JC questioned where the TLS connection would terminate. MN outlined the pros and cons of termination on firewall vs. float, highlighting the consequence of float termination that access by the float to the to the private key was required, and that mechanisms may be needed to store that key securely.
|
||||
|
||||
MH contended that the need to propagate TLS headers etc. through to the node (for reinforcing identity checks etc.) implied a need to terminate on the float. MN agreed but noted that in practice the current node design did not make much use of that feature.
|
||||
|
||||
JC questioned how users would provision a TLS cert on a firewall – MN confirmed users would be able to do this themselves and were typically familiar with doing so.
|
||||
|
||||
RGB highlighted the distinction between the signing key for the TLS vs. identity certificates, and that this needed to be made clear to users. MN agreed that TLS private keys could be argued to be less critical from a security perspective, particularly when revocation was enabled.
|
||||
|
||||
MH noted potential to issue sub-certs with key usage flags as an additional mitigating feature.
|
||||
|
||||
RGB queried at what point in the flow a message would be regarded as trusted. MN set an expectation that the float would apply basic checks (e.g. stopping a connection talking on other topics etc.) but that subsequent sanitisation should happen in internal trusted portion.
|
||||
|
||||
RGB questioned whether the TLS key on the float could be re-used on the bridge to enable wrapped messages to be forwarded in an encrypted form – session migration. MH and MN maintained TLS forwarding could not work in that way, and this would not allow the ‘fire break’ requirement to inspect packets.
|
||||
|
||||
RGB concluded the bridge must effectively trust the firewall or bridge on the origin of incoming messages. MN raised the possibility of SASL verification,but noted objections by MH (clumsy because of multiple handshakes etc.).
|
||||
|
||||
JC queried whether SASL would allow passing of identity and hence termination at the firewall;MN confirmed this.
|
||||
|
||||
MH contented that the TLS implementation was specific to Corda in several ways which may challenge implementation using firewalls, and that typical firewalls(using old OpenSSL etc.) were probably not more secure than R3’s own solutions. RGB pointed out that the design was ultimately driven by client perception of security (MN: “security theatre”) rather than objective assessment. MH added that implementations would be firewall-specific and not all devices would support forwarding, support for AMQP etc.
|
||||
|
||||
RGB proposed messaging to clients that the option existed to terminate on the firewall if it supported the relevant requirements.
|
||||
|
||||
MN re-raised the question of key management. RGB asked about the risk implied from the threat of a compromised float. MN said an attacker who compromised a float could establish TLS connections in the name of the compromised party, and could inspect and alter packets including readable business data (assuming AMQP serialisation). MH gave an example of a MITM attack where an attacker could swap in their own single-use key allowing them to gain control of (e.g.) a cash asset; the TLS layer is the only current protection against that.
|
||||
|
||||
RGB queried whether messages could be signed by senders. MN raised potential threat of traffic analysis, and stated E2E encryption was definitely possible but not for March-April.
|
||||
|
||||
MH viewed the use-case for extra encryption as the consumer/SME market, where users would want to upload/download messages from a mailbox without needing to trust it –not the target market yet. MH maintained TLS really strong and that assuming compromise of float was not conceptually different from compromise of another device e.g. the firewall. MN confirmed that use of an HSM would generally require signing on the HSM device for every session; MH observed this could bea bottleneck in the scenario of a restored node seeking to re-establish a large number of connections. It was observed that the float would still need access to a key provisioning access to the HSM, so this did not materially improve the security in a compromised float scenario.
|
||||
|
||||
MH advised against offering clients support for their own firewall since it would likely require R3 effort to test support and help with customisations.
|
||||
|
||||
MN described option 2b to tunnel through to the internal trusted portion of the float over a connection initiated from inside the internal network in order for the key to be loaded into memory at run-time; this would require a bit more code.
|
||||
|
||||
MH advocated option 2c - just to accept risk and store on file system – on the basis of time constraints, maintaining that TLS handshakes are complicated to code and hard to proxy. MH suggested upgrading to 2b or 2a later if needed. MH described how keys were managed at Google.
|
||||
|
||||
**DECISION CONFIRMED**: Accept option 2b - Terminate on float, inject key from internal portion of the float (RGB, JC, MH agreed)
|
||||
|
||||
### [E2E encryption](./e2e-encryption.md)
|
||||
|
||||
DH proposed that E2E encryption would be much better but conceded the time limitations and agreed that the threat scenario of a compromised DMZ device was the same under the proposed options. MN agreed.
|
||||
|
||||
MN argued for a placeholder vs. ignoring or scheduling work to build e2e encryption now. MH agreed, seeking more detailed proposals on what the placeholder was and how it would be used.
|
||||
|
||||
MH queried whether e2e encryption would be done at the app level rather than the AMQP level, raising questions what would happen on non-supporting nodes etc.
|
||||
|
||||
MN highlighted the link to AMQP serialisation work being done.
|
||||
|
||||
**DECISION CONFIRMED:** Add placeholder, subject to more detailed design proposal (RGB, JC, MH agreed)
|
||||
|
||||
### [AMQP vs. custom protocol](./p2p-protocol.md)
|
||||
|
||||
MN described alternative options involving onion-routing etc.
|
||||
|
||||
JoC questioned whether this would also allow support for load balancing; MN advised this would be too much change in direction in practice.
|
||||
|
||||
MH outlined his original reasoning for AMQP (lots of e.g. manageability features, not all of which would be needed at the outset but possibly in future) vs. other options e.g. MQTT.
|
||||
|
||||
MO questioned whether the broker would imply performance limitations.
|
||||
|
||||
RGB argued there were two separate concerns: Carrying messages from float to bridge and then bridge to node, with separate design options.
|
||||
|
||||
JC proposed the decision could be deferred until later. MN pointed out changing the protocol would compromise wire stability.
|
||||
|
||||
MH advocated sticking with AMQP for now and implementing a custom protocol later with suitable backwards-compatibility features when needed.
|
||||
|
||||
RGB queried whether full AMQP implementation should be done in this phase. MN provided explanation.
|
||||
|
||||
**DECISION CONFIRMED:** Continue to use AMQP (RGB, JC, MH agreed)
|
||||
|
||||
### [Pluggable broker prioritisation](./pluggable-broker.md)
|
||||
|
||||
MN outlined arguments for deferring pluggable brokers, whilst describing how he’d go about implementing the functionality. MH agreed with prioritisation for later.
|
||||
|
||||
JC queried whether broker providers could be asked to deliver the feature. AB mentioned that Solace seemed keen on working with R3 and could possibly be utilised. MH was sceptical, arguing that R3 resource would still be needed to support.
|
||||
|
||||
JoC noted a distinction in scope for P2P and/or RPC.
|
||||
|
||||
There was discussion of replacing the core protocol with JMS + plugins. RGB drew focus to the question of when to do so, rather than how.
|
||||
|
||||
AB noted Solace have functionality with conceptual similarities to the float, and questioned to what degree the float could be considered non-core technology. MH argued the nature of Corda as a P2P network made the float pretty core to avoiding dedicated network infrastructure.
|
||||
|
||||
**DECISION CONFIRMED:** Defer support for pluggable brokers until later, except in the event that a requirement to do so emerges from higher priority float / HA work. (RGB, JC, MH agreed)
|
||||
|
||||
### Inbound only vs. inbound & outbound connections
|
||||
|
||||
DL sought confirmation that the group was happy with the float to act as a Listener only.MN repeated the explanation of how outbound connections would be initiated through a SOCKS 4/5 proxy. No objections were raised.
|
||||
|
||||
### Overall design and implementation plan
|
||||
|
||||
MH requested more detailed proposals going forward on:
|
||||
|
||||
1) To what degree logs from different components need to be integrated (consensus was no requirement at this stage)
|
||||
|
||||
2) Bridge control protocols.
|
||||
|
||||
3) Scalability of hashing network map entries to a queue names
|
||||
|
||||
4) Node admins' user experience – MH argued for documenting this in advance to validate design
|
||||
|
||||
5) Behaviour following termination of a remote node (retry frequency, back-off etc.)?
|
||||
|
||||
6) Impact on standalone nodes (no float)?
|
||||
|
||||
JC noted an R3 obligation with Microsoft to support AMQP-compliant Azure messaging,. MN confirmed support for pluggable brokers should cover that.
|
||||
|
||||
JC argued for documentation of procedures to be the next step as it is needed for the Project Agent Pilot phase. MH proposed sharing the advance documentation.
|
||||
|
||||
JoC questioned whether the Bridge Manager locked the design to Artemis? MO highlighted the transitional elements of the design.
|
||||
|
||||
RGB questioned the rationale for moving the broker out of the node. MN provided clarification.
|
||||
|
||||
**DECISION CONFIRMED**: Design to proceed as discussed (RGB, JC, MH agreed)
|
@ -1,55 +0,0 @@
|
||||
# Design Decision: End-to-end encryption
|
||||
|
||||
## Background / Context
|
||||
|
||||
End-to-end encryption is a desirable potential design feature for the [float](../design.md).
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. No end-to-end encryption
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Least effort
|
||||
2. Easier to fault find and manage
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. With no placeholder, it is very hard to add support later and maintain wire stability.
|
||||
2. May not get past security reviews of Float.
|
||||
|
||||
### 2. Placeholder only
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Allows wire stability when we have agreed an encrypted approach
|
||||
2. Shows that we are serious about security, even if this isn’t available yet.
|
||||
3. Allows later encrypted version to be an enterprise feature that can interoperate with OS versions.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Doesn’t actually provide E2E, or define what an encrypted payload looks like.
|
||||
2. Doesn’t address any crypto features that target protecting the AMQP headers.
|
||||
|
||||
### 3. Implement end-to-end encryption
|
||||
|
||||
1. Will protect the sensitive data fully.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Lots of work.
|
||||
2. Difficult to get right.
|
||||
3. Re-inventing TLS.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 2: Placeholder
|
||||
|
||||
## Decision taken
|
||||
|
||||
Proceed with Option 2 - Add placeholder, subject to more detailed design proposal (RGB, JC, MH agreed)
|
||||
|
||||
.. toctree::
|
||||
|
||||
drb-meeting-20171116.md
|
||||
|
@ -1,75 +0,0 @@
|
||||
# Design Decision: P2P Messaging Protocol
|
||||
|
||||
## Background / Context
|
||||
|
||||
Corda requires messages to be exchanged between nodes via a well-defined protocol.
|
||||
|
||||
Determining this protocol is a critical upstream dependency for the design of key messaging components including the [float](../design.md).
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Use AMQP
|
||||
|
||||
Under this option, P2P messaging will follow the [Advanced Message Queuing Protocol](https://www.amqp.org/).
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. As we have described in our marketing materials.
|
||||
2. Well-defined standard.
|
||||
3. Support for packet level flow control and explicit delivery acknowledgement.
|
||||
4. Will allow eventual swap out of Artemis for other brokers.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. AMQP is a complex protocol with many layered state machines, for which it may prove hard to verify security properties.
|
||||
2. No support for secure MAC in packets frames.
|
||||
3. No defined encryption mode beyond creating custom payload encryption and custom headers.
|
||||
4. No standardised support for queue creation/enumeration, or deletion.
|
||||
5. Use of broker durable queues and autonomous bridge transfers does not align with checkpoint timing, so that independent replication of the DB and Artemis data risks causing problems. (Writing to the DB doesn’t work currently and is probably also slow).
|
||||
|
||||
### 2. Develop a custom protocol
|
||||
|
||||
This option would discard existing Artemis server/AMQP support for peer-to-peer communications in favour of a custom
|
||||
implementation of the Corda MessagingService, which takes direct responsibility for message retries and stores the
|
||||
pending messages into the node's database. The wire level of this service would be built on top of a fully encrypted MIX
|
||||
network which would not require a fully connected graph, but rather send messages on randomly selected paths over the
|
||||
dynamically managed network graph topology.
|
||||
|
||||
Packet format would likely use the [SPHINX packet format](http://www0.cs.ucl.ac.uk/staff/G.Danezis/papers/sphinx-eprint.pdf) although with the body encryption updated to
|
||||
a modern AEAD scheme as in https://www.cs.ru.nl/~bmennink/pubs/16cans.pdf . In this scheme, nodes would be identified in
|
||||
the overlay network solely by Curve25519 public key addresses and floats would be dumb nodes that only run the MIX
|
||||
network code and don't act as message sources, or sinks. Intermediate traffic would not be readable except by the
|
||||
intended waypoint and only the final node can read the payload.
|
||||
|
||||
Point to point links would be standard TLS and the network certificates would be whatever is acceptable to the host
|
||||
institutions e.g. standard Verisign certs. It is assumed institutions would select partners to connect to that they
|
||||
trust and permission them individually in their firewalls. Inside the MIX network the nodes would be connected mostly in
|
||||
a static way and use standard HELLO packets to determine the liveness of neighbour routes, then use tunnelled gossip to
|
||||
distribute the signed/versioned Link topology messages. Nodes will also be allowed to advertise a public IP, so some
|
||||
dynamic links and publicly visible nodes would exist. Network map addresses would then be mappings from Legal Identity
|
||||
to these overlay network addresses, not to physical network locations.
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Can be defined with very small message surface area that is amenable to security analysis.
|
||||
2. Packet formats can follow best practice cryptography from the start and be matched to Corda’s needs.
|
||||
3. Doesn’t require a complete graph structure for network if we have intermediate routing.
|
||||
4. More closely aligns checkpointing and message delivery handling at the application level.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Inconsistent with previous design statements published to external stakeholders.
|
||||
2. Effort implications - starting from scratch
|
||||
3. Technical complexity in developing a P2P protocols which is attack tolerant.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 1
|
||||
|
||||
## Decision taken
|
||||
|
||||
Proceed with Option 1 - Continue to use AMQP (RGB, JC, MH agreed)
|
||||
|
||||
.. toctree::
|
||||
|
||||
drb-meeting-20171116.md
|
@ -1,62 +0,0 @@
|
||||
# Design Decision: Pluggable Broker prioritisation
|
||||
|
||||
## Background / Context
|
||||
|
||||
A decision on when to prioritise implementation of a pluggable broker has implications for delivery of key messaging
|
||||
components including the [float](../design.md).
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Deliver pluggable brokers now
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Meshes with business opportunities from HPE and Solace Systems.
|
||||
2. Would allow us to interface to existing Bank middleware.
|
||||
3. Would allow us to switch away from Artemis if we need higher performance.
|
||||
4. Makes our AMQP story stronger.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. More up-front work.
|
||||
2. Might slow us down on other priorities.
|
||||
|
||||
### 2. Defer development of pluggable brokers until later
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Still gets us where we want to go, just later.
|
||||
2. Work can be progressed as resource is available, rather than right now.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Have to take care that we have sufficient abstractions that things like CORE connections can be replaced later.
|
||||
2. Leaves HPE and Solace hanging even longer.
|
||||
|
||||
|
||||
### 3. Never enable pluggable brokers
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. What we already have.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Ties us to ArtemisMQ development speed.
|
||||
|
||||
2. Not good for our relationship with HPE and Solace.
|
||||
|
||||
3. Probably limits our maximum messaging performance longer term.
|
||||
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 2 (defer development of pluggable brokers until later)
|
||||
|
||||
## Decision taken
|
||||
|
||||
.. toctree::
|
||||
|
||||
drb-meeting-20171116.md
|
||||
|
||||
Proceed with Option 2 - Defer support for pluggable brokers until later, except in the event that a requirement to do so emerges from higher priority float / HA work. (RGB, JC, MH agreed)
|
@ -1,91 +0,0 @@
|
||||
# Design Decision: TLS termination point
|
||||
|
||||
## Background / Context
|
||||
|
||||
Design of the [float](../design.md) is critically influenced by the decision of where TLS connections to the node should
|
||||
be terminated.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Terminate TLS on Firewall
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Common practice for DMZ web solutions, often with an HSM associated with the Firewall and should be familiar for banks to setup.
|
||||
2. Doesn’t expose our private key in the less trusted DMZ context.
|
||||
3. Bugs in the firewall TLS engine will be patched frequently.
|
||||
4. The DMZ float server would only require a self-signed certificate/private key to enable secure communications, so theft of this key has no impact beyond the compromised machine.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. May limit cryptography options to RSA, and prevent checking of X500 names (only the root certificate checked) - Corda certificates are not totally standard.
|
||||
2. Doesn’t allow identification of the message source.
|
||||
3. May require additional work and SASL support code to validate the ultimate origin of connections in the float.
|
||||
|
||||
#### Variant option 1a: Include SASL connection checking
|
||||
|
||||
##### Advantages
|
||||
|
||||
1. Maintain authentication support
|
||||
2. Can authenticate against keys held internally e.g. Legal Identity not just TLS.
|
||||
|
||||
##### Disadvantages
|
||||
|
||||
1. More work than the do-nothing approach
|
||||
2. More protocol to design for sending across the inner firewall.
|
||||
|
||||
### 2. Direct TLS Termination onto Float
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Validate our PKI certificates directly ourselves.
|
||||
2. Allow messages to be reliably tagged with source.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. We don’t currently use the identity to check incoming packets, only for connection authentication anyway.
|
||||
2. Management of Private Key a challenge requiring extra work and security implications. Options for this are presented below.
|
||||
|
||||
#### Variant Option 2a: Float TLS certificate via direct HSM
|
||||
|
||||
##### Advantages
|
||||
|
||||
1. Key can’t be stolen (only access to signing operations)
|
||||
2. Audit trail of signings.
|
||||
|
||||
##### Disadvantages
|
||||
|
||||
1. Accessing HSM from DMZ probably not allowed.
|
||||
2. Breaks the inbound-connection-only rule of modern DMZ.
|
||||
|
||||
#### Variant Option 2b: Tunnel signing requests to bridge manager
|
||||
|
||||
##### Advantages
|
||||
|
||||
1. No new connections involved from Float box.
|
||||
2. No access to actual private key from DMZ.
|
||||
|
||||
##### Disadvantages
|
||||
|
||||
1. Requires implementation of a message protocol, in addition to a key provider that can be passed to the standard SSLEngine, but proxies signing requests.
|
||||
|
||||
#### Variant Option 2c: Store key on local file system
|
||||
|
||||
##### Advantages
|
||||
|
||||
1. Simple with minimal extra code required.
|
||||
2. Delegates access control to bank’s own systems.
|
||||
3. Risks losing only the TLS private key, which can easily be revoked. This isn’t the legal identity key at all.
|
||||
|
||||
##### Disadvantages
|
||||
|
||||
1. Risks losing the TLS private key.
|
||||
2. Probably not allowed.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Variant option 1a: Terminate on firewall; include SASL connection checking.
|
||||
|
||||
## Decision taken
|
||||
|
||||
[DNB Meeting, 16/11/2017](./drb-meeting-20171116.md): Proceed with option 2b - Terminate on float, inject key from internal portion of the float (RGB, JC, MH agreed)
|
@ -1,256 +0,0 @@
|
||||
# Float Design
|
||||
|
||||
.. important:: This design document describes a feature of Corda Enterprise.
|
||||
|
||||
## Overview
|
||||
|
||||
The role of the 'float' is to meet the requirements of organisations that will not allow direct incoming connections to
|
||||
their node, but would rather host a proxy component in a DMZ to achieve this. As such it needs to meet the requirements
|
||||
of modern DMZ security rules, which essentially assume that the entire machine in the DMZ may become compromised. At
|
||||
the same time, we expect that the Float can interoperate with directly connected nodes, possibly even those using open
|
||||
source Corda.
|
||||
|
||||
### Background
|
||||
|
||||
#### Current state of peer-to-peer messaging in Corda
|
||||
|
||||
The diagram below illustrates the current mechanism for peer-to-peer messaging between Corda nodes.
|
||||
|
||||
![Current P2P State](./current-p2p-state.png)
|
||||
|
||||
When a flow running on a Corda node triggers a requirement to send a message to a peer node, it first checks for
|
||||
pre-existence of an applicable message queue for that peer.
|
||||
|
||||
**If the relevant queue exists:**
|
||||
|
||||
1. The node submits the message to the queue and continues after receiving acknowledgement.
|
||||
2. The Core Bridge picks up the message and transfers it via a TLS socket to the inbox of the destination node.
|
||||
3. A flow on the recipient receives message from peer and acknowledged consumption on bus when the flow has checkpointed this progress.
|
||||
|
||||
**If the queue does not exist (messaging a new peer):**
|
||||
|
||||
1. The flow triggers creation of a new queue with a name encoding the identity of the intended recipient.
|
||||
2. When the queue creation has completed the node sends the message to the queue.
|
||||
3. The hosted Artemis server within the node has a queue creation hook which is called.
|
||||
4. The queue name is used to lookup the remote connection details and a new bridge is registered.
|
||||
5. The client certificate of the peer is compared to the expected legal identity X500 Name. If this is OK, message flow proceeds as for a pre-existing queue (above).
|
||||
|
||||
## Scope
|
||||
|
||||
* Goals:
|
||||
* Allow connection to a Corda node without requiring direct incoming connections from external participants.
|
||||
* Allow connections to a Corda node without requiring the node itself to have a public IP address. Separate TLS connection handling from the MQ broker.
|
||||
* Non-goals (out of scope):
|
||||
* Support for MQ brokers other than Apache Artemis
|
||||
|
||||
## Timeline
|
||||
For delivery by end Q1 2018.
|
||||
|
||||
## Requirements
|
||||
Allow connectivity in compliance with DMZ constraints commonly imposed by modern financial institutions; namely:
|
||||
1. Firewalls required between the internet and any device in the DMZ, and between the DMZ and the internal network
|
||||
2. Data passing from the internet and the internal network via the DMZ should pass through a clear protocol break in the DMZ.
|
||||
3. Only identified IPs and ports are permitted to access devices in the DMZ; this include communications between devices co-located in the DMZ.
|
||||
4. Only a limited number of ports are opened in the firewall (<5) to make firewall operation manageable. These ports must change slowly.
|
||||
5. Any DMZ machine is typically multi-homed, with separate network cards handling traffic through the institutional
|
||||
firewall vs. to the Internet. (There is usually a further hidden management interface card accessed via a jump box for
|
||||
managing the box and shipping audit trail information). This requires that our software can bind listening ports to the
|
||||
correct network card not just to 0.0.0.0.
|
||||
6. No connections to be initiated by DMZ devices towards the internal network. Communications should be initiated from
|
||||
the internal network to form a bidirectional channel with the proxy process.
|
||||
7. No business data should be persisted on the DMZ box.
|
||||
8. An audit log of all connection events is required to track breaches. Latency information should also be tracked to
|
||||
facilitate management of connectivity issues.
|
||||
9. Processes on DMZ devices run as local accounts with no relationship to internal permission systems, or ability to
|
||||
enumerate devices on the internal network.
|
||||
10. Communications in the DMZ should yse modern TLS, often with local-only certificates/keys that hold no value outside of use in predefined links.
|
||||
11. Where TLS is required to terminate on the firewall, provide a suitably secure key management mechanism (e.g. an HSM).
|
||||
12. Any proxy in the DMZ should be subject to the same HA requirements as the devices it is servicing
|
||||
13. Any business data passing through the proxy should be separately encrypted, so that no data is in the clear of the
|
||||
program memory if the DMZ box is compromised.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
The following design decisions fed into this design:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
decisions/p2p-protocol.md
|
||||
decisions/ssl-termination.md
|
||||
decisions/e2e-encryption.md
|
||||
decisions/pluggable-broker.md
|
||||
|
||||
## Target Solution
|
||||
|
||||
The proposed solution introduces a reverse proxy component ("**float**") which may be sited in the DMZ, as illustrated
|
||||
in the diagram below.
|
||||
|
||||
![Full Float Implementation](./full-float.png)
|
||||
|
||||
The main role of the float is to forward incoming AMQP link packets from authenticated TLS links to the AMQP Bridge
|
||||
Manager, then echo back final delivery acknowledgements once the Bridge Manager has successfully inserted the messages.
|
||||
The Bridge Manager is responsible for rejecting inbound packets on queues that are not local inboxes to prevent e.g.
|
||||
'cheating' messages onto management topics, faking outgoing messages etc.
|
||||
|
||||
The float is linked to the internal AMQP Bridge Manager via a single AMQP/TLS connection, which can contain multiple
|
||||
logical AMQP links. This link is initiated at the socket level by the Bridge Manager towards the float.
|
||||
|
||||
The float is a **listener only** and does not enable outgoing bridges (see Design Decisions, above). Outgoing bridge
|
||||
formation and message sending come directly from the internal Bridge Manager (possibly via a SOCKS 4/5 proxy, which is
|
||||
easy enough to enable in netty, or directly through the corporate firewall. Initiating from the float gives rise to
|
||||
security concerns.)
|
||||
|
||||
The float is **not mandatory**; interoperability with older nodes, even those using direct AMQP from bridges in the
|
||||
node, is supported.
|
||||
|
||||
**No state will be serialized on the float**, although suitably protected logs will be recorded of all float activities.
|
||||
|
||||
**End-to-end encryption** of the payload is not delivered through this design (see Design Decisions, above). For current
|
||||
purposes, a header field indicating plaintext/encrypted payload is employed as a placeholder.
|
||||
|
||||
**HA** is enabled (this should be easy as the bridge manager can choose which float to make active). Only fully
|
||||
connected DMZ floats should activate their listening port.
|
||||
|
||||
Implementation of the float is expected to be based on existing AMQP Bridge Manager code - see Implementation Plan,
|
||||
below, for expected work stages.
|
||||
|
||||
### Bridge control protocol
|
||||
|
||||
The bridge control is designed to be as stateless as possible. Thus, nodes and bridges restarting must
|
||||
re-request/broadcast information to each other. Messages are sent to a 'bridge.control' address in Artemis as
|
||||
non-persistent messages with a non-durable queue. Each message should contain a duplicate message ID, which is also
|
||||
re-used as the correlation id in replies. Relevant scenarios are described below:
|
||||
|
||||
#### On bridge start-up, or reconnection to Artemis
|
||||
1. The bridge process should subscribe to the 'bridge.control'.
|
||||
2. The bridge should start sending QueueQuery messages which will contain a unique message id and an identifier for the bridge sending the message.
|
||||
3. The bridge should continue to send these until at least one node replies with a matched QueueSnapshot message.
|
||||
4. The QueueSnapshot message replies from the nodes contains a correlationId field set to the unique id of the QueueQuery query, or the correlation id is null. The message payload is a list of inbox queue info items and a list of outbound queue info items. Each queue info item is a tuple of Legal X500 Name (as expected upon the destination TLS certificates) and the queue name which should have the form of "internal.peers."+hash key of legal identity (using the same algorithm as we use in the db to make the string). Note this queue name is a change from the current logic, but will be more portable to length constrained topics and allow multiple inboxes on the same broker.
|
||||
5. The bridge should process the QueueSnapshot, initiating links to the outgoing targets. It should also add expected inboxes to its in-bound permission list.
|
||||
6. When an outgoing link is successfully formed the remote client certificate should be checked against the expected X500 name. Assuming the link is valid the bridge should subscribe to the related queue and start trying to forward the messages.
|
||||
|
||||
#### On node start-up, or reconnection to Artemis
|
||||
1. The node should subscribe to 'bridge.control'.
|
||||
2. The node should enumerate the queues and identify which are have well known identities in the network map cache. The appropriate information about its own inboxes and any known outgoing queues should be compiled into an unsolicited QueueSnapshot message with a null correlation id. This should be broadcasted to update any bridges that are running.
|
||||
3. If any QueueQuery messages arrive these should be responded to with specific QueueSnapshot messages with the correlation id set.
|
||||
|
||||
#### On network map updates
|
||||
1. On receipt of any network map cache updates the information should be evaluated to see if any addition queues can now be mapped to a bridge. At this point a BridgeRequest packet should be sent which will contain the legal X500Name and queue name of the new update.
|
||||
|
||||
#### On flow message to Peer
|
||||
1. If a message is to be sent to a peer the code should (as it does now) check for queue existence in its cache and then on the broker. If it does exist it simply sends the message.
|
||||
2. If the queue is not listed in its cache it should block until the queue is created (this should be safe versus race conditions with other nodes).
|
||||
3. Once the queue is created the original message and subsequent messages can now be sent.
|
||||
4. In parallel a BridgeRequest packet should be sent to activate a new connection outwards. This will contain the contain the legal X500Name and queue name of the new queue.
|
||||
5. Future QueueSnapshot requests should be responded to with the new queue included in the list.
|
||||
|
||||
### Behaviour with a Float portion in the DMZ
|
||||
|
||||
1. On initial connection of an inbound bridge, AMQP is configured to run a SASL challenge response to (re-)validate the
|
||||
origin and confirm the client identity. (The most likely SASL mechanism for this is using https://tools.ietf.org/html/rfc3163
|
||||
as this allows reuse of our PKI certificates in the challenge response. Potentially we could forward some bridge control
|
||||
messages to cover the SASL exchange to the internal Bridge Controller. This would allow us to keep the private keys
|
||||
internal to the organisation, so we may also require a SASLAuth message type as part of the bridge control protocol.)
|
||||
2. The float restricts acceptable AMQP topics to the name space appropriate for inbound messages only. Hence, there
|
||||
should be no way to tunnel messages to bridge control, or RPC topics on the bus.
|
||||
3. On receipt of a message from the external network, the Float should append a header to link the source channel's X500
|
||||
name, then create a Delivery for forwarding the message inwards.
|
||||
4. The internal Bridge Control Manager process validates the message further to ensure that it is targeted at a legitimate
|
||||
inbox (i.e. not an outbound queue) and then forwards it to the bus. Once delivered to the broker, the Delivery
|
||||
acknowledgements are cascaded back.
|
||||
5. On receiving Delivery notification from the internal side, the Float acknowledges back the correlated original Delivery.
|
||||
6. The Float should protect against excessive inbound messages by AMQP flow control and refusing to accept excessive unacknowledged deliveries.
|
||||
7. The Float only exposes its inbound server socket when activated by a valid AMQP link from the Bridge Control Manager
|
||||
to allow for a simple HA pool of DMZ Float processes. (Floats cannot run hot-hot as this would invalidate Corda's
|
||||
message ordering guarantees.)
|
||||
|
||||
## Implementation plan
|
||||
|
||||
### Proposed incremental steps towards a float
|
||||
|
||||
1. First, I would like to more explicitly split the RPC and P2P MessagingService instances inside the Node. They can
|
||||
keep the same interface, but this would let us develop P2P and RPC at different rates if required.
|
||||
|
||||
2. The current in-node design with Artemis Core bridges should first be replaced with an equivalent piece of code that
|
||||
initiates send only bridges using an in-house wrapper over the proton-j library. Thus, the current Artemis message
|
||||
objects will be picked up from existing queues using the CORE protocol via an abstraction interface to allow later
|
||||
pluggable replacement. The specific subscribed queues are controlled as before and bridges started by the existing code
|
||||
path. The only difference is the bridges will be the new AMQP client code. The remote Artemis broker should accept
|
||||
transferred packets directly onto its own inbox queue and acknowledge receipt via standard AMQP Delivery notifications.
|
||||
This in turn will be acknowledged back to the Artemis Subscriber to permanently remove the message from the source
|
||||
Artemis queue. The headers for deduplication, address names, etc will need to be mapped to the AMQP messages and we will
|
||||
have to take care about the message payload. This should be an envelope that is capable in the future of being
|
||||
end-to-end encrypted. Where possible we should stay close to the current Artemis mappings.
|
||||
|
||||
3. We need to define a bridge control protocol, so that we can have an out of process float/bridge. The current process
|
||||
is that on message send the node checks the target address to see if the target queue already exists. If the queue
|
||||
doesn't exist it creates a new queue which includes an encoding of the PublicKey in its name. This is picked up by a
|
||||
wrapper around the Artemis Server which is also hosted inside the node and can ask the network map cache for a
|
||||
translation to a target host and port. This in turn allows a new bridge to be provisioned. At node restart the
|
||||
re-population of the network map cache is followed to re-create the bridges to any unsent queues/messages.
|
||||
|
||||
4. My proposal for a bridge control protocol is partly influenced by the fact that AMQP does not have a built-in
|
||||
mechanism for queue creation/deletion/enumeration. Also, the flows cannot progress until they are sure that there is an
|
||||
accepting queue. Finally, if one runs a local broker it should be fine to run multiple nodes without any bridge
|
||||
processes. Therefore, I will leave the queue creation as the node's responsibility. Initially we can continue to use the
|
||||
existing CORE protocol for this. The requirement to initiate a bridge will change from being implicit signalling via
|
||||
server queue detection to being an explicit pub-sub message that requests bridge formation. This doesn't need
|
||||
durability, or acknowledgements, because when a bridge process starts it should request a refresh of the required bridge
|
||||
list. The typical create bridge messages should contain:
|
||||
|
||||
1. The queue name (ideally with the sha256 of the PublicKey, not the whole PublicKey as that may not work on brokers with queue name length constraints).
|
||||
2. The expected X500Name for the remote TLS certificate.
|
||||
3. The list of host and ports to attempt connection to. See separate section for more info.
|
||||
|
||||
5. Once we have the bridge protocol in place and a bridge out of process the broker can move out of process too, which
|
||||
is a requirement for clustering anyway. We can then start work on floating the bridge and making our broker pluggable.
|
||||
|
||||
1. At this point the bridge connection to the local queues should be upgraded to also be AMQP client, rather than CORE
|
||||
protocol, which will give the ability for the P2P bridges to work with other broker products.
|
||||
2. An independent task is to look at making the Bridge process HA, probably using a similar hot-warm mastering solution
|
||||
as the node, or atomix.io. The inactive node should track the control messages, but obviously doesn't initiate any
|
||||
bridges.
|
||||
3. Another potentially parallel piece of development is to start to build a float, which is essentially just splitting
|
||||
the bridge in two and putting in an intermediate hop AMQP/TLS link. The thin proxy in the DMZ zone should be as
|
||||
stateless as possible in this.
|
||||
4. Finally, the node should use AMQP to talk to its local broker cluster, but this will have to remain partly tied
|
||||
to Artemis, as queue creation will require sending management messages to the Artemis core, but we should be
|
||||
able to abstract this.
|
||||
|
||||
### Float evolution
|
||||
|
||||
#### In-Process AMQP Bridging
|
||||
|
||||
![In-Process AMQP Bridging](./in-process-amqp-bridging.png)
|
||||
|
||||
In this phase of evolution we hook the same bridge creation code as before and use the same in-process data access to
|
||||
network map cache. However, we now implement AMQP sender clients using proton-j and netty for TLS layer and connection
|
||||
retry. This will also involve formalising the AMQP packet format of the Corda P2P protocol. Once a bridge makes a
|
||||
successful link to a remote node's Artemis broker it will subscribe to the associated local queue. The messages will be
|
||||
picked up from the local broker via an Artemis CORE consumer for simplicity of initial implementation. The queue
|
||||
consumer should be implemented with a simple generic interface as façade, to allow future replacement. The message will
|
||||
be sent across the AMQP protocol directly to the remote Artemis broker. Once acknowledgement of receipt is given with an
|
||||
AMQP Delivery notification the queue consumption will be acknowledged. This will remove the original item from the
|
||||
source queue. If delivery fails due to link loss the subscriber should be closed until a new link is established to
|
||||
ensure messages are not consumed. If delivery fails for other reasons there should be some for of periodic retry over
|
||||
the AMQP link. For authentication checks the client cert returned from the remote server will be checked and the link
|
||||
dropped if it doesn't match expectations.
|
||||
|
||||
#### Out of process Artemis Broker and Bridges
|
||||
![Out of process Artemis Broker and Bridges](./out-of-proc-artemis-broker-bridges.png)
|
||||
|
||||
Move the Artemis broker and bridge formation logic out of the node. This requires formalising the bridge creation
|
||||
requests, but allows clustered brokers, standardised AMQP usage and ultimately pluggable brokers. We should implement a
|
||||
netty socket server on the bridge and forward authenticated packets to the local Artemis broker inbound queues. An AMQP
|
||||
server socket is required for the float, although it should be transparent whether a NodeInfo refers to a bridge socket
|
||||
address, or an Artemis broker. The queue names should use the sha-256 of the PublicKey not the full key. Also, the name
|
||||
should be used for in and out queues, so that multiple distinct nodes can coexist on the same broker. This will simplify
|
||||
development as developers just run a background broker and shouldn't need to restart it. To export the network map
|
||||
information and to initiate bridges a non-durable bridge control protocol will be needed (in blue). Essentially the
|
||||
messages declare the local queue names and target TLS link information. For in-bound messages only messages for known
|
||||
inbox targets will be acknowledged. It should not be hard to make the bridges active-passive HA as they contain no
|
||||
persisted message state and simple RPC can resync the state of the bridge. Queue creation will remain with the node as
|
||||
this must use non-AMQP mechanisms and because flows should be able to queue sent messages even if the bridge is
|
||||
temporarily down. In parallel work can start to upgrade the local links to Artemis (i.e. the node-Artemis link and the
|
||||
Bridge Manager-Artemis link) to be AMQP clients as much as possible.
|
Before Width: | Height: | Size: 162 KiB |
Before Width: | Height: | Size: 63 KiB |
Before Width: | Height: | Size: 126 KiB |
@ -1,50 +0,0 @@
|
||||
# Design Decision: Node starting & stopping
|
||||
|
||||
## Background / Context
|
||||
|
||||
The potential use of a crash shell is relevant to high availability capabilities of nodes.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Use crash shell
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Already built into the node.
|
||||
2. Potentially add custom commands.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Won’t reliably work if the node is in an unstable state
|
||||
2. Not practical for running hundreds of nodes as our customers already trying to do.
|
||||
3. Doesn’t mesh with the user access controls of the organisation.
|
||||
4. Doesn’t interface to the existing monitoring and control systems i.e. Nagios, Geneos ITRS, Docker Swarm, etc.
|
||||
|
||||
### 2. Delegate to external tools
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Doesn’t require change from our customers
|
||||
2. Will work even if node is completely stuck
|
||||
3. Allows scripted node restart schedules
|
||||
4. Doesn’t raise questions about access control lists and audit
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. More uncertainty about what customers do.
|
||||
2. Might be more requirements on us to interact nicely with lots of different products.
|
||||
3. Might mean we get blamed for faults in other people’s control software.
|
||||
4. Doesn’t coordinate with the node for graceful shutdown.
|
||||
5. Doesn’t address any crypto features that target protecting the AMQP headers.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 2: Delegate to external tools
|
||||
|
||||
## Decision taken
|
||||
|
||||
Restarts should be handled by polite shutdown, followed by a hard clear. (RGB, JC, MH agreed)
|
||||
|
||||
.. toctree::
|
||||
|
||||
drb-meeting-20171116.md
|
@ -1,46 +0,0 @@
|
||||
# Design Decision: Message storage
|
||||
|
||||
## Background / Context
|
||||
|
||||
Storage of messages by the message broker has implications for replication technologies which can be used to ensure both
|
||||
[high availability](../design.md) and disaster recovery of Corda nodes.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Storage in the file system
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Out of the box configuration.
|
||||
2. Recommended Artemis setup
|
||||
3. Faster
|
||||
4. Less likely to have interaction with DB Blob rules
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Unaligned capture time of journal data compared to DB checkpointing.
|
||||
2. Replication options on Azure are limited. Currently we may be forced to the ‘Azure Files’ SMB mount, rather than the ‘Azure Data Disk’ option. This is still being evaluated
|
||||
|
||||
### 2. Storage in node database
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Single point of data capture and backup
|
||||
2. Consistent solution between VM and physical box solutions
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Doesn’t work on H2, or SQL Server. From my own testing LargeObject support is broken. The current Artemis code base does allow some pluggability, but not of the large object implementation, only of the SQL statements. We should lobby for someone to fix the implementations for SQLServer and H2.
|
||||
2. Probably much slower, although this needs measuring.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Continue with Option 1: Storage in the file system
|
||||
|
||||
## Decision taken
|
||||
|
||||
Use storage in the file system (for now)
|
||||
|
||||
.. toctree::
|
||||
|
||||
drb-meeting-20171116.md
|
@ -1,118 +0,0 @@
|
||||
# Design Review Board Meeting Minutes
|
||||
|
||||
**Date / Time:** 16/11/2017, 16:30
|
||||
|
||||
## Attendees
|
||||
|
||||
- Mark Oldfield (MO)
|
||||
- Matthew Nesbit (MN)
|
||||
- Richard Gendal Brown (RGB)
|
||||
- James Carlyle (JC)
|
||||
- Mike Hearn (MH)
|
||||
- Jose Coll (JoC)
|
||||
- Rick Parker (RP)
|
||||
- Andrey Bozhko (AB)
|
||||
- Dave Hudson (DH)
|
||||
- Nick Arini (NA)
|
||||
- Ben Abineri (BA)
|
||||
- Jonathan Sartin (JS)
|
||||
- David Lee (DL)
|
||||
|
||||
## Minutes
|
||||
|
||||
The meeting re-opened following prior discussion of the float design.
|
||||
|
||||
MN introduced the design for high availability, clarifying that the design did not include support for DR-implied features (asynchronous replication etc.).
|
||||
|
||||
MN highlighted limitations in testability: Azure had confirmed support for geo replication but with limited control by the user and no testing facility; all R3 can do is test for impact on performance.
|
||||
|
||||
The design was noted to be dependent on a lot on external dependencies for replication, with R3's testing capability limited to Azure. Agent banks may want to use SAN across dark fiber sites, redundant switches etc. not available to R3.
|
||||
|
||||
MN noted that certain databases are not yet officially supported in Corda.
|
||||
|
||||
### [Near-term-target](./near-term-target.md), [Medium-term target](./medium-term-target.md)
|
||||
|
||||
Outlining the hot-cold design, MN highlighted importance of ensuring only one node is active at one time. MN argued for having a tested hot-cold solution as a ‘backstop’. MN confirmed the work involved was to develop DB/SAN exclusion checkers and test appropriately.
|
||||
|
||||
JC queried whether unknowns exist for hot-cold. MN described limitations of Azure file replication.
|
||||
|
||||
JC noted there was optionality around both the replication mechanisms and the on-premises vs. cloud deployment.
|
||||
|
||||
### [Message storage](./db-msg-store.md)
|
||||
|
||||
Lack of support for storing Artemis messages via JDBC was raised, and the possibility for RedHat to provide an enhancement was discussed.
|
||||
|
||||
MH raised the alternative of using Artemis’ inbuilt replication protocol - MN confirmed this was in scope for hot-warm, but not hot-cold.
|
||||
|
||||
JC posited that file system/SAN replication should be OK for banks
|
||||
|
||||
**DECISION AGREED**: Use storage in the file system (for now)
|
||||
|
||||
AB questioned about protections against corruption; RGB highlighted the need for testing on this. MH described previous testing activity, arguing for a performance cluster that repeatedly runs load tests, kills nodes,checking they come back etc.
|
||||
|
||||
MN could not comment on testing status of current code. MH noted the notary hasn't been tested.
|
||||
|
||||
AB queried how basic node recovery would work. MN explained, highlighting the limitation for RPC callbacks.
|
||||
|
||||
JC proposed these limitations should be noted and explained to Finastra; move on.
|
||||
|
||||
There was discussion of how RPC observables could be made to persist across node outages. MN argued that for most applications, a clear signal of the outage that triggered clients to resubscribe was preferable. This was agreed.
|
||||
|
||||
JC argued for using Kafka.
|
||||
|
||||
MN presented the Hot-warm solution as a target for March-April and provide clarifications on differences vs. hot-cold and hot-hot.
|
||||
|
||||
JC highlighted that the clustered artemis was an important intermediate step. MN highlighted other important features
|
||||
|
||||
MO noted that different banks may opt for different solutions.
|
||||
|
||||
JoC raised the question of multi-IP per node.
|
||||
|
||||
MN described the Hot-hot solution, highlighting that flows remained 'sticky' to a particular instance but could be picked up by another when needed.
|
||||
|
||||
AB preferred the hot-hot solution. MN noted the many edge cases to be worked through.
|
||||
|
||||
AB Queried the DR story. MO stated this was out of scope at present.
|
||||
|
||||
There was discussion of the implications of not having synchronous replication.
|
||||
|
||||
MH questioned the need for a backup strategy that allows winding back the clock. MO stated this was out of scope at present.
|
||||
|
||||
MO drew attention to the expectation that Corda would be considered part of larger solutions with controlled restore procedures under BCP.
|
||||
|
||||
JC noted the variability in many elements as a challenge.
|
||||
|
||||
MO argued for providing a 'shrink-wrapped' solution based around equipment R3 could test (e.g. Azure)
|
||||
|
||||
JC argued for the need to manage testing of banks' infrastructure choices in order to reduce time to implementation.
|
||||
|
||||
There was discussion around the semantic difference between HA and DR. MH argued for a definition based around rolling backups. MN and MO shared banks' view of what DR is. MH contrasted this with Google definitions. AB noted HA and DR have different SLAs.
|
||||
|
||||
**DECISION AGREED:** Near-term target: Hot Cold; Medium-term target: Hot-warm (RGB, JC, MH agreed)
|
||||
|
||||
RGB queried why Artemis couldn't be run in clustered mode now. MN explained.
|
||||
|
||||
AB queried what Finastra asked for. MO implied nothing specific; MH maintained this would be needed anyway.
|
||||
|
||||
### [Broker separation](./external-broker.md)
|
||||
|
||||
MN outlined his rationale for Broker separation.
|
||||
|
||||
JC queried whether this would affect demos.
|
||||
|
||||
MN gave an assumption that HA was for enterprise only; RGB, JC: pointed out that Enterprise might still be made available for non-production use.
|
||||
|
||||
**DECISION AGREED**: The broker should only be separated if required by other features (e.g. the float), otherwise not. (RGB, JC, MH agreed).
|
||||
|
||||
### [Load balancers and multi-IP](./ip-addressing.md)
|
||||
|
||||
The topic was discussed.
|
||||
|
||||
**DECISION AGREED**: The design can allow for optional load balancers to be implemented by clients.
|
||||
|
||||
### [Crash shell](./crash-shell.md)
|
||||
|
||||
MN provided outline explanation.
|
||||
|
||||
**DECISION AGREED**: Restarts should be handled by polite shutdown, followed by a hard clear. (RGB, JC, MH agreed)
|
||||
|
@ -1,48 +0,0 @@
|
||||
# Design Decision: Broker separation
|
||||
|
||||
## Background / Context
|
||||
|
||||
A decision of whether to extract the Artemis message broker as a separate component has implications for the design of
|
||||
[high availability](../design.md) for nodes.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. No change (leave broker embedded)
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Least change
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Means that starting/stopping Corda is tightly coupled to starting/stopping Artemis instances.
|
||||
2. Risks resource leaks from one system component affecting other components.
|
||||
3. Not pluggable if we wish to have an alternative broker.
|
||||
|
||||
### 2. External broker
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Separates concerns
|
||||
2. Allows future pluggability and standardisation on AMQP
|
||||
3. Separates life cycles of the components
|
||||
4. Makes Artemis deployment much more out of the box.
|
||||
5. Allows easier tuning of VM resources for Flow processing workloads vs broker type workloads.
|
||||
6. Allows later encrypted version to be an enterprise feature that can interoperate with OS versions.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. More work
|
||||
2. Requires creating a protocol to control external bridge formation.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 2: External broker
|
||||
|
||||
## Decision taken
|
||||
|
||||
The broker should only be separated if required by other features (e.g. the float), otherwise not. (RGB, JC, MH agreed).
|
||||
|
||||
.. toctree::
|
||||
|
||||
drb-meeting-20171116.md
|
@ -1,46 +0,0 @@
|
||||
# Design Decision: IP addressing mechanism (near-term)
|
||||
|
||||
## Background / Context
|
||||
|
||||
End-to-end encryption is a desirable potential design feature for the [high availability support](../design.md).
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Via load balancer
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Standard technology in banks and on clouds, often for non-HA purposes.
|
||||
2. Intended to allow us to wait for completion of network map work.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. We do need to support multiple IP address advertisements in network map long term.
|
||||
2. Might involve small amount of code if we find Artemis doesn’t like the health probes. So far though testing of the Azure Load balancer doesn’t need this.
|
||||
3. Won’t work over very large data centre separations, but that doesn’t work for HA/DR either
|
||||
|
||||
### 2. Via IP list in Network Map
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. More flexible
|
||||
2. More deployment options
|
||||
3. We will need it one day
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Have to write code to support it.
|
||||
2. Configuration more complicated and now the nodes are non-equivalent, so you can’t just copy the config to the backup.
|
||||
3. Artemis has round robin and automatic failover, so we may have to expose a vendor specific config flag in the network map.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 1: Via Load Balancer
|
||||
|
||||
## Decision taken
|
||||
|
||||
The design can allow for optional load balancers to be implemented by clients. (RGB, JC, MH agreed)
|
||||
|
||||
.. toctree::
|
||||
|
||||
drb-meeting-20171116.md
|
@ -1,49 +0,0 @@
|
||||
# Design Decision: Medium-term target for node HA
|
||||
|
||||
## Background / Context
|
||||
|
||||
Designing for high availability is a complex task which can only be delivered over an operationally-significant
|
||||
timeline. It is therefore important to determine whether an intermediate state design (deliverable for around March
|
||||
2018) is desirable as a precursor to longer term outcomes.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Hot-warm as interim state
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Simpler master/slave election logic
|
||||
2. Less edge cases with respect to messages being consumed by flows.
|
||||
3. Naive solution of just stopping/starting the node code is simple to implement.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Still probably requires the Artemis MQ outside of the node in a cluster.
|
||||
2. May actually turn out more risky than hot-hot, because shutting down code is always prone to deadlocks and resource leakages.
|
||||
3. Some work would have to be thrown away when we create a full hot-hot solution.
|
||||
|
||||
### 2. Progress immediately to Hot-hot
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Horizontal scalability is what all our customers want.
|
||||
2. It simplifies many deployments as nodes in a cluster are all equivalent.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. More complicated especially regarding message routing.
|
||||
2. Riskier to do this big-bang style.
|
||||
3. Might not meet deadlines.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 1: Hot-warm as interim state.
|
||||
|
||||
## Decision taken
|
||||
|
||||
Adopt option 1: Medium-term target: Hot Warm (RGB, JC, MH agreed)
|
||||
|
||||
.. toctree::
|
||||
|
||||
drb-meeting-20171116.md
|
||||
|
@ -1,46 +0,0 @@
|
||||
# Design Decision: Near-term target for node HA
|
||||
|
||||
## Background / Context
|
||||
|
||||
Designing for high availability is a complex task which can only be delivered over an operationally-significant
|
||||
timeline. It is therefore important to determine the target state in the near term as a precursor to longer term
|
||||
outcomes.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. No HA
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Reduces developer distractions.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. No backstop if we miss our targets for fuller HA.
|
||||
2. No answer at all for simple DR modes.
|
||||
|
||||
### 2. Hot-cold (see [HA design doc](../design.md))
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Flushes out lots of basic deployment issues that will be of benefit later.
|
||||
2. If stuff slips we at least have a backstop position with hot-cold.
|
||||
3. For now, the only DR story we have is essentially a continuation of this mode
|
||||
4. The intent of decisions such as using a loadbalancer is to minimise code changes
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Distracts from the work for more complete forms of HA.
|
||||
2. Involves creating a few components that are not much use later, for instance the mutual exclusion lock.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 2: Hot-cold.
|
||||
|
||||
## Decision taken
|
||||
|
||||
Adopt option 2: Near-term target: Hot Cold (RGB, JC, MH agreed)
|
||||
|
||||
.. toctree::
|
||||
|
||||
drb-meeting-20171116.md
|
@ -1,284 +0,0 @@
|
||||
# High availability support
|
||||
|
||||
.. important:: This design document describes a feature of Corda Enterprise.
|
||||
|
||||
## Overview
|
||||
### Background
|
||||
|
||||
The term high availability (HA) is used in this document to refer to the ability to rapidly handle any single component
|
||||
failure, whether due to physical issues (e.g. hard drive failure), network connectivity loss, or software faults.
|
||||
|
||||
Expectations of HA in modern enterprise systems are for systems to recover normal operation in a few minutes at most,
|
||||
while ensuring minimal/zero data loss. Whilst overall reliability is the overriding objective, it is desirable for Corda
|
||||
to offer HA mechanisms which are both highly automated and transparent to node operators. HA mechanism must not involve
|
||||
any configuration changes that require more than an appropriate admin tool, or a simple start/stop of a process as that
|
||||
would need an Emergency Change Request.
|
||||
|
||||
HA naturally grades into requirements for Disaster Recovery (DR), which requires that there is a tested procedure to
|
||||
handle large scale multi-component failures e.g. due to data centre flooding, acts of terrorism. DR processes are
|
||||
permitted to involve significant manual intervention, although the complications of actually invoking a Business
|
||||
Continuity Plan (BCP) mean that the less manual intervention, the more competitive Corda will be in the modern vendor
|
||||
market. For modern financial institutions, maintaining comprehensive and effective BCP procedures are a legal
|
||||
requirement which is generally tested at least once a year.
|
||||
|
||||
However, until Corda is the system of record, or the primary system for transactions we are unlikely to be required to
|
||||
have any kind of fully automatic DR. In fact, we are likely to be restarted only once BCP has restored the most critical
|
||||
systems. In contrast, typical financial institutions maintain large, complex technology landscapes in which individual
|
||||
component failures can occur, such as:
|
||||
|
||||
* Small scale software failures
|
||||
* Mandatory data centre power cycles
|
||||
* Operating system patching and restarts
|
||||
* Short lived network outages
|
||||
* Middleware queue build-up
|
||||
* Machine failures
|
||||
|
||||
Thus, HA is essential for enterprise Corda and providing help to administrators necessary for rapid fault diagnosis.
|
||||
|
||||
### Current node topology
|
||||
|
||||
![Current (single process)](./no-ha.png)
|
||||
|
||||
The current solution has a single integrated process running in one JVM including Artemis, H2 database, Flow State
|
||||
Machine, P2P bridging. All storage is on the local file system. There is no HA capability other than manual restart of
|
||||
the node following failure.
|
||||
|
||||
#### Limitations
|
||||
|
||||
- All sub-systems must be started and stopped together.
|
||||
- Unable to handle partial failure e.g. Artemis.
|
||||
- Artemis cannot use its in-built HA capability (clustered slave mode) as it is embedded.
|
||||
- Cannot run the node with the flow state machine suspended.
|
||||
- Cannot use alternative message brokers.
|
||||
- Cannot run multiple nodes against the same broker.
|
||||
- Cannot use alternative databases to H2.
|
||||
- Cannot share the database across Corda nodes.
|
||||
- RPC clients do have automatic reconnect but there is no clear solution for resynchronising on reconnect.
|
||||
- The backup strategy is unclear.
|
||||
|
||||
## Requirements
|
||||
### Goals
|
||||
|
||||
* A logical Corda node should continue to function in the event of an individual component failure or (e.g.) restart.
|
||||
* No loss, corruption or duplication of data on the ledger due to component outages
|
||||
* Ensure continuity of flows throughout any disruption
|
||||
* Support software upgrades in a live network
|
||||
|
||||
### Non-goals (out of scope for this design document)
|
||||
|
||||
* Be able to distribute a node over more than two data centers.
|
||||
* Be able to distribute a node between data centers that are very far apart latency-wise (unless you don't care about performance).
|
||||
* Be able to tolerate arbitrary byzantine failures within a node cluster.
|
||||
* DR, specifically in the case of the complete failure of a site/datacentre/cluster or region will require a different
|
||||
solution to that specified here. For now DR is only supported where performant synchronous replication is feasible
|
||||
i.e. sites only a few miles apart.
|
||||
|
||||
## Timeline
|
||||
|
||||
This design document outlines a range of topologies which will be enabled through progressive enhancements from the
|
||||
short to long term.
|
||||
|
||||
On the timescales available for the current production pilot deployments we clearly do not have time to reach the ideal
|
||||
of a highly fault tolerant, horizontally scaled Corda.
|
||||
|
||||
Instead, I suggest that we can only achieve the simplest state of a standby Corda installation only by January 5th and
|
||||
even this is contingent on other enterprise features, such as external database and network map stabilisation being
|
||||
completed on this timescale, plus any issues raised by testing.
|
||||
|
||||
For the Enterprise GA timeline, I hope that we can achieve a more fully automatic node failover state, with the Artemis
|
||||
broker running as a cluster too. I include a diagram of a fully scaled Corda for completeness and so that I can discuss
|
||||
what work is re-usable/throw away.
|
||||
|
||||
With regards to DR it is unclear how this would work where synchronous replication is not feasible. At this point we can
|
||||
only investigate approaches as an aside to the main thrust of work for HA support. In the synchronous replication mode
|
||||
it is assumed that the file and database replication can be used to ensure a cold DR backup.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
The following design decisions are assumed by this design:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
decisions/near-term-target.md
|
||||
decisions/medium-term-target.md
|
||||
decisions/external-broker.md
|
||||
decisions/db-msg-store.md
|
||||
decisions/ip-addressing.md
|
||||
decisions/crash-shell.md
|
||||
|
||||
## Target Solution
|
||||
|
||||
### Hot-Cold (minimum requirement)
|
||||
![Hot-Cold (minimum requirement)](./hot-cold.png)
|
||||
|
||||
Small scale software failures on a node are recovered from locally via restarting/re-setting the offending component by
|
||||
the external (to JVM) "Health Watchdog" (HW) process. The HW process (eg a shell script or similar) would monitor
|
||||
parameters for java processes by periodically query them (sleep period a few seconds). This may require introduction of
|
||||
a few monitoring 'hooks' into Corda codebase or a "health" CorDapp the HW script can interface with. There would be a
|
||||
back-off logic to prevent continues restarts in the case of persistent failure.
|
||||
|
||||
We would provide a fully-functional sample HW script for Linux/Unix deployment platforms.
|
||||
|
||||
The hot-cold design provides a backup VM and Corda deployment instance that can be manually started if the primary is
|
||||
stopped. The failed primary must be killed to ensure it is fully stopped.
|
||||
|
||||
For single-node deployment scenarios the simplest supported way to recover from failures is to re-start the entire set
|
||||
of Corda Node processes or reboot the node OS.
|
||||
|
||||
For a 2-node HA deployment scenario a load balancer determines which node is active and routes traffic to that node. The
|
||||
load balancer will need to monitor the health of the primary and secondary nodes and automatically route traffic from
|
||||
the public IP address to the only active end-point. An external solution is required for the load balancer and health
|
||||
monitor. In the case of Azure cloud deployments, no custom code needs to be developed to support the health monitor.
|
||||
|
||||
An additional component will be written to prevent accidental dual running which is likely to make use of a database
|
||||
heartbeat table. Code size should be minimal.
|
||||
|
||||
#### Advantages
|
||||
|
||||
- This approach minimises the need for new code so can be deployed quickly.
|
||||
- Use of a load balancer in the short term avoids the need for new code and configuration management to support the alternative approach of multiple advertised addresses for a single legal identity.
|
||||
- Configuration of the inactive mode should be a simple mirror of the primary.
|
||||
- Assumes external monitoring and management of the nodes e.g. ability to identify node failure and that Corda watchdog code will not be required (customer developed).
|
||||
|
||||
#### Limitations
|
||||
|
||||
- Slow failover as this is manually controlled.
|
||||
- Requires external solutions for replication of database and Artemis journal data.
|
||||
- Replication mechanism on agent banks with real servers not tested.
|
||||
- Replication mechanism on Azure is under test but may prove to be too slow.
|
||||
- Compatibility with external load balancers not tested. Only Azure configuration tested.
|
||||
- Contingent on completion of database support and testing of replication.
|
||||
- Failure of database (loss of connection) may not be supported or may require additional code.
|
||||
- RPC clients assumed to make short lived RPC requests e.g. from Rest server so no support for long term clients operating across failover.
|
||||
- Replication time point of the database and Artemis message data are independent and may not fully synchronise (may work subject to testing) .
|
||||
- Health reporting and process controls need to be developed by the customer.
|
||||
|
||||
### Hot-Warm (Medium-term solution)
|
||||
![Hot-Warm (Medium-term solution)](./hot-warm.png)
|
||||
|
||||
Hot-warm aims to automate failover and provide failover of individual major components e.g. Artemis.
|
||||
|
||||
It involves Two key changes to the hot-cold design:
|
||||
1) Separation and clustering of the Artemis broker.
|
||||
2) Start and stop of flow processing without JVM exit.
|
||||
|
||||
The consequences of these changes are that peer to peer bridging is separated from the node and a bridge control
|
||||
protocol must be developed. A leader election component is a pre-cursor to load balancing – likely to be a combination
|
||||
of custom code and standard library and, in the short term, is likely to be via the database. Cleaner handling of
|
||||
disconnects from the external components (Artemis and the database) will also be needed.
|
||||
|
||||
#### Advantages
|
||||
|
||||
- Faster failover as no manual intervention.
|
||||
- We can use Artemis replication protocol to replicate the message store.
|
||||
- The approach is integrated with preliminary steps for the float.
|
||||
- Able to handle loss of network connectivity to the database from one node.
|
||||
- Extraction of Artemis server allows a more standard Artemis deployment.
|
||||
- Provides protection against resource leakage in Artemis or Node from affecting the other component.
|
||||
- VMs can be tuned to address different work load patterns of broker and node.
|
||||
- Bridge work allows chance to support multiple IP addresses without a load balancer.
|
||||
|
||||
#### Limitations
|
||||
|
||||
- This approach will require careful testing of resource management on partial shutdown.
|
||||
- No horizontal scaling support.
|
||||
- Deployment of master and slave may not be completely symmetric.
|
||||
- Care must be taken with upgrades to ensure master/slave election operates across updates.
|
||||
- Artemis clustering does require a designated master at start-up of its cluster hence any restart involving changing
|
||||
the primary node will require configuration management.
|
||||
- The development effort is much more significant than the hot-cold configuration.
|
||||
|
||||
### Hot-Hot (Long-term strategic solution)
|
||||
![Hot-Hot (Long-term strategic solution)](./hot-hot.png)
|
||||
|
||||
In this configuration, all nodes are actively processing work and share a clustered database. A mechanism for sharding
|
||||
or distributing the work load will need to be developed.
|
||||
|
||||
#### Advantages
|
||||
|
||||
- Faster failover as flows are picked up by other active nodes.
|
||||
- Rapid scaling by adding additional nodes.
|
||||
- Node deployment is symmetric.
|
||||
- Any broker that can support AMQP can be used.
|
||||
- RPC can gracefully handle failover because responsibility for the flow can be migrated across nodes without the client being aware.
|
||||
|
||||
#### Limitations
|
||||
|
||||
- Very significant work with many edge cases during failure.
|
||||
- Will require handling of more states than just checkpoints e.g. soft locks and RPC subscriptions.
|
||||
- Single flows will not be active on multiple nodes without future development work.
|
||||
|
||||
## Implementation plan
|
||||
|
||||
### Transitioning from Corda 2.0 to Manually Activated HA
|
||||
|
||||
The current Corda is built to run as a fully contained single process with the Flow logic, H2 database and Artemis
|
||||
broker all bundled together. This limits the options for automatic replication, or subsystem failure. Thus, we must use
|
||||
external mechanisms to replicate the data in the case of failure. We also should ensure that accidental dual start is
|
||||
not possible in case of mistakes, or slow shutdown of the primary.
|
||||
|
||||
Based on this situation, I suggest the following minimum development tasks are required for a tested HA deployment:
|
||||
|
||||
1. Complete and merge JDBC support for an external clustered database. Azure SQL Server has been identified as the most
|
||||
likely initial deployment. With this we should be able to point at an HA database instance for Ledger and Checkpoint data.
|
||||
2. I am suggesting that for the near term we just use the Azure Load Balancer to hide the multiple machine addresses.
|
||||
This does require allowing a health monitoring link to the Artemis broker, but so far testing indicates that this
|
||||
operates without issue. Longer term we need to ensure that the network map and configuration support exists for the
|
||||
system to work with multiple TCP/IP endpoints advertised to external nodes. Ideally this should be rolled into the
|
||||
work for AMQP bridges and Floats.
|
||||
3. Implement a very simple mutual exclusion feature, so that an enterprise node cannot start if another is running onto
|
||||
the same database. This can be via a simple heartbeat update in the database, or possibly some other library. This
|
||||
feature should be enabled only when specified by configuration.
|
||||
4. The replication of the Artemis Message Queues will have to be via an external mechanism. On Azure we believe that the
|
||||
only practical solution is the 'Azure Files' approach which maps a virtual Samba drive. This we are testing in-case it
|
||||
is too slow to work. The mounting of separate Data Disks is possible, but they can only be mounted to one VM at a
|
||||
time, so they would not be compatible with the goal of no change requests for HA.
|
||||
5. Improve health monitoring to better indicate fault failure. Extending the existing JMX and logging support should
|
||||
achieve this, although we probably need to create watchdog CordApp that verifies that the State Machine and Artemis
|
||||
messaging are able to process new work and to monitor flow latency.
|
||||
6. Test the checkpointing mechanism and confirm that failures don't corrupt the data by deploying an HA setup on Azure
|
||||
and driving flows through the system as we stop the node randomly and switch to the other node. If this reveals any
|
||||
issues we will have to fix them.
|
||||
7. Confirm that the behaviour of the RPC Client API is stable through these restarts, from the perspective of a stateless
|
||||
REST server calling through to RPC. The RPC API should provide positive feedback to the application, so that it can
|
||||
respond in a controlled fashion when disconnected.
|
||||
8. Work on flow hospital tools where needed
|
||||
|
||||
### Moving Towards Automatic Failover HA
|
||||
|
||||
To move towards more automatic failover handling we need to ensure that the node can be partially active i.e. live
|
||||
monitoring the health status and perhaps keeping major data structures in sync for faster activation, but not actually
|
||||
processing flows. This needs to be reversible without leakage, or destabilising the node as it is common to use manually
|
||||
driven master changes to help with software upgrades and to carry out regular node shutdown and maintenance. Also, to
|
||||
reduce the risks associated with the uncoupled replication of the Artemis message data and the database I would
|
||||
recommend that we move the Artemis broker out of the node to allow us to create a failover cluster. This is also in line
|
||||
with the goal of creating a AMQP bridges and Floats.
|
||||
|
||||
To this end I would suggest packages of work that include:
|
||||
|
||||
1. Move the broker out of the node, which will require having a protocol that can be used to signal bridge creation and
|
||||
which decouples the network map. This is in line with the Flow work anyway.
|
||||
2. Create a mastering solution, probably using Atomix.IO although this might require a solution with a minimum of three
|
||||
nodes to avoid split brain issues. Ideally this service should be extensible in the future to lead towards an eventual
|
||||
state with Flow level sharding. Alternatively, we may be able to add a quick enterprise adaptor to ZooKeeper as
|
||||
master selector if time is tight. This will inevitably impact upon configuration and deployment support.
|
||||
3. Test the leakage when we repeated start-stop the Node class and fix any resource leaks, or deadlocks that occur at shutdown.
|
||||
4. Switch the Artemis client code to be able to use the HA mode connection type and thus take advantage of the rapid
|
||||
failover code. Also, ensure that we can support multiple public IP addresses reported in the network map.
|
||||
5. Implement proper detection and handling of disconnect from the external database and/or Artemis broker, which should
|
||||
immediately drop the master status of the node and flush any incomplete flows.
|
||||
6. We should start looking at how to make RPC proxies recover from disconnect/failover, although this is probably not a
|
||||
top priority. However, it would be good to capture the missed results of completed flows and ensure the API allows
|
||||
clients to unregister/re-register Observables.
|
||||
|
||||
## The Future
|
||||
|
||||
Hopefully, most of the work from the automatic failover mode can be modified when we move to a full hot-hot sharding of
|
||||
flows across nodes. The mastering solution will need to be modified to negotiate finer grained claim on individual
|
||||
flows, rather than stopping the whole of Node. Also, the routing of messages will have to be thought about so that they
|
||||
go to the correct node for processing, but failover if the node dies. However, most of the other health monitoring and
|
||||
operational aspects should be reusable.
|
||||
|
||||
We also need to look at DR issues and in particular how we might handle asynchronous replication and possibly
|
||||
alternative recovery/reconciliation mechanisms.
|
Before Width: | Height: | Size: 376 KiB |
Before Width: | Height: | Size: 423 KiB |
Before Width: | Height: | Size: 247 KiB |
Before Width: | Height: | Size: 280 KiB |
@ -1,50 +0,0 @@
|
||||
# Design Decision: Storage engine for committed state index
|
||||
|
||||
## Background / Context
|
||||
|
||||
The storage engine for the committed state index needs to support a single operation: "insert all values with unique
|
||||
keys, or abort if any key conflict found". A wide range of solutions could be used for that, from embedded key-value
|
||||
stores to full-fledged relational databases. However, since we don't need any extra features a RDBMS provides over a
|
||||
simple key-value store, we'll only consider lightweight embedded solutions to avoid extra operational costs.
|
||||
|
||||
Most RDBMSs are also generally optimised for read performance (use B-tree based storage engines like InnoDB, MyISAM).
|
||||
Our workload is write-heavy and uses "random" primary keys (state references), which leads to particularly poor write
|
||||
performance for those types of engines – as we have seen with our Galera-based notary service. One exception is the
|
||||
MyRocks storage engine, which is based on RocksDB and can handle write workloads well, and is supported by Percona
|
||||
Server, and MariaDB. It is easier, however, to just use RocksDB directly.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### A. RocksDB
|
||||
|
||||
An embedded key-value store based on log-structured merge-trees (LSM). It's highly configurable, provides lots of
|
||||
configuration options for performance tuning. E.g. can be tuned to run on different hardware – flash, hard disks or
|
||||
entirely in-memory.
|
||||
|
||||
### B. LMDB
|
||||
|
||||
An embedded key-value store using B+ trees, has ACID semantics and support for transactions.
|
||||
|
||||
### C. MapDB
|
||||
|
||||
An embedded Java database engine, providing persistent collection implementations. Uses memory mapped files. Simple to
|
||||
use, implements Java collection interfaces. Provides a HashMap implementation that we can use for storing committed
|
||||
states.
|
||||
|
||||
### D. MVStore
|
||||
|
||||
An embedded log structured key-value store. Provides a simple persistent map abstraction. Supports multiple map
|
||||
implementations (B-tree, R-tree, concurrent B-tree).
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Performance test results when running on a Macbook Pro with Intel Core i7-4980HQ CPU @ 2.80GHz, 16 GB RAM, SSD:
|
||||
|
||||
![Comparison](../images/store-comparison.png)
|
||||
|
||||
Multiple tests were run with varying number of transactions and input states per transaction: "1m x 1" denotes a million
|
||||
transactions with one input state.
|
||||
|
||||
Proceed with Option A, as RocksDB provides most tuning options and achieves by far the best write performance.
|
||||
|
||||
Note that the index storage engine can be replaced in the future with minimal changes required on the notary service.
|
@ -1,144 +0,0 @@
|
||||
# Design Decision: Replication framework
|
||||
|
||||
## Background / Context
|
||||
|
||||
Multiple libraries/platforms exist for implementing fault-tolerant systems. In existing CFT notary implementations we
|
||||
experimented with using a traditional relational database with active replication, as well as a pure state machine
|
||||
replication approach based on CFT consensus algorithms.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### A. Atomix
|
||||
|
||||
*Raft-based fault-tolerant distributed coordination framework.*
|
||||
|
||||
Our first CFT notary notary implementation was based on Atomix. Atomix can be easily embedded into a Corda node and
|
||||
provides abstractions for implementing custom replicated state machines. In our case the state machine manages committed
|
||||
Corda contract states. When notarisation requests are sent to Atomix, they get forwarded to the leader node. The leader
|
||||
persists the request to a log, and replicates it to all followers. Once the majority of followers acknowledge receipt,
|
||||
it applies the request to the user-defined state machine. In our case we commit all input states in the request to a
|
||||
JDBC-backed map, or return an error if conflicts occur.
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Lightweight, easy to integrate – embeds into Corda node.
|
||||
2. Uses Raft for replication – simpler and requires less code than other algorithms like Paxos.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Not designed for storing large datasets. State is expected to be maintained in memory only. On restart, each replica re-reads the entire command log to reconstruct the state. This behaviour is not configurable and would require code changes.
|
||||
2. Does not support batching, not optimised for performance.
|
||||
3. Since version 2.0, only supports snapshot replication. This means that each replica has to periodically dump the entire commit log to disk, and replicas that fall behind have to download the _entire_ snapshot.
|
||||
4. Limited tooling.
|
||||
|
||||
### B. Permazen
|
||||
|
||||
*Java persistence layer with a built-in Raft-based replicated key-value store.*
|
||||
|
||||
Conceptually similar to Atomix, but persists the state machine instead of the request log. Built around an abstract
|
||||
persistent key-value store: requests get cleaned up after replication and processing.
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Lightweight, easy to integrate – embeds into Corda node.
|
||||
2. Uses Raft for replication – simpler and requires less code than other algorithms like Paxos.
|
||||
3. Built around a (optionally) persistent key-value store – supports large datasets.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Maintained by a single developer, used by a single company in production. Code quality and documentation looks to be of a high standard though.
|
||||
2. Not tested with large datasets.
|
||||
3. Designed for read-write-delete workloads. Replicas that fall behind too much will have to download the entire state snapshot (similar to Atomix).
|
||||
4. Does not support batching, not optimised for performance.
|
||||
5. Limited tooling.
|
||||
|
||||
### C. Apache Kafka
|
||||
|
||||
*Paxos-based distributed streaming platform.*
|
||||
|
||||
Atomix and Permazen implement both the replicated request log and the state machine, but Kafka only provides the log
|
||||
component. In theory that means more complexity having to implement request log processing and state machine management,
|
||||
but for our use case it's fairly straightforward: consume requests and insert input states into a database, marking the
|
||||
position of the last processed request. If the database is lost, we can just replay the log from the beginning. The main
|
||||
benefit of this approach is that it gives a more granular control and performance tuning opportunities in different
|
||||
parts of the system.
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Stable – used in production for many years.
|
||||
2. Optimised for performance. Provides multiple configuration options for performance tuning.
|
||||
3. Designed for managing large datasets (performance not affected by dataset size).
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Relatively complex to set up and operate, requires a Zookeeper cluster. Note that some hosting providers offer Kafka as-a-service (e.g. Confluent Cloud), so we could delegate the setup and management.
|
||||
2. Dictates a more complex notary service architecture.
|
||||
|
||||
### D. Custom Raft-based implementation
|
||||
|
||||
For even more granular control, we could replace Kafka with our own replicated log implementation. Kafka was started
|
||||
before the Raft consensus algorithm was introduced, and is using Zookeeper for coordination, which is based on Paxos for
|
||||
consensus. Paxos is known to be complex to understand and implement, and the main driver behind Raft was to create a
|
||||
much simpler algorithm with equivalent functionality. Hence, while reimplementing Zookeeper would be an onerous task,
|
||||
building a Raft-based alternative from scratch is somewhat feasible.
|
||||
|
||||
#### Advantages
|
||||
|
||||
Most of the implementations above have many extra features our use-case does not require. We can implement a relatively
|
||||
simple clean optimised solution that will most likely outperform others (Thomas Schroeter already built a prototype).
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
Large effort required to make it highly performant and reliable.
|
||||
|
||||
### E. Galera
|
||||
|
||||
*Synchronous replication plugin for MySQL, uses certification-based replication.*
|
||||
|
||||
All of the options discussed so far were based on abstract state machine replication. Another approach is simply using a
|
||||
more traditional RDBMS with active replication support. Note that most relational databases support some form
|
||||
replication in general, however, very few provide strong consistency guarantees and ensure no data loss. Galera is a
|
||||
plugin for MySQL enabling synchronous multi-master replication.
|
||||
|
||||
Galera uses certification-based replication, which operates on write-sets: a database server executes the (database)
|
||||
transaction, and only performs replication if the transaction requires write operations. If it does, the transaction is
|
||||
broadcasted to all other servers (using atomic broadcast). On delivery, each server executes a deterministic
|
||||
certification phase, which decides if the transaction can commit or must abort. If a conflict occurs, the entire cluster
|
||||
rolls back the transaction. This type of technique is quite efficient in low-conflict situations and allows read scaling
|
||||
(the latter is mostly irrelevant for our use case).
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Very little code required on Corda side to implement.
|
||||
2. Stable – used in production for many years.
|
||||
3. Large tooling and support ecosystem.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Certification-based replication is based on database transactions. A replication round is performed on every transaction commit, and batching is not supported. To improve performance, we need to combine the committing of multiple Corda transactions into a single database transaction, which gets complicated when conflicts occur.
|
||||
2. Only supports the InnoDB storage engine, which is based on B-trees. It works well for reads, but performs _very_ poorly on write-intensive workloads with "random" primary keys. In tests we were only able to achieve up to 60 TPS throughput. Moreover, the performance steadily drops with more data added.
|
||||
|
||||
### F. CockroachDB
|
||||
|
||||
*Distributed SQL database built on a transactional and strongly-consistent key-value store. Uses Raft-based replication.*
|
||||
|
||||
On paper, CockroachDB looks like a great candidate, but it relies on sharding: data is automatically split into
|
||||
partitions, and each partition is replicated using Raft. It performs great for single-shard database transactions, and
|
||||
also natively supports cross-shard atomic commits. However, the majority of Corda transactions are likely to have more
|
||||
than one input state, which means that most transaction commits will require cross-shard database transactions. In our
|
||||
tests we were only able to achieve up to 30 TPS in a 3 DC deployment.
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Scales very well horizontally by sharding data.
|
||||
2. Easy to set up and operate.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Cross-shard atomic commits are slow. Since we expect most transactions to contain more than one input state, each transaction commit will very likely span multiple shards.
|
||||
2. Fairly new, limited use in production so far.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option C. A Kafka-based solution strikes the best balance between performance and the required effort to
|
||||
build a production-ready solution.
|
@ -1,236 +0,0 @@
|
||||
# High Performance CFT Notary Service
|
||||
|
||||
.. important:: This design document describes a prototyped but not shipped feature of Corda Enterprise. There are presently no plans to ship this notary.
|
||||
|
||||
## Overview
|
||||
|
||||
This proposal describes the architecture and an implementation for a high performance crash fault-tolerant notary
|
||||
service, operated by a single party.
|
||||
|
||||
## Background
|
||||
|
||||
For initial deployments, we expect to operate a single non-validating CFT notary service. The current Raft and Galera
|
||||
implementations cannot handle more than 100-200 TPS, which is likely to be a serious bottleneck in the near future. To
|
||||
support our clients and compete with other platforms we need a notary service that can handle TPS in the order of
|
||||
1,000s.
|
||||
|
||||
## Scope
|
||||
|
||||
Goals:
|
||||
|
||||
- A CFT non-validating notary service that can handle more than 1,000 TPS. Stretch goal: 10,000 TPS.
|
||||
- Disaster recovery strategy and tooling.
|
||||
- Deployment strategy.
|
||||
|
||||
Out-of-scope:
|
||||
|
||||
- Validating notary service.
|
||||
- Byzantine fault-tolerance.
|
||||
|
||||
## Requirements
|
||||
|
||||
The notary service should be able to:
|
||||
|
||||
- Notarise more than 1,000 transactions per second, with average 4 inputs per transaction.
|
||||
- Notarise a single transaction within 1s (from the service perspective).
|
||||
- Tolerate single node crash without affecting service availability.
|
||||
- Tolerate single data center failure.
|
||||
- Tolerate single disk failure/corruption.
|
||||
|
||||
|
||||
## Design Decisions
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
decisions/replicated-storage.md
|
||||
decisions/index-storage.md
|
||||
|
||||
## Target Solution
|
||||
|
||||
Having explored different solutions for implementing notaries we propose the following architecture for a CFT notary,
|
||||
consisting of two components:
|
||||
|
||||
1. A central replicated request log, which orders and stores all notarisation requests. Efficient append-only log
|
||||
storage can be used along with batched replication, making performance mainly dependent on network throughput.
|
||||
2. Worker nodes that service clients and maintain a consumed state index. The state index is a simple key-value store
|
||||
containing committed state references and pointers to the corresponding request positions in the log. If lost, it can be
|
||||
reconstructed by replaying and applying request log entries. There is a range of fast key-value stores that can be used
|
||||
for implementation.
|
||||
|
||||
![High level architecture](./images/high-level.svg)
|
||||
|
||||
At high level, client notarisation requests first get forwarded to a central replicated request log. The requests are
|
||||
then applied in order to the consumed state index in each worker to verify input state uniqueness. Each individual
|
||||
request outcome (success/conflict) is then sent back to the initiating client by the worker responsible for it. To
|
||||
emphasise, each worker will process _all_ notarisation requests, but only respond to the ones it received directly.
|
||||
|
||||
Messages (requests) in the request log are persisted and retained forever. The state index has a relatively low
|
||||
footprint and can in theory be kept entirely in memory. However, when a worker crashes, replaying the log to recover the
|
||||
index may take too long depending on the SLAs. Additionally, we expect applying the requests to the index to be much
|
||||
faster than consuming request batches even with persistence enabled.
|
||||
|
||||
_Technically_, the request log can also be kept entirely in memory, and the cluster will still be able to tolerate up to
|
||||
$f < n/2$ node failures. However, if for some reason the entire cluster is shut down (e.g. administrator error), all
|
||||
requests will be forever lost! Therefore, we should avoid it.
|
||||
|
||||
The request log does not need to be a separate cluster, and the worker nodes _could_ maintain the request log replicas
|
||||
locally. This would allow workers to consume ordered requests from the local copy rather than from a leader node across
|
||||
the network. It is hard to say, however, if this would have a significant performance impact without performing tests in
|
||||
the specific network environment (e.g. the bottleneck could be the replication step).
|
||||
|
||||
One advantage of hosting the request log in a separate cluster is that it makes it easier to independently scale the
|
||||
number of worker nodes. If, for example, if transaction validation and resolution is required when receiving a
|
||||
notarisation request, we might find that a significant number of receivers is required to generate enough incoming
|
||||
traffic to the request log. On the flip side, increasing the number of workers adds additional consumers and load on the
|
||||
request log, so a balance needs to be found.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
As the design decision documents below discuss, the most suitable platform for managing the request log was chosen to be
|
||||
[Apache Kafka](https://kafka.apache.org/), and [RocksDB](http://rocksdb.org/) as the storage engine for the committed
|
||||
state index.
|
||||
|
||||
| Heading | Recommendation |
|
||||
| ---------------------------------------- | -------------- |
|
||||
| [Replication framework](decisions/replicated-storage.md) | Option C |
|
||||
| [Index storage engine](decisions/index-storage.md) | Option A |
|
||||
|
||||
TECHNICAL DESIGN
|
||||
---
|
||||
|
||||
## Functional
|
||||
|
||||
A Kafka-based notary service does not deviate much from the high-level target solution architecture as described above.
|
||||
|
||||
![Kafka overview](./images/kafka-high-level.svg)
|
||||
|
||||
For our purposes we can view Kafka as a replicated durable queue we can push messages (_records_) to and consume from.
|
||||
Consuming a record just increments the consumer's position pointer, and does not delete it. Old records eventually
|
||||
expire and get cleaned up, but the expiry time can be set to "indefinite" so all data is retained (it's a supported
|
||||
use-case).
|
||||
|
||||
The main caveat is that Kafka does not allow consuming records from replicas directly – all communication has to be
|
||||
routed via a single leader node.
|
||||
|
||||
In Kafka, logical queues are called _topics_. Each topic can be split into multiple partitions. Topics are assigned a
|
||||
_replication factor_, which specifies how many replicas Kafka should create for each partition. Each replicated
|
||||
partition has an assigned leader node which producers and consumers can connect to. Partitioning topics and evenly
|
||||
distributing partition leadership allows Kafka to scale well horizontally.
|
||||
|
||||
In our use-case, however, we can only use a single-partition topic for notarisation requests, which limits the total
|
||||
capacity and throughput to a single machine. Partitioning requests would break global transaction ordering guarantees
|
||||
for consumers. There is a [proposal](#kafka-throughput-scaling-via-partitioning) from Rick Parker on how we _could_ use
|
||||
partitioning to potentially avoid traffic contention on the single leader node.
|
||||
|
||||
### Data model
|
||||
|
||||
Each record stored in the Kafka topic contains:
|
||||
1. Transaction Id
|
||||
2. List of input state references
|
||||
2. Requesting party X.500 name
|
||||
3. Notarisation request signature
|
||||
|
||||
The committed state index contains a map of:
|
||||
|
||||
`Input state reference: StateRef -> ( Transaction Id: SecureHash, Kafka record position: Long )`
|
||||
|
||||
It also stores a special key-value pair denoting the position of the last applied Kafka record.
|
||||
|
||||
## Non-Functional
|
||||
|
||||
### Fault tolerance, durability and consistency guarantees
|
||||
|
||||
Let's have a closer look at what exactly happens when a client sends a notarisation request to a notary worker node.
|
||||
|
||||
![Sequence diagram](./images/steps.svg)
|
||||
|
||||
A small note on terminology: the "notary service" we refer to in this section is the internal long-running service in the Corda node.
|
||||
|
||||
1. Client sends a notarisation request to the chosen Worker node. The load balancing is handled on the client by Artemis (round-robin).
|
||||
2. Worker acknowledges receipt and starts the service flow. The flow validates the request: verifies the transaction if needed, validates timestamp and notarisation request signature. The flow then forwards the request to the notary service, and suspends waiting for a response.
|
||||
3. The notary service wraps the request in a Kafka record and sends it to the global log via a Kafka producer. The sends are asynchronous from the service's perspective, and the producer is configured to buffer records and perform sends in batches.
|
||||
4. The Kafka leader node responsible for the topic partition replicates the received records to followers. The producer also specifies "ack" settings, which control when the records are considered to be committed. Only committed records are available for consumers. Using the "all" setting ensures that the records are persisted all replicas before it is available for consumption. **This ensures that no worker will consume a record that may later be lost if the Kafka leader crashes**.
|
||||
7. The notary service maintains a separate thread that continuously attempts to pull new available batches of records from the Kafka leader node. It processes the received batches of notarisation requests – commits input states to a local persistent key-value store. Once a batch is processed, the last record position in the Kafka partition is also persisted locally. On restart, the consumption of records is started from the last recorded position.
|
||||
9. Kafka also tracks consumer positions in Zookeeper, and provides the ability for consumers to commit the last consumed position either synchronously, or asynchronously. Since we don't require exactly once delivery semantics, we opt for asynchronous position commits for performance reasons.
|
||||
10. Once notarisation requests are processed, the notary service matches them against ones received by this particular worker node, and resumes the flows to send responses back to the clients.
|
||||
|
||||
Now let's consider the possible failure scenarios and how they are handled:
|
||||
* 2: Worker fails to acknowledge request. The Artemis broker on the client will redirect the message to a different worker node.
|
||||
* 3: Worker fails right after acknowledging the request, nothing is sent to the Kafka request log. Without some heartbeat mechanism the client can't know if the worker has failed, or the request is simply taking a long time to process. For this reason clients have special logic to retry notarisation requests with different workers, if a response is not received before a specified timeout.
|
||||
* 4: Kafka leader fails before replicating records. The producer does not receive an ack and the batch send fails. A new leader is elected and all producers and consumers switch to it. The producer retries sending with the new leader (it has to be configured to auto-retry). The lost records were not considered to be committed and therefore not made available for any consumers. Even if the producer did not re-send the batch to the new leader, client retries would fire and the requests would be reinserted into the "pipeline".
|
||||
* 7: The worker fails after sending out a batch of requests. The requests will be replicated and processed by other worker nodes. However, other workers will not send back replies to clients that the failed worker was responsible for.
|
||||
The client will retry with another worker. That worker will have already processed the same request, and committing the input states will result in a conflict. Since the conflict is caused by the same Corda transaction, it will ignore it and send back a successful response.
|
||||
* 8: The worker fails right after consuming a record batch. The consumer position is not recorded anywhere so it would re-consume the batch once it's back up again.
|
||||
* 9: The worker fails right after committing input states, but before recording last processed record position. On restart, it will re-consume the last batch of requests it had already processed. Committing input states is idempotent so re-processing the same request will succeed. Committing the consumer position to Kafka is strictly speaking not needed in our case, since we maintain it locally and manually "rewind" the partition to the last processed position on startup.
|
||||
* 10: The worker fails just before sending back a response. The client will retry with another worker.
|
||||
|
||||
The above discussion only considers crash failures which don't lead to data loss. What happens if the crash also results in disk corruption/failure?
|
||||
* If a Kafka leader node fails and loses all data, the machine can be re-provisioned, the Kafka node will reconnect to the cluster and automatically synchronise all data from one of the replicas. It can only become a leader again once it fully catches up.
|
||||
* If a worker node fails and loses all data, it can replay the Kafka partition from the beginning to reconstruct the committed state index. To speed this up, periodical backups can be taken so the index can be restored from a more recent snapshot.
|
||||
|
||||
One open question is flow handling on the worker node. If notary service flow is checkpointed and the worker crashes while the flow is suspended and waiting for a response (the completion of a future), on restart the flow will re-issue the request to the notary service. The service will in turn forward it to the request log (Kafka) for processing. If the worker node was down long enough for the client to retry the request with a different worker, a single notarisation request will get processed 3 times.
|
||||
|
||||
If the notary service flow is not checkpointed, the request won't be re-issued after restart, resulting in it being processed only twice. However, in the latter case, the client will need to wait for the entire duration until the timeout expires, and if the worker is down for only a couple of seconds, the first approach would result in a much faster response time.
|
||||
|
||||
### Performance
|
||||
|
||||
Kafka provides various configuration parameters allowing to control producer and consumer record batch size, compression, buffer size, ack synchrony and other aspects. There are also guidelines on optimal filesystem setup.
|
||||
|
||||
RocksDB is highly tunable as well, providing different table format implementations, compression, bloom filters, compaction styles, and others.
|
||||
|
||||
Initial prototype tests showed up to *15,000* TPS for single-input state transactions, or *40,000* IPS (inputs/sec) for 1,000 input transactions. No performance drop observed even after 1.2m transactions were notarised. The tests were run on three 8 core, 28 GB RAM Azure VMs in separate data centers.
|
||||
|
||||
With the recent introduction of notarisation request signatures the figures are likely to be much lower, as the request payload size is increased significantly. More tuning and testing required.
|
||||
|
||||
### Scalability
|
||||
|
||||
Not possible to scale beyond peak single machine throughput. Possible to scale the number of worker nodes for transactions verification and signing.
|
||||
|
||||
## Operational
|
||||
|
||||
As a general note, Kafka and Zookeeper are widely used in the industry and there are plenty of deployment guidelines and management tools available.
|
||||
|
||||
### Deployment
|
||||
|
||||
Different options available. A singe Kafka broker, Zookeeper replica and a Corda notary worker node can be hosted on the same machine for simplicity and cost-saving. At the other extreme, every Kafka/Zookeeper/Corda node can be hosted on its own machine. The latter arguably provides more room for error, at the expense of extra operational costs and effort.
|
||||
|
||||
### Management
|
||||
|
||||
Kafka provides command-line tools for managing brokers and topics. Third party UI-based tools are also available.
|
||||
|
||||
### Monitoring
|
||||
|
||||
Kafka exports a wide range of metrics via JMX. Datadog integration available.
|
||||
|
||||
### Disaster recovery
|
||||
|
||||
Failure modes:
|
||||
1. **Single machine or data center failure**. No backup/restore procedures are needed – nodes can catch up with the cluster on start. The RocksDB-backed committed state index keeps a pointer to the position of the last applied Kafka record, and it can resume where it left after restart.
|
||||
2. **Multi-data center disaster leading to data loss**. Out of scope.
|
||||
3. **User error**. It is possible for an admin to accidentally delete a topic – Kafka provides tools for that. However, topic deletion has to be explicitly enabled in the configuration (disabled by default). Keeping that option disabled should be a sufficient safeguard.
|
||||
4. **Protocol-level corruption**. This covers scenarios when data stored in Kafka gets corrupted and the corruption is replicated to healthy replicas. In general, this is extremely unlikely to happen since Kafka records are immutable. The only such corruption in practical sense could happen due to record deletion during compaction, which would occur if the broker is misconfigured to not retrain records indefinitely. However, compaction is performed asynchronously and local to the broker. In order for all data to be lost, _all_ brokers have to be misconfigured.
|
||||
|
||||
It is not possible to recover without any data loss in the event of 3 or 4. We can only _minimise_ data loss. There are two options:
|
||||
1. Run a backup Kafka cluster. Kafka provides a tool that forwards messages from one cluster to another (asynchronously).
|
||||
2. Take periodical physical backups of the Kafka topic.
|
||||
|
||||
In both scenarios the most recent requests will be lost. If data loss only occurs in Kafka, and the worker committed state indexes are intact, the notary could still function correctly and prevent double-spends of the transactions that were lost. However, in the non-validating notary scenario, the notarisation request signature and caller identity will be lost, and it will be impossible to trace the submitter of a fraudulent transaction. We could argue that the likelihood of request loss _and_ malicious transactions occurring at the same time is very low.
|
||||
|
||||
## Security
|
||||
|
||||
* **Communication**. Kafka supports SSL for both client-to-server and server-to-server communication. However, Zookeeper only supports SSL in client-to-server, which means that running Zookeeper across data centers will require setting up a VPN. For simplicity, we can reuse the same VPN for the Kafka cluster as well. The notary worker nodes can talk to Kafka either via SSL or the VPN.
|
||||
|
||||
* **Data privacy**. No transaction contents or PII is revealed or stored.
|
||||
|
||||
APPENDICES
|
||||
---
|
||||
|
||||
## Kafka throughput scaling via partitioning
|
||||
|
||||
We have to use a single partition for global transaction ordering guarantees, but we could reduce the load on it by using it _just_ for ordering:
|
||||
|
||||
* Have a single-partition `transactions` topic where all worker nodes send only the transaction id.
|
||||
* Have a separate _partitioned_ `payload` topic where workers send the entire notarisation request content: transaction id, inputs states, request signature. A single request can be around 1KB in size).
|
||||
|
||||
Workers would need to consume from the `transactions` partition to obtain the ordering, and from all `payload` partitions for the actual notarisation requests. A request will not be processed until its global order is known. Since Kafka tries to distribute leaders for different partitions evenly across the cluster, we would avoid a single Kafka broker handling all of the traffic. Load-wise, nothing changes from the worker node's perspective – it still has to process all requests – but a larger number of worker nodes could be supported.
|
Before Width: | Height: | Size: 19 KiB |
Before Width: | Height: | Size: 16 KiB |
Before Width: | Height: | Size: 37 KiB |
Before Width: | Height: | Size: 40 KiB |
@ -1,144 +0,0 @@
|
||||
# StatePointer
|
||||
|
||||
## Background
|
||||
|
||||
Occasionally there is a need to create a link from one `ContractState` to another. This has the effect of creating a uni-directional "one-to-one" relationship between a pair of `ContractState`s.
|
||||
|
||||
There are two ways to do this.
|
||||
|
||||
### By `StateRef`
|
||||
|
||||
Link one `ContractState` to another by including a `StateRef` or a `StateAndRef<T>` as a property inside another `ContractState`:
|
||||
|
||||
```kotlin
|
||||
// StateRef.
|
||||
data class FooState(val ref: StateRef) : ContractState
|
||||
// StateAndRef.
|
||||
data class FooState(val ref: StateAndRef<BarState>) : ContractState
|
||||
```
|
||||
|
||||
Linking to a `StateRef` or `StateAndRef<T>` is only recommended if a specific version of a state is required in perpetuity. Clearly, adding a `StateAndRef` embeds the data directly. This type of pointer is compatible with any `ContractState` type.
|
||||
|
||||
But what if the linked state is updated? The `StateRef` will be pointing to an older version of the data and this could be a problem for the `ContractState` which contains the pointer.
|
||||
|
||||
### By `linearId`
|
||||
|
||||
To create a link to the most up-to-date version of a state, instead of linking to a specific `StateRef`, a `linearId` which references a `LinearState` can be used. This is because all `LinearState`s contain a `linearId` which refers to a particular lineage of `LinearState`. The vault can be used to look-up the most recent state with the specified `linearId`.
|
||||
|
||||
```kotlin
|
||||
// Link by LinearId.
|
||||
data class FooState(val ref: UniqueIdentifier) : ContractState
|
||||
```
|
||||
|
||||
This type of pointer only works with `LinearState`s.
|
||||
|
||||
### Resolving pointers
|
||||
|
||||
The trade-off with pointing to data in another state is that the data being pointed to cannot be immediately seen. To see the data contained within the pointed-to state, it must be "resolved".
|
||||
|
||||
## Design
|
||||
|
||||
Introduce a `StatePointer` interface and two implementations of it; the `StaticPointer` and the `LinearPointer`. The `StatePointer` is defined as follows:
|
||||
|
||||
```kotlin
|
||||
interface StatePointer {
|
||||
val pointer: Any
|
||||
fun resolve(services: ServiceHub): StateAndRef<ContractState>
|
||||
}
|
||||
```
|
||||
|
||||
The `resolve` method facilitates the resolution of the `pointer` to a `StateAndRef`.
|
||||
|
||||
The `StaticPointer` type requires developers to provide a `StateRef` which points to a specific state.
|
||||
|
||||
```kotlin
|
||||
class StaticPointer(override val pointer: StateRef) : StatePointer {
|
||||
override fun resolve(services: ServiceHub): StateAndRef<ContractState> {
|
||||
val transactionState = services.loadState(pointer)
|
||||
return StateAndRef(transactionState, pointer)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `LinearPointer` type contains the `linearId` of the `LinearState` being pointed to and a `resolve` method. Resolving a `LinearPointer` returns a `StateAndRef<T>` containing the latest version of the `LinearState` that the node calling `resolve` is aware of.
|
||||
|
||||
```kotlin
|
||||
class LinearPointer(override val pointer: UniqueIdentifier) : StatePointer {
|
||||
override fun resolve(services: ServiceHub): StateAndRef<LinearState> {
|
||||
val query = QueryCriteria.LinearStateQueryCriteria(linearId = listOf(pointer))
|
||||
val result = services.vaultService.queryBy<LinearState>(query).states
|
||||
check(result.isNotEmpty()) { "LinearPointer $pointer cannot be resolved." }
|
||||
return result.single()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Bi-directional link
|
||||
|
||||
Symmetrical relationships can be modelled by embedding a `LinearPointer` in the pointed-to `LinearState` which points in the "opposite" direction. **Note:** this can only work if both states are `LinearState`s.
|
||||
|
||||
## Use-cases
|
||||
|
||||
It is important to note that this design only standardises a pattern which is currently possible with the platform. In other words, this design does not enable anything new.
|
||||
|
||||
### Tokens
|
||||
|
||||
Uncoupling token type definitions from the notion of ownership. Using the `LinearPointer`, `Token` states can include an `Amount` of some pointed-to type. The pointed-to type can evolve independently from the `Token` state which should just be concerned with the question of ownership.
|
||||
|
||||
## Issues and resolutions
|
||||
|
||||
Some issue to be aware of and their resolutions:
|
||||
|
||||
| Problem | Resolution |
|
||||
| :----------------------------------------------------------- | ------------------------------------------------------------ |
|
||||
| If the node calling `resolve` has not seen the specified `StateRef`, then `resolve` will return `null`. Here, the node calling `resolve` might be missing some crucial data. | Use data distribution groups. Assuming the creator of the `ContractState` publishes it to a data distribution group, subscribing to that group ensures that the node calling resolve will eventually have the required data. |
|
||||
| The node calling `resolve` has seen and stored transactions containing a `LinearState` with the specified `linearId`. However, there is no guarantee the `StateAndRef<T>` returned by `resolve` is the most recent version of the `LinearState`. | Embed the pointed-to `LinearState` in transactions containing the `LinearPointer` as a reference state. The reference states feature will ensure the pointed-to state is the latest version. |
|
||||
| The creator of the pointed-to `ContractState` exits the state from the ledger. If the pointed-to state is included a reference state then notaries will reject transactions containing it. | Contract code can be used to make a state un-exitable. |
|
||||
|
||||
All of the noted resolutions rely on additional paltform features:
|
||||
|
||||
* Reference states which will be available in V4
|
||||
* Data distribution groups which are not currently available. However, there is an early prototype
|
||||
* Additional state interface
|
||||
|
||||
### Additional concerns and responses
|
||||
|
||||
#### Embedding reference states in transactions
|
||||
|
||||
**Concern:** Embedding reference states for pointed-to states in transactions could cause transactions to increase by some unbounded size.
|
||||
|
||||
**Response:** The introduction of this feature doesn't create a new platform capability. It merely formalises a pattern which is currently possible. Futhermore, there is a possibility that _any_ type of state can cause a transaction to increase by some un-bounded size. It is also worth remembering that the maximum transaction size is 10MB.
|
||||
|
||||
#### `StatePointer`s are not human readable
|
||||
|
||||
**Concern:** Users won't know what sits behind the pointer.
|
||||
|
||||
**Response:** When the state containing the pointer is used in a flow, the pointer can be easily resolved. When the state needs to be displayed on a UI, the pointer can be resolved via vault query.
|
||||
|
||||
#### This feature adds complexity to the platform
|
||||
|
||||
**Concern:** This all seems quite complicated.
|
||||
|
||||
**Response:** It's possible anyway. Use of this feature is optional.
|
||||
|
||||
#### Coinselection will be slow.
|
||||
|
||||
**Concern:** We'll need to join on other tables to perform coinselection, making it slower. This is when a `StatePointer` is used as a `FungibleState` or `FungibleAsset` type.
|
||||
|
||||
**Response:** This is probably not true in most cases. Take the existing coinselection code from `CashSelectionH2Impl.kt`:
|
||||
|
||||
```sql
|
||||
SELECT vs.transaction_id, vs.output_index, ccs.pennies, SET(@t, ifnull(@t,0)+ccs.pennies) total_pennies, vs.lock_id
|
||||
FROM vault_states AS vs, contract_cash_states AS ccs
|
||||
WHERE vs.transaction_id = ccs.transaction_id AND vs.output_index = ccs.output_index
|
||||
AND vs.state_status = 0
|
||||
AND vs.relevancy_status = 0
|
||||
AND ccs.ccy_code = ? and @t < ?
|
||||
AND (vs.lock_id = ? OR vs.lock_id is null)
|
||||
```
|
||||
|
||||
Notice that the only property required which is not accessible from the `StatePointer` is the `ccy_code`. This is not necessarily a problem though, as the `pointer` specified in the pointer can be used as a proxy for the `ccy_code` or "token type".
|
||||
|
||||
|
||||
|
||||
|
@ -1,146 +0,0 @@
|
||||
# Validation of Maximus Scope and Future Work Proposal
|
||||
|
||||
## Introduction
|
||||
|
||||
The intent of this document is to ensure that the Tech Leads and Product Management are comfortable with the proposed
|
||||
direction of HA team future work. The term Maximus has been used widely across R3 and we wish to ensure that the scope
|
||||
is clearly understood and in alignment with wider delivery expectations.
|
||||
|
||||
I hope to explain the successes and failures of our rapid POC work, so it is clearer what guides our decision making in
|
||||
this.
|
||||
|
||||
Also, it will hopefully inform other teams of changes that may cross into their area.
|
||||
|
||||
## What is Maximus?
|
||||
|
||||
Mike’s original proposal for Maximus, made at CordaCon Tokyo 2018, was to use some automation to start and stop node
|
||||
VM’s using some sort of automation to reduce runtime cost. In Mike’s words this would allow ‘huge numbers of
|
||||
identities’, perhaps ‘thousands’.
|
||||
|
||||
The HA team and Andrey Brozhko have tried to stay close to this original definition that Maximus is for managing
|
||||
100’s-1000’s Enterprise Nodes and that the goal of the project is to better manage costs, especially in cloud
|
||||
deployments and with low overall flow rates. However, this leads to the following assumptions:
|
||||
|
||||
1. The overall rate of flows is low and users will accept some latency. The additional sharing of identities on a
|
||||
reduced physical footprint will inevitably reduce throughput compared to dedicated nodes, but should not be a problem.
|
||||
|
||||
2. At least in the earlier phases it is acceptable to statically manage identity keys/certificates for each individual
|
||||
identity. This will be scripted but will incur some effort/procedures/checking on the doorman side.
|
||||
|
||||
3. Every identity has an associated ‘DB schema’, which might be on a shared database server, but the separation is
|
||||
managed at that level. This database is a fixed runtime cost per identity and will not be shared in the earlier phases
|
||||
of Maximus. It might be optionally shareable in future, but this is not a hard requirement for Corda 5 as it needs
|
||||
significant help from core to change the DB schemas. Also, our understanding is that the isolation is a positive feature
|
||||
in some deployments.
|
||||
|
||||
4. Maximus may share infrastructure and possibly JVM memory between identities without breaking some customer
|
||||
requirement for isolation. In other words we are virtualizing the ‘node’, but CorDapps and peer nodes will be unaware of
|
||||
any changes.
|
||||
|
||||
## What Maximus is not
|
||||
|
||||
1. Maximus is not designed to handle millions of identities. That is firmly Marco Polo and possibly handled completely
|
||||
differently.
|
||||
|
||||
2. Maximus should be not priced such as to undercut our own high-performance Enterprise nodes, or allow customers to run
|
||||
arbitrary numbers of nodes for free.
|
||||
|
||||
3. Maximus is not a ‘wallet’ based solution. The nodes in Maximus are fully equivalent to the current Enterprise
|
||||
offering and have first class identities. There is also no remoting of the signing operations.
|
||||
|
||||
## The POC technologies we have tried
|
||||
|
||||
The HA team has looked at several elements of the solution. Some approaches look promising, some do not.
|
||||
|
||||
1. We have already started the work to share a common P2P Artemis between multiple nodes and common bridge/float. This
|
||||
is the ‘SNI header’ work which has been DRB’s recently. This should be functionally complete soon and available in Corda
|
||||
4.0 This work will reduce platform cost and simplify deployment of multiple nodes. For Maximus the main effect is that it
|
||||
should make the configuration much more consistent between nodes and it means that where a node runs is immaterial as
|
||||
the shared broker distributes messages and the Corda firewall handles the public communication.
|
||||
|
||||
2. I looked at flattening the flow state machine, so that we could map Corda operations into combining state and
|
||||
messages in the style of a Map-Reduce pattern. Unfortunately, the work involved is extreme and not compatible with the
|
||||
Corda API. Therefore a pure ‘flow worker’ approach does not look viable any time soon and in general full hot-hot is
|
||||
still a way off.
|
||||
|
||||
3. Chris looked at reducing the essential service set in the node to those needed to support the public flow API and the
|
||||
StateMachine. Then we attached a simple start flow messaging interface. This simple ‘FlowRunner’ class allowed
|
||||
exploration of several options in a gaffer taped state.
|
||||
|
||||
1. We created a simple messaging interface between an RPC runner and a Flow Runner and showed that we can run
|
||||
standard flows.
|
||||
|
||||
2. We were able to POC combining two identities running side-by-side in a Flow Runner, which is in fact quite similar
|
||||
to many of our integration tests. We must address static variable leakage but should be feasible.
|
||||
|
||||
3. We were able to create an RPC worker that could handle several identities at once and start flows on the
|
||||
same/different flow runner harnesses.
|
||||
|
||||
4. We then pushed forward looking into flow sharding. Here we made some progress, but the task started to get more and more
|
||||
complicated. It also highlighted that we don’t have suitable headers on our messages and that the message header
|
||||
whitelist will make this difficult to change whilst maintaining wire compatibility. The conclusion from this is that
|
||||
hot-hot flow sharding will have to wait.
|
||||
|
||||
8. We have been looking at resource/cost management technologies. The almost immediate conclusion is that whilst cloud
|
||||
providers do have automated VM/container as service they are not standardized. Instead, the only standardized approach
|
||||
is Kubernetes+docker, which will charge dynamically according to active use levels.
|
||||
|
||||
9. Looking at resource management in Kubernetes we can dynamically scale relatively homogeneous pods, but the metrics
|
||||
approach cannot easily cope with identity injection. Instead we can scale the number of running pods, but they will have
|
||||
to self-organize the work balancing amongst themselves.
|
||||
|
||||
## Maximus Work Proposal
|
||||
|
||||
#### Current State
|
||||
|
||||
![Current Enterprise State](./images/current_state.png)
|
||||
|
||||
The current enterprise node solution in GA 3.1 is as above. This has dynamic HA failover available for the bridge/float
|
||||
using ZooKeeper as leader elector, but the node has to be hot-cold. There is some sharing support for the ZooKeeper
|
||||
cluster, but otherwise all this infrastructure has to be replicated per identity. In addition, all elements of this have
|
||||
to have at least one resident instance to ensure that messages are captured and RPC clients have an endpoint to talk to.
|
||||
|
||||
#### Corda 4.0 Agreed Target with SNI Shared Corda Firewalls
|
||||
|
||||
![Corda 4.0 Enterprise State](./images/shared_bridge_float.png)
|
||||
|
||||
Here by sharing the P2P Artemis externally and work on the messaging protocol it should be possible to reuse the corda
|
||||
firewall for multiple nodes. This means that the externally advertised address will be stable for the whole cluster
|
||||
independent of the deployed identities. Also, the durable messaging is outside nodes, which means that we can
|
||||
theoretically schedule running the nodes only if a few times a day if they only act in response to external peer
|
||||
messages. Mostly this is a prelude to greater sharing in the future Maximus state.
|
||||
|
||||
#### Intermediate State Explored during POC
|
||||
|
||||
![Maximus POC](./images/maximus_poc.png)
|
||||
|
||||
During the POC we explore the model above, although none of the components were completed to a production standard. The
|
||||
key feature here is that the RPC side has been split out of the node and has API support for multiple identities built
|
||||
in. The flow and P2P elements of the node have been split out too, which means that the ‘FlowWorker’ start-up code can
|
||||
be simpler than the current AbstractNode as it doesn’t have to support the same testing framework. The actual service
|
||||
implementations are unchanged in this.
|
||||
|
||||
The principal communication between the RPC and FlowWorker is about starting flows and completed work is broadcast as
|
||||
events. A message protocol will be defined to allow re-attachment and status querying if the RPC client is restarted.
|
||||
The vault RPC api will continue to the database directly in the RpcWorker and not involve the FlowWorker. The scheduler
|
||||
service will live in the RPC service as potentially the FlowWorkers will not yet be running when the due time occurs.
|
||||
|
||||
#### Proposed Maximus Phase 1 State
|
||||
|
||||
![Maximus Phase 1](./images/maximus_phase1.png)
|
||||
|
||||
The productionised version of the above POC will introduce ‘Max Nodes’ that can load FlowWorkers on demand. We still
|
||||
require only one runs at once, but for this we will use ZooKeeper to ensure that FlowWorkers with capacity compete to
|
||||
process the work and only one wins. Based on trials we can safely run a couple of identities at one inside the same Max
|
||||
Node assuming load is manageable. Idle identities will be dropped trivially, since the Hibernate, Artemis connections
|
||||
and thread pools will be owned by the Max Node not the flow workers. At this stage there is no dynamic management of the
|
||||
physical resources, but some sort of scheduler could control how many Max Nodes are running at once.
|
||||
|
||||
#### Final State Maximus with Dynamic Resource Management
|
||||
|
||||
![Maximus Final](./images/maximus_final.png)
|
||||
|
||||
The final evolution is to add dynamic cost control to the system. As the Max Nodes are homogeneous the RpcWorker can
|
||||
monitor the load and signal metrics available to Kubernetes. This means that Max Nodes can be added and removed as
|
||||
required and potentially cost zero. Ideally, separate work would begin in parallel to combine database data into a
|
||||
single schema, but that is possibly not required.
|
Before Width: | Height: | Size: 48 KiB |
Before Width: | Height: | Size: 99 KiB |
Before Width: | Height: | Size: 86 KiB |
Before Width: | Height: | Size: 72 KiB |
Before Width: | Height: | Size: 54 KiB |
Before Width: | Height: | Size: 63 KiB |
@ -1,533 +0,0 @@
|
||||
# Monitoring and Logging Design
|
||||
|
||||
## Overview
|
||||
|
||||
The successful deployment and operation of Corda (and associated CorDapps) in a production environment requires a
|
||||
supporting monitoring and management capability to ensure that both a Corda node (and its supporting middleware
|
||||
infrastructure) and deployed CorDapps execute in a functionally correct and consistent manner. A pro-active monitoring
|
||||
solution will enable the immediate alerting of unexpected behaviours and associated management tooling should enable
|
||||
swift corrective action.
|
||||
|
||||
This design defines the monitoring metrics and logging outputs, and associated implementation approach, required to
|
||||
enable a proactive enterprise management and monitoring solution of Corda nodes and their associated CorDapps. This also
|
||||
includes a set of "liveliness" checks to verify and validate correct functioning of a Corda node (and associated
|
||||
CorDapp).
|
||||
|
||||
![MonitoringLoggingOverview](./MonitoringLoggingOverview.png)
|
||||
|
||||
In the above diagram, the left hand side dotted box represents the components within scope for this design. It is
|
||||
anticipated that 3rd party enterprise-wide system management solutions will closely follow the architectural component
|
||||
breakdown in the right hand side box, and thus seamlessly integrate with the proposed Corda event generation and logging
|
||||
design. The interface between the two is de-coupled and based on textual log file parsing and adoption of industry
|
||||
standard JMX MBean events.
|
||||
|
||||
## Background
|
||||
|
||||
Corda currently exposes several forms of monitorable content:
|
||||
|
||||
* Application log files using the [SLF4J](https://www.slf4j.org/) (Simple Logging Facade for Java) which provides an
|
||||
abstraction over various concrete logging frameworks (several of which are used within other Corda dependent 3rd party
|
||||
libraries). Corda itself uses the [Apache Log4j 2](https://logging.apache.org/log4j/2.x/) framework for logging output
|
||||
to a set of configured loggers (to include a rolling file appender and the console). Currently the same set of rolling
|
||||
log files are used by both the node and CorDapp(s) deployed to the node. The log file policy specifies a 60 day
|
||||
rolling period (but preserving the most recent 10Gb) with a maximum of 10 log files per day.
|
||||
|
||||
* Industry standard exposed JMX-based metrics, both standard JVM and custom application metrics are exposed directly
|
||||
using the [Dropwizard.io](http://metrics.dropwizard.io/3.2.3/) *JmxReporter* facility. In addition Corda also uses the
|
||||
[Jolokia](https://jolokia.org/) framework to make these accessible over an HTTP endpoint. Typically, these metrics are
|
||||
also collated by 3rd party tools to provide pro-active monitoring, visualisation and re-active management.
|
||||
|
||||
A full list of currently exposed metrics can be found in the appendix A.
|
||||
|
||||
The Corda flow framework also has *placeholder* support for recording additional Audit data in application flows using a
|
||||
simple *AuditService*. Audit event types are currently loosely defined and data is stored in string form (as a
|
||||
description and contextual map of name-value pairs) together with a timestamp and principal name. This service does not
|
||||
currently have an implementation of the audit event data to a persistent store.
|
||||
|
||||
The `ProgressTracker` component is used to report the progress of a flow throughout its business lifecycle, and is
|
||||
typically configured to report the start of a specific business workflow step (often before and after message send and
|
||||
receipt where other participants form part of a multi-staged business workflow). The progress tracking framework was
|
||||
designed to become a vital part of how exceptions, errors, and other faults are surfaced to human operators for
|
||||
investigation and resolution. It provides a means of exporting progress as a hierarchy of steps in a way that’s both
|
||||
human readable and machine readable.
|
||||
|
||||
In addition, in-house Corda networks at R3 use the following tools:
|
||||
|
||||
* Standard [DataDog](https://docs.datadoghq.com/guides/overview/) probes are currently used to provide e-mail based
|
||||
alerting for running Corda nodes. [Telegraf](https://github.com/influxdata/telegraf) is used in conjunction with a
|
||||
[Jolokia agent](https://jolokia.org/agent.html) as a collector to parse emitted metric data and push these to DataDog.
|
||||
* Investigation is underway to evaluate [ELK](https://logz.io/learn/complete-guide-elk-stack/) as a mechanism for parsing,
|
||||
indexing, storing, searching, and visualising log file data.
|
||||
|
||||
## Scope
|
||||
|
||||
### Goals
|
||||
|
||||
- Add new metrics at the level of a Corda node, individual CorDapps, and other supporting Corda components (float, bridge manager, doorman)
|
||||
- Support liveness checking of the node, deployed flows and services
|
||||
- Review logging groups and severities in the node.
|
||||
- Separate application logging from node logging.
|
||||
- Implement the audit framework that is currently only a stubbed out API
|
||||
- Ensure that Corda can be used with third party systems for monitoring, log collection and audit
|
||||
|
||||
### Out of scope
|
||||
|
||||
- Recommendation of a specific set of monitoring tools.
|
||||
- Monitoring of network infrastructure like the network map service.
|
||||
- Monitoring of liveness of peers.
|
||||
|
||||
## Requirements
|
||||
|
||||
Expanding on the first goal identified above, the following requirements have been identified:
|
||||
|
||||
1. Node health
|
||||
- Message queues: latency, number of queues/messages, backlog, bridging establishment and connectivity (success / failure)
|
||||
- Database: connections (retries, errors), latency, query time
|
||||
- RPC metrics, latency, authentication/authorisation checking (eg. number of successful / failed attempts).
|
||||
- Signing performance (eg. signatures per sec).
|
||||
- Deployed CorDapps
|
||||
- Garbage collector and JVM statistics
|
||||
|
||||
2. CorDapp health
|
||||
- Number of flows broken down by type (including flow status and aging statistics: oldest, latest)
|
||||
- Flow durations
|
||||
- JDBC connections, latency/histograms
|
||||
|
||||
3. Logging
|
||||
- RPC logging
|
||||
- Shell logging (user/command pairs)
|
||||
- Message queue
|
||||
- Traces
|
||||
- Exception logging (including full stack traces)
|
||||
- Crash dumps (full stack traces)
|
||||
- Hardware Security Module (HSM) events.
|
||||
- per CorDapp logging
|
||||
|
||||
4. Auditing
|
||||
|
||||
- Security: login authentication and authorisation
|
||||
- Business Event flow progress tracking
|
||||
- System events (particularly failures)
|
||||
|
||||
Audit data should be stored in a secure, storage medium.
|
||||
Audit data should include sufficient contextual information to enable optimal off-line analysis.
|
||||
Auditing should apply to all Corda node processes (running CorDapps, notaries, oracles).
|
||||
|
||||
### Use Cases
|
||||
|
||||
It is envisaged that operational management and support teams will use the metrics and information collated from this
|
||||
design, either directly or through an integrated enterprise-wide systems management platform, to perform the following:
|
||||
|
||||
- Validate liveness and correctness of Corda nodes and deployed CorDapps, and the physical machine or VM they are hosted on.
|
||||
|
||||
* Use logging to troubleshoot operational failures (in conjunction with other supporting failure information: eg. GC logs, stack traces)
|
||||
* Use reported metrics to fine-tune and tweak operational systems parameters (including dynamic setting of logging
|
||||
modules and severity levels to enable detailed logging).
|
||||
|
||||
## Design Decisions
|
||||
|
||||
The following design decisions are to be confirmed:
|
||||
|
||||
1. JMX for metric eventing and SLF4J for logging
|
||||
Both above are widely adopted mechanisms that enable pluggability and seamless interoperability with other 3rd party
|
||||
enterprise-wide system management solutions.
|
||||
2. Continue or discontinue usage of Jolokia? (TBC - most likely yes, subject to read-only security lock-down)
|
||||
3. Separation of Corda Node and CorDapp log outputs (TBC)
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
There are a number of activities and parts to the solution proposal:
|
||||
|
||||
1. Extend JMX metric reporting through the Corda Monitoring Service and associated jolokia conversion to REST/JSON)
|
||||
coverage (see implementation details) to include all Corda services (vault, key management, transaction storage,
|
||||
network map, attachment storage, identity, cordapp provision) & sub-sytems components (state machine)
|
||||
|
||||
2. Review and extend Corda log4j2 coverage (see implementation details) to ensure
|
||||
|
||||
- consistent use of severities according to situation
|
||||
- consistent coverage across all modules and libraries
|
||||
- consistent output format with all relevant contextual information (node identity, user/execution identity, flow
|
||||
session identity, version information)
|
||||
- separation of Corda Node and CorDapp log outputs (TBC)
|
||||
For consistent interleaving reasons, it may be desirable to continue using combined log output.
|
||||
|
||||
Publication of a *code style guide* to define when to use different severity levels.
|
||||
|
||||
3. Implement a CorDapp to perform sanity checking of flow framework, fundamental corda services (vault, identity), and
|
||||
dependent middleware infrastructure (message broker, database).
|
||||
|
||||
4. Revisit and enhance as necessary the [Audit service API]( https://github.com/corda/corda/pull/620 ), and provide a
|
||||
persistent backed implementation, to include:
|
||||
|
||||
- specification of Business Event Categories (eg. User authentication and authorisation, Flow-based triggering, Corda
|
||||
Service invocations, Oracle invocations, Flow-based send/receive calls, RPC invocations)
|
||||
- auto-enabled with Progress Tracker as Business Event generator
|
||||
- RDBMS backed persistent store (independent of Corda database), with adequate security controls (authenticated access
|
||||
and read-only permissioning). Captured information should be consistent with standard logging, and it may be desirable
|
||||
to define auditable loggers within log4j2 to automatically redirect certain types of log events to the audit service.
|
||||
|
||||
5. Ensure 3rd party middleware drivers (JDBC for database, MQ for messaging) and the JVM are correctly configured to export
|
||||
JMX metrics. Ensure the [JVM Hotspot VM command-line parameters](https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/clopts001.html)
|
||||
are tuned correctly to enable detailed troubleshooting upon failure. Many of these metrics are already automatically
|
||||
exposed to 3rd party profiling tools such as Yourkit.
|
||||
|
||||
Apache Artemis has a comprehensive [management API](https://activemq.apache.org/artemis/docs/latest/management.html)
|
||||
that allows a user to modify a server configuration, create new resources (e.g. addresses and queues), inspect these
|
||||
resources (e.g. how many messages are currently held in a queue) and interact with it (e.g. to remove messages from a
|
||||
queue), and exposes key metrics using JMX (using role-based authentication using Artemis's JAAS plug-in support to
|
||||
ensure Artemis cannot be controlled via JMX)..
|
||||
|
||||
### Restrictions
|
||||
|
||||
As of Corda M11, Java serialisation in the Corda node has been restricted, meaning MBeans access via the JMX port will no longer work.
|
||||
|
||||
Usage of Jolokia requires bundling an associated *jolokia-agent-war* file on the classpath, and associated configuration
|
||||
to export JMX monitoring statistics and data over the Jolokia REST/JSON interface. An associated *jolokia-access.xml*
|
||||
configuration file defines role based permissioning to HTTP operations.
|
||||
|
||||
## Complementary solutions
|
||||
|
||||
A number of 3rd party libraries and frameworks have been proposed which solve different parts of the end to end
|
||||
solution, albeit with most focusing on the Agent Collector (eg. collect metrics from systems then output them to some
|
||||
backend storage.), Event Storage and Search, and Visualization aspects of Systems Management and Monitoring. These
|
||||
include:
|
||||
|
||||
| Solution | Type (OS/£) | Description |
|
||||
| ---------------------------------------- | ----------- | ---------------------------------------- |
|
||||
| [Splunk](https://www.splunk.com/en_us/products.html) | £ | General purpose enterprise-wide system management solution which performs collection and indexing of data, searching, correlation and analysis, visualization and reporting, monitoring and alerting. |
|
||||
| [ELK](https://logz.io/learn/complete-guide-elk-stack/) | OS | The ELK stack is a collection of 3 open source products from Elastic which provide an end to end enterprise-wide system management solution:<br />Elasticsearch: NoSQL database based on Lucene search engine<br />Logstash: is a log pipeline tool that accepts inputs from various sources, executes different transformations, and exports the data to various targets. Kibana: is a visualization layer that works on top of Elasticsearch. |
|
||||
| [ArcSight](https://software.microfocus.com/en-us/software/siem-security-information-event-management) | £ | Enterprise Security Manager |
|
||||
| [Collectd](https://collectd.org/) | OS | Collector agent (written in C circa 2005). Data acquisition and storage handled by over 90 plugins. |
|
||||
| [Telegraf](https://github.com/influxdata/telegraf) | OS | Collector agent (written in Go, active community) |
|
||||
| [Graphite](https://graphiteapp.org/) | OS | Monitoring tool that stores, retrieves, shares, and visualizes time-series data. |
|
||||
| [StatsD](https://github.com/etsy/statsd) | OS | Collector daemon that runs on the [Node.js](http://nodejs.org/) platform and listens for statistics, like counters and timers, sent over [UDP](http://en.wikipedia.org/wiki/User_Datagram_Protocol) or [TCP](http://en.wikipedia.org/wiki/Transmission_Control_Protocol) and sends aggregates to one or more pluggable backend services (e.g., Graphite). |
|
||||
| [fluentd](https://www.fluentd.org/) | OS | Collector daemon which collects data directly from logs and databases. Often used to analyze event logs, application logs, and clickstreams (a series of mouse clicks). |
|
||||
| [Prometheus](https://prometheus.io/) | OS | End to end monitoring solution using time-series data (eg. metric name and a set of key-value pairs) and includes collection, storage, query and visualization. |
|
||||
| [NewRelic](https://newrelic.com/) | £ | Full stack instrumentation for application monitoring and real-time analytics solution. |
|
||||
|
||||
Most of the above solutions are not within the scope of this design proposal, but should be capable of ingesting the outputs (logging and metrics) defined by this design.
|
||||
|
||||
## Technical design
|
||||
|
||||
In general, the requirements outlined in this design are cross-cutting concerns which affect the Corda codebase holistically, both for logging and capture/export of JMX metrics.
|
||||
|
||||
### Interfaces
|
||||
|
||||
* Public APIs impacted
|
||||
* No Public API's are impacted.
|
||||
* Internal APIs impacted
|
||||
* No identified internal API's are impacted.
|
||||
* Services impacted:
|
||||
* No change anticipated to following service:
|
||||
* *Monitoring*
|
||||
This service defines and used the *Codahale* `MetricsRegistry`, which is used by all other Corda services.
|
||||
* Changes expected to:
|
||||
* *AuditService*
|
||||
This service has been specified but not implemented.
|
||||
The following event types have been defined (and may need reviewing):
|
||||
* `FlowAppAuditEvent`: used in `FlowStateMachine`, exposed on `FlowLogic` (but never called)
|
||||
* `FlowPermissionAuditEvent`: (as above)
|
||||
* `FlowStartEvent` (unused)
|
||||
* `FlowProgressAuditEvent` (unused)
|
||||
* `FlowErrorAuditEvent` (unused)
|
||||
* `SystemAuditEvent` (unused)
|
||||
* Modules impacted
|
||||
* All modules packaged and shipped as part of a Corda distribution (as published to Artifactory / Maven): *core, node, node-api, node-driver, finance, confidential-identities, test-common, test-utils, webserver, jackson, jfx, mock, rpc*
|
||||
|
||||
### Functional
|
||||
|
||||
#### Health Checker
|
||||
|
||||
The Health checker is a CorDapp which verifies the health and liveliness of the Corda node it is deployed and running within by performing the following activities:
|
||||
|
||||
1. Corda network and middleware infrastructure connectivity checking:
|
||||
|
||||
- Database connectivity
|
||||
- Message broker connectivity
|
||||
|
||||
2. Network Map participants summary (count, list)
|
||||
|
||||
- Notary summary (type, [number of cluster members]
|
||||
|
||||
3. Flow framework verification
|
||||
|
||||
Implement a simple flow that performs a simple "in-node" (no external messaging to 3rd party processes) round trip, and by doing so, exercises:
|
||||
|
||||
- flow checkpointing (including persistence to relational data store)
|
||||
- message subsystem verification (creation of a send-to-self queue for purpose of routing)
|
||||
- custom CordaService invocation (verify and validate behaviour of an installed CordaService)
|
||||
- vault querying (verify and validate behaviour of vault query mechanism)
|
||||
|
||||
[this CorDapp could perform a simple Issuance of a fictional Corda token, Spend Corda token to self, Corda token exit, plus a couple of Vault queries in between: one using the VaultQuery API and the other using a Custom Query via a registered @CordaService]
|
||||
|
||||
4. RPC triggering
|
||||
Autotriggering of above flow using RPC to exercise the following:
|
||||
|
||||
- messaging subsystem verification (RPC queuing)
|
||||
- authenticaton and permissions checking (against underlying configuration)
|
||||
|
||||
|
||||
The Health checker may be deployed as part of a Corda distribution and automatically invoked upon start-up and/or manually triggered via JMX or the nodes associated Crash shell (using the startFlow command)
|
||||
|
||||
Please note that the Health checker application is not responsible for determining the healthiness of a Corda Network. This is the responsibility of the network operator, and may include verification checks such as:
|
||||
|
||||
- correct functioning of Network Map Service (registration, discovery)
|
||||
- correct functioning of configured Notary
|
||||
- remote messaging sub-sytem (including bridge creation)
|
||||
|
||||
#### Metrics augmentation within Corda Subsystems and Components
|
||||
|
||||
*Codahale* provides the following types of reportable metrics:
|
||||
|
||||
- Gauge: is an instantaneous measurement of a value.
|
||||
- Counter: is a gauge for a numeric value (specifically of type `AtomicLong`) which can be incremented or decremented.
|
||||
- Meter: measures mean throughput (eg. the rate of events over time, e.g., “requests per second”). Also measures one-, five-, and fifteen-minute exponentially-weighted moving average throughputs.
|
||||
- Histogram: measures the statistical distribution of values in a stream of data (minimum, maximum, mean, median, 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles).
|
||||
- Timer: measures both the rate that a particular piece of code is called and the distribution of its duration (eg. rate of requests in requests per second).
|
||||
- Health checks: provides a means of centralizing service (database, message broker health checks).
|
||||
|
||||
See Appendix B for summary of current JMX Metrics exported by the Corda codebase.
|
||||
|
||||
The following table identifies additional metrics to report for a Corda node:
|
||||
|
||||
| Component / Subsystem | Proposed Metric(s) |
|
||||
| ---------------------------------------- | ---------------------------------------- |
|
||||
| Database | Connectivity (health check) |
|
||||
| Corda Persistence | Database configuration details: <br />Data source properties: JDBC driver, JDBC driver class name, URL<br />Database properties: isolation level, schema name, init database flag<br />Run-time metrics: total & in flight connection, session, transaction counts; committed / rolled back transaction (counter); transaction durations (metric) |
|
||||
| Message Broker | Connectivity (health check) |
|
||||
| Corda Messaging Client | |
|
||||
| State Machine | Fiber thread pool queue size (counter), Live fibers (counter) , Fibers waiting for ledger commit (counter)<br />Flow Session Messages (counters): init, confirm, received, reject, normal end, error end, total received messages (for a given flow session, Id and state)<br />(in addition to existing metrics captured)<br />Flow error (count) |
|
||||
| Flow State Machine | Initiated flows (counter)<br />For a given flow session (counters): initiated flows, send, sendAndReceive, receive, receiveAll, retries upon send<br />For flow messaging (timers) to determine round trip latencies between send/receive interactions with counterparties.<br />Flow suspension metrics (count, age, wait reason, cordapp) |
|
||||
| RPC | For each RPC operation we should export metrics to report: calling user, round trip latency (timer), calling frequency (meter). Metric reporting should include the Corda RPC protocol version (should be the same as the node's Platform Version) in play. <br />Failed requests would be of particular interest for alerting. |
|
||||
| Vault | round trip latency of Vault Queries (timer)<br />Soft locking counters for reserve, release (counter), elapsed times soft locks are held for per flow id (timer, histogram), list of soft locked flow ids and associated stateRefs.<br />attempt to soft lock fungible states for spending (timer) |
|
||||
| Transaction Verification<br />(InMemoryTransactionVerifierService) | worker pool size (counter), verify duration (timer), verify throughput (meter), success (counter), failure counter), in flight (counter) |
|
||||
| Notarisation | Notary details (type, members in cluster)<br />Counters for success, failures, failure types (conflict, invalid time window, invalid transaction, wrong notary), elapsed time (timer)<br />Ideally provide breakdown of latency across notarisation steps: state ref notary validation, signature checking, from sending to remote notary to receiving response |
|
||||
| RAFT Notary Service<br />(awaiting choice of new RAFT implementation) | should include similar metrics to previous RAFT (see appendix). |
|
||||
| SimpleNotaryService | success/failure uniqueness checking<br />success/failure time-window checking |
|
||||
| ValidatingNotaryService | as above plus success/failure of transaction validation |
|
||||
| RaftNonValidatingNotaryService | as `SimpleNotaryService`, plus timer for algorithmic execution latency |
|
||||
| RaftValidatingNotaryService | as `ValidatingNotaryService`, plus timer for algorithmic execution latency |
|
||||
| BFTNonValidatingNotaryService | as `RaftNonValidatingNotaryService` |
|
||||
| CorDapps<br />(CordappProviderImpl, CordappImpl) | list of corDapps loaded in node, path used to load corDapp jars<br />Details per CorDapp: name, contract class names, initiated flows, rpc flows, service flows, schedulable flows, services, serialization whitelists, custom schemas, jar path |
|
||||
| Doorman Server | TBC |
|
||||
| KeyManagementService | signing requests (count), fresh key requests (count), fresh key and cert requests (count), number of loaded keys (count) |
|
||||
| ContractUpgradeServiceImpl | number of authorisation upgrade requests (counter) |
|
||||
| DBTransactionStorage | number of transactions in storage map (cache) <br />cache size (max. 1024), concurrency level (def. 8) |
|
||||
| DBTransactionMappingStorage | as above |
|
||||
| Network Map | TBC (following re-engineering) |
|
||||
| Identity Service | number or parties, keys, principals (in cache)<br />Identity verification count & latency (count, metric) |
|
||||
| Attachment Service | counters for open, import, checking requests<br />(in addition to exiting attachment count) |
|
||||
| Schema Service | list of registered schemas; schemaOptions per schema; table prefix. |
|
||||
|
||||
#### Logging augmentation within Corda Subsystems and Components
|
||||
|
||||
Need to ensure that Log4J2 log messages within Corda code are correctly categorized according to defined severities (from most specific to least):
|
||||
|
||||
- ERROR: an error in the application, possibly recoverable.
|
||||
- WARNING: an event that might possible lead to an error.
|
||||
- INFO: an event for informational purposes.
|
||||
- DEBUG: a general debugging event.
|
||||
- TRACE: a fine-grained debug message, typically capturing the flow through the application.
|
||||
|
||||
A *logging style guide* will be published to answer questions such as what severity level should be used and why when:
|
||||
|
||||
- A connection to a remote peer is unexpectedly terminated.
|
||||
- A database connection timed out but was successfully re-established.
|
||||
- A message was sent to a peer.
|
||||
|
||||
It is also important that we capture the correct amount of contextual information to enable rapid identification and resolution of issues using log file output. Specifically, within Corda we should include the following information in logged messages:
|
||||
|
||||
- Node identifier
|
||||
- User name
|
||||
- Flow id (runId, also referred to as `StateMachineRunId`), if logging within a flow
|
||||
- Other contextual Flow information (eg. counterparty), if logging within a flow
|
||||
- `FlowStackSnapshot` information for catastrophic flow failures.
|
||||
Note: this information is not currently supposed to be used in production.
|
||||
- Session id information for RPC calls
|
||||
- CorDapp name, if logging from within a CorDapp
|
||||
|
||||
See Appendix C for summary of current Logging and Progress Tracker Reporting coverage within the Corda codebase.
|
||||
|
||||
##### Custom logging for enhanced visibility and troubleshooting:
|
||||
|
||||
1. Database SQL logging is controlled via explicit configuration of the Hibernate log4j2 logger as follows:
|
||||
|
||||
```
|
||||
<Logger name="org.hibernate.SQL" level="debug" additivity="false">
|
||||
<AppenderRef ref="Console-Appender"/>
|
||||
</Logger>
|
||||
```
|
||||
|
||||
2. Message broker (Apache Artemis) advanced logging is enabled by configuring log4j2 for each of the 6 available [loggers defined](https://activemq.apache.org/artemis/docs/latest/logging.html). In general, Artemis logging is highly chatty so default logging is actually toned down for one of the defined loggers:
|
||||
|
||||
```
|
||||
<Logger name="org.apache.activemq.artemis.core.server" level="error" additivity="false">
|
||||
<AppenderRef ref="RollingFile-Appender"/>
|
||||
</Logger>
|
||||
```
|
||||
|
||||
3. Corda coin selection advanced logging - including display of prepared statement parameters (which are not displayed for certain database providers when enabling Hibernate debug logging):
|
||||
|
||||
```
|
||||
<Logger name="net.corda.finance.contracts.asset.cash.selection" level="trace" additivity="false">
|
||||
<AppenderRef ref="Console-Appender"/>
|
||||
</Logger>
|
||||
```
|
||||
|
||||
#### Audit Service persistence implementation and enablement
|
||||
|
||||
1. Implementation of the existing `AuditService` API to write to a (pluggable) secure destination (database, message queue, other)
|
||||
2. Identification of Business Events that we should audit, and instrumentation of code to ensure the AuditService is called with the correct Event Type according to Business Event.
|
||||
For Corda Flows it would be a good idea to use the `ProgressTracker` component as a means of sending Business audit events. Refer [here](https://docs.corda.net/head/flow-state-machines.html?highlight=progress%20tracker#progress-tracking) for a detailed description of the ProgressTracker API.
|
||||
3. Identification of System Events that should be automatically audited.
|
||||
4. Specification of a database schema and associated object relational mapping implementation.
|
||||
5. Setup and configuration of separate database and user account.
|
||||
|
||||
## Software Development Tools and Programming Standards to be adopted.
|
||||
|
||||
* Design patterns
|
||||
|
||||
[Michele] proposes the adoption of an [event-based propagation](https://r3-cev.atlassian.net/browse/ENT-1131) solution (and associated event-driven framework) based on separation of concerns (performance improvements through parallelisation, latency minimisation for mainline execution thread): mainstream flow logic, business audit event triggering, JMX metric reporting. This approach would continue to use the same libraries for JMX event triggering and file logging.
|
||||
|
||||
* 3rd party libraries
|
||||
|
||||
[Jolokia](https://jolokia.org/) is a JMX-HTTP bridge giving access to the raw data and operations without connecting to the JMX port directly. Jolokia defines the JSON and REST formats for accessing MBeans, and provides client libraries to work with that protocol as well.
|
||||
|
||||
[Dropwizard Metrics](http://metrics.dropwizard.io/3.2.3/) (formerly Codahale) provides a toolkit of ways to measure the behavior of critical components in a production environment.
|
||||
|
||||
* supporting tools
|
||||
|
||||
[VisualVM](http://visualvm.github.io/) is a visual tool integrating commandline JDK tools and lightweight profiling capabilities.
|
||||
|
||||
## Appendix A - Corda exposed JMX Metrics
|
||||
|
||||
The following metrics are exposed directly by a Corda Node at run-time:
|
||||
|
||||
| Module | Metric | Desccription |
|
||||
| ------------------------ | ---------------------------- | ---------------------------------------- |
|
||||
| Attachment Service | Attachments | Counts number of attachments persisted in database. |
|
||||
| RAFT Uniqueness Provider | RaftCluster.ThisServerStatus | Gauge |
|
||||
| RAFT Uniqueness Provider | RaftCluster.MembersCount | Count |
|
||||
| RAFT Uniqueness Provider | RaftCluster.Members | Gauge, containing a list of members (by server address) |
|
||||
| State Machine Manager | Flows.InFlight | Gauge (number of instances of state machine manager) |
|
||||
| State Machine Manager | Flows.CheckpointingRate | Meter |
|
||||
| State Machine Manager | Flows.Started | Count |
|
||||
| State Machine Manager | Flows.Finished | Count |
|
||||
|
||||
Additionally, JMX metrics are also generated within the Corda *node-driver* performance testing utilities. Specifically, the `startPublishingFixedRateInjector` defines and exposes `QueueSize` and `WorkDuration` metrics.
|
||||
|
||||
## Appendix B - Corda Logging and Reporting coverage
|
||||
|
||||
Primary node services exposed publicly via ServiceHub (SH) or internally by ServiceHubInternal (SHI):
|
||||
|
||||
| Service | Type | Implementation | Logging summary |
|
||||
| ---------------------------------------- | ---- | ---------------------------------- | ---------------------------------------- |
|
||||
| VaultService | SH | NodeVaultService | extensive coverage including Vault Query api calls using `HibernateQueryCriteriaParser` |
|
||||
| KeyManagementService | SH | PersistentKeyManagementService | none |
|
||||
| ContractUpgradeService | SH | ContractUpgradeServiceImpl | none |
|
||||
| TransactionStorage | SH | DBTransactionStorage | none |
|
||||
| NetworkMapCache | SH | NetworkMapCacheImpl | some logging (11x info, 1x warning) |
|
||||
| TransactionVerifierService | SH | InMemoryTransactionVerifierService | |
|
||||
| IdentityService | SH | PersistentIdentityService | some logging (error, debug) |
|
||||
| AttachmentStorage | SH | NodeAttachmentService | minimal logging (info) |
|
||||
| | | | |
|
||||
| TransactionStorage | SHI | DBTransactionStorage | see SH |
|
||||
| StateMachineRecordedTransactionMappingStorage | SHI | DBTransactionMappingStorage | none |
|
||||
| MonitoringService | SHI | MonitoringService | none |
|
||||
| SchemaService | SHI | NodeSchemaService | none |
|
||||
| NetworkMapCacheInternal | SHI | PersistentNetworkMapCache | see SH |
|
||||
| AuditService | SHI | <unimplemented> | |
|
||||
| MessagingService | SHI | NodeMessagingClient | Good coverage (error, warning, info, trace) |
|
||||
| CordaPersistence | SHI | CordaPersistence | INFO coverage within `HibernateConfiguration` |
|
||||
| CordappProviderInternal | SHI | CordappProviderImpl | none |
|
||||
| VaultServiceInternal | SHI | NodeVaultService | see SH |
|
||||
|
||||
Corda subsystem components:
|
||||
|
||||
| Name | Implementation | Logging summary |
|
||||
| -------------------------- | ---------------------------------------- | ---------------------------------------- |
|
||||
| NotaryService | SimpleNotaryService | some logging (warn) via `TrustedAuthorityNotaryService` |
|
||||
| NotaryService | ValidatingNotaryService | as above |
|
||||
| NotaryService | RaftValidatingNotaryService | some coverage (info, debug) within `RaftUniquenessProvider` |
|
||||
| NotaryService | RaftNonValidatingNotaryService | as above |
|
||||
| NotaryService | BFTNonValidatingNotaryService | Logging coverage (info, debug) |
|
||||
| Doorman | DoormanServer (Enterprise only) | Some logging (info, warn, error), and use of `println` |
|
||||
|
||||
Corda core flows:
|
||||
|
||||
| Flow name | Logging | Exception handling | Progress Tracking |
|
||||
| --------------------------------------- | ------------------- | ---------------------------------------- | ----------------------------- |
|
||||
| FinalityFlow | none | NotaryException | NOTARISING, BROADCASTING |
|
||||
| NotaryFlow | none | NotaryException (NotaryError types: TimeWindowInvalid, TransactionInvalid, WrongNotary), IllegalStateException, some via `check` assertions | REQUESTING, VALIDATING |
|
||||
| NotaryChangeFlow | none | StateReplacementException | SIGNING, NOTARY |
|
||||
| SendTransactionFlow | none | FetchDataFlow.HashNotFound (FlowException) | none |
|
||||
| ReceiveTransactionFlow | none | SignatureException, AttachmentResolutionException, TransactionResolutionException, TransactionVerificationException | none |
|
||||
| ResolveTransactionsFlow | none | FetchDataFlow.HashNotFound (FlowException), ExcessivelyLargeTransactionGraph (FlowException) | none |
|
||||
| FetchAttachmentsFlow | none | FetchDataFlow.HashNotFound | none |
|
||||
| FetchTransactionsFlow | none | FetchDataFlow.HashNotFound | none |
|
||||
| FetchDataFlow | some logging (info) | FetchDataFlow.HashNotFound | none |
|
||||
| AbstractStateReplacementFlow.Instigator | none | StateReplacementException | SIGNING, NOTARY |
|
||||
| AbstractStateReplacementFlow.Acceptor | none | StateReplacementException | VERIFYING, APPROVING |
|
||||
| CollectSignaturesFlow | none | IllegalArgumentException via `require` assertions | COLLECTING, VERIFYING |
|
||||
| CollectSignatureFlow | none | as above | none |
|
||||
| SignTransactionFlow | none | FlowException, possibly other (general) Exception | RECEIVING, VERIFYING, SIGNING |
|
||||
| ContractUpgradeFlow | none | FlowException | none |
|
||||
|
||||
Corda finance flows:
|
||||
|
||||
| Flow name | Logging | Exception handling | Progress Tracking |
|
||||
| -------------------------- | ------- | ---------------------------------------- | ---------------------------------------- |
|
||||
| AbstractCashFlow | none | CashException (FlowException) | GENERATING_ID, GENERATING_TX, SIGNING_TX, FINALISING_TX |
|
||||
| CashIssueFlow | none | CashException (via call to `FinalityFlow`) | GENERATING_TX, SIGNING_TX, FINALISING_TX |
|
||||
| CashPaymentFlow | none | CashException (caused by `InsufficientBalanceException` or thrown by `FinalityFlow`), SwapIdentitiesException | GENERATING_ID, GENERATING_TX, SIGNING_TX, FINALISING_TX |
|
||||
| CashExitFlow | none | CashException (caused by `InsufficientBalanceException` or thrown by `FinalityFlow`), | GENERATING_TX, SIGNING_TX, FINALISING_TX |
|
||||
| CashIssueAndPaymentFlow | none | any thrown by `CashIssueFlow` and `CashPaymentFlow` | as `CashIssueFlow` and `CashPaymentFlow` |
|
||||
| TwoPartyDealFlow.Primary | none | | GENERATING_ID, SENDING_PROPOSAL |
|
||||
| TwoPartyDealFlow.Secondary | none | IllegalArgumentException via `require` assertions | RECEIVING, VERIFYING, SIGNING, COLLECTING_SIGNATURES, RECORDING |
|
||||
| TwoPartyTradeFlow.Seller | none | FlowException, IllegalArgumentException via `require` assertions | AWAITING_PROPOSAL, VERIFYING_AND_SIGNING |
|
||||
| TwoPartyTradeFlow.Buyer | none | IllegalArgumentException via `require` assertions, IllegalStateException | RECEIVING, VERIFYING, SIGNING, COLLECTING_SIGNATURES, RECORDING |
|
||||
|
||||
Confidential identities flows:
|
||||
|
||||
| Flow name | Logging | Exception handling | Progress Tracking |
|
||||
| ------------------------ | ------- | ---------------------------------------- | ---------------------------------------- |
|
||||
| SwapIdentitiesFlow | | | |
|
||||
| IdentitySyncFlow.Send | none | IllegalArgumentException via `require` assertions, IllegalStateException | SYNCING_IDENTITIES |
|
||||
| IdentitySyncFlow.Receive | none | CertificateExpiredException, CertificateNotYetValidException, InvalidAlgorithmParameterException | RECEIVING_IDENTITIES, RECEIVING_CERTIFICATES |
|
||||
|
||||
## Appendix C - Apache Artemis JMX Event types and Queuing Metrics.
|
||||
|
||||
The following table contains a list of Notification Types and associated perceived importance to a Corda node at run-time:
|
||||
|
||||
| Name | Code | Importance |
|
||||
| --------------------------------- | :--: | ---------- |
|
||||
| BINDING_ADDED | 0 | |
|
||||
| BINDING_REMOVED | 1 | |
|
||||
| CONSUMER_CREATED | 2 | Medium |
|
||||
| CONSUMER_CLOSED | 3 | Medium |
|
||||
| SECURITY_AUTHENTICATION_VIOLATION | 6 | Very high |
|
||||
| SECURITY_PERMISSION_VIOLATION | 7 | Very high |
|
||||
| DISCOVERY_GROUP_STARTED | 8 | |
|
||||
| DISCOVERY_GROUP_STOPPED | 9 | |
|
||||
| BROADCAST_GROUP_STARTED | 10 | N/A |
|
||||
| BROADCAST_GROUP_STOPPED | 11 | N/A |
|
||||
| BRIDGE_STARTED | 12 | High |
|
||||
| BRIDGE_STOPPED | 13 | High |
|
||||
| CLUSTER_CONNECTION_STARTED | 14 | Soon |
|
||||
| CLUSTER_CONNECTION_STOPPED | 15 | Soon |
|
||||
| ACCEPTOR_STARTED | 16 | |
|
||||
| ACCEPTOR_STOPPED | 17 | |
|
||||
| PROPOSAL | 18 | |
|
||||
| PROPOSAL_RESPONSE | 19 | |
|
||||
| CONSUMER_SLOW | 21 | High |
|
||||
|
||||
The following table summarised the types of metrics associated with Message Queues:
|
||||
|
||||
| Metric | Description |
|
||||
| ----------------- | ---------------------------------------- |
|
||||
| count | total number of messages added to a queue since the server started |
|
||||
| countDelta | number of messages added to the queue *since the last message counter update* |
|
||||
| messageCount | *current* number of messages in the queue |
|
||||
| messageCountDelta | *overall* number of messages added/removed from the queue *since the last message counter update*. Positive value indicated more messages were added, negative vice versa. |
|
||||
| lastAddTimestamp | timestamp of the last time a message was added to the queue |
|
||||
| updateTimestamp | timestamp of the last message counter update |
|
@ -1,175 +0,0 @@
|
||||
# Reference states
|
||||
|
||||
## Overview
|
||||
|
||||
See a prototype implementation here: https://github.com/corda/corda/pull/2889
|
||||
|
||||
There is an increasing need for Corda to support use-cases which require reference data which is issued and updated by specific parties, but available for use, by reference, in transactions built by other parties.
|
||||
|
||||
Why is this type of reference data required? A key benefit of blockchain systems is that everybody is sure they see the
|
||||
same as their counterpart - and for this to work in situations where accurate processing depends on reference data
|
||||
requires everybody to be operating on the same reference data. This, in turn, requires any given piece of reference data
|
||||
to be uniquely identifiable and, requires that any given transaction must be certain to be operating on the most current
|
||||
version of that reference data. In cases where the latter condition applies, only the notary can attest to this fact and
|
||||
this, in turn, means the reference data must be in the form of an unconsumed state.
|
||||
|
||||
This document outlines the approach for adding support for this type of reference data to the Corda transaction model
|
||||
via a new approach called "reference input states".
|
||||
|
||||
## Background
|
||||
|
||||
Firstly, it is worth considering the types of reference data on Corda how it is distributed:
|
||||
|
||||
1. **Rarely changing universal reference data.** Such as currency codes and holiday calendars. This type of data can be added to transactions as attachments and referenced within contracts, if required. This data would only change based upon the decision of an International standards body, for example, therefore it is not critical to check the data is current each time it is used.
|
||||
2. **Constantly changing reference data.** Typically, this type of data must be collected and aggregated by a central party. Oracles can be used as a central source of truth for this type of constantly changing data. There are multiple examples of making transaction validity contingent on data provided by Oracles (IRS demo and SIMM demo). The Oracle asserts the data was valid at the time it was provided.
|
||||
3. **Periodically changing subjective reference data.** Reference data provided by entities such as bond issuers where the data changes frequently enough to warrant users of the data check it is current.
|
||||
|
||||
At present, periodically changing subjective data can only be provided via:
|
||||
|
||||
* Oracles,
|
||||
* Attachments,
|
||||
* Regular contract states, or alternatively,
|
||||
* kept off-ledger entirely
|
||||
|
||||
However, neither of these solutions are optimal for reasons discussed in later sections of this design document.
|
||||
|
||||
As such, this design document introduces the concept of a "reference input state" which is a better way to serve "periodically changing subjective reference data" on Corda.
|
||||
|
||||
A reference input state is a `ContractState` which can be referred to in a transaction by the contracts of input and output states but whose contract is not executed as part of the transaction verification process and is not consumed when the transaction is committed to the ledger but _is_ checked for "current-ness". In other words, the contract logic isn't run for the referencing transaction only. It's still a normal state when it occurs in an input or output position.
|
||||
|
||||
Reference data states will enable many parties to "reuse" the same state in their transactions as reference data whilst still allowing the reference data state owner the capability to update the state. When data distribution groups are available then reference state owners will be able to distribute updates to subscribers more easily. Currently, distribution would have to be performed manually.
|
||||
|
||||
Reference input states can be added to Corda by adding a new transaction component group that allows developers to add reference data `ContractState`s that are not consumed when the transaction is committed to the ledger. This eliminates the problems created by long chains of provenance, contention, and allows developers to use any `ContractState` for reference data. The feature should allow developers to add _any_ `ContractState` available in their vault, even if they are not a `participant` whilst nevertheless providing a guarantee that the state being used is the most recent version of that piece of information.
|
||||
|
||||
## Scope
|
||||
|
||||
Goals
|
||||
|
||||
* Add the capability to Corda transactions to support reference states
|
||||
|
||||
Non-goals (eg. out of scope)
|
||||
|
||||
* Data distribution groups are required to realise the full potential of reference data states. This design document does not discuss data distribution groups.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. Reference states can be any `ContractState` created by one or more `Party`s and subsequently updated by those `Party`s. E.g. `Cash`, `CompanyData`, `InterestRateSwap`, `FxRate`. Reference states can be `OwnableState`s, but it's more likely they will be `LinearState`s.
|
||||
2. Any `Party` with a `StateRef` for a reference state should be able to add it to a transaction to be used as a reference, even if they are not a `participant` of the reference state.
|
||||
3. The contract code for reference states should not be executed. However, reference data states can be referred to by the contracts of `ContractState`s in the input and output lists.
|
||||
4. `ContractStates` should not be consumed when used as reference data.
|
||||
5. Reference data must be current, therefore when reference data states are used in a transaction, notaries should check that they have not been consumed before.
|
||||
6. To ensure determinism of the contract verification process, reference data states must be in scope for the purposes of transaction resolution. This is because whilst users of the reference data are not consuming the state, they must be sure that the series of transactions that created and evolved the state were executed validly.
|
||||
|
||||
**Use-cases:**
|
||||
|
||||
The canonical use-case for reference states: *KYC*
|
||||
|
||||
* KYC data can be distributed as reference data states.
|
||||
* KYC data states can only updatable by the data owner.
|
||||
* Usable by any party - transaction verification can be conditional on this KYC/reference data.
|
||||
* Notary ensures the data is current.
|
||||
|
||||
Collateral reporting:
|
||||
|
||||
* Imagine a bank needs to provide evidence to another party (like a regulator) that they hold certain states, such as cash and collateral, for liquidity reporting purposes
|
||||
* The regulator holds a liquidity reporting state that maintains a record of past collateral reports and automates the handling of current reports using some contract code
|
||||
* To update the liquidity reporting state, the regulator needs to include the bank’s cash/collateral states in a transaction – the contract code checks available collateral vs requirements. By doing this, the cash/collateral states would be consumed, which is not desirable
|
||||
* Instead, what if those cash/collateral states could be referenced in a transaction but not consumed? And at the same time, the notary still checks to see if the cash/collateral states are current, or not (i.e. does the bank still own them)
|
||||
|
||||
Other uses:
|
||||
|
||||
* Distributing reference data for financial instruments. E.g. Bond issuance details created, updated and distributed by the bond issuer rather than a third party.
|
||||
* Account level data included in cash payment transactions.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
There are various other ways to implement reference data on Corda, discussed below:
|
||||
|
||||
**Regular contract states**
|
||||
|
||||
Currently, the transaction model is too cumbersome to support reference data as unconsumed states for the following reasons:
|
||||
|
||||
* Contract verification is required for the `ContractState`s used as reference data. This limits the use of states, such as `Cash` as reference data (unless a special "reference" command is added which allows a "NOOP" state transaction to assert no that changes were made.)
|
||||
* As such, whenever an input state reference is added to a transaction as reference data, an output state must be added, otherwise the state will be extinguished. This results in long chains of unnecessarily duplicated data.
|
||||
* Long chains of provenance result in confidentiality breaches as down-stream users of the reference data state see all the prior uses of the reference data in the chain of provenance. This is an important point: it means that two parties, who have no business relationship and care little about each other's transactions nevertheless find themselves intimately bound: should one of them rely on a piece of common reference data in a transaction, the other one will not only need to be informed but will need to be furnished with a copy of the transaction.
|
||||
* Reference data states will likely be used by many parties so they will be come highly contended. Parties will "race" to use the reference data. The latest copy must be continually distributed to all that require it.
|
||||
|
||||
**Attachments**
|
||||
|
||||
Of course, attachments can be used to store and share reference data. This approach does solve the contention issue around reference data as regular contract states. However, attachments don't allow users to ascertain whether they are working on the most recent copy of the data. Given that it's crucial to know whether reference data is current, attachments cannot provide a workable solution here.
|
||||
|
||||
The other issue with attachments is that they do not give an intrinsic "format" to data, like state objects do. This makes working with attachments much harder as their contents are effectively bespoke. Whilst a data format tool could be written, it's more convenient to work with state objects.
|
||||
|
||||
**Oracles**
|
||||
|
||||
Whilst Oracles could provide a solution for periodically changing reference data, they introduce unnecessary centralisation and are onerous to implement for each class of reference data. Oracles don't feel like an optimal solution here.
|
||||
|
||||
**Keeping reference data off-ledger**
|
||||
|
||||
It makes sense to push as much verification as possible into the contract code, otherwise why bother having it? Performing verification inside flows is generally not a good idea as the flows can be re-written by malicious developers. In almost all cases, it is much more difficult to change the contract code. If transaction verification can be conditional on reference data included in a transaction, as a state, then the result is a more robust and secure ledger (and audit trail).
|
||||
|
||||
## Target Solution
|
||||
|
||||
Changes required:
|
||||
|
||||
1. Add a `references` property of type `List<StateRef>` and `List<StateAndRef>` (for `FullTransaction`s) to all the transaction types.
|
||||
2. Add a `REFERENCE_STATES` component group.
|
||||
3. Amend the notary flows to check that reference states are current (but do not consume them)
|
||||
4. Add a `ReferencedStateAndRef` class that encapsulates a `StateAndRef`, this is so `TransactionBuilder.withItems` can delineate between `StateAndRef`s and state references.
|
||||
5. Add a `StateAndRef.referenced` method which wraps a `StateAndRef` in a `ReferencedStateAndRef`.
|
||||
6. Add helper methods to `LedgerTransaction` to get `references` by type, etc.
|
||||
7. Add a check to the transaction classes that asserts all references and inputs are on the same notary.
|
||||
8. Add a method to `TransactionBuilder` to add a reference state.
|
||||
9. Update the transaction resolution flow to resolve references.
|
||||
10. Update the transaction and ledger DSLs to support references.
|
||||
11. No changes are required to be made to contract upgrade or notary change transactions.
|
||||
|
||||
Implications:
|
||||
|
||||
**Versioning**
|
||||
|
||||
This can be done in a backwards compatible way. However, a minimum platform version must be mandated. Nodes running on an older version of Corda will not be able to verify transactions which include references. Indeed, contracts which refer to `references` will fail at run-time for older nodes.
|
||||
|
||||
**Privacy**
|
||||
|
||||
Reference states will be visible to all that possess a chain of provenance including them. There are potential implications from a data protection perspective here. Creators of reference data must be careful **not** to include sensitive personal data.
|
||||
|
||||
Outstanding issues:
|
||||
|
||||
**Oracle choice**
|
||||
|
||||
If the party building a transaction is using a reference state which they are not the owner of, they must move their states to the reference state's notary. If two or more reference states with different notaries are used, then the transaction cannot be committed as there is no notary change solution that works absent asking the reference state owner to change the notary.
|
||||
|
||||
This can be mitigated by requesting that reference state owners distribute reference states for all notaries. This solution doesn't work for `OwnableState`s used as reference data as `OwnableState`s should be unique. However, in most cases it is anticipated that the users of `OwnableState`s as reference data will be the owners of those states.
|
||||
|
||||
This solution introduces a new issue where nodes may store the same piece of reference data under different linear IDs. `TransactionBuilder`s would also need to know the required notary before a reference state is added.
|
||||
|
||||
**Syndication of reference states**
|
||||
|
||||
In the absence of data distribution groups, reference data must be manually transmitted to those that require it. Pulling might have the effect of DoS attacking nodes that own reference data used by many frequent users. Pushing requires reference data owners to be aware of all current users of the reference data. A temporary solution is required before data distribution groups are implemented.
|
||||
|
||||
Initial thoughts are that pushing reference states is the better approach.
|
||||
|
||||
**Interaction with encumbrances**
|
||||
|
||||
It is likely not possible to reference encumbered states unless the encumbrance state is also referenced. For example, a cash state referenced for collateral reporting purposes may have been "seized" and thus encumbered by a regulator, thus cannot be counted for the collateral report.
|
||||
|
||||
**What happens if a state is added to a transaction as an input as well as an input reference state?**
|
||||
|
||||
An edge case where a developer might erroneously add the same StateRef as an input state _and_ input reference state. The effect is referring to reference data that immediately becomes out of date! This is an edge case that should be prevented as it is likely to confuse CorDapp developers.
|
||||
|
||||
**Handling of update races.**
|
||||
|
||||
Usage of a referenced state may race with an update to it. This would cause notarisation failure, however, the flow cannot simply loop and re-calculate the transaction because it has not necessarily seen the updated tx yet (it may be a slow broadcast).
|
||||
|
||||
Therefore, it would make sense to extend the flows API with a new flow - call it WithReferencedStatesFlow that is given a set of LinearIDs and a factory that instantiates a subflow given a set of resolved StateAndRefs.
|
||||
|
||||
It does the following:
|
||||
|
||||
1. Checks that those linear IDs are in the vault and throws if not.
|
||||
2. Resolves the linear IDs to the tip StateAndRefs.
|
||||
3. Creates the subflow, passing in the resolved StateAndRefs to the factory, and then invokes it.
|
||||
4. If the subflow throws a NotaryException because it tried to finalise and failed, that exception is caught and examined. If the failure was due to a conflict on a referenced state, the flow suspends until that state has been updated in the vault (there is an API to do wait for transaction already, but here the flow must wait for a state update).
|
||||
5. Then it re-does the initial calculation, re-creates the subflow with the new resolved tips using the factory, and re-runs it as a new subflow.
|
||||
|
||||
Care must be taken to handle progress tracking correctly in case of loops.
|
Before Width: | Height: | Size: 74 KiB |
@ -1,69 +0,0 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: CPU certification method
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
Remote attestation is done in two main steps.
|
||||
1. Certification of the CPU. This boils down to some kind of Intel signature over a key that only a specific enclave has
|
||||
access to.
|
||||
2. Using the certified key to sign business logic specific enclave quotes and providing the full chain of trust to
|
||||
challengers.
|
||||
|
||||
This design question concerns the way we can manage a certification key. A more detailed description is
|
||||
[here](../details/attestation.md)
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### A. Use Intel's recommended protocol
|
||||
|
||||
This involves using ``aesmd`` and the Intel SDK to establish an opaque attestation key that transparently signs quotes.
|
||||
Then for each enclave we need to do several round trips to IAS to get a revocation list (which we don't need) and request
|
||||
a direct Intel signature over the quote (which we shouldn't need as the trust has been established already during EPID
|
||||
join)
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. We have a PoC implemented that does this
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Frequent round trips to Intel infrastructure
|
||||
2. Intel can reproduce the certifying private key
|
||||
3. Involves unnecessary protocol steps and features we don't need (EPID)
|
||||
|
||||
### B. Use Intel's protocol to bootstrap our own certificate
|
||||
|
||||
This involves using Intel's current attestation protocol to have Intel sign over our own certifying enclave's
|
||||
certificate that derives its certification key using the sealing fuse values.
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Certifying key not reproducible by Intel
|
||||
2. Allows for our own CPU enrollment process, should we need one
|
||||
3. Infrequent round trips to Intel infrastructure (only needed once per microcode update)
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Still uses the EPID protocol
|
||||
|
||||
### C. Intercept Intel's recommended protocol
|
||||
|
||||
This involves using Intel's current protocol as is but instead of doing round trips to IAS to get signatures over quotes
|
||||
we try to establish the chain of trust during EPID provisioning and reuse it later.
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Uses Intel's current protocol
|
||||
2. Infrequent rountrips to Intel infrastructure
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. The provisioning protocol is underdocumented and it's hard to decipher how to construct the trust chain
|
||||
2. The chain of trust is not a traditional certificate chain but rather a sequence of signed messages
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option B. This is the most readily available and flexible option.
|
@ -1,59 +0,0 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: Enclave language of choice
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
In the long run we would like to use the JVM for all enclave code. This is so that later on we can solve the problem of
|
||||
side channel attacks on the bytecode level (e.g. oblivious RAM) rather than putting this burden on enclave functionality
|
||||
implementors.
|
||||
|
||||
As we plan to use a JVM in the long run anyway and we already have an embedded Avian implementation I think the best
|
||||
course of action is to immediately use this together with the full JDK. To keep the native layer as minimal as possible
|
||||
we should forward enclave calls with little to no marshalling to the embedded JVM. All subsequent sanity checks,
|
||||
including ones currently handled by the edger8r generated code should be done inside the JVM. Accessing native enclave
|
||||
functionality (including OCALLs and reading memory from untrusted heap) should be through a centrally defined JNI
|
||||
interface. This way when we switch from Avian we have a very clear interface to code against both from the hosted code's
|
||||
side and from the ECALL/OCALL side.
|
||||
|
||||
The question remains what the thin native layer should be written in. Currently we use C++, but various alternatives
|
||||
popped up, most notably Rust.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### A. C++
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. The Intel SDK is written in C++
|
||||
2. [Reproducible binaries](https://wiki.debian.org/ReproducibleBuilds)
|
||||
3. The native parts of Avian, HotSpot and SubstrateVM are written in C/C++
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Unsafe memory accesses (unless strict adherence to modern C++)
|
||||
2. Quirky build
|
||||
3. Larger attack surface
|
||||
|
||||
### B. Rust
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Safe memory accesses
|
||||
2. Easier to read/write code, easier to audit
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Does not produce reproducible binaries currently (but it's [planned](https://github.com/rust-lang/rust/issues/34902))
|
||||
2. We would mostly be using it for unsafe things (raw pointers, calling C++ code)
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option A (C++) and keep the native layer as small as possible. Rust currently doesn't produce reproducible
|
||||
binary code, and we need the native layer mostly to handle raw pointers and call Intel SDK functions anyway, so we
|
||||
wouldn't really leverage Rust's safe memory features.
|
||||
|
||||
Having said that, once Rust implements reproducible builds we may switch to it, in this case the thinness of the native
|
||||
layer will be of big benefit.
|
@ -1,58 +0,0 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: Key-value store implementation
|
||||
============================================
|
||||
|
||||
This is a simple choice of technology.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### A. ZooKeeper
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Tried and tested
|
||||
2. HA team already uses ZooKeeper
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Clunky API
|
||||
2. No HTTP API
|
||||
3. Hand-rolled protocol
|
||||
|
||||
### B. etcd
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Very simple API, UNIX philosophy
|
||||
2. gRPC
|
||||
3. Tried and tested
|
||||
4. MVCC
|
||||
5. Kubernetes uses it in the background already
|
||||
6. "Successor" of ZooKeeper
|
||||
7. Cross-platform, OSX and Windows support
|
||||
8. Resiliency, supports backups for disaster recovery
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. HA team uses ZooKeeper
|
||||
|
||||
### C. Consul
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. End to end discovery including UIs
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Not very well spread
|
||||
2. Need to store other metadata as well
|
||||
3. HA team uses ZooKeeper
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option B (etcd). It's practically a successor of ZooKeeper, the interface is quite simple, it focuses on
|
||||
primitives (CAS, leases, watches etc) and is tried and tested by many heavily used applications, most notably
|
||||
Kubernetes. In fact we have the option to use etcd indirectly by writing Kubernetes extensions, this would have the
|
||||
advantage of getting readily available CLI and UI tools to manage an enclave cluster.
|
@ -1,81 +0,0 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: Strategic SGX roadmap
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
The statefulness of the enclave affects the complexity of both the infrastructure and attestation greatly.
|
||||
The infrastructure needs to take care of tracking enclave state for request routing, and we need extra care if we want
|
||||
to make sure that old keys cannot be used to reveal sealed secrets.
|
||||
|
||||
As the first step the easiest thing to do would be to provide an infrastructure for hosting *stateless* enclaves that
|
||||
are only concerned with enclave to non-enclave attestation. This provides a framework to do provable computations,
|
||||
without the headache of handling sealed state and the various implied upgrade paths.
|
||||
|
||||
In the first phase we want to facilitate the ease of rolling out full enclave images (JAR linked into the image)
|
||||
regardless of what the enclaves are doing internally. The contract of an enclave is the host-enclave API (attestation
|
||||
protocol) and the exposure of the static set of channels the enclave supports. Furthermore the infrastructure will allow
|
||||
deployment in a cloud environment and trivial scalability of enclaves through starting them on-demand.
|
||||
|
||||
The first phase will allow for a "fixed stateless provable computations as a service" product, e.g. provable builds or
|
||||
RNG.
|
||||
|
||||
The question remains on how we should proceed afterwards. In terms of infrastructure we have a choice of implementing
|
||||
sealed state or focusing on dynamic loading of bytecode. We also have the option to delay this decision until the end of
|
||||
the first phase.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### A. Implement sealed state
|
||||
|
||||
Implementing sealed state involves solving the routing problem, for this we can use the concept of active channel sets.
|
||||
Furthermore we need to solve various additional security issues around guarding sealed secret provisioning, most notably
|
||||
expiration checks. This would involve implementing a future-proof calendar time oracle, which may turn out to be
|
||||
impossible, or not quite good enough. We may decide that we cannot actually provide strong privacy guarantees and need
|
||||
to enforce epochs as mentioned [here](../details/time.md).
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. We would solve long term secret persistence early, allowing for a longer time frame for testing upgrades and
|
||||
reprovisioning before we integrate Corda
|
||||
2. Allows "fixed stateful provable computations as a service" product, e.g. HA encryption
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. There are some unsolved issues (Calendar time, sealing epochs)
|
||||
2. It would delay non-stateful Corda integration
|
||||
|
||||
### B. Implement dynamic code loading
|
||||
|
||||
Implementing dynamic loading involves sandboxing of the bytecode, providing bytecode verification and perhaps
|
||||
storage/caching of JARs (although it may be better to develop a more generic caching layer and use channels themselves
|
||||
to do the upload). Doing bytecode verification is quite involved as Avian does not support verification, so this
|
||||
would mean switching to a different JVM. This JVM would either be HotSpot or SubstrateVM, we are doing some preliminary
|
||||
exploratory work to assess their feasibility. If we choose this path it opens up the first true integration point with
|
||||
Corda by enabling semi-validating notaries - these are non-validating notaries that check an SGX signature over the
|
||||
transaction. It would also enable an entirely separate generic product for verifiable pure computation.
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Early adoption of Graal if we choose to go with it (the alternative is HotSpot)
|
||||
2. Allows first integration with Corda (semi-validating notaries)
|
||||
3. Allows "generic stateless provable computation as a service" product, i.e. anything expressible as a JAR
|
||||
4. Holding off on sealed state
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Too early Graal integration may result in maintenance headache later
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option B, dynamic code loading. It would make us very early adopters of Graal (with the implied ups and
|
||||
downs), and most importantly kickstart collaboration between R3 and Oracle. We would also move away from Avian which we
|
||||
wanted to do anyway. It would also give us more time to think about the issues around sealed state, do exploratory work
|
||||
on potential solutions, and there may be further development from Intel's side. Furthermore we need dynamic loading for
|
||||
any fully fledged Corda integration, so we should finish this ASAP.
|
||||
|
||||
## Appendix: Proposed roadmap breakdown
|
||||
|
||||
![Dynamic code loading first](roadmap.png)
|
Before Width: | Height: | Size: 98 KiB |
@ -1,84 +0,0 @@
|
||||
# SGX Infrastructure design
|
||||
|
||||
.. important:: This design document describes a feature of Corda Enterprise.
|
||||
|
||||
This document is intended as a design description of the infrastructure around the hosting of SGX enclaves, interaction
|
||||
with enclaves and storage of encrypted data. It assumes basic knowledge of SGX concepts, and some knowledge of
|
||||
Kubernetes for parts specific to that.
|
||||
|
||||
## High level description
|
||||
|
||||
The main idea behind the infrastructure is to provide a highly available cluster of enclave services (hosts) which can
|
||||
serve enclaves on demand. It provides an interface for enclave business logic that's agnostic with regards to the
|
||||
infrastructure, similar to serverless architectures. The enclaves will use an opaque reference
|
||||
to other enclaves or services in the form of enclave channels. Channels hides attestation details
|
||||
and provide a loose coupling between enclave/non-enclave functionality and specific enclave images/services implementing
|
||||
it. This loose coupling allows easier upgrade of enclaves, relaxed trust (whitelisting), dynamic deployment, and
|
||||
horizontal scaling as we can spin up enclaves dynamically on demand when a channel is requested.
|
||||
|
||||
For more information see:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
details/serverless.md
|
||||
details/channels.md
|
||||
|
||||
## Infrastructure components
|
||||
|
||||
Here are the major components of the infrastructure. Note that this doesn't include business logic specific
|
||||
infrastructure pieces (like ORAM blob storage for Corda privacy model integration).
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
details/kv-store.md
|
||||
details/discovery.md
|
||||
details/host.md
|
||||
details/enclave-storage.md
|
||||
details/ias-proxy.md
|
||||
|
||||
## Infrastructure interactions
|
||||
|
||||
* **Enclave deployment**:
|
||||
This includes uploading of the enclave image/container to enclave storage and adding of the enclave metadata to the
|
||||
key-value store.
|
||||
|
||||
* **Enclave usage**:
|
||||
This includes using the discovery service to find a specific enclave image and a host to serve it, then connecting to
|
||||
the host, authenticating(attestation) and proceeding with the needed functionality.
|
||||
|
||||
* **Ops**:
|
||||
This includes management of the cluster (Kubernetes/Kubespray) and management of the metadata relating to discovery to
|
||||
control enclave deployment (e.g. canary, incremental, rollback).
|
||||
|
||||
## Decisions to be made
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
decisions/roadmap.md
|
||||
decisions/certification.md
|
||||
decisions/enclave-language.md
|
||||
decisions/kv-store.md
|
||||
|
||||
## Further details
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
details/attestation.md
|
||||
details/time.md
|
||||
details/enclave-deployment.md
|
||||
|
||||
## Example deployment
|
||||
|
||||
This is an example of how two Corda parties may use the above infrastructure. In this example R3 is hosting the IAS
|
||||
proxy and the enclave image store and the parties host the rest of the infrastructure, aside from Intel components.
|
||||
|
||||
Note that this is flexible, the parties may decide to host their own proxies (as long as they whitelist their keys) or
|
||||
the enclave image store (although R3 will need to have a repository of the signed enclaves somewhere).
|
||||
We may also decide to go the other way and have R3 host the enclave hosts and the discovery service, shared between
|
||||
parties (if e.g. they don't have access to/want to maintain SGX capable boxes).
|
||||
|
||||
![Example SGX deployment](ExampleSGXdeployment.png)
|
@ -1,92 +0,0 @@
|
||||
### Terminology recap
|
||||
|
||||
* **measurement**: The hash of an enclave image, uniquely pinning the code and related configuration
|
||||
* **report**: A datastructure produced by an enclave including the measurement and other non-static properties of the
|
||||
running enclave instance (like the security version number of the hardware)
|
||||
* **quote**: A signed report of an enclave produced by Intel's quoting enclave.
|
||||
|
||||
# Attestation
|
||||
|
||||
The goal of attestation is to authenticate enclaves. We are concerned with two variants of this, enclave to non-enclave
|
||||
attestation and enclave to enclave attestation.
|
||||
|
||||
In order to authenticate an enclave we need to establish a chain of trust rooted in an Intel signature certifying that a
|
||||
report is coming from an enclave running on genuine Intel hardware.
|
||||
|
||||
Intel's recommended attestation protocol is split into two phases.
|
||||
|
||||
1. Provisioning
|
||||
The first phase's goal is to establish an Attestation Key(AK) aka EPID key, unique to the SGX installation.
|
||||
The establishment of this key uses an underdocumented protocol similar to the attestation protocol:
|
||||
- Intel provides a Provisioning Certification Enclave(PCE). This enclave has special privileges in that it can derive a
|
||||
key in a deterministic fashion based on the *provisioning* fuse values. Intel stores these values in their databases
|
||||
and can do the same derivation to later check a signature from PCE.
|
||||
- Intel provides a separate enclave called the Provisioning Enclave(PvE), also privileged, which interfaces with PCE
|
||||
(using local attestation) to certify the PvE's report and talks with a special Intel endpoint to join an EPID group
|
||||
anonymously. During the join Intel verifies the PCE's signature. Once the join happened the PvE creates a related
|
||||
private key(the AK) that cannot be linked by Intel to a specific CPU. The PvE seals this key (also sometimes referred
|
||||
to as the "EPID blob") to MRSIGNER, which means it can only be unsealed by Intel enclaves.
|
||||
|
||||
2. Attestation
|
||||
- When a user wants to do attestation of their own enclave they need to do so through the Quoting Enclave(QE), also
|
||||
signed by Intel. This enclave can unseal the EPID blob and use the key to sign over user provided reports
|
||||
- The signed quote in turn is sent to the Intel Attestation Service, which can check whether the quote was signed by a
|
||||
key in the EPID group. Intel also checks whether the QE was provided with an up-to-date revocation list.
|
||||
|
||||
The end result is a signature of Intel over a signature of the AK over the user enclave quote. Challengers can then
|
||||
simply check this chain to make sure that the user provided data in the quote (probably another key) comes from a
|
||||
genuine enclave.
|
||||
|
||||
All enclaves involved (PCE, PvE, QE) are owned by Intel, so this setup basically forces us to use Intel's infrastructure
|
||||
during attestation (which in turn forces us to do e.g. MutualTLS, maintain our own proxies etc). There are two ways we
|
||||
can get around this.
|
||||
|
||||
1. Hook the provisioning phase. During the last step of provisioning the PvE constructs a chain of trust rooted in
|
||||
Intel. If we can extract some provable chain that allows proving of membership based on an EPID signature then we can
|
||||
essentially replicate what IAS does.
|
||||
2. Bootstrap our own certification. This would involve deriving another certification key based on sealing fuse values
|
||||
and getting an Intel signature over it using the original IAS protocol. This signature would then serve the same
|
||||
purpose as the certificate in 1.
|
||||
|
||||
## Non-enclave to enclave channels
|
||||
|
||||
When a non-enclave connects to a "leaf" enclave the goal is to establish a secure channel between the non-enclave and
|
||||
the enclave by authenticating the enclave and possibly authenticating the non-enclave. In addition we want to provide
|
||||
secrecy of the non-enclave. To this end we can use SIGMA-I to do a Diffie-Hellman key exchange between the non-enclave
|
||||
identity and the enclave identity.
|
||||
|
||||
The enclave proves the authenticity of its identity by providing a certificate chain rooted in Intel. If we do our own
|
||||
enclave certification then the chain goes like this:
|
||||
|
||||
* Intel signs quote of certifying enclave containing the certifying key pair's public part.
|
||||
* Certifying key signs report of leaf enclave containing the enclave's temporary identity.
|
||||
* Enclave identity signs the relevant bits in the SIGMA protocol.
|
||||
|
||||
Intel's signature may be cached on disk, and the certifying enclave signature over the temporary identity may be cached
|
||||
in enclave memory.
|
||||
|
||||
We can provide various invalidations, e.g. non-enclave won't accept signature if X time has passed since Intel's
|
||||
signature, or R3's whitelisting cert expired etc.
|
||||
|
||||
If the enclave needs to authorise the non-enclave the situation is a bit more complicated. Let's say the enclave holds
|
||||
some secret that it should only reveal to authorised non-enclaves. Authorisation is expressed as a whitelisting
|
||||
signature over the non-enclave identity. How do we check the expiration of the whitelisting key's certificate?
|
||||
|
||||
Calendar time inside enclaves deserves its own [document](time.md), the gist is that we simply don't have access to time
|
||||
unless we trust a calendar time oracle.
|
||||
|
||||
Note however that we probably won't need in-enclave authorisation for *stateless* enclaves, as these have no secrets to
|
||||
reveal at all. Authorisation would simply serve as access control, and we can solve access control in the hosting
|
||||
infrastructure instead.
|
||||
|
||||
## Enclave to enclave channels
|
||||
|
||||
Doing remote attestation between enclaves is similar to enclave to non-enclave, only this time authentication involves
|
||||
verifying the chain of trust on both sides. However note that this is also predicated on having access to a calendar
|
||||
time oracle, as this time expiration checks of the chain must be done in enclaves. So in a sense both enclave to enclave
|
||||
and stateful enclave to non-enclave attestation forces us to trust a calendar time oracle.
|
||||
|
||||
But note that remote enclave to enclave attestation is mostly required when there *is* sealed state (secrets to share
|
||||
with the other enclave). One other use case is the reduction of audit surface, once it comes to that. We may be able to
|
||||
split stateless enclaves into components that have different upgrade lifecycles. By doing so we ease the auditors' job
|
||||
by reducing the enclaves' contracts and code size.
|
@ -1,75 +0,0 @@
|
||||
# Enclave channels
|
||||
|
||||
AWS Lambdas may be invoked by name, and are simple request-response type RPCs. The lambda's name abstracts the
|
||||
specific JAR or code image that implements the functionality, which allows upgrading of a lambda without disrupting
|
||||
the rest of the lambdas.
|
||||
|
||||
Any authentication required for the invocation is done by a different AWS service (IAM), and is assumed to be taken
|
||||
care of by the time the lambda code is called.
|
||||
|
||||
Serverless enclaves also require ways to be addressed, let's call these "enclave channels". Each such channel may be
|
||||
identified with a string similar to Lambdas, however unlike lambdas we need to incorporate authentication into the
|
||||
concept of a channel in the form of attestation.
|
||||
|
||||
Furthermore unlike Lambdas we can implement a generic two-way communication channel. This reintroduces state into the
|
||||
enclave logic. However note that this state is in-memory only, and because of the transient nature of enclaves (they
|
||||
may be "lost" at any point) enclave authors are in general incentivised to either keep in-memory state minimal (by
|
||||
sealing state) or make their functionality idempotent (allowing retries).
|
||||
|
||||
We should be able to determine an enclave's supported channels statically. Enclaves may store this data for example in a
|
||||
specific ELF section or a separate file. The latter may be preferable as it may be hard to have a central definition of
|
||||
channels in an ELF section if we use JVM bytecode. Instead we could have a specific static JVM datastructure that can be
|
||||
extracted from the enclave statically during the build.
|
||||
|
||||
## Sealed state
|
||||
|
||||
Sealing keys tied to specific CPUs seem to throw a wrench in the requirement of statelessness. Routing a request to an
|
||||
enclave that has associated sealed state cannot be the same as routing to one which doesn't. How can we transparently
|
||||
scale enclaves like Lambdas if fresh enclaves by definition don't have associated sealed state?
|
||||
|
||||
Take key provisioning as an example: we want some key to be accessible by a number of enclaves, how do we
|
||||
differentiate between enclaves that have the key provisioned versus ones that don't? We need to somehow expose an
|
||||
opaque version of the enclave's sealed state to the hosting infrastructure for this.
|
||||
|
||||
The way we could do this is by expressing this state in terms of a changing set of "active" enclave channels. The
|
||||
enclave can statically declare the channels it potentially supports, and start with some initial subset of them as
|
||||
active. As the enclave's lifecycle (sealed state) evolves it may change this active set to something different,
|
||||
thereby informing the hosting infrastructure that it shouldn't route certain requests there, or that it can route some
|
||||
other ones.
|
||||
|
||||
Take the above key provisioning example. An enclave can be in two states, unprovisioned or provisioned. When it's
|
||||
unprovisioned its set of active channels will be related to provisioning (for example, request to bootstrap key or
|
||||
request from sibling enclave), when it's provisioned its active set will be related to the usage of the key and
|
||||
provisioning of the key itself to unprovisioned enclaves.
|
||||
|
||||
The enclave's initial set of active channels defines how enclaves may be scaled horizontally, as these are the
|
||||
channels that will be active for the freshly started enclaves without sealed state.
|
||||
|
||||
"Hold on" you might say, "this means we didn't solve the scalability of stateful enclaves!".
|
||||
|
||||
This is partly true. However in the above case we can force certain channels to be part of the initial active set! In
|
||||
particular the channels that actually use the key (e.g. for signing) may be made "stateless" by lazily requesting
|
||||
provisioning of the key from sibling enclaves. Enclaves may be spun up on demand, and as long as there is at least one
|
||||
sibling enclave holding the key it will be provisioned as needed. This hints at a general pattern of hiding stateful
|
||||
functionality behind stateless channels, if we want them to scale automatically.
|
||||
|
||||
Note that this doesn't mean we can't have external control over the provisioning of the key. For example we probably
|
||||
want to enforce redundancy across N CPUs. This requires the looping in of the hosting infrastructure, we cannot
|
||||
enforce this invariant purely in enclave code.
|
||||
|
||||
As we can see the set of active enclave channels are inherently tied to the sealed state of the enclave, therefore we
|
||||
should make the updating both of them an atomic operation.
|
||||
|
||||
### Side note
|
||||
|
||||
Another way to think about enclaves using sealed state is like an actor model. The sealed state is the actor's state,
|
||||
and state transitions may be executed by any enclave instance running on the same CPU. By transitioning the actor state
|
||||
one can also transition the type of messages the actor can receive atomically (= active channel set).
|
||||
|
||||
## Potential gRPC integration
|
||||
|
||||
It may be desirable to expose a built-in serialisation and network protocol. This would tie us to a specific protocol,
|
||||
but in turn it would ease development.
|
||||
|
||||
An obvious candidate for this is gRPC as it supports streaming and a specific serialization protocol. We need to
|
||||
investigate how we can integrate it so that channels are basically responsible for tunneling gRPC packets.
|
@ -1,88 +0,0 @@
|
||||
# Discovery
|
||||
|
||||
In order to understand enclave discovery and routing we first need to understand the mappings between CPUs, VMs and
|
||||
enclave hosts.
|
||||
|
||||
The cloud provider manages a number of physical machines (CPUs), each of those machines hosts a hypervisor which in
|
||||
turn hosts a number of guest VMs. Each VM in turn may host a number of enclave host containers (together with required
|
||||
supporting software like aesmd) and the sgx device driver. Each enclave host in turn may host several enclave instances.
|
||||
For the sake of simplicity let's assume that an enclave host may only host a single enclave instance per measurement.
|
||||
|
||||
We can figure out the identity of the CPU the VM is running on by using a dedicated enclave to derive a unique ID
|
||||
specific to the CPU. For this we can use EGETKEY with pre-defined inputs to derive a seal key sealed to MRENCLAVE. This
|
||||
provides a 128bit value reproducible only on the same CPU in this manner. Note that this is completely safe as the
|
||||
value won't be used for encryption and is specific to the measurement doing this. With this ID we can reason about
|
||||
physical locality of enclaves without looping in the cloud provider.
|
||||
Note: we should set OWNEREPOCH to a static value before doing this.
|
||||
|
||||
We don't need an explicit handle on the VM's identity, the mapping from VM to container will be handled by the
|
||||
orchestration engine (Kubernetes).
|
||||
|
||||
Similarly to VM identity, the specific host container's identity(IP address/DNS A) is also tracked by Kubernetes,
|
||||
however we do need access to this identity in order to implement discovery.
|
||||
|
||||
When an enclave instance seals a secret that piece of data is tied to the measurement+CPU combo. The secret can only be
|
||||
revealed to an enclave with the same measurement running on the same CPU. However the management of this secret is
|
||||
tied to the enclave host container, which we may have several of running on the same CPU, possibly all of them hosting
|
||||
enclaves with the same measurement.
|
||||
|
||||
To solve this we can introduce a *sealing identity*. This is basically a generated ID/namespace for a collection of
|
||||
secrets belonging to a specific CPU. It is generated when a fresh enclave host starts up and subsequently the host will
|
||||
store sealed secrets under this ID. These secrets should survive host death, so they will be persisted in etcd (together
|
||||
with the associated active channel sets). Every host owns a single sealing identity, but not every sealing identity may
|
||||
have an associated host (e.g. in case the host died).
|
||||
|
||||
## Mapping to Kubernetes
|
||||
|
||||
The following mapping of the above concepts to Kubernetes concepts is not yet fleshed out and requires further
|
||||
investigation into Kubernetes capabilities.
|
||||
|
||||
VMs correspond to Nodes, and enclave hosts correspond to Pods. The host's identity is the same as the Pod's, which is
|
||||
the Pod's IP address/DNS A record. From Kubernetes's point of view enclave hosts provide a uniform stateless Headless
|
||||
Service. This means we can use their scaling/autoscaling features to provide redundancy across hosts (to balance load).
|
||||
|
||||
However we'll probably need to tweak their (federated?) ReplicaSet concept in order to provide redundancy across CPUs
|
||||
(to be tolerant of CPU failures), or perhaps use their anti-affinity feature somehow, to be explored.
|
||||
|
||||
The concept of a sealing identity is very close to the stable identity of Pods in Kubernetes StatefulSets. However I
|
||||
couldn't find a way to use this directly as we need to tie the sealing identity to the CPU identity, which in Kubernetes
|
||||
would translate to a requirement to pin stateful Pods to Nodes based on a dynamically determined identity. We could
|
||||
however write an extension to handle this metadata.
|
||||
|
||||
## Registration
|
||||
|
||||
When an enclave host is started it first needs to establish its sealing identity. To this end first it needs to check
|
||||
whether there are any sealing identities available for the CPU it's running on. If not it can generate a fresh one and
|
||||
lease it for a period of time (and update the lease periodically) and atomically register its IP address in the process.
|
||||
If an existing identity is available the host can take over it by leasing it. There may be existing Kubernetes
|
||||
functionality to handle some of this.
|
||||
|
||||
Non-enclave services (like blob storage) could register similarly, but in this case we can take advantage of Kubernetes'
|
||||
existing discovery infrastructure to abstract a service behind a Service cluster IP. We do need to provide the metadata
|
||||
about supported channels though.
|
||||
|
||||
## Resolution
|
||||
|
||||
The enclave/service discovery problem boils down to:
|
||||
"Given a channel, my trust model and my identity, give me an enclave/service that serves this channel, trusts me, and I
|
||||
trust them".
|
||||
|
||||
This may be done in the following steps:
|
||||
|
||||
1. Resolve the channel to a set of measurements supporting it
|
||||
2. Filter the measurements to trusted ones and ones that trust us
|
||||
3. Pick one of the measurements randomly
|
||||
4. Find an alive host that has the channel in its active set for the measurement
|
||||
|
||||
1 may be done by maintaining a channel -> measurements map in etcd. This mapping would effectively define the enclave
|
||||
deployment and would be the central place to control incremental roll-out or rollbacks.
|
||||
|
||||
2 requires storing of additional metadata per advertised channel, namely a datastructure describing the enclave's trust
|
||||
predicate. A similar datastructure is provided by the discovering entity - these two predicates can then be used to
|
||||
filter measurements based on trust.
|
||||
|
||||
3 is where we may want to introduce more control if we want to support incremental roll-out/canary deployments.
|
||||
|
||||
4 is where various (non-MVP) optimisation considerations come to mind. We could add a loadbalancer, do autoscaling based
|
||||
on load (although Kubernetes already provides support for this), could have a preference for looping back to the same
|
||||
host to allow local attestation, or ones that have the enclave image cached locally or warmed up.
|
@ -1,16 +0,0 @@
|
||||
# Enclave deployment
|
||||
|
||||
What happens if we roll out a new enclave image?
|
||||
|
||||
In production we need to sign the image directly with the R3 key as MRSIGNER (process to be designed), as well as create
|
||||
any whitelisting signatures needed (e.g. from auditors) in order to allow existing enclaves to trust the new one.
|
||||
|
||||
We need to make the enclave build sources available to users - we can package this up as a single container pinning all
|
||||
build dependencies and source code. Docker style image layering/caching will come in handy here.
|
||||
|
||||
Once the image, build containers and related signatures are created we need to push this to the main R3 enclave storage.
|
||||
|
||||
Enclave infrastructure owners (e.g. Corda nodes) may then start using the images depending on their upgrade policy. This
|
||||
involves updating their key value store so that new channel discovery requests resolve to the new measurement, which in
|
||||
turn will trigger the image download on demand on enclave hosts. We can potentially add pre-caching here to reduce
|
||||
latency for first-time enclave users.
|
@ -1,7 +0,0 @@
|
||||
# Enclave storage
|
||||
|
||||
The enclave storage is a simple static content server. It should allow uploading of and serving of enclave images based
|
||||
on their measurement. We may also want to store metadata about the enclave build itself (e.g. github link/commit hash).
|
||||
|
||||
We may need to extend its responsibilities to serve other SGX related static content such as whitelisting signatures
|
||||
over measurements.
|
@ -1,11 +0,0 @@
|
||||
# Enclave host
|
||||
|
||||
An enclave host's responsibility is the orchestration of the communication with hosted enclaves.
|
||||
|
||||
It is responsible for:
|
||||
* Leasing a sealing identity
|
||||
* Getting a CPU certificate in the form of an Intel-signed quote
|
||||
* Downloading and starting of requested enclaves
|
||||
* Driving attestation and subsequent encrypted traffic
|
||||
* Using discovery to connect to other enclaves/services
|
||||
* Various caching layers (and invalidation of) for the CPU certificate, hosted enclave quotes and enclave images
|
@ -1,10 +0,0 @@
|
||||
# IAS proxy
|
||||
|
||||
The Intel Attestation Service proxy's responsibility is simply to forward requests to and from the IAS.
|
||||
|
||||
The reason we need this proxy is because Intel requires us to do Mutual TLS with them for each attestation round trip.
|
||||
For this we need an R3 maintained private key, and as we want third parties to be able to do attestation we need to
|
||||
store this private key in these proxies.
|
||||
|
||||
Alternatively we may decide to circumvent this mutual TLS requirement completely by distributing the private key with
|
||||
the host containers.
|
@ -1,13 +0,0 @@
|
||||
# Key-value store
|
||||
|
||||
To solve enclave to enclave and enclave to non-enclave communication we need a way to route requests correctly. There
|
||||
are readily available discovery solutions out there, however we have some special requirements because of the inherent
|
||||
statefulness of enclaves (route to enclave with correct state) and the dynamic nature of trust between them (route to
|
||||
enclave I can trust and that trusts me). To store metadata about discovery we can need some kind of distributed
|
||||
key-value store.
|
||||
|
||||
The key-value store needs to store information about the following entities:
|
||||
* Enclave image: measurement and supported channels
|
||||
* Sealing identity: the sealing ID, the corresponding CPU ID and the host leasing it (if any)
|
||||
* Sealed secret: the sealing ID, the sealing measurement, the sealed secret and corresponding active channel set
|
||||
* Enclave deployment: mapping from channel to set of measurements
|
@ -1,33 +0,0 @@
|
||||
# Serverless architectures
|
||||
|
||||
In 2014 Amazon launched AWS Lambda, which they coined a "serverless architecture". It essentially creates an abstraction
|
||||
layer which hides the infrastructure details. Users provide "lambdas", which are stateless functions that may invoke
|
||||
other lambdas, access other AWS services etc. Because Lambdas are inherently stateless (any state they need must be
|
||||
accessed through a service) they may be loaded and executed on demand. This is in contrast with microservices, which
|
||||
are inherently stateful. Internally AWS caches the lambda images and even caches JIT compiled/warmed up code in order
|
||||
to reduce latency. Furthermore the lambda invocation interface provides a convenient way to scale these lambdas: as the
|
||||
functions are statelesss AWS can spin up new VMs to push lambda functions to. The user simply pays for CPU usage, all
|
||||
the infrastructure pain is hidden by Amazon.
|
||||
|
||||
Google and Microsoft followed suit in a couple of years with Cloud Functions and Azure Functions.
|
||||
|
||||
This way of splitting hosting computation from a hosted restricted computation is not a new idea, examples are web
|
||||
frameworks (web server vs application), MapReduce (Hadoop vs mappers/reducers), or even the cloud (hypervisors vs vms)
|
||||
and the operating system (kernel vs userspace). The common pattern is: the hosting layer hides some kind of complexity,
|
||||
imposes some restriction on the guest layer (and provides a simpler interface in turn), and transparently multiplexes
|
||||
a number of resources for them.
|
||||
|
||||
The relevant key features of serverless architectures are 1. on-demand scaling and 2. business logic independent of
|
||||
hosting logic.
|
||||
|
||||
# Serverless SGX?
|
||||
|
||||
How are Amazon Lambdas relevant to SGX? Enclaves exhibit very similar features to Lambdas: they are pieces of business
|
||||
logic completely independent of the hosting functionality. Not only that, enclaves treat hosts as adversaries! This
|
||||
provides a very clean separation of concerns which we can exploit.
|
||||
|
||||
If we could provide a similar infrastructure for enclaves as Amazon provides for Lambdas it would not only allow easy
|
||||
HA and scaling, it would also decouple the burden of maintaining the infrastructure from the enclave business logic.
|
||||
Furthermore our plan of using the JVM within enclaves also aligns with the optimizations Amazon implemented (e.g.
|
||||
keeping warmed up enclaves around). Optimizations like upgrading to local attestation also become orthogonal to
|
||||
enclave business logic. Enclave code can focus on the specific functionality at hand, everything else is taken care of.
|
@ -1,69 +0,0 @@
|
||||
# Time in enclaves
|
||||
|
||||
In general we know that any one crypto algorithm will be broken in X years time. The usual way to mitigate this is by
|
||||
using certificate expiration. If a peer with an expired certificate tries to connect we reject it in order to enforce
|
||||
freshness of their key.
|
||||
|
||||
In order to check certificate expiration we need some notion of calendar time. However in SGX's threat model the host
|
||||
of the enclave is considered malicious, so we cannot rely on their notion of time. Intel provides trusted time through
|
||||
their PSW, however this uses the Management Engine which is known to be a proprietary vulnerable piece of architecture.
|
||||
|
||||
Therefore in order to check calendar time in general we need some kind of time oracle. We can burn in the oracle's
|
||||
identity to the enclave and request timestamped signatures from it. This already raises questions with regards to the
|
||||
oracle's identity itself, however for the time being let's assume we have something like this in place.
|
||||
|
||||
### Timestamped nonces
|
||||
|
||||
The most straightforward way to implement calendar time checks is to generate a nonce *after* DH exchange, send it to
|
||||
the oracle and have it sign over it with a timestamp. The nonce is required to avoid replay attacks. A malicious host
|
||||
may delay the delivery of the signature indefinitely, even until after the certificate expires. However note that the
|
||||
DH happened before the nonce was generated, which means even if an attacker can crack the expired key they would not be
|
||||
able to steal the DH session, only try creating new ones, which will fail at the timestamp check.
|
||||
|
||||
This seems to be working, however note that this would impose a full round trip to an oracle *per DH exchange*.
|
||||
|
||||
### Timestamp-encrypted channels
|
||||
|
||||
In order to reduce the round trips required for timestamp checking we can invert the responsibility of checking of the
|
||||
timestamp. We can do this by encrypting the channel traffic with an additional key generated by the enclave but that can
|
||||
only be revealed by the time oracle. The enclave encrypts the encryption key with the oracle's public key so the peer
|
||||
trying to communicate with the enclave must forward the encrypted key to the oracle. The oracle in turn will check the
|
||||
timestamp and reveal the contents (perhaps double encrypted with a DH-derived key). The peer can cache the key and later
|
||||
use the same encryption key with the enclave. It is then the peer's responsibility to get rid of the key after a while.
|
||||
|
||||
Note that this mitigates attacks where the attacker is a third party trying to exploit an expired key, but this method
|
||||
does *not* mitigate against malicious peers that keep around the encryption key until after expiration(= they "become"
|
||||
malicious).
|
||||
|
||||
### Oracle key break
|
||||
|
||||
So given an oracle we can secure a channel against expired keys and potentially improve performance by trusting
|
||||
once-authorized enclave peers to not become malicious.
|
||||
|
||||
However what happens if the oracle key itself is broken? There's a chicken-and-egg problem where we can't check the
|
||||
expiration of the time oracle's certificate itself! Once the oracle's key is broken an attacker can fake timestamping
|
||||
replies (or decrypt the timestamp encryption key), which in turn allows it to bypass the expiration check.
|
||||
|
||||
The main issue with this is in relation to sealed secrets, and sealed secret provisioning between enclaves. If an
|
||||
attacker can fake being e.g. an authorized enclave then it can extract old secrets. We have yet to come up with a
|
||||
solution to this, and I don't think it's possible.
|
||||
|
||||
Instead, knowing that current crypto algorithms are bound to be broken at *some* point in the future, instead of trying
|
||||
to make sealing future-proof we can become explicit about the time-boundness of security guarantees.
|
||||
|
||||
### Sealing epochs
|
||||
|
||||
Let's call the time period in which a certain set of algorithms are considered safe a *sealing epoch*. During this
|
||||
period sealed data at rest is considered to be secure. However once the epoch finishes old sealed data is considered to
|
||||
be potentially compromised. We can then think of sealed data as an append-only log of secrets with overlapping epoch
|
||||
intervals where the "breaking" of old epochs is constantly catching up with new ones.
|
||||
|
||||
In order to make sure that this works we need to enforce an invariant where secrets only flow from old epochs to newer
|
||||
ones, never the other way around.
|
||||
|
||||
This translates to the ledger nicely, data in old epochs are generally not valuable anymore, so it's safe to consider
|
||||
them compromised. Note however that in the privacy model an epoch transition requires a full re-provisioning of the
|
||||
ledger to the new set of algorithms/enclaves.
|
||||
|
||||
In any case this is an involved problem, and I think we should defer the fleshing out of it for now as we won't need it
|
||||
for the first round of stateless enclaves.
|
Before Width: | Height: | Size: 240 KiB |
@ -1,317 +0,0 @@
|
||||
# SGX Integration
|
||||
|
||||
This document is intended as a design description of how we can go about integrating SGX with Corda. As the
|
||||
infrastructure design of SGX is quite involved (detailed elsewhere) but otherwise flexible we can discuss the possible
|
||||
integration points separately, without delving into lower level technical detail.
|
||||
|
||||
For the purposes of this document we can think of SGX as a way to provision secrets to a remote node with the
|
||||
knowledge that only trusted code(= enclave) will operate on it. Furthermore it provides a way to durably encrypt data
|
||||
in a scalable way while also ensuring that the encryption key is never leaked (unless the encrypting enclave is
|
||||
compromised).
|
||||
|
||||
Broadly speaking there are two dimensions to deciding how we can integrate SGX: *what* we store in the ledger and
|
||||
*where* we store it.
|
||||
|
||||
The first dimension is the what: this relates to what we so far called the "integrity model" vs the "privacy model".
|
||||
|
||||
In the **integrity model** we rely on SGX to ensure the integrity of the ledger. Using this assumption we can cut off
|
||||
the transaction body and only store an SGX-backed signature over filtered transactions. Namely we would only store
|
||||
information required for notarisation of the current and subsequent spending transactions. This seems neat on first
|
||||
sight, however note that if we do this naively then if an attacker can impersonate an enclave they'll gain write
|
||||
access to the ledger, as the fake enclave can sign transactions as valid without having run verification.
|
||||
|
||||
In the **privacy model** we store the full transaction backchain (encrypted) and we keep provisioning it between nodes
|
||||
on demand, just like in the current Corda implementation. This means we only rely on SGX for the privacy aspects - if
|
||||
an enclave is compromised we only lose privacy, the verification cannot be eluded by providing a fake signature.
|
||||
|
||||
The other dimension is the where: currently in non-SGX Corda the full transaction backchain is provisioned between non-
|
||||
notary nodes, and is also provisioned to notaries in the case they are validating ones. With SGX+BFT notaries we have
|
||||
the possibility to offload the storage of the encrypted ledger (or encrypted signatures thereof) to notary nodes (or
|
||||
dedicated oracles) and only store bookkeeping information required for further ledger updates in non-notary nodes. The
|
||||
storage policy is very important, customers want control over the persistence of even encrypted data, and with the
|
||||
introduction of recent regulation (GDPR) unrestricted provisioning of sensitive data will be illegal by law, even when
|
||||
encrypted.
|
||||
|
||||
We'll explore the different combination of choices below. Note that we don't need to marry to any one of them, we may
|
||||
decide to implement several.
|
||||
|
||||
## Privacy model + non-notary provisioning
|
||||
|
||||
Let's start with the model that's closest to the current Corda implementation as this is an easy segue into the
|
||||
possibilities with SGX. We also have a simple example and a corresponding neat diagram (thank you Kostas!!) we showed
|
||||
to a member bank Itau to indicate in a semi-handwavy way what the integration will look like.
|
||||
|
||||
We have a cordapp X used by node A and B. The cordapp contains a flow XFlow and a (deterministic) contract XContract.
|
||||
The two nodes are negotiating a transaction T2. T2 consumes a state that comes from transaction T1.
|
||||
|
||||
Let's assume that both A and B are happy with T2, except Node A hasn't established the validity of it yet. Our goal is
|
||||
to prove the validity of T2 to A without revealing the details of T1.
|
||||
|
||||
The following diagram shows an overview of how this can be achieved. Note that the diagram is highly oversimplified
|
||||
and is meant to communicate the high-level data flow relevant to Corda.
|
||||
|
||||
![SGX Provisioning](SgxProvisioning.png "SGX Provisioning")
|
||||
|
||||
* In order to validate T2, A asks its enclave whether T2 is valid.
|
||||
* The enclave sees that T2 depends on T1, so it consults its sealed ledger whether it contains T1.
|
||||
* If it does then this means T1 has been verified already, so the enclave moves on to the verification of T2.
|
||||
* If the ledger doesn't contain T1 then the enclave needs to retrieve it from node B.
|
||||
* In order to do this A's enclave needs to prove to B's enclave that it is indeed a trusted enclave B can provision T1
|
||||
to. This proof is what the attestation process provides.
|
||||
* Attestation is done in the clear: (TODO attestation diagram)
|
||||
* A's enclave generates a keypair, the public part of which is sent to Node B in a datastructure signed by Intel,
|
||||
this is called the quote(1).
|
||||
* Node B's XFlow may do various checks on this datastructure that cannot be performed by B's enclave, for example
|
||||
checking of the timeliness of Intel's signature(2).
|
||||
* Node B's XFlow then forwards the quote to B's enclave, which will check Intel's signature and whether it trusts A'
|
||||
s enclave. For the sake of simplicity we can assume this to be strict check that A is running the exact same
|
||||
enclave B is.
|
||||
* At this point B's enclave has established trust in A's enclave, and has the public part of the key generated by A'
|
||||
s enclave.
|
||||
* The nodes repeat the above process the other way around so that A's enclave establishes trust in B's and gets hold
|
||||
of B's public key(3).
|
||||
* Now they proceed to perform an ephemeral Diffie-Hellman key exchange using the keys in the quotes(4).
|
||||
* The ephemeral key is then used to encrypt further communication. Beyond this point the nodes' flows (and anything
|
||||
outside of the enclaves) have no way of seeing what data is being exchanged, all the nodes can do is forward the
|
||||
encrypted messages.
|
||||
* Once attestation is done B's enclave provisions T1 to A's enclave using the DH key. If there are further
|
||||
dependencies those would be provisioned as well.
|
||||
* A's enclave then proceeds to verify T1 using the embedded deterministic JVM to run XContract. The verified
|
||||
transaction is then sealed to disk(5). We repeat this for T2.
|
||||
* If verification or attestation fails at any point the enclave returns to A's XFlow with a failure. Otherwise if all
|
||||
is good the enclave returns with a success. At this point A's XFlow knows that T2 is valid, but hasn't seen T1 in
|
||||
the clear.
|
||||
|
||||
(1) This is simplified, the actual protocol is a bit different. Namely the quote is not generated every time A requires provisioning, but is rather generated periodically.
|
||||
|
||||
(2) There is a way to do this check inside the enclave, however it requires switching on of the Intel ME which in general isn't available on machines in the cloud and is known to have vulnerabilities.
|
||||
|
||||
(3) We need symmetric trust even if the secrets seem to only flow from B to A. Node B may try to fake being an enclave to fish for information from A.
|
||||
|
||||
(4) The generated keys in the quotes are used to authenticate the respective parts of the DH key exchange.
|
||||
|
||||
(5) Sealing means encryption of data using a key unique to the enclave and CPU. The data may be subsequently unsealed (decrypted) by the enclave, even if the enclave was restarted. Also note that there is another layer of abstraction needed which we don't detail here, needed for redundancy of the encryption key.
|
||||
|
||||
To summarise, the journey of T1 is:
|
||||
|
||||
1. Initially it's sitting encrypted in B's storage.
|
||||
2. B's enclave decrypts it using its seal key specific to B's enclave + CPU combination.
|
||||
3. B's enclave encrypts it using the ephemeral DH key.
|
||||
4. The encrypted transaction is sent to A. The safety of this (namely that A's enclave doesn't reveal the transaction to node A) hinges on B's enclave's trust in A's enclave, which is expressed as a check of A's enclave measurement during attestation, which in turn requires auditing of A's enclave code and reproducing of the measurement.
|
||||
5. A's enclave decrypts the transaction using the DH key.
|
||||
6. A's enclave verifies the transaction using a deterministic JVM.
|
||||
7. A's enclave encrypts the transaction using A's seal key specific to A's enclave + CPU combination.
|
||||
8. The encrypted transaction is stored in A's storage.
|
||||
|
||||
As we can see in this model each non-notary node runs their own SGX enclave and related storage. Validation of the
|
||||
backchain happens by secure provisioning of it between enclaves, plus subsequent verification and storage. However
|
||||
there is one important thing missing from the example (actually it has several, but those are mostly technical detail):
|
||||
the notary!
|
||||
|
||||
In reality we cannot establish the full validity of T2 at this point of negotiation, we need to first notarise it.
|
||||
This model gives us some flexibility in this regard: we can use a validating notary (also running SGX) or a
|
||||
non-validating one. This indicates that the enclave API should be split in two, mirroring the signature check choice
|
||||
in SignedTransaction.verify. Only when the transaction is fully signed and notarised should it be persisted (sealed).
|
||||
|
||||
This model has both advantages and disadvantages. On one hand it is the closest to what we have now - we (and users)
|
||||
are familiar with this model, we can fairly easily nest it into the existing codebase and it gives us flexibility with
|
||||
regards to notary modes. On the other hand it is a compromising answer to the regulatory problem. If we use non-
|
||||
validating notaries then the backchain storage is restricted to participants, however consider the following example:
|
||||
if we have a transaction X that parties A and B can process legally, but a later transaction Y that has X in its
|
||||
backchain is sent for verification to party C, then C will process and store X as well, which may be illegal.
|
||||
|
||||
## Privacy model + notary provisioning
|
||||
|
||||
This model would work similarly to the previous one, except non-notary nodes wouldn't need to run SGX or care about
|
||||
storage of the encrypted ledger, it would all be done in notary nodes. Nodes would connect to SGX capable notary nodes,
|
||||
and after attestation the nodes can be sure that the notary has run verification before signing.
|
||||
|
||||
This fixes the choice of using validating notaries, as notaries would be the only entities capable of verification:
|
||||
only they have access to the full backchain inside enclaves.
|
||||
|
||||
Note that because we still provision the full backchain between notary members for verification, we don't necessarily
|
||||
need a BFT consensus on validity - if an enclave is compromised an invalid transaction will be detected at the next
|
||||
backchain provisioning.
|
||||
|
||||
This model reduces the number of responsibilities of a non-notary node, in particular it wouldn't need to provide
|
||||
storage for the backchain or verification, but could simply trust notary signatures. Also it wouldn't need to host SGX
|
||||
enclaves, only partake in the DH exchange with notary enclaves. The node's responsibilities would be reduced to the
|
||||
orchestration of ledger updates (flows) and related bookkeeping (vault, network map). This split would also enable us
|
||||
to be flexible with regards to the update orchestration: trust in the validity of the ledger would cease to depend on
|
||||
the transaction resolution currently embedded into flows - we could provide a from-scratch light-weight implementation
|
||||
of a "node" (say a mobile app) that doesn't use flows and related code at all, it just needs to be able to connect to
|
||||
notary enclaves to notarise, validity is taken care of by notaries.
|
||||
|
||||
Note that although we wouldn't require validation checks from non-notary nodes, in theory it would be safe to allow
|
||||
them to do so (if they want a stronger-than-BFT guarantee).
|
||||
|
||||
Of course this model has disadvantages too. From the regulatory point of view it is a strictly worse solution than the
|
||||
non-notary provisioning model: the backchain would be provisioned between notary nodes not owned by actual
|
||||
participants in the backchain. It also disables us from using non-validating notaries.
|
||||
|
||||
## Integrity model + non-notary provisioning
|
||||
|
||||
In this model we would trust SGX-backed signatures and related attestation datastructures (quote over signature key
|
||||
signed by Intel) as proof of validity. When node A and B are negotiating a transaction it's enough to provision SGX
|
||||
signatures over the dependency hashes to one another, there's no need to provision the full backchain.
|
||||
|
||||
This sounds very simple and efficient, and it's even more private than the privacy model as we're only passing
|
||||
signatures around, not transactions. However there are a couple of issues that need addressing: If an SGX enclave is
|
||||
compromised a malicious node can provide a signature over an invalid transaction that checks out, and nobody will ever
|
||||
know about it, because the original transaction will never be verified. One way we can mitigate this is by requiring a
|
||||
BFT consensus signature, or perhaps a threshold signature is enough. We could decouple verification into "verifying
|
||||
oracles" which verify in SGX and return signatures over transaction hashes, and require a certain number of them to
|
||||
convince the notary to notarise and subsequent nodes to trust validity. Another issue is enclave updates. If we find a
|
||||
vulnerability in an enclave and update it, what happens to the already signed backchain? Historical transactions have
|
||||
signatures that are rooted in SGX quotes belonging to old untrusted enclave code. One option is to simply have a
|
||||
cutoff date before which we accept old signatures. This requires a consensus-backed timestamp on the notary signature.
|
||||
Another option would be to keep the old ledger around and re-verify it with the new enclaves. However if we do this we
|
||||
lose the benefits of the integrity model - we get back the regulatory issue, and we don't gain the performance benefits.
|
||||
|
||||
## Integrity model + notary provisioning
|
||||
|
||||
This is similar to the previous model, only once again non-notary nodes wouldn't need to care about verifying or
|
||||
collecting proofs of validity before sending the transaction off for notarisation. All of the complexity would be
|
||||
hidden by notary nodes, which may use validating oracles or perhaps combine consensus over validity with consensus
|
||||
over spending. This model would be a very clean separation of concerns which solves the regulatory problem (almost)
|
||||
and is quite efficient as we don't need to keep provisioning the chain. One potential issue with regards to regulation
|
||||
is the tip of the ledger (the transaction being notarised) - this is sent to notaries and although it is not stored it
|
||||
may still be against the law to receive it and hold it in volatile memory, even inside an enclave. I'm unfamiliar with
|
||||
the legal details of whether this is good enough. If this is an issue, one way we could address this would be to scope
|
||||
the validity checks required for notarisation within legal boundaries and only require "full" consensus on the
|
||||
spentness check. Of course this has the downside that ledger participants outside of the regulatory boundary need to
|
||||
trust the BFT-SGX of the scope. I'm not sure whether it's possible to do any better, after all we can't send the
|
||||
transaction body outside the scope in any shape or form.
|
||||
|
||||
## Threat model
|
||||
|
||||
In all models we have the following actors, which may or may not overlap depending on the model:
|
||||
|
||||
* Notary quorum members
|
||||
* Non-notary nodes/entities interacting with the ledger
|
||||
* Identities owning the verifying enclave hosting infrastructure
|
||||
* Identities owning the encrypted ledger/signature storage infrastructure
|
||||
* R3 = enclave whitelisting identity
|
||||
* Network Map = contract whitelisting identity
|
||||
* Intel
|
||||
|
||||
We have two major ways of compromise:
|
||||
|
||||
* compromise of a non-enclave entity (notary, node, R3, Network Map, storage)
|
||||
* compromise of an enclave.
|
||||
|
||||
In the case of **notaries** compromise means malicious signatures, for **nodes** it's malicious transactions, for **R3**
|
||||
it's signing malicious enclaves, for **Network Map** it's signing malicious contracts, for **storage** it's read-write
|
||||
access to encrypted data, and for **Intel** it's forging of quotes or signing over invalid ones.
|
||||
|
||||
A compromise of an **enclave** means some form of access to the enclave's temporary identity key. This may happen
|
||||
through direct hardware compromise (extracting of fuse values) and subsequent forging of a quote, or leaking of secrets
|
||||
through weakness of the enclave-host boundary or other side-channels like Spectre(hacking). In any case it allows an
|
||||
adversary to impersonate an enclave and therefore to intercept enclave traffic and forge signatures.
|
||||
|
||||
The actors relevant to SGX are enclave hosts, storage infrastructure owners, regular nodes and R3.
|
||||
|
||||
* **Enclave hosts**: enclave code is specifically written with malicious (compromised) hosts in mind. That said we
|
||||
cannot be 100% secure against yet undiscovered side channel attacks and other vulnerabilities, so we need to be
|
||||
prepared for the scenario where enclaves get compromised. The privacy model effectively solves this problem by
|
||||
always provisioning and re-verifying the backchain. An impersonated enclave may be able to see what's on the ledger,
|
||||
but tampering with it will not check out at the next provisioning. On the other hand if a compromise happens in the
|
||||
integrity model an attacker can forge a signature over validity. We can mitigate this with a BFT guarantee by
|
||||
requiring a consensus over validity. This way we effectively provide the same guarantee for validity as notaries
|
||||
provide with regards to double spend.
|
||||
|
||||
* **Storage infrastructure owner**:
|
||||
* A malicious actor would need to crack the encryption key to decrypt transactions
|
||||
or transaction signatures. Although this is highly unlikely, we can mitigate by preparing for and forcing of key
|
||||
updates (i.e. we won't provision new transactions to enclaves using old keys).
|
||||
* What an attacker *can* do is simply erase encrypted data (or perhaps re-encrypt as part of ransomware), blocking
|
||||
subsequent resolution and verification. In the non-notary provisioning models we can't really mitigate this as the
|
||||
tip of the ledger (or signature over) may only be stored by a single non-notary entity (assumed to be compromised).
|
||||
However if we require consensus over validity between notary or non-notary entities (e.g. validating oracles) then
|
||||
this implicitly provides redundancy of storage.
|
||||
* Furthermore storage owners can spy on the enclave's activity by observing access patterns to the encrypted blobs.
|
||||
We can mitigate by implementing ORAM storage.
|
||||
|
||||
* **Regular nodes**: if a regular node is compromised the attacker may gain access to the node's long term key that
|
||||
allows them to Diffie-Hellman with an enclave, or get the ephemeral DH value calculated during attestation directly.
|
||||
This means they can man-in-the-middle between the node and the enclave. From the ledger's point of view we are
|
||||
prepared for this scenario as we never leak sensitive information to the node from the enclave, however it opens the
|
||||
possibility that the attacker can fake enclave replies (e.g. validity checks) and can sniff on secrets flowing from
|
||||
the node to the enclave. We can mitigate the fake enclave replies by requiring an extra signature on messages.
|
||||
Sniffing cannot really be mitigated, but one could argue that if the transient DH key (that lives temporarily in
|
||||
volatile memory) or long term key (that probably lives in an HSM) was leaked then the attacker has access to node
|
||||
secrets anyway.
|
||||
|
||||
* **R3**: the entity that's whitelisting enclaves effectively controls attestation trust, which means they can
|
||||
backdoor the ledger by whitelisting a secret-revealing/signature-forging enclave. One way to mitigate this is by
|
||||
requiring a threshold signature/consensus over new trusted enclave measurements. Another way would be to use "canary"
|
||||
keys controlled by neutral parties. These parties' responsibility would simply be to publish enclave measurements (and
|
||||
perhaps the reproducing build) to the public before signing over them. The "publicity" and signature would be checked
|
||||
during attestation, so a quote with a non-public measurement would be rejected. Although this wouldn't prevent
|
||||
backdoors (unless the parties also do auditing), it would make them public.
|
||||
|
||||
* **Intel**: There are two ways a compromised Intel can interact with the ledger maliciously, both provide a backdoor.
|
||||
* It can sign over invalid quotes. This can be mitigated by implementing our own attestation service. Intel told us
|
||||
we'll be able to do this in the future (by downloading a set of certificates tied to CPU+CPUSVN combos that may be
|
||||
used to check QE signatures).
|
||||
* It can produce valid quotes without an enclave. This is due to the fact that they store one half of the SGX-
|
||||
specific fuse values in order to validate quotes flexibly. One way to circumvent this would be to only use the
|
||||
other half of the fuse values (the seal values) which they don't store (or so they claim). However this requires
|
||||
our own "enrollment" process of CPUs where we replicate the provisioning process based off of seal values and
|
||||
verify manually that the provisioning public key comes from the CPU. And even if we do this all we did was move
|
||||
the requirement of trust from Intel to R3.
|
||||
|
||||
Note however that even if an attacker compromises Intel and decides to backdoor they would need to connect to the
|
||||
ledger participants in order to take advantage. The flow framework and the business network concept act as a form of
|
||||
ACL on data that would make an Intel backdoor quite useless.
|
||||
|
||||
## Summary
|
||||
|
||||
As we can see we have a number of options here, all of them have advantages and disadvantages.
|
||||
|
||||
#### Privacy + non-notary
|
||||
|
||||
**Pros**:
|
||||
* Closest to our current non-SGX model
|
||||
* Strong guarantee of validity
|
||||
* Flexible with respect to notary modes
|
||||
|
||||
**Cons**:
|
||||
* Regulatory problem about provisioning of ledger
|
||||
* Relies on ledger participants to do validation checks
|
||||
* No redundancy across ledger participants
|
||||
|
||||
#### Privacy + notary
|
||||
|
||||
**Pros**:
|
||||
* Strong guarantee of validity
|
||||
* Separation of concerns, allows lightweight ledger participants
|
||||
* Redundancy across notary nodes
|
||||
|
||||
**Cons**:
|
||||
* Regulatory problem about provisioning of ledger
|
||||
|
||||
#### Integrity + non-notary
|
||||
|
||||
**Pros**:
|
||||
* Efficient validity checks
|
||||
* No storage of sensitive transaction body only signatures
|
||||
|
||||
**Cons**:
|
||||
* Enclave impersonation compromises ledger (unless consensus validation)
|
||||
* Relies on ledger participants to do validation checks
|
||||
* No redundancy across ledger participants
|
||||
|
||||
#### Integrity + notary
|
||||
|
||||
**Pros**:
|
||||
* Efficient validity check
|
||||
* No storage of sensitive transaction body only signatures
|
||||
* Separation of concerns, allows lightweight ledger participants
|
||||
* Redundancy across notary nodes
|
||||
|
||||
**Cons**:
|
||||
* Only BFT guarantee over validity
|
||||
* Temporary storage of transaction in RAM may be against regulation
|
||||
|
||||
Personally I'm strongly leaning towards an integrity model where SGX compromise is mitigated by a BFT consensus over validity (perhaps done by a validating oracle cluster). This would solve the regulatory problem, it would be efficient and the infrastructure would have a very clean separation of concerns between notary and non-notary nodes, allowing lighter-weight interaction with the ledger.
|
@ -1,90 +0,0 @@
|
||||
# CorDapp Minimum and Target Platform Version
|
||||
|
||||
## Overview
|
||||
|
||||
We want to give CorDapps the ability to specify which versions of the platform they support. This will make it easier for CorDapp developers to support multiple platform versions, and enable CorDapp developers to tweak behaviour and opt in to changes that might be breaking (e.g. sandboxing). Corda developers gain the ability to introduce changes to the implementation of the API that would otherwise break existing CorDapps.
|
||||
|
||||
This document proposes that CorDapps will have metadata associated with them specifying a minimum platform version and a target platform Version. The minimum platform version of a CorDapp would indicate that a Corda node would have to be running at least this version of the Corda platform in order to be able to run this CorDapp. The target platform version of a CorDapp would indicate that it was tested for this version of the Corda platform.
|
||||
|
||||
## Background
|
||||
|
||||
> Introduce target version and min platform version as app attributes
|
||||
>
|
||||
> This is probably as simple as a couple of keys in a MANIFEST.MF file.
|
||||
> We should document what it means, make sure API implementations can always access the target version of the calling CorDapp (i.e. by examining the flow, doing a stack walk or using Reflection.getCallerClass()) and do a simple test of an API that acts differently depending on the target version of the app.
|
||||
> We should also implement checking at CorDapp load time that min platform version <= current platform version.
|
||||
|
||||
([from CORDA-470](https://r3-cev.atlassian.net/browse/CORDA-470))
|
||||
|
||||
### Definitions
|
||||
|
||||
* *Platform version (Corda)* An integer representing the API version of the Corda platform
|
||||
|
||||
> It starts at 1 and will increment by exactly 1 for each release which changes any of the publicly exposed APIs in the entire platform. This includes public APIs on the node itself, the RPC system, messaging, serialisation, etc. API backwards compatibility will always be maintained, with the use of deprecation to migrate away from old APIs. In rare situations APIs may have to be removed, for example due to security issues. There is no relationship between the Platform Version and the release version - a change in the major, minor or patch values may or may not increase the Platform Version.
|
||||
|
||||
([from the docs](https://docs.corda.net/head/versioning.html#versioning)).
|
||||
|
||||
* *Platform version (Node)* The value of the Corda platform version that a node is running and advertising to the network.
|
||||
|
||||
* *Minimum platform version (Network)* The minimum platform version that the nodes must run in order to be able to join the network. Set by the network zone operator. The minimum platform version is distributed with the network parameters as `minimumPlatformVersion`.
|
||||
([see docs:](https://docs.corda.net/network-map.html#network-parameters))
|
||||
|
||||
* *Target platform version (CorDapp)* Introduced in this document. Indicates that a CorDapp was tested with this version of the Corda Platform and should be run at this API level if possible.
|
||||
|
||||
* *Minimum platform version (CorDapp)* Introduced in this document. Indicates the minimum version of the Corda platform that a Corda Node has to run in order to be able to run a CorDapp.
|
||||
|
||||
|
||||
## Goals
|
||||
|
||||
Define the semantics of target platform version and minimum platform version attributes for CorDapps, and the minimum platform version for the Corda network. Describe how target and platform versions would be specified by CorDapp developers. Define how these values can be accessed by the node and the CorDapp itself.
|
||||
|
||||
## Non-goals
|
||||
|
||||
In the future it might make sense to integrate the minimum and target versions into a Corda gradle plugin. Such a plugin is out of scope of this document.
|
||||
|
||||
## Timeline
|
||||
|
||||
This is intended as a long-term solution. The first iteration of the implementation will be part of platform version 4 and contain the minimum and target platform version.
|
||||
|
||||
## Requirements
|
||||
|
||||
* The CorDapp's minimum and target platform version must be accessible to nodes at CorDapp load time.
|
||||
|
||||
* At CorDapp load time there should be a check that the node's platform version is greater or equal to the CorDapp's Minimum Platform version.
|
||||
|
||||
* API implementations must be able to access the target version of the calling CorDapp.
|
||||
|
||||
* The node's platform version must be accessible to CorDapps.
|
||||
|
||||
* The CorDapp's target platform version must be accessible to the node when running CorDapps.
|
||||
|
||||
## Design
|
||||
|
||||
### Testing
|
||||
|
||||
When a new platform version is released, CorDapp developers can increase their CorDapp's target version and re-test their app. If the tests are successful, they can then release their CorDapp with the increased target version. This way they would opt-in to potentially breaking changes that were introduced in that version. If they choose to keep their current target version, their CorDapp will continue to work.
|
||||
|
||||
### Implications for platform developers
|
||||
|
||||
When new features or changes are introduced that require all nodes on the network to understand them (e.g. changes in the wire transaction format), they must be version-gated on the network level. This means that the new behaviour should only take effect if the minimum platform version of the network is equal to or greater than the version in which these changes were introduced. Failing that, the old behaviour must be used instead.
|
||||
|
||||
Changes that risk breaking apps must be gated on targetVersion>=X where X is the version where the change was made, and the old behaviour must be preserved if that condition isn't met.
|
||||
|
||||
## Technical Design
|
||||
|
||||
The minimum- and target platform version will be written to the manifest of the CorDapp's JAR, in fields called `Min-Platform-Version` and `Target-Platform-Version`.
|
||||
The node's CorDapp loader reads these values from the manifest when loading the CorDapp. If the CorDapp's minimum platform version is greater than the node's platform version, the node will not load the CorDapp and log a warning. The CorDapp loader sets the minimum and target version in `net.corda.core.cordapp.Cordapp`, which can be obtained via the `CorDappContext` from the service hub.
|
||||
|
||||
To make APIs caller-sensitive in cases where the service hub is not available a different approach has to be used. It would possible to do a stack walk, and parse the manifest of each class on the stack to determine if it belongs to a CorDapp, and if yes, what its target version is. Alternatively, the mapping of classes to `Cordapp`s obtained by the CorDapp loader could be stored in a global singleton. This singleton would expose a lambda returning the current CorDapp's version information (e.g. `() -> Cordapp.Info`).
|
||||
|
||||
Let's assume that we want to change `TimeWindow.Between` to make it inclusive, i.e. change `contains(instant: Instant) = instant >= fromTime && instant < untilTime` to `contains(instant: Instant) = instant >= fromTime && instant <= untilTime`. However, doing so will break existing CorDapps. We could then version-guard the change such that the new behaviour is only used if the target version of the CorDapp calling `contains` is equal to or greater than the platform version that contains this change. It would look similar to this:
|
||||
|
||||
```
|
||||
fun contains(instant: Instant) {
|
||||
if (CorDappVersionResolver.resolve().targetVersion > 42) {
|
||||
return instant >= fromTime && instant <= untilTime
|
||||
} else {
|
||||
return instant >= fromTime && instant < untilTime
|
||||
}
|
||||
```
|
||||
Version-gating API changes when the service hub is available would look similar to the above example, in that case the service hub's CorDapp provider would be used to determine if this code is being called from a CorDapp and to obtain its target version information.
|
@ -1,39 +0,0 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: <Description heading>
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
Short outline of decision point.
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### A. <Option summary>
|
||||
|
||||
#### Advantages
|
||||
|
||||
1.
|
||||
2.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1.
|
||||
2.
|
||||
|
||||
### B. <Option summary>
|
||||
|
||||
#### Advantages
|
||||
|
||||
1.
|
||||
2.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1.
|
||||
2.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option <A or B or ... >
|
@ -1,76 +0,0 @@
|
||||
# Design doc template
|
||||
|
||||
## Overview
|
||||
|
||||
Please read the [Design Review Process](../design-review-process.md) before completing a design.
|
||||
|
||||
Each section of the document should be at the second level (two hashes at the start of a line).
|
||||
|
||||
This section should describe the desired change or feature, along with background on why it's needed and what problem
|
||||
it solves.
|
||||
|
||||
An outcome of the design document should be an implementation plan that defines JIRA stories and tasks to be completed
|
||||
to produce shippable, demonstrable, executable code.
|
||||
|
||||
Please complete and/or remove section headings as appropriate to the design being proposed. These are provided as
|
||||
guidance and to structure the design in a consistent and coherent manner.
|
||||
|
||||
## Background
|
||||
|
||||
Description of existing solution (if any) and/or rationale for requirement.
|
||||
|
||||
* Reference(s) to discussions held elsewhere (slack, wiki, etc).
|
||||
* Definitions, acronyms and abbreviations
|
||||
|
||||
## Goals
|
||||
|
||||
What's in scope to be solved.
|
||||
|
||||
## Non-goals
|
||||
|
||||
What won't be tackled as part of this design, either because it's not needed/wanted, or because it will be tackled later
|
||||
as part of a separate design effort. Figuring out what you will *not* do is frequently a useful exercise.
|
||||
|
||||
## Timeline
|
||||
|
||||
* Is this a short, medium or long-term solution?
|
||||
* Where short-term design, is this evolvable / extensible or stop-gap (eg. potentially throwaway)?
|
||||
|
||||
## Requirements
|
||||
|
||||
* Reference(s) to any of following:
|
||||
* Captured Product Backlog JIRA entry
|
||||
* Internal White Paper feature item and/or visionary feature
|
||||
* Project related requirement (POC, RFP, Pilot, Prototype) from
|
||||
* Internal Incubator / Accelerator project
|
||||
* Direct from Customer, ISV, SI, Partner
|
||||
* Use Cases
|
||||
* Assumptions
|
||||
|
||||
## Design Decisions
|
||||
|
||||
List of design decisions identified in defining the target solution.
|
||||
|
||||
For each item, please complete the attached [Design Decision template](decisions/decision.md)
|
||||
|
||||
Use the ``.. toctree::`` feature to list out the design decision docs here (see the source of this file for an example).
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
decisions/decision.md
|
||||
|
||||
## Design
|
||||
|
||||
Think about:
|
||||
|
||||
* Public API, backwards compatibility impact.
|
||||
* UI requirements, if any. Illustrate with UI Mockups and/or wireframes.
|
||||
* Data model & serialization impact and changes required.
|
||||
* Infrastructure services: persistence (schemas), messaging.
|
||||
* Impact on performance, scalability, high availability
|
||||
* Versioning, upgradability, migration=
|
||||
* Management: audit, alerting, monitoring, backup/recovery, archiving
|
||||
* Data privacy, authentication, access control
|
||||
* Logging
|
||||
* Testability
|
@ -1,429 +0,0 @@
|
||||
<style>.wy-table-responsive table td, .wy-table-responsive table th { white-space: normal;}</style>
|
||||
Corda Threat Model
|
||||
==================
|
||||
|
||||
This document describes the security threat model of the Corda Platform. The Corda Threat Model is the result of architectural and threat modelling sessions,
|
||||
and is designed to provide a high level overview of the security objectives for the Corda Network , and the controls and mitigations used to deliver on those
|
||||
objectives. It is intended to support subsequent analysis and architecture of systems connecting with the network and the applications which interact with data
|
||||
across it.
|
||||
|
||||
It is incumbent on all ledger network participants to review and assess the security measures described in this document against their specific organisational
|
||||
requirements and policies, and to implement any additional measures needed.
|
||||
|
||||
Scope
|
||||
-----
|
||||
|
||||
Built on the [Corda](http://www.corda.net/) distributed ledger platform designed by R3, the ledger network enables the origination and management of agreements
|
||||
between business partners. Participants to the network create and maintain Corda *nodes,* each hosting one or more pluggable applications ( *CorDapps* ) which
|
||||
define the data to be exchanged and its workflow. See the [Corda Technical White Paper](https://docs.corda.net/_static/corda-technical-whitepaper.pdf) for a
|
||||
detailed description of Corda's design and functionality.
|
||||
|
||||
R3 provide and maintain a number of essential services underpinning the ledger network. In the future these services are intended to be operated by a separate
|
||||
Corda Foundation. The network services currently include:
|
||||
|
||||
- Network Identity service ('Doorman'): Issues signed digital certificates that uniquely identity parties on the network.
|
||||
- Network Map service: Provides a way for nodes to advertise their identity, and identify other nodes on the network, their network address and advertised
|
||||
services.
|
||||
|
||||
Participants to the ledger network include major institutions, financial organisations and regulated bodies, across various global jurisdictions. In a majority
|
||||
of cases, there are stringent requirements in place for participants to demonstrate that their handling of all data is performed in an appropriately secure
|
||||
manner, including the exchange of data over the ledger network. This document identifies measures within the Corda platform and supporting infrastructure to
|
||||
mitigate key security risks in support of these requirements.
|
||||
|
||||
The Corda Network
|
||||
-----------------
|
||||
|
||||
The diagram below illustrates the network architecture, protocols and high level data flows that comprise the Corda Network. The threat model has been developed
|
||||
based upon this architecture.
|
||||
|
||||
![](./images/threat-model.png)
|
||||
|
||||
Threat Model
|
||||
------------
|
||||
|
||||
Threat Modelling is an iterative process that works to identify, describe and mitigate threats to a system. One of the most common models for identifying
|
||||
threats is the [STRIDE](https://en.wikipedia.org/wiki/STRIDE_(security)) framework. It provides a set of security threats in six categories:
|
||||
|
||||
- Spoofing
|
||||
- Tampering
|
||||
- Repudiation
|
||||
- Information Disclosure
|
||||
- Denial of Service
|
||||
- Elevation of Privilege
|
||||
|
||||
The Corda threat model uses the STRIDE framework to present the threats to the Corda Network in a structured way. It should be stressed that threat modelling is
|
||||
an iterative process that is never complete. The model described below is part of an on-going process intended to continually refine the security architecture
|
||||
of the Corda platform.
|
||||
|
||||
### Spoofing
|
||||
|
||||
Spoofing is pretending to be something or someone other than yourself. It is the actions taken by an attacker to impersonate another party, typically for the
|
||||
purposes of gaining unauthorised access to privileged data, or perpetrating fraudulent transactions. Spoofing can occur on multiple levels. Machines can be
|
||||
impersonated at the network level by a variety of methods such as ARP & IP spoofing or DNS compromise.
|
||||
|
||||
Spoofing can also occur at an application or user-level. Attacks at this level typically target authentication logic, using compromised passwords and
|
||||
cryptographic keys, or by subverting cryptography systems.
|
||||
|
||||
Corda employs a Public Key Infrastructure (PKI) to validate the identity of nodes, both at the point of registration with the network map service and
|
||||
subsequently through the cryptographic signing of transactions. An imposter would need to acquire an organisation's private keys in order to meaningfully
|
||||
impersonate that organisation. R3 provides guidance to all ledger network participants to ensure adequate security is maintained around cryptographic keys.
|
||||
|
||||
+-------------+------------------------------------------------------------------------------+----------------------------------------------------------------+
|
||||
| Element | Attacks | Mitigations |
|
||||
+=============+==============================================================================+================================================================+
|
||||
| RPC Client | An external attacker impersonates an RPC client and is able to initiate | The RPC Client is authenticated by the node and must supply |
|
||||
| | flows on their behalf. | valid credentials (username & password). |
|
||||
| | | |
|
||||
| | A malicious RPC client connects to the node and impersonates another, | RPC Client permissions are configured by the node |
|
||||
| | higher-privileged client on the same system, and initiates flows on their | administrator and can be used to restrict the actions and |
|
||||
| | behalf. | flows available to the client. |
|
||||
| | | |
|
||||
| | **Impacts** | RPC credentials and permissions can be managed by an Apache |
|
||||
| | | Shiro service. The RPC service restricts which actions are |
|
||||
| | If successful, the attacker would be able to perform actions that they are | available to a client based on what permissions they have been |
|
||||
| | not authorised to perform, such initiating flows. The impact of these | assigned. |
|
||||
| | actions could have financial consequences depending on what flows were | |
|
||||
| | available to the attacker. | |
|
||||
+-------------+------------------------------------------------------------------------------+----------------------------------------------------------------+
|
||||
| Node | An attacker attempts to impersonate a node and issue a transaction using | Nodes must connect to each other using using |
|
||||
| | their identity. | mutually-authenticated TLS connections. Node identity is |
|
||||
| | | authenticated using the certificates exchanged as part of the |
|
||||
| | An attacker attempts to impersonate another node on the network by | TLS protocol. Only the node that owns the corresponding |
|
||||
| | submitting NodeInfo updates with falsified address and/or identity | private key can assert their true identity. |
|
||||
| | information. | |
|
||||
| | | NodeInfo updates contain the node's public identity |
|
||||
| | **Impacts** | certificate and must be signed by the corresponding private |
|
||||
| | | key. Only the node in possession of this private key can sign |
|
||||
| | If successful, a node able to assume the identity of another party could | the NodeInfo. |
|
||||
| | conduct fraudulent transactions (e.g. pay cash to its own identity), giving | |
|
||||
| | a direct financial impact to the compromised identity. Demonstrating that | Corda employs a Public Key Infrastructure (PKI) to validate |
|
||||
| | the actions were undertaken fraudulently could prove technically challenging | the identity of nodes. An imposter would need to acquire an |
|
||||
| | to any subsequent dispute resolution process. | organisation's private keys in order to meaningfully |
|
||||
| | | impersonate that organisation. Corda will soon support a range |
|
||||
| | In addition, an impersonating node may be able to obtain privileged | of HSMs (Hardware Security Modules) for storing a node's |
|
||||
| | information from other nodes, including receipt of messages intended for the | private keys, which mitigates this risk. |
|
||||
| | original party containing information on new and historic transactions. | |
|
||||
+-------------+------------------------------------------------------------------------------+----------------------------------------------------------------+
|
||||
| Network Map | An attacker with appropriate network access performs a DNS compromise, | Connections to the Network Map service are secured using the |
|
||||
| | resulting in network traffic to the Doorman & Network Map being routed to | HTTPS protocol. The connecting node authenticates the |
|
||||
| | their attack server, which attempts to impersonate these machines. | NetworkMap servers using their public certificates, to ensure |
|
||||
| | | the identity of these servers is correct. |
|
||||
| | **Impact** | |
|
||||
| | | All data received from the NetworkMap is digitally signed (in |
|
||||
| | Impersonation of the Network Map would enable an attacker to issue | addition to being protected by TLS) - an attacker attempting |
|
||||
| | unauthorised updates to the map. | to spoof the Network Map would need to acquire both private |
|
||||
| | | TLS keys, and the private NetworkMap signing keys. |
|
||||
| | | |
|
||||
| | | The Doorman and NetworkMap signing keys are stored inside a |
|
||||
| | | (Hardware Security Module (HSM) with strict security controls |
|
||||
| | | (network separation and physical access controls). |
|
||||
+-------------+------------------------------------------------------------------------------+----------------------------------------------------------------+
|
||||
| Doorman | An malicious attacker operator attempts to join the Corda Network by | R3 operate strict validation procedures to ensure that |
|
||||
| | impersonating an existing organisation and issues a fraudulent registration | requests to join the Corda Network have legitimately |
|
||||
| | request. | originated from the organisation in question. |
|
||||
| | | |
|
||||
| | **Impact** | |
|
||||
| | | |
|
||||
| | The attacker would be able to join and impersonate an organisation. | |
|
||||
| | | |
|
||||
| | The operator could issue an identity cert for any organisation, publish a | |
|
||||
| | valid NodeInfo and redirect all traffic to themselves in the clear. | |
|
||||
+-------------+------------------------------------------------------------------------------+----------------------------------------------------------------+
|
||||
|
||||
|
||||
|
||||
### Tampering
|
||||
|
||||
Tampering refers to the modification of data with malicious intent. This typically involves modification of data at rest (such as a file on disk, or fields in a
|
||||
database), or modification of data in transit.
|
||||
|
||||
To be successful, an attacker would require privileged access to some part of the network infrastructure (either public or internal private networks). They
|
||||
might also have access to a node's file-system, database or even direct memory access.
|
||||
|
||||
+------------+-----------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| Element | Attacks | Mitigations |
|
||||
+============+=============================================================================+==================================================================+
|
||||
| Node | Unintended, adverse behaviour of a CorDapp running on one or more nodes - | By design, Corda's notary-based consensus model and contract |
|
||||
| (CorDapp) | either its core code or any supporting third party libraries. A coding bug | validation mechanisms provide protection against attempts to |
|
||||
| | is assumed to be the default cause, although malicious modification of a | alter shared data or perform invariant operations. The primary |
|
||||
| | CorDapp could result in similar effects. | risk is therefore to local systems. |
|
||||
| | | |
|
||||
| | | Future versions of Corda will require CorDapps to be executed |
|
||||
| | | inside a sandboxed JVM environment, modified to restrict |
|
||||
| | | unauthorised access to the local file system and network. This |
|
||||
| | | is intended to minimise the potential of a compromised CorDapp |
|
||||
| | | to affect systems local to the node. |
|
||||
+------------+-----------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| P2P & RPC | An attacker performs Man-in-the-Middle (MITM) attack against a node's | Mutually authenticated TLS connections between nodes ensures |
|
||||
| connection | peer-to-peer (P2P) connection | that Man-In-The-Middle (MITM) attacks cannot take place. Corda |
|
||||
| s | | Nodes restrict their connections to TLS v1.2 and also restrict |
|
||||
| | **Impact** | which cipher suites are accepted. |
|
||||
| | | |
|
||||
| | An attacker would be able to modify transactions between participating | |
|
||||
| | nodes. | |
|
||||
+------------+-----------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| Node Vault | An attacker gains access to the node's vault and modifies tables in the | There are not currently any direct controls to mitigate this |
|
||||
| | database. | kind of attack. A node's vault is assumed to be within the same |
|
||||
| | | trust boundary of the node JVM. Access to the vault must be |
|
||||
| | **Impact** | restricted such that only the node can access it. Both |
|
||||
| | | network-level controls (fire-walling) and database permissions |
|
||||
| | Transaction history would become compromised. The impact could range from | must be employed. |
|
||||
| | deletion of data to malicious tampering of financial detail. | |
|
||||
| | | Note that the tampering of a node's vault only affects that |
|
||||
| | | specific node's transaction history. No other node in the |
|
||||
| | | network is affected and any tampering attempts are easily |
|
||||
| | | detected. |
|
||||
| | | |
|
||||
| | | |
|
||||
+------------+-----------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| Network | An attacker compromises the Network Map service and publishes an | Individual Node entries in the NetworkMap must be signed by the |
|
||||
| Map | illegitimate update. | associated node's private key. The signatures are validated by |
|
||||
| | | the NetworkMap service, and all other Nodes in the network, to |
|
||||
| | **Impact** | ensure they have not been tampered with. An attacker would need |
|
||||
| | | to acquire a node's private identity signing key to be able to |
|
||||
| | NodeInfo entries (name & address information) could potentially become | make modifications to a NodeInfo. This is only possible if the |
|
||||
| | altered if this attack was possible | attacker has control of the node in question. |
|
||||
| | | |
|
||||
| | The NetworkMap could be deleted and/or unauthorized nodes could be added | It is not possible for the NetworkMap service (or R3) to modify |
|
||||
| | to, or removed from the map. | entries in the network map (because the node's private keys are |
|
||||
| | | not accessible). If the NetworkMap service were compromised, the |
|
||||
| | | only impact the attacker could have would be to add or remove |
|
||||
| | | individual entries in the map. |
|
||||
+------------+-----------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
|
||||
### Repudiation
|
||||
|
||||
Repudiation refers to the ability to claim a malicious action did not take place. Repudiation becomes relevant when it is not possible to verify the identity of
|
||||
an attacker, or there is a lack of evidence to link their malicious actions with events in a system.
|
||||
|
||||
Preventing repudiation does not prevent other forms of attack. Rather, the goal is to ensure that the attacker is identifiable, their actions can be traced, and
|
||||
there is no way for the attacker to deny having committed those actions.
|
||||
|
||||
+-------------+------------------------------------------------------------------------------+-----------------------------------------------------------------+
|
||||
| Element | Attacks | Mitigations |
|
||||
+=============+==============================================================================+=================================================================+
|
||||
| RPC Client | Attacker attempts to initiate a flow that they are not entitled to perform | RPC clients must authenticate to the Node using credentials |
|
||||
| | | passed over TLS. It is therefore not possible for an RPC client |
|
||||
| | **Impact** | to perform actions without first proving their identity. |
|
||||
| | | |
|
||||
| | Flows could be initiated without knowing the identity of the client. | All interactions with an RPC user are also logged by the node. |
|
||||
| | | An attacker's identity and actions will be recorded and cannot |
|
||||
| | | be repudiated. |
|
||||
+-------------+------------------------------------------------------------------------------+-----------------------------------------------------------------+
|
||||
| Node | A malicious CorDapp attempts to spend a state that does not belong to them. | Corda transactions must be signed with a node's private |
|
||||
| | The node operator then claims that it was not their node that initiated the | identity key in order to be accepted by the rest of the |
|
||||
| | transaction. | network. The signature directly identities the signing party |
|
||||
| | | and cannot be made by any other node - therefore the act of |
|
||||
| | **Impact** | signing a transaction |
|
||||
| | | |
|
||||
| | Financial transactions could be initiated by anonymous parties, leading to | Corda transactions between nodes utilize the P2P protocol, |
|
||||
| | financial loss, and loss of confidence in the network. | which requires a mutually authenticated TLS connection. It is |
|
||||
| | | not possible for a node to issue transactions without having |
|
||||
| | | it's identity authenticated by other nodes in the network. Node |
|
||||
| | | identity and TLS certificates are issued via Corda Network |
|
||||
| | | services, and use the Corda PKI (Public Key Infrastructure) for |
|
||||
| | | authentication. |
|
||||
| | | |
|
||||
| | | All P2P transactions are logged by the node, meaning that any |
|
||||
| | | interactions are recorded |
|
||||
+-------------+------------------------------------------------------------------------------+-----------------------------------------------------------------+
|
||||
| Node | A node attempts to perform a denial-of-state attack. | Non-validating Notaries require a signature over every request, |
|
||||
| | | therefore nobody can deny performing denial-of-state attack |
|
||||
| | | because every transaction clearly identities the node that |
|
||||
| | | initiated it. |
|
||||
+-------------+------------------------------------------------------------------------------+-----------------------------------------------------------------+
|
||||
| Node | | |
|
||||
+-------------+------------------------------------------------------------------------------+-----------------------------------------------------------------+
|
||||
|
||||
|
||||
|
||||
### Information Disclosure
|
||||
|
||||
Information disclosure is about the unauthorised access of data. Attacks of this kind have an impact when confidential data is accessed. Typical examples of
|
||||
attack include extracting secrets from a running process, and accessing confidential files on a file-system which have not been appropriately secured.
|
||||
Interception of network communications between trusted parties can also lead to information disclosure.
|
||||
|
||||
An attacker capable of intercepting network traffic from a Corda node would, at a minimum, be able to identify which other parties that node was interacting
|
||||
with, along with relative frequency and volume of data being shared; this could be used to infer additional privileged information without the parties'
|
||||
consent. All network communication of a Corda is encrypted using the TLS protocol (v1.2), using modern cryptography algorithms.
|
||||
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| Element | Attack | Mitigations |
|
||||
+============+==============================================================================+==================================================================+
|
||||
| Node | An attacker attempts to retrieve transaction history from a peer node in the | By design, Corda nodes do not globally broadcast transaction |
|
||||
| | network, for which they have no legitimate right of access. | information to all participants in the network. |
|
||||
| | | |
|
||||
| | Corda nodes will, upon receipt of a request referencing a valid transaction | A node will not divulge arbitrary transactions to a peer unless |
|
||||
| | hash, respond with the dependency graph of that transaction. One theoretical | that peer has been included in the transaction flow. A node only |
|
||||
| | scenario is therefore that a participant is able to guess (or otherwise | divulges transaction history if the transaction being requested |
|
||||
| | acquire by illicit means) the hash of a valid transaction, thereby being | is a descendant of a transaction that the node itself has |
|
||||
| | able to acquire its content from another node. | previously shared as part of the current flow session. |
|
||||
| | | |
|
||||
| | **Impact** | The SGX integration feature currently envisaged for Corda will |
|
||||
| | | implement CPU peer-to-peer encryption under which transaction |
|
||||
| | If successful, an exploit of the form above could result in information | graphs are transmitted in an encrypted state and only decrypted |
|
||||
| | private to specific participants being shared with one or more | within a secure enclave. Knowledge of a transaction hash will |
|
||||
| | non-privileged parties. This may include market-sensitive information used | then be further rendered insufficient for a non-privileged party |
|
||||
| | to derive competitive advantage. | to view the content of a transaction. |
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| Node Vault | An unauthorised user attempts to access the node's vault | Access to the Vault uses standard JDBC authentication mechanism. |
|
||||
| (database) | | Any user connecting to the vault must have permission to do so. |
|
||||
| | **Impact** | |
|
||||
| | | |
|
||||
| | Access to the vault would reveal the full transaction history that the node | |
|
||||
| | has taken part in. This may include financial information. | |
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| Node | An attacker who gains access to the machine running the Node attempts to | Corda Nodes are designed to be executed using a designated |
|
||||
| Process | read memory from the JVM process. | 'corda' system process, which other users and processes on the |
|
||||
| (JVM) | | system do not have permission to access. |
|
||||
| | An attacker with access the file-system attempts to read the node's | |
|
||||
| | cryptographic key-store, containing the private identity keys. | The node's Java Key Store is encrypted using PKCS\#12 |
|
||||
| | | encryption. In the future Corda will eventually store its keys |
|
||||
| | **Impact** | in a HSM (Hardware Security Module). |
|
||||
| | | |
|
||||
| | An attacker would be able to read sensitive such as private identity keys. | |
|
||||
| | The worst impact would be the ability to extract private keys from the JVM | |
|
||||
| | process. | |
|
||||
| | | |
|
||||
| | | |
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| RPC Client | Interception of RPC traffic between a client system and the node. | RPC communications are protected by the TLS protocol. |
|
||||
| | | |
|
||||
| | A malicious RPC client authenticates to a Node and attempts to query the | Permission to query a node's vault must be explicitly granted on |
|
||||
| | transaction vault. | a per-user basis. It is recommended that RPC credentials and |
|
||||
| | | permissions are managed in an Apache Shiro database. |
|
||||
| | **Impact** | |
|
||||
| | | |
|
||||
| | An attacker would be able to see details of transactions shared between the | |
|
||||
| | connected business systems and any transacting party. | |
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
|
||||
|
||||
|
||||
### Denial of Service
|
||||
|
||||
Denial-of-service (DoS) attacks target the availability of a resource from its intended users. There are two anticipated targets of a DoS attack - network
|
||||
participants (Corda Nodes) and network services (Doorman and the Network Map). DoS attacks occur by targeting the node or network services with a high
|
||||
volume/frequency of requests, or by sending malformed requests. Typical DoS attacks leverage a botnet or other distributed group of systems (Distributed Denial
|
||||
of Service, DDoS). A successful DoS attack may result in non-availability of targeted ledger network node(s)/service(s), both during the attack and thereafter
|
||||
until normal service can be resumed.
|
||||
|
||||
Communication over the ledger network is primarily peer-to-peer. Therefore the network as a whole is relatively resilient to DoS attacks. Notaries and oracles
|
||||
will only communicate with peers in the network, so are protected from non-member-on-member application-level attack.
|
||||
|
||||
Corda Network Services are protected by enterprise-grade DDoS detection and mitigation services.
|
||||
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| Element | Attack | Mitigations |
|
||||
+============+==============================================================================+==================================================================+
|
||||
| Node | An attacker control sends high volume of malformed transactions to a node. | P2P communcation is authenticated as part of the TLS protocol, |
|
||||
| | | meaning that attackers must be part of the Corda network to |
|
||||
| | **Impact** | launch an attack. |
|
||||
| | | |
|
||||
| | Nodes targeted by this attack could exhaust their processing & memory | Communication over the ledger network is primarily peer-to-peer, |
|
||||
| | resources, or potentially cease responding to transactions. | the network as a whole is relatively resilient to DoS attacks, |
|
||||
| | | the primary threat being to specific nodes or services. |
|
||||
| | | |
|
||||
| | | Note that there is no specific mitigation against DoS attacks at |
|
||||
| | | the per-node level. DoS attacks by participants on other |
|
||||
| | | participants will be expressly forbidden under the terms of the |
|
||||
| | | ledger network's network agreement. Measures will be taken |
|
||||
| | | against any ledger network participant found to have perpetrated |
|
||||
| | | a DoS attack, including exclusion from the ledger network |
|
||||
| | | network and potential litigation. As a result, the perceived |
|
||||
| | | risk of a member-on-member attack is low and technical measures |
|
||||
| | | are not considered under this threat model, although they may be |
|
||||
| | | included in future iterations. |
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| CorDapp | Unintended termination or other logical sequence (e.g. due to a coding bug | The network agreement will stipulate a default maximum allowable |
|
||||
| | in either Corda or a CorDapp) by which a party is rendered unable to resolve | period time - the 'event horizon' - within which a party is |
|
||||
| | a f low. The most likely results from another party failing to respond when | required to provide a valid response to any message sent to it |
|
||||
| | required to do so under the terms of the agreed transaction protocol. | in the course of a flow. If that period is exceeded, the flow |
|
||||
| | | will be considered to be cancelled and may be discontinued |
|
||||
| | **Impact** | without prejudice by all parties. The event horizon may be |
|
||||
| | | superseded by agreements between parties specifying other |
|
||||
| | Depending on the nature of the flow, a party could be financially impacted | timeout periods, which may be encoded into flows under the Corda |
|
||||
| | by failure to resolve a flow on an indefinite basis. For example, a party | flow framework. |
|
||||
| | may be left in possession of a digital asset without the means to transfer | |
|
||||
| | it to another party. | Additional measures may be taken under the agreement against |
|
||||
| | | parties who repeatedly fail to meet their response obligations |
|
||||
| | | under the network agreement. |
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| Doorman | Attacker submits excessive registration requests to the Doorman service | Doorman is deployed behind a rate-limiting firewall. |
|
||||
| | | |
|
||||
| | | Doorman requests are validated and filtered to ensure malformed |
|
||||
| | | requests are rejected. |
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
| Network | Attacker causes the network map service to become unavailable | Updates to the network map must be signed by participant nodes |
|
||||
| Map | | and are authenticated before being processed. |
|
||||
| | | |
|
||||
| | | The network map is designed to be distributed by a CDN (Content |
|
||||
| | | Delivery Network). This design leverages the architecture and |
|
||||
| | | security controls of the CDN and is expected to be resilient to |
|
||||
| | | DDoS (Distributed Denial of Service) attack. |
|
||||
| | | |
|
||||
| | | The Network Map is also cached locally by nodes on the network. |
|
||||
| | | If the network map online service were temporarily unavailable, |
|
||||
| | | the Corda network would not be affected. |
|
||||
| | | |
|
||||
| | | There is no requirement for the network map services to be |
|
||||
| | | highly available in order for the ledger network to be |
|
||||
| | | operational. Temporary non-availability of the network map |
|
||||
| | | service may delay certification of new entrants to the network, |
|
||||
| | | but will have no impact on existing participants. Similarly, the |
|
||||
| | | network map will be cached by individual nodes once downloaded |
|
||||
| | | from the network map service; unplanned downtime would prevent |
|
||||
| | | broadcast of updates relating to new nodes connecting to / |
|
||||
| | | disconnecting from the network, but not affect communication |
|
||||
| | | between nodes whose connection state remains unchanged |
|
||||
| | | throughout the incident. |
|
||||
+------------+------------------------------------------------------------------------------+------------------------------------------------------------------+
|
||||
|
||||
|
||||
|
||||
### Elevation of Privilege
|
||||
|
||||
Elevation of Privilege is enabling somebody to perform actions they are not permitted to do. Attacks range from a normal user executing actions as a more
|
||||
privileged administrator, to a remote (external) attacker with no privileges executing arbitrary code.
|
||||
|
||||
+------------+------------------------------------------------------------------------------+-----------------------------------------------------------------+
|
||||
| Element | Attack | Mitigations |
|
||||
+============+==============================================================================+=================================================================+
|
||||
| Node | Malicious contract attempts to instantiate classes in the JVM that it is not | The AMQP serialiser uses a combination of white and black-lists |
|
||||
| | authorised to access. | to mitigate against de-serialisation vulnerabilities. |
|
||||
| | | |
|
||||
| | Malicious CorDapp sends malformed serialised data to a peer. | Corda does not currently provide specific security controls to |
|
||||
| | | mitigate all classes of privilege escalation vulnerabilities. |
|
||||
| | **Impact** | The design of Corda requires that CorDapps are inherently |
|
||||
| | | trusted by the node administrator. |
|
||||
| | Unauthorised remote code execution would lead to complete system compromise. | |
|
||||
| | | Future security research will introduce stronger controls that |
|
||||
| | | can mitigate this class of threat. The Deterministic JVM will |
|
||||
| | | provide a sandbox that prevents execution of code & classes |
|
||||
| | | outside of the security boundary that contract code is |
|
||||
| | | restricted to. |
|
||||
| | | |
|
||||
| | | |
|
||||
+------------+------------------------------------------------------------------------------+-----------------------------------------------------------------+
|
||||
| RPC Client | A malicious RPC client connects to the node and impersonates another, | Nodes implement an access-control model that restricts what |
|
||||
| | higher-privileged client on the same system, and initiates flows on their | actions RPC users can perform. |
|
||||
| | behalf. | |
|
||||
| | | Session replay is mitigated by virtue of the TLS protocol used |
|
||||
| | | to protect RPC communications. |
|
||||
+------------+------------------------------------------------------------------------------+-----------------------------------------------------------------+
|
||||
|
||||
|
||||
|
||||
Conclusion
|
||||
----------
|
||||
|
||||
The threat model presented here describes the main threats to the Corda Network, and the controls that are included to mitigate these threats. It was necessary
|
||||
to restrict this model to a high-level perspective of the Corda Network. It is hoped that enough information is provided to allow network participants to
|
||||
understand the security model of Corda.
|
||||
|
||||
Threat modelling is an on-going process. There is active research at R3 to continue evolving the Corda Threat Model. In particular, models are being developed
|
||||
that focus more closely on individual components - such as the Node, Network Map and Doorman.
|
||||
|
||||
|
||||
|
||||
|
Before Width: | Height: | Size: 24 KiB |
@ -1,398 +0,0 @@
|
||||
# Contract versioning and ensuring data integrity
|
||||
|
||||
|
||||
## Terminology used in this document:
|
||||
|
||||
- ContractJAR = The code that contains the: State, Command, Contract and (optional) custom persistent Schema. This code is used to verify transactions and is stored on the ledger as an Attachment. (TODO: Find a better name.)
|
||||
- FlowsJAR = Code that contains the flows and services. This is installed on the node and exposes endpoints. (TODO: Find a better name.)
|
||||
- CorDapp = Distributed applications that run on the Corda platform (https://docs.corda.net/cordapp-overview.html). This term does not mean anything in this document, because it is including both of the above!
|
||||
- Attachment = A file that is stored on the "ledger" and is referenced by it's hash. In this document it is usually the ContractJAR.
|
||||
- Contract = The class that contains the verification logic. (lives in the ContractJar)
|
||||
- State schema or State = the fields that compose the ContractState class.
|
||||
|
||||
|
||||
## Background:
|
||||
|
||||
This document addresses "Corda as a platform for applications" concerns.
|
||||
|
||||
These applications that run on Corda - CorDapps - as opposed to other blockchains are allowed to be updated to fix bugs and address new requirements.
|
||||
|
||||
Corda also allows a lot of flexibility so CorDapps can depend on other CorDapps, which have different release cycles, and participants on the network can have any combination installed.
|
||||
|
||||
This document is focused mainly on the "ContractJar" part of the CorDapps, as this is the Smart contract that lives on the ledger.
|
||||
|
||||
Starting with version 3, Corda has introduced the WhitelistedByZone Contract Constraint, which is the first constraint that allows the contract and contract state type to evolve.
|
||||
In version 4 we will introduce the decentralized Signature Constraint, which is also an upgradable (allows evolving) constraint.
|
||||
This introduces a set of new problems that were not present when the Hash Constraint was the only alternative. (The Hash Constraint is non-upgradeable, as it pins the jar version to the hardcoded hash. It can only be upgraded via the "explicit" mechanism.)
|
||||
|
||||
E.g.:
|
||||
Developer MegaCorp develops a token contract: `com.megacorp.tokens.MegaToken`.
|
||||
As various issues are discovered, and requirements change over time, MegaCorp constantly releases new versions and distributes them to node operators, who can use these new versions for new transactions.
|
||||
These versions could in theory either change the verification logic, change the meaning of various state fields, or even add/remove/rename fields.
|
||||
|
||||
Also, this means that at any point in time, different nodes may have different versions of MegaToken installed. (and the associated MegaFlows that build transactions using this contract). But these nodes still need to communicate.
|
||||
|
||||
Also in the vault of different nodes, there are states created by transactions with various versions of the token contract.
|
||||
|
||||
Corda is designed such that the flow that builds the transaction (on its executing node) will have to choose the contract version that will be used to create the output states and also verify the current transaction.
|
||||
|
||||
But because input states are actually output states that are serialised with the previous transaction (built using a potentially different version of the contract), this means that states serialised with a version of the ContractJAR will need to be deserialisable with a different version.
|
||||
|
||||
|
||||
.. image:: ../../resources/tx-chain.png
|
||||
:scale: 25%
|
||||
:align: center
|
||||
|
||||
|
||||
## Goals
|
||||
|
||||
- States should be correctly deserialized as an input state of a transaction that uses a different release of the contract code.
|
||||
After the input states are correctly deserialised they can be correctly verified by the transaction contract.
|
||||
This is critical for the UTXO model of Corda to function correctly.
|
||||
|
||||
- Define a simple process and basic tooling for contract code developers to ensure a consistent release process.
|
||||
|
||||
- Nodes should be prevented to select older buggy contract code for current transactions.
|
||||
|
||||
- Ensure basic mechanism for flows to not miss essential data when communicating with newer flows.
|
||||
|
||||
|
||||
## Non-Goals
|
||||
|
||||
This design is not about:
|
||||
|
||||
- Addressing security issues discovered in an older version of a contract (that was used in transactions) without compromising the trust in the ledger. (There are proposals for this and can be extracted in a separate doc)
|
||||
- Define the concept of "CorDapp identity" useful when flows or contracts are coded against third-party contracts.
|
||||
- Evolve states from the HashConstraint or the Whitelist constraint (addressed in a separate design).
|
||||
- Publishing And Distribution of applications to nodes.
|
||||
- Contract constraints, package ownership, or any operational concerns.
|
||||
|
||||
## Issues considered but postponed for a future version of corda
|
||||
|
||||
- How releasing new versions of contract code interacts with the explicit upgrade functionality.
|
||||
- How contracts depend on other contracts.
|
||||
- Node to node communication using flows, and what impact different contract versions have on that. (Flows depending on contracts)
|
||||
- Versioning of flows and subflows and backwards compatibility concerns.
|
||||
|
||||
|
||||
### Assumptions and trade-offs made for the current version
|
||||
|
||||
#### We assume that ContractStates will never change their semantics in a way that would impact other Contracts or Flows that depend on them.
|
||||
|
||||
E.g.: If various contracts depend on Cash, the assumption is that no new field will be added to Cash that would have an influence over the amount or the owner (the fundamental fields of Cash).
|
||||
It is always safe to mix new CashStates with older states that depend on it.
|
||||
|
||||
This means that we can simplify the contract-to-contract dependency, also given that the UpgradeableContract is actually a contract that depends on another contract, it can be simplified too.
|
||||
|
||||
This is not a very strong definition, so we will have to create more formalised rules in the next releases.
|
||||
|
||||
If any contract breaks this assumption, in Corda 4 there will be no platform support the transition to the new version. The burden of coordinating with all the other cordapp developers and nodes is on the original developer.
|
||||
|
||||
|
||||
#### Flow to Flow communication could be lossy for objects that are not ContractStates or Commands.
|
||||
|
||||
Explanation:
|
||||
Flows communicate by passing around various objects, and eventually the entire TransactionBuilder.
|
||||
As flows evolve, these objects might evolve too, even if the sequence stays the same.
|
||||
The decision was that the node that sends data would decide (if his version is higher) if this new data is relevant for the other party.
|
||||
|
||||
The objects that live on the ledger like ContractStates and Commands that the other party actually has to sign, will not be allowed to lose any data.
|
||||
|
||||
|
||||
#### We assume that cordapp developers will correctly understand all implications and handle backwards compatibility themselves.
|
||||
Basically they will have to realise that any version of a flow can talk to any other version, and code accordingly.
|
||||
|
||||
This get particularly tricky when there are reusable inline subflows involved.
|
||||
|
||||
|
||||
## Design details
|
||||
|
||||
### Possible attack under the current implementation
|
||||
|
||||
Cordapp developer Megacorp wants to release an update to their cordapp, and add support for accumulating debt on the token (which is a placeholder for something you really want to know about).
|
||||
|
||||
- V1: com.megacorp.token.MegaToken(amount: Amount, owner: Party)
|
||||
- V2: com.megacorp.token.MegaToken(amount: Amount, owner: Party, accumulatedDebt: Amount? = 0)
|
||||
|
||||
After they publish the new release, this sort of scenario could happen if we don't have a mechanism to stop it.
|
||||
|
||||
1. Tx1: Alice transfers MegaToken to Bob, and selects V1
|
||||
2. Tx2: Bob transfers to Chuck, but selects V2. The V1 output state will be deserialised with an accumulatedDebt=0, which is correct.
|
||||
3. After a while, Chuck accumulates some debt on this token.
|
||||
4. Txn: Chuck creates a transaction with Dan, but selects V1 as the contract version for this transaction, thus managing to "lose" the `accumulatedDebt`. (V1 does not know about the accumulatedDebt field)
|
||||
|
||||
|
||||
|
||||
### High level description of the solution
|
||||
|
||||
Currently we have the concept of "CorDapp" and, as described in the terminology section, this makes reasoning harder as it is actually composed of 2 parts.
|
||||
|
||||
Contracts and Flows should be able to evolve and be released independently, and have proper names and their own version, even if they share the same gradle multi-module build.
|
||||
|
||||
Contract states need to be seen as evolvable objects that can be different from one version to the next.
|
||||
|
||||
Corda uses a proprietary serialisation engine based on AMQP, which allows evolution of objects: https://docs.corda.net/serialization-enum-evolution.html.
|
||||
|
||||
We can use features already implemented in the serialisation engine and add new features to make sure that data on the ledger is never lost from one transaction to the next.
|
||||
|
||||
|
||||
### Contract Version
|
||||
|
||||
Contract code should live in it's own gradle module. This is already the way our examples are written.
|
||||
|
||||
The Cordapp gradle plugin should be amended to differentiate between a "flows" module and a "contracts" module.
|
||||
|
||||
In the build.gradle file of the contracts module, there should be a `version` property that needs be incremented for each release.
|
||||
|
||||
This `version` will be used for the regular release, and be part of the jar name.
|
||||
|
||||
Also, when packaging the contract for release, the `version` should be added by the plugin to the manifest file, together with other properties like `target-platform-version`.
|
||||
|
||||
When loading the contractJar in the attachment storage, the version should be saved as a column, so it is easily accessible.
|
||||
|
||||
The new cordapp plugin should also be able to deploy nodes with different versions of the code so that developers can test compatibility.
|
||||
|
||||
Ideally the driver should be able to do this too.
|
||||
|
||||
|
||||
#### Alternatives considered
|
||||
|
||||
The version can be of the `major.minor` format, so that developers can encode if they actually make a breaking change or not.
|
||||
Given that we assumed that breaking changes are not supported in this version, we can keep it to a simple `major`.
|
||||
|
||||
|
||||
#### Backwards compatibility
|
||||
|
||||
Contracts released before V4 will not have this metadata.
|
||||
|
||||
Assuming that the constraints propagated correctly, when verifying a transaction where the constraint:
|
||||
|
||||
- is the HashConstraint the contract can be considered to have `version=1`
|
||||
- is the WhitelistedByZoneConstraint the contract can be considered: `version= Order_of_the_hash_in_the_whitelist`
|
||||
|
||||
Any signed ContractJars should be only considered valid if they have the version metadata. (As signing is a Corda4 feature)
|
||||
|
||||
|
||||
|
||||
### Protection against losing data on the ledger
|
||||
|
||||
The solution we propose is:
|
||||
|
||||
- States can only evolve by respecting some predefined rules (see below).
|
||||
- The serialisation engine will need a new `Strict mode` feature to enforce the evolution rules.
|
||||
- The `version` metadata of the contract code can be used to make sure that nodes can't spend a state with an older version (downgrade).
|
||||
|
||||
|
||||
#### Contract State evolution
|
||||
|
||||
States need to follow the general serialisation rules: https://docs.corda.net/serialization-default-evolution.html
|
||||
|
||||
These are the possible evolutions based on these general rules:
|
||||
- Adding nullable fields with default values is OK (deserialising old versions would populate the newer fields with the defaults)
|
||||
- Adding non-nullable fields with default values is OK but requires extra serialisation annotation
|
||||
- Removing fields is permitted, but transactions will fail at the message deserialisation stage if a non-null value is supplied for a removed field. This means that if a nullable field is added, states received from the previous version can be transmitted back to that version, as evolution from the old version to the new version will supply a default value of null for that field, and evolution from the new version back to the old version will discard that null value. If the value is not-null, the data is assumed to have originated from the new version and to be significant for contract validation, and the old version must refuse to handle it.
|
||||
- Rename fields NOK ( will be possible in the future when the serialisation engine will support it)
|
||||
- Changing type of field NOK (Serialisation engine would most likely fail )
|
||||
- Deprecating fields OK (as long as it's not removed)
|
||||
|
||||
|
||||
Given the above reasoning, it means that states only support a subset of the general rules. Basically they can only evolve by adding new fields.
|
||||
|
||||
Another way to look at this:
|
||||
|
||||
When Contract.verify is written (and compiled), it is in the same project as the current (from it's point of view) version of the State.
|
||||
But, at runtime, the contract state it's operating with can actually be a deserialised version of any previous state.
|
||||
That's why the current one needs to be a superset of all preceding states, and you can't verify with an older version.
|
||||
|
||||
|
||||
The serialisation engine needs to implement the above rules, and needs to run in this `Strict Mode` during transaction verification.
|
||||
This mode can be implemented as a new `SerializationContext`.
|
||||
|
||||
The same serialization evolution `Strict Mode` needs to be enforced any time ContractStates or Commands are serialized.
|
||||
This is to ensure that when flows communicate to older versions, the older node will not sign something that he doesn't know.
|
||||
|
||||
|
||||
##### Backwards compatibility
|
||||
|
||||
The only case when this would break for existing transactions is when the Whitelist Constraint was used and states were evolved not according to the above rules.
|
||||
Should this rule be applied retroactively, or only for after-v4 transactions?
|
||||
|
||||
|
||||
### Non-downgrade rule
|
||||
|
||||
To avoid the possibility of malicious nodes selecting old and buggy contract code when spending newer states, we need to enforce a `Non Downgrade rule`.
|
||||
|
||||
Transactions contain an attachment for each contract. The version of the output states is the version of this contract attachment.
|
||||
(It can be seen as the version of code that instantiated and serialised those classes.)
|
||||
|
||||
The rule is: the version of the code used in the transaction that spends a state needs to be >= any version of the input states. ``spending_version >= creation_version``
|
||||
|
||||
This rule needs to be enforced at verification time, and also during transaction building.
|
||||
|
||||
*Note:* This rule can be implemented as a normal contract constraint, as it is a constraint on the attachments that can be used on the spending transaction.
|
||||
We don't currently support multiple constraints on a state and don't have any delegation mechanism to implement it like that.
|
||||
|
||||
|
||||
#### Considered but decided against - Add Version field on TransactionState
|
||||
|
||||
The `version` could also be stored redundantly to states on the ledger - as a field in ``TransactionState`` and also in the `vault_states` table.
|
||||
|
||||
This would make the verification logic more clear and faster, and also expose the version of the input states to the contract verify code.
|
||||
|
||||
Note:
|
||||
|
||||
- When a transaction is verified the version of the attachment needs to match the version of the output states.
|
||||
|
||||
|
||||
## Actions
|
||||
|
||||
1. Implement the serialization `Strict Mode` and wire that during transaction verification, and more generally whenever it tries to deserialize ContractStates or Commands.
|
||||
Also to define other possible evolutions and decide if possible or not (E.g: adding/removing interfaces).
|
||||
2. Find some good names for the `ContractJar` and the `FlowsJar`.
|
||||
3. Implement the gradle plugin changes to split the 'cordapp' into 'contract' (better name) and 'flows' (better name).
|
||||
4. Implement the versioning strategy proposed above in the 'contract' part of the plugin.
|
||||
5. When importing a ContractJar, read the version and save it as a database column. Also add it as a field to teh `ContractAttachment` class.
|
||||
6. Implement the non-downgrade rule using the above `version`.
|
||||
7. Add support to the `cordapp` plugin and the driver to test multiple versions of the contractJar together.
|
||||
8. Document all the release process.
|
||||
9. Update samples with the new gradle plugin.
|
||||
10. Create an elaborate sample for some more complex flows and contract upgrade scenarios.
|
||||
Ideally with new fields added to the state, new version of the flow protocol, internal flow objects changed, subflows, dependency on other contracts.
|
||||
This would be published as an example for how it should be done.
|
||||
11. Use this elaborate sample to create a comprehensive automated Behave Compatibility Test Suite.
|
||||
|
||||
## Deferred tasks
|
||||
|
||||
1. Formalise the dependency rules between contracts, contracts and flows, and also subflow versioning.
|
||||
2. Find a way to hide the complexity of backwards compatibility from flow devlopers (maybe using a handshake).
|
||||
3. Remove custom contract state schema from ContractJar.
|
||||
|
||||
## Appendix:
|
||||
|
||||
This section contains various hypothetical scenarios to illustrate various issues and how they are solved by the current design.
|
||||
|
||||
These are the possible changes:
|
||||
- changes to the state (fields added/removed from the state)
|
||||
- changes to the verification logic (more restrictive, less restrictive, or both - for different checks)
|
||||
- adding/removing/renaming of commands
|
||||
- Persistent Schema changes (These are out of scope of this document, as they shouldn't be in contract jar in the first place)
|
||||
- any combination of the above
|
||||
|
||||
|
||||
Terminology:
|
||||
|
||||
- V1, V2 - are versions of the contract.
|
||||
- Tx1, Tx2 - are transactions between parties.
|
||||
|
||||
|
||||
### Scenario 1 - Spending a state with an older contract.
|
||||
- V1: com.megacorp.token.MegaToken(amount: Amount, owner: Party)
|
||||
- V2: com.megacorp.token.MegaToken(amount: Amount, owner: Party, accumulatedDebt: Amount? = 0)
|
||||
|
||||
|
||||
- Tx1: Alice transfers MegaToken to Bob, and selects V1
|
||||
- Tx2: Bob transfers to Chuck, but selects V2. The V1 output state will be deserialised with an accumulatedDebt=0, which is correct.
|
||||
- After a while, Chuck accumulates some debt on this token.
|
||||
- Txn: Chuck creates a transaction with Dan, but selects V1 as the contract version for this transaction, thus managing to "lose" the accumulatedDebt. (V1 does not know about the accumulatedDebt field)
|
||||
|
||||
|
||||
Solution: This was analysed above. It will be solved by the non-downgrade rule and the serialization engine changes.
|
||||
|
||||
|
||||
### Scenario 2: - Running an explicit upgrade written against an older contract version.
|
||||
- V1: com.megacorp.token.MegaToken(amount: Amount, owner: Party)
|
||||
- V2: com.megacorp.token.MegaToken(amount: Amount, owner: Party, accumulatedDebt: Amount? = 0)
|
||||
- Another company creates a better com.gigacorp.token.GigaToken that is an UpgradedContract designed to replace the MegaToken via an explicit upgrade, but develop against V1 ( as V2 was not released at the time of development).
|
||||
|
||||
Same as before:
|
||||
- Tx1: Alice transfers MegaToken to Bob, and selects V1 (the only version available at that time)
|
||||
- Tx2: Bob transfers to Chuck, but selects V2. The V1 output state will be deserialised with an accumulatedDebt=0, which is correct.
|
||||
- After a while, Chuck accumulates some debt on this token.
|
||||
- Chuck notices the GigaToken does not know about the accumulatedDebt field, so runs an explicit upgrade and transforms his MegaToken with debt into a clean GigaToken.
|
||||
|
||||
Solution: This attack breaks the assumption we made that contracts will not be adding fields that change it fundamentally.
|
||||
|
||||
|
||||
### Scenario 3 - Flows installed by 2 peers compiled against different contract versions.
|
||||
- Alice runs V1 of the MegaToken FlowsJAR, while Bob runs V2.
|
||||
- Bob builds a new transaction where he transfers a state with accumulatedDebt To Alice.
|
||||
- Alice is not able to correctly evaluate the business proposition, as she does not know that there even exists an accumulatedDebt field, but still signs it as if it was debt free.
|
||||
|
||||
Solution: Solved by the assumption that peer-to-peer communication can be lossy, and the peer with the higher version is responsible to send the right data
|
||||
|
||||
|
||||
### Scenario 4 - Developer attempts to rename a field from one version to the next.
|
||||
- V1: com.megacorp.token.MegaToken(amount: Amount, owner: Party, accumulatedDebt: Amount )
|
||||
- V2: com.megacorp.token.MegaToken(amount: Amount, owner: Party, currentDebt: Amount)
|
||||
|
||||
This is a rename of a field.
|
||||
This would break as soon as you try to spend a V1 state with V2, because there is no default value for currentDebt.
|
||||
|
||||
Solution: Not possible for now, as it breaks the serialisation evolution rules. Could be added as a new feature in the future.
|
||||
|
||||
|
||||
### Scenario 5 - Contract verification logic becomes more strict in a new version. General concerns.
|
||||
- V1: check that amount > 10
|
||||
- V2: check that amount > 12
|
||||
|
||||
|
||||
- Tx1: Alice transfers MegaToken(amount=11) to Bob, and selects V1
|
||||
- Tx2: Bob wants to forward it to Charlie, but in the meantime v2 is released, and Bob installs it.
|
||||
|
||||
|
||||
- If Bob selects V2, the contract will fail. So, Bob needs to select V1, but will Charlie accept it?
|
||||
|
||||
The question is how important it is, that amount is >12?
|
||||
- Is it a new regulation active from a date?
|
||||
- Was it actually a (security) bug in V1?
|
||||
- Is it just a change done for not good reason?
|
||||
|
||||
Solution: This is not addressed in the current design doc. It needs to be explored in more depth.
|
||||
|
||||
### Scenario 6 - Contract verification logic becomes more strict in a new version. ???
|
||||
- V1: check that amount > 10
|
||||
- V2: check that amount > 12
|
||||
|
||||
|
||||
- Alice runs V1 of the MegaToken FlowsJAR, while Bob runs V2. So Alice does not know (yet) that in the future the amount will have to be >12.
|
||||
- Alice builds a transaction that transfers 11 token to Bob. This is a perfectly good transaction (from her point of view), with the V1 attachment.
|
||||
- Should Bob sign this transaction?
|
||||
- This could be encoded in Bob's flow (who should know that the underlying contract has changed this way.).
|
||||
|
||||
Solution: Same as Scenario 5.
|
||||
|
||||
|
||||
### Scenario 7 - Contract verification logic becomes less strict.
|
||||
- V1: check that amount > 12
|
||||
- V2: check that amount > 10
|
||||
|
||||
Alice runs V1 of the MegaToken FlowsJAR, while Bob runs V2.
|
||||
|
||||
Because there was no change in the actual structure of the state it means the flow that Alice has installed is compatible with the newer version from Bob so Alice could download V2 from Bob.
|
||||
|
||||
Solution: This should not require any change.
|
||||
|
||||
|
||||
### Scenario 8 - Contract depends on another Contract.
|
||||
- V1: com.megacorp.token.MegaToken(amount: Amount, owner: Party)
|
||||
- V2: com.megacorp.token.MegaToken(amount: Amount, owner: Party, accumulatedDebt: Amount? = 0)
|
||||
|
||||
A contract developed by a thirdparty: com.megabank.tokens.SuperToken depends on com.megacorp.token.MegaToken
|
||||
|
||||
V1 of `com.megabank.tokens.SuperToken` is compiled against V1 of `com.megacorp.token.MegaToken`. So does not know about the new ``accumulatedDebt`` field.
|
||||
|
||||
Alice swaps MegaToken-v2 for SuperToken-v1 with Bob in a transaction. If Alice select v1 of the SuperToken contract attachment, then it will not be able to correctly evaluate the transaction.
|
||||
|
||||
|
||||
Solution: Solved by the assumption we made that contracts don't change fundamentally.
|
||||
|
||||
|
||||
### Scenario 9 - A new command is added or removed
|
||||
- V1 of com.megacorp.token.MegaToken has 3 Commands: Issue, Move, Exit
|
||||
- V2 of com.megacorp.token.MegaToken has 4 Commands: Issue, Move, Exit, AddDebt
|
||||
- V3 of com.megacorp.token.MegaToken has 3 Commands: Issue, Move, Exit
|
||||
|
||||
There should not be any problem with adding/removing commands, as they apply only to transitions.
|
||||
Spending of a state should not be affected by the command that created it.
|
||||
|
||||
Solution: Does not require any change
|
@ -251,7 +251,7 @@ To copy the same file to all nodes `ext.drivers` can be defined in the top level
|
||||
|
||||
Package namespace ownership
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
To specify :doc:`design/data-model-upgrades/package-namespace-ownership` configuration, the optional ``networkParameterOverrides`` and ``packageOwnership`` blocks can be used, similar to the configuration file used in :doc:`network-bootstrapper`:
|
||||
To specify package namespace ownership, the optional ``networkParameterOverrides`` and ``packageOwnership`` blocks can be used, similar to the configuration file used in :doc:`network-bootstrapper`:
|
||||
|
||||
.. sourcecode:: groovy
|
||||
|
||||
|
@ -145,6 +145,5 @@ Welcome to Corda !
|
||||
|
||||
contributing-index.rst
|
||||
deterministic-modules.rst
|
||||
design/design-docs-index.rst
|
||||
changelog
|
||||
legal-info
|
||||
|
@ -333,8 +333,6 @@ Package namespace ownership is a Corda security feature that allows a compatibil
|
||||
namespace to registered users (e.g. a CorDapp development organisation). The exact mechanism used to claim a namespace is up to the zone
|
||||
operator. A typical approach would be to accept an SSL certificate with the domain in it as proof of domain ownership, or to accept an email from that domain.
|
||||
|
||||
.. note:: Read more about *Package ownership* :doc:`here<design/data-model-upgrades/package-namespace-ownership>`.
|
||||
|
||||
A Java package namespace is case insensitive and cannot be a sub-package of an existing registered namespace.
|
||||
See `Naming a Package <https://docs.oracle.com/javase/tutorial/java/package/namingpkgs.html>`_ and `Naming Conventions <https://www.oracle.com/technetwork/java/javase/documentation/codeconventions-135099.html#28840 for guidelines and conventions>`_ for guidelines on naming conventions.
|
||||
|
||||
|
@ -151,7 +151,6 @@ The current set of network parameters:
|
||||
This ensures that when a node encounters an owned contract it can uniquely identify it and knows that all other nodes can do the same.
|
||||
Encountering an owned contract in a JAR that is not signed by the rightful owner is most likely a sign of malicious behaviour, and should be reported.
|
||||
The transaction verification logic will throw an exception when this happens.
|
||||
Read more about *Package ownership* here :doc:`design/data-model-upgrades/package-namespace-ownership`.
|
||||
|
||||
More parameters will be added in future releases to regulate things like allowed port numbers, whether or not IPv6
|
||||
connectivity is required for zone members, required cryptographic algorithms and roll-out schedules (e.g. for moving to post quantum cryptography), parameters related to SGX and so on.
|
||||
|
@ -99,5 +99,3 @@ sophisticated or proprietary business logic, machine learning models, even user
|
||||
being Corda flows or services.
|
||||
|
||||
.. important:: The ``versionId`` specified for the JAR manifest is currently used for informative purposes only.
|
||||
|
||||
.. note:: You can read the original design doc here: :doc:`design/targetversion/design`.
|