diff --git a/docs/source/design/hadr/decisions/crash-shell.md b/docs/source/design/hadr/decisions/crash-shell.md new file mode 100644 index 0000000000..9046148815 --- /dev/null +++ b/docs/source/design/hadr/decisions/crash-shell.md @@ -0,0 +1,52 @@ +![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) + +-------------------------------------------- +Design Decision: Node starting & stopping +============================================ + +## Background / Context + +The potential use of a crash shell is relevant to [high availability](../design.md) capabilities of nodes. + + + +## Options Analysis + +### 1. Use crash shell + +#### Advantages + +1. Already built into the node. +2. Potentially add custom commands. + +#### Disadvantages + +1. Won’t reliably work if the node is in an unstable state +2. Not practical for running hundreds of nodes as our customers arealready trying to do. +3. Doesn’t mesh with the user access controls of the organisation. +4. Doesn’t interface to the existing monitoring andcontrol systems i.e. Nagios, Geneos ITRS, Docker Swarm, etc. + +### 2. Delegate to external tools + +#### Advantages + +1. Doesn’t require change from our customers +2. Will work even if node is completely stuck +3. Allows scripted node restart schedules +4. Doesn’t raise questions about access controllists and audit + +#### Disadvantages + +1. More uncertainty about what customers do. +2. Might be more requirements on us to interact nicely with lots of different products. +3. Might mean we get blamed for faults in other people’s control software. +4. Doesn’t coordinate with the node for graceful shutdown. +5. Doesn’t address any crypto features that target protecting the AMQP headers. + +## Recommendation and justification + +Proceed with Option 2: Delegate to external tools + +## Decision taken + +Decision still required. \ No newline at end of file diff --git a/docs/source/design/hadr/decisions/db-msg-store.md b/docs/source/design/hadr/decisions/db-msg-store.md new file mode 100644 index 0000000000..1e0c4cbd15 --- /dev/null +++ b/docs/source/design/hadr/decisions/db-msg-store.md @@ -0,0 +1,47 @@ +![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) + +-------------------------------------------- +Design Decision: Message storage +============================================ + +## Background / Context + +Storage of messages by the message broker has implications for replication technologies which can be used to ensure both [high availability](../design.md) and disaster recovery of Corda nodes. + + + +## Options Analysis + +### 1. Storage in the file system + +#### Advantages + +1. Out of the box configuration. +2. Recommended Artemis setup +3. Faster +4. Less likely to have interaction with DB Blob rules + +#### Disadvantages + +1. Unaligned capture time of journal data compared to DB checkpointing. +2. Replication options on Azure are limited. Currently we may be forced to the ‘Azure Files’ SMB mount, rather than the ‘Azure Data Disk’ option. This is still being evaluated + +### 2. Storage in node database + +#### Advantages + +1. Single point of data capture and backup +2. Consistent solution between VM and physical box solutions + +#### Disadvantages + +1. Doesn’t work on H2, or SQL Server. From my own testing LargeObject support is broken. The current Artemis code base does allow somepluggability, but not of the large object implementation, only of the SQLstatements. We should lobby for someone to fix the implementations for SQLServer and H2. +2. Probably much slower, although this needs measuring. + +## Recommendation and justification + +Continue with Option 1: Storage in the file system + +## Decision taken + +Decision still required. \ No newline at end of file diff --git a/docs/source/design/hadr/decisions/external-broker.md b/docs/source/design/hadr/decisions/external-broker.md new file mode 100644 index 0000000000..2bfb6415aa --- /dev/null +++ b/docs/source/design/hadr/decisions/external-broker.md @@ -0,0 +1,49 @@ +![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) + +-------------------------------------------- +Design Decision: Broker separation +============================================ + +## Background / Context + +A decision of whether to extract the Artemis message broker as a separate component has implications for the design of [high availability](../design.md) for nodes. + + + +## Options Analysis + +### 1. No change (leave broker embedded) + +#### Advantages + +1. Least change + +#### Disadvantages + +1. Means that starting/stopping Corda is tightly coupled to starting/stopping Artemis instances. +2. Risks resource leaks from one system component affecting other components. +3. Not pluggable if we wish to have an alternative broker. + +## 2. External broker + +#### Advantages + +1. Separates concerns +2. Allows future pluggability and standardisation on AMQP +3. Separates life cycles of the components +4. Makes Artemis deployment much more out of the box. +5. Allows easier tuning of VM resources for Flow processing workloads vs broker type workloads. +6. Allows later encrypted version to be an enterprise feature that can interoperate with OS versions. + +#### Disadvantages + +1. More work +2. Requires creating a protocol to control external bridge formation. + +## Recommendation and justification + +Proceed with Option 2: External broker + +## Decision taken + +Decision still required. \ No newline at end of file diff --git a/docs/source/design/hadr/decisions/ip-addressing.md b/docs/source/design/hadr/decisions/ip-addressing.md new file mode 100644 index 0000000000..39d5a08ac4 --- /dev/null +++ b/docs/source/design/hadr/decisions/ip-addressing.md @@ -0,0 +1,48 @@ +![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) + +-------------------------------------------- +Design Decision: IP addressing mechanism (near-term) +============================================ + +## Background / Context + +End-to-end encryption is a desirable potential design feature for the [float](../design.md). + + + +## Options Analysis + +### 1. Via load balancer + +#### Advantages + +1. Standard technology in banks and on clouds, often for non-HA purposes. +2. Intended to allow us to wait for completion of network map work. + +#### Disadvantages + +1. We do need to support multiple IP address advertisements in network map long term. +2. Might involve small amount of code if we find Artemis doesn’t like the health probes. So far though testing of the Azure Load balancer doesn’t need this. +3. Won’t work over very large data centre separations, but that doesn’t work for HA/DR either + +### 2. Via IP list in Network Map + +#### Advantages + +1. More flexible +2. More deployment options +3. We will need it one day + +#### Disadvantages + +1. Have to write code to support it. +2. Configuration more complicated and now the nodesare non-equivalent, so you can’t just copy the config to the backup. +3. Artemis has round robin and automatic failover, so we may have to expose a vendor specific config flag in the network map. + +## Recommendation and justification + +Proceed with Option 1: Via Load Balancer + +## Decision taken + +Decision still required. \ No newline at end of file diff --git a/docs/source/design/hadr/decisions/medium-term-target.md b/docs/source/design/hadr/decisions/medium-term-target.md new file mode 100644 index 0000000000..566eadd506 --- /dev/null +++ b/docs/source/design/hadr/decisions/medium-term-target.md @@ -0,0 +1,48 @@ +![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) + +------ + +# Design Decision: Medium-term target for node HA + +## Background / Context + +Designing for high availability is a complex task which can only be delivered over an operationally-significant timeline. It is therefore important to determine whether an intermediate state design (deliverable for around March 2018) is desirable as a precursor to longer term outcomes. + + + +## Options Analysis + +### 1. Hot-warm as interim state (see [HA design doc](../design.md)) + +#### Advantages + +1. Simpler master/slave election logic +2. Less edge cases with respect to messages being consumed by flows. +3. Naive solution of just stopping/starting the node code is simple to implement. + +#### Disadvantages + +1. Still probably requires the Artemis MQ outside of the node in a cluster. +2. May actually turn out more risky than hot-hot, because shutting down code is always prone to deadlocks and resource leakages. +3. Some work would have to be thrown away when we create a full hot-hot solution. + +### 2. Progress immediately to Hot-hot (see [HA design doc](../design.md)) + +#### Advantages + +1. Horizontal scalability is what all our customers want. +2. It simplifies many deployments as nodes in a cluster are all equivalent. + +#### Disadvantages + +1. More complicated especially regarding message routing. +2. Riskier to do this big-bang style. +3. Might not meet deadlines. + +## Recommendation and justification + +Proceed with Option 2: Hot-warm as interim state. + +## Decision taken + +Decision still required. \ No newline at end of file diff --git a/docs/source/design/hadr/decisions/near-term-target.md b/docs/source/design/hadr/decisions/near-term-target.md new file mode 100644 index 0000000000..87cd69db16 --- /dev/null +++ b/docs/source/design/hadr/decisions/near-term-target.md @@ -0,0 +1,46 @@ +![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png) + +-------------------------------------------- +Design Decision: Near-term target for node HA +============================================ + +## Background / Context + +Designing for high availability is a complex task which can only be delivered over an operationally-significant timeline. It is therefore important to determine the target state in the near term as a precursor to longer term outcomes. + + + +## Options Analysis + +### 1. No HA + +#### Advantages + +1. Reduces developer distractions. + +#### Disadvantages + +1. No backstop if we miss our targets for fuller HA. +2. No answer at all for simple DR modes. + +### 2. Hot-cold (see [HA design doc](../design.md)) + +#### Advantages + +1. Flushes out lots of basic deployment issues that will be ofbenefit later. +2. If stuff slips we at least have a backstop position with hot-cold. +3. For now, the only DR story we have is essentially a continuationof this mode +4. The intent of decisions such as using a loadbalancer is to minimise code changes + +#### Disadvantages + +1. Distracts from the work for more complete forms of HA. +2. Involves creating a few components that are notmuch use later, for instance the mutual exclusion lock. + +## Recommendation and justification + +Proceed with Option 2: Hot-cold. + +## Decision taken + +Decision still required. \ No newline at end of file diff --git a/docs/source/design/hadr/design.md b/docs/source/design/hadr/design.md index 6f32fb11bb..d45e706902 100644 --- a/docs/source/design/hadr/design.md +++ b/docs/source/design/hadr/design.md @@ -48,13 +48,18 @@ In contrast, typical financial institutions maintain large, complex technology l Thus, HA is essential for enterprise Corda and providing help to administrators necessary for rapid fault diagnosis. +### Current node topology + +The diagram below illustrates Corda's current design in the context of messaging between peer nodes. No HA is currently supported by this topology. + +![Current (single process)](./HA%20deployment%20-%20No%20HA.png) + ## Requirements * A logical Corda node should continue to function in the event of an individual component failure or (e.g.) restart. * No loss, corruption or duplication of data on the ledger due to component outages * Ensure continuity of flows throughout any disruption * Support software upgrades in a live network - * Non-goals (out of scope for this design document) * Be able to distribute a node over more than two datacenters. @@ -74,35 +79,35 @@ For the March 31st timeline, I hope that we can achieve a more fully automatic n With regards to DR it is unclear how this would work where synchronous replication is not feasible. At this point we can only investigate approaches as an aside to the main thrust of work for HA support. In the synchronous replication mode it is assumed that the file and database replication can be used to ensure a cold DR backup. -## Proposed Solution -### Current (single process) -![Current (single process)](./HA%20deployment%20-%20No%20HA.png) +## Design Decisions -### Hot-Cold (minimum requirement) +The following design decisions are assumed by this design: + +1. [Near-term-target](./decisions/near-term-target.md): Hot-Cold HA (see below) +2. [Medium-term target](./decisions/medium-term-target.md): Hot-Warm HA (see below) +3. [External broker](./external-broker.md): Yes +4. [Database message store](./db-msg-store.md): No +5. [IP addressing mechanism](./ip-addressing.md): Load balancer +6. [Crash shell start/stop](./crash-shell.md): No + + + +## Target Solution + +### Hot-Cold (near-term target) ![Hot-Cold (minimum requirement)](./HA%20deployment%20-%20Hot-Cold.png) -### Hot-Warm (Medium-term solution) +### Hot-Warm (medium-term-target) ![Hot-Warm (Medium-term solution)](./HA%20deployment%20-%20Hot-Warm.png) -### Hot-Hot (Long-term strategic solution) +### Hot-Hot (Long-term target) ![Hot-Hot (Long-term strategic solution)](./HA%20deployment%20-%20Hot-Hot.png) -## Alternative Options - -List any alternative solutions that may be viable but not recommended. - -## Final recommendation - -Proposed solution (if more than one option presented) -Proceed direct to implementation -Proceed to Technical Design stage -Proposed Platform Technical team(s) to implement design (if not already decided) - -------------------------------------------- IMPLEMENTATION PLAN ============================================ -# Transitioning from Corda 2.0 to Manually Activated HA +## Transitioning from Corda 2.0 to Manually Activated HA The current Corda is built to run as a fully contained single process with the Flow logic, H2 database and Artemis broker all bundled together. This limits the options for automatic replication, or subsystem failure. Thus, we must use external mechanisms to replicate the data in the case of failure. We also should ensure that accidental dual start is not possible in case of mistakes, or slow shutdown of the primary. @@ -117,20 +122,20 @@ Based on this situation, I suggest the following minimum development tasks are r 7. Confirm that the behaviour of the RPC proxy is stable through these restarts, from the perspective of a stateless REST server calling through to RPC. The RPC API should provide positive feedback to the application, so that it can respond in a controlled fashion when disconnected. 8. Work on flow hospital tools where needed -# Moving Towards Automatic Failover HA +## Moving Towards Automatic Failover HA To move towards more automatic failover handling we need to ensure that the node can be partially active i.e. live monitoring the health status and perhaps keeping major data structures in sync for faster activation, but not actually processing flows. This needs to be reversible without leakage, or destabilising the node as it is common to use manually driven master changes to help with software upgrades and to carry out regular node shutdown and maintenance. Also, to reduce the risks associated with the uncoupled replication of the Artemis message data and the database I would recommend that we move the Artemis broker out of the node to allow us to create a failover cluster. This is also in line with the goal of creating a AMQP bridges and Floats. To this end I would suggest packages of work that include: 1. Move the broker out of the node, which will require having a protocol that can be used to signal bridge creation and which decouples the network map. This is in line with the Flow work anyway. -2. Create a mastering solution, probably using Atomix.IO although this might require a solution with a minimum of three nodes to avoid split brain issues. Ideally this service should be extensible in the future to lead towards an eventual state with Flow level sharding. Alternatively, we may be able to add a quick enterprise adaptor to ZooKeeper as master selector if time is tight. This will inevitably impact upon configuration and deployment support. -3. Test the leakage when we repeated start-stop the Node class and fix any resource leaks, or deadlocks that occur at shutdown. -4. Switch the Artemis client code to be able to use the HA mode connection type and thus take advantage of the rapid failover code. Also, ensure that we can support multiple public IP addresses reported in the network map. -5. Implement proper detection and handling of disconnect from the external database and/or Artemis broker, which should immediately drop the master status of the node and flush any incomplete flows. -6. We should start looking at how to make RPC proxies recover from disconnect/failover, although this is probably not a top priority. However, it would be good to capture the missed results of completed flows and ensure the API allows clients to unregister/re-register Observables. + 2.Create a mastering solution, probably using Atomix.IO although this might require a solution with a minimum of three nodes to avoid split brain issues. Ideally this service should be extensible in the future to lead towards an eventual state with Flow level sharding. Alternatively, we may be able to add a quick enterprise adaptor to ZooKeeper as master selector if time is tight. This will inevitably impact upon configuration and deployment support. + 3.Test the leakage when we repeated start-stop the Node class and fix any resource leaks, or deadlocks that occur at shutdown. + 4.Switch the Artemis client code to be able to use the HA mode connection type and thus take advantage of the rapid failover code. Also, ensure that we can support multiple public IP addresses reported in the network map. + 5.Implement proper detection and handling of disconnect from the external database and/or Artemis broker, which should immediately drop the master status of the node and flush any incomplete flows. + 6.We should start looking at how to make RPC proxies recover from disconnect/failover, although this is probably not a top priority. However, it would be good to capture the missed results of completed flows and ensure the API allows clients to unregister/re-register Observables. -# The Future +## The Future Hopefully, most of the work from the automatic failover mode can be modified when we move to a full hot-hot sharding of flows across nodes. The mastering solution will need to be modified to negotiate finer grained claim on individual flows, rather than stopping the whole of Node. Also, the routing of messages will have to be thought about so that they go to the correct node for processing, but failover if the node dies. However, most of the other health monitoring and operational aspects should be reusable.