mirror of
https://github.com/corda/corda.git
synced 2024-12-19 04:57:58 +00:00
Docs: import hadr design doc.
This commit is contained in:
parent
3f44910a2b
commit
b2c28cb523
BIN
docs/source/design/hadr/HA deployment - Hot-Cold.png
Normal file
BIN
docs/source/design/hadr/HA deployment - Hot-Cold.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 376 KiB |
BIN
docs/source/design/hadr/HA deployment - Hot-Hot.png
Normal file
BIN
docs/source/design/hadr/HA deployment - Hot-Hot.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 423 KiB |
BIN
docs/source/design/hadr/HA deployment - Hot-Warm.png
Normal file
BIN
docs/source/design/hadr/HA deployment - Hot-Warm.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 247 KiB |
BIN
docs/source/design/hadr/HA deployment - No HA.png
Normal file
BIN
docs/source/design/hadr/HA deployment - No HA.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 280 KiB |
52
docs/source/design/hadr/decisions/crash-shell.md
Normal file
52
docs/source/design/hadr/decisions/crash-shell.md
Normal file
@ -0,0 +1,52 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: Node starting & stopping
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
The potential use of a crash shell is relevant to [high availability](../design.md) capabilities of nodes.
|
||||
|
||||
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Use crash shell
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Already built into the node.
|
||||
2. Potentially add custom commands.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Won’t reliably work if the node is in an unstable state
|
||||
2. Not practical for running hundreds of nodes as our customers arealready trying to do.
|
||||
3. Doesn’t mesh with the user access controls of the organisation.
|
||||
4. Doesn’t interface to the existing monitoring andcontrol systems i.e. Nagios, Geneos ITRS, Docker Swarm, etc.
|
||||
|
||||
### 2. Delegate to external tools
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Doesn’t require change from our customers
|
||||
2. Will work even if node is completely stuck
|
||||
3. Allows scripted node restart schedules
|
||||
4. Doesn’t raise questions about access controllists and audit
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. More uncertainty about what customers do.
|
||||
2. Might be more requirements on us to interact nicely with lots of different products.
|
||||
3. Might mean we get blamed for faults in other people’s control software.
|
||||
4. Doesn’t coordinate with the node for graceful shutdown.
|
||||
5. Doesn’t address any crypto features that target protecting the AMQP headers.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 2: Delegate to external tools
|
||||
|
||||
## Decision taken
|
||||
|
||||
**[DRB meeting, 16/11/2017:](./drb-meeting-20171116.md)** Restarts should be handled by polite shutdown, followed by a hard clear. (RGB, JC, MH agreed)
|
47
docs/source/design/hadr/decisions/db-msg-store.md
Normal file
47
docs/source/design/hadr/decisions/db-msg-store.md
Normal file
@ -0,0 +1,47 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: Message storage
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
Storage of messages by the message broker has implications for replication technologies which can be used to ensure both [high availability](../design.md) and disaster recovery of Corda nodes.
|
||||
|
||||
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Storage in the file system
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Out of the box configuration.
|
||||
2. Recommended Artemis setup
|
||||
3. Faster
|
||||
4. Less likely to have interaction with DB Blob rules
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Unaligned capture time of journal data compared to DB checkpointing.
|
||||
2. Replication options on Azure are limited. Currently we may be forced to the ‘Azure Files’ SMB mount, rather than the ‘Azure Data Disk’ option. This is still being evaluated
|
||||
|
||||
### 2. Storage in node database
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Single point of data capture and backup
|
||||
2. Consistent solution between VM and physical box solutions
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Doesn’t work on H2, or SQL Server. From my own testing LargeObject support is broken. The current Artemis code base does allow somepluggability, but not of the large object implementation, only of the SQLstatements. We should lobby for someone to fix the implementations for SQLServer and H2.
|
||||
2. Probably much slower, although this needs measuring.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Continue with Option 1: Storage in the file system
|
||||
|
||||
## Decision taken
|
||||
|
||||
[DRB meeting, 16/11/2017:](./drb-meeting-20171116.md) Use storage in the file system (for now)
|
126
docs/source/design/hadr/decisions/drb-meeting-20171116.md
Normal file
126
docs/source/design/hadr/decisions/drb-meeting-20171116.md
Normal file
@ -0,0 +1,126 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Review Board Meeting Minutes
|
||||
============================================
|
||||
|
||||
**Date / Time:** 16/11/2017, 16:30
|
||||
|
||||
|
||||
|
||||
## Attendees
|
||||
|
||||
- Mark Oldfield (MO)
|
||||
- Matthew Nesbit (MN)
|
||||
- Richard Gendal Brown (RGB)
|
||||
- James Carlyle (JC)
|
||||
- Mike Hearn (MH)
|
||||
- Jose Coll (JoC)
|
||||
- Rick Parker (RP)
|
||||
- Andrey Bozhko (AB)
|
||||
- Dave Hudson (DH)
|
||||
- Nick Arini (NA)
|
||||
- Ben Abineri (BA)
|
||||
- Jonathan Sartin (JS)
|
||||
- David Lee (DL)
|
||||
|
||||
|
||||
|
||||
## **Minutes**
|
||||
|
||||
The meeting re-opened following prior discussion of the float design.
|
||||
|
||||
MN introduced the design for high availability, clarifying that the design did not include support for DR-implied features (asynchronous replication etc.).
|
||||
|
||||
MN highlighted limitations in testability: Azure had confirmed support for geo replication but with limited control by the user and no testing facility; all R3 can do is test for impact on performance.
|
||||
|
||||
The design was noted to be dependent on a lot on external dependencies for replication, with R3's testing capability limited to Azure. Agent banks may want to use SAN across dark fiber sites, redundant switches etc. not available to R3.
|
||||
|
||||
MN noted that certain databases are not yet officially supported in Corda.
|
||||
|
||||
### [Near-term-target](./near-term-target.md), [Medium-term target](./medium-term-target.md)
|
||||
|
||||
Outlining the hot-cold design, MN highlighted importance of ensuring only one node is active at one time. MN argued for having a tested hot-cold solution as a ‘backstop’. MN confirmed the work involved was to develop DB/SAN exclusion checkers and test appropriately.
|
||||
|
||||
JC queried whether unknowns exist for hot-cold. MN described limitations of Azure file replication.
|
||||
|
||||
JC noted there was optionality around both the replication mechanisms and the on-premises vs. cloud deployment.
|
||||
|
||||
### [Message storage](./db-msg-store.md)
|
||||
|
||||
Lack of support for storing Artemis messages via JDBC was raised, and the possibility for RedHat to provide an enhancement was discussed.
|
||||
|
||||
MH raised the alternative of using Artemis’ inbuilt replication protocol - MN confirmed this was in scope for hot-warm, but not hot-cold.
|
||||
|
||||
JC posited that file system/SAN replication should be OK for banks
|
||||
|
||||
**DECISION AGREED**: Use storage in the file system (for now)
|
||||
|
||||
AB questioned about protections against corruption; RGB highlighted the need for testing on this. MH described previous testing activity, arguing for a performance cluster that repeatedly runs load tests, kills nodes,checking they come back etc.
|
||||
|
||||
MN could not comment on testing status of current code. MH noted the notary hasn't been tested.
|
||||
|
||||
AB queried how basic node recovery would work. MN explained, highlighting the limitation for RPC callbacks.
|
||||
|
||||
JC proposed these limitations should be noted and explained to Finastra; move on.
|
||||
|
||||
There was discussion of how RPC observables could be made to persist across node outages. MN argued that for most applications, a clear signal of the outage that triggered clients to resubscribe was preferable. This was agreed.
|
||||
|
||||
JC argued for using Kafka.
|
||||
|
||||
MN presented the Hot-warm solution as a target for March-April and provide clarifications on differences vs. hot-cold and hot-hot.
|
||||
|
||||
JC highlighted that the clustered artemis was an important intermediate step. MN highlighted other important features
|
||||
|
||||
MO noted that different banks may opt for different solutions.
|
||||
|
||||
JoC raised the question of multi-IP per node.
|
||||
|
||||
MN described the Hot-hot solution, highlighting that flows remained 'sticky' to a particular instance but could be picked up by another when needed.
|
||||
|
||||
AB preferred the hot-hot solution. MN noted the many edge cases to be worked through.
|
||||
|
||||
AB Queried the DR story. MO stated this was out of scope at present.
|
||||
|
||||
There was discussion of the implications of not having synchronous replication.
|
||||
|
||||
MH questioned the need for a backup strategy that allows winding back the clock. MO stated this was out of scope at present.
|
||||
|
||||
MO drew attention to the expectation that Corda would be considered part of larger solutions with controlled restore procedures under BCP.
|
||||
|
||||
JC noted the variability in many elements as a challenge.
|
||||
|
||||
MO argued for providing a 'shrink-wrapped' solution based around equipment R3 could test (e.g. Azure)
|
||||
|
||||
JC argued for the need to manage testing of banks' infrastructure choices in order to reduce time to implementation.
|
||||
|
||||
There was discussion around the semantic difference between HA and DR. MH argued for a definition based around rolling backups. MN and MO shared banks' view of what DR is. MH contrasted this with Google definitions. AB noted HA and DR have different SLAs.
|
||||
|
||||
**DECISION AGREED:** Near-term target: Hot Cold; Medium-term target: Hot-warm (RGB, JC, MH agreed)
|
||||
|
||||
RGB queried why Artemis couldn't be run in clustered mode now. MN explained.
|
||||
|
||||
AB queried what Finastra asked for. MO implied nothing specific; MH maintained this would be needed anyway.
|
||||
|
||||
### [Broker separation](./external-broker.md)
|
||||
|
||||
MN outlined his rationale for Broker separation.
|
||||
|
||||
JC queried whether this would affect demos.
|
||||
|
||||
MN gave an assumption that HA was for enterprise only; RGB, JC: pointed out that Enterprise might still be made available for non-production use.
|
||||
|
||||
**DECISION AGREED**: The broker should only be separated if required by other features (e.g. the float), otherwise not. (RGB, JC, MH agreed).
|
||||
|
||||
### [Load balancers and multi-IP](./ip-addressing.md)
|
||||
|
||||
The topic was discussed.
|
||||
|
||||
**DECISION AGREED**: The design can allow for optional load balancers to be implemented by clients.
|
||||
|
||||
### [Crash shell](./crash-shell.md)
|
||||
|
||||
MN provided outline explanation.
|
||||
|
||||
**DECISION AGREED**: Restarts should be handled by polite shutdown, followed by a hard clear. (RGB, JC, MH agreed)
|
||||
|
49
docs/source/design/hadr/decisions/external-broker.md
Normal file
49
docs/source/design/hadr/decisions/external-broker.md
Normal file
@ -0,0 +1,49 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: Broker separation
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
A decision of whether to extract the Artemis message broker as a separate component has implications for the design of [high availability](../design.md) for nodes.
|
||||
|
||||
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. No change (leave broker embedded)
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Least change
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Means that starting/stopping Corda is tightly coupled to starting/stopping Artemis instances.
|
||||
2. Risks resource leaks from one system component affecting other components.
|
||||
3. Not pluggable if we wish to have an alternative broker.
|
||||
|
||||
## 2. External broker
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Separates concerns
|
||||
2. Allows future pluggability and standardisation on AMQP
|
||||
3. Separates life cycles of the components
|
||||
4. Makes Artemis deployment much more out of the box.
|
||||
5. Allows easier tuning of VM resources for Flow processing workloads vs broker type workloads.
|
||||
6. Allows later encrypted version to be an enterprise feature that can interoperate with OS versions.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. More work
|
||||
2. Requires creating a protocol to control external bridge formation.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 2: External broker
|
||||
|
||||
## Decision taken
|
||||
|
||||
**[DRB meeting, 16/11/2017:](./drb-meeting-20171116.md)** The broker should only be separated if required by other features (e.g. the float), otherwise not. (RGB, JC, MH agreed).
|
48
docs/source/design/hadr/decisions/ip-addressing.md
Normal file
48
docs/source/design/hadr/decisions/ip-addressing.md
Normal file
@ -0,0 +1,48 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: IP addressing mechanism (near-term)
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
End-to-end encryption is a desirable potential design feature for the [float](../design.md).
|
||||
|
||||
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Via load balancer
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Standard technology in banks and on clouds, often for non-HA purposes.
|
||||
2. Intended to allow us to wait for completion of network map work.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. We do need to support multiple IP address advertisements in network map long term.
|
||||
2. Might involve small amount of code if we find Artemis doesn’t like the health probes. So far though testing of the Azure Load balancer doesn’t need this.
|
||||
3. Won’t work over very large data centre separations, but that doesn’t work for HA/DR either
|
||||
|
||||
### 2. Via IP list in Network Map
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. More flexible
|
||||
2. More deployment options
|
||||
3. We will need it one day
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Have to write code to support it.
|
||||
2. Configuration more complicated and now the nodesare non-equivalent, so you can’t just copy the config to the backup.
|
||||
3. Artemis has round robin and automatic failover, so we may have to expose a vendor specific config flag in the network map.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 1: Via Load Balancer
|
||||
|
||||
## Decision taken
|
||||
|
||||
**[DRB meeting, 16/11/2017:](./drb-meeting-20171116.md)** The design can allow for optional load balancers to be implemented by clients. (RGB, JC, MH agreed)
|
49
docs/source/design/hadr/decisions/medium-term-target.md
Normal file
49
docs/source/design/hadr/decisions/medium-term-target.md
Normal file
@ -0,0 +1,49 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
------
|
||||
|
||||
# Design Decision: Medium-term target for node HA
|
||||
|
||||
## Background / Context
|
||||
|
||||
Designing for high availability is a complex task which can only be delivered over an operationally-significant timeline. It is therefore important to determine whether an intermediate state design (deliverable for around March 2018) is desirable as a precursor to longer term outcomes.
|
||||
|
||||
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. Hot-warm as interim state (see [HA design doc](../design.md))
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Simpler master/slave election logic
|
||||
2. Less edge cases with respect to messages being consumed by flows.
|
||||
3. Naive solution of just stopping/starting the node code is simple to implement.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Still probably requires the Artemis MQ outside of the node in a cluster.
|
||||
2. May actually turn out more risky than hot-hot, because shutting down code is always prone to deadlocks and resource leakages.
|
||||
3. Some work would have to be thrown away when we create a full hot-hot solution.
|
||||
|
||||
### 2. Progress immediately to Hot-hot (see [HA design doc](../design.md))
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Horizontal scalability is what all our customers want.
|
||||
2. It simplifies many deployments as nodes in a cluster are all equivalent.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. More complicated especially regarding message routing.
|
||||
2. Riskier to do this big-bang style.
|
||||
3. Might not meet deadlines.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 1: Hot-warm as interim state.
|
||||
|
||||
## Decision taken
|
||||
|
||||
**[DRB meeting, 16/11/2017:](./drb-meeting-20171116.md)** Adopt option 1: Medium-term target: Hot Warm (RGB, JC, MH agreed)
|
||||
|
46
docs/source/design/hadr/decisions/near-term-target.md
Normal file
46
docs/source/design/hadr/decisions/near-term-target.md
Normal file
@ -0,0 +1,46 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
--------------------------------------------
|
||||
Design Decision: Near-term target for node HA
|
||||
============================================
|
||||
|
||||
## Background / Context
|
||||
|
||||
Designing for high availability is a complex task which can only be delivered over an operationally-significant timeline. It is therefore important to determine the target state in the near term as a precursor to longer term outcomes.
|
||||
|
||||
|
||||
|
||||
## Options Analysis
|
||||
|
||||
### 1. No HA
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Reduces developer distractions.
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. No backstop if we miss our targets for fuller HA.
|
||||
2. No answer at all for simple DR modes.
|
||||
|
||||
### 2. Hot-cold (see [HA design doc](../design.md))
|
||||
|
||||
#### Advantages
|
||||
|
||||
1. Flushes out lots of basic deployment issues that will be of benefit later.
|
||||
2. If stuff slips we at least have a backstop position with hot-cold.
|
||||
3. For now, the only DR story we have is essentially a continuation of this mode
|
||||
4. The intent of decisions such as using a loadbalancer is to minimise code changes
|
||||
|
||||
#### Disadvantages
|
||||
|
||||
1. Distracts from the work for more complete forms of HA.
|
||||
2. Involves creating a few components that are not much use later, for instance the mutual exclusion lock.
|
||||
|
||||
## Recommendation and justification
|
||||
|
||||
Proceed with Option 2: Hot-cold.
|
||||
|
||||
## Decision taken
|
||||
|
||||
**[DRB meeting, 16/11/2017:](./drb-meeting-20171116.md)** Adopt option 2: Near-term target: Hot Cold (RGB, JC, MH agreed)
|
236
docs/source/design/hadr/design.md
Normal file
236
docs/source/design/hadr/design.md
Normal file
@ -0,0 +1,236 @@
|
||||
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||
|
||||
# High Availability Support for Corda: A Phased Approach
|
||||
|
||||
-------------------
|
||||
DOCUMENT MANAGEMENT
|
||||
===================
|
||||
|
||||
## Document Control
|
||||
|
||||
* High Availability and Disaster Recovery for Corda: A Phased Approach
|
||||
* Date: 13th November 2017
|
||||
* Author: Matthew Nesbit
|
||||
* Distribution: Design Review Board, Product Management, Services - Technical (Consulting), Platform Delivery
|
||||
* Corda target version: Enterprise
|
||||
|
||||
## Document Sign-off
|
||||
|
||||
* Author: David Lee
|
||||
* Reviewers(s): TBD
|
||||
* Final approver(s): TBD
|
||||
|
||||
## Document History
|
||||
|
||||
--------------------------------------------
|
||||
HIGH LEVEL DESIGN
|
||||
============================================
|
||||
|
||||
## Overview
|
||||
### Background
|
||||
|
||||
The term high availability (HA) is used in this document to refer to the ability to rapidly handle any single component failure, whether due to physical issues (e.g. hard drive failure), network connectivity loss, or software faults.
|
||||
|
||||
Expectations of HA in modern enterprise systems are for systems to recover normal operation in a few minutes at most, while ensuring minimal/zero data loss. Whilst overall reliability is the overriding objective, it is desirable for Corda to offer HA mechanisms which are both highly automated and transparent to node operators. HA mechanism must not involve any configuration changes that require more than an appropriate admin tool, or a simple start/stop of a process as that would need an Emergency Change Request.
|
||||
|
||||
HA naturally grades into requirements for Disaster Recovery (DR), which requires that there is a tested procedure to handle large scale multi-component failures e.g. due to data centre flooding, acts of terrorism. DR processes are permitted to involve significant manual intervention, although the complications of actually invoking a Business Continuity Plan (BCP) mean that the less manual intervention, the more competitive Corda will be in the modern vendor market.
|
||||
For modern financial institutions, maintaining comprehensive and effective BCP procedures are a legal requirement which is generally tested at least once a year.
|
||||
|
||||
However, until Corda is the system of record, or the primary system for transactions we are unlikely to be required to have any kind of fully automatic DR. In fact, we are likely to be restarted only once BCP has restored the most critical systems.
|
||||
In contrast, typical financial institutions maintain large, complex technology landscapes in which individual component failures can occur, such as:
|
||||
|
||||
* Small scale software failures
|
||||
* Mandatory data centre power cycles
|
||||
* Operating system patching and restarts
|
||||
* Short lived network outages
|
||||
* Middleware queue build-up
|
||||
* Machine failures
|
||||
|
||||
Thus, HA is essential for enterprise Corda and providing help to administrators necessary for rapid fault diagnosis.
|
||||
|
||||
### Current node topology
|
||||
|
||||
![Current (single process)](./HA%20deployment%20-%20No%20HA.png)
|
||||
|
||||
The current solution has a single integrated process running in one JVM including
|
||||
Artemis, H2 database, Flow State Machine, P2P bridging. All storage is on the local file system. There is no HA capability other than manual restart of the node following failure.
|
||||
|
||||
#### Limitations
|
||||
|
||||
- All sub-systems must be started and stopped together.
|
||||
- Unable to handle partial failure e.g. Artemis.
|
||||
- Artemis cannot use its in-built HA capability (clustered slave mode) as it is embedded.
|
||||
- Cannot run the node with the flow state machine suspended.
|
||||
- Cannot use alternative message brokers.
|
||||
- Cannot run multiple nodes against the same broker.
|
||||
- Cannot use alternative databases to H2.
|
||||
- Cannot share the database across Corda nodes.
|
||||
- RPC clients do have automatic reconnect but there is no clear solution for resynchronising on reconnect.
|
||||
- The backup strategy is unclear.
|
||||
|
||||
## Requirements
|
||||
### Goals
|
||||
* A logical Corda node should continue to function in the event of an individual component failure or (e.g.) restart.
|
||||
* No loss, corruption or duplication of data on the ledger due to component outages
|
||||
* Ensure continuity of flows throughout any disruption
|
||||
* Support software upgrades in a live network
|
||||
|
||||
### Goals (out of scope for this design document)
|
||||
* Be able to distribute a node over more than two datacenters.
|
||||
* Be able to distribute a node between datacenters that are very far apart latency-wise (unless you don't care about performance).
|
||||
* Be able to tolerate arbitrary byzantine failures within a node cluster.
|
||||
* DR, specifically in the case of the complete failure of a site/datacentre/cluster or region will require a different solution to that specified here. For now DR is only supported where performant synchronous replication is feasible i.e. sites only a few miles apart.
|
||||
|
||||
## Timeline
|
||||
|
||||
This design document outlines a range of topologies which will be enabled through progressive enhancements from the short to long term.
|
||||
|
||||
On the timescales available for the current production pilot deployments we clearly do not have time to reach the ideal of a highly fault tolerant, horizontally scaled Corda.
|
||||
|
||||
Instead, I suggest that we can only achieve the simplest state of a standby Corda installation only by January 5th and even this is contingent on other enterprise features, such as external database and network map stabilisation being completed on this timescale, plus any issues raised by testing.
|
||||
|
||||
For the March 31st timeline, I hope that we can achieve a more fully automatic node failover state, with the Artemis broker running as a cluster too. I include a diagram of a fully scaled Corda for completeness and so that I can discuss what work is re-usable/throw away.
|
||||
|
||||
With regards to DR it is unclear how this would work where synchronous replication is not feasible. At this point we can only investigate approaches as an aside to the main thrust of work for HA support. In the synchronous replication mode it is assumed that the file and database replication can be used to ensure a cold DR backup.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
The following design decisions are assumed by this design:
|
||||
|
||||
1. [Near-term-target](./decisions/near-term-target.md): Hot-Cold HA (see below)
|
||||
2. [Medium-term target](./decisions/medium-term-target.md): Hot-Warm HA (see below)
|
||||
3. [External broker](./decisions/external-broker.md): Yes
|
||||
4. [Database message store](./decisions/db-msg-store.md): No
|
||||
5. [IP addressing mechanism](./decisions/ip-addressing.md): Load balancer
|
||||
6. [Crash shell start/stop](./decisions/crash-shell.md): No
|
||||
|
||||
|
||||
|
||||
## Target Solution
|
||||
|
||||
|
||||
### Hot-Cold (minimum requirement)
|
||||
![Hot-Cold (minimum requirement)](./HA%20deployment%20-%20Hot-Cold.png)
|
||||
|
||||
Small scale software failures on a node are recovered from locally via restarting/re-setting the offending component by the external (to JVM) "Health Watchdog" (HW) process. The HW process (eg a shell script or similar) would monitor parameters for java processes by periodically query them (sleep period a few seconds). This may require introduction of a few monitoring 'hooks' into Corda codebase or a "health" CorDapp the HW script can interface with. There would be a back-off logic to prevent continues restarts in the case of persistent failure.
|
||||
|
||||
We would provide a fully-functional sample HW script for Linux/Unix deployment platforms.
|
||||
|
||||
The hot-cold design provides a backup VM and Corda deployment instance that can be manually started if the primary is stopped. The failed primary must be killed to ensure it is fully stopped.
|
||||
|
||||
For single-node deployment scenarios the simplest supported way to recover from failures is to re-start the entire set of Corda Node processes or reboot the node OS.
|
||||
|
||||
For a 2-node HA deployment scenario a load balancer determines which node is active and routes traffic to that node.
|
||||
The load balancer will need to monitor the health of the primary and secondary nodes and automatically route traffic from the public IP address to the only active end-point. An external solution is required for the load balancer and health monitor. In the case of Azure cloud deployments, no custom code needs to be developed to support the health monitor.
|
||||
|
||||
An additional component will be written to prevent accidental dual running which is likely to make use of a database heartbeat table. Code size should be minimal.
|
||||
|
||||
#### Advantages
|
||||
|
||||
- This approach minimises the need for new code so can be deployed quickly.
|
||||
- Use of a load balancer in the short term avoids the need for new code and configuration management to support the alternative approach of multiple advertised addresses for a single legal identity.
|
||||
- Configuration of the inactive mode should be a simple mirror of the primary.
|
||||
- Assumes external monitoring and management of the nodes e.g. ability to identify node failure and that Corda watchdog code will not be required (customer developed).
|
||||
|
||||
#### Limitations
|
||||
|
||||
- Slow failover as this is manually controlled.
|
||||
- Requires external solutions for replication of database and Artemis journal data.
|
||||
- Replication mechanism on agent banks with real servers not tested.
|
||||
- Replication mechanism on Azure is under test but may prove to be too slow.
|
||||
- Compatibility with external load balancers not tested. Only Azure configuration tested.
|
||||
- Contingent on completion of database support and testing of replication.
|
||||
- Failure of database (loss of connection) may not be supported or may require additional code.
|
||||
- RPC clients assumed to make short lived RPC requests e.g. from Rest server so no support for long term clients operating across failover.
|
||||
- Replication time point of the database and Artemis message data are independent and may not fully synchronise (may work subject to testing) .
|
||||
- Health reporting and process controls need to be developed by the customer.
|
||||
|
||||
### Hot-Warm (Medium-term solution)
|
||||
![Hot-Warm (Medium-term solution)](./HA%20deployment%20-%20Hot-Warm.png)
|
||||
|
||||
Hot-warm aims to automate failover and provide failover of individual major components e.g. Artemis.
|
||||
|
||||
It involves Two key changes to the hot-cold design:
|
||||
1) Separation and clustering of the Artemis broker.
|
||||
2) Start and stop of flow processing without JVM exit.
|
||||
|
||||
The consequences of these changes are that peer to peer bridging is separated from the node and a bridge control protocol must be developed.
|
||||
A leader election component is a pre-cursor to load balancing – likely to be a combination of custom code and standard library and, in the short term, is likely to be via the database.
|
||||
Cleaner handling of disconnects from the external components (Artemis and the database) will also be needed.
|
||||
|
||||
#### Advantages
|
||||
|
||||
- Faster failover as no manual intervention.
|
||||
- We can use Artemis replication protocol to replicate the message store.
|
||||
- The approach is integrated with preliminary steps for the float.
|
||||
- Able to handle loss of network connectivity to the database from one node.
|
||||
- Extraction of Artemis server allows a more standard Artemis deployment.
|
||||
- Provides protection against resource leakage in Artemis or Node from affecting the other component.
|
||||
- VMs can be tuned to address different work load patterns of broker and node.
|
||||
- Bridge work allows chance to support multiple IP addresses without a load balancer.
|
||||
|
||||
#### Limitations
|
||||
|
||||
- This approach will require careful testing of resource management on partial shutdown.
|
||||
- No horizontal scaling support.
|
||||
- Deployment of master and slave may not be completely symmetric.
|
||||
- Care must be taken with upgrades to ensure master/slave election operates across updates.
|
||||
- Artemis clustering does require a designated master at start-up of its cluster hence any restart involving changing the primary node will require configuration management.
|
||||
- The development effort is much more significant than the hot-cold configuration.
|
||||
|
||||
### Hot-Hot (Long-term strategic solution)
|
||||
![Hot-Hot (Long-term strategic solution)](./HA%20deployment%20-%20Hot-Hot.png)
|
||||
|
||||
In this configuration, all nodes are actively processing work and share a clustered database. A mechanism for sharding or distributing the work load will need to be developed.
|
||||
|
||||
#### Advantages
|
||||
|
||||
- Faster failover as flows are picked up by other active nodes.
|
||||
- Rapid scaling by adding additional nodes.
|
||||
- Node deployment is symmetric.
|
||||
- Any broker that can support AMQP can be used.
|
||||
- RPC can gracefully handle failover because responsibility for the flow can be migrated across nodes without the client being aware.
|
||||
|
||||
#### Limitations
|
||||
|
||||
- Very significant work with many edge cases during failure.
|
||||
- Will require handling of more states than just checkpoints e.g. soft locks and RPC subscriptions.
|
||||
- Single flows will not be active on multiple nodes without future development work.
|
||||
|
||||
--------------------------------------------
|
||||
IMPLEMENTATION PLAN
|
||||
============================================
|
||||
|
||||
## Transitioning from Corda 2.0 to Manually Activated HA
|
||||
|
||||
The current Corda is built to run as a fully contained single process with the Flow logic, H2 database and Artemis broker all bundled together. This limits the options for automatic replication, or subsystem failure. Thus, we must use external mechanisms to replicate the data in the case of failure. We also should ensure that accidental dual start is not possible in case of mistakes, or slow shutdown of the primary.
|
||||
|
||||
Based on this situation, I suggest the following minimum development tasks are required for a tested HA deployment:
|
||||
|
||||
1. Complete and merge JDBC support for an external clustered database. Azure SQL Server has been identified as the most likely initial deployment. With this we should be able to point at an HA database instance for Ledger and Checkpoint data.
|
||||
2. I am suggesting that for the near term we just use the Azure Load Balancer to hide the multiple machine addresses. This does require allowing a health monitoring link to the Artemis broker, but so far testing indicates that this operates without issue. Longer term we need to ensure that the network map and configuration support exists for the system to work with multiple TCP/IP endpoints advertised to external nodes. Ideally this should be rolled into the work for AMQP bridges and Floats.
|
||||
3. Implement a very simple mutual exclusion feature, so that an enterprise node cannot start if another is running onto the same database. This can be via a simple heartbeat update in the database, or possibly some other library. This feature should be enabled only when specified by configuration.
|
||||
4. The replication of the Artemis Message Queues will have to be via an external mechanism. On Azure we believe that the only practical solution is the 'Azure Files' approach which maps a virtual Samba drive. This we are testing in-case it is too slow to work. The mounting of separate Data Disks is possible, but they can only be mounted to one VM at a time, so they would not be compatible with the goal of no change requests for HA.
|
||||
5. Improve health monitoring to better indicate fault failure. Extending the existing JMX and logging support should achieve this, although we probably need to create watchdog CordApp that verifies that the State Machine and Artemis messaging are able to process new work and to monitor flow latency.
|
||||
6. Test the checkpointing mechanism and confirm that failures don't corrupt the data by deploying an HA setup on Azure and driving flows through the system as we stop the node randomly and switch to the other node. If this reveals any issues we will have to fix them.
|
||||
7. Confirm that the behaviour of the RPC Client API is stable through these restarts, from the perspective of a stateless REST server calling through to RPC. The RPC API should provide positive feedback to the application, so that it can respond in a controlled fashion when disconnected.
|
||||
8. Work on flow hospital tools where needed
|
||||
|
||||
## Moving Towards Automatic Failover HA
|
||||
|
||||
To move towards more automatic failover handling we need to ensure that the node can be partially active i.e. live monitoring the health status and perhaps keeping major data structures in sync for faster activation, but not actually processing flows. This needs to be reversible without leakage, or destabilising the node as it is common to use manually driven master changes to help with software upgrades and to carry out regular node shutdown and maintenance. Also, to reduce the risks associated with the uncoupled replication of the Artemis message data and the database I would recommend that we move the Artemis broker out of the node to allow us to create a failover cluster. This is also in line with the goal of creating a AMQP bridges and Floats.
|
||||
|
||||
To this end I would suggest packages of work that include:
|
||||
|
||||
1. Move the broker out of the node, which will require having a protocol that can be used to signal bridge creation and which decouples the network map. This is in line with the Flow work anyway.
|
||||
2. Create a mastering solution, probably using Atomix.IO although this might require a solution with a minimum of three nodes to avoid split brain issues. Ideally this service should be extensible in the future to lead towards an eventual state with Flow level sharding. Alternatively, we may be able to add a quick enterprise adaptor to ZooKeeper as master selector if time is tight. This will inevitably impact upon configuration and deployment support.
|
||||
3. Test the leakage when we repeated start-stop the Node class and fix any resource leaks, or deadlocks that occur at shutdown.
|
||||
4. Switch the Artemis client code to be able to use the HA mode connection type and thus take advantage of the rapid failover code. Also, ensure that we can support multiple public IP addresses reported in the network map.
|
||||
5. Implement proper detection and handling of disconnect from the external database and/or Artemis broker, which should immediately drop the master status of the node and flush any incomplete flows.
|
||||
6. We should start looking at how to make RPC proxies recover from disconnect/failover, although this is probably not a top priority. However, it would be good to capture the missed results of completed flows and ensure the API allows clients to unregister/re-register Observables.
|
||||
|
||||
## The Future
|
||||
|
||||
Hopefully, most of the work from the automatic failover mode can be modified when we move to a full hot-hot sharding of flows across nodes. The mastering solution will need to be modified to negotiate finer grained claim on individual flows, rather than stopping the whole of Node. Also, the routing of messages will have to be thought about so that they go to the correct node for processing, but failover if the node dies. However, most of the other health monitoring and operational aspects should be reusable.
|
||||
|
||||
We also need to look at DR issues and in particular how we might handle asynchronous replication and possibly alternative recovery/reconciliation mechanisms.
|
Loading…
Reference in New Issue
Block a user