Bogdan hot warm doc

2025-05-27 20:54:24 +00:00 · 2018-02-05 10:14:42 +00:00 · 2018-02-05 10:14:42 +00:00 · b092b6b547
commit b092b6b547
parent 92cf91c0b0
4 changed files with 192 additions and 0 deletions
--- a/docs/source/design/failure-detection-master-election/HW
+++ b/docs/source/design/failure-detection-master-election/HW
--- a/docs/source/design/failure-detection-master-election/HW
+++ b/docs/source/design/failure-detection-master-election/HW
--- a/docs/source/design/failure-detection-master-election/design.md
+++ b/docs/source/design/failure-detection-master-election/design.md
@ -0,0 +1,105 @@
+![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
+
+# Failure detection and master election: design proposal
+
+-------------------
+DOCUMENT MANAGEMENT
+===================
+
+## Document Control
+
+* Failure detection and master election: design proposal
+* Date: 23rd January 2018
+* Author: Bogdan Paunescu
+* Distribution: Design Review Board, Product Management, DevOps, Services - Technical (Consulting)
+* Corda target version: Enterprise
+
+## Document History
+
+--------------------------------------------
+HIGH LEVEL DESIGN
+============================================
+
+## Background
+
+Two key issues need to be resolved before Hot-Warm can be implemented:
+
+* Automatic failure detection (currently our Hot-Cold set-up requires a human observer to detect a failed node)
+* Master election and node activation (currently done manually)
+
+This document proposes two solutions to the above mentioned issues. The strengths and drawbacks of each solution are explored.
+
+## Constraints/Requirements
+
+Typical modern HA environments rely on a majority quorum of the cluster to be alive and operating normally in order to service requests. This means:
+
+* A cluster of 1 replica can tolerate 0 failures
+* A cluster of 2 replicas can tolerate 0 failures
+* A cluster of 3 replicas can tolerate 1 failure
+* A cluster of 4 replicas can tolerate 1 failure
+* A cluster of 5 replicas can tolerate 2 failures
+
+This already poses a challenge to us as clients will most likely want to deploy the minimum possible number of R3 Corda nodes. Ideally that minimum would be 3 but a solution for only 2 nodes should be available (even if it provides a lesser degree of HA than 3, 5 or more nodes). The problem with having only two nodes in the cluster is there is no distinction between failure and network partition.
+
+Users should be allowed to set a preference for which node to be active in a hot-warm environment. This would probably be done with the help of a property(persisted in the DB in order to be changed on the fly). This is an important functionality as users might want to have the active node on better hardware and switch to the back-ups and back as soon as possible.
+
+It would also be helpful for the chosen solution to not add deployment complexity.
+
+## Proposed solutions
+
+Based on what is needed for Hot-Warm, 1 active node and at least one passive node (started but in stand-by mode), and the constraints identified above (automatic failover with at least 2 nodes and master preference), two frameworks have been explored: Zookeeper and Atomix. Neither apply to our use cases perfectly and require some tinkering to solve our issues, especially the preferred master election.
+
+### Zookeeper
+
+![](./HW%20Design-Zookeeper.png)
+
+Preferred leader election - while the default algorithm does not take into account a leader preference, a custom algorithm can be implemented to suit our needs.
+
+Environment with 2 nodes -  while this type of set-up can't distinguish between a node failure and network partition, a workaround can be implemented by having 2 nodes and 3 zookeeper instances(3rd would be needed to form a majority).
+
+Pros:
+- Very well documented
+- Widely used, hence a lot of cookbooks, recipes and solutions to all sorts of problems
+- Supports custom leader election
+
+Cons:
+- Added deployment complexity
+- Bootstrapping a cluster is not very straightforward
+- Too complex for our needs?
+
+### Atomix
+
+![](./HW%20Design-Atomix.png)
+
+Preferred leader election - cannot be implemented easily; a creative solution would be required.
+
+Environment with 2 nodes - using only embedded replicas, there's no solution; Atomix comes also as a standalone server which could be run outside the node as a 3rd entity to allow a quorum(see image above).
+
+Pros:
+- Easy to get started with
+- Embedded, no added deployment complexity
+- Already used partially (Atomix Catalyst) in the notary cluster
+
+Cons:
+- Not as popular as Zookeeper, less used
+- Documentation is underwhelming; no proper usage examples
+- No easy way of influencing leader election; will require some creative use of Atomix functionality either via distributed groups or other resources
+
+## Recommendations
+
+If Zookeeper is chosen, we would need to look into a solution for easy configuration and deployment (maybe docker images). Custom leader election can be implemented by following one of the [examples](https://github.com/SainTechnologySolutions/allprogrammingtutorials/tree/master/apache-zookeeper/leader-election) available online.
+
+If Atomix is chosen, a solution to enforce some sort of preferred leader needs to found. One way to do it would be to have the Corda cluster leader be a separate entity from the Atomix cluster leader. Implementing the election would then be done using the distributed resources made available by the framework.
+
+## Conclusions
+
+Whichever solution is chosen, using 2 nodes in a Hot-Warm environment is not ideal. A minimum of 3 is required to ensure proper failover.
+
+Almost every configuration option that these frameworks offer should be exposed through node.conf.
+
+We've looked into using Galera which is currently used for the Notary cluster for storing the committed state hashes. It offers multi-master read/write and certification-based replication which is not leader based. It could be used to implement automatic failure detection and master election(similar to our current mutual exclusion).However, we found that it doesn't suit our needs because:
+- it adds to deployment complexity
+- usable only with MySQL and InnoDB storage engine
+- we'd have to implement node failure detection and master election from scratch; in this regard both Atomix and Zookeeper are better suited
+
+Our preference would be Zookeeper despite not being as lightweight and deployment-friendly as Atomix. The wide spread use, proper documentation and flexibility to use it not only for automatic failover and master election but also configuration management(something we might consider moving forward) makes it a better fit for our needs.
--- a/docs/source/design/failure-detection-master-election/drb-meeting-20180131.md
+++ b/docs/source/design/failure-detection-master-election/drb-meeting-20180131.md
@ -0,0 +1,87 @@
+![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
+
+--------------------------------------------
+Design Review Board Meeting Minutes
+============================================
+
+**Date / Time:** Jan 31 2018, 11.00
+
+## Attendees
+
+- Matthew Nesbit (MN)
+- Bogdan Paunescu (BP)
+- James Carlyle (JC)
+- Mike Hearn (MH)
+- Wawrzyniec Niewodniczanski (WN)
+- Jonathan Sartin (JS)
+- Gavin Thomas (GT)
+
+
+## **Decision**
+
+Proceed with recommendation to use Zookeeper as the master selection solution
+
+
+## **Primary Requirement of Design**
+
+- Client can run just 2 nodes, master and slave
+- Current deployment model to not change significantly
+- Prioritised mastering or be able to automatically elect a master. Useful to allow clients to do rolling upgrades, or for use when a high spec machine is used for master
+- Nice to have: use for flow sharding and soft locking
+
+## **Minutes**
+
+MN presented a high level summary of the options:
+- Galera:
+    - Negative: does not have leader election and failover capability.
+
+- Atomix IO:
+    - Positive: does integrate into node easily, can setup ports
+    - Negative: requires min 3 nodes, cannot manipulate election e.g. drop the master rolling deployments / upgrades, cannot select the 'beefy' host for master where cost efficiencies have been used for the slave / DR, young library and has limited functionality, poor documentation and examples
+
+- Zookeeper (recommended option): industry standard widely used and trusted. May be able to leverage clients' incumbent Zookeeper infrastructure
+    - Positive: has flexibility for storage and a potential for future proofing; good permissioning capabilities; standalone cluster of Zookeeper servers allows 2 nodes solution rather than 3
+    - Negative: adds deployment complexity due to need for Zookeeper cluster split across datacentres
+Wrapper library choice for Zookeeper requires some analysis
+
+
+MH: predictable source of API for RAFT implementations and Zookeeper compared to Atomix. Be better to have master selector implemented as an abstraction
+
+MH: hybrid approach possible - 3rd node for oversight, i.e. 2 embedded in the node, 3rd is an observer. Zookeeper can have one node in primary data centre, one in secondary data centre and 3rd as tie-breaker
+
+WN: why are we concerned about cost of 3 machines? MN: we're seeing / hearing clients wanting to run many nodes on one VM. Zookeeper is good for this since 1 Zookepper cluster can serve 100+ nodes
+
+MH: terminology clarification required: what holds the master lock? Ideally would be good to see design thinking around split node and which bits need HA. MB: as a long term vision, ideally have 1 database for many IDs and the flows for those IDs are load balanced. Regarding services internally to node being suspended, this is being investigated.
+
+MH: regarding auto failover, in the event a database has its own perception of master and slave, how is this handled? Failure detector will need to grow or have local only schedule to confirm it is processing everything including connectivity between database and bus, i.e. implement a 'healthiness' concept
+
+MH: can you get into a situation where the node fails over but the database does not, but database traffic continues to be sent to down node? MB: database will go offline leading to an all-stop event.
+
+MH: can you have master affinity between node and database? MH: need watchdog / heartbeat solutions to confirm state of all components
+
+JC: how long will this solution live? MB: will work for hot / hot flow sharding, multiple flow workers and soft locks, then this is long term solution. Service abstraction will be used so we are not wedded to Zookeeper however the abstraction work can be done later
+
+JC: does the implementation with Zookeeper have an impact on whether cloud or physical deployments are used? MB: its an internal component, not part of the larger Corda network therefore can be either. For the customer they will have to deploy a separate Zookeeper solution, but this is the same for Atomix.
+
+WN: where Corda as a service is being deployed with many nodes in the cloud. Zookeeper will be better suited to big providers.
+
+WN: concern is the customer expects to get everything on a plate, therefore will need to be educated on how to implement Zookeeper, but this is the same for other master selection solutions.
+
+JC: is it possible to launch R3 Corda with a button on Azure marketplace to commission a Zookeeper? Yes, if we can resource it. But expectation is Zookeeper will be used by well-informed clients / implementers so one-click option is less relevant.
+
+MH: how does failover work with HSMs? MB: can replicate realm so failover is trivial
+
+JC: how do we document Enterprise features? Publish design docs? Enterprise fact sheets? R3 Corda marketing material? Clear seperation of documentation is required. GT: this is already achieved by havind docs.corda.net for open source Corda and docs.corda.r3.com for enterprise R3 Corda
+
+
+### Next Steps
+
+MN proposed the following steps:
+
+1)   Determine who has experience in the team to help select wrapper library
+2)   Build container with Zookeeper for development
+3)   Demo hot / cold with current R3 Corda Dev Preview release (writing a guide)
+4)   Turn nodes passive or active
+5)   Leader election
+6)   Failure detection and tooling
+7)   Edge case testing