mirror of
https://github.com/corda/corda.git
synced 2025-01-19 11:16:54 +00:00
Initial doc setup
This commit is contained in:
parent
523a6db0b9
commit
bb1143652f
185
docs/source/design/hadr/design.md
Normal file
185
docs/source/design/hadr/design.md
Normal file
@ -0,0 +1,185 @@
|
|||||||
|
![Corda](https://www.corda.net/wp-content/uploads/2016/11/fg005_corda_b.png)
|
||||||
|
|
||||||
|
# High Availability and Disaster Recovery for Corda: A Phased Approach
|
||||||
|
|
||||||
|
============================================
|
||||||
|
DOCUMENT MANAGEMENT
|
||||||
|
============================================
|
||||||
|
|
||||||
|
## Document Control
|
||||||
|
|
||||||
|
* High Availability and Disaster Recovery for Corda: A Phased Approach
|
||||||
|
* Date: 13th November 2018
|
||||||
|
* Author: Matthew Nesbit
|
||||||
|
* Distribution: Design Review Board, Product Management, Services - Technical (Consulting), Platform Delivery
|
||||||
|
* Corda target version: Enterprise
|
||||||
|
|
||||||
|
## Document Sign-off
|
||||||
|
|
||||||
|
* Author: David Lee
|
||||||
|
* Reviewers(s): TBD
|
||||||
|
* Final approver(s): TBD
|
||||||
|
|
||||||
|
## Document History
|
||||||
|
|
||||||
|
============================================
|
||||||
|
HIGH LEVEL DESIGN
|
||||||
|
============================================
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The term high availability (HA) is used in this document to refer to the ability to rapidly handle any single component failure, whether due to physical issues (e.g. hard drive failure), network connectivity loss, or software faults.
|
||||||
|
Expectations of HA in modern enterprise systems are for systems to recover normal operation in a few minutes at most, while ensuring minimal/zero data loss. Whilst overall reliability is the overriding objective, it is desirable for Corda to offer HA mechanisms which are both highly automated and transparent to node operators. HA mechanism must not involve any configuration changes that require more than an appropriate admin tool, or a simple start/stop of a process as that would need an Emergency Change Request.
|
||||||
|
HA naturally grades into requirements for Disaster Recovery (DR), which requires that there is a tested procedure to handle large scale multi-component failures e.g. due to data centre flooding, acts of terrorism. DR processes are permitted to involve significant manual intervention, although the complications of actually invoking a Business Continuity Plan (BCP) mean that the less manual intervention, the more competitive Corda will be in the modern vendor market.
|
||||||
|
For modern financial institutions, maintaining comprehensive and effective BCP procedures are a legal requirement which is generally tested at least once a year.
|
||||||
|
However, until Corda is the system of record, or the primary system for transactions we are unlikely to be required to have any kind of fully automatic DR. In fact, we are likely to be restarted only once BCP has restored the most critical systems.
|
||||||
|
In contrast, typical financial institutions maintain large, complex technology landscapes in which individual component failures can occur, such as:
|
||||||
|
|
||||||
|
* Small scale software failures
|
||||||
|
* Mandatory data centre power cycles
|
||||||
|
* Operating system patching and restarts
|
||||||
|
* Short lived network outages
|
||||||
|
* Middleware queue build-up
|
||||||
|
* Machine failures
|
||||||
|
|
||||||
|
Thus, HA is essential for enterprise Corda and providing help to administrators necessary for rapid fault diagnosis.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
* Goals
|
||||||
|
* Non-goals (eg. out of scope)
|
||||||
|
* Reference(s) to similar or related work
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
This design document outlines a range of topologies which will be enabled through progressive enhancements from the short to long trm.
|
||||||
|
Hot-
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
* Reference(s) to any of following:
|
||||||
|
** Captured Product Backlog JIRA entry
|
||||||
|
** Internal White Paper feature item and/or visionary feature
|
||||||
|
** Project related requirement (POC, RFP, Pilot, Prototype) from
|
||||||
|
*** Internal Incubator / Accelerator project
|
||||||
|
*** Direct from Customer, ISV, SI, Partner
|
||||||
|
* Use Cases
|
||||||
|
* Assumptions
|
||||||
|
|
||||||
|
## Proposed Solution
|
||||||
|
|
||||||
|
* Illustrate any business process with diagrams
|
||||||
|
** Business Process Flow (or formal BPMN 2.0), swimlane activity
|
||||||
|
** UML: activity, state, sequence
|
||||||
|
* Illustrate operational solutions with deployment diagrams
|
||||||
|
** Network
|
||||||
|
* Validation matrix (against requirements)
|
||||||
|
** Role, requirement, how design satisfies requirement
|
||||||
|
* Sample walk through (against Use Cases)
|
||||||
|
* Implications
|
||||||
|
** Technical
|
||||||
|
** Operational
|
||||||
|
** Security
|
||||||
|
* Adherence to existing industry standards or approaches
|
||||||
|
* List any standards to be followed / adopted
|
||||||
|
* Outstanding issues
|
||||||
|
|
||||||
|
## Alternative Options
|
||||||
|
|
||||||
|
List any alternative solutions that may be viable but not recommended.
|
||||||
|
|
||||||
|
## Final recommendation
|
||||||
|
|
||||||
|
Proposed solution (if more than one option presented)
|
||||||
|
Proceed direct to implementation
|
||||||
|
Proceed to Technical Design stage
|
||||||
|
Proposed Platform Technical team(s) to implement design (if not already decided)
|
||||||
|
|
||||||
|
============================================
|
||||||
|
TECHNICAL DESIGN
|
||||||
|
============================================
|
||||||
|
|
||||||
|
## Interfaces
|
||||||
|
|
||||||
|
* Public APIs impact
|
||||||
|
* Internal APIs impacted
|
||||||
|
* Modules impacted
|
||||||
|
** Illustrate with Software Component diagrams
|
||||||
|
|
||||||
|
## Functional
|
||||||
|
|
||||||
|
* UI requirements
|
||||||
|
** Illustrate with UI Mockups and/or Wireframes
|
||||||
|
|
||||||
|
* (Subsystem) Components descriptions and interactions)
|
||||||
|
Consider and list existing impacted components and services within Corda:
|
||||||
|
** Doorman
|
||||||
|
** Network Map
|
||||||
|
** Public API's (ServiceHub, RPCOps)
|
||||||
|
** Vault
|
||||||
|
** Notaries
|
||||||
|
** Identity services
|
||||||
|
** Flow framework
|
||||||
|
** Attachments
|
||||||
|
** Core data structures, libraries or utilities
|
||||||
|
** Testing frameworks
|
||||||
|
** Pluggable infrastructure: DBs, Message Brokers, LDAP
|
||||||
|
|
||||||
|
* Data model & serialization impact and changes required
|
||||||
|
Illustrate with ERD diagrams
|
||||||
|
|
||||||
|
* Infrastructure services: persistence (schemas), messaging
|
||||||
|
|
||||||
|
## Non-Functional
|
||||||
|
|
||||||
|
* Performance
|
||||||
|
* Scalability
|
||||||
|
* High Availability
|
||||||
|
|
||||||
|
## Operational
|
||||||
|
|
||||||
|
* Deployment
|
||||||
|
** Versioning
|
||||||
|
* Maintenance
|
||||||
|
** Upgradability, migration
|
||||||
|
* Management
|
||||||
|
** Audit, alerting, monitoring, backup/recovery, archiving
|
||||||
|
|
||||||
|
## Security
|
||||||
|
|
||||||
|
* Data privacy
|
||||||
|
* Authentication
|
||||||
|
* Access control
|
||||||
|
|
||||||
|
## Software Development Tools & Programming Standards to be adopted.
|
||||||
|
|
||||||
|
* languages
|
||||||
|
* frameworks
|
||||||
|
* 3rd party libraries
|
||||||
|
* supporting tools
|
||||||
|
|
||||||
|
## Testability
|
||||||
|
|
||||||
|
* Unit
|
||||||
|
* Integration
|
||||||
|
* Smoke
|
||||||
|
* Non-functional (performance)
|
||||||
|
|
||||||
|
============================================
|
||||||
|
IMPLEMENTATION PLAN
|
||||||
|
============================================
|
||||||
|
|
||||||
|
* Estimated time (number of Sprints) and effort (resources)
|
||||||
|
* Fit within Corda Release schedule
|
||||||
|
* Long term feature
|
||||||
|
** Epic and story breakdown
|
||||||
|
** Shippable timelines
|
||||||
|
|
||||||
|
# We need the JDBC support for an external clustered database completed and merged. Azure SQL Server has been identified as the most likely Finestra. With this we should be able to point at an HA database instance for Ledger and Checkpoint data.
|
||||||
|
# I am suggesting that for the near term we just use the Azure Load Balancer to hide the multiple machine addresses. This does require allowing a health monitoring link to the Artemis broker, but so far testing indicates that this operates without issue. Longer term we need to ensure that the network map and configuration support exists for the system to work with multiple TCP/IP endpoints advertised to external nodes. Ideally this should be rolled into the work for AMPQ bridges and Floats.
|
||||||
|
# Implement a very simple mutual exclusion feature, so that an enterprise node cannot start if another is running onto the same database. This can be via a simple heartbeat update in the database, or possibly some other library. This feature should be enabled only when specified by configuration.
|
||||||
|
# The replication of the Artemis Message Queues will have to be via an external mechanism. On Azure we believe that the only practical solution is the 'Azure Files' approach which maps a virtual Samba drive. This we are testing in-case it is too slow to work. The mounting of separate Data Disks is possible, but they can only be mounted to one VM at a time, so they would not be compatible with the goal of no change requests for HA.
|
||||||
|
# need to improve our health monitoring to better indicate fault failure. Extending our existing JMX and logging support should manage this, although we probably need to create watchdog CordApp that verifies that the State Machine and Artemis messaging are able to process new work and to monitor flow latency.
|
||||||
|
# We should test the checkpointing mechanism and confirm that failures don't corrupt the data by deploying an HA setup on Azure and driving flows through the system as we stop the node randomly and switch to the other node. If this reveals any issues we will have to fix them.
|
||||||
|
# We need to confirm that the behaviour of the RPC proxy is stable through these restarts, from the perspective of a stateless REST server calling through to RPC. The RPC API should provide positive feedback to the application, so that it can respond in a controlled fashion when disconnected.
|
||||||
|
# We may need to work on some flow hospital tools
|
Loading…
Reference in New Issue
Block a user