8.3 KiB
High Availability and Disaster Recovery for Corda: A Phased Approach
============================================ DOCUMENT MANAGEMENT
Document Control
- High Availability and Disaster Recovery for Corda: A Phased Approach
- Date: 13th November 2018
- Author: Matthew Nesbit
- Distribution: Design Review Board, Product Management, Services - Technical (Consulting), Platform Delivery
- Corda target version: Enterprise
Document Sign-off
- Author: David Lee
- Reviewers(s): TBD
- Final approver(s): TBD
Document History
============================================ HIGH LEVEL DESIGN
Overview
The term high availability (HA) is used in this document to refer to the ability to rapidly handle any single component failure, whether due to physical issues (e.g. hard drive failure), network connectivity loss, or software faults. Expectations of HA in modern enterprise systems are for systems to recover normal operation in a few minutes at most, while ensuring minimal/zero data loss. Whilst overall reliability is the overriding objective, it is desirable for Corda to offer HA mechanisms which are both highly automated and transparent to node operators. HA mechanism must not involve any configuration changes that require more than an appropriate admin tool, or a simple start/stop of a process as that would need an Emergency Change Request. HA naturally grades into requirements for Disaster Recovery (DR), which requires that there is a tested procedure to handle large scale multi-component failures e.g. due to data centre flooding, acts of terrorism. DR processes are permitted to involve significant manual intervention, although the complications of actually invoking a Business Continuity Plan (BCP) mean that the less manual intervention, the more competitive Corda will be in the modern vendor market. For modern financial institutions, maintaining comprehensive and effective BCP procedures are a legal requirement which is generally tested at least once a year. However, until Corda is the system of record, or the primary system for transactions we are unlikely to be required to have any kind of fully automatic DR. In fact, we are likely to be restarted only once BCP has restored the most critical systems. In contrast, typical financial institutions maintain large, complex technology landscapes in which individual component failures can occur, such as:
- Small scale software failures
- Mandatory data centre power cycles
- Operating system patching and restarts
- Short lived network outages
- Middleware queue build-up
- Machine failures
Thus, HA is essential for enterprise Corda and providing help to administrators necessary for rapid fault diagnosis.
Scope
- Goals
- Non-goals (eg. out of scope)
- Reference(s) to similar or related work
Timeline
This design document outlines a range of topologies which will be enabled through progressive enhancements from the short to long trm. Hot-
Requirements
- Reference(s) to any of following: ** Captured Product Backlog JIRA entry ** Internal White Paper feature item and/or visionary feature ** Project related requirement (POC, RFP, Pilot, Prototype) from *** Internal Incubator / Accelerator project *** Direct from Customer, ISV, SI, Partner
- Use Cases
- Assumptions
Proposed Solution
- Illustrate any business process with diagrams ** Business Process Flow (or formal BPMN 2.0), swimlane activity ** UML: activity, state, sequence
- Illustrate operational solutions with deployment diagrams ** Network
- Validation matrix (against requirements) ** Role, requirement, how design satisfies requirement
- Sample walk through (against Use Cases)
- Implications ** Technical ** Operational ** Security
- Adherence to existing industry standards or approaches
- List any standards to be followed / adopted
- Outstanding issues
Alternative Options
List any alternative solutions that may be viable but not recommended.
Final recommendation
Proposed solution (if more than one option presented) Proceed direct to implementation Proceed to Technical Design stage Proposed Platform Technical team(s) to implement design (if not already decided)
============================================ TECHNICAL DESIGN
Interfaces
- Public APIs impact
- Internal APIs impacted
- Modules impacted ** Illustrate with Software Component diagrams
Functional
-
UI requirements ** Illustrate with UI Mockups and/or Wireframes
-
(Subsystem) Components descriptions and interactions) Consider and list existing impacted components and services within Corda: ** Doorman ** Network Map ** Public API's (ServiceHub, RPCOps) ** Vault ** Notaries ** Identity services ** Flow framework ** Attachments ** Core data structures, libraries or utilities ** Testing frameworks ** Pluggable infrastructure: DBs, Message Brokers, LDAP
-
Data model & serialization impact and changes required Illustrate with ERD diagrams
-
Infrastructure services: persistence (schemas), messaging
Non-Functional
- Performance
- Scalability
- High Availability
Operational
- Deployment ** Versioning
- Maintenance ** Upgradability, migration
- Management ** Audit, alerting, monitoring, backup/recovery, archiving
Security
- Data privacy
- Authentication
- Access control
Software Development Tools & Programming Standards to be adopted.
- languages
- frameworks
- 3rd party libraries
- supporting tools
Testability
- Unit
- Integration
- Smoke
- Non-functional (performance)
============================================ IMPLEMENTATION PLAN
- Estimated time (number of Sprints) and effort (resources)
- Fit within Corda Release schedule
- Long term feature ** Epic and story breakdown ** Shippable timelines