Hot-cold operational guide fixes and improvements (#884)

* minor typos, improving the HA operational * Hot-cold operational guide improvement(typos, styling, more specific settings) * ENT-1959: address PR comments * address pr comments
2025-04-16 07:27:17 +00:00 · 2018-05-30 14:12:47 +01:00 · 2018-05-30 14:12:47 +01:00 · 9447398acb
commit 9447398acb
parent 86b4086b37
5 changed files with 138 additions and 58 deletions
--- a/docs/source/hot-cold-deployment.rst
+++ b/docs/source/hot-cold-deployment.rst
@ -8,34 +8,36 @@ Overview
 --------

 This section describes hot-cold availability of Corda nodes and their associated configuration setup. In such a set-up,
-there would be one back-up instance that can be started if the primary instance stops. Each instance
-of Corda would be hosted on a separate server and represent the same entity in the Corda network.
+there is one back-up instance that can be started if the primary instance stops. Each instance of Corda should be hosted
+on a separate server and represent the same entity in the Corda network.

-.. note:: It is expected that the users handle the monitoring of the instances and use the appropriate tools to swtich
-between the primary and the back-up in case of failure.
+.. note:: It is expected that the users handle the monitoring of the instances and use the appropriate tools to switch
+          between the primary and the back-up in case of failure.

 In order to achieve this set-up, in addition to the physical nodes, a few other resources are required:

    * 3rd party database which should be running in some sort of replication mode to avoid any data loss
    * a network drive mounted on all nodes (used to store P2P messaging broker files)
-    * an internet facing load balancer to to monitor health of the primary and secondary instances and to automatically
-    route traffic from the public IP address to the only active end-point — the *hot* instance
+    * an internet facing load balancer to monitor the health of the primary and secondary instances and to automatically
+      route traffic from the public IP address to the *hot* instance

 This guide will cover all the steps required to configure and deploy the nodes as well as configuring the above mentioned
-resources for both Microsoft Azure and Amazon Web Services. The image below illustrates the environment that will result from
-following the guide. There will be two Corda nodes, one active and the other inactive. Each node will represent the same
-legal identity inside the Corda network. Both will share a database and a network file system.
+resources for both **Microsoft Azure** and **Amazon Web Services**. The image below illustrates the environment that will
+result from following the guide. There will be two Corda nodes, one active and the other inactive. Each node will represent
+the same legal identity inside the Corda network. Both will share a database and a network file system.

 .. image:: resources/hot-cold.png

 Configuring the load balancer
 -----------------------------

-In a hot-cold environment, the load balancer is used to redirect incoming traffic (P2P, RPC and HTTP) towards the active node.
-The internet facing IP address of the load balancer will be advertised by each node as their P2P addresses. The back-end
-pool of the load balancer should include both machines hosting the nodes to be able to redirect traffic to them. A load
-balancing rule should be created for each port configured in the nodes' configuration files (P2P, RPC and HTTP). Furthermore,
-to determine which machine the traffic should be redirected to, a health probe should be created for each port as well.
+In a hot-cold environment, the load balancer is used to redirect incoming traffic (P2P, RPC and HTTP) towards the active
+Corda node instance. The internet facing IP address of the load balancer will be advertised to the rest of the Corda network
+by each node as their P2P addresses. This is done by configuring the load balancer IP as the node's P2P address in its
+configuration file. The back-end pool of the load balancer should include both machines hosting the nodes to be able to redirect traffic to
+them. A load balancing rule should be created for each port configured in the nodes' configuration files (P2P, RPC and HTTP).
+Furthermore, to determine which machine the traffic should be redirected to, a health probe should be created for each port
+as well.

 .. important:: Set TCP as the protocol for P2P and RPC health probes.

@ -43,26 +45,103 @@ Microsoft Azure
 ~~~~~~~~~~~~~~~

 A guide on how to create an internet facing load balancer in Azure can be found `here <https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-get-started-internet-portal>`_.
-The next step is to create rules for every port corresponding to each type of connection. The following table shows the
-rules corresponding to all 3 ports of a node.
+The next step is to create health probes and load balancing rules for every port corresponding to each type of connection.

-.. image:: resources/hot-cold-lb-azure-rules.png
+When creating the health probes, there are several properties that have to be set:

-The corresponding health probes are shown in the next table.
+    * name - used to identify the probe when associating it with a rule (e.g. p2p, rpc, web).
+    * protocol - determines what kind of packets are used to assess the health of the VMs behind the balancer. Use
+      TCP for the P2P and RPC probes, HTTP for the web traffic probes.
+    * port - the port being checked.
+    * path - in case of the HTTP protocol, it has to be set to "/". Leave empty for the TCP probes.
+    * interval - the amount of time in seconds between probe attempts.
+    * unhealthy threshold - the number of failed probes before a VM is considered unhealthy. No suggested values. Default
+      seems reasonable.
+
+A possible configuration for a hot-cold environment would be:
+
+=====  ========    ====== =====  ==========
+Name   Protocol    Port   Path   Used by
+-----  --------    ------ -----  ----------
+p2p    TCP         10002         ha-lbr-p2p
+
+rpc    TCP         10003         ha-lbr-rpc
+
+web    HTTP        10004    /    ha-lbr-web
+=====  ========    ====== =====  ==========
+
+The following properties have to be set when creating a load balancing rule:
+
+    * name - simple identifier.
+    * ip version - depending on how the resources have been created and configured, it can be IPv4 or IPv6.
+    * frontend ip address - the address used by peers and clients to communicate with the Corda instances.
+    * protocol - needs to be set to TCP for every rule.
+    * port - used by peers and clients to communicate with the Corda instances.
+    * backend port - target port for traffic redirection. Set to the same value as the previous port.
+    * backend pool - an Azure specific resource that represents the address pool of the VMs hosting the Corda instances.
+    * health probe - the probe name used to determine the target VM for incoming traffic.
+    * session persistence - mode in which requests are handled. Set to **None** to specify that successive
+      request from the same client can be received by any VM for the duration of the session.
+
+Using the health probe example, a possible load balancer configuration would be:
+
+============  =========    ============  ============
+Name          Rule         Backend pool  Health probe
+------------  ---------    ------------  ------------
+ha-lbr-p2p    TCP/10002    ha-testing    p2p
+
+ha-lbr-rpc    TCP/10003    ha-testing    rpc
+
+ha-lbr-web    TCP/10004    ha-testing    web
+============  =========    ============  ============

-.. image:: resources/hot-cold-lb-azure-health.png

 Amazon Web Services
 ~~~~~~~~~~~~~~~~~~~

 A guide on how to create an internet facing load balancer in AWS can be found `here <https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-getting-started.html>`_.
-Because the AWS classic load balancer only allows one health check, it is required to create a load balancer per type
-of connection (P2P, RPC and HTTP), each with its own health check for its corresponding port. As an example, the following
-images show the configuration of a single load balancer with 3 rules and 1 health check for the P2P port.
+AWS offers 3 types of load balancers: application, network, and classic. For this guide, only the classic load balancer
+configuration is covered.

-.. image:: resources/hot-cold-lb-aws-rules.png
+Because the AWS classic load balancer can be configured with only one health check, it is required to create a load balancer
+per type of connection (P2P, RPC and HTTP), each with its own health check. Everything can be configured in one go, not having
+to create the rules and checks as separate resources.
+
+When creating an AWS classic load balancer, the following configuration properties need to be set:
+
+    * Load Balancer name - simple identifier.
+    * Create LB inside - set it to the network containing the EC2 VMs hosting the Corda instances
+    * Create an internal load balancer - not chosen as it has to be external (internet facing)
+    * Enable advanced VPC configuration - depends on what option is chosen for **Create LB inside**
+    * Listener Configuration:
+        - Load Balancer Protocol - protocol for incoming traffic
+        - Load Balancer Port - used by peers and clients to communicate with the Corda instances
+        - Instance Protocol - protocol for redirected traffic. Set to the same value as the previous protocol.
+        - Instance Port - target port for traffic redirection. Set to the same value as the previous port.
+    * Security groups - used to control visibility and access of the load balancer in the network and outside.
+    * Health check - mechanism used to determine to which EC2 instance the traffic will be directed. Only one health check
+      per balancer.
+        - Ping Protocol - determines what kind of packets are used to assess the health of the EC2s behind the balancer. Use
+          TCP for the P2P and RPC probes, HTTP for the web traffic probes.
+        - Ping Port - the port being checked.
+        - Ping Path - in case of the HTTP protocol, it has to be set to "/". Leave empty for the TCP checks.
+        - Timeout - the amount of time in seconds before a check waits for a response.
+        - Interval - the amount of time in seconds between check attempts.
+        - Unhealthy threshold - number of failed checks that signal an EC2 instance is unusable
+        - Healthy threshold - number of consecutive checks before an EC2 instance is considered usable
+
+After creating a load balancer for each traffic type, the configuration should look like this:
+
+============  =======================================    ============
+Name          Port Configuration                         Health Check
+------------  ---------------------------------------    ------------
+ha-lb-p2p     10002 (TCP) forwarding to 10002 (TCP)      TCP:10002
+
+ha-lb-rpc     10003 (TCP) forwarding to 10003 (TCP)      TCP:10003
+
+ha-lb-web     10004 (HTTP) forwarding to 10004 (HTTP)    HTTP:10004
+============  =======================================    ============

-.. image:: resources/hot-cold-lb-aws-health.png

 Configuring the shared network drive
 ------------------------------------
@ -77,54 +156,55 @@ base directory. For example, ``${BASE_DIR}/artemis`` should be a link to the net
 Microsoft Azure
 ~~~~~~~~~~~~~~~

-When deploying in Azure, a ``File Share`` component can be used. In order to create and use one, several steps must be
-followed:
+When deploying in Azure, a ``File Share`` component can be used. To create a file share, a ``Storage Account`` is required.
+In order to create one, please follow the guide found `here <https://docs.microsoft.com/en-us/azure/storage/common/storage-create-storage-account>`_.

-1. Create an Azure Storage Account (guide `here <https://docs.microsoft.com/en-us/azure/storage/common/storage-create-storage-account>`_)
-    a. Deployment model should be **Resource manager**
+The following are the properties that can be set during creation:

-    b. Account kind needs to be **General purpose** as Artemis can't handle **Blobs**
-
-    c. Performance can either be **Standard** (HDD) or **Premium** (SSD). Standard HDDs storage have read/write speeds
-    of 14 to 16 MB/s which is sufficient for the P2P messaging component
-
-    d. Replication type should be **Locally-redundant storage** (LRS)
-
-    e. Secure transfer **Enabled**
-
-    f. Location can be chosen based on requirements. Note that some settings options are not available for all locations.
-
-2. Add a file share. Quota can be any size up to 5 TiB
-3. Create a persistent mount point for the file share using */etc/fstab/*:
-        - required: **storage account name**, **storage account key** (choose one of the 2 found in Your_storage → Settings → Access keys) and the **file share name**
-        - persist the mount point by using the following command, replacing the placeholders in angle brackets with the
-        appropriate values:
-        .. container:: codeset
-
-            .. sourcecode:: groovy
-
-                sudo bash -c 'echo "//<storage-account-name>.file.core.windows.net/<share-name> /mymountpoint cifs vers=2.1,username=<storage-account-name>,password=<storage-account-key>,dir_mode=0700,file_mode=0700,serverino" >> /etc/fstab'
+    * Deployment model - set to **Resource manager**.
+    * Account kind - set to **General purpose** as Artemis can't work with **Blobs**.
+    * Performance - drive access speeds. The **Standard (HDD)** offers speeds around 14-16 MB/s. **Premium (SSD)** is
+      superior (no performance values found). Both options are sufficient for the purpose of this storage account.
+    * Replication type - can be any of **LRS**, **ZRS** or **GRS**.
+    * Secure transfer - set to **Enabled**
+    * Location - chosen based on requirements. Some of the above options are not available for all location.

 .. note:: From the Azure documentation: *LRS is the lowest cost replication option and offers the least durability compared
          to other options. If a datacenter-level disaster (for example, fire or flooding) occurs, all replicas may be
          lost or unrecoverable. To mitigate this risk, Microsoft recommends using either zone-redundant storage (ZRS) or
          geo-redundant storage (GRS).*

+After creating the storage account, add a **file share** to it. Max quota is 5 TiB which more than enough for the purpose
+of this file share. The newly created file share needs to be mounted and linked to the ``artemis`` directory in the Corda
+base directory of both primary and back-up VMs. To facilitate operations, a persistent mount point can be created using
+**/etc/fstab**:
+
+        - required: **storage account name**, **storage account key** (choose one of the 2 found in Your_storage → Settings → Access keys) and the **file share name**
+        - persist the mount point by using the following command, replacing the placeholders in angle brackets with the
+          appropriate values:
+
+        .. container:: codeset
+            .. sourcecode:: groovy
+                sudo bash -c 'echo "//<storage-account-name>.file.core.windows.net/<share-name> /mymountpoint cifs vers=2.1,username=<storage-account-name>,password=<storage-account-key>,dir_mode=0700,file_mode=0700,serverino" >> /etc/fstab'
+
+In the above command, **mymountpoint** represents the location on the VM's file system where the mount point will be created.
+
+It is important to set the appropriate **file_mode** value, based on user requirements.
+
 Amazon Web Services
 ~~~~~~~~~~~~~~~~~~~

 When deploying on AWS, an ``Elastic File System`` can be used. Creating one can be easily done by following `this <https://docs.aws.amazon.com/efs/latest/ug/getting-started.html>`_ guide.

-
 During the creation, two performance modes are offered: **General Purpose** and **Max I/O**. For a simple hot-cold
-environment consisting of a few nodes, the general purpose mode is sufficient.
+environment consisting of a few nodes, the general purpose mode is sufficient as the superior mode is best suited for large
+clusters of thousands of machines accessing the file system.

-To persist the mount point, run the following command:
+The newly created EFS needs to be mounted and linked to the ``artemis`` directory in the Corda base directory of both
+primary and back-up VMs. To facilitate operations, a persistent mount point can be created using **/etc/fstab**:

 .. container:: codeset
-
    .. sourcecode:: groovy
-
        sudo bash -c 'echo "mount-target-DNS:/ efs-mount-point nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,_netdev,noresvport 0 0" >> /etc/fstab'

 .. note:: EFS cannot be mounted on a Windows machine. Please see EFS limits `here <https://docs.aws.amazon.com/efs/latest/ug/limits.html>`_.
@ -138,7 +218,7 @@ Node deployment
 This section covers the deployment of the back-up Corda instance. It is assumed that the primary has already been deployed.
 For instructions on how to do so, please see :doc:`deploying-a-node`.

-The following files and directories to be copied from the primary instance to the back-up instance as well as any
+The following files and directories need to be copied from the primary instance to the back-up instance as well as any
 cordapps and jars that exist:

    * ./certificates/
@ -165,13 +245,13 @@ exists, all others will shut down shortly after starting. A standard configurati
 :on: Whether hot cold high availability is turned on, default is ``false``.

 :machineName: Unique name for node. It is combined with the node's base directory to create an identifier which is
-used in the mutual exclusion process (signal which corda instance is active and using the database). Default value is the
-machines host name.
+              used in the mutual exclusion process (signal which corda instance is active and using the database). Default value is the
+              machines host name.

 :updateInterval: Period(milliseconds) over which the running node updates the mutual exclusion lease.

 :waitInterval: Amount of time(milliseconds) to wait since last mutual exclusion lease update before being able to become
-the active node. This has to be greater than updateInterval.
+               the active node. This has to be greater than updateInterval.

 Node configuration
 ------------------
--- a/docs/source/resources/hot-cold-lb-aws-health.png
+++ b/docs/source/resources/hot-cold-lb-aws-health.png
--- a/docs/source/resources/hot-cold-lb-aws-rules.png
+++ b/docs/source/resources/hot-cold-lb-aws-rules.png
--- a/docs/source/resources/hot-cold-lb-azure-health.png
+++ b/docs/source/resources/hot-cold-lb-azure-health.png
--- a/docs/source/resources/hot-cold-lb-azure-rules.png
+++ b/docs/source/resources/hot-cold-lb-azure-rules.png