ENT-1959: add a default value for mutualExclusionConfiguration.machineName (#877)

* ENT-1959: add a default value for mutualExclusionConfiguration.machineName * ENT-1959: update docs * ENT-1959: update docs, remove machineName from default conf, add unit test
2025-05-31 06:31:08 +00:00 · 2018-05-25 11:13:53 +01:00 · 2018-05-25 11:13:53 +01:00 · d13cca49ec
commit d13cca49ec
parent 9155000407
11 changed files with 52 additions and 168 deletions
--- a/docs/source/bridge-configuration-file.rst
+++ b/docs/source/bridge-configuration-file.rst
@ -301,7 +301,6 @@ Typical configuration for ``nodeserver1`` would be a ``node.conf`` files contain
        externalBridge = true // Ensure node doesn't run P2P AMQP bridge, instead delegate to the BridgeInner.
        mutualExclusionConfiguration = { // Enable the protective heartbeat logic so that only one node instance is ever running.
            on = true
-            machineName = "nodeserver1"
            updateInterval = 20000
            waitInterval = 40000
        }
@ -353,7 +352,6 @@ Typical configuration for ``nodeserver2`` would be a ``node.conf`` files contain
        externalBridge = true // Ensure node doesn't run P2P AMQP bridge, instead delegate to the BridgeInner.
        mutualExclusionConfiguration = { // Enable the protective heartbeat logic so that only one node instance is ever running.
            on = true
-            machineName = "nodeserver2"
            updateInterval = 20000
            waitInterval = 40000
        }
--- a/docs/source/design/float/deployment/Bridge
+++ b/docs/source/design/float/deployment/Bridge
--- a/docs/source/high-availability.rst
+++ b/docs/source/high-availability.rst
@ -1,152 +0,0 @@
-High Availability
-=================
-
-This section describes how to make a Corda node highly available.
-
-Hot Cold
-~~~~~~~~
-
-In the hot cold configuration, failover is handled manually, by promoting the cold node after the former hot node
-failed or was taken offline for maintenance.
-
-For RPC clients there is a way to recover in case of failover, see section below.
-
-Prerequisites
-------------
-
-* A load-balancer for P2P, RPC and web traffic
-* A shared file system for the artemis and certificates directories
-* A shared database, e.g. Azure SQL
-
-The hot-cold deployment consists of two Corda nodes, a hot node that is currently handling request and running flows
-and a cold backup node that can take over, if the hot node fails or is taken offline for an upgrade. Both nodes should
-be able to connect to a shared database and a replicated file-system hosting the artemis and certificates directories.
-The hot-cold ensemble should be fronted by a load-balancer for P2P, web and RPC traffic. The load-balancer should do
-health monitoring and route the traffic to the node that is currently active. To prevent data corruption in case of
-accidental simultaneous start of both nodes, the current hot node takes a leader lease in the form of a mutual exclusion
-lock implemented by a row in the shared database.
-
-Configuration
-------------
-
-The configuration snippet below shows the relevant settings.
-
-.. sourcecode:: none
-
-    enterpriseConfiguration = {
-        mutualExclusionConfiguration = {
-            on = true
-            machineName = ${HOSTNAME}
-            updateInterval = 20000
-            waitInterval = 40000
-        }
-    }
-
-Fields
------
-
-:on: Whether hot cold high availability is turned on, defaulted to ``false``.
-
-:machineName: Unique name for node.
-
-:updateInterval: Period(milliseconds) over which the running node updates the mutual exclusion lease.
-
-:waitInterval: Amount of time(milliseconds) to wait since last mutual exclusion lease update before being able to become the master node. This has to be greater than updateInterval.
-
-RPC failover
------------
-
-In case of hot-cold there will be a short period of time when none of the nodes available and accepting connections.
-If the RPC client has not been connected at all and makes its first RPC connection during this instability window, the connection will be rejected
-as if server address does not exists. The only choice client has in this case is to catch corresponding exception during ``CordaRPCClient.start()``
-and keep on re-trying.
-
-The following code snippet illustrates that.
-
-.. sourcecode:: Kotlin
-
-    fun establishConnectionWithRetry(nodeHostAndPort: NetworkHostAndPort, username: String, password: String): CordaRPCConnection {
-
-        val retryInterval = 5.seconds
-
-        do {
-            val connection = try {
-                logger.info("Connecting to: $nodeHostAndPort")
-                val client = CordaRPCClient(
-                        nodeHostAndPort,
-                        object : CordaRPCClientConfiguration {
-                            override val connectionMaxRetryInterval = retryInterval
-                        }
-                )
-                val _connection = client.start(username, password)
-                // Check connection is truly operational before returning it.
-                val nodeInfo = _connection.proxy.nodeInfo()
-                require(nodeInfo.legalIdentitiesAndCerts.isNotEmpty())
-                _connection
-            } catch(secEx: ActiveMQSecurityException) {
-                // Happens when incorrect credentials provided - no point to retry connecting.
-                throw secEx
-            }
-            catch(th: Throwable) {
-                // Deliberately not logging full stack trace as it will be full of internal stacktraces.
-                logger.info("Exception upon establishing connection: " + th.message)
-                null
-            }
-
-            if(connection != null) {
-                logger.info("Connection successfully established with: $nodeHostAndPort")
-                return connection
-            }
-            // Could not connect this time round - pause before giving another try.
-            Thread.sleep(retryInterval.toMillis())
-        } while (connection == null)
-
-        throw IllegalArgumentException("Never reaches here")
-    }
-
-If, however, the RPC client was connected through load-balancer to a node and failover occurred it will take sometime for cold instance to start-up.
-Acceptable behavior in this case would be for RPC client to keep re-trying to connect and once connected - back-fill any data that might have been missed since connection was down.
-In a way this scenario is no different to a temporal loss of connectivity with a node even without any form of High Availability.
-
-In order to achieve said re-try/back-fill functionality the client needs to install ``onError`` handler on the ``Observable`` returned by ``CordaRPCOps``.
-Please see code below which illustrates how this can be achieved.
-
-.. sourcecode:: Kotlin
-
-    fun performRpcReconnect(nodeHostAndPort: NetworkHostAndPort, username: String, password: String) {
-
-        val connection = establishConnectionWithRetry(nodeHostAndPort, username, password)
-        val proxy = connection.proxy
-
-        val (stateMachineInfos, stateMachineUpdatesRaw) = proxy.stateMachinesFeed()
-
-        val retryableStateMachineUpdatesSubscription: AtomicReference<Subscription?> = AtomicReference(null)
-        val subscription: Subscription = stateMachineUpdatesRaw
-                .startWith(stateMachineInfos.map { StateMachineUpdate.Added(it) })
-                .subscribe({ clientCode(it) /* Client code here */ }, {
-                    // Terminate subscription such that nothing gets past this point to downstream Observables.
-                    retryableStateMachineUpdatesSubscription.get()?.unsubscribe()
-                    // It is good idea to close connection to properly mark the end of it. During re-connect we will create a new
-                    // client and a new connection, so no going back to this one. Also the server might be down, so we are
-                    // force closing the connection to avoid propagation of notification to the server side.
-                    connection.forceClose()
-                    // Perform re-connect.
-                    performRpcReconnect(nodeHostAndPort, username, password)
-                })
-
-        retryableStateMachineUpdatesSubscription.set(subscription)
-    }
-
-In this code snippet it is possible to see that function ``performRpcReconnect`` creates RPC connection and installs error handler
-upon subscription to an ``Observable``. The call to this ``onError`` handler will be made when failover happens then the code
-will terminate existing subscription, closes RPC connection and recursively calls ``performRpcReconnect`` which will re-subscribe
-once RPC connection comes back online.
-
-Client code if fed with instances of ``StateMachineInfo`` using call ``clientCode(it)``. Upon re-connect this code receives
-all the items. Some of these items might have already been delivered to client code prior to failover occurred.
-It is down to client code in this case to have a memory and handle those duplicating items as appropriate.
-
-Hot Warm
-~~~~~~~~
-
-In the future we are going to support automatic failover.
--- a/docs/source/hot-cold-deployment.rst
+++ b/docs/source/hot-cold-deployment.rst
@ -156,7 +156,7 @@ exists, all others will shut down shortly after starting. A standard configurati
    enterpriseConfiguration = {
        mutualExclusionConfiguration = {
            on = true
-            machineName = ${UNIQUE_ID}
+            machineName = ${UNIQUE_ID} // Optional
            updateInterval = 20000
            waitInterval = 40000
        }
@ -164,7 +164,9 @@ exists, all others will shut down shortly after starting. A standard configurati

 :on: Whether hot cold high availability is turned on, default is ``false``.

-:machineName: Unique name for node. Used when checking which node is active. Example: *corda-ha-vm1.example.com*
+:machineName: Unique name for node. It is combined with the node's base directory to create an identifier which is
+used in the mutual exclusion process (signal which corda instance is active and using the database). Default value is the
+machines host name.

 :updateInterval: Period(milliseconds) over which the running node updates the mutual exclusion lease.

@ -206,7 +208,6 @@ file that can be used for either node:
    enterpriseConfiguration = {
        mutualExclusionConfiguration = {
            on = true
-            machineName = "${NODE_MACHINE_ID}"
            updateInterval = 20000
            waitInterval = 40000
        }
@ -218,5 +219,4 @@ network.
 Each machine's own address is used for the RPC connection as the node's internal messaging client needs it to
 connect to the broker.

-The ``machineName`` value should be different for each node as it is used to ensure that only one of them can be active at any time.

--- a/docs/source/resources/bridge/ha_bridge_float.png
+++ b/docs/source/resources/bridge/ha_bridge_float.png
--- a/docs/source/resources/bridge/ha_bridge_float_socks.png
+++ b/docs/source/resources/bridge/ha_bridge_float_socks.png
--- a/node/src/main/kotlin/net/corda/node/internal/AbstractNode.kt
+++ b/node/src/main/kotlin/net/corda/node/internal/AbstractNode.kt
@ -339,7 +339,9 @@ abstract class AbstractNode(val configuration: NodeConfiguration,
                    networkParameters)
            val mutualExclusionConfiguration = configuration.enterpriseConfiguration.mutualExclusionConfiguration
            if (mutualExclusionConfiguration.on) {
-                RunOnceService(database, mutualExclusionConfiguration.machineName,
+                // Ensure uniqueness in case nodes are hosted on the same machine.
+                val extendedMachineName = "${configuration.baseDirectory}/${mutualExclusionConfiguration.machineName}"
+                RunOnceService(database, extendedMachineName,
                        ManagementFactory.getRuntimeMXBean().name.split("@")[0],
                        mutualExclusionConfiguration.updateInterval, mutualExclusionConfiguration.waitInterval).start()
            }
--- a/node/src/main/kotlin/net/corda/node/services/config/EnterpriseConfiguration.kt
+++ b/node/src/main/kotlin/net/corda/node/services/config/EnterpriseConfiguration.kt
@ -10,8 +10,8 @@

 package net.corda.node.services.config

-import net.corda.node.services.statemachine.transitions.SessionDeliverPersistenceStrategy
 import net.corda.node.services.statemachine.transitions.StateMachineConfiguration
+import java.net.InetAddress

 data class EnterpriseConfiguration(
        val mutualExclusionConfiguration: MutualExclusionConfiguration,
@ -19,7 +19,15 @@ data class EnterpriseConfiguration(
        val tuning: PerformanceTuning = PerformanceTuning.default,
        val externalBridge: Boolean? = null)

-data class MutualExclusionConfiguration(val on: Boolean = false, val machineName: String, val updateInterval: Long, val waitInterval: Long)
+data class MutualExclusionConfiguration(val on: Boolean = false,
+                                        val machineName: String = defaultMachineName,
+                                        val updateInterval: Long,
+                                        val waitInterval: Long
+) {
+    companion object {
+        private val defaultMachineName = InetAddress.getLocalHost().hostName
+    }
+}

 /**
 * @param flowThreadPoolSize Determines the size of the thread pool used by the flow framework to run flows.
--- a/node/src/main/resources/reference.conf
+++ b/node/src/main/resources/reference.conf
@ -28,7 +28,6 @@ verifierType = InMemory
 enterpriseConfiguration = {
    mutualExclusionConfiguration = {
        on = false
-        machineName = ""
        updateInterval = 20000
        waitInterval = 40000
    }
--- a/node/src/test/kotlin/net/corda/node/services/config/NodeConfigurationImplTest.kt
+++ b/node/src/test/kotlin/net/corda/node/services/config/NodeConfigurationImplTest.kt
@ -13,26 +13,23 @@ package net.corda.node.services.config
 import com.typesafe.config.Config
 import com.typesafe.config.ConfigFactory
 import com.zaxxer.hikari.HikariConfig
-import net.corda.core.internal.div
 import net.corda.core.internal.toPath
 import net.corda.core.utilities.NetworkHostAndPort
 import net.corda.nodeapi.internal.persistence.CordaPersistence.DataSourceConfigTag
 import net.corda.core.utilities.seconds
-import net.corda.nodeapi.BrokerRpcSslOptions
+import net.corda.nodeapi.internal.config.UnknownConfigKeysPolicy
 import net.corda.testing.core.ALICE_NAME
 import net.corda.testing.node.MockServices.Companion.makeTestDataSourceProperties
 import net.corda.tools.shell.SSHDConfiguration
 import org.assertj.core.api.Assertions.assertThat
 import org.assertj.core.api.Assertions.assertThatThrownBy
 import org.junit.Test
+import java.net.InetAddress
 import java.net.URL
 import java.net.URI
 import java.nio.file.Paths
 import java.util.*
-import kotlin.test.assertEquals
-import kotlin.test.assertFalse
-import kotlin.test.assertNull
-import kotlin.test.assertTrue
+import kotlin.test.*

 class NodeConfigurationImplTest {
    @Test
@ -174,6 +171,12 @@ class NodeConfigurationImplTest {
        assertThat(errors).hasOnlyOneElementSatisfying { error -> error.contains("compatibilityZoneURL") && error.contains("devMode") }
    }

+    @Test
+    fun `mutual exclusion machineName set to default if not explicitly set`() {
+        val config = getConfig("test-config-mutualExclusion-noMachineName.conf").parseAsNodeConfiguration(UnknownConfigKeysPolicy.IGNORE::handle)
+        assertEquals(InetAddress.getLocalHost().hostName, config.enterpriseConfiguration.mutualExclusionConfiguration.machineName)
+    }
+
    private fun configDebugOptions(devMode: Boolean, devModeOptions: DevModeOptions?): NodeConfiguration {
        return testConfiguration.copy(devMode = devMode, devModeOptions = devModeOptions)
    }
--- a/node/src/test/resources/test-config-mutualExclusion-noMachineName.conf
+++ b/node/src/test/resources/test-config-mutualExclusion-noMachineName.conf
@ -0,0 +1,26 @@
+p2pAddress : "localhost:10002"
+rpcSettings {
+	address : "localhost:10003"
+	adminAddress : "localhost:1777"
+}
+h2port : 11000
+myLegalName : "O=Corda HA, L=London, C=GB"
+keyStorePassword : "cordacadevpass"
+trustStorePassword : "trustpass"
+devMode : true
+rpcUsers=[
+    {
+        user=corda
+        password=corda_is_awesome
+        permissions=[
+            ALL
+        ]
+    }
+]
+enterpriseConfiguration = {
+    mutualExclusionConfiguration = {
+        on = true
+        updateInterval = 20000
+        waitInterval = 40000
+    }
+}