Document FinalityFlow error handling logic and reference flow hospital doc.

2025-05-31 22:50:53 +00:00 · 2019-03-27 13:09:05 +00:00 · 2019-03-27 13:09:05 +00:00 · f43906b973
commit f43906b973
parent a81fbadec5
2 changed files with 92 additions and 0 deletions
--- a/docs/source/api-flows.rst
+++ b/docs/source/api-flows.rst
@ -613,6 +613,21 @@ flow to receive the transaction:
 ``idOfTxWeSigned`` is an optional parameter used to confirm that we got the right transaction. It comes from using ``SignTransactionFlow``
 which is described below.

+**Error handling behaviour**
+
+Once a transaction has been notarised and its input states consumed by the flow initiator (eg. sender), should the participant(s) receiving the
+transaction fail to verify it, or the receiving flow (the finality handler) fails due to some other error, we then have a scenario where not
+all parties have the correct up to date view of the ledger (a condition defined as `eventual consistency <https://en.wikipedia.org/wiki/Eventual_consistency>`_
+in distributed systems terminology). To recover from this scenario, the receivers finality handler will automatically be sent to the
+:doc:`node-flow-hospital` where it's suspended and retried from its last checkpoint upon node restart, or according to other conditional retry rules
+explained in :ref:`flow hospital runtime behaviour <flow-hospital-runtime>`. This gives the node operator the opportunity to recover from the error.
+Until the issue is resolved the node will continue to retry the flow on each startup. Upon successful completion by the receivers finality flow,
+the ledger will become fully consistent once again.
+
+.. warning:: It's possible to forcibly terminate the erroring finality handler using the ``killFlow`` RPC but at the risk of an inconsistent view of the ledger.
+
+.. note:: A future release will allow retrying hospitalised flows without restarting the node, i.e. via RPC.
+
 CollectSignaturesFlow/SignTransactionFlow
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The list of parties who need to sign a transaction is dictated by the transaction's commands. Once we've signed a
--- a/docs/source/node-flow-hospital.rst
+++ b/docs/source/node-flow-hospital.rst
@ -0,0 +1,77 @@
+Flow Hospital
+=============
+
+Overview
+--------
+
+The **flow hospital** refers to a built-in node service that manages flows that have encountered an error.
+
+This service is responsible for recording, tracking, diagnosis, recovery and retry. It determines whether errored flows should be retried
+from their previous checkpoints or have their errors propagate. Flows may be recoverable under certain scenarios (eg. manual intervention
+may be required to install a missing contract JAR version). For a given errored flow, the flow hospital service determines the next course of
+action towards recovery and retry.
+
+.. note:: The flow hospital will never terminate a flow, but will propagate its error back to the state machine, and ultimately, end user code to handle.
+
+This concept is analogous to *exception management handling* associated with enterprise workflow software, or
+*retry queues/stores* in enterprise messaging middleware for recovering from failure to deliver a message.
+
+Functionality
+-------------
+
+Flow hospital functionality is enabled by default in |release|. No explicit configuration settings are required.
+
+There are two aspects to the flow hospital:
+
+- run-time behaviour in the node upon failure, including retry and recovery transitions and policies.
+- visualisation of failed flows in the Explorer UI.
+
+.. _flow-hospital-runtime:
+
+Run-time behaviour
+~~~~~~~~~~~~~~~~~~
+
+Specifically, there are two main ways a flow is hospitalized:
+
+1. A counterparty invokes a flow on your node that isn’t installed (i.e. missing CorDapp):
+   this will cause the flow session initialisation mechanism to trigger a ``ClassNotFoundException``.
+   If this happens, the session initiation attempt is kept in the hospital for observation and will retry if you restart the node.
+   Corrective action requires installing the correct CorDapp in the node's "cordapps" directory.
+
+   .. warning:: There is currently no retry API. If you don’t want to install the cordapp, you should be able to call `killFlow` with the UUID
+      associated with the failing flow in the node's log messages.
+
+2. Once started, if a flow experiences an error, the following failure scenarios are handled:
+
+   * SQLException mentioning a deadlock*:
+     if this happens, the flow will retry. If it retries more than once, a back off delay is applied to try and reduce contention.
+     Current policy means these types of failed flows will retry forever (unless explicitly killed).  No intervention required.
+
+   * Database constraint violation:
+     this scenario may occur due to natural contention between racing flows as Corda delegates handling using the database's optimistic concurrency control.
+     As the likelihood of re-occurrence should be low, the flow will actually error and fail if it experiences this at the same point more than 3 times. No intervention required.
+
+   * Finality Flow handling - Corda 3.x (old style) ``FinalityFlow`` and Corda 4.x ``ReceiveFinalityFlow`` handling:
+     if on the receive side of the finality flow, any error will result in the flow being kept in for observation to allow the cause of the
+     error to be rectified (so that the transaction isn’t lost if, for example, associated contract JARs are missing).
+     Intervention is expected to be “rectify error, perhaps uploading attachment, and restart node” (or alternatively reject and call `killFlow`).
+
+   * `FlowTimeoutException`:
+     this is used internally by the notary client flow when talking to an HA notary.  It’s used to cause the client to try and talk to a different
+     member of the notary cluster if it doesn't hear back from the original member it sent the request to within a “reasonable” time.
+     The time is hard to document as the notary members, if actually alive, will inform the requester of the ETA of a response.
+     This can occur an infinite number of times.  i.e. we never give up notarising.  No intervention required.
+
+Futures
+-------
+
+The flow hospital will be extended in the following areas:
+
+- Human Computer Interaction (HCI) with MQ integration <ref design/CID>
+- Addition of Public APIs (and CRaSH utility functions) to trigger retries
+- Improved back-off and retry policies
+- Improved Explorer visualization operational controls (eg. ability to select and retry or terminate a failed flow).
+- Tighter integration with Corda Enterprise monitoring and management tooling
+
+
+