Document FinalityFlow error handling logic and reference flow hospital doc.

2025-05-30 06:04:24 +00:00 · 2019-03-27 13:09:05 +00:00 · 2019-03-27 13:09:05 +00:00 · f43906b973
commit f43906b973
parent a81fbadec5
2 changed files with 92 additions and 0 deletions
--- a/docs/source/api-flows.rst
+++ b/docs/source/api-flows.rst
@ -613,6 +613,21 @@ flow to receive the transaction:
 ``idOfTxWeSigned`` is an optional parameter used to confirm that we got the right transaction. It comes from using ``SignTransactionFlow``
 which is described below.
 **Error handling behaviour**
 Once a transaction has been notarised and its input states consumed by the flow initiator (eg. sender), should the participant(s) receiving the
 transaction fail to verify it, or the receiving flow (the finality handler) fails due to some other error, we then have a scenario where not
 all parties have the correct up to date view of the ledger (a condition defined as `eventual consistency <https://en.wikipedia.org/wiki/Eventual_consistency>`_
 in distributed systems terminology). To recover from this scenario, the receivers finality handler will automatically be sent to the
 :doc:`node-flow-hospital` where it's suspended and retried from its last checkpoint upon node restart, or according to other conditional retry rules
 explained in :ref:`flow hospital runtime behaviour <flow-hospital-runtime>`. This gives the node operator the opportunity to recover from the error.
 Until the issue is resolved the node will continue to retry the flow on each startup. Upon successful completion by the receivers finality flow,
 the ledger will become fully consistent once again.
 .. warning:: It's possible to forcibly terminate the erroring finality handler using the ``killFlow`` RPC but at the risk of an inconsistent view of the ledger.
 .. note:: A future release will allow retrying hospitalised flows without restarting the node, i.e. via RPC.
 CollectSignaturesFlow/SignTransactionFlow
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The list of parties who need to sign a transaction is dictated by the transaction's commands. Once we've signed a
--- a/docs/source/node-flow-hospital.rst
+++ b/docs/source/node-flow-hospital.rst
@ -0,0 +1,77 @@
 Flow Hospital
 =============
 Overview
 --------
 The **flow hospital** refers to a built-in node service that manages flows that have encountered an error.
 This service is responsible for recording, tracking, diagnosis, recovery and retry. It determines whether errored flows should be retried
 from their previous checkpoints or have their errors propagate. Flows may be recoverable under certain scenarios (eg. manual intervention
 may be required to install a missing contract JAR version). For a given errored flow, the flow hospital service determines the next course of
 action towards recovery and retry.
 .. note:: The flow hospital will never terminate a flow, but will propagate its error back to the state machine, and ultimately, end user code to handle.
 This concept is analogous to *exception management handling* associated with enterprise workflow software, or
 *retry queues/stores* in enterprise messaging middleware for recovering from failure to deliver a message.
 Functionality
 -------------
 Flow hospital functionality is enabled by default in |release|. No explicit configuration settings are required.
 There are two aspects to the flow hospital:
 - run-time behaviour in the node upon failure, including retry and recovery transitions and policies.
 - visualisation of failed flows in the Explorer UI.
 .. _flow-hospital-runtime:
 Run-time behaviour
 ~~~~~~~~~~~~~~~~~~
 Specifically, there are two main ways a flow is hospitalized:
 1. A counterparty invokes a flow on your node that isn’t installed (i.e. missing CorDapp):
   this will cause the flow session initialisation mechanism to trigger a ``ClassNotFoundException``.
   If this happens, the session initiation attempt is kept in the hospital for observation and will retry if you restart the node.
   Corrective action requires installing the correct CorDapp in the node's "cordapps" directory.
   .. warning:: There is currently no retry API. If you don’t want to install the cordapp, you should be able to call `killFlow` with the UUID
      associated with the failing flow in the node's log messages.
 2. Once started, if a flow experiences an error, the following failure scenarios are handled:
   * SQLException mentioning a deadlock*:
     if this happens, the flow will retry. If it retries more than once, a back off delay is applied to try and reduce contention.
     Current policy means these types of failed flows will retry forever (unless explicitly killed).  No intervention required.
   * Database constraint violation:
     this scenario may occur due to natural contention between racing flows as Corda delegates handling using the database's optimistic concurrency control.
     As the likelihood of re-occurrence should be low, the flow will actually error and fail if it experiences this at the same point more than 3 times. No intervention required.
   * Finality Flow handling - Corda 3.x (old style) ``FinalityFlow`` and Corda 4.x ``ReceiveFinalityFlow`` handling:
     if on the receive side of the finality flow, any error will result in the flow being kept in for observation to allow the cause of the
     error to be rectified (so that the transaction isn’t lost if, for example, associated contract JARs are missing).
     Intervention is expected to be “rectify error, perhaps uploading attachment, and restart node” (or alternatively reject and call `killFlow`).
   * `FlowTimeoutException`:
     this is used internally by the notary client flow when talking to an HA notary.  It’s used to cause the client to try and talk to a different
     member of the notary cluster if it doesn't hear back from the original member it sent the request to within a “reasonable” time.
     The time is hard to document as the notary members, if actually alive, will inform the requester of the ETA of a response.
     This can occur an infinite number of times.  i.e. we never give up notarising.  No intervention required.
 Futures
 -------
 The flow hospital will be extended in the following areas:
 - Human Computer Interaction (HCI) with MQ integration <ref design/CID>
 - Addition of Public APIs (and CRaSH utility functions) to trigger retries
 - Improved back-off and retry policies
 - Improved Explorer visualization operational controls (eg. ability to select and retry or terminate a failed flow).
 - Tighter integration with Corda Enterprise monitoring and management tooling