mirror of
https://github.com/corda/corda.git
synced 2024-12-18 20:47:57 +00:00
Document FinalityFlow error handling logic and reference flow hospital doc.
This commit is contained in:
parent
a81fbadec5
commit
f43906b973
@ -613,6 +613,21 @@ flow to receive the transaction:
|
||||
``idOfTxWeSigned`` is an optional parameter used to confirm that we got the right transaction. It comes from using ``SignTransactionFlow``
|
||||
which is described below.
|
||||
|
||||
**Error handling behaviour**
|
||||
|
||||
Once a transaction has been notarised and its input states consumed by the flow initiator (eg. sender), should the participant(s) receiving the
|
||||
transaction fail to verify it, or the receiving flow (the finality handler) fails due to some other error, we then have a scenario where not
|
||||
all parties have the correct up to date view of the ledger (a condition defined as `eventual consistency <https://en.wikipedia.org/wiki/Eventual_consistency>`_
|
||||
in distributed systems terminology). To recover from this scenario, the receivers finality handler will automatically be sent to the
|
||||
:doc:`node-flow-hospital` where it's suspended and retried from its last checkpoint upon node restart, or according to other conditional retry rules
|
||||
explained in :ref:`flow hospital runtime behaviour <flow-hospital-runtime>`. This gives the node operator the opportunity to recover from the error.
|
||||
Until the issue is resolved the node will continue to retry the flow on each startup. Upon successful completion by the receivers finality flow,
|
||||
the ledger will become fully consistent once again.
|
||||
|
||||
.. warning:: It's possible to forcibly terminate the erroring finality handler using the ``killFlow`` RPC but at the risk of an inconsistent view of the ledger.
|
||||
|
||||
.. note:: A future release will allow retrying hospitalised flows without restarting the node, i.e. via RPC.
|
||||
|
||||
CollectSignaturesFlow/SignTransactionFlow
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
The list of parties who need to sign a transaction is dictated by the transaction's commands. Once we've signed a
|
||||
|
77
docs/source/node-flow-hospital.rst
Normal file
77
docs/source/node-flow-hospital.rst
Normal file
@ -0,0 +1,77 @@
|
||||
Flow Hospital
|
||||
=============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The **flow hospital** refers to a built-in node service that manages flows that have encountered an error.
|
||||
|
||||
This service is responsible for recording, tracking, diagnosis, recovery and retry. It determines whether errored flows should be retried
|
||||
from their previous checkpoints or have their errors propagate. Flows may be recoverable under certain scenarios (eg. manual intervention
|
||||
may be required to install a missing contract JAR version). For a given errored flow, the flow hospital service determines the next course of
|
||||
action towards recovery and retry.
|
||||
|
||||
.. note:: The flow hospital will never terminate a flow, but will propagate its error back to the state machine, and ultimately, end user code to handle.
|
||||
|
||||
This concept is analogous to *exception management handling* associated with enterprise workflow software, or
|
||||
*retry queues/stores* in enterprise messaging middleware for recovering from failure to deliver a message.
|
||||
|
||||
Functionality
|
||||
-------------
|
||||
|
||||
Flow hospital functionality is enabled by default in |release|. No explicit configuration settings are required.
|
||||
|
||||
There are two aspects to the flow hospital:
|
||||
|
||||
- run-time behaviour in the node upon failure, including retry and recovery transitions and policies.
|
||||
- visualisation of failed flows in the Explorer UI.
|
||||
|
||||
.. _flow-hospital-runtime:
|
||||
|
||||
Run-time behaviour
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Specifically, there are two main ways a flow is hospitalized:
|
||||
|
||||
1. A counterparty invokes a flow on your node that isn’t installed (i.e. missing CorDapp):
|
||||
this will cause the flow session initialisation mechanism to trigger a ``ClassNotFoundException``.
|
||||
If this happens, the session initiation attempt is kept in the hospital for observation and will retry if you restart the node.
|
||||
Corrective action requires installing the correct CorDapp in the node's "cordapps" directory.
|
||||
|
||||
.. warning:: There is currently no retry API. If you don’t want to install the cordapp, you should be able to call `killFlow` with the UUID
|
||||
associated with the failing flow in the node's log messages.
|
||||
|
||||
2. Once started, if a flow experiences an error, the following failure scenarios are handled:
|
||||
|
||||
* SQLException mentioning a deadlock*:
|
||||
if this happens, the flow will retry. If it retries more than once, a back off delay is applied to try and reduce contention.
|
||||
Current policy means these types of failed flows will retry forever (unless explicitly killed). No intervention required.
|
||||
|
||||
* Database constraint violation:
|
||||
this scenario may occur due to natural contention between racing flows as Corda delegates handling using the database's optimistic concurrency control.
|
||||
As the likelihood of re-occurrence should be low, the flow will actually error and fail if it experiences this at the same point more than 3 times. No intervention required.
|
||||
|
||||
* Finality Flow handling - Corda 3.x (old style) ``FinalityFlow`` and Corda 4.x ``ReceiveFinalityFlow`` handling:
|
||||
if on the receive side of the finality flow, any error will result in the flow being kept in for observation to allow the cause of the
|
||||
error to be rectified (so that the transaction isn’t lost if, for example, associated contract JARs are missing).
|
||||
Intervention is expected to be “rectify error, perhaps uploading attachment, and restart node” (or alternatively reject and call `killFlow`).
|
||||
|
||||
* `FlowTimeoutException`:
|
||||
this is used internally by the notary client flow when talking to an HA notary. It’s used to cause the client to try and talk to a different
|
||||
member of the notary cluster if it doesn't hear back from the original member it sent the request to within a “reasonable” time.
|
||||
The time is hard to document as the notary members, if actually alive, will inform the requester of the ETA of a response.
|
||||
This can occur an infinite number of times. i.e. we never give up notarising. No intervention required.
|
||||
|
||||
Futures
|
||||
-------
|
||||
|
||||
The flow hospital will be extended in the following areas:
|
||||
|
||||
- Human Computer Interaction (HCI) with MQ integration <ref design/CID>
|
||||
- Addition of Public APIs (and CRaSH utility functions) to trigger retries
|
||||
- Improved back-off and retry policies
|
||||
- Improved Explorer visualization operational controls (eg. ability to select and retry or terminate a failed flow).
|
||||
- Tighter integration with Corda Enterprise monitoring and management tooling
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user