mirror of
https://github.com/corda/corda.git
synced 2025-05-30 06:04:24 +00:00
Document FinalityFlow error handling logic and reference flow hospital doc.
This commit is contained in:
parent
a81fbadec5
commit
f43906b973
@ -613,6 +613,21 @@ flow to receive the transaction:
|
|||||||
``idOfTxWeSigned`` is an optional parameter used to confirm that we got the right transaction. It comes from using ``SignTransactionFlow``
|
``idOfTxWeSigned`` is an optional parameter used to confirm that we got the right transaction. It comes from using ``SignTransactionFlow``
|
||||||
which is described below.
|
which is described below.
|
||||||
|
|
||||||
|
**Error handling behaviour**
|
||||||
|
|
||||||
|
Once a transaction has been notarised and its input states consumed by the flow initiator (eg. sender), should the participant(s) receiving the
|
||||||
|
transaction fail to verify it, or the receiving flow (the finality handler) fails due to some other error, we then have a scenario where not
|
||||||
|
all parties have the correct up to date view of the ledger (a condition defined as `eventual consistency <https://en.wikipedia.org/wiki/Eventual_consistency>`_
|
||||||
|
in distributed systems terminology). To recover from this scenario, the receivers finality handler will automatically be sent to the
|
||||||
|
:doc:`node-flow-hospital` where it's suspended and retried from its last checkpoint upon node restart, or according to other conditional retry rules
|
||||||
|
explained in :ref:`flow hospital runtime behaviour <flow-hospital-runtime>`. This gives the node operator the opportunity to recover from the error.
|
||||||
|
Until the issue is resolved the node will continue to retry the flow on each startup. Upon successful completion by the receivers finality flow,
|
||||||
|
the ledger will become fully consistent once again.
|
||||||
|
|
||||||
|
.. warning:: It's possible to forcibly terminate the erroring finality handler using the ``killFlow`` RPC but at the risk of an inconsistent view of the ledger.
|
||||||
|
|
||||||
|
.. note:: A future release will allow retrying hospitalised flows without restarting the node, i.e. via RPC.
|
||||||
|
|
||||||
CollectSignaturesFlow/SignTransactionFlow
|
CollectSignaturesFlow/SignTransactionFlow
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
The list of parties who need to sign a transaction is dictated by the transaction's commands. Once we've signed a
|
The list of parties who need to sign a transaction is dictated by the transaction's commands. Once we've signed a
|
||||||
|
77
docs/source/node-flow-hospital.rst
Normal file
77
docs/source/node-flow-hospital.rst
Normal file
@ -0,0 +1,77 @@
|
|||||||
|
Flow Hospital
|
||||||
|
=============
|
||||||
|
|
||||||
|
Overview
|
||||||
|
--------
|
||||||
|
|
||||||
|
The **flow hospital** refers to a built-in node service that manages flows that have encountered an error.
|
||||||
|
|
||||||
|
This service is responsible for recording, tracking, diagnosis, recovery and retry. It determines whether errored flows should be retried
|
||||||
|
from their previous checkpoints or have their errors propagate. Flows may be recoverable under certain scenarios (eg. manual intervention
|
||||||
|
may be required to install a missing contract JAR version). For a given errored flow, the flow hospital service determines the next course of
|
||||||
|
action towards recovery and retry.
|
||||||
|
|
||||||
|
.. note:: The flow hospital will never terminate a flow, but will propagate its error back to the state machine, and ultimately, end user code to handle.
|
||||||
|
|
||||||
|
This concept is analogous to *exception management handling* associated with enterprise workflow software, or
|
||||||
|
*retry queues/stores* in enterprise messaging middleware for recovering from failure to deliver a message.
|
||||||
|
|
||||||
|
Functionality
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Flow hospital functionality is enabled by default in |release|. No explicit configuration settings are required.
|
||||||
|
|
||||||
|
There are two aspects to the flow hospital:
|
||||||
|
|
||||||
|
- run-time behaviour in the node upon failure, including retry and recovery transitions and policies.
|
||||||
|
- visualisation of failed flows in the Explorer UI.
|
||||||
|
|
||||||
|
.. _flow-hospital-runtime:
|
||||||
|
|
||||||
|
Run-time behaviour
|
||||||
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Specifically, there are two main ways a flow is hospitalized:
|
||||||
|
|
||||||
|
1. A counterparty invokes a flow on your node that isn’t installed (i.e. missing CorDapp):
|
||||||
|
this will cause the flow session initialisation mechanism to trigger a ``ClassNotFoundException``.
|
||||||
|
If this happens, the session initiation attempt is kept in the hospital for observation and will retry if you restart the node.
|
||||||
|
Corrective action requires installing the correct CorDapp in the node's "cordapps" directory.
|
||||||
|
|
||||||
|
.. warning:: There is currently no retry API. If you don’t want to install the cordapp, you should be able to call `killFlow` with the UUID
|
||||||
|
associated with the failing flow in the node's log messages.
|
||||||
|
|
||||||
|
2. Once started, if a flow experiences an error, the following failure scenarios are handled:
|
||||||
|
|
||||||
|
* SQLException mentioning a deadlock*:
|
||||||
|
if this happens, the flow will retry. If it retries more than once, a back off delay is applied to try and reduce contention.
|
||||||
|
Current policy means these types of failed flows will retry forever (unless explicitly killed). No intervention required.
|
||||||
|
|
||||||
|
* Database constraint violation:
|
||||||
|
this scenario may occur due to natural contention between racing flows as Corda delegates handling using the database's optimistic concurrency control.
|
||||||
|
As the likelihood of re-occurrence should be low, the flow will actually error and fail if it experiences this at the same point more than 3 times. No intervention required.
|
||||||
|
|
||||||
|
* Finality Flow handling - Corda 3.x (old style) ``FinalityFlow`` and Corda 4.x ``ReceiveFinalityFlow`` handling:
|
||||||
|
if on the receive side of the finality flow, any error will result in the flow being kept in for observation to allow the cause of the
|
||||||
|
error to be rectified (so that the transaction isn’t lost if, for example, associated contract JARs are missing).
|
||||||
|
Intervention is expected to be “rectify error, perhaps uploading attachment, and restart node” (or alternatively reject and call `killFlow`).
|
||||||
|
|
||||||
|
* `FlowTimeoutException`:
|
||||||
|
this is used internally by the notary client flow when talking to an HA notary. It’s used to cause the client to try and talk to a different
|
||||||
|
member of the notary cluster if it doesn't hear back from the original member it sent the request to within a “reasonable” time.
|
||||||
|
The time is hard to document as the notary members, if actually alive, will inform the requester of the ETA of a response.
|
||||||
|
This can occur an infinite number of times. i.e. we never give up notarising. No intervention required.
|
||||||
|
|
||||||
|
Futures
|
||||||
|
-------
|
||||||
|
|
||||||
|
The flow hospital will be extended in the following areas:
|
||||||
|
|
||||||
|
- Human Computer Interaction (HCI) with MQ integration <ref design/CID>
|
||||||
|
- Addition of Public APIs (and CRaSH utility functions) to trigger retries
|
||||||
|
- Improved back-off and retry policies
|
||||||
|
- Improved Explorer visualization operational controls (eg. ability to select and retry or terminate a failed flow).
|
||||||
|
- Tighter integration with Corda Enterprise monitoring and management tooling
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
x
Reference in New Issue
Block a user