This page contains information about checkpoint tooling. These tools can be used to debug the causes of stuck flows.
Before reading this page, please ensure you understand the mechanics and principles of Corda Flows by reading :doc:`key-concepts-flows` and :doc:`flow-state-machines`.
It is also recommended that you understand the purpose and behaviour of the :doc:`node-flow-hospital` in relation to *checkpoints* and flow recovery.
An advanced explanation of :ref:`*checkpoints* <flow_internals_checkpoints_ref>` within the flow state machine can be found here: :doc:`contributing-flow-internals`.
A flow *checkpoint* is a serialised snapshot of the flow's stack frames and any objects reachable from the stack. Checkpoints are saved to
the database automatically when a flow suspends or resumes, which typically happens when sending or receiving messages. A flow may be replayed
from the last checkpoint if the node restarts. Automatic checkpointing is an unusual feature of Corda and significantly helps developers write
reliable code that can survive node restarts and crashes. It also assists with scaling up, as flows that are waiting for a response can be flushed
from memory.
The checkpoint tools available are:
-:ref:`Checkpoint dumper <checkpoint_dumper>`
-:ref:`Checkpoint agent <checkpoint_agent>`
.._checkpoint_dumper:
Checkpoint dumper
~~~~~~~~~~~~~~~~~
The checkpoint dumper outputs information about flows running on a node. This is useful for diagnosing the causes of stuck flows. Using the generated output,
corrective actions can be taken to resolve the issues flows are facing. One possible solution, is ending a flow using the ``flow kill`` command.
- Each file follows the naming format ``<flow name>-<flow id>.json`` (for example, ``CashIssueAndPaymentFlow-90613d6f-be78-41bd-98e1-33a756c28808.json``).
- The zip is placed into the ``logs`` directory of the node and is named ``checkpoints_dump-<date and time>.zip`` (for example, ``checkpoints_dump-20190812-153847``).
Below are some of the more important fields included in the output:
-``flowId``: The id of the flow
-``topLevelFlowClass``: The name of the original flow that was invoked (by RPC or a service)
-``topLevelFlowLogic``: Detailed view of the top level flow
-``flowCallStackSummary``: A summarised list of the current stack of sub flows along with any progress tracker information
-``suspendedOn``: The command that the flow is suspended on (e.g. ``SuspendAndReceive``) which includes the ``suspendedTimestamp``
-``flowCallStack`` A detailed view of the of the current stack of sub flows
The Checkpoint Agent is a very low level diagnostics tool that can be used to output the type, size and content of flow *checkpoints* at node runtime.
It is primarily targeted at users developing and testing code that may exhibit flow mis-behaviour (preferably before going into production).
For a given flow *checkpoint*, the agent outputs:
1. Information about the checkpoint such as its ``id`` (also called a ``flow id``) that can be used to correlate with that flows lifecycle details in the main Corda logs.
2. A nested hierarchical view of its reachable objects (indented and tagged with depth and size) and their associated sizes, including the state
of any flows held within the checkpoint.
Diagnostics information is written to standard log files (eg. log4j2 configured logger).
To run simply pass in the following jar to the JVM used to start a Corda node: ``-Dcapsule.jvm.args=-javaagent:<PATH>/checkpoint-agent.jar[=arg=value,...]``
..note:: As above also ensure to use the jar when using corda gradle plugin configuration tasks: e.g. ``cordformation deployNodes`` task.
See https://docs.corda.net/head/generating-a-node.html#the-cordform-task
*``instrumentType``: whether to output checkpoints on read or write. Possible values: [read, write]. Default: read.
*``instrumentClassname``: specify the base type of objects to log. The default setting is to process all *Flow* object types. Default: net.corda.node.services.statemachine.FlowStateMachineImpl.
*``minimumSize``: specifies the minimum size (in bytes) of objects to log. Default: 8192 bytes (8K)
*``maximumSize``: specifies the maximum size (in bytes) of objects to log. Default: 20000000 bytes (20Mb)
*``graphDepth``: specifies how many levels deep to display the graph output. Default: unlimited
*``printOnce``: if true, will display a full object reference (and its sub-graph) only once. Otherwise an object will be displayed repeatedly as referenced. Default: true
These arguments are passed to the JVM along with the agent specification. For example:
15f16740-4ea2-4e48-bcb3-fd9051d5b Cash Issue And Payment bankUser In progress
1c6c3e59-26aa-4b93-8435-4e34e265e Cash Issue And Payment bankUser In progress
90613d6f-be78-41bd-98e1-33a756c28 Cash Issue And Payment bankUser In progress
43c7d5c8-aa66-4a98-beed-dc91354d0 Cash Issue And Payment bankUser In progress
Waiting for completion or Ctrl-C ...
Note that "In progress" indicates the flows above have not completed (and will have been checkpointed).
1. Check the main corda node log file for *hospitalisation* and/or *flow retry* messages: ``<NODE_BASE>\logs\node-<hostname>.log``
..sourcecode:: none
[INFO ] 2019-07-11T17:56:43,227Z [pool-12-thread-1] statemachine.FlowMonitor. - Flow with id 90613d6f-be78-41bd-98e1-33a756c28808 has been waiting for 97904 seconds to receive messages from parties [O=BigCorporation, L=New York, C=US].
..note:: Always search for the flow id, in this case **90613d6f-be78-41bd-98e1-33a756c28808**
* attempting to perform a graceful shutdown (draining all outstanding flows and preventing others from starting) and re-start of the node:
..sourcecode:: none
Welcome to the Corda interactive shell.
Useful commands include 'help' to see what is available, and 'bye' to shut down the node.
Thu Jul 11 19:52:56 BST 2019>>> gracefulShutdown
Upon re-start ensure you disable flow draining mode to allow the node to continue to receive requests:
..sourcecode:: none
Welcome to the Corda interactive shell.
Useful commands include 'help' to see what is available, and 'bye' to shut down the node.
Thu Jul 11 19:52:56 BST 2019>>> run setFlowsDrainingModeEnabled enabled: false
See also :ref:`Flow draining mode <draining-mode>`.
* contacting other participants in the network where their nodes are not responding to an initiated flow.
The checkpoint dump gives good diagnostics on the reason a flow may be suspended (including the destination peer participant node that is not responding):