Previously we were just throwing this away when pausing, meaning
updates would not be passed back to the user.
The progress tracker is now maintained in the `NonResidentFlow`
allowing it to be reused in the flow when it is retried.
* ENT-5672 Update database query to get paused flows which have previously been hospitalised
* NOTICK Remove unneeded check if a database exception was removed when switching a flow to RUNNABLE since we were to remove it anyway
Always attempt to load a checkpoint from the database when a flow
retries.
This is to prevent transient errors where the checkpoint is committed to
the database but throws an error back to the node. When the node tries
to retry in this scenario, `isAnyCheckpointPersisted` is false, meaning
that it will try to insert when it tries to save its initial checkpoint
again.
By loading from the existing checkpoint, even though it doesn't
really use it because it is `Unstarted`, the flag gets put into the
right state and will update rather than insert later on.
The issue with the test was that the environment variable are kept as a static member so it passed if it was the first one to run, but failed if another test runs the config beforehand.
Terminate sessions that need to be removed instantly in whatever transition is currently executing, rather than scheduling another event and doing so at a later time.
To do this, update the transition being created in `TopLevelTransition` to remove the sessions and append the `RemoveSessionBindings` action to it.
This achieves the same outcome as the original code but does so with 1 less transition. Doing this also removes the race condition that can occur where another external event is added to the flow's event queue before the terminate event could be added.
* CORDA-3986 Increase sleep in `FlowSessionCloseTest`
A sleep duration needed to be increased to ensure that an end session
message has time to be processed by the other node.
Locks do not fully fix this because some internal processing needs to be
completed that can't be waited for using a lock. Therefore the sleep
time was increased generously.
* removed dependency from tools.jar
I removed the log line in /node/src/main/kotlin/net/corda/node/internal/NodeStartup.kt because I felt it was not so important
and I modified the checkpoint agent detection simply using a static field (I tested both with and without the checkpoint agent running and detection works correctly)
* move method to node-api to address review comments
Co-authored-by: Walter Oggioni <walter.oggioni@r3.com>
* CORDA-3954: Added step to run database migration scripts during initial node registration with -s / --skip-schema-creation option (default to false) to prevent migration.
* CORDA-3954: Applied code convention to if statement
* CORDA-3954: Marked NodeCmdLineOptions' -s/--skip-schema-creation as deprecated and hidden in line with --initial-registration
Disable `SignatureConstraintMigrationFromHashConstraintsTests.HashConstraint cannot be migrated to SignatureConstraint if a HashConstraint is specified for one state and another uses an AutomaticPlaceholderConstraint()` as it frequently appears in reports about Gradle process failures, to try isolating the actual cause.
* fix suggestion and tests
* detekt suppress
* making sure the forced join works with IndirectStatePersistable and removing unnecessary joinPredicates from parse with sorting
* remove joinPredicates and add tests
* rename sorting
* revert deleting joinPredicates and modify the force join to use `OR` instead of `AND`
* add system property switch
* CORDA-3657 Extract information from state machine
`FlowReadOperations` interface provides functions that extract
information about flows from the state machine manager.
`FlowOperator` implements this interface (along with another currenly
empty interface).
* CORDA-3657 Rename function and use set
* initial test is passing
* wip
* done tests
* additional tests to cover more FlowIORequest variations
* completed tests
* The quasar.jar should nat have been changed
* Fixed issues reported by detekt
* got rid of sync objects, instead relying on nodes being offline
* Added extra grouping test and minor simplification
* Hospital test must use online node which fails on otherside
* Added additional information required for the ENT
* Added tests to cover SEND FlowIORequests
* using node name constants from the core testing module
* Changed flow operator to the query pattern
* made query fields mutable to simply building query
* fixed detekt issue
* Fixed test which had dependency on the order int the result (failed for windows)
* Fixed recommendations in PR
* Moved WrappedFlowExternalOperation and WrappedFlowExternalAsyncOperation to FlowExternalOperation.kt as per PR comment
* Moved extension to FlowAsyncOperation
* removed unnecessarily brackets
Co-authored-by: LankyDan <danknewton@hotmail.com>
Fix memory leak due to DB not shutting down in FlowFrameworkPersistenceTests.flow restarted just after receiving payload.
Also reduces number of class-wide variables to reduce scope for references being accidentally held between runs.
`KillFlowTest` is failing quite often. This is probably due to issues in ordering when taking and releasing locks. By using `CountDownLatch` in places instead of `Semaphore`s should reduce the likelihood of tests failing.
If an error occurs when creating a transition (a.k.a anything inside of
`TopLevelTransition`) then resume the flow with the error that occurred.
This is needed, because the current code is swallowing all errors thrown
at this point and causing the flow to hang.
This change will allow better debugging of errors since the real error
will be thrown back to the flow and will get handled and logged by the
normal error code path.
Extra logging has been added to `processEventsUntilFlowIsResumed`, just
in case an exception gets thrown out of the normal code path. We do not
want this exception to be swallowed as it can make it impossible to
debug the original error.
Save the exception for flows that fail during session init when they are
kept for observation.
Change the exception tidy up logic to only update the flow's status if
the exception was removed.
Add `CordaRPCOps.reattachFlowWithClientId` to allow clients to reattach
to an existing flow by only providing a client id. This behaviour is the
same as calling `startFlowDynamicWithClientId` for an existing
`clientId`. Where it differs is `reattachFlowWithClientId` will return
`null` if there is no flow running or finished on the node with the same
client id.
Return `null` if record deleted from race-condition
Update the compatible flag in the DB if the flowstate cannot be deserialised.
The most common cause of this problem is if a CorDapp has been upgraded
without draining flows from the node.
`RUNNABLE` and `HOSPITALISED` flows are restored on node startup so
the flag is set for these then. The flag can also be set when a flow
retries for some reason (see retryFlowFromSafePoint) in this case the
problem has been caused by another reason.
Added a newpause event to the statemachine which returns an Abort
continuation and causes the flow to be moved into the Paused flow Map.
Flows can receive session messages whilst paused.
Add a lock to `StateMachineState`, allowing every flow to lock
themselves when performing a transition or when an external thread (such
as `killFlow`) tries to interact with a flow from occurring at the same
time.
Doing this prevents race-conditions where the external threads mutate
the database or the flow's state causing an in-flight transition to
fail.
A `Semaphore` is used to acquire and release the lock. A `ReentrantLock`
is not used as it is possible for a flow to suspend while locked, and
resume on a different thread. This causes a `ReentrantLock` to fail when
releasing the lock because the thread doing so is not the thread holding
the lock. `Semaphore`s can be used across threads, therefore bypassing
this issue.
The lock is copied across when a flow is retried. This is to prevent
another thread from interacting with a flow just after it has been
retried. Without copying the lock, the external thread would acquire the
old lock and execute, while the fiber thread acquires the new lock and
also executes.
* Remove use of Thread.sleep() FROM FlowReloadAfterCheckpointTest, instead relying on CountdownLatch to wait until the target number has been hit or a timeout occurs, so the thread can continue as soon as the target is hit.
* Replace use of hashmaps to a concurrent queue, to mitigate risk of complex threading issues.
Enhance rpc acknowledgement method (`removeClientId`) to remove checkpoint
from all checkpoint database tables.
Optimize `CheckpointStorage.removeCheckpoint` to not delete from all checkpoint
tables if not needed. This includes excluding the results (`DBFlowResult`) and
exceptions (`DBFlowException`) tables.
Integrate `DBFlowException` with the rest of the checkpoint schema, so now
we are saving the flow's exception result in the database.
Making statemachine not remove `FAILED` flows' checkpoints from the
database if they are started with a clientId.
Retrieve the DBFlowException from the database to construct a
`FlowStateMachineHandle` future and complete exceptionally the flow's result
future for requests (`startFlowDynamicWithClientId`) that pick FAILED flows ,
started with client id, of status Removed.
On killing a flow the client id mapping of the flow gets removed.
The storage serialiser is used for serialising exceptions. Note, that if an
exception cannot be serialised, it will not fail and will instead be stored
as a `CordaRuntimeException`. This could be improved in future
changes.
Dummy package names cause build failure as they are not found on the classpath when trying to import them. Now that empty package name list is allowed, the dummy names are removed.
* CORDA-3663 MockServices crashes when two of the provided packages to scan are deemed empty in 4.4 RC05
this happends when a given package is not found on the classpath. Now it is handled and an exception is thrown
* replace dummy package names in tests with valid ones
* allow empty package list for CustomCordapps and exclude those from the created jars
* detekt fix
* always true logic fix
* fix to check for empty packages instead of empty classes
* fix for classes and fixups
* logic refactor because of detekt stupidity
* PR related minor refactors
When an incorrect message is received, the flow should resume to allow
it to throw the error back to user code and possibly cause the flow to
fail.
For now, if an `EndSessionMessage` is received instead of a
`DataSessionMessage`, then an `UnexpectedFlowEndException` is thrown
back to user code. Allowing it to correctly re-enter normal flow error
handling.
Without this change, the flow will hang due to it failing while creating
a transition which exists outside of the general state machine error
handling code path.
Sessions are now terminated after performing the original
`FlowIORequest` passed into `StartedFlowTransition`, instead of before.
This is done by scheduling an `Event.TerminateSessions` if there are
sessions to terminate when performing a suspending event.
Originally this was done by hijacking a transition that is trying to
perform a `StartedFlowTransition`, terminating the sessions and then
scheduling another `Event.DoRemainingWork` to perform the original
transition. This introduced a bug where, another event (from a external
message) could be placed onto the queue before the
`Event.DoRemainingWork` could be added. In most scenarios, that should
be ok. But, if a flow is retrying (while in an uninitiated state) and
this occurs the flow could fail due to being in an unexpected state.
Terminating the sessions after performing the original transition
removes this possibility. Meaning that a restarting flow will always
perform the transition they supposed to do (based on the called
suspending event).
* CORDA-3871: Import external code
Compiles, but does not work for various reasons
* CORDA-3871: More improvements to imported code
Currently fails due to keystores not being found
* CORDA-3871: Initialise keystores for the server
Currently fails due to keystores for client not being found
* CORDA-3871: Configure certificates to client
The program started to run
* CORDA-3871: Improve debug output
* CORDA-3871: Few more minor changes
* CORDA-3871: Add AMQClient test
Currently fails due to `localCert` not being set
* CORDA-3871: Configure server to demand client to present its certificate
* CORDA-3871: Changes to the test to make it pass
ACK status is not delivered as server is not talking AMQP
* CORDA-3871: Add delayed handshake scenario
* CORDA-3871: Tidy-up imported classes
* CORDA-3871: Hide thread creation inside `ServerThread`
* CORDA-3871: Test description
* CORDA-3871: Detekt baseline update
* CORDA-3871: Trigger repeated execution of new tests
To make sure they are not flaky
* CORDA-3871: Improve robustness of the newly introduced tests
* CORDA-3871: Improve robustness of the newly introduced tests
* CORDA-3871: New tests proven to be stable - reduce number of iterations to 1
* CORDA-3871: Adding Alex Karnezis to the list of contributors
Correct race condition in FlowVersioningTest where the last message is read (and the session close can be triggered)
before one side has finished reading metadata from the session.
Remove memory leak endurance test as it spends 8 minutes testing a single failure case that's not end user visible,
and ultimately manifests elsewhere in test failures (which is where this came from in the beginning). It was a good
idea to confirm the change fixed the issue, but this isn't critical enough to retain.
Making statemachine not remove COMPLETED flows' checkpoints from the database
if they are started with a clientId, instead they are getting persisted and retained within
the database along with their result (`DBFlowResult`).
On flow start with a client id (`startFlowDynamicWithClientId`), if the client id maps to
a flow that was previously started with the same client id and the flow is now finished,
then fetch the `DBFlowResult` from the database to construct a
`FlowStateMachineHandle` done future and return it back to the client.
Object stored as results must abide by the storage serializer rules. If they fail to do so
the result will not be stored and an exception is thrown to the client to indicate this.
Enable reloading of a flow after every checkpoint is saved. This
includes reloading the checkpoint from the database and recreating the
fiber.
When a flow and its `StateMachineState` is created it checks the node's
config to see if the `reloadCheckpointAfterSuspend` is set to true. If it is
it initialises `StateMachineState.reloadCheckpointAfterSuspendCount`
with the value 0. Otherwise, it remains `null`.
This count represents how many times the flow has reloaded from its
checkpoint (not the same as retrying). It is incremented every time the
flow is reloaded.
When a flow suspends, it processes the suspend event like usual, but
it will now also check if `reloadCheckpointAfterSuspendCount` is not
`null` (that it is activated) and process a
`ReloadFlowFromCheckpointAfterSuspend`event, if and only if
`reloadCheckpointAfterSuspendCount` is greater than
`CheckpointState.numberOfSuspends`.
This means idempotent flows can reload from the start and not reload
again until reaching a new suspension point.
Flows that skip checkpoints can reload from a previously saved
checkpoint (or from the initial checkpoint) and will continue reloading
on reaching the next new suspension point (not the suspension point that
it skipped saving).
If the flow fails to deserialize the checkpoint from the database upon
reloading a `ReloadFlowFromCheckpointException` is throw. This causes
the flow to be kept for observation.
* CORDA-3844: Add new functions to network map client
* CORDA-3844: Apply new fetch logic to nm updater
* CORDA-3844: Fix base url and warnings
* CORDA-3844: Change response object and response validation
In order to make sure that the returned node infos are not maliciously modified, either a signed list response
or a signed reference object would need to be provided. As providing a signed list requires a lot of effort from NM and Signer services,
the signed network map is provided instead, allowing nodes to validate that the list provided conforms to the entries of the signed network map.
* CORDA-3844: Add clarifications and comments
* CORDA-3844: Add error handling for bulk request
* CORDA-3844: Enhance testing
* CORDA-3844: Fix detekt issues
* EG-3844: Apply pr suggestions
* CORDA-3845: Update BC to 1.64
* CORDA-3845: Upgraded log4j to 2.13.3
* We can remove the use of Manifests from the logging package so that when _it_ logs it doesn't error on the fact the stream was already closed by the default Java logger.
* Some more tidy up
* Remove the logging package as a plugin
* latest BC version
* Remove old test
* fix up
* Fix some rebased changes to log file handling
* Fix some rebased changes to log file handling
* Update slf4j too
Co-authored-by: Adel El-Beik <adel.el-beik@r3.com>
* CORDA-3717: Apply custom serializers to checkpoints
* Remove try/catch to fix TooGenericExceptionCaught detekt rule
* Rename exception
* Extract method
* Put calls to the userSerializer on their own lines to improve readability
* Remove unused constructors from exception
* Remove unused proxyType field
* Give field a descriptive name
* Explain why we are looking for two type parameters when we only use one
* Tidy up the fetching of types
* Use 0 seconds when forcing a flow checkpoint inside test
* Add test to check references are restored correctly
* Add CheckpointCustomSerializer interface
* Wire up the new CheckpointCustomSerializer interface
* Use kryo default for abstract classes
* Remove unused imports
* Remove need for external library in tests
* Make file match original to remove from diff
* Remove maySkipCheckpoint from calls to sleep
* Add newline to end of file
* Test custom serializers mapped to interfaces
* Test serializer configured with abstract class
* Move test into its own package
* Rename test
* Move flows and serializers into their own source file
* Move broken map into its own source file
* Delete comment now source file is simpler
* Rename class to have a shorter name
* Add tests that run the checkpoint serializer directly
* Check serialization of final classes
* Register as default unless the target class is final
* Test PublicKey serializer has not been overridden
* Add a broken serializer for EdDSAPublicKey to make test more robust
* Split serializer registration into default and non-default registrations. Run registrations at the right time to preserve Cordas own custom serializers.
* Check for duplicate custom checkpoint serializers
* Add doc comments
* Add doc comments to CustomSerializerCheckpointAdaptor
* Add test to check duplicate serializers are logged
* Do not log the duplicate serializer warning when the duplicate is the same class
* Update doc comment for CheckpointCustomSerializer
* Sort serializers by classname so we are not registering in an unknown or random order
* Add test to serialize a class that references itself
* Store custom serializer type in the Kryo stream so we can spot when a different serializer is being used to deserialize
* Testing has shown that registering custom serializers as default is more robust when adding new cordapps
* Remove new line character
* Remove unused imports
* Add interface net.corda.core.serialization.CheckpointCustomSerializer to api-current.txt
* Remove comment
* Update comment on exception
* Make CustomSerializerCheckpointAdaptor internal
* Revert "Add interface net.corda.core.serialization.CheckpointCustomSerializer to api-current.txt"
This reverts commit b835de79bd.
* Restore "Add interface net.corda.core.serialization.CheckpointCustomSerializer to api-current.txt""
This reverts commit 718873a4e9.
* Pass the class loader instead of the context
* Do less work in test setup
* Make the serialization context unique for CustomCheckpointSerializerTest so we get a new Kryo pool for the test
* Rebuild the Kryo pool for the given context when we change custom serializers
* Rebuild all Kryo pools on serializer change to keep serializer list consistent
* Move the custom serializer list into CheckpointSerializationContext to reduce scope from global to a serialization context
* Remove unused imports
* Make the new checkpointCustomSerializers property default to the empty list
* Delegate implementation using kotlin language feature
Refactor `FlowStateMachineImpl.transientValues` and
`FlowStateMachineImpl.transientState` to stop the fields from exposing
the fact that they are nullable.
This is done by having private backing fields `transientValuesReference`
and `transientStateReference` that can be null. The nullability is still
needed due to serialisation and deserialisation of flow fibers. The
fields are transient and therefore will be null when reloaded from the
database.
Getters and setters hide the private field, allowing a non-null field to
returned.
There is no point other than in `FlowCreator` where the transient fields
can be null. Therefore the non null checks that are being made are
valid.
Add custom kryo serialisation and deserialisation to `TransientValues`
and `StateMachineState` to ensure that neither of the objects are ever
touched by kryo.
Introducing a new flow start method (`startFlowDynamicWithClientId`) passing in a `clientId`.
Once `startFlowDynamicWithClientId` gets called, the `clientId` gets injected into `InvocationContext` and also pushed to the logging context.
If a new flow starts with this method, then a < `clientId` to flow > pair is kept on node side, even after the flow's lifetime. If `startFlowDynamicWithClientId` is called again with the same `clientId` then the node identifies that this `clientId` refers to an existing < `clientId` to flow > pair and returns back to the rpc client a `FlowStateMachineHandle` future, created out of that pair.
`FlowStateMachineHandle` interface was introduced as a thinner `FlowStateMachine`. All `FlowStateMachine` properties used by call sites are moved into this new interface along with `clientId` and then `FlowStateMachine` extends it.
Introducing an acknowledgement method (`removeClientId`). Calling this method removes the < `clientId` to flow > pair on the node side and frees resources.
* CORDA-3769: Switched attachments class loader cache to use caffeine with original implementation used by determinstic core.
* CORDA-3769: Removed default ctor arguments.
* CORDA-3769: Switched mapping function to Function type to avoid synthetic method being generated.
* CORDA-3769: Now using a cache created from NamedCacheFactory for the attachments class loader cache.
* CORDA-3769: Making detekt happy.
* CORDA-3769: The finality tests now check for UntrustedAttachmentsException which will actually happen in reality.
* CORDA-3769: Refactored after review comments.
* CORDA-3769: Removed the AttachmentsClassLoaderSimpleCacheImpl as DJVM does not need it. Also updated due to review comments.
* CORDA-3769: Removed the generic parameters from AttachmentsClassLoader.
* CORDA-3769: Removed unused imports.
* CORDA-3769: Updates from review comments.
* CORDA-3769: Updated following review comments. MigrationServicesForResolution now uses cache factory. Ctor updated for AttachmentsClassLoaderSimpleCacheImpl.
* CORDA-3769: Reduced max class loader cache size
* CORDA-3769: Fixed the attachments class loader cache size to a fixed default
* CORDA-3769: Switched attachments class loader size to be reduced by fixed value.
Cancel the future being run by a flow when finishing or retrying it. The
cancellation of the future no longer cares about what type of future it
is.
`StateMachineState` has the `future` field, which holds the 3
(currently) possible types of futures:
- sleep
- wait for ledger commit
- async operation / external operation
Move the starting of all futures triggered by actions into
`ActionFutureExecutor`.
* Add schema migration to smoke tests
* Fix driver to work correctly for out-of-proc node with persistent database.
Co-authored-by: Ross Nicoll <ross.nicoll@r3.com>
Changes to `TopLevelTransition` after merging from earlier releases.
When a flow is kept for observation and its checkpoint is saved as
HOSPITALIZED in the database, we must acknowledge the session init and
flow start events so that they are not replayed on node startup.
Otherwise the same flow will be ran twice when the node is restarted,
one from the checkpoint and one from artemis.
Reduce exception_message and stack_trace lengths in table node_flow_exceptions from 4000 to 2000 to fix Oracle failing with: 'ORA-00910: specified length too long for its datatype'
These tests were removed after doing a merge from 4.4. They needed
updating after the changes from 4.4 anyway. These have been included in
this change.
Also fix kill flow tests and send initial tests.
When an uncaught exception propagates all the way to the flow exception
handler, the flow will be forced into observation/hospitalised.
The updating of the checkpoints status is done on a separate thread as
the fiber cannot be relied on anymore. The new thread is needed to allow
database transaction to be created and committed. Failures to the status
update will be rescheduled to ensure that this information is eventually
reflected in the database.
* Move log messages that are not useful in typical usage from info to debug level to reduce log spam.
* Add node startup check before attempting to connect.
In enterprise, `AuthDBTests` picked up a schema from a unit test and
included it in the cordapp it builds. This schema does not have a
migration and therefore fails the integration tests.
`NodeBasedTest` now lets cordapps to be defined and passed in to avoid
this issue. It defaults to making a cordapp from the tests base
directory if none are provided.
When a non-database exception is thrown out of a `withEntityManager`
block, always check if the session needs to be rolled back.
This means if a database error is caught and a new non-database error is
thrown out of the `withEntityManager` block, the transaction is still
rolled back. The flow can then continue progressing as normal.
* CORDA-3722 withEntityManager can rollback its session
Improve the handling of database transactions when using
`withEntityManager` inside a flow.
Extra changes have been included to improve the safety and
correctness of Corda around handling database transactions.
This focuses on allowing flows to catch errors that occur inside an
entity manager and handle them accordingly.
Errors can be caught in two places:
- Inside `withEntityManager`
- Outside `withEntityManager`
Further changes have been included to ensure that transactions are
rolled back correctly.
Errors caught inside `withEntityManager` require the flow to manually
`flush` the current session (the entity manager's individual session).
By manually flushing the session, a `try-catch` block can be placed
around the `flush` call, allowing possible exceptions to be caught.
Once an error is thrown from a call to `flush`, it is no longer possible
to use the same entity manager to trigger any database operations. The
only possible option is to rollback the changes from that session.
The flow can continue executing updates within the same session but they
will never be committed. What happens in this situation should be handled
by the flow. Explicitly restricting the scenario requires a lot of effort
and code. Instead, we should rely on the developer to control complex
workflows.
To continue updating the database after an error like this occurs, a new
`withEntityManager` block should be used (after catching the previous
error).
Exceptions can be caught around `withEntityManager` blocks. This allows
errors to be handled in the same way as stated above, except the need to
manually `flush` the session is removed. `withEntityManager` will
automatically `flush` a session if it has not been marked for rollback
due to an earlier error.
A `try-catch` can then be placed around the whole of the
`withEntityManager` block, allowing the error to be caught while not
committing any changes to the underlying database transaction.
To make `withEntityManager` blocks work like mini database transactions,
save points have been utilised. A new savepoint is created when opening
a `withEntityManager` block (along with a new session). It is then used
as a reference point to rollback to if the session errors and needs to
roll back. The savepoint is then released (independently from
completing successfully or failing).
Using save points means, that either all the statements inside the
entity manager are executed, or none of them are.
- A new session is created every time an entity manager is requested,
but this does not replace the flow's main underlying database session.
- `CordaPersistence.transaction` can now determine whether it needs
to execute its extra error handling code. This is needed to allow errors
escape `withEntityManager` blocks while allowing some of our exception
handling around subscribers (in `NodeVaultService`) to continue to work.
## Summary
This change deals with multiple issues:
* Errors that occur during flow initialisation.
* Errors that occur when handling the outcome of an existing flow error.
* Failures to rollback and close a database transaction when an error
occurs in `TransitionExecutorImpl`.
* Removal of create and commit transaction actions around retrying a flow.
## Errors that occur during flow initialisation
Flow initialisation has been moved into the try/catch that exists inside
`FlowStateMachineImpl.run`. This means if an error is thrown all the way
out of `initialiseFlow` (which should rarely happen) it will be caught
and move into a flow's standard error handling path. The flow should
then properly terminate.
`Event.Error` was changed to make the choice to rollback be optional.
Errors during flow initialisation cause the flow to not have a open
database transaction. Therefore there is no need to rollback.
## Errors that occur when handling the outcome of an existing flow error
When an error occurs a flow goes to the flow hospital and is given an
outcome event to address the original error. If the transition that was
processing the error outcome event (`StartErrorPropagation` and
`RetryFlowFromSafePoint`) has an error then the flow aborts and
nothing happens. This means that the flow is left in a runnable state.
To resolve this, we now retry the original error outcome event whenever
another error occurs doing so.
This is done by adding a new staff member that looks for
`ErrorStateTransitionException` thrown in the error code path of
`TransitionExecutorImpl`. It then takes the last outcome for that flow
and schedules it to run again. This scheduling runs with a backoff.
This means that a flow will continually retry the original error outcome
event until it completes it successfully.
## Failures to rollback and close a database transaction when an error occurs in `TransitionExecutorImpl`
Rolling back and closing the database transaction inside of
`TransitionExecutorImpl` is now done inside individual try/catch blocks
as this should not prevent the flow from continuing.
## Removal of create and commit transaction actions around retrying a flow
The database commit that occurs after retrying a flow can fail which
required some custom code just for that event to prevent inconsistent
behaviour. The transaction was only needed for reading checkpoints from
the database, therefore the transaction was moved into
`retryFlowFromSafePoint` instead and the commit removed.
If we need to commit data inside of `retryFlowFromSafePoint` in the
future, a commit should be added directly to `retryFlowFromSafePoint`.
The commit should occur before the flow is started on a new fiber.
The state machines state is held within `InnerState` which lived inside
the SMM. `InnerState` has been extracted out of the SMM to allow the SMM
to be refactored in the future. Smaller classes can now be made that
focus on a single goal as the locking of the state can be accessed from
external classes. To achieve this, pass the `InnerState` into the class
and request a lock if needed.
The locking of `InnerState` has been made a property of the `InnerState`
itself. It has a `lock` field that allows locks to be taken out when
needed.
An inline `withLock` function has been added to tidy up the code and not
harm performance.
Some classes have been made internal to prevent invalid usage of purely
node internal classes.
As part of this change, flow timeouts have been extracted out into
`FlowTimeoutScheduler`.
Only hit the database if `StateMachineState.isAnyCheckpointPersisted`
returns true. Otherwise, there will be no checkpoint to retrieve from the
database anyway. This can prevent errors due to a transient loss of
connection to the database.
Update tests after merging to 4.6
* Decouple DatabaseConfig and CordaPersistence etc.
* Add schema sync to schema migration + test
* Add command line parameters for synchronising schema
Only hit the database if `StateMachineState.isAnyCheckpointPersisted`
returns true. Otherwise, there will be no checkpoint to retrieve from the
database anyway. This can prevent errors due to a transient loss of
connection to the database.
* CORDA-3578 add corda prefix to conf file names as original issue was having non-corda reference.conf files on classpath causes DriverDSLImp failure
it is better to have this naming convention and avoid further conflicts of conf files.
* fixed weird duplicates
* revert renaming changes for web-reference.conf and loadtest-reference.conf
* Ensure logs directory is written to correct location
* Remove a superfluous set of log path property
* Add a unit test to catch bad log paths
* Address detekt issues
* checking for unsigned cordapps in prod mode and shutting down if it the contracts are not signed
* inheriting from CordaRuntimeException instead of Exception
* had the same tests twice, removed the shutdown for minimumplatformversion, modified some of the tests
* shut down when we get invalid platformversion and modified the test according to this
* making the message more accurate
* CORDA-3176 Inefficient query generated on vault queries with custom paging
a check added to avoid self joins
local postgres db tests were executed on large volume test data (50k states) to ensure that the DB optimizes the suspicious query and so it does not cuase performance issues.
A test added to check that a pagination query on large volume of sorted data is completed in a reasonable time
* foreach detekt fix
Adding to the query -explicitly- the flow's locked states resolved the SQL Deadlocks in SQL server. The SQL Deadlocks would come up from `softLockRelease` when only the `lockId` was passed in as argument. In that case the query optimizer would use `lock_id_idx(lock_id, state_status)` to search and update entries in `VAULT_STATES` table.
However, all rest of the queries would follow the opposite direction meaning they would use PK's `index(output_index, transaction_id)` but they would also update the `lock_id` column and therefore the `lock_id_idx` as well, because `lock_id` is a part of it. That was causing a circular locking among the different transactions (SQL processes) within the database.
To resolve this, whenever a flow attempts to reserve soft locks using their flow id (remember the flow id is always the flow id for the very first flow in the flow stack), we then save these states to the fiber. Then, upon releasing soft locks the fiber passes that set to the release soft locks query. That way the database query optimizer will use the primary key index of VAULT_STATES table, instead of lock_id_idx in order to search rows to update. That way the query will be aligned with the rest of the queries that are following that route as well (i.e. making use of the primary key), and therefore its locking order of resources within the database will be aligned with the rest queries' locking orders (solving SQL deadlocks).
* Fixed SQL deadlocks caused from softLockRelease, by saving locked states per fiber; NodeVaultService.softLockRelease query then uses VAULT_STATES PK index instead of lock_id_idx
Speed up SQL server by breaking down queries with > 16 elements in their IN clause, into sub queries with 16 elements max in their IN clauses
* Allow softLockedStates to remove states
* Add to softLockedStates only states soft locked under our flow id
* Fix softLockRelease not to take into account flowStateMachineImpl.softLockedStates when using lockId != ourFlowId
* Moved CriteriaBuilder.executeUpdate at the bottom of the file
Performance tuning of the new checkpoint schema;
- Checkpoint tables are now using `flowId` as join keys.
- Indexes consist of a PK's index on `node_checkpoints(flow_id)` and then unique indexes on `node_checkpoint_blobs(flow_id)` and `node_flow_metadata(flow_id)`.
- Serialization of `checkpointState` is being done with `CHECKPOINT_CONTEXT` so that we can have compression. This is needed when messages get passed into `checkpointState.sessions` therefore `checkpointState` grows in size upon serialized and
saved into the database.
* Deserialize checkpointState with CHECKPOINT_CONTEXT
* Align tests with schema update; We cannot add and update a checkpoint in the same session now, ends up with hibernate complaining: two different objects with same identifier
* Fix indentation and format
* Ignore tests that assert DBFlowResult or DBFlowException
* Set DBFlowCheckpoint.blob to null whenever the flow errors or hospitalizes; this way we save an extra SELECT in such cases;
* Fix test; cleared Hibernate session, it would fail at checkpoint with 'org.hibernate.NonUniqueObjectException'
* Changing VARCHAR to NVARCHAR
* Rename v17 liquibase scripts to v19 to resolve collision with ENT v17 scripts
* CORDA-3750: Use hand-written sandbox Crypto object that delegates to the node.
* CORDA-3750: Add integration test for deterministic CashIssueAndPayment flow.
* Tidy up generics for Array instances.
* Upgrade to DJVM 1.1-RC04.
When a flow is missing its database transaction the _real_ stacktrace
of the exception is missing in the logs. This is due to the normal error
handling path way does not work due the transaction being missing. When
it goes to process the next error event (due to the original transaction
context missing error) it will fail for the same error as it needs the
context to be able to process the error event.
The extra logging should help diagnose future errors.
Removing the ability to initialise schema from the node config, and add a new sub-command to initialise the schema (that does not do anything else and exits afterwards).
Also adding a command line flag that allow app schema to be maintained by hibernate for legacy cordapps, tests or rapid development.
Patching up mock net and driver test frameworks so they create the required schemas for tests to work, defaulting schema migration and hibernate schema management to true to match pre-existing behaviour.
Modified network bootstrapper to run an initial schema set-up so it can register nodes.
* Fix erroneous sql statement for oracle; It was failing tests with 'ORA-00933: SQL command not properly ended'
* Fixed flaky test; it didn't wait for counter party flow to get hospitalized as the test implied
Added command-line option: `--pause-all-flows` to the Node to control this.
This mode causes all checkpoints to be set to status PAUSED when the
state machine starts up (in StartMode.Safe mode).
Changed the state machine so that PAUSED checkpoints are loaded into
memory (the checkpoint is deserialised but the flow state is left serialised)
but not started.
Messages from peers are queued whilst the flow is paused and processed
once the flow is resumed.
When a non-database exception is thrown out of a `withEntityManager`
block, always check if the session needs to be rolled back.
This means if a database error is caught and a new non-database error is
thrown out of the `withEntityManager` block, the transaction is still
rolled back. The flow can then continue progressing as normal.
* CORDA-3715: When loading cordapps now check that contract classes have class version between 49 and 52
* CORDA-3715: Now check class version when contract verification takes place.
* CORDA-3715: Making detekt happy with number of levels in func
* CORDA-3715: Make use of new ClassGraph release which provides class file major version number.
* CORDA-3715: Changed package name in test jars
* CORDA-3715: Use ClassGraph when loading attachments.
* CORDA-3715: Reverted file to 4.5 version
* CORDA-3715: Updating method to match non deterministic version.
* CORDA-3715: Added in default param.
* CORDA-3715: Adjusted min JDK version to 1.1
* CORDA-3715: Switching check to JDK 1.2
* CORDA-3715: Now version check SerializationWhitelist classes.
* CORDA-3715: Switched default to null for range.
* [EG-438] First commit of error code interface
* [EG-438] Implement error reporter and a few error codes
* [EG-438] Add unit tests and default properties files
* [EG-438] Add the error table builder
* [EG-438] Update initial properties files
* [EG-438] Add some Irish tests and the build.gradle
* [EG-438] Fall back for aliases and use different resource strategy
* [EG-438] Define the URL using a project-specific context
* [EG-438] Tidy up initialization code
* [EG-438] Add testing to generator and tidy up
* [EG-438] Remove direct dependency on core and add own logging config
* [EG-438] Fix compiler warnings and tidy up logging
* [EG-438] Fix detekt warnings
* [EG-438] Improve error messages
* [EG-438] Address first set of review comments
* [EG-438] Use enums and a builder for the reporter
* [EG-438] Address first set of review comments
* [EG-438] Use enums and a builder for the reporter
* [EG-438] Add kdocs for error resource static methods
* [EG-440] Add error code for duplicate CorDapp loading
* [EG-438] Handle enums defined with underscores
* [EG-440] Add errors for some CorDapp loading scenarios
* [EG-440] Finish adding errors for CorDapp loading
* [EG-440] Fix up errors in properties files
* [EG-440] Start change to error code definition
* [EG-440] Update error code definition and add resource generation tool
* [EG-440] Tidy up error resource generation tool frontend
* [EG-440] Small refactorings and add kdocs
* [EG-440] Generate all missing resources
* [EG-440] Some refactoring and start writing a test
* [EG-440] Update unit test for resource generator
* [EG-440] Renaming of various parts of the error tool
* [EG-440] Add testing for errors and fix an issue in resource generation
* [EG-440] Add a kdoc for context provider API
* [EG-440] Remove old code from repository
* [EG-440] Address some review comments
* CORDA-3291 `isKilled` flag and session errors for killed flows
## Summary
Two major improvements have been worked on:
- A new flag named `isKilled` has been added to `FlowLogic` to allow
developers to break out of loops without suspension points.
- Killed flows now send session errors to their counter parties allowing
their flows to also terminate without further coordination.
Achieving these changes required a __fundamental__ change to how flows are
killed as well as how they sleep.
## `isKilled` flag
The addition of `FlowLogic.isKilled` allows flows to check if the
current flow has been killed. They can then throw an exception to lead
to the flow's termination (following the standard error pathway). They
can also perform some extra logic or not throw an exception if they
really wanted to.
No matter what, once the flag is set, the flow will terminate. Due to
timing, a killed flow might successfully process its next suspension
event, but it will then process a killed transition and terminate.
## Send session errors when killing a flow
A flow will now send session errors to all of its counter parties. They
are transferred as `UnexpectedFlowEndException`s. This allows initiated
flows to handle these errors as they see fit, although they should
probably just terminate.
## How flows are killed
### Before
Originally we were relying on Quasar to interrupt a flow's fiber, we
could then handle the resulting `InterruptedException`. The problem with
this solution is that it only worked when a flow was already suspended
or when a flow moved into suspension. Flows stuck in loops did not work.
### After
We now *do not* use Quasar to interrupt a flow's fiber. Instead, we
switch `FlowStateMachine.isKilled` to true and schedule a new event.
Any event that is processed after switching this flag will now cause a
`KilledFlowTransition`. This transition follows similar logic to how
error propagation works. Note, the extra event allows a suspended flow
to be killed without waiting for the event that it was _really_ waiting
for.
This allows a lot of the tidy up code in `StateMachineManager.killFlow`
to be removed as tidy up is executed as part of removing a flow.
Deleting a flow's checkpoint and releasing related soft locks is still
handled manually in case of infinite loops but also triggered as part
of the actions executed in a transition.
This required flow sleeping to be changed as we no longer rely on
quasar.
## How flows now sleep
The reliance on Quasar to make a flow sleep has been removed.
Instead, when a flow sleeps we create a `ScheduledFuture` that is
delayed for the requested sleep duration. When the future executes it
schedules a `WakeUpFromSleep` event that wakes up the flow... Duh.
`FlowSleepScheduler` handles the future logic. It also uses the same
scheduled thread pool that timed flows uses.
A future field was added to `StateMachineState`. This removes the
need for concurrency control around flow sleeps as the code path does
not need to touch any concurrent data structures.
To achieve this:
- `StateMachineState.future` added as a `var`
- When the `ScheduledFuture` is created to wake up the flow the passed
in `StateMachineState` has its `future` value changed
- When resumed `future` and `isWaitingForFuture` are set to `null` and
`false` respectively
- When cancelling a sleeping flow, the `future` is cancelled and nulled
out. `isWaitingForFuture` is not changed since the flow is ending anyway
so really the value of the field is not important.
* [EG-438] First commit of error code interface
* [EG-438] Implement error reporter and a few error codes
* [EG-438] Add unit tests and default properties files
* [EG-438] Add the error table builder
* [EG-438] Update initial properties files
* [EG-438] Add some Irish tests and the build.gradle
* [EG-438] Fall back for aliases and use different resource strategy
* [EG-438] Define the URL using a project-specific context
* [EG-438] Tidy up initialization code
* [EG-438] Add testing to generator and tidy up
* [EG-438] Remove direct dependency on core and add own logging config
* [EG-438] Fix compiler warnings and tidy up logging
* [EG-438] Fix detekt warnings
* [EG-438] Improve error messages
* [EG-438] Address first set of review comments
* [EG-438] Use enums and a builder for the reporter
* [EG-438] Add kdocs for error resource static methods
* [EG-438] Handle enums defined with underscores
* [EG-438] Slight refactoring of startup code
* [EG-438] Port changes to error reporting code from future branch
* [EG-438] Also port test changes
* [EG-438] Suppress a deliberately unused parameter
* CORDA-3722 withEntityManager can rollback its session
## Summary
Improve the handling of database transactions when using
`withEntityManager` inside a flow.
Extra changes have been included to improve the safety and
correctness of Corda around handling database transactions.
This focuses on allowing flows to catch errors that occur inside an
entity manager and handle them accordingly.
Errors can be caught in two places:
- Inside `withEntityManager`
- Outside `withEntityManager`
Further changes have been included to ensure that transactions are
rolled back correctly.
## Catching errors inside `withEntityManager`
Errors caught inside `withEntityManager` require the flow to manually
`flush` the current session (the entity manager's individual session).
By manually flushing the session, a `try-catch` block can be placed
around the `flush` call, allowing possible exceptions to be caught.
Once an error is thrown from a call to `flush`, it is no longer possible
to use the same entity manager to trigger any database operations. The
only possible option is to rollback the changes from that session.
The flow can continue executing updates within the same session but they
will never be committed. What happens in this situation should be handled
by the flow. Explicitly restricting the scenario requires a lot of effort
and code. Instead, we should rely on the developer to control complex
workflows.
To continue updating the database after an error like this occurs, a new
`withEntityManager` block should be used (after catching the previous
error).
## Catching errors outside `withEntityManager`
Exceptions can be caught around `withEntityManager` blocks. This allows
errors to be handled in the same way as stated above, except the need to
manually `flush` the session is removed. `withEntityManager` will
automatically `flush` a session if it has not been marked for rollback
due to an earlier error.
A `try-catch` can then be placed around the whole of the
`withEntityManager` block, allowing the error to be caught while not
committing any changes to the underlying database transaction.
## Savepoints / Transactionality
To make `withEntityManager` blocks work like mini database transactions,
save points have been utilised. A new savepoint is created when opening
a `withEntityManager` block (along with a new session). It is then used
as a reference point to rollback to if the session errors and needs to
roll back. The savepoint is then released (independently from
completing successfully or failing).
Using save points means, that either all the statements inside the
entity manager are executed, or none of them are.
## Some implementation details
- A new session is created every time an entity manager is requested,
but this does not replace the flow's main underlying database session.
- `CordaPersistence.transaction` can now determine whether it needs
to execute its extra error handling code. This is needed to allow errors
escape `withEntityManager` blocks while allowing some of our exception
handling around subscribers (in `NodeVaultService`) to continue to work.
On node start, load CordaServices before starting the NotaryService,
so that the NotaryService can check that the services it requires are
available when starting.
Resolves#6172.
* CORDA-3762: Integration test exposing the problem reported
* CORDA-3726: Additional logging
* CORDA-3726: Prevent thread leaks
* CORDA-3726: New `journalBufferTimeout` parameter
* CORDA-3726: Override `journalBufferTimeout` parameter
* CORDA-3726: Making Detekt happier
* CORDA-3276: Account for extra thread user in MockNetwork
For real node this does not matter as `shutdown` can safely be called multiple times, which is not true for server thread provided by MockNetwork
* CORDA-3276: Do not make SMM shutdown "executor" as it belongs to AbstractNode
* CORDA-3276: Address input from @rick-r3
* CORDA-3276: Fix test after rebase
* adding blocked functions ro RestrictedEntityManager and creating RestrictedConnection class
* adding flow tests and fixing issues regarding the review
* adding quasar util to gradle
* updating flow tests
* adding space before } at .isThrownBy()
* adding spaces
* [EG-503] Spent state audit tool
Fixes
* Refinements to notary query interfaces. Feature complete.
* EG-503: Introduce optional `notaryService` in `ServiceHubCoreInternal`
* Remove redundant logic following change to use extensions API
Co-authored-by: Viktor Kolomeyko <viktor.kolomeyko@r3.com>
* CORDA-3696: Temporary update to enable JDK11 build and test. Will eventually be switchable.
* CORDA-3696: Filter out the Nashorn warning.
* CORDA-3696: Add JDK11 classifier.
* CORDA-3696: Updated match string to cope with JDK11.
* CORDA-3696: Filtering out SPHINCS256_SHA256 where failing due to JDK11.
* CORDA-3696: Now remove SPHINCS256_SHA256 only if JDK11.
* CORDA-3696: Fix test failure - switch to regex matching.
* CORDA-3696: Hide the illegal access warnings.
* CORDA-3696: Check for Java11 when disabling Java11 warnings.
* CORDA-3696: Fix unneccessary non null check.
* CORDA-3696: Reverting build env to JDK8
* CORDA-3696: Revert hiding of illegal access warnings via Unsafe class.
* CORDA-3696: Remove internal access warnings and new JDK11 version checker.
* CORDA-3696: Updated build file for OS
* CORDA-3696: Removed typo
* CORDA-3696: Fixed space typo.
* CORDA-3696: Open modules to remove the illegal access warnings.
Co-authored-by: Adel El-Beik <adelel-beik@19LDN-MAC108.local>
* CORDA-3691 Delete checkpoint when flow finishes
The checkpoint and its related records in joined tables should be deleted
when a flow finishes.
Keeping these flows around will be completed in the future.
* CORDA-3691 Ignore some flow metadata tests
Ignore tests around recording the finish time of flow metadata records
since we are not currently keeping COMPLETED flows in the database.
Flows that are kept for overnight observation:
- Save their Checkpoint.status as 'HOSPITALIZED' in the database
- Save the error that caused the hospitalization in the database
A new Event was added for this reason. Whenever the hospital determines
a flow for hospitalization, it adds this Event in the flow's fiber queue.
When processed it creates a new DB transaction, stores the checkpoint status along with
the error, and it adds a 'FlowContinuation.ProcessEvents' continuation so that the fiber keeps
processing events (effectively since there are no more events in the fiber's channel, the fiber will suspend).
Flows that error:
- Their checkpoints are kept in the database
- Save their Checkpoint.status as 'FAILED'
- Save the error that caused the error in the database
Upon erroring, the flow's Checkpoint.status gets updated('FAILED') and the checkpoint is stored
in the database instead of getting removed. The flow then propagates the error to counterparties,
sets its future with the error and gets removed from memory.
* ENT-4967: Require no classifier for corda-node-djvm, corda-deserializers-djvm.
* Also remove classifiers from core, serialization and finance-contracts.
* Compile corda-serialization-djvm for Java 8 and remove its classifier.
Added a new field Completed to the in-memory object FlowState.
FlowState.Completed is corresponds to flow_state=Null in the DB.
This change will save disk space.
* Run serialisation tests with both in-process and out-of-process nodes.
* Add custom serialisers and whitelists to Driver's AMQPServerSerializationScheme.
* Run serialisation tests with both in-process and out-of-process nodes.
* Add custom serialisers and whitelists to Driver's AMQPServerSerializationScheme.
* CORDA-3601 Record a flow's finish time
Record a flow's finish time by updating its metadata record. It is set
in `updateCheckpoint` by checking the status of the checkpoint. If it is
`COMPLETED` it will set the `finishInstant` on the metadata object and
update it.
* CORDA-3601 Record flow finish time for all finished statuses
Update the flow finish time for the following statuses:
- COMPLETED
- KILLED
- FAILED
* CORDA-3601 Use platform clock in `DBCheckpointStorage`