As explained in the comments of this commit, a container with the restart policy
of 'on-failure' with a non-zero exit code matches the conditions for the race, so
the Supervisor will also attempt to start it. A container with the 'no' restart
policy that has been started once will not be started again. If a container with
'no' has never been started, its service status will be 'Installed' and the Supervisor
will already try to start it until success, so the service with 'no' doesn't require
special handling.
Signed-off-by: Christina Ying Wang <christina@balena.io>
This ensures target state has settled (since it seems that the 'applied' status
that's reported isn't 100% accurate and the actual Engine state may lag behind slightly)
Signed-off-by: Christina Ying Wang <christina@balena.io>
There exists a race condition between Engine and a host resource that may not
be immediately created. In this race condition, if a container's compose config
depends on the existence of that host resource, such as a network interface, and the
Engine tries to create & start the container before the host resource is created, the
Engine will not reattempt to start the container, regardless of the restart policy.
This is undesireable behavior but seems to be the behavior as implemented by Docker.
To rectify this, the Supervisor state funnel noops for a grace period of 1 minute
after starting a container to see that the container's status has become 'running`.
If the container exits because of the race condition, the status becomes 'exited' and the
Supervisor will attempt to generate another start step. This noop-wait-start step loop
will repeat until the container is able to start.
If the container is never able to start, there was a problem in the host in the creation of the
host resource, and that should be fixed at the host level.
This commit does not handle the case of services with restart policies "no" or "on-failure"
which encounter this host race, as metadata from container inspects needs to be introduced
during step calculation in order to figure out whether services with those restart policies
need to be started. This will be fixed in a future PR.
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
Support for colon characters was added v14.6.0 which enabled
configurations for HDMI port 2 (e.g on the RPi 4). These configurations
are not documented anywhere else so this allows users to be able to
better find the relevant information for working with HDMI.
Change-type: patch
Relates-to: #2090
After a recent change enforcing all the partitions to be on the same
block device, encrypted partitions are no longer being detected
correctly. This is because the assumption that the parent block device
is a substring of the actually mounted block device does not work
for LUKS devices - the mount will either be /dev/mapper/luks-XXX
or /dev/dm-X while the parent device is still e.g. /dev/sda.
The usual balenaOS boot partition is also split in two - boot and efi.
The boot partition (mounted under /mnt/boot) is encrypted and the efi
partition (mounted under /mnt/efi) is not.
This patch generalizes the detection of the parent device so that
it works with both encrypted and unencrypted partitions.
Change-type: patch
Signed-off-by: Michal Toman <michalt@balena.io>
The docker compose V2 spec no longer accepts `network_mode: bridge`,
which means we can no longer override the network configuration of
the `balena-supervisor` service for tests.
For this reason we now create a separate service to run the built
supervisor `balena-supervisor-sut` and run API tests against this
service instead of the default `balena-supervisor`.
Change-type: patch
A bug in service comparison would make it that a device already running
a service from a new release with network changes would never stop the
running service so remaining services would forever get stuck in
`Downloaded` state.
This fixes the comparison so the service will get killed in this case,
particularly allowing devices to recover from #1576
Change-type: patch
Devices affected by the bug described in 1576, are also stuck with some
services in the `Downloaded` state, because the state engine does not
detect that the running services should be killed on a network change
even if they belong to a new release. This is a bug, which can be
replicated by the tests in this commit
Change-type: patch
Previous behavior would make it that an `updateMetadata` step would take
precedence over a `kill` step when network changes are present. This
would lead to an inconsistent state if an update included a
network and a container change.
Closes: #1576
Change-type: patch
These tests use the supervisor API to check that applying a target state
allows the device to eventually get to the desired target configuration.
This are high-level tests that work with real images and containers
using dind.
Change-type: patch
The supervisor allows the target image to be an image without a
registry (e.g. `alpine:latest`), while this really only happens while in
local mode, we don't want to pass credentials to the default registry as
those credentials are meant for balena registry and will otherwise fail.
Change-type: patch
Target volatile doesn't make sense now that we can use the
current state as a target. It wasn't actually being used for anything
anymore apparently
Change-type: patch
This simplifies this module interface and hides implementation details
from the rest of the code.
The function `applyIntermediateTarget` will now call `pausingApply`
before applying the target
API actions no longer need to call pausing apply
Change-type: patch
The actions now work by passing an intermediate state to the state
engine.
- doPurge first removes the user app from the target state and passes
that to the state engine for purging. Since intermediate state doesn't
remove images, this will have the effect of basically re-installing
the app.
- doRestart modifies the target state by first removing only the
services from the current state but keeping volumes and networks. This
has the same effect as before where services were stopped one by one
Change-type: patch
Local mode uses a numeric `appUuid` which was messing up parsing the
network name. This fixes this issue so the current state can be used
as a target state
Change-type: patch
The Service class in `compose/service.ts` cannot get the image name
from the image id when building the object from the container metadata.
We query the metadata in the application manager getCurrentApps method
so the current state can be used as target by API methods
Change-type: patch
Network aliases are now compared checking that the target state is a
subset of the current state. This will prevent service restarts due to
additional aliases created by docker in the container.
Closes: #2134
Change-type: patch
When getting the service from the docker container, remove the
containerId from the list of aliases (which gets added by docker). This
will make it easier to use the current service state as a target.
This will help us remove the `safeStateClone` function in the API in a
future commit
Change-type: patch
This replaces the previous flag `isApplyingIntermediate` on application
manager and simplifies the interface of the state engine to make temporary changes to the
general app state.
Change-type: patch
There were multiple places in the state engine that skipped some
operations while in local mode. In reality, all it's needed while in
local mode is to skip image and volume deletion.
This commit simplifies application-manager and compose app to be more
local mode agnostic and instead making the image deletion and volume
deletion configurable via function arguments.
This also has the benefit to make the treatment of local mode
applications more similar to cloud mode applications, allowing for
API endpoints to function the same way both modes.
Change-type: patch