This reverts commit 0c7bad7792, as that
change causes a service restart loop. The supervisor cannot distinguish
between ports exposed via the `EXPOSE` directive and the docker-compose
`expose` property. Because of this, in the case of `network-mode:
service:<...>` the current state and target state never match, leading
to a service restart loop.
Change-type: patch
The supervisor exposes ports configured using the `EXPOSE` directive in
the dockerfile when configuring the container for runtime. This can
cause issues if using `network_mode: service:<service name>` as the
expose configuration is not compatible with that network mode. This
fix now skips image exposed ports for that particular network mode.
Change-type: patch
Relates-to: #2211
/mnt/boot/os-release isn't always accurate so /mnt/root should be the source of truth.
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
The node-dbus module is unmaintained and a blocker for the update to
Node 18. Switching to our own node bindings for systemd solves this
issue
Relates-to: Shouqun/node-dbus#241
Change-type: patch
It's not an official status from container inspects, and the Supervisor
doesn't set it internally anywhere. It's better to remove it entirely as the
method by which Supervisor sets internal service statuses is by using a global
event emitter (reportNewStatus) which makes things difficult to test.
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
The previous implementation in #2170 of parsing the container status was too general,
because it relied on the mistaken assumption that a container would have a status of
`Stopped` if it was manually stopped. This turned out to be untrue, as manually stopped
containers were also getting restarted by the Supervisor due to their inspect status of
`exited`. With this, parsing the exit message became unavoidable as there are no other
clear ways to discern a container that has been manually stopped and shouldn't be started
from a container experiencing the Engine-host race condition issue (again, see #2170).
Since we're just parsing the exit error message, we don't need to worry about different behaviors
amongst restart policies, as any container with the error message on exit should be started.
Change-type: patch
Closes: #2178
Signed-off-by: Christina Ying Wang <christina@balena.io>
As explained in the comments of this commit, a container with the restart policy
of 'on-failure' with a non-zero exit code matches the conditions for the race, so
the Supervisor will also attempt to start it. A container with the 'no' restart
policy that has been started once will not be started again. If a container with
'no' has never been started, its service status will be 'Installed' and the Supervisor
will already try to start it until success, so the service with 'no' doesn't require
special handling.
Signed-off-by: Christina Ying Wang <christina@balena.io>
This ensures target state has settled (since it seems that the 'applied' status
that's reported isn't 100% accurate and the actual Engine state may lag behind slightly)
Signed-off-by: Christina Ying Wang <christina@balena.io>
There exists a race condition between Engine and a host resource that may not
be immediately created. In this race condition, if a container's compose config
depends on the existence of that host resource, such as a network interface, and the
Engine tries to create & start the container before the host resource is created, the
Engine will not reattempt to start the container, regardless of the restart policy.
This is undesireable behavior but seems to be the behavior as implemented by Docker.
To rectify this, the Supervisor state funnel noops for a grace period of 1 minute
after starting a container to see that the container's status has become 'running`.
If the container exits because of the race condition, the status becomes 'exited' and the
Supervisor will attempt to generate another start step. This noop-wait-start step loop
will repeat until the container is able to start.
If the container is never able to start, there was a problem in the host in the creation of the
host resource, and that should be fixed at the host level.
This commit does not handle the case of services with restart policies "no" or "on-failure"
which encounter this host race, as metadata from container inspects needs to be introduced
during step calculation in order to figure out whether services with those restart policies
need to be started. This will be fixed in a future PR.
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
Devices affected by the bug described in 1576, are also stuck with some
services in the `Downloaded` state, because the state engine does not
detect that the running services should be killed on a network change
even if they belong to a new release. This is a bug, which can be
replicated by the tests in this commit
Change-type: patch
These tests use the supervisor API to check that applying a target state
allows the device to eventually get to the desired target configuration.
This are high-level tests that work with real images and containers
using dind.
Change-type: patch
The actions now work by passing an intermediate state to the state
engine.
- doPurge first removes the user app from the target state and passes
that to the state engine for purging. Since intermediate state doesn't
remove images, this will have the effect of basically re-installing
the app.
- doRestart modifies the target state by first removing only the
services from the current state but keeping volumes and networks. This
has the same effect as before where services were stopped one by one
Change-type: patch
Network aliases are now compared checking that the target state is a
subset of the current state. This will prevent service restarts due to
additional aliases created by docker in the container.
Closes: #2134
Change-type: patch
When getting the service from the docker container, remove the
containerId from the list of aliases (which gets added by docker). This
will make it easier to use the current service state as a target.
This will help us remove the `safeStateClone` function in the API in a
future commit
Change-type: patch
There were multiple places in the state engine that skipped some
operations while in local mode. In reality, all it's needed while in
local mode is to skip image and volume deletion.
This commit simplifies application-manager and compose app to be more
local mode agnostic and instead making the image deletion and volume
deletion configurable via function arguments.
This also has the benefit to make the treatment of local mode
applications more similar to cloud mode applications, allowing for
API endpoints to function the same way both modes.
Change-type: patch
The OS since v2.82.6 will monitor changes to config.json and restart
the relevant services to apply the changes. There is no need to trigger
restart of the services via the supervisor. Users on older OS versions
will need to update their OS or restart the services manually as OS
loses support after 2y.
Change-type: patch
Closes: #2160
From: c0b4fafe84
Restart-service checks that both services have restarted in its test assertion, which is
incorrect as restart-service should only restart one service.
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
As the Supervisor is a privileged container, it has access to host /dev, and therefore has access
to boot, data, and state balenaOS partitions. This commit sets up the framework for the following:
- Finds the /dev partition that corresponds to each partition based on partition label
- Mounts the partitions into set mountpoints in the device
- Removes reliance on env vars and mountpoints provided by host's start-balena-supervisor script
- Simplifies host path querying by centralizing these queries through methods in lib/host-utils.ts
This particular changes env vars for and mounts the boot partition.
Since the Supervisor would no longer rely on container `run` arguments provided by a host script,
this change moves Supervisor closer to being able to start itself (Supervisor-as-an-app).
Change-type: minor
Signed-off-by: Christina Ying Wang <christina@balena.io>
The issue with the original Supervisor implementation of the firewall is that
on Supervisor start, the Supervisor flushes the INPUT chain of the filter table.
This doesn't play well with services that add to the INPUT chain on startup that
may start up before the Supervisor, such as certain NetworkManager connection
profiles. This change only replaces the BALENA-FIREWALL rule in the INPUT chain,
preserving the other rules as well as their order.
Closes: #1482
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
We have seen a few times devices with duplicated network names for some
reason. While we don't know the cause the networks get duplicates, this
can be disruptive for updates as trying to create a container referencing a duplicate
network results in a 400 error from the engine.
This commit finds and removes duplicate networks via the state engine,
this means that even if somehow a container could be referencing a
network that has been duplicated later somehow, this will remove the
container first.
While thies doesn't solve the problem of duplicate networks being
created in the first place, it will fix the state of the system to
correct the inconsistency.
Change-type: minor
Closes: #590
This includes:
- proxyvisor.js
- references in docs
- references device-state, api-binder, compose modules, API
- references in tests
The commit also adds a migration to remove the 4 dependent device tables from the DB.
Change-type: minor
Signed-off-by: Christina Ying Wang <christina@balena.io>
The wait-for-it script used during tests would setup a timer
that would send SIGUSR2 to the parent process after the timer ends.
Since node was ignoring additional signals, the timer ending would have
no effect after the node process had replaced the start script. However
when node has pid != 1, SIGUSR2 default behavior is to terminate the
process, meaning the tests would fail after 30 seconds.
The script is now updated so the timer is killed once the services are
ready for the tests.
The Raspberry Pi config.txt file defines the use of colon to configure
variables of the same name in different ports, for instance on those
devices with two hdmi ports. This syntax was previously not supported by
the supervisor. This change relaxes the syntax validation on config vars
to allow the use of the colon character.
Relates-to: #1573, #2046
Change-type: minor
This includes:
- /v1/apps/:appId/(stop|start)
- /v2/applications/:appId/(restart|stop|start)-service
Signed-off-by: Christina Ying Wang <christina@balena.io>