Commit Graph

453 Commits

Author SHA1 Message Date
Christina Ying Wang
6e185fbd44 Don't follow symlinks when checking for lockfiles
The Supervisor should only care whether a lockfile exists or
not. This also fixes an edge case where a user symlinked a lockfile
to a nonexistent file, causing the Supervisor to enter an error
loop as it was not able to `stat` the nonexistent file.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-12 10:34:46 -04:00
Christina Ying Wang
f863075bdc Add memory usage healthcheck
This healthcheck fails when Supervisor memory usage is above a threshold
based on initial memory measurements after device state has settled.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-11 18:16:47 -07:00
Christina Ying Wang
8ac2ce4677 Respect lockOverride when taking locks
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-06 00:59:04 -07:00
Christina Ying Wang
fd7d58f89a Clean up lockfiles on takeLock step failure
We don't want any Supervisor lockfiles to remain on the device
when a takeLock step fails because this would interfere with the user app.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-04 14:07:47 -07:00
Christina Ying Wang
fb1bd33ab6 Refine update locking interface
* Remove Supervisor lockfile cleanup SIGTERM listener
* Modify lockfile.getLocksTaken to read files from the filesystem
* Remove in-memory tracking of locks taken in favor of filesystem
* Require both `(resin-)updates.lock` to be locked with `nobody` UID
  for service to count as locked by the Supervisor

Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-04 14:07:47 -07:00
Christina Ying Wang
10f294cf8e Add takeLock to state funnel
A takeLock step should be generated before any of the following steps:
* kill
* start
* stop
* updateMetadata
* restart
* handover

ALL services in an app will be locked for any of the above actions,
unless the action is generated through Supervisor API's
`POST /v2/applications/:appId/(start|stop|restart)-service` endpoints,
in which case only the target service will be locked.

A lock will be taken for a service before it starts by creating the
directory in /tmp before the Engine creates it through bind mounts.

Also, the commit simplifies the generation of service kill
steps from network/volume changes or removals.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-04 14:07:47 -07:00
Christina Ying Wang
cf8d8cedd7 Simplify lock interface to prep for adding takeLock to state funnel
This commit changes a few things:

* Pass `force` to `takeLock` step directly. This allows us to remove
the `lockFn` used by app manager's action executors, setting takeLock
as the main interface to interact with the update lock module. Note
that this commit by itself will not pass tests, as no update locking
occurs where it once did. This will be amended in the next commit.

* Remove locking functions from doRestart & doPurge, as this is
the only area where skipLock is required.

* Remove `skipLock` interface, as it's redundant with the functionality
of `force`. The only time `skipLock` is true is in doRestart/doPurge,
as those API methods are already run within a lock function. We removed
the lock function which removes the need for skipLock, and in the next
commit we'll add locking as a composition step to replace the
functionality removed here.

* Remove some methods not in use, such as app manager's `stopAll`.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-04 14:07:47 -07:00
Christina Ying Wang
af6359f7ae Take lock before updating service metadata
Change-type: minor
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-04 14:07:47 -07:00
Christina Ying Wang
e6df78a22b Implement takeLock composition step + tests
This commit only implements the action that a takeLock step
results in. It does not add takeLock step generation logic
to the state funnel yet.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-04 14:07:47 -07:00
Christina Ying Wang
f2843e1382 Add update lock release functionality to state funnel
releaseLock is a step that will be inferred if there are services
in target state, and if some of those services have locks taken by
the Supervisor.

The releaseLock composition step calls the method of the same name
in the updateLock module, which takes the exclusive process lock before
disposing all Supervisor lockfiles in the target appId.

This is half of the update lock incorporation into the state funnel, as
we also need to introduce a takeLock step which triggers during crucial
stages of device state transition.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-04 14:07:47 -07:00
Christina Ying Wang
d18a740a40 Add methods for easier checking of lockfile existence
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-04-04 14:07:47 -07:00
Felipe Lalanne
6217546894 Update typescript to v5
This also updates code to use the default import syntax instead of
`import * as` when the imported module exposes a default. This is needed
with the latest typescript version.

Change-type: patch
2024-03-05 15:33:56 -03:00
Felipe Lalanne
988a1c9e9a Update @balena/lint to v7
This updates balena lint to the latest version to enable eslint support
and unblock Typescript updates. This is a huge number of changes as the
linting rules are much more strict now, requiring modifiying most files
in the codebase. This commit also bumps the test dependency `rewire` as
that was interfering with the update of balena-lint

Change-type: patch
2024-03-01 18:27:30 -03:00
Felipe Lalanne
bda1bac04c Add support for repeated overlays
RPI firmware configuration allows repeating overlays to define
configurations on multiple devices. For instance, for configuring
multiple `ads` devices, `config.txt` needs to be setup this way

```
dtoverlay=ads1115,addr=0x48
dtoverlay=ads1115,addr=0x49
```

Before this change, the supervisor would interpret both lines as
belonging to the same overlay, preventing users from configuring multiple
devices, and leading to a loop when trying to apply configurations with
repeated overlays coming from the cloud side.

Change-type: minor
2024-02-27 14:52:41 -03:00
Christina Ying Wang
3fd035c5bd Patch default dtparam handling in config.txt
This commit completes the list of default / board-wide dtparams
to include some `baudrate` and `vc` i2c params.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-02-21 12:45:29 -08:00
Christina Ying Wang
e22253ce6e Patch config.txt backend to return array configs correctly
Previously, getBootConfig() of the config.txt backend was omitting
array configurations such as gpio settings, thus resulting in the SV
mistakenly assuming that boot config had not been applied, since gpio
would not be in current config.txt config but would be in target config.
This resulted in SV entering an infinite loop of attempting to apply the
gpio config when it wasn't necessary.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-02-16 18:12:33 -08:00
Felipe Lalanne
6e6a796da5 Add special case for base DTO params on RPI config
While ordering is important in the RPI firmware configuration file (config.txt),
some dt params are by default considered part of the base dt overlay
if they are not used by other overlays.
Unfortunately the [list of dtparams](https://github.com/raspberrypi/firmware/blob/master/boot/overlays/README#L133)
is too long to add all of them as exceptions, but we can add the params
used in the default config.txt provided in OS images, to avoid reboots
when updating to this new supervisor and correctly parsing the
provisioning config.txt as variables.

While this addition handles most common scenarios, there is still a
chance a user may have use other base overlay dt params in the initial
config, in which case those will be interpreted according to the
relative ordering

Change-type: patch
2024-02-08 15:48:10 -03:00
Felipe Lalanne
55a8c5bf90 Add tests for dtoverlay management in config.txt 2024-02-07 20:38:44 -03:00
Christina Ying Wang
3afcef2969 Respect update strategies app-wide instead of at the service level
Fixes behavior for release updates which removes a service in current state
and adds a new service in target state.

Change-type: patch
Closes: #2095
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-01-29 12:26:28 -08:00
Felipe Lalanne
87b195685a Use the state-helper functions in app module tests 2024-01-29 12:25:55 -08:00
Felipe Lalanne
6ee606806d Fix docker utils tests for docker v25
From docker 25, the engine will validate IPAM config. This would cause
the docker utils test to fail since the network/subnet configuration was
incorrect.

Change-type: patch
2024-01-25 15:05:12 -03:00
Felipe Lalanne
9bd216327f Expose ports from port mappings on services
PR #2217 removed the expose configuration but also caused a regresion
where ports set via the `ports` configuration would no longer get
exposed to the host, despite portmappings being set. This fixes that
issue by exposing only those ports comming from port mappings.

Change-type: patch
2023-10-24 15:04:39 -03:00
Felipe Lalanne
416170bc05 Ignore expose service compose configuration
The docker EXPOSE directive and corresponding docker-compose `expose`
service configuration serves as documentation/metadata that a container
listens on a certain port that may be used for service discovery but it doesn't
have any real impact on the ability for
other containers on the same network to access the exposed service via
the port. In newer engine implementations, this property may conflict
with other network configurations, and prevent the container from being
started by the docker engine (see #2211).

This PR removes code that would manage the expose property and takes the
property out of the whitelist. A composition with the `expose` property
will result in the log message `Ignoring unsupported or unknown compose fields: expose`.

While this change should not have operational impact, it still removes
a previously supported configuration and as such there is a chance of it
being a breaking change for some applications. For this reason it is
being published as a new major version.

Change-type: major
Closes: #2211
2023-10-23 11:41:32 -03:00
Pagan Gazzard
c9f032e13a Switch _.isFunction usage to native versions
Change-type: patch
2023-10-16 14:30:25 -03:00
Pagan Gazzard
20df54668c Switch _.isArray usage to native versions
Change-type: patch
2023-10-16 14:30:25 -03:00
Felipe Lalanne
3e828dcc52 Revert "Do not expose ports from image if service network mode"
This reverts commit 0c7bad7792, as that
change causes a service restart loop. The supervisor cannot distinguish
between ports exposed via the `EXPOSE` directive and the docker-compose
`expose` property. Because of this, in the case of `network-mode:
service:<...>` the current state and target state never match, leading
to a service restart loop.

Change-type: patch
2023-10-16 13:06:50 -03:00
Pagan Gazzard
766cce89c7 Convert multiple bluebird uses to native promises
Change-type: patch
2023-10-16 11:40:45 +01:00
Felipe Lalanne
0c7bad7792 Do not expose ports from image if service network mode
The supervisor exposes ports configured using the `EXPOSE` directive in
the dockerfile when configuring the container for runtime. This can
cause issues if using `network_mode: service:<service name>` as the
expose configuration is not compatible with that network mode. This
fix now skips image exposed ports for that particular network mode.

Change-type: patch
Relates-to: #2211
2023-10-12 18:03:42 -03:00
Pagan Gazzard
894bdeeeb6 Remove unused docker logs logging code
Change-type: patch
2023-10-11 14:20:33 +01:00
Christina Ying Wang
bc1d251e66 Revert os-release path to /mnt/root
/mnt/boot/os-release isn't always accurate so /mnt/root should be the source of truth.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-10-09 14:02:02 -07:00
Felipe Lalanne
327dc31ef0 Replace node-dbus with @balena/systemd
The node-dbus module is unmaintained and a blocker for the update to
Node 18. Switching to our own node bindings for systemd solves this
issue

Relates-to: Shouqun/node-dbus#241
Change-type: patch
2023-08-16 15:58:52 -04:00
Felipe Lalanne
8f17c30de6 Replace dbus test service with mock-systemd-bus
This avoids unnecessary mocking and tests against the real systemd API

Change-type: patch
2023-08-16 14:46:58 -04:00
Alexandru Costache
512240c544 backends: Add Jetson Orin NANO custom device-tree support
Signed-off-by: Alexandru Costache <alexandru@balena.io>
Change-type: patch
2023-07-11 18:11:32 +03:00
Christina Ying Wang
38fe8dae75 Remove the 'Stopped' status for services
It's not an official status from container inspects, and the Supervisor
doesn't set it internally anywhere. It's better to remove it entirely as the
method by which Supervisor sets internal service statuses is by using a global
event emitter (reportNewStatus) which makes things difficult to test.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-28 11:17:13 -04:00
Christina W
71d24d6e33 Parse container exit error message instead of status
The previous implementation in #2170 of parsing the container status was too general,
because it relied on the mistaken assumption that a container would have a status of
`Stopped` if it was manually stopped. This turned out to be untrue, as manually stopped
containers were also getting restarted by the Supervisor due to their inspect status of
`exited`. With this, parsing the exit message became unavoidable as there are no other
clear ways to discern a container that has been manually stopped and shouldn't be started
from a container experiencing the Engine-host race condition issue (again, see #2170).

Since we're just parsing the exit error message, we don't need to worry about different behaviors
amongst restart policies, as any container with the error message on exit should be started.

Change-type: patch
Closes: #2178
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-22 14:43:17 -07:00
Christina Ying Wang
7eba48f8b8 Improve tests surrounding Engine-host race patch
See: #2170
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-19 11:11:26 -07:00
Christina Ying Wang
9e249e6ae8 Remove unnecessary async/await from method
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-19 11:11:26 -07:00
Christina Ying Wang
6e6f79c71d Decrease wait time before start from 60s to 30s
60 seconds to wait may be excessively long.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-19 11:11:26 -07:00
Christina Ying Wang
ace642ea0f Improve naming of a util function & add unit test
isOlderThan -> isValidDateAndOlderThan

See: https://github.com/balena-os/balena-supervisor/pull/2170#discussion_r1226809686
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-19 11:11:26 -07:00
Christina Ying Wang
2537eb8189 Handle the case of 'on-failure' restart policy
As explained in the comments of this commit, a container with the restart policy
of 'on-failure' with a non-zero exit code matches the conditions for the race, so
the Supervisor will also attempt to start it. A container with the 'no' restart
policy that has been started once will not be started again. If a container with
'no' has never been started, its service status will be 'Installed' and the Supervisor
will already try to start it until success, so the service with 'no' doesn't require
special handling.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-05 11:05:58 -07:00
Christina Ying Wang
95f3e13d50 Add extra delay after state engine integration tests
This ensures target state has settled (since it seems that the 'applied' status
that's reported isn't 100% accurate and the actual Engine state may lag behind slightly)

Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-05-31 11:33:27 -07:00
Christina Ying Wang
7f32141958 Handle Engine-host race condition for "always" and "unless-stopped" restart policy
There exists a race condition between Engine and a host resource that may not
be immediately created. In this race condition, if a container's compose config
depends on the existence of that host resource, such as a network interface, and the
Engine tries to create & start the container before the host resource is created, the
Engine will not reattempt to start the container, regardless of the restart policy.
This is undesireable behavior but seems to be the behavior as implemented by Docker.

To rectify this, the Supervisor state funnel noops for a grace period of 1 minute
after starting a container to see that the container's status has become 'running`.
If the container exits because of the race condition, the status becomes 'exited' and the
Supervisor will attempt to generate another start step. This noop-wait-start step loop
will repeat until the container is able to start.

If the container is never able to start, there was a problem in the host in the creation of the
host resource, and that should be fixed at the host level.

This commit does not handle the case of services with restart policies "no" or "on-failure"
which encounter this host race, as metadata from container inspects needs to be introduced
during step calculation in order to figure out whether services with those restart policies
need to be started. This will be fixed in a future PR.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-05-31 11:32:19 -07:00
Felipe Lalanne
2758e190b2 Fix sw.arch typo when testing contracts
Change-type: patch
2023-05-11 13:07:26 -04:00
Felipe Lalanne
8656bd62f7 Add arch.sw to the valid container requirements
Change-type: minor
2023-05-09 15:44:26 -04:00
Felipe Lalanne
f1f09e0e27 Allow using slug to validate hw.device-type contract
This also adds the hw.device-type test case to the unit tests.

Change-type: patch
2023-05-09 15:20:18 -04:00
Felipe Lalanne
a884a58b4c Simplify and move lib/contract.spec.ts to tests/unit
Improve contract tests to remove dependence on stubs and unnecessary
system calls.

Change-type: patch
2023-05-09 15:20:12 -04:00
Felipe Lalanne
7b8b187c74 Create tests with recovery from #1576
Devices affected by the bug described in 1576, are also stuck with some
services in the `Downloaded` state, because the state engine does not
detect that the running services should be killed on a network change
even if they belong to a new release. This is a bug, which can be
replicated by the tests in this commit

Change-type: patch
2023-04-26 11:58:42 -04:00
Felipe Lalanne
0a358a4463 Add replication of issue using unit tests
Change-type: patch
2023-04-25 14:47:00 -04:00
Felipe Lalanne
138aec5de4 Add integration tests for state-engine
These tests use the supervisor API to check that applying a target state
allows the device to eventually get to the desired target configuration.

This are high-level tests that work with real images and containers
using dind.

Change-type: patch
2023-04-25 14:47:00 -04:00
Felipe Lalanne
3d43f7e3b3 Simplify doRestart and doPurge actions
The actions now work by passing an intermediate state to the state
engine.

- doPurge first removes the user app from the target state and passes
  that to the state engine for purging. Since intermediate state doesn't
  remove images, this will have the effect of basically re-installing
  the app.

- doRestart modifies the target state by first removing only the
  services from the current state but keeping volumes and networks. This
  has the same effect as before where services were stopped one by one

Change-type: patch
2023-04-20 14:58:58 -04:00