1534 Commits

Author SHA1 Message Date
Pagan Gazzard
766cce89c7 Convert multiple bluebird uses to native promises
Change-type: patch
2023-10-16 11:40:45 +01:00
Felipe Lalanne
0c7bad7792 Do not expose ports from image if service network mode
The supervisor exposes ports configured using the `EXPOSE` directive in
the dockerfile when configuring the container for runtime. This can
cause issues if using `network_mode: service:<service name>` as the
expose configuration is not compatible with that network mode. This
fix now skips image exposed ports for that particular network mode.

Change-type: patch
Relates-to: #2211
2023-10-12 18:03:42 -03:00
Pagan Gazzard
3d73bf3e91 Use mutation for adding service/image ids to logs to reduce allocations
Change-type: patch
2023-10-11 15:39:19 -03:00
Pagan Gazzard
d685ccacb2 Keep the container lock for the entire duration of attaching logs
Change-type: patch
2023-10-11 15:39:19 -03:00
Pagan Gazzard
74d374b5ad Remove unnecessary async on handling journald stderr entries
Change-type: patch
2023-10-11 15:39:19 -03:00
Pagan Gazzard
e3806ec018 Avoid unnecessary work in systemd log row handling for invalid logs
Change-type: patch
2023-10-11 15:39:19 -03:00
Pagan Gazzard
894bdeeeb6 Remove unused docker logs logging code
Change-type: patch
2023-10-11 14:20:33 +01:00
Christina Ying Wang
06d4775178 Use native structuredClone instead of _.cloneDeep
Memory tests have shown performance improvements to using the native method.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-09-29 12:29:50 -07:00
jaomaloy
ab513cc021 Dump target-state to hostOS tmp dir
This change is mainly for the hostOS
to know if update locks should be ignored
when updating to a newer version.

Change-type: patch
Signed-off-by: jaomaloy <jao.maloy@balena.io>
2023-09-14 11:03:34 +08:00
Felipe Lalanne
327dc31ef0 Replace node-dbus with @balena/systemd
The node-dbus module is unmaintained and a blocker for the update to
Node 18. Switching to our own node bindings for systemd solves this
issue

Relates-to: Shouqun/node-dbus#241
Change-type: patch
2023-08-16 15:58:52 -04:00
Alexandru Costache
512240c544 backends: Add Jetson Orin NANO custom device-tree support
Signed-off-by: Alexandru Costache <alexandru@balena.io>
Change-type: patch
2023-07-11 18:11:32 +03:00
Florin Sarbu
8d2b310af8 Add revpi-connect-s to Raspberry Pi variants
We need the supervisor to be able to manage config.txt changes for the
Revolution Pi Connect S.

Change-type: patch
Signed-off-by: Florin Sarbu <florin@balena.io>
2023-07-05 13:55:29 +02:00
Christina Ying Wang
38fe8dae75 Remove the 'Stopped' status for services
It's not an official status from container inspects, and the Supervisor
doesn't set it internally anywhere. It's better to remove it entirely as the
method by which Supervisor sets internal service statuses is by using a global
event emitter (reportNewStatus) which makes things difficult to test.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-28 11:17:13 -04:00
Christina W
71d24d6e33 Parse container exit error message instead of status
The previous implementation in #2170 of parsing the container status was too general,
because it relied on the mistaken assumption that a container would have a status of
`Stopped` if it was manually stopped. This turned out to be untrue, as manually stopped
containers were also getting restarted by the Supervisor due to their inspect status of
`exited`. With this, parsing the exit message became unavoidable as there are no other
clear ways to discern a container that has been manually stopped and shouldn't be started
from a container experiencing the Engine-host race condition issue (again, see #2170).

Since we're just parsing the exit error message, we don't need to worry about different behaviors
amongst restart policies, as any container with the error message on exit should be started.

Change-type: patch
Closes: #2178
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-22 14:43:17 -07:00
Felipe Lalanne
12eac04484 Fix /v2/applications/state endpoint
It was returning stale information, particularly the download progress
of the target release images never got updated.

Change-type: patch
Closes: #2174
2023-06-19 17:16:36 -04:00
Christina Ying Wang
9e249e6ae8 Remove unnecessary async/await from method
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-19 11:11:26 -07:00
Christina Ying Wang
6e6f79c71d Decrease wait time before start from 60s to 30s
60 seconds to wait may be excessively long.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-19 11:11:26 -07:00
Christina Ying Wang
ace642ea0f Improve naming of a util function & add unit test
isOlderThan -> isValidDateAndOlderThan

See: https://github.com/balena-os/balena-supervisor/pull/2170#discussion_r1226809686
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-19 11:11:26 -07:00
Christina Ying Wang
ab80f198d8 Add exitCode property to Service class
Since we need to conditionally query the service's exit code
during step inference, adding the exitCode property keeps the
step inference function pure.

See: https://github.com/balena-os/balena-supervisor/pull/2170#discussion_r1226805153
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-19 11:11:26 -07:00
Christina Ying Wang
2537eb8189 Handle the case of 'on-failure' restart policy
As explained in the comments of this commit, a container with the restart policy
of 'on-failure' with a non-zero exit code matches the conditions for the race, so
the Supervisor will also attempt to start it. A container with the 'no' restart
policy that has been started once will not be started again. If a container with
'no' has never been started, its service status will be 'Installed' and the Supervisor
will already try to start it until success, so the service with 'no' doesn't require
special handling.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-06-05 11:05:58 -07:00
Christina Ying Wang
7f32141958 Handle Engine-host race condition for "always" and "unless-stopped" restart policy
There exists a race condition between Engine and a host resource that may not
be immediately created. In this race condition, if a container's compose config
depends on the existence of that host resource, such as a network interface, and the
Engine tries to create & start the container before the host resource is created, the
Engine will not reattempt to start the container, regardless of the restart policy.
This is undesireable behavior but seems to be the behavior as implemented by Docker.

To rectify this, the Supervisor state funnel noops for a grace period of 1 minute
after starting a container to see that the container's status has become 'running`.
If the container exits because of the race condition, the status becomes 'exited' and the
Supervisor will attempt to generate another start step. This noop-wait-start step loop
will repeat until the container is able to start.

If the container is never able to start, there was a problem in the host in the creation of the
host resource, and that should be fixed at the host level.

This commit does not handle the case of services with restart policies "no" or "on-failure"
which encounter this host race, as metadata from container inspects needs to be introduced
during step calculation in order to figure out whether services with those restart policies
need to be started. This will be fixed in a future PR.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-05-31 11:32:19 -07:00
Felipe Lalanne
8656bd62f7 Add arch.sw to the valid container requirements
Change-type: minor
2023-05-09 15:44:26 -04:00
Felipe Lalanne
f1f09e0e27 Allow using slug to validate hw.device-type contract
This also adds the hw.device-type test case to the unit tests.

Change-type: patch
2023-05-09 15:20:18 -04:00
Felipe Lalanne
5fdd689590 Fix service comparison when creating component steps
A bug in service comparison would make it that a device already running
a service from a new release with network changes would never stop the
running service so remaining services would forever get stuck in
`Downloaded` state.

This fixes the comparison so the service will get killed in this case,
particularly allowing devices to recover from #1576

Change-type: patch
2023-04-26 11:58:48 -04:00
Felipe Lalanne
7aecaae8b0 Skip updateMetadata step if there are network changes
Previous behavior would make it that an `updateMetadata` step would take
precedence over a `kill` step when network changes are present. This
would lead to an inconsistent state if an update included a
network and a container change.

Closes: #1576
Change-type: patch
2023-04-25 14:47:00 -04:00
Felipe Lalanne
138aec5de4 Add integration tests for state-engine
These tests use the supervisor API to check that applying a target state
allows the device to eventually get to the desired target configuration.

This are high-level tests that work with real images and containers
using dind.

Change-type: patch
2023-04-25 14:47:00 -04:00
Felipe Lalanne
c1207cbbff Do not pass auth to images with no registry
The supervisor allows the target image to be an image without a
registry (e.g. `alpine:latest`), while this really only happens while in
local mode, we don't want to pass credentials to the default registry as
those credentials are meant for balena registry and will otherwise fail.

Change-type: patch
2023-04-25 14:47:00 -04:00
Felipe Lalanne
6c031299d6 Remove safeStateClone function
This function is no longer needed with the latest changes to
getCurrentState

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
36311ef7a1 Get rid of targetVolatile in app manager
Target volatile doesn't make sense now that we can use the
current state as a target. It wasn't actually being used for anything
anymore apparently

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
1e0dd381f5 Make pausingApply a private member of device-state
This simplifies this module interface and hides implementation details
from the rest of the code.

The function `applyIntermediateTarget` will now call `pausingApply`
before applying the target

API actions no longer need to call pausing apply

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
3d43f7e3b3 Simplify doRestart and doPurge actions
The actions now work by passing an intermediate state to the state
engine.

- doPurge first removes the user app from the target state and passes
  that to the state engine for purging. Since intermediate state doesn't
  remove images, this will have the effect of basically re-installing
  the app.

- doRestart modifies the target state by first removing only the
  services from the current state but keeping volumes and networks. This
  has the same effect as before where services were stopped one by one

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
43630e5267 Fix network appUuid inference in local mode
Local mode uses a numeric `appUuid` which was messing up parsing the
network name. This fixes this issue so the current state can be used
as a target state

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
b1fc4e1761 Get image name from DB when getting the app current state
The Service class in `compose/service.ts` cannot get the image name
from the image id when building the object from the container metadata.

We query the metadata in the application manager getCurrentApps method
so the current state can be used as target by API methods

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
27f0d2e655 Improve net alias comparison to prevent unwanted restarts
Network aliases are now compared checking that the target state is a
subset of the current state. This will prevent service restarts due to
additional aliases created by docker in the container.

Closes: #2134
Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
cb98133717 Exclude containerId from service network aliases
When getting the service from the docker container, remove the
containerId from the list of aliases (which gets added by docker). This
will make it easier to use the current service state as a target.

This will help us remove the `safeStateClone` function in the API in a
future commit

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
f2ca7dbb6a Skip image delete when applying intermediate state
This replaces the previous flag `isApplyingIntermediate` on application
manager and simplifies the interface of the state engine to make temporary changes to the
general app state.

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
967cb7747f Make local mode image management work as in cloud mode
There were multiple places in the state engine that skipped some
operations while in local mode. In reality, all it's needed while in
local mode is to skip image and volume deletion.

This commit simplifies application-manager and compose app to be more
local mode agnostic and instead making the image deletion and volume
deletion configurable via function arguments.

This also has the benefit to make the treatment of local mode
applications more similar to cloud mode applications, allowing for
API endpoints to function the same way both modes.

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
76d5be64e5 Remove ignoreImages argument from getRequiredSteps
The argument was unused and hence unnecesary. This is just a bit of
cleanup

Change-type: patch
2023-04-20 14:58:58 -04:00
Felipe Lalanne
7b68ee4c4f Do not restart balena-hostname on rename
The OS since v2.82.6 will monitor changes to config.json and restart
the relevant services to apply the changes. There is no need to trigger
restart of the services via the supervisor. Users on older OS versions
will need to update their OS or restart the services manually as OS
loses support after 2y.

Change-type: patch
Closes: #2160
2023-04-20 11:43:35 -04:00
Felipe Lalanne
6764641426 Log uncaught promise exceptions on the app entry
Node 15 [changed the way it treats unhandled promise rejections](https://github.com/nodejs/node/blob/main/doc/changelogs/CHANGELOG_V15.md#throw-on-unhandled-rejections---33021) from a warning to a throw.
For this reason errors like a corrupt migration directory, that happens when trying to
roll back to a previous supervisor version were no longer showing a
message but dumping the full minimized code into the journal logs.

This PR adds a catchall on app.ts to log the exception and throw an exit
code of 1.

Change-type: patch
2023-04-10 11:18:35 -04:00
Alexandru Costache
6b67db98e5 backends: Add Jetson Orin NX custom device-tree support
Signed-off-by: Alexandru Costache <alexandru@balena.io>
Change-type: patch
2023-04-07 18:12:31 +03:00
Christina Ying Wang
4c948c8854 Mount data and state partitions on container startup
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-03-27 12:07:01 -07:00
Christina Ying Wang
49ee1042a8 Mount boot partition into container on Supervisor start
As the Supervisor is a privileged container, it has access to host /dev, and therefore has access
to boot, data, and state balenaOS partitions. This commit sets up the framework for the following:

- Finds the /dev partition that corresponds to each partition based on partition label
- Mounts the partitions into set mountpoints in the device
- Removes reliance on env vars and mountpoints provided by host's start-balena-supervisor script
- Simplifies host path querying by centralizing these queries through methods in lib/host-utils.ts

This particular changes env vars for and mounts the boot partition.

Since the Supervisor would no longer rely on container `run` arguments provided by a host script,
this change moves Supervisor closer to being able to start itself (Supervisor-as-an-app).

Change-type: minor
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-03-27 12:07:01 -07:00
Christina Ying Wang
9522c15ecd Change constants imports to remove 'require'
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-03-27 12:07:01 -07:00
Christina Ying Wang
37371d89dc Add missing log backend field assignment in logger init
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-03-23 14:07:35 -07:00
Christina Ying Wang
36e46d80a6 Use log endpoint subdomain if it exists in config.json
See: https://github.com/balena-io/open-balena-api/pull/1288
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-03-14 12:56:32 -07:00
Felipe Lalanne
f6435814cd Skip pin device step if release was deleted
Preloaded devices can require that the device is pinned to the preloaded
release on provisioning. However if the provisioned release gets
released in the future, that would lead to the device remaining in "VPN
only" state forever as the provisioning process could not finish due to
pinning failure.

This commit changes the behavior so if the release does not exist, the
pinning step is skipped and the device follows the fleet pinning state.

Closes: #2133
Change-type: patch
2023-03-13 10:03:00 -03:00
Christina Ying Wang
84a9e7e9ac Replace BALENA-FIREWALL rule in INPUT chain instead of flushing
The issue with the original Supervisor implementation of the firewall is that
on Supervisor start, the Supervisor flushes the INPUT chain of the filter table.
This doesn't play well with services that add to the INPUT chain on startup that
may start up before the Supervisor, such as certain NetworkManager connection
profiles. This change only replaces the BALENA-FIREWALL rule in the INPUT chain,
preserving the other rules as well as their order.

Closes: #1482
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2023-03-01 13:42:07 -08:00
Pagan Gazzard
d356f979d3 Always lower case the cpu id to avoid bouncing between casing when reporting
Change-type: patch
2023-02-15 13:54:40 +00:00
Felipe Lalanne
89175432af Find and remove duplicate networks
We have seen a few times devices with duplicated network names for some
reason. While we don't know the cause the networks get duplicates, this
can be disruptive for updates as trying to create a container referencing a duplicate
network results in a 400 error from the engine.

This commit finds and removes duplicate networks via the state engine,
this means that even if somehow a container could be referencing a
network that has been duplicated later somehow, this will remove the
container first.

While thies doesn't solve the problem of duplicate networks being
created in the first place, it will fix the state of the system to
correct the inconsistency.

Change-type: minor
Closes: #590
2023-02-10 20:24:36 -05:00