balena-supervisor

mirror of https://github.com/balena-os/balena-supervisor.git synced 2024-12-19 13:47:54 +00:00

Author	SHA1	Message	Date
Christina Ying Wang	6e185fbd44	Don't follow symlinks when checking for lockfiles The Supervisor should only care whether a lockfile exists or not. This also fixes an edge case where a user symlinked a lockfile to a nonexistent file, causing the Supervisor to enter an error loop as it was not able to `stat` the nonexistent file. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-12 10:34:46 -04:00
Christina Ying Wang	f863075bdc	Add memory usage healthcheck This healthcheck fails when Supervisor memory usage is above a threshold based on initial memory measurements after device state has settled. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-11 18:16:47 -07:00
Christina Ying Wang	8ac2ce4677	Respect lockOverride when taking locks Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-06 00:59:04 -07:00
Christina Ying Wang	fd7d58f89a	Clean up lockfiles on takeLock step failure We don't want any Supervisor lockfiles to remain on the device when a takeLock step fails because this would interfere with the user app. Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	fb1bd33ab6	Refine update locking interface * Remove Supervisor lockfile cleanup SIGTERM listener * Modify lockfile.getLocksTaken to read files from the filesystem * Remove in-memory tracking of locks taken in favor of filesystem * Require both `(resin-)updates.lock` to be locked with `nobody` UID for service to count as locked by the Supervisor Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	10f294cf8e	Add takeLock to state funnel A takeLock step should be generated before any of the following steps: * kill * start * stop * updateMetadata * restart * handover ALL services in an app will be locked for any of the above actions, unless the action is generated through Supervisor API's `POST /v2/applications/:appId/(start\|stop\|restart)-service` endpoints, in which case only the target service will be locked. A lock will be taken for a service before it starts by creating the directory in /tmp before the Engine creates it through bind mounts. Also, the commit simplifies the generation of service kill steps from network/volume changes or removals. Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	cf8d8cedd7	Simplify lock interface to prep for adding takeLock to state funnel This commit changes a few things: * Pass `force` to `takeLock` step directly. This allows us to remove the `lockFn` used by app manager's action executors, setting takeLock as the main interface to interact with the update lock module. Note that this commit by itself will not pass tests, as no update locking occurs where it once did. This will be amended in the next commit. * Remove locking functions from doRestart & doPurge, as this is the only area where skipLock is required. * Remove `skipLock` interface, as it's redundant with the functionality of `force`. The only time `skipLock` is true is in doRestart/doPurge, as those API methods are already run within a lock function. We removed the lock function which removes the need for skipLock, and in the next commit we'll add locking as a composition step to replace the functionality removed here. * Remove some methods not in use, such as app manager's `stopAll`. Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	af6359f7ae	Take lock before updating service metadata Change-type: minor Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	e6df78a22b	Implement takeLock composition step + tests This commit only implements the action that a takeLock step results in. It does not add takeLock step generation logic to the state funnel yet. Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	f2843e1382	Add update lock release functionality to state funnel releaseLock is a step that will be inferred if there are services in target state, and if some of those services have locks taken by the Supervisor. The releaseLock composition step calls the method of the same name in the updateLock module, which takes the exclusive process lock before disposing all Supervisor lockfiles in the target appId. This is half of the update lock incorporation into the state funnel, as we also need to introduce a takeLock step which triggers during crucial stages of device state transition. Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	d18a740a40	Add methods for easier checking of lockfile existence Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Felipe Lalanne	6217546894	Update typescript to v5 This also updates code to use the default import syntax instead of `import * as` when the imported module exposes a default. This is needed with the latest typescript version. Change-type: patch	2024-03-05 15:33:56 -03:00
Felipe Lalanne	988a1c9e9a	Update @balena/lint to v7 This updates balena lint to the latest version to enable eslint support and unblock Typescript updates. This is a huge number of changes as the linting rules are much more strict now, requiring modifiying most files in the codebase. This commit also bumps the test dependency `rewire` as that was interfering with the update of balena-lint Change-type: patch	2024-03-01 18:27:30 -03:00
Felipe Lalanne	bda1bac04c	Add support for repeated overlays RPI firmware configuration allows repeating overlays to define configurations on multiple devices. For instance, for configuring multiple `ads` devices, `config.txt` needs to be setup this way ``` dtoverlay=ads1115,addr=0x48 dtoverlay=ads1115,addr=0x49 ``` Before this change, the supervisor would interpret both lines as belonging to the same overlay, preventing users from configuring multiple devices, and leading to a loop when trying to apply configurations with repeated overlays coming from the cloud side. Change-type: minor	2024-02-27 14:52:41 -03:00
Christina Ying Wang	3fd035c5bd	Patch default dtparam handling in config.txt This commit completes the list of default / board-wide dtparams to include some `baudrate` and `vc` i2c params. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-02-21 12:45:29 -08:00
Christina Ying Wang	e22253ce6e	Patch config.txt backend to return array configs correctly Previously, getBootConfig() of the config.txt backend was omitting array configurations such as gpio settings, thus resulting in the SV mistakenly assuming that boot config had not been applied, since gpio would not be in current config.txt config but would be in target config. This resulted in SV entering an infinite loop of attempting to apply the gpio config when it wasn't necessary. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-02-16 18:12:33 -08:00
Felipe Lalanne	6e6a796da5	Add special case for base DTO params on RPI config While ordering is important in the RPI firmware configuration file (config.txt), some dt params are by default considered part of the base dt overlay if they are not used by other overlays. Unfortunately the [list of dtparams](https://github.com/raspberrypi/firmware/blob/master/boot/overlays/README#L133) is too long to add all of them as exceptions, but we can add the params used in the default config.txt provided in OS images, to avoid reboots when updating to this new supervisor and correctly parsing the provisioning config.txt as variables. While this addition handles most common scenarios, there is still a chance a user may have use other base overlay dt params in the initial config, in which case those will be interpreted according to the relative ordering Change-type: patch	2024-02-08 15:48:10 -03:00
Felipe Lalanne	55a8c5bf90	Add tests for dtoverlay management in config.txt	2024-02-07 20:38:44 -03:00
Christina Ying Wang	3afcef2969	Respect update strategies app-wide instead of at the service level Fixes behavior for release updates which removes a service in current state and adds a new service in target state. Change-type: patch Closes: #2095 Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-01-29 12:26:28 -08:00
Felipe Lalanne	87b195685a	Use the state-helper functions in app module tests	2024-01-29 12:25:55 -08:00
Felipe Lalanne	6ee606806d	Fix docker utils tests for docker v25 From docker 25, the engine will validate IPAM config. This would cause the docker utils test to fail since the network/subnet configuration was incorrect. Change-type: patch	2024-01-25 15:05:12 -03:00
Felipe Lalanne	9bd216327f	Expose ports from port mappings on services PR #2217 removed the expose configuration but also caused a regresion where ports set via the `ports` configuration would no longer get exposed to the host, despite portmappings being set. This fixes that issue by exposing only those ports comming from port mappings. Change-type: patch	2023-10-24 15:04:39 -03:00
Felipe Lalanne	416170bc05	Ignore `expose` service compose configuration The docker EXPOSE directive and corresponding docker-compose `expose` service configuration serves as documentation/metadata that a container listens on a certain port that may be used for service discovery but it doesn't have any real impact on the ability for other containers on the same network to access the exposed service via the port. In newer engine implementations, this property may conflict with other network configurations, and prevent the container from being started by the docker engine (see #2211). This PR removes code that would manage the expose property and takes the property out of the whitelist. A composition with the `expose` property will result in the log message `Ignoring unsupported or unknown compose fields: expose`. While this change should not have operational impact, it still removes a previously supported configuration and as such there is a chance of it being a breaking change for some applications. For this reason it is being published as a new major version. Change-type: major Closes: #2211	2023-10-23 11:41:32 -03:00
Pagan Gazzard	c9f032e13a	Switch _.isFunction usage to native versions Change-type: patch	2023-10-16 14:30:25 -03:00
Pagan Gazzard	20df54668c	Switch _.isArray usage to native versions Change-type: patch	2023-10-16 14:30:25 -03:00
Felipe Lalanne	3e828dcc52	Revert "Do not expose ports from image if service network mode" This reverts commit `0c7bad7792`, as that change causes a service restart loop. The supervisor cannot distinguish between ports exposed via the `EXPOSE` directive and the docker-compose `expose` property. Because of this, in the case of `network-mode: service:<...>` the current state and target state never match, leading to a service restart loop. Change-type: patch	2023-10-16 13:06:50 -03:00
Pagan Gazzard	766cce89c7	Convert multiple bluebird uses to native promises Change-type: patch	2023-10-16 11:40:45 +01:00
Felipe Lalanne	0c7bad7792	Do not expose ports from image if service network mode The supervisor exposes ports configured using the `EXPOSE` directive in the dockerfile when configuring the container for runtime. This can cause issues if using `network_mode: service:<service name>` as the expose configuration is not compatible with that network mode. This fix now skips image exposed ports for that particular network mode. Change-type: patch Relates-to: #2211	2023-10-12 18:03:42 -03:00
Pagan Gazzard	894bdeeeb6	Remove unused docker logs logging code Change-type: patch	2023-10-11 14:20:33 +01:00
Christina Ying Wang	bc1d251e66	Revert os-release path to /mnt/root /mnt/boot/os-release isn't always accurate so /mnt/root should be the source of truth. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-10-09 14:02:02 -07:00
Felipe Lalanne	327dc31ef0	Replace node-dbus with @balena/systemd The node-dbus module is unmaintained and a blocker for the update to Node 18. Switching to our own node bindings for systemd solves this issue Relates-to: Shouqun/node-dbus#241 Change-type: patch	2023-08-16 15:58:52 -04:00
Felipe Lalanne	8f17c30de6	Replace dbus test service with mock-systemd-bus This avoids unnecessary mocking and tests against the real systemd API Change-type: patch	2023-08-16 14:46:58 -04:00
Alexandru Costache	512240c544	backends: Add Jetson Orin NANO custom device-tree support Signed-off-by: Alexandru Costache <alexandru@balena.io> Change-type: patch	2023-07-11 18:11:32 +03:00
Christina Ying Wang	38fe8dae75	Remove the 'Stopped' status for services It's not an official status from container inspects, and the Supervisor doesn't set it internally anywhere. It's better to remove it entirely as the method by which Supervisor sets internal service statuses is by using a global event emitter (reportNewStatus) which makes things difficult to test. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-28 11:17:13 -04:00
Christina W	71d24d6e33	Parse container exit error message instead of status The previous implementation in #2170 of parsing the container status was too general, because it relied on the mistaken assumption that a container would have a status of `Stopped` if it was manually stopped. This turned out to be untrue, as manually stopped containers were also getting restarted by the Supervisor due to their inspect status of `exited`. With this, parsing the exit message became unavoidable as there are no other clear ways to discern a container that has been manually stopped and shouldn't be started from a container experiencing the Engine-host race condition issue (again, see #2170). Since we're just parsing the exit error message, we don't need to worry about different behaviors amongst restart policies, as any container with the error message on exit should be started. Change-type: patch Closes: #2178 Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-22 14:43:17 -07:00
Christina Ying Wang	7eba48f8b8	Improve tests surrounding Engine-host race patch See: #2170 Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-19 11:11:26 -07:00
Christina Ying Wang	9e249e6ae8	Remove unnecessary async/await from method Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-19 11:11:26 -07:00
Christina Ying Wang	6e6f79c71d	Decrease wait time before start from 60s to 30s 60 seconds to wait may be excessively long. Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-19 11:11:26 -07:00
Christina Ying Wang	ace642ea0f	Improve naming of a util function & add unit test isOlderThan -> isValidDateAndOlderThan See: https://github.com/balena-os/balena-supervisor/pull/2170#discussion_r1226809686 Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-19 11:11:26 -07:00
Christina Ying Wang	2537eb8189	Handle the case of 'on-failure' restart policy As explained in the comments of this commit, a container with the restart policy of 'on-failure' with a non-zero exit code matches the conditions for the race, so the Supervisor will also attempt to start it. A container with the 'no' restart policy that has been started once will not be started again. If a container with 'no' has never been started, its service status will be 'Installed' and the Supervisor will already try to start it until success, so the service with 'no' doesn't require special handling. Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-05 11:05:58 -07:00
Christina Ying Wang	95f3e13d50	Add extra delay after state engine integration tests This ensures target state has settled (since it seems that the 'applied' status that's reported isn't 100% accurate and the actual Engine state may lag behind slightly) Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-05-31 11:33:27 -07:00
Christina Ying Wang	7f32141958	Handle Engine-host race condition for "always" and "unless-stopped" restart policy There exists a race condition between Engine and a host resource that may not be immediately created. In this race condition, if a container's compose config depends on the existence of that host resource, such as a network interface, and the Engine tries to create & start the container before the host resource is created, the Engine will not reattempt to start the container, regardless of the restart policy. This is undesireable behavior but seems to be the behavior as implemented by Docker. To rectify this, the Supervisor state funnel noops for a grace period of 1 minute after starting a container to see that the container's status has become 'running`. If the container exits because of the race condition, the status becomes 'exited' and the Supervisor will attempt to generate another start step. This noop-wait-start step loop will repeat until the container is able to start. If the container is never able to start, there was a problem in the host in the creation of the host resource, and that should be fixed at the host level. This commit does not handle the case of services with restart policies "no" or "on-failure" which encounter this host race, as metadata from container inspects needs to be introduced during step calculation in order to figure out whether services with those restart policies need to be started. This will be fixed in a future PR. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-05-31 11:32:19 -07:00
Felipe Lalanne	2758e190b2	Fix `sw.arch` typo when testing contracts Change-type: patch	2023-05-11 13:07:26 -04:00
Felipe Lalanne	8656bd62f7	Add `arch.sw` to the valid container requirements Change-type: minor	2023-05-09 15:44:26 -04:00
Felipe Lalanne	f1f09e0e27	Allow using slug to validate hw.device-type contract This also adds the hw.device-type test case to the unit tests. Change-type: patch	2023-05-09 15:20:18 -04:00
Felipe Lalanne	a884a58b4c	Simplify and move lib/contract.spec.ts to tests/unit Improve contract tests to remove dependence on stubs and unnecessary system calls. Change-type: patch	2023-05-09 15:20:12 -04:00
Felipe Lalanne	7b8b187c74	Create tests with recovery from #1576 Devices affected by the bug described in 1576, are also stuck with some services in the `Downloaded` state, because the state engine does not detect that the running services should be killed on a network change even if they belong to a new release. This is a bug, which can be replicated by the tests in this commit Change-type: patch	2023-04-26 11:58:42 -04:00
Felipe Lalanne	0a358a4463	Add replication of issue using unit tests Change-type: patch	2023-04-25 14:47:00 -04:00
Felipe Lalanne	138aec5de4	Add integration tests for state-engine These tests use the supervisor API to check that applying a target state allows the device to eventually get to the desired target configuration. This are high-level tests that work with real images and containers using dind. Change-type: patch	2023-04-25 14:47:00 -04:00
Felipe Lalanne	3d43f7e3b3	Simplify doRestart and doPurge actions The actions now work by passing an intermediate state to the state engine. - doPurge first removes the user app from the target state and passes that to the state engine for purging. Since intermediate state doesn't remove images, this will have the effect of basically re-installing the app. - doRestart modifies the target state by first removing only the services from the current state but keeping volumes and networks. This has the same effect as before where services were stopped one by one Change-type: patch	2023-04-20 14:58:58 -04:00

1 2 3 4 5 ...

453 Commits