balena-supervisor

mirror of https://github.com/balena-os/balena-supervisor.git synced 2025-06-21 16:49:46 +00:00

Author	SHA1	Message	Date
Christina Ying Wang	ed1c18e369	Add support for init field from compose Init supports boolean values, and is not included in the config when not defined. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-09-26 10:39:59 -03:00
Felipe Lalanne	e9a52e6786	Store rejected apps in the database This moves from throwing an error when an app is rejected due to unmet requirements (because of contracts) to storing the target with a `rejected` flag on the database. The application manager filters rejected apps when calculating steps to prevent them from affecting the current state. The state engine uses the rejection info to generate the state report. Change-type: minor	2024-08-30 10:52:11 -04:00
Felipe Lalanne	227fee9941	Set the app update status when reporting state Change-type: minor	2024-08-30 10:52:11 -04:00
Felipe Lalanne	48e526ec43	Refactor contracts validation code This updates the interfaces on lib/contracts and the validation in the application-manager module.	2024-08-30 10:52:11 -04:00
Felipe Lalanne	f3fcb0db7a	Improve the LogBackend interface This make the LogBackend `log` method into an async method in preparation for upcoming changes that will use backpressure from the connection to delay logging coming from containers. This also removes unnecessary imageId from the LogMessage type Change-type: patch	2024-07-30 10:51:19 -04:00
Felipe Lalanne	dbacca977a	Do not use DB to store container logs info This removes the dependence of the supervisor on the containerLogs database for remembering the last sent timestamp. This commit instead uses the supervisor startup time as the initial time for log retrieval. This might result in some logs missing for services that may start before the supervisor after a boot, or if the supervisor restarts. However this seems like an acceptable trade-off as the current implementation seems to make things worst in resource contrained environments. We'll move storing the last sent timestamp to a better storage medium in a future commit. Change-type: minor	2024-07-30 10:51:18 -04:00
Felipe Lalanne	ede27b63ce	Fix engine deadlock on network+service change This fixes a regression on the supervisor state engine computation (added on v16.2.0) when the target state removes a network at the same time that a service referencing that network is changed. Example going from ``` services: one: image: alpine: 3.18 networks: ['balena'] networks: balena: ``` to ``` services: one: image: alpine: latest ``` Would never reach the target state as killing the service in order to remove the network is prioritized, but one of the invariants in the target state calculation is to not kill any services until all images have been downloaded. These two instructions were in contradiction leading to a deadlock. The fix involves only adding removal steps for services depending on a changing network or volume if the service container is not being removed already. Change-type: patch	2024-06-24 18:12:12 -04:00
Felipe Lalanne	ac2db38742	Move api-keys module to src/lib This removes circular dependencies between the device-api module and the compose module, reducing total circular dependencies to 15 Change-type: patch	2024-05-27 14:36:03 -04:00
Felipe Lalanne	bef5e78440	Move Compose(Network\|Volume)Config to top level types This reduces dependencies from 80 to 47 Change-type: patch	2024-05-27 14:36:03 -04:00
Felipe Lalanne	234e0de075	Move composition types to compose/types This reduces circular dependencies from 250 to 80 by ensuring that modules that only require types do not import the full module with all its dependencies. Change-type: patch	2024-05-27 14:36:03 -04:00
Felipe Lalanne	94de4006a0	Split compose types into interface and implementation This splits `App`, `Network`, `Service` and `Volume` which used to be defined as classes into an interface and a class implementation that is not exported. This will allow to work with just the types in some cases and prevent circular dependencies when importing. Change-type: patch	2024-05-27 14:36:03 -04:00
Pagan Gazzard	4adf710520	Update @types dependencies Change-type: patch	2024-04-29 16:29:07 +01:00
Felipe Lalanne	ae823fea18	Update docker related dependencies This bumps dockerode, removes resin-docker-build in favor of @balena/compose, and updates docker-delta and docker-progress packages. Change-type: patch	2024-04-26 12:03:04 -04:00
Christina Ying Wang	fb1bd33ab6	Refine update locking interface * Remove Supervisor lockfile cleanup SIGTERM listener * Modify lockfile.getLocksTaken to read files from the filesystem * Remove in-memory tracking of locks taken in favor of filesystem * Require both `(resin-)updates.lock` to be locked with `nobody` UID for service to count as locked by the Supervisor Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	10f294cf8e	Add takeLock to state funnel A takeLock step should be generated before any of the following steps: * kill * start * stop * updateMetadata * restart * handover ALL services in an app will be locked for any of the above actions, unless the action is generated through Supervisor API's `POST /v2/applications/:appId/(start\|stop\|restart)-service` endpoints, in which case only the target service will be locked. A lock will be taken for a service before it starts by creating the directory in /tmp before the Engine creates it through bind mounts. Also, the commit simplifies the generation of service kill steps from network/volume changes or removals. Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	cf8d8cedd7	Simplify lock interface to prep for adding takeLock to state funnel This commit changes a few things: * Pass `force` to `takeLock` step directly. This allows us to remove the `lockFn` used by app manager's action executors, setting takeLock as the main interface to interact with the update lock module. Note that this commit by itself will not pass tests, as no update locking occurs where it once did. This will be amended in the next commit. * Remove locking functions from doRestart & doPurge, as this is the only area where skipLock is required. * Remove `skipLock` interface, as it's redundant with the functionality of `force`. The only time `skipLock` is true is in doRestart/doPurge, as those API methods are already run within a lock function. We removed the lock function which removes the need for skipLock, and in the next commit we'll add locking as a composition step to replace the functionality removed here. * Remove some methods not in use, such as app manager's `stopAll`. Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	af6359f7ae	Take lock before updating service metadata Change-type: minor Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	e6df78a22b	Implement takeLock composition step + tests This commit only implements the action that a takeLock step results in. It does not add takeLock step generation logic to the state funnel yet. Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	f2843e1382	Add update lock release functionality to state funnel releaseLock is a step that will be inferred if there are services in target state, and if some of those services have locks taken by the Supervisor. The releaseLock composition step calls the method of the same name in the updateLock module, which takes the exclusive process lock before disposing all Supervisor lockfiles in the target appId. This is half of the update lock incorporation into the state funnel, as we also need to introduce a takeLock step which triggers during crucial stages of device state transition. Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	7cfc42e197	Separate rwlock functionality from update-lock for clarity Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Christina Ying Wang	b9a6a6b685	Improve types & remove some lodash from state engine Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-04-04 14:07:47 -07:00
Felipe Lalanne	6217546894	Update typescript to v5 This also updates code to use the default import syntax instead of `import * as` when the imported module exposes a default. This is needed with the latest typescript version. Change-type: patch	2024-03-05 15:33:56 -03:00
Felipe Lalanne	988a1c9e9a	Update @balena/lint to v7 This updates balena lint to the latest version to enable eslint support and unblock Typescript updates. This is a huge number of changes as the linting rules are much more strict now, requiring modifiying most files in the codebase. This commit also bumps the test dependency `rewire` as that was interfering with the update of balena-lint Change-type: patch	2024-03-01 18:27:30 -03:00
Christina Ying Wang	3afcef2969	Respect update strategies app-wide instead of at the service level Fixes behavior for release updates which removes a service in current state and adds a new service in target state. Change-type: patch Closes: #2095 Signed-off-by: Christina Ying Wang <christina@balena.io>	2024-01-29 12:26:28 -08:00
Felipe Lalanne	3ea8d4727a	Force remove container if updateMetadata fails The `updateMetadata` step renames the container to match the target release when the service doesn't change between releases. We have seen this step fail because of an engine bug that seems to relate to the engine keeping stale references after container restarts. The only way around this issue is to remove the old container and create it again. This implements that workaround during the updateMetadata step to deal with that issue. Change-type: minor Relates-to: balena-os/balena-engine#261	2023-11-22 14:16:44 -03:00
Felipe Lalanne	9bd216327f	Expose ports from port mappings on services PR #2217 removed the expose configuration but also caused a regresion where ports set via the `ports` configuration would no longer get exposed to the host, despite portmappings being set. This fixes that issue by exposing only those ports comming from port mappings. Change-type: patch	2023-10-24 15:04:39 -03:00
Felipe Lalanne	416170bc05	Ignore `expose` service compose configuration The docker EXPOSE directive and corresponding docker-compose `expose` service configuration serves as documentation/metadata that a container listens on a certain port that may be used for service discovery but it doesn't have any real impact on the ability for other containers on the same network to access the exposed service via the port. In newer engine implementations, this property may conflict with other network configurations, and prevent the container from being started by the docker engine (see #2211). This PR removes code that would manage the expose property and takes the property out of the whitelist. A composition with the `expose` property will result in the log message `Ignoring unsupported or unknown compose fields: expose`. While this change should not have operational impact, it still removes a previously supported configuration and as such there is a chance of it being a breaking change for some applications. For this reason it is being published as a new major version. Change-type: major Closes: #2211	2023-10-23 11:41:32 -03:00
Pagan Gazzard	e15205301c	Switch some _.includes usage to native versions Change-type: patch	2023-10-16 14:30:25 -03:00
Pagan Gazzard	a4a9a17c1a	Switch _.assign usage to native versions Change-type: patch	2023-10-16 14:30:25 -03:00
Pagan Gazzard	d0cb54537f	Switch _.isNaN usage to native versions Change-type: patch	2023-10-16 14:30:25 -03:00
Pagan Gazzard	ca3faebfc9	Switch _.isNumber usage to native versions Change-type: patch	2023-10-16 14:30:25 -03:00
Pagan Gazzard	20df54668c	Switch _.isArray usage to native versions Change-type: patch	2023-10-16 14:30:25 -03:00
Pagan Gazzard	3fe8a22fb0	Switch _.isString usage to native versions Change-type: patch	2023-10-16 14:30:25 -03:00
Felipe Lalanne	3e828dcc52	Revert "Do not expose ports from image if service network mode" This reverts commit `0c7bad7792`, as that change causes a service restart loop. The supervisor cannot distinguish between ports exposed via the `EXPOSE` directive and the docker-compose `expose` property. Because of this, in the case of `network-mode: service:<...>` the current state and target state never match, leading to a service restart loop. Change-type: patch	2023-10-16 13:06:50 -03:00
Pagan Gazzard	766cce89c7	Convert multiple bluebird uses to native promises Change-type: patch	2023-10-16 11:40:45 +01:00
Felipe Lalanne	0c7bad7792	Do not expose ports from image if service network mode The supervisor exposes ports configured using the `EXPOSE` directive in the dockerfile when configuring the container for runtime. This can cause issues if using `network_mode: service:<service name>` as the expose configuration is not compatible with that network mode. This fix now skips image exposed ports for that particular network mode. Change-type: patch Relates-to: #2211	2023-10-12 18:03:42 -03:00
Christina Ying Wang	06d4775178	Use native structuredClone instead of _.cloneDeep Memory tests have shown performance improvements to using the native method. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-09-29 12:29:50 -07:00
Christina Ying Wang	38fe8dae75	Remove the 'Stopped' status for services It's not an official status from container inspects, and the Supervisor doesn't set it internally anywhere. It's better to remove it entirely as the method by which Supervisor sets internal service statuses is by using a global event emitter (reportNewStatus) which makes things difficult to test. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-28 11:17:13 -04:00
Christina W	71d24d6e33	Parse container exit error message instead of status The previous implementation in #2170 of parsing the container status was too general, because it relied on the mistaken assumption that a container would have a status of `Stopped` if it was manually stopped. This turned out to be untrue, as manually stopped containers were also getting restarted by the Supervisor due to their inspect status of `exited`. With this, parsing the exit message became unavoidable as there are no other clear ways to discern a container that has been manually stopped and shouldn't be started from a container experiencing the Engine-host race condition issue (again, see #2170). Since we're just parsing the exit error message, we don't need to worry about different behaviors amongst restart policies, as any container with the error message on exit should be started. Change-type: patch Closes: #2178 Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-22 14:43:17 -07:00
Christina Ying Wang	9e249e6ae8	Remove unnecessary async/await from method Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-19 11:11:26 -07:00
Christina Ying Wang	6e6f79c71d	Decrease wait time before start from 60s to 30s 60 seconds to wait may be excessively long. Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-19 11:11:26 -07:00
Christina Ying Wang	ace642ea0f	Improve naming of a util function & add unit test isOlderThan -> isValidDateAndOlderThan See: https://github.com/balena-os/balena-supervisor/pull/2170#discussion_r1226809686 Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-19 11:11:26 -07:00
Christina Ying Wang	ab80f198d8	Add exitCode property to Service class Since we need to conditionally query the service's exit code during step inference, adding the exitCode property keeps the step inference function pure. See: https://github.com/balena-os/balena-supervisor/pull/2170#discussion_r1226805153 Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-19 11:11:26 -07:00
Christina Ying Wang	2537eb8189	Handle the case of 'on-failure' restart policy As explained in the comments of this commit, a container with the restart policy of 'on-failure' with a non-zero exit code matches the conditions for the race, so the Supervisor will also attempt to start it. A container with the 'no' restart policy that has been started once will not be started again. If a container with 'no' has never been started, its service status will be 'Installed' and the Supervisor will already try to start it until success, so the service with 'no' doesn't require special handling. Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-06-05 11:05:58 -07:00
Christina Ying Wang	7f32141958	Handle Engine-host race condition for "always" and "unless-stopped" restart policy There exists a race condition between Engine and a host resource that may not be immediately created. In this race condition, if a container's compose config depends on the existence of that host resource, such as a network interface, and the Engine tries to create & start the container before the host resource is created, the Engine will not reattempt to start the container, regardless of the restart policy. This is undesireable behavior but seems to be the behavior as implemented by Docker. To rectify this, the Supervisor state funnel noops for a grace period of 1 minute after starting a container to see that the container's status has become 'running`. If the container exits because of the race condition, the status becomes 'exited' and the Supervisor will attempt to generate another start step. This noop-wait-start step loop will repeat until the container is able to start. If the container is never able to start, there was a problem in the host in the creation of the host resource, and that should be fixed at the host level. This commit does not handle the case of services with restart policies "no" or "on-failure" which encounter this host race, as metadata from container inspects needs to be introduced during step calculation in order to figure out whether services with those restart policies need to be started. This will be fixed in a future PR. Change-type: patch Signed-off-by: Christina Ying Wang <christina@balena.io>	2023-05-31 11:32:19 -07:00
Felipe Lalanne	5fdd689590	Fix service comparison when creating component steps A bug in service comparison would make it that a device already running a service from a new release with network changes would never stop the running service so remaining services would forever get stuck in `Downloaded` state. This fixes the comparison so the service will get killed in this case, particularly allowing devices to recover from #1576 Change-type: patch	2023-04-26 11:58:48 -04:00
Felipe Lalanne	7aecaae8b0	Skip updateMetadata step if there are network changes Previous behavior would make it that an `updateMetadata` step would take precedence over a `kill` step when network changes are present. This would lead to an inconsistent state if an update included a network and a container change. Closes: #1576 Change-type: patch	2023-04-25 14:47:00 -04:00
Felipe Lalanne	36311ef7a1	Get rid of targetVolatile in app manager Target volatile doesn't make sense now that we can use the current state as a target. It wasn't actually being used for anything anymore apparently Change-type: patch	2023-04-20 14:58:58 -04:00
Felipe Lalanne	3d43f7e3b3	Simplify doRestart and doPurge actions The actions now work by passing an intermediate state to the state engine. - doPurge first removes the user app from the target state and passes that to the state engine for purging. Since intermediate state doesn't remove images, this will have the effect of basically re-installing the app. - doRestart modifies the target state by first removing only the services from the current state but keeping volumes and networks. This has the same effect as before where services were stopped one by one Change-type: patch	2023-04-20 14:58:58 -04:00
Felipe Lalanne	43630e5267	Fix network appUuid inference in local mode Local mode uses a numeric `appUuid` which was messing up parsing the network name. This fixes this issue so the current state can be used as a target state Change-type: patch	2023-04-20 14:58:58 -04:00

1 2 3 4 5 ...

316 Commits