1671 Commits

Author SHA1 Message Date
8ffdba7d18 Remove memory healthcheck
Supervisor has had memory leaks removed since v16.5.1, with latest tested
version being v16.7.1. Furthermore, on recent reported instances of memory healthcheck
triggering on support, we've snapshotted the heap before & after on devices multiple times
without finding any evidence of memory leaks in the snapshots.

Therefore, it's hypothesized that the heuristic for determining starting memory may
be flawed in that it's not waiting long enough after system startup, or it may
run right after garbage collection has happened. Because of the variability and
difficulty of ascertaining these factors, we suspect an inaccurate memory
baseline may be the cause of the instances of false positives on support.

See: https://balena.zulipchat.com/#narrow/channel/403752-channel.2Fsupport-help/topic/supervisor.20memory.20usage.20above.20threadhold/near/520640885
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2025-06-06 11:44:15 -07:00
8c69166271 Support target state apply cancellation
The current target state apply is cancelled when either:
- /v1/update is called with cancel: true
- A different target state is received from the cloud (with a non-304 status)

Following apply cancellation, a target state apply is re-triggered. This ensures
that the user can force a device out of a dead-locked situation where a long-running
task such as an image fetch fails to cede control back to the Supervisor, which is
the behavior observed in an Engine bug with infinite pull retries with a bad network.

Change-type: minor
Signed-off-by: Christina Ying Wang <christina@balena.io>
2025-05-28 07:46:10 -07:00
0af915d815 Pass AbortSignal to image pull functions
When abortController.abort() is called, this signal is passed down
to the functions that interface with Docker Engine for image pulls,
cancelling those pulls.

The next commit will limit when abortController.abort() is called.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2025-05-27 11:09:19 -07:00
49e91a2639 Exclude reclaimable slab memory from used memory metric
Aligns metric value with used memory reported by the free and htop
utilities.

Change-type: patch
2025-05-25 11:51:06 -04:00
4318272844 Remove unsupported fields from contract requirements
A contract including extra requirement fields, such as "name" would fail
validation. This PR removes any extra fields from the validated contract
to prevent services with these extra fields from getting rejected by the
contract validation.

Change-type: patch
2025-05-15 17:38:03 -04:00
7c83eaef80 Simplify contract validation module
Use `satisfiesChildContract` instead of Blueprints as the previous
implementation did.

Change-type: patch
2025-05-08 19:33:12 -04:00
d475b1d830 Fix search for app leftover locks
The leftover locks search was creating an array rather than an object
keyed by the appId. This could affect the lock cleanup and make leftover
locks from one app affect the install of the app in local mode.

Change-type: patch
2025-04-01 17:56:06 -03:00
b596c77ce2 Add Docker network label if custom ipam config
In a target release where the only change is the addition or removal
of a custom ipam config, the Supervisor does not recreate the network
due to ignoring ipam config differences when comparing current and target
network (in network.isEqualConfig). This commit implements the addition of
a network label if the target compose object includes a network with custom
ipam. With the label, the Supervisor will detect a difference between a
network with a custom ipam and a network without, without needing to compare
the ipam configs themselves.

This is a major change, as devices running networks with custom ipam configs
will have their networks recreated to add the network label.

Closes: #2251
Change-type: major
See: https://balena.fibery.io/Work/Project/Fix-Supervisor-not-recreating-network-when-passed-custom-ipam-config-1127
Signed-off-by: Christina Ying Wang <christina@balena.io>
2025-03-24 14:55:19 -07:00
7764f98c9d Start a dependent if all dependencies are started
The previous behavior required that dependencies were running beefore
starting the dependent service. This made it that services dependent on
a one-shot service would not get started and goes against the default
docker behavior.

Depending on a service to be running will require the implementation of
[long syntax depends_on](https://docs.docker.com/reference/compose-file/services/#long-syntax-1) and the condition
`service_healthy`.

Change-type: patch
Closes: #2409
2025-03-20 14:51:32 -03:00
ae337a1dd7 Remove GOT retries on state poll
The state poll already has retry implementation, making the GOT default
unnecessary.

Change-type: patch
2025-03-12 10:59:16 -03:00
bdbc6a4ba4 Ensure poll socket timeout is defined early
We have observed that even when setting the socket timeout on the
state poll https request, the timeout is only applied once the socket is
connected. This causes issues with Node's auto family selection (happy
eyeballs), as the default https timeout is 5s which means that larger
[auto select attempt timeout](https://nodejs.org/docs/latest-v22.x/api/net.html#netgetdefaultautoselectfamilyattempttimeout) may result in the socket timing out before all connection attempts have been tried.

This commit sets a different https Agent for state polling, with a
timeout matching the `apiRequestTimeout` used for other request events.

Change-type: patch
2025-03-12 10:59:11 -03:00
026dc0aed2 Release locks when removing apps
This prevents leftover locks that can prevent other operations from
taking place.

Change-type: patch
2025-03-06 11:50:31 -03:00
6d00be2093 Log non-API errors during state poll
The supervisor was failing silently if an error happened while establishing the
connection (e.g. requesting the socket).

Change-type: patch
2025-03-04 10:46:45 -03:00
f8bdb14335 Fix target poll healthcheck
The Target.lastFetch time compared when performing the healthcheck
resets any time a poll is attempted no matter the outcome. This changes
the behavior so the time is reset only on a successful poll

Change-type: patch
2025-03-04 10:45:31 -03:00
49163e92a0 Decrease balenaCloud api request timeout from 15m to 59s
This was mistakenly increased due to confusion between the timeout for
requests to the supervisor's api vs the timeout for requests from the
supervisor to the balenaCloud api. This separates the two configs and
documents the difference between the timeouts whilst also decreasing
the timeout for balenaCloud api requests to the correct/expected value

Change-type: patch
2025-03-04 12:29:18 +00:00
2dc9d275b1 Don't revert to regular pull if delta server 401
If the Supervisor receives a 401 Unauthorized from the delta server
when requesting a delta image location, we should surface the error
instead of falling back to a regular pull immediately, as there could
be an issue with the delta auth token, which refreshes after
DELTA_TOKEN_TIMEOUT (10min), or some other edge case.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2025-02-24 16:17:15 -08:00
341111f1f9 Retry DELTA_APPLY_RETRY_COUNT (3) times during delta apply fail before reverting to regular pull
This prevents an image download error loop where the delta image on the delta server is present,
but some aspect of the delta image or the base image on the device does not match up, causing
the delta to fail to be applied to the base image.

Delta apply errors don't raise status codes as they are thrown from the Engine (although they should),
so if an error with a status code is raised during this time, throw an error to the handler
indicating that the delta should be retried until success. Errors with status codes raised during
this time are largely network related, so falling back to a regular pull won't improve anything.

Upon delta apply errors exceeding DELTA_APPLY_RETRY_COUNT, revert to a regular pull.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2025-02-11 12:19:53 -08:00
1fc242200f Revert to regular pull immediately on delta server failure (code 400s)
If the delta server responds immediately with HTTP 4xx upon requesting a delta image,
this means the server is not able to supply the resource, so fall back to a regular pull
immediately.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2025-02-11 10:58:51 -08:00
f71f98777c Update network-manager to v1
Change-type: patch
2025-01-23 23:40:52 -03:00
85fc5784bc Update contrato to v0.12.0
Change-type: patch
2025-01-15 18:56:24 -03:00
e416ad0daf Add support for io.balena.update.requires-reboot
This label can be used by user services to indicate that a reboot is
required after the install of a service in order to fully apply an update.

Change-type: minor
2025-01-14 11:20:35 -03:00
75127c6074 Move reboot breadcrumb check to device-state
This was on device-config before, but we'll need to set the reboot
breadcrumb from the application-manager as well when we introduce
`requires-reboot` as a label.

Change-type: patch
2025-01-09 14:31:55 -03:00
51f1fb0f30 Refactor device-config as part of device-state
Move the device-config module to the device-state folder and export only
those functions that are needed elsewhere in the codebase

This moves us closer to making the device-state module the only way to
modify application and configuration.

Change-type: patch
2025-01-09 14:31:43 -03:00
8e6c0fcad7 Wait for service dependencies to be running
This fixes a regression where dependencies would only be started in
order and would start the dependent service if its dependency had been
started at some point in the past, regardless of the running condition.

This makes the behavior more consistent with docker compose where the
[dependency needs to be
running or healthy](69a83d1303/pkg/compose/convergence.go (L441)) for the service to be started.

Change-type: patch
2024-12-13 16:22:11 -03:00
2f2b2e1c50 Don't require reboot if setting fan control
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-12-09 18:43:57 -08:00
828bd22ba0 Add PowerFanConfig config backend
This config backend uses ConfigJsonConfigBackend to update
os.power and os.fan subfields under the "os" key, in order
to set power and fan configs. The expected format for os.power
and os.fan settings is:
```
{
  os: {
    power: {
      mode: string
    },
    fan: {
      profile: string
    }
  }
}
```

There may be other keys in os which are not managed by the Supervisor,
so PowerFanConfig backend doesn't read or write to them. Extra keys in os.power
and os.fan are ignored when getting boot config and removed when setting
boot config.

After this backend writes to config.json, host services os-power-mode
and os-fan-profile pick up the changes, on reboot in the former's case
and at runtime in the latter's case. The changes are applied by the host
services, which the Supervisor does not manage aside from streaming
their service logs to the dashboard.

Change-type: minor
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-12-09 18:43:51 -08:00
54fcfa22a7 Support "os" key with object values in ConfigJsonConfigBackend
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-12-09 18:29:26 -08:00
9ec45a724b Add tests for ConfigJsonConfigBackend
Also deprecate path-getting method, and remove OS version check.
The OS version itself is not used in ConfigJsonConfigBackend, so
it seems the OS version check is to confirm the existence of config.json
during class init, because OS version is a field that's always there
in a valid config.json.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-12-09 18:29:26 -08:00
8f3eeff72d Stream logs from last SV's State.FinishedAt, process uptime otherwise
This will catch any container or host logs between Supervisor runs. If
FinishedAt is invalid (0), the last sent timestamp is already set (i.e.
this isn't the first time logMonitor.start() has been called), or
the Supervisor container metadata couldn't be acquired, use the
Supervisor process uptime as the default. This has the downside of
missing any logs generated during SV downtime, but at least
means the log-streamer can proceed without error.

Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-12-06 07:46:38 -08:00
fb6fa9b16c Add ability to stream logs from host services to cloud
Add `os-power-mode.service`, `nvpmodel.service`, and `os-fan-profile.service`
which report status from applying power mode and fan profile configs as read
from config.json. The Supervisor sets these configs in config.json for these
host services to pick up and apply.

Also add host log streaming from `jetson-qspi-manager.service` as it
will very soon be needed for Jetson Orins.

Relates-to: #2379
See: balena-io/open-balena-api#1792
See: balena-os/balena-jetson-orin#513
Change-type: minor
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-12-06 07:45:43 -08:00
c610710f03 Move logger.ts into logging/index.ts
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-12-05 21:55:09 -08:00
e62e245fc7 Modify log monitor logging to be more generic
Includes other host services in addition to balena.service

Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-12-05 09:11:04 -08:00
a2d4b31b23 Take update locks for host-config changes
This adds update-lock support to hostname changes via the host-config
endpoint, in addition to proxy changes as changing the hostname may
cause an engine restart from the OS.

Change-type: minor
2024-12-03 15:07:24 -03:00
8b3b9a5b7b Respect lockOverride when using withLock 2024-11-27 16:40:58 -03:00
9c09329b86 Clean up remaining locks on state settle
Locks could remain from a previous supervisor run that didn't get to
settle the state. This ensures that cleanup will happen for remaining
locks every time the state is settled.

Change-type: patch
2024-11-27 16:40:58 -03:00
3c6e9dd209 Refactor update-locks implementation
The refactor simplifies the implementation and ensures that locks per
app can only be held by one supervisor task at the time.

Change-type: patch
2024-11-27 16:40:50 -03:00
d8f54c05e7 Refactor lockfile module
Updated interfaces for clarity

Change-type: patch
2024-11-15 18:25:50 -03:00
7e1cafa866 Firewall: allow DNS requests from custom Docker bridge networks
We only allow DNS requests through `balena0` interface, but this
is the default Docker bridge which is used for containers that
don't have a custom bridge. However, the Supervisor creates a
custom bridge for all containers unless another network mode is
specified.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-11-08 17:02:34 -08:00
3d3f659f16 Delete apps not in target from db by appUuid instead of appId
Resolve an issue in balenaMachine instances that were installed at <v14.1.0,
in which a Supervisor app with random UUID is kept in the target db due to its appId
being the same, even after the BM instance has upgraded to v14.1.0 which patches
the correct reserved Supervisor app UUIDs in. This results in two Supervisors running
on devices under the BM instance which persists after BM upgrade.

See: https://balena.fibery.io/search/T7ozi#Inputs/Pattern/Two-supervisors-are-running-on-device-3370
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-11-04 14:15:55 -08:00
ed1c18e369 Add support for init field from compose
Init supports boolean values, and is not included in the config when
not defined.

Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-09-26 10:39:59 -03:00
e9a52e6786 Store rejected apps in the database
This moves from throwing an error when an app is rejected due to unmet
requirements (because of contracts) to storing the target with a
`rejected` flag on the database.

The application manager filters rejected apps when calculating steps to
prevent them from affecting the current state. The state engine uses the
rejection info to generate the state report.

Change-type: minor
2024-08-30 10:52:11 -04:00
227fee9941 Set the app update status when reporting state
Change-type: minor
2024-08-30 10:52:11 -04:00
48e526ec43 Refactor contracts validation code
This updates the interfaces on lib/contracts and the validation in
the application-manager module.
2024-08-30 10:52:11 -04:00
e9f460fd75 Add update status to types
Change-type: minor
2024-08-30 10:52:11 -04:00
788afee9a1 Remove unused patchDevice function
This function was a remainder of the dependent devices code that no
was removed on #2105

Change-type: patch
2024-08-29 10:34:43 -04:00
eaa07e97a9 Add support for redsocks dnsu2t config
Users may specify dnsu2t config by including a `dns` field
in the `proxy` section of PATCH /v1/device/host-config's body:
```
{
  network: {
    proxy: {
      dns: '1.1.1.1:53',
    }
  }
}
```

If `dns` is a string, ADDRESS and PORT are required and should be
in the format `ADDRESS:PORT`. The endpoint with error with
code 400 if either ADDRESS or PORT are missing.

`dns` may also be a boolean. If true, defaults will be configured.
If false, the dns configuration will be removed.

If `proxy` is patched to empty, `dns` will be removed regardless
of its current or input configs, as `dns` depends on an active
redsocks proxy to function.

Change-type: minor
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-08-28 14:01:51 -07:00
8bf346a6fd Parse dnsu2t block to dns config
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-08-28 13:51:46 -07:00
b775f8f14d Stringify dns subsection of redsocks input config to dnsu2t
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-08-28 13:51:46 -07:00
e724f60beb Strip additional fields from HostConfiguration type
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-08-28 13:51:46 -07:00
51e59725f8 Add unit test for usingInferStepsLock
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
2024-08-26 13:44:51 -07:00