Supervisor has had memory leaks removed since v16.5.1, with latest tested
version being v16.7.1. Furthermore, on recent reported instances of memory healthcheck
triggering on support, we've snapshotted the heap before & after on devices multiple times
without finding any evidence of memory leaks in the snapshots.
Therefore, it's hypothesized that the heuristic for determining starting memory may
be flawed in that it's not waiting long enough after system startup, or it may
run right after garbage collection has happened. Because of the variability and
difficulty of ascertaining these factors, we suspect an inaccurate memory
baseline may be the cause of the instances of false positives on support.
See: https://balena.zulipchat.com/#narrow/channel/403752-channel.2Fsupport-help/topic/supervisor.20memory.20usage.20above.20threadhold/near/520640885
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>
The current target state apply is cancelled when either:
- /v1/update is called with cancel: true
- A different target state is received from the cloud (with a non-304 status)
Following apply cancellation, a target state apply is re-triggered. This ensures
that the user can force a device out of a dead-locked situation where a long-running
task such as an image fetch fails to cede control back to the Supervisor, which is
the behavior observed in an Engine bug with infinite pull retries with a bad network.
Change-type: minor
Signed-off-by: Christina Ying Wang <christina@balena.io>
When abortController.abort() is called, this signal is passed down
to the functions that interface with Docker Engine for image pulls,
cancelling those pulls.
The next commit will limit when abortController.abort() is called.
Signed-off-by: Christina Ying Wang <christina@balena.io>
A contract including extra requirement fields, such as "name" would fail
validation. This PR removes any extra fields from the validated contract
to prevent services with these extra fields from getting rejected by the
contract validation.
Change-type: patch
The leftover locks search was creating an array rather than an object
keyed by the appId. This could affect the lock cleanup and make leftover
locks from one app affect the install of the app in local mode.
Change-type: patch
In a target release where the only change is the addition or removal
of a custom ipam config, the Supervisor does not recreate the network
due to ignoring ipam config differences when comparing current and target
network (in network.isEqualConfig). This commit implements the addition of
a network label if the target compose object includes a network with custom
ipam. With the label, the Supervisor will detect a difference between a
network with a custom ipam and a network without, without needing to compare
the ipam configs themselves.
This is a major change, as devices running networks with custom ipam configs
will have their networks recreated to add the network label.
Closes: #2251
Change-type: major
See: https://balena.fibery.io/Work/Project/Fix-Supervisor-not-recreating-network-when-passed-custom-ipam-config-1127
Signed-off-by: Christina Ying Wang <christina@balena.io>
The previous behavior required that dependencies were running beefore
starting the dependent service. This made it that services dependent on
a one-shot service would not get started and goes against the default
docker behavior.
Depending on a service to be running will require the implementation of
[long syntax depends_on](https://docs.docker.com/reference/compose-file/services/#long-syntax-1) and the condition
`service_healthy`.
Change-type: patch
Closes: #2409
We have observed that even when setting the socket timeout on the
state poll https request, the timeout is only applied once the socket is
connected. This causes issues with Node's auto family selection (happy
eyeballs), as the default https timeout is 5s which means that larger
[auto select attempt timeout](https://nodejs.org/docs/latest-v22.x/api/net.html#netgetdefaultautoselectfamilyattempttimeout) may result in the socket timing out before all connection attempts have been tried.
This commit sets a different https Agent for state polling, with a
timeout matching the `apiRequestTimeout` used for other request events.
Change-type: patch
The Target.lastFetch time compared when performing the healthcheck
resets any time a poll is attempted no matter the outcome. This changes
the behavior so the time is reset only on a successful poll
Change-type: patch
This was mistakenly increased due to confusion between the timeout for
requests to the supervisor's api vs the timeout for requests from the
supervisor to the balenaCloud api. This separates the two configs and
documents the difference between the timeouts whilst also decreasing
the timeout for balenaCloud api requests to the correct/expected value
Change-type: patch
If the Supervisor receives a 401 Unauthorized from the delta server
when requesting a delta image location, we should surface the error
instead of falling back to a regular pull immediately, as there could
be an issue with the delta auth token, which refreshes after
DELTA_TOKEN_TIMEOUT (10min), or some other edge case.
Change-type: patch
Signed-off-by: Christina Ying Wang <christina@balena.io>