This change improves the check for the DuplicateUuidError that can happen if a device has been provisioned but the API's response hasn't been persisted - the error message
returned from the API has been known to have a few variations (usually an extra dot at the end), so we now use _.startsWith instead of checking for equal strings to make the
supervisor still work under these variations.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
It appears preloaded apps have been getting restarted because the "apiKey" configuration value was only available after provisioning succeeded. This change ensures the
deviceApiKey that the device will use is injected into the env vars of preloaded apps, ensuring the app is not restarted (unless provisioning fails and the uuid and deviceApiKey are
regenerated, but this should be rare).
We also ensure that whenever an app's RESIN_API_KEY env var is populated, it is *always* done with the deviceApiKey and never with the provisioning apiKey.
Closes#457
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
- Updates resumable-request to 1.0.1
- Updates docker-progress to 2.0.3
- Removes `DEFAULT_DELTA_APPLY_TIMEOUT`; it’s not needed anymore, docker-delta reliably tracks rsync.
- Properly end the update when applying the delta results in an error.
Change-Type: patch
This commit changes the way the source for a delta is determined. We used to do
it by comparing the available tags with the one we want and relying on the format that
includes the app in the image name. Now we explicitly choose a delta source from the previous app
version if we have one, and otherwise use the image from any available app - which will allow us
to have a valid source when moving a device between apps.
For this to work consistently if there's an unexpected reboot, we now avoid deleting an app from the db
until the full update has succeeded. Instead, we mark the app for deletion so that we still have the image stored after the reboot.
This commit also changes a .map to .mapSeries when iterating over appIds for removal/install/update - this avoids parallel treatment
of apps which can cause inconsistencies in the status reported to the API.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
We've been using `.catch Promise.OperationalError, ...` to catch errors when stopping a container and
detecting whether the error means that the container has already been stopped of removed.
Apparently, after the recent dockerode upgrade these errors are not typed as OperationalError anymore, causing error
messages like "No such container: null" when applying an update. This commit makes us catch all errors and check for their statusCode.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Errors from docker-modem that are passed from dockerode can have a "json" or "reason" property,
but that is generally less descriptive than the more standard "message", and can show up in the logs
as `[object Object]`. This commit changes it so that we log err.message if it is non-empty, and otherwise
look for json and reason.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Applying a delta update consists of two parts:
1. The request to the delta server for the delta payload (an rsync batch file, plus some prepended Docker metadata). The response is a redirect to a URL that contains the delta (currently S3).
2. The request for the actual download of the delta. The response is streamed directly to rsync, which applies it onto the mounted root filesystem of the final image.
The first step may take a while as it may trigger the generation of the delta if the request is the first one for this combination of src/dest image and the images are large. If the request times out, either because of the delta server taking too long to respond or bad network, the Supervisor automatically schedules a retry to be performed after a while.
Currently, similar behaviour applies to the second step as well -- if the request fails, we immediately bail out and the Supervisor schedules a retry of the whole process (i.e. from step 1). But in this case it means we might have downloaded and applied some or most of the delta when a socket timeout occurs causing us to start all over again, wasting time and bandwidth.
This commit splits the process into the two discreet steps and improves the behaviour on the second step. Specifically:
- makes the Supervisor try to resume the delta download request several times before it bails out and starts the process all over again.
- removes arbitrary timeout which applied over the whole process and meant some deltas would never manage to be applied (because of large delta size and low network bandwidth).
- makes sure any launched rsync processes always exit and any opened streams consumed and closed.
Most of the improvements are in the two dependencies linked below -- `resumable-request` and `node-docker-delta` -- and this commit merely combines the updated versions of these modules.
Change-Type: minor
Connects-To: #140
Depends-On: https://github.com/resin-io/node-docker-delta/pull/19
Depends-On: https://github.com/resin-io-modules/resumable-request/pull/2
We mark when the device is rebooting and avoid some steps in the update cycle that change the device
state, similarly to when the device is in local mode, to avoid problems with non-atomic operations.
This doesn't solve *all* the potential scenarios of a reboot happening in the middle of an update, but at least
should prevent the case where we start an app container and reboot the device before saving the containerId, potentially
causing a duplicated container issue.
We also correct the API docs to reflect the 202 response when reboot or shutdown are successful.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
We used to store the uuid which would cause the supervisor to not attempt a provisioning even if offline mode
was turned off. This was to avoid preloaded apps being reloaded constantly leaving multiple containers.
We now avoid persisting the uuid, so that when the supervisor goes out of offline mode it can provision
without the need to wipe out the db. We avoid the problem with preloaded apps by not loading them
if there's apps already stored on the db.
(In the future, apps in the db will only represent target state and we can make preloaded apps be reloaded on every
start, but for now we can't do it as long as we store the containerId on the db - deleting an app on the db
means losing track of its containerId and therefore leaving an orphaned container)
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
This makes the Async suffix for docker functions unnecessary. It also allows us to remove dockerode as an
explicit dependency.
Change-Type: minor
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
The test for an exec format error caused a `err.json.trim` is not a function
error so the message shown didn't relate to what the problem actually was.
This makes the test for the exec format error safer.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
The current setup would cause the check to always fail - the consequence is not *that* bad since
the provisioning key still gets overwritten, but it's better to delete it if we can.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
This allows us to also remove a few npm dependencies and the docker compose binary.
Change-Type: major
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
The supervisor uses an `API_ENDPOINT` environment variable to define what API to register to. Up to now this has been defaulted to `https://api.resin.io`.
(In Resin OS devices this environment variable ultimately comes from config.json).
This commit changes the behavior so that an empty value of that environment variable causes the supervisor to work in "offline mode", i.e. not connected to a remote server.
Basically only preloaded apps and the supervisor API work in this mode.
The config.json `supervisorOfflineMode` field still works for backwards compatibility, but we'll treat it as deprecated and it should be removed eventually.
Change-Type: minor
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
The logic to disable mixpanel initialization in offline mode was inverted :S causing mixpanel
to *only* be initialized when in offline mode.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
This was properly done in the recently added changes in bootstrap.coffee,
but all other references where using "Os" instead of "OS.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
When requesting a delta, a `Promise.join` promise chain was producing unhandled
errors since it consisted in a separate promise chain from the parent function which,
was created with `new Promise`. This commit fixes this by creating the new Promise only
when it's needed, avoiding the creation of a separate promise chain.
Closes#432
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
This avoids problems when updating the supervisor on an older OS, where the VPN and other
host services still require config.json to have an apiKey field to authenticate.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
This helps avoid unnecessary writes to the DB which may cause disk wearout.
We also change the error message in this section to show that the error might have happened
when fetching the device config as much as when setting it.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
device.getID caused a fatal error when connection was down, as the memoization with `promise: true` throws
synchronously. Changing memoizee to use `promise: 'then'` makes the memoization work as expected.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
We add an extra image/container cleanup before applying updates, allowing any unwanted images to be deleted.
When doing this, we take care not to delete images that will be used when the target state is applied.
This prevents the problem of stale images being stored while the update lock is set, potentially
leaving the device out of space.
Running the cleanup *before* applying the update ensures that only one target image is downloaded: if a stale one
had been downloaded previously, it will be deleted before starting the update for the new one. This can have a slight
impact on delta performance, since the delta is potentially done from an older (and more different) version of the app,
but can have a big impact on storage usage, as not doing this would duplicate the required free storage space when
the update lock is set.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Hand over authentication credentials to the docker engine
Fetch an access token from the API if possible and hand it over to the delta server
Change-Type: minor
Signed-off-by: Andreas Fitzek <andreas@resin.io>
This prevents duplicated containers when updating from older supervisors before the config column
was introduced.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Header is in the format Supervisor/X.Y.Z (Linux; Resin OS v2.A.B.revC; Dev) - omitting any fields
that are not available depending on the OS.
Change-Type: minor
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
The lock is now located at `/tmp/resin-supervisor/<appId>/` on the host, and `/tmp/resin/`
on the user container. The old lock location is supported only in Resin OS 1.X (and both locks are
taken in that case).
This fixes the race condition when the app is started before the supervisor, and takes a lock that is
cleared on supervisor startup.
Change-Type: major
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Local mode makes the API accept unauthenticated requests.
Local mode now also removes app containers when stopping them.
Local mode only works on a host OS that has `VARIANT_ID = "dev"` in /etc/os-release.
Also add more explicit logging when stopping an app and it was already stopped
or the container was already removed.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Up to now we've only been running the "special actions" (like vpn on/off, logs on/off)
when the target state includes a current value for the corresponding config variable.
We now also check if there was a *previous* value, and in that case also call the action function.
These functions are prepared to reset to a default when they're called with an undefined value.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Config variables now use a checkTruthy validation function,
and can be "1", "on", "true" or true to be considered true, or
"0", "off", "false" or false to be considered false.
Change-Type: minor
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
A RESIN_SUPERVISOR_LOCAL_MODE variable is introduced. When this variable is "1", all apps
are stopped and the update cycle stops executing changes other than deviceConfig changes
and the proxyvisor.
Change-Type: minor
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
We've been using docker attach, which only gives us the logs since we attach. This change allows getting the
full logs from the beginning.
We also use the timestamps that come with the logs from docker, as they will be more precise and are more relevant now
that we're getting previous logs from history.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
On ResinOS 2.X the default mounts should not include the previously deprecated host_run, and there's no connman which makes the connman mount confusing.
This is a breaking change as it is not backwards-compatible on non-ResinOS instances of the supervisor.
Change-Type: major
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
The logic for updateStatus.forceNext is changed so that its value is checked when the scheduled update is run, instead
of when the update is scheduled. And when an update is already scheduled and a new request comes in,
we mark forceNext as true if the new request requires a force update.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Also split out deviceConfig set and get to a separate module to avoid circular dependencies.
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Also take validation functions into a module, and use that in all cases where
we need to check for an integer or string.
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Current delta timeouts are too limiting, so we increase the request timeout to 30 minutes which is big enough that
the server will time out first and we can provide a nice message letting the user know we'll retry; and we increase
the total timeout to 24 hours to account for really big deltas over slower connections (the rsync calls will time out anyways
if something else goes wrong, as they have a 5 minute I/O timeout).
The timeouts are now configurable with the RESIN_SUPERVISOR_DELTA_REQUEST_TIMEOUT and RESIN_SUPERVISOR_DELTA_TOTAL_TIMEOUT
configuration variables.
Change-Type: minor
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
We add a 1s delay between requests to the API to apply state changes,
as this will throttle it to a point that it has a reasonable rate while
preventing too many unnecessary requests to the API.
Closes#375
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Resin OS 2.X removes the use of compressed modules, which was the initial
motivation for us to bind mount kmod into user containers (as Debian distros
don't include support for compressed modules).
This is a breaking change, but we still keep bind mounting on devices that are
on 1.X to ensure we don't break apps currently relying on the feature.
Implementation note: some functions in device.coffee have been refactored to
extract (DRY) a memoization procedure for Promise-returning functions.
`device.getOSVersion()` now also memoizes its result.
Change-Type: major
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
If there's no entries in deviceConfig table, always create one.
Avoids problems if the supervisor is stopped while running the db initialization
(deviceConfig gets created but not populated).
Change-Type: patch
Signed-off-by: Pablo Carranza Velez <pablo@resin.io>
Using REJECT allows better feedback for legitimate users while providing the same level
of security than drop (see http://www.chiark.greenend.org.uk/~peterb/network/drop-vs-reject).
But some hosts don't have REJECT support in the kernel config, so in that case we fall back to DROP.
With the bluebird update to v3, all requests to gosuper (most notably, getting the IP addresses) got broken as we use .spread, which requires the Promise to fulfill with an array. So we need to add multiArgs so that getAsync and postAsync return an array.
* Use appId in dependent app assets tar path, and only create the tar if it doesn't exist
* Support AUFS by upgrading node-docker-delta to 1.0.0 and docker-toolbelt to 1.3.0
* Better parameter handling in PUT /v1/devices/:uuid
* An update hook response of 200 will cause the proxyvisor to stop pinging the hook
* Allow deleting dependent apps and devices
* Implement delete dependent device hook
* Omit some fields when responding with a device object
* Store config vars when there's nothing else to update
* Do not mark an update as failed if the hook failed
* When hitting the dependent devices hook, send appId as int
* Implement proxyvisor API with dependent device handling
* Use the state endpoint from the API to get the full device state
* Add a deviceConfig db table to store host config separately, and allow deleting config.txt entries
* Expose RESIN_APP_NAME, RESIN_APP_RELEASE, RESIN_DEVICE_NAME_AT_INIT, RESIN_DEVICE_TYPE and RESIN_HOST_OS_VERSION env vars
* Add missing error handler on a stream in docker-utils
Log messages to PubNub are now an array instead of an object.
Each element of the array is an object with m (message), t (timestamp) and s (isSystem, optional) attributes.
Logs are sent at a specific interval (110ms, fit with some margin to PubNub's approximated 10 messages/s limit), and truncated to PubNub's 32KB limit.
If we don't persist the uuid then every time the supervisor starts it
will think it's a new device. This triggers a wipe of the local state
and also a re-load of the preloaded apps. This in turn causes multiple
instances of the preloaded apps to be left running.
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
A supervisorOfflineMode true-ish attribute in config.json will cause that:
* If unprovisioned, the supervisor won't try to provision on Resin
* The update cycle will not start as the device won't consider itself provisioned
* Logs will not be sent to pubnub
* Mixpanel events won't be tracked
* The device state won't be updated to the Resin API
This change will also make the Supervisor API work with an unprovisioned device.
- Validate the options in the YAML file
- Define bind mounts for each service as in Resin apps
- Keep the modified compose file inside the supervisor's /data folder
- Fix error reporting in the first stage of "up"
Allow users to set container restart policy using environment variables.
RESIN_APP_RESTART_POLICY sets the name of the policy, and
if policy is "on-failure", optionally, RESIN_APP_RESTART_RETRIES
sets the maximum number of retries.
More information on docker docs:
https://docs.docker.com/engine/reference/run/#restart-policies-restart
One major change we introduce here is that the default policy is set
to always while we used to have the default "no".
We validate the arguments and pass retries parameter only for the case
of "on-failure" as specified in Docker API as of v1.19.
We could let docker handle the arguments directly, gaining
forwards-compatibility with any new features, but I opted
for an implementation that is as well-defined as possible.
Docker 1.10 sends containerInfo.Image without the ":latest", so
the image name doesn't match the app's imageId.
This fix first splits the image name into repo and tag and then rebuilds
it to include ":latest" when appropriate. Should avoid removing containers
when using resin-sync.
This allows application containers to interface with host connman.
Host /var/lib/connman is bind mounted to /host_var/lib/connman to avoid
collisions with connman installations inside the container.
* On handover, fetch old app from DB before starting the new app (and overwriting the DB record)
* Tidy up the logging
* Fix waitToKill so that it actually works
* Several other fixups
* Break up the update function into manageable parts
* Refactor the special variables into the RESIN_SUPERVISOR_ namespace
* Use three update strategies: normal, kill before download, hand over
TO-DO: implement waitToKill in the hand over case.
Otherwise, we have a deadlock whenever the lock is forced: the app
will be restarted, and will find the lockfile is already there so it won't be
able to lock/unlock it. The supervisor will have the same problem.
So the solution is that, whenever we kill the app, that is the lock owner, we also unlock the file.
Select app to kill from DB within lock (otherwise, if some other part kills and restarts the app, the
containerId will have changed and the real container will not be removed).
Change the update cycle to go by appId instead of imageId.
Use Promise.using for lockFile locks and unlocks.
Now updates shouldn't stop if one of the apps fails to update
(it's a step towards better supporting multiple apps).
Forcing the lock now works.
Remove unnecessary require fs
Nicer assignment for s in joinErrorMessages
* gosuper in dockerignored folder
* correctly handle app not found in purge
* test formatting in test-gosuper
* Fix test-gosuper
* DRY up test-integration
* More work on the integration test
* Correctly get supervisor IP
* Use Fatal for test errors
* test-integration working separate from run-supervisor
* Use jenkins' JOB_NAME to identify and remove containers with their volumes
* Document testing procedure
* Document the assume-unchanged tip
* Use /mnt/root for data path
* Nicer secret assignment
* Restart app when purging
* Use log.Fatal to exit with status 1
* Quotes in entry.sh
* Use JSON for request body
* Handle errors for parseJsonBody
* Better error printing in main
* First attempt at testing nodesuper from Go
* Cleaner build
* Use ARCH to differentiate concurrent tests/builds
* Use --rm to autoremove containers
If we're restarting for binds/mounts then we're already in a non-working state, and starting the new supervisor may have to wait for bootstrap to complete (which takes an indefinite amount of time to complete, meaning the timeout was killing the new container before it could bootstrap in some cases)
This handles the case of running an updated supervisor on an old image which doesn't support host commands - printing out a nice message and waiting rather than exiting silently.
This stores the container id for an app when creating that app, using it when it is necessary to stop/remove the app and when attempting to start it again (rather than creating a new container each time, eg restarting the pi does not create a new container any more)