patch: Migrate Supervisor Debugging docs from masterclass

Signed-off-by: Vipul Gupta (@vipulgupta2048) <vipul@balena.io>
2025-05-27 21:04:19 +00:00 · 2022-12-28 00:59:32 +05:30 · 2022-12-28 00:59:32 +05:30 · a9045d5eda
commit a9045d5eda
parent f94f06c7ff
1 changed files with 349 additions and 0 deletions
--- a/docs/debugging-supervisor.md
+++ b/docs/debugging-supervisor.md
@ -0,0 +1,349 @@
+# Working with the Supervisor
+
+Service: `balena-supervisor.service`
+
+The balena Supervisor is the service that carries out the management of the
+software release on a device, including determining when to download updates,
+the changing of variables, ensuring services
+are restarted correctly, etc. It is, in effect, the on-device agent for
+balenaCloud.
+
+As such, it's imperative that the Supervisor is operational and healthy at all
+times, even when a device is not connected via the Internet, as it still
+ensures the running of a device that is offline.
+
+The Supervisor itself is a Docker service that runs alongside any installed
+user services and the healthcheck container (more on that later). One
+major advantage of running it as a Docker service is that it can be updated
+just like any other service (although actually carrying that out is slightly
+different to updating user containers, see 'Updating the Supervisor').
+
+Assuming you're still logged into your development device, run the following:
+
+```shell
+root@debug-device:~# systemctl status balena-supervisor
+● balena-supervisor.service - Balena supervisor
+     Loaded: loaded (/lib/systemd/system/balena-supervisor.service; enabled; vendor preset: enabled)
+     Active: active (running) since Fri 2022-08-19 18:08:59 UTC; 41s ago
+    Process: 2296 ExecStartPre=/usr/bin/balena stop resin_supervisor (code=exited, status=1/FAILURE)
+    Process: 2311 ExecStartPre=/usr/bin/balena stop balena_supervisor (code=exited, status=0/SUCCESS)
+    Process: 2325 ExecStartPre=/bin/systemctl is-active balena.service (code=exited, status=0/SUCCESS)
+   Main PID: 2326 (start-balena-su)
+      Tasks: 10 (limit: 1878)
+     Memory: 11.9M
+     CGroup: /system.slice/balena-supervisor.service
+             ├─2326 /bin/sh /usr/bin/start-balena-supervisor
+             ├─2329 /proc/self/exe --healthcheck /usr/lib/balena-supervisor/balena-supervisor-healthcheck --pid 2326
+             └─2486 balena start --attach balena_supervisor
+
+Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug]   Starting target state poll
+Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug]   Spawning journald with: chroot  /mnt/root journalctl -a --follow -o json >
+Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug]   Finished applying target state
+Aug 19 18:09:07 debug-device balena-supervisor[2486]: [success] Device state apply success
+Aug 19 18:09:07 debug-device balena-supervisor[2486]: [info]    Applying target state
+Aug 19 18:09:07 debug-device balena-supervisor[2486]: [info]    Reported current state to the cloud
+Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug]   Finished applying target state
+Aug 19 18:09:07 debug-device balena-supervisor[2486]: [success] Device state apply success
+Aug 19 18:09:17 debug-device balena-supervisor[2486]: [info]    Internet Connectivity: OK
+Aug 19 18:09:18 debug-device balena-supervisor[2486]: [info]    Reported current state to the cloud
+```
+
+You can see the Supervisor is just another `systemd` service
+(`balena-supervisor.service)`, and that it is started and run by balenaEngine.
+
+Supervisor issues, due to their nature, vary quite significantly. It's also
+commonly used to misattribute issues to. As the Supervisor is verbose about its
+state and actions (such as the download of images), it tends to be suspected of
+problems when in fact there are usually other underlying issues. A few examples
+are:
+
+- Networking problems - In the case of the Supervisor reporting failed downloads
+  or attempting to retrieve the same images repeatedly (where in fact instable
+  networking is usually the cause).
+- Service container restarts - The default policy for service containers is to
+  restart if they exit, and this sometimes is misunderstood. If a container's
+  restarting, it's worth ensuring it's not because the container itself is
+  exiting correctly either due to a bug in the service container code or
+  because it has correctly come to the end of its running process.
+- Staged releases - A fleet/device has been pinned to a particular
+  version, and a new push is not being downloaded.
+
+It's _always_ worth considering how the system is configured, how releases were
+produced, how the fleet or device is configured and what the current
+networking state is when investigating Supervisor issues, to ensure that there
+isn't something else amiss that the Supervisor is merely exposing via logging.
+
+Another point to note is that the Supervisor is started using
+[`healthdog`](https://github.com/balena-os/healthdog-rs) which continually
+ensures that the Supervisor is present by using balenaEngine to find the
+Supervisor image. If the image isn't present, or balenaEngine doesn't respond,
+then the Supervisor is restarted. The default period for this check is 180
+seconds at the time of writing, but inspect the
+`/lib/systemd/system/balena-supervisor.service` file on-device to see what
+it is for the device you're SSHd into. For example, using our example device:
+
+```shell
+root@debug-device:~# cat /lib/systemd/system/balena-supervisor.service
+[Unit]
+Description=Balena supervisor
+Requires=\
+    resin\x2ddata.mount \
+    balena-device-uuid.service \
+    os-config-devicekey.service \
+    bind-etc-balena-supervisor.service \
+    extract-balena-ca.service
+Wants=\
+    migrate-supervisor-state.service
+After=\
+    balena.service \
+    resin\x2ddata.mount \
+    balena-device-uuid.service \
+    os-config-devicekey.service \
+    bind-etc-systemd-system-resin.target.wants.service \
+    bind-etc-balena-supervisor.service \
+    migrate-supervisor-state.service \
+    extract-balena-ca.service
+Wants=balena.service
+ConditionPathExists=/etc/balena-supervisor/supervisor.conf
+
+[Service]
+Type=simple
+Restart=always
+RestartSec=10s
+WatchdogSec=180
+SyslogIdentifier=balena-supervisor
+EnvironmentFile=/etc/balena-supervisor/supervisor.conf
+EnvironmentFile=-/tmp/update-supervisor.conf
+ExecStartPre=-/usr/bin/balena stop resin_supervisor
+ExecStartPre=-/usr/bin/balena stop balena_supervisor
+ExecStartPre=/bin/systemctl is-active balena.service
+ExecStart=/usr/bin/healthdog --healthcheck=/usr/lib/balena-supervisor/balena-supervisor-healthcheck /usr/bin/start-balena-supervisor
+ExecStop=-/usr/bin/balena stop balena_supervisor
+
+[Install]
+WantedBy=multi-user.target
+Alias=resin-supervisor.service
+```
+
+#### 8.1 Restarting the Supervisor
+
+It's actually incredibly rare to actually _need_ a Supervisor restart. The
+Supervisor will attempt to recover from issues that occur automatically, without
+the requirement for a restart. If you've got to a point where you believe that
+a restart is required, double check with the other agent on-duty, and if
+required either with the Supervisor maintainer or another knowledgeable engineer
+before doing so.
+
+There are instances where the Supervisor is incorrectly restarted when in fact
+the issue could be down to corruption of service images, containers, volumes
+or networking. In these cases, you're better off dealing with the underlying
+balenaEngine to ensure that anything corrupt is recreated correctly. See the
+balenaEngine section for more details.
+
+If a restart is required, ensure that you have gathered as much information
+as possible before a restart, including pertinent logs and symptoms so that
+investigations can occur asynchronously to determine what occurred and how it
+may be mitigated in the future. Enabling permanent logging may also be of
+benefit in cases where symptoms are repeatedly occurring.
+
+To restart the Supervisor, simply restart the `systemd` service:
+
+```shell
+root@debug-device:~# systemctl restart balena-supervisor.service
+root@debug-device:~# systemctl status balena-supervisor.service
+● balena-supervisor.service - Balena supervisor
+     Loaded: loaded (/lib/systemd/system/balena-supervisor.service; enabled; vendor preset: enabled)
+     Active: active (running) since Fri 2022-08-19 18:13:28 UTC; 10s ago
+    Process: 3013 ExecStartPre=/usr/bin/balena stop resin_supervisor (code=exited, status=1/FAILURE)
+    Process: 3021 ExecStartPre=/usr/bin/balena stop balena_supervisor (code=exited, status=0/SUCCESS)
+    Process: 3030 ExecStartPre=/bin/systemctl is-active balena.service (code=exited, status=0/SUCCESS)
+   Main PID: 3031 (start-balena-su)
+      Tasks: 11 (limit: 1878)
+     Memory: 11.8M
+     CGroup: /system.slice/balena-supervisor.service
+             ├─3031 /bin/sh /usr/bin/start-balena-supervisor
+             ├─3032 /proc/self/exe --healthcheck /usr/lib/balena-supervisor/balena-supervisor-healthcheck --pid 3031
+             └─3089 balena start --attach balena_supervisor
+
+Aug 19 18:13:33 debug-device balena-supervisor[3089]: [info]    Waiting for connectivity...
+Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug]   Starting current state report
+Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug]   Starting target state poll
+Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug]   Spawning journald with: chroot  /mnt/root journalctl -a --follow -o json >
+Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug]   Finished applying target state
+Aug 19 18:13:33 debug-device balena-supervisor[3089]: [success] Device state apply success
+Aug 19 18:13:34 debug-device balena-supervisor[3089]: [info]    Applying target state
+Aug 19 18:13:34 debug-device balena-supervisor[3089]: [info]    Reported current state to the cloud
+Aug 19 18:13:34 debug-device balena-supervisor[3089]: [debug]   Finished applying target state
+Aug 19 18:13:34 debug-device balena-supervisor[3089]: [success] Device state apply success
+```
+
+#### 8.2 Updating the Supervisor
+
+Occasionally, there are situations where the Supervisor requires an update. This
+may be because a device needs to use a new feature or because the version of
+the Supervisor on a device is outdated and is causing an issue. Usually the best
+way to achieve this is via a balenaOS update, either from the dashboard or via
+the command line on the device.
+
+If updating balenaOS is not desirable or a user prefers updating the Supervisor independently, this can easily be accomplished using the [self-service](https://www.balena.io/docs/reference/supervisor/supervisor-upgrades/) Supervisor upgrades. Alternatively, this can be programmatically done by using the Node.js SDK method [device.setSupervisorRelease](https://www.balena.io/docs/reference/sdk/node-sdk/#devicesetsupervisorreleaseuuidorid-supervisorversionorid-%E2%87%92-codepromisecode).
+
+You can additionally write a script to manage this for a fleet of devices in combination with other SDK functions such as [device.getAll](https://www.balena.io/docs/reference/sdk/node-sdk/#devicegetalloptions-%E2%87%92-codepromisecode).
+
+**Note:** In order to update the Supervisor release for a device, you must have edit permissions on the device (i.e., more than just support access).
+
+#### 8.3 The Supervisor Database
+
+The Supervisor uses a SQLite database to store persistent state (so in the
+case of going offline, or a reboot, it knows exactly what state an
+app should be in, and which images, containers, volumes and networks
+to apply to it).
+
+This database is located at
+`/mnt/data/resin-data/balena-supervisor/database.sqlite` and can be accessed
+inside the Supervisor, most easily by running Node. Assuming you're logged
+into your device, run the following:
+
+```shell
+root@debug-device:~# balena exec -ti balena_supervisor node
+```
+
+This will get you into a Node interpreter in the Supervisor service
+container. From here, we can use the `sqlite3` NPM module used by
+the Supervisor to make requests to the database:
+
+```shell
+> sqlite3 = require('sqlite3');
+{
+  Database: [Function: Database],
+  Statement: [Function: Statement],
+  Backup: [Function: Backup],
+  OPEN_READONLY: 1,
+  OPEN_READWRITE: 2,
+  OPEN_CREATE: 4,
+  OPEN_FULLMUTEX: 65536,
+  OPEN_URI: 64,
+  OPEN_SHAREDCACHE: 131072,
+  OPEN_PRIVATECACHE: 262144,
+  VERSION: '3.30.1',
+  SOURCE_ID: '2019-10-10 20:19:45 18db032d058f1436ce3dea84081f4ee5a0f2259ad97301d43c426bc7f3df1b0b',
+  VERSION_NUMBER: 3030001,
+  OK: 0,
+  ERROR: 1,
+  INTERNAL: 2,
+  PERM: 3,
+  ABORT: 4,
+  BUSY: 5,
+  LOCKED: 6,
+  NOMEM: 7,
+  READONLY: 8,
+  INTERRUPT: 9,
+  IOERR: 10,
+  CORRUPT: 11,
+  NOTFOUND: 12,
+  FULL: 13,
+  CANTOPEN: 14,
+  PROTOCOL: 15,
+  EMPTY: 16,
+  SCHEMA: 17,
+  TOOBIG: 18,
+  CONSTRAINT: 19,
+  MISMATCH: 20,
+  MISUSE: 21,
+  NOLFS: 22,
+  AUTH: 23,
+  FORMAT: 24,
+  RANGE: 25,
+  NOTADB: 26,
+  cached: { Database: [Function: Database], objects: {} },
+  verbose: [Function]
+}
+> db = new sqlite3.Database('/data/database.sqlite');
+Database {
+  open: false,
+  filename: '/data/database.sqlite',
+  mode: 65542
+}
+```
+
+You can get a list of all the tables used by the Supervisor by issuing:
+
+```shell
+> db.all("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;", console.log);
+Database { open: true, filename: '/data/database.sqlite', mode: 65542 }
+> null [
+  { name: 'apiSecret' },
+  { name: 'app' },
+  { name: 'config' },
+  { name: 'containerLogs' },
+  { name: 'currentCommit' },
+  { name: 'dependentApp' },
+  { name: 'dependentAppTarget' },
+  { name: 'dependentDevice' },
+  { name: 'dependentDeviceTarget' },
+  { name: 'deviceConfig' },
+  { name: 'engineSnapshot' },
+  { name: 'image' },
+  { name: 'knex_migrations' },
+  { name: 'knex_migrations_lock' },
+  { name: 'logsChannelSecret' },
+  { name: 'sqlite_sequence' }
+]
+```
+
+With these, you can then examine and modify data, if required. Note that there's
+usually little reason to do so, but this is included for completeness. For
+example, to examine the configuration used by the Supervisor:
+
+```shell
+> db.all('SELECT * FROM config;', console.log);
+Database { open: true, filename: '/data/database.sqlite', mode: 65542 }
+> null [
+  { key: 'localMode', value: 'false' },
+  { key: 'initialConfigSaved', value: 'true' },
+  {
+    key: 'initialConfigReported',
+    value: 'https://api.balena-cloud.com'
+  },
+  { key: 'name', value: 'shy-rain' },
+  { key: 'targetStateSet', value: 'true' },
+  { key: 'delta', value: 'true' },
+  { key: 'deltaVersion', value: '3' }
+]
+```
+
+Occasionally, should the Supervisor get into a state where it is unable to
+determine which release images it should be downloading or running, it
+is necessary to clear the database. This usually goes hand-in-hand with removing
+the current containers and putting the Supervisor into a 'first boot' state,
+whilst keeping the Supervisor and release images. This can be achieved by
+carrying out the following:
+
+```shell
+root@debug-device:~# systemctl stop balena-supervisor.service update-balena-supervisor.timer
+root@debug-device:~# balena rm -f $(balena ps -aq)
+1db1d281a548
+6c5cde1581e5
+2a9f6e83578a
+root@debug-device:~# rm /mnt/data/resin-data/balena-supervisor/database.sqlite
+```
+
+This:
+
+- Stops the Supervisor (and the timer that will attempt to restart it).
+- Removes all current services containers (including the Supervisor).
+- Removes the Supervisor database.
+  (If for some reason the images also need to be removed, run
+  `balena rmi -f $(balena images -q)` which will remove all images _including_
+  the Supervisor image).
+  You can now restart the Supervisor:
+
+```shell
+root@debug-device:~# systemctl start update-balena-supervisor.timer balena-supervisor.service
+```
+
+If you deleted all the images, this will first download the Supervisor image
+again before restarting it.
+At this point, the Supervisor will start up as if the device has just been
+provisioned (though it will already be registered), and the release will
+be freshly downloaded (if the images were removed) before starting the service
+containers.