patch: Migrate Supervisor Debugging docs from masterclass

Signed-off-by: Vipul Gupta (@vipulgupta2048) <vipul@balena.io>
2025-03-10 22:44:29 +00:00 · 2022-12-28 00:59:32 +05:30 · 2022-12-28 00:59:32 +05:30 · a9045d5eda
commit a9045d5eda
parent f94f06c7ff
1 changed files with 349 additions and 0 deletions
--- a/docs/debugging-supervisor.md
+++ b/docs/debugging-supervisor.md
@ -0,0 +1,349 @@
 # Working with the Supervisor
 Service: `balena-supervisor.service`
 The balena Supervisor is the service that carries out the management of the
 software release on a device, including determining when to download updates,
 the changing of variables, ensuring services
 are restarted correctly, etc. It is, in effect, the on-device agent for
 balenaCloud.
 As such, it's imperative that the Supervisor is operational and healthy at all
 times, even when a device is not connected via the Internet, as it still
 ensures the running of a device that is offline.
 The Supervisor itself is a Docker service that runs alongside any installed
 user services and the healthcheck container (more on that later). One
 major advantage of running it as a Docker service is that it can be updated
 just like any other service (although actually carrying that out is slightly
 different to updating user containers, see 'Updating the Supervisor').
 Assuming you're still logged into your development device, run the following:
 ```shell
 root@debug-device:~# systemctl status balena-supervisor
 ● balena-supervisor.service - Balena supervisor
     Loaded: loaded (/lib/systemd/system/balena-supervisor.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2022-08-19 18:08:59 UTC; 41s ago
    Process: 2296 ExecStartPre=/usr/bin/balena stop resin_supervisor (code=exited, status=1/FAILURE)
    Process: 2311 ExecStartPre=/usr/bin/balena stop balena_supervisor (code=exited, status=0/SUCCESS)
    Process: 2325 ExecStartPre=/bin/systemctl is-active balena.service (code=exited, status=0/SUCCESS)
   Main PID: 2326 (start-balena-su)
      Tasks: 10 (limit: 1878)
     Memory: 11.9M
     CGroup: /system.slice/balena-supervisor.service
             ├─2326 /bin/sh /usr/bin/start-balena-supervisor
             ├─2329 /proc/self/exe --healthcheck /usr/lib/balena-supervisor/balena-supervisor-healthcheck --pid 2326
             └─2486 balena start --attach balena_supervisor
 Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug]   Starting target state poll
 Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug]   Spawning journald with: chroot  /mnt/root journalctl -a --follow -o json >
 Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug]   Finished applying target state
 Aug 19 18:09:07 debug-device balena-supervisor[2486]: [success] Device state apply success
 Aug 19 18:09:07 debug-device balena-supervisor[2486]: [info]    Applying target state
 Aug 19 18:09:07 debug-device balena-supervisor[2486]: [info]    Reported current state to the cloud
 Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug]   Finished applying target state
 Aug 19 18:09:07 debug-device balena-supervisor[2486]: [success] Device state apply success
 Aug 19 18:09:17 debug-device balena-supervisor[2486]: [info]    Internet Connectivity: OK
 Aug 19 18:09:18 debug-device balena-supervisor[2486]: [info]    Reported current state to the cloud
 ```
 You can see the Supervisor is just another `systemd` service
 (`balena-supervisor.service)`, and that it is started and run by balenaEngine.
 Supervisor issues, due to their nature, vary quite significantly. It's also
 commonly used to misattribute issues to. As the Supervisor is verbose about its
 state and actions (such as the download of images), it tends to be suspected of
 problems when in fact there are usually other underlying issues. A few examples
 are:
 - Networking problems - In the case of the Supervisor reporting failed downloads
  or attempting to retrieve the same images repeatedly (where in fact instable
  networking is usually the cause).
 - Service container restarts - The default policy for service containers is to
  restart if they exit, and this sometimes is misunderstood. If a container's
  restarting, it's worth ensuring it's not because the container itself is
  exiting correctly either due to a bug in the service container code or
  because it has correctly come to the end of its running process.
 - Staged releases - A fleet/device has been pinned to a particular
  version, and a new push is not being downloaded.
 It's _always_ worth considering how the system is configured, how releases were
 produced, how the fleet or device is configured and what the current
 networking state is when investigating Supervisor issues, to ensure that there
 isn't something else amiss that the Supervisor is merely exposing via logging.
 Another point to note is that the Supervisor is started using
 [`healthdog`](https://github.com/balena-os/healthdog-rs) which continually
 ensures that the Supervisor is present by using balenaEngine to find the
 Supervisor image. If the image isn't present, or balenaEngine doesn't respond,
 then the Supervisor is restarted. The default period for this check is 180
 seconds at the time of writing, but inspect the
 `/lib/systemd/system/balena-supervisor.service` file on-device to see what
 it is for the device you're SSHd into. For example, using our example device:
 ```shell
 root@debug-device:~# cat /lib/systemd/system/balena-supervisor.service
 [Unit]
 Description=Balena supervisor
 Requires=\
    resin\x2ddata.mount \
    balena-device-uuid.service \
    os-config-devicekey.service \
    bind-etc-balena-supervisor.service \
    extract-balena-ca.service
 Wants=\
    migrate-supervisor-state.service
 After=\
    balena.service \
    resin\x2ddata.mount \
    balena-device-uuid.service \
    os-config-devicekey.service \
    bind-etc-systemd-system-resin.target.wants.service \
    bind-etc-balena-supervisor.service \
    migrate-supervisor-state.service \
    extract-balena-ca.service
 Wants=balena.service
 ConditionPathExists=/etc/balena-supervisor/supervisor.conf
 [Service]
 Type=simple
 Restart=always
 RestartSec=10s
 WatchdogSec=180
 SyslogIdentifier=balena-supervisor
 EnvironmentFile=/etc/balena-supervisor/supervisor.conf
 EnvironmentFile=-/tmp/update-supervisor.conf
 ExecStartPre=-/usr/bin/balena stop resin_supervisor
 ExecStartPre=-/usr/bin/balena stop balena_supervisor
 ExecStartPre=/bin/systemctl is-active balena.service
 ExecStart=/usr/bin/healthdog --healthcheck=/usr/lib/balena-supervisor/balena-supervisor-healthcheck /usr/bin/start-balena-supervisor
 ExecStop=-/usr/bin/balena stop balena_supervisor
 [Install]
 WantedBy=multi-user.target
 Alias=resin-supervisor.service
 ```
 #### 8.1 Restarting the Supervisor
 It's actually incredibly rare to actually _need_ a Supervisor restart. The
 Supervisor will attempt to recover from issues that occur automatically, without
 the requirement for a restart. If you've got to a point where you believe that
 a restart is required, double check with the other agent on-duty, and if
 required either with the Supervisor maintainer or another knowledgeable engineer
 before doing so.
 There are instances where the Supervisor is incorrectly restarted when in fact
 the issue could be down to corruption of service images, containers, volumes
 or networking. In these cases, you're better off dealing with the underlying
 balenaEngine to ensure that anything corrupt is recreated correctly. See the
 balenaEngine section for more details.
 If a restart is required, ensure that you have gathered as much information
 as possible before a restart, including pertinent logs and symptoms so that
 investigations can occur asynchronously to determine what occurred and how it
 may be mitigated in the future. Enabling permanent logging may also be of
 benefit in cases where symptoms are repeatedly occurring.
 To restart the Supervisor, simply restart the `systemd` service:
 ```shell
 root@debug-device:~# systemctl restart balena-supervisor.service
 root@debug-device:~# systemctl status balena-supervisor.service
 ● balena-supervisor.service - Balena supervisor
     Loaded: loaded (/lib/systemd/system/balena-supervisor.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2022-08-19 18:13:28 UTC; 10s ago
    Process: 3013 ExecStartPre=/usr/bin/balena stop resin_supervisor (code=exited, status=1/FAILURE)
    Process: 3021 ExecStartPre=/usr/bin/balena stop balena_supervisor (code=exited, status=0/SUCCESS)
    Process: 3030 ExecStartPre=/bin/systemctl is-active balena.service (code=exited, status=0/SUCCESS)
   Main PID: 3031 (start-balena-su)
      Tasks: 11 (limit: 1878)
     Memory: 11.8M
     CGroup: /system.slice/balena-supervisor.service
             ├─3031 /bin/sh /usr/bin/start-balena-supervisor
             ├─3032 /proc/self/exe --healthcheck /usr/lib/balena-supervisor/balena-supervisor-healthcheck --pid 3031
             └─3089 balena start --attach balena_supervisor
 Aug 19 18:13:33 debug-device balena-supervisor[3089]: [info]    Waiting for connectivity...
 Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug]   Starting current state report
 Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug]   Starting target state poll
 Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug]   Spawning journald with: chroot  /mnt/root journalctl -a --follow -o json >
 Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug]   Finished applying target state
 Aug 19 18:13:33 debug-device balena-supervisor[3089]: [success] Device state apply success
 Aug 19 18:13:34 debug-device balena-supervisor[3089]: [info]    Applying target state
 Aug 19 18:13:34 debug-device balena-supervisor[3089]: [info]    Reported current state to the cloud
 Aug 19 18:13:34 debug-device balena-supervisor[3089]: [debug]   Finished applying target state
 Aug 19 18:13:34 debug-device balena-supervisor[3089]: [success] Device state apply success
 ```
 #### 8.2 Updating the Supervisor
 Occasionally, there are situations where the Supervisor requires an update. This
 may be because a device needs to use a new feature or because the version of
 the Supervisor on a device is outdated and is causing an issue. Usually the best
 way to achieve this is via a balenaOS update, either from the dashboard or via
 the command line on the device.
 If updating balenaOS is not desirable or a user prefers updating the Supervisor independently, this can easily be accomplished using the [self-service](https://www.balena.io/docs/reference/supervisor/supervisor-upgrades/) Supervisor upgrades. Alternatively, this can be programmatically done by using the Node.js SDK method [device.setSupervisorRelease](https://www.balena.io/docs/reference/sdk/node-sdk/#devicesetsupervisorreleaseuuidorid-supervisorversionorid-%E2%87%92-codepromisecode).
 You can additionally write a script to manage this for a fleet of devices in combination with other SDK functions such as [device.getAll](https://www.balena.io/docs/reference/sdk/node-sdk/#devicegetalloptions-%E2%87%92-codepromisecode).
 **Note:** In order to update the Supervisor release for a device, you must have edit permissions on the device (i.e., more than just support access).
 #### 8.3 The Supervisor Database
 The Supervisor uses a SQLite database to store persistent state (so in the
 case of going offline, or a reboot, it knows exactly what state an
 app should be in, and which images, containers, volumes and networks
 to apply to it).
 This database is located at
 `/mnt/data/resin-data/balena-supervisor/database.sqlite` and can be accessed
 inside the Supervisor, most easily by running Node. Assuming you're logged
 into your device, run the following:
 ```shell
 root@debug-device:~# balena exec -ti balena_supervisor node
 ```
 This will get you into a Node interpreter in the Supervisor service
 container. From here, we can use the `sqlite3` NPM module used by
 the Supervisor to make requests to the database:
 ```shell
 > sqlite3 = require('sqlite3');
 {
  Database: [Function: Database],
  Statement: [Function: Statement],
  Backup: [Function: Backup],
  OPEN_READONLY: 1,
  OPEN_READWRITE: 2,
  OPEN_CREATE: 4,
  OPEN_FULLMUTEX: 65536,
  OPEN_URI: 64,
  OPEN_SHAREDCACHE: 131072,
  OPEN_PRIVATECACHE: 262144,
  VERSION: '3.30.1',
  SOURCE_ID: '2019-10-10 20:19:45 18db032d058f1436ce3dea84081f4ee5a0f2259ad97301d43c426bc7f3df1b0b',
  VERSION_NUMBER: 3030001,
  OK: 0,
  ERROR: 1,
  INTERNAL: 2,
  PERM: 3,
  ABORT: 4,
  BUSY: 5,
  LOCKED: 6,
  NOMEM: 7,
  READONLY: 8,
  INTERRUPT: 9,
  IOERR: 10,
  CORRUPT: 11,
  NOTFOUND: 12,
  FULL: 13,
  CANTOPEN: 14,
  PROTOCOL: 15,
  EMPTY: 16,
  SCHEMA: 17,
  TOOBIG: 18,
  CONSTRAINT: 19,
  MISMATCH: 20,
  MISUSE: 21,
  NOLFS: 22,
  AUTH: 23,
  FORMAT: 24,
  RANGE: 25,
  NOTADB: 26,
  cached: { Database: [Function: Database], objects: {} },
  verbose: [Function]
 }
 > db = new sqlite3.Database('/data/database.sqlite');
 Database {
  open: false,
  filename: '/data/database.sqlite',
  mode: 65542
 }
 ```
 You can get a list of all the tables used by the Supervisor by issuing:
 ```shell
 > db.all("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;", console.log);
 Database { open: true, filename: '/data/database.sqlite', mode: 65542 }
 > null [
  { name: 'apiSecret' },
  { name: 'app' },
  { name: 'config' },
  { name: 'containerLogs' },
  { name: 'currentCommit' },
  { name: 'dependentApp' },
  { name: 'dependentAppTarget' },
  { name: 'dependentDevice' },
  { name: 'dependentDeviceTarget' },
  { name: 'deviceConfig' },
  { name: 'engineSnapshot' },
  { name: 'image' },
  { name: 'knex_migrations' },
  { name: 'knex_migrations_lock' },
  { name: 'logsChannelSecret' },
  { name: 'sqlite_sequence' }
 ]
 ```
 With these, you can then examine and modify data, if required. Note that there's
 usually little reason to do so, but this is included for completeness. For
 example, to examine the configuration used by the Supervisor:
 ```shell
 > db.all('SELECT * FROM config;', console.log);
 Database { open: true, filename: '/data/database.sqlite', mode: 65542 }
 > null [
  { key: 'localMode', value: 'false' },
  { key: 'initialConfigSaved', value: 'true' },
  {
    key: 'initialConfigReported',
    value: 'https://api.balena-cloud.com'
  },
  { key: 'name', value: 'shy-rain' },
  { key: 'targetStateSet', value: 'true' },
  { key: 'delta', value: 'true' },
  { key: 'deltaVersion', value: '3' }
 ]
 ```
 Occasionally, should the Supervisor get into a state where it is unable to
 determine which release images it should be downloading or running, it
 is necessary to clear the database. This usually goes hand-in-hand with removing
 the current containers and putting the Supervisor into a 'first boot' state,
 whilst keeping the Supervisor and release images. This can be achieved by
 carrying out the following:
 ```shell
 root@debug-device:~# systemctl stop balena-supervisor.service update-balena-supervisor.timer
 root@debug-device:~# balena rm -f $(balena ps -aq)
 1db1d281a548
 6c5cde1581e5
 2a9f6e83578a
 root@debug-device:~# rm /mnt/data/resin-data/balena-supervisor/database.sqlite
 ```
 This:
 - Stops the Supervisor (and the timer that will attempt to restart it).
 - Removes all current services containers (including the Supervisor).
 - Removes the Supervisor database.
  (If for some reason the images also need to be removed, run
  `balena rmi -f $(balena images -q)` which will remove all images _including_
  the Supervisor image).
  You can now restart the Supervisor:
 ```shell
 root@debug-device:~# systemctl start update-balena-supervisor.timer balena-supervisor.service
 ```
 If you deleted all the images, this will first download the Supervisor image
 again before restarting it.
 At this point, the Supervisor will start up as if the device has just been
 provisioned (though it will already be registered), and the release will
 be freshly downloaded (if the images were removed) before starting the service
 containers.