mirror of
https://github.com/balena-os/balena-supervisor.git
synced 2025-03-10 22:44:29 +00:00
patch: Migrate Supervisor Debugging docs from masterclass
Signed-off-by: Vipul Gupta (@vipulgupta2048) <vipul@balena.io>
This commit is contained in:
parent
f94f06c7ff
commit
a9045d5eda
349
docs/debugging-supervisor.md
Normal file
349
docs/debugging-supervisor.md
Normal file
@ -0,0 +1,349 @@
|
|||||||
|
# Working with the Supervisor
|
||||||
|
|
||||||
|
Service: `balena-supervisor.service`
|
||||||
|
|
||||||
|
The balena Supervisor is the service that carries out the management of the
|
||||||
|
software release on a device, including determining when to download updates,
|
||||||
|
the changing of variables, ensuring services
|
||||||
|
are restarted correctly, etc. It is, in effect, the on-device agent for
|
||||||
|
balenaCloud.
|
||||||
|
|
||||||
|
As such, it's imperative that the Supervisor is operational and healthy at all
|
||||||
|
times, even when a device is not connected via the Internet, as it still
|
||||||
|
ensures the running of a device that is offline.
|
||||||
|
|
||||||
|
The Supervisor itself is a Docker service that runs alongside any installed
|
||||||
|
user services and the healthcheck container (more on that later). One
|
||||||
|
major advantage of running it as a Docker service is that it can be updated
|
||||||
|
just like any other service (although actually carrying that out is slightly
|
||||||
|
different to updating user containers, see 'Updating the Supervisor').
|
||||||
|
|
||||||
|
Assuming you're still logged into your development device, run the following:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
root@debug-device:~# systemctl status balena-supervisor
|
||||||
|
● balena-supervisor.service - Balena supervisor
|
||||||
|
Loaded: loaded (/lib/systemd/system/balena-supervisor.service; enabled; vendor preset: enabled)
|
||||||
|
Active: active (running) since Fri 2022-08-19 18:08:59 UTC; 41s ago
|
||||||
|
Process: 2296 ExecStartPre=/usr/bin/balena stop resin_supervisor (code=exited, status=1/FAILURE)
|
||||||
|
Process: 2311 ExecStartPre=/usr/bin/balena stop balena_supervisor (code=exited, status=0/SUCCESS)
|
||||||
|
Process: 2325 ExecStartPre=/bin/systemctl is-active balena.service (code=exited, status=0/SUCCESS)
|
||||||
|
Main PID: 2326 (start-balena-su)
|
||||||
|
Tasks: 10 (limit: 1878)
|
||||||
|
Memory: 11.9M
|
||||||
|
CGroup: /system.slice/balena-supervisor.service
|
||||||
|
├─2326 /bin/sh /usr/bin/start-balena-supervisor
|
||||||
|
├─2329 /proc/self/exe --healthcheck /usr/lib/balena-supervisor/balena-supervisor-healthcheck --pid 2326
|
||||||
|
└─2486 balena start --attach balena_supervisor
|
||||||
|
|
||||||
|
Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug] Starting target state poll
|
||||||
|
Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug] Spawning journald with: chroot /mnt/root journalctl -a --follow -o json >
|
||||||
|
Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug] Finished applying target state
|
||||||
|
Aug 19 18:09:07 debug-device balena-supervisor[2486]: [success] Device state apply success
|
||||||
|
Aug 19 18:09:07 debug-device balena-supervisor[2486]: [info] Applying target state
|
||||||
|
Aug 19 18:09:07 debug-device balena-supervisor[2486]: [info] Reported current state to the cloud
|
||||||
|
Aug 19 18:09:07 debug-device balena-supervisor[2486]: [debug] Finished applying target state
|
||||||
|
Aug 19 18:09:07 debug-device balena-supervisor[2486]: [success] Device state apply success
|
||||||
|
Aug 19 18:09:17 debug-device balena-supervisor[2486]: [info] Internet Connectivity: OK
|
||||||
|
Aug 19 18:09:18 debug-device balena-supervisor[2486]: [info] Reported current state to the cloud
|
||||||
|
```
|
||||||
|
|
||||||
|
You can see the Supervisor is just another `systemd` service
|
||||||
|
(`balena-supervisor.service)`, and that it is started and run by balenaEngine.
|
||||||
|
|
||||||
|
Supervisor issues, due to their nature, vary quite significantly. It's also
|
||||||
|
commonly used to misattribute issues to. As the Supervisor is verbose about its
|
||||||
|
state and actions (such as the download of images), it tends to be suspected of
|
||||||
|
problems when in fact there are usually other underlying issues. A few examples
|
||||||
|
are:
|
||||||
|
|
||||||
|
- Networking problems - In the case of the Supervisor reporting failed downloads
|
||||||
|
or attempting to retrieve the same images repeatedly (where in fact instable
|
||||||
|
networking is usually the cause).
|
||||||
|
- Service container restarts - The default policy for service containers is to
|
||||||
|
restart if they exit, and this sometimes is misunderstood. If a container's
|
||||||
|
restarting, it's worth ensuring it's not because the container itself is
|
||||||
|
exiting correctly either due to a bug in the service container code or
|
||||||
|
because it has correctly come to the end of its running process.
|
||||||
|
- Staged releases - A fleet/device has been pinned to a particular
|
||||||
|
version, and a new push is not being downloaded.
|
||||||
|
|
||||||
|
It's _always_ worth considering how the system is configured, how releases were
|
||||||
|
produced, how the fleet or device is configured and what the current
|
||||||
|
networking state is when investigating Supervisor issues, to ensure that there
|
||||||
|
isn't something else amiss that the Supervisor is merely exposing via logging.
|
||||||
|
|
||||||
|
Another point to note is that the Supervisor is started using
|
||||||
|
[`healthdog`](https://github.com/balena-os/healthdog-rs) which continually
|
||||||
|
ensures that the Supervisor is present by using balenaEngine to find the
|
||||||
|
Supervisor image. If the image isn't present, or balenaEngine doesn't respond,
|
||||||
|
then the Supervisor is restarted. The default period for this check is 180
|
||||||
|
seconds at the time of writing, but inspect the
|
||||||
|
`/lib/systemd/system/balena-supervisor.service` file on-device to see what
|
||||||
|
it is for the device you're SSHd into. For example, using our example device:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
root@debug-device:~# cat /lib/systemd/system/balena-supervisor.service
|
||||||
|
[Unit]
|
||||||
|
Description=Balena supervisor
|
||||||
|
Requires=\
|
||||||
|
resin\x2ddata.mount \
|
||||||
|
balena-device-uuid.service \
|
||||||
|
os-config-devicekey.service \
|
||||||
|
bind-etc-balena-supervisor.service \
|
||||||
|
extract-balena-ca.service
|
||||||
|
Wants=\
|
||||||
|
migrate-supervisor-state.service
|
||||||
|
After=\
|
||||||
|
balena.service \
|
||||||
|
resin\x2ddata.mount \
|
||||||
|
balena-device-uuid.service \
|
||||||
|
os-config-devicekey.service \
|
||||||
|
bind-etc-systemd-system-resin.target.wants.service \
|
||||||
|
bind-etc-balena-supervisor.service \
|
||||||
|
migrate-supervisor-state.service \
|
||||||
|
extract-balena-ca.service
|
||||||
|
Wants=balena.service
|
||||||
|
ConditionPathExists=/etc/balena-supervisor/supervisor.conf
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
Restart=always
|
||||||
|
RestartSec=10s
|
||||||
|
WatchdogSec=180
|
||||||
|
SyslogIdentifier=balena-supervisor
|
||||||
|
EnvironmentFile=/etc/balena-supervisor/supervisor.conf
|
||||||
|
EnvironmentFile=-/tmp/update-supervisor.conf
|
||||||
|
ExecStartPre=-/usr/bin/balena stop resin_supervisor
|
||||||
|
ExecStartPre=-/usr/bin/balena stop balena_supervisor
|
||||||
|
ExecStartPre=/bin/systemctl is-active balena.service
|
||||||
|
ExecStart=/usr/bin/healthdog --healthcheck=/usr/lib/balena-supervisor/balena-supervisor-healthcheck /usr/bin/start-balena-supervisor
|
||||||
|
ExecStop=-/usr/bin/balena stop balena_supervisor
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
Alias=resin-supervisor.service
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 8.1 Restarting the Supervisor
|
||||||
|
|
||||||
|
It's actually incredibly rare to actually _need_ a Supervisor restart. The
|
||||||
|
Supervisor will attempt to recover from issues that occur automatically, without
|
||||||
|
the requirement for a restart. If you've got to a point where you believe that
|
||||||
|
a restart is required, double check with the other agent on-duty, and if
|
||||||
|
required either with the Supervisor maintainer or another knowledgeable engineer
|
||||||
|
before doing so.
|
||||||
|
|
||||||
|
There are instances where the Supervisor is incorrectly restarted when in fact
|
||||||
|
the issue could be down to corruption of service images, containers, volumes
|
||||||
|
or networking. In these cases, you're better off dealing with the underlying
|
||||||
|
balenaEngine to ensure that anything corrupt is recreated correctly. See the
|
||||||
|
balenaEngine section for more details.
|
||||||
|
|
||||||
|
If a restart is required, ensure that you have gathered as much information
|
||||||
|
as possible before a restart, including pertinent logs and symptoms so that
|
||||||
|
investigations can occur asynchronously to determine what occurred and how it
|
||||||
|
may be mitigated in the future. Enabling permanent logging may also be of
|
||||||
|
benefit in cases where symptoms are repeatedly occurring.
|
||||||
|
|
||||||
|
To restart the Supervisor, simply restart the `systemd` service:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
root@debug-device:~# systemctl restart balena-supervisor.service
|
||||||
|
root@debug-device:~# systemctl status balena-supervisor.service
|
||||||
|
● balena-supervisor.service - Balena supervisor
|
||||||
|
Loaded: loaded (/lib/systemd/system/balena-supervisor.service; enabled; vendor preset: enabled)
|
||||||
|
Active: active (running) since Fri 2022-08-19 18:13:28 UTC; 10s ago
|
||||||
|
Process: 3013 ExecStartPre=/usr/bin/balena stop resin_supervisor (code=exited, status=1/FAILURE)
|
||||||
|
Process: 3021 ExecStartPre=/usr/bin/balena stop balena_supervisor (code=exited, status=0/SUCCESS)
|
||||||
|
Process: 3030 ExecStartPre=/bin/systemctl is-active balena.service (code=exited, status=0/SUCCESS)
|
||||||
|
Main PID: 3031 (start-balena-su)
|
||||||
|
Tasks: 11 (limit: 1878)
|
||||||
|
Memory: 11.8M
|
||||||
|
CGroup: /system.slice/balena-supervisor.service
|
||||||
|
├─3031 /bin/sh /usr/bin/start-balena-supervisor
|
||||||
|
├─3032 /proc/self/exe --healthcheck /usr/lib/balena-supervisor/balena-supervisor-healthcheck --pid 3031
|
||||||
|
└─3089 balena start --attach balena_supervisor
|
||||||
|
|
||||||
|
Aug 19 18:13:33 debug-device balena-supervisor[3089]: [info] Waiting for connectivity...
|
||||||
|
Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug] Starting current state report
|
||||||
|
Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug] Starting target state poll
|
||||||
|
Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug] Spawning journald with: chroot /mnt/root journalctl -a --follow -o json >
|
||||||
|
Aug 19 18:13:33 debug-device balena-supervisor[3089]: [debug] Finished applying target state
|
||||||
|
Aug 19 18:13:33 debug-device balena-supervisor[3089]: [success] Device state apply success
|
||||||
|
Aug 19 18:13:34 debug-device balena-supervisor[3089]: [info] Applying target state
|
||||||
|
Aug 19 18:13:34 debug-device balena-supervisor[3089]: [info] Reported current state to the cloud
|
||||||
|
Aug 19 18:13:34 debug-device balena-supervisor[3089]: [debug] Finished applying target state
|
||||||
|
Aug 19 18:13:34 debug-device balena-supervisor[3089]: [success] Device state apply success
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 8.2 Updating the Supervisor
|
||||||
|
|
||||||
|
Occasionally, there are situations where the Supervisor requires an update. This
|
||||||
|
may be because a device needs to use a new feature or because the version of
|
||||||
|
the Supervisor on a device is outdated and is causing an issue. Usually the best
|
||||||
|
way to achieve this is via a balenaOS update, either from the dashboard or via
|
||||||
|
the command line on the device.
|
||||||
|
|
||||||
|
If updating balenaOS is not desirable or a user prefers updating the Supervisor independently, this can easily be accomplished using the [self-service](https://www.balena.io/docs/reference/supervisor/supervisor-upgrades/) Supervisor upgrades. Alternatively, this can be programmatically done by using the Node.js SDK method [device.setSupervisorRelease](https://www.balena.io/docs/reference/sdk/node-sdk/#devicesetsupervisorreleaseuuidorid-supervisorversionorid-%E2%87%92-codepromisecode).
|
||||||
|
|
||||||
|
You can additionally write a script to manage this for a fleet of devices in combination with other SDK functions such as [device.getAll](https://www.balena.io/docs/reference/sdk/node-sdk/#devicegetalloptions-%E2%87%92-codepromisecode).
|
||||||
|
|
||||||
|
**Note:** In order to update the Supervisor release for a device, you must have edit permissions on the device (i.e., more than just support access).
|
||||||
|
|
||||||
|
#### 8.3 The Supervisor Database
|
||||||
|
|
||||||
|
The Supervisor uses a SQLite database to store persistent state (so in the
|
||||||
|
case of going offline, or a reboot, it knows exactly what state an
|
||||||
|
app should be in, and which images, containers, volumes and networks
|
||||||
|
to apply to it).
|
||||||
|
|
||||||
|
This database is located at
|
||||||
|
`/mnt/data/resin-data/balena-supervisor/database.sqlite` and can be accessed
|
||||||
|
inside the Supervisor, most easily by running Node. Assuming you're logged
|
||||||
|
into your device, run the following:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
root@debug-device:~# balena exec -ti balena_supervisor node
|
||||||
|
```
|
||||||
|
|
||||||
|
This will get you into a Node interpreter in the Supervisor service
|
||||||
|
container. From here, we can use the `sqlite3` NPM module used by
|
||||||
|
the Supervisor to make requests to the database:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
> sqlite3 = require('sqlite3');
|
||||||
|
{
|
||||||
|
Database: [Function: Database],
|
||||||
|
Statement: [Function: Statement],
|
||||||
|
Backup: [Function: Backup],
|
||||||
|
OPEN_READONLY: 1,
|
||||||
|
OPEN_READWRITE: 2,
|
||||||
|
OPEN_CREATE: 4,
|
||||||
|
OPEN_FULLMUTEX: 65536,
|
||||||
|
OPEN_URI: 64,
|
||||||
|
OPEN_SHAREDCACHE: 131072,
|
||||||
|
OPEN_PRIVATECACHE: 262144,
|
||||||
|
VERSION: '3.30.1',
|
||||||
|
SOURCE_ID: '2019-10-10 20:19:45 18db032d058f1436ce3dea84081f4ee5a0f2259ad97301d43c426bc7f3df1b0b',
|
||||||
|
VERSION_NUMBER: 3030001,
|
||||||
|
OK: 0,
|
||||||
|
ERROR: 1,
|
||||||
|
INTERNAL: 2,
|
||||||
|
PERM: 3,
|
||||||
|
ABORT: 4,
|
||||||
|
BUSY: 5,
|
||||||
|
LOCKED: 6,
|
||||||
|
NOMEM: 7,
|
||||||
|
READONLY: 8,
|
||||||
|
INTERRUPT: 9,
|
||||||
|
IOERR: 10,
|
||||||
|
CORRUPT: 11,
|
||||||
|
NOTFOUND: 12,
|
||||||
|
FULL: 13,
|
||||||
|
CANTOPEN: 14,
|
||||||
|
PROTOCOL: 15,
|
||||||
|
EMPTY: 16,
|
||||||
|
SCHEMA: 17,
|
||||||
|
TOOBIG: 18,
|
||||||
|
CONSTRAINT: 19,
|
||||||
|
MISMATCH: 20,
|
||||||
|
MISUSE: 21,
|
||||||
|
NOLFS: 22,
|
||||||
|
AUTH: 23,
|
||||||
|
FORMAT: 24,
|
||||||
|
RANGE: 25,
|
||||||
|
NOTADB: 26,
|
||||||
|
cached: { Database: [Function: Database], objects: {} },
|
||||||
|
verbose: [Function]
|
||||||
|
}
|
||||||
|
> db = new sqlite3.Database('/data/database.sqlite');
|
||||||
|
Database {
|
||||||
|
open: false,
|
||||||
|
filename: '/data/database.sqlite',
|
||||||
|
mode: 65542
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You can get a list of all the tables used by the Supervisor by issuing:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
> db.all("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;", console.log);
|
||||||
|
Database { open: true, filename: '/data/database.sqlite', mode: 65542 }
|
||||||
|
> null [
|
||||||
|
{ name: 'apiSecret' },
|
||||||
|
{ name: 'app' },
|
||||||
|
{ name: 'config' },
|
||||||
|
{ name: 'containerLogs' },
|
||||||
|
{ name: 'currentCommit' },
|
||||||
|
{ name: 'dependentApp' },
|
||||||
|
{ name: 'dependentAppTarget' },
|
||||||
|
{ name: 'dependentDevice' },
|
||||||
|
{ name: 'dependentDeviceTarget' },
|
||||||
|
{ name: 'deviceConfig' },
|
||||||
|
{ name: 'engineSnapshot' },
|
||||||
|
{ name: 'image' },
|
||||||
|
{ name: 'knex_migrations' },
|
||||||
|
{ name: 'knex_migrations_lock' },
|
||||||
|
{ name: 'logsChannelSecret' },
|
||||||
|
{ name: 'sqlite_sequence' }
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
With these, you can then examine and modify data, if required. Note that there's
|
||||||
|
usually little reason to do so, but this is included for completeness. For
|
||||||
|
example, to examine the configuration used by the Supervisor:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
> db.all('SELECT * FROM config;', console.log);
|
||||||
|
Database { open: true, filename: '/data/database.sqlite', mode: 65542 }
|
||||||
|
> null [
|
||||||
|
{ key: 'localMode', value: 'false' },
|
||||||
|
{ key: 'initialConfigSaved', value: 'true' },
|
||||||
|
{
|
||||||
|
key: 'initialConfigReported',
|
||||||
|
value: 'https://api.balena-cloud.com'
|
||||||
|
},
|
||||||
|
{ key: 'name', value: 'shy-rain' },
|
||||||
|
{ key: 'targetStateSet', value: 'true' },
|
||||||
|
{ key: 'delta', value: 'true' },
|
||||||
|
{ key: 'deltaVersion', value: '3' }
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Occasionally, should the Supervisor get into a state where it is unable to
|
||||||
|
determine which release images it should be downloading or running, it
|
||||||
|
is necessary to clear the database. This usually goes hand-in-hand with removing
|
||||||
|
the current containers and putting the Supervisor into a 'first boot' state,
|
||||||
|
whilst keeping the Supervisor and release images. This can be achieved by
|
||||||
|
carrying out the following:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
root@debug-device:~# systemctl stop balena-supervisor.service update-balena-supervisor.timer
|
||||||
|
root@debug-device:~# balena rm -f $(balena ps -aq)
|
||||||
|
1db1d281a548
|
||||||
|
6c5cde1581e5
|
||||||
|
2a9f6e83578a
|
||||||
|
root@debug-device:~# rm /mnt/data/resin-data/balena-supervisor/database.sqlite
|
||||||
|
```
|
||||||
|
|
||||||
|
This:
|
||||||
|
|
||||||
|
- Stops the Supervisor (and the timer that will attempt to restart it).
|
||||||
|
- Removes all current services containers (including the Supervisor).
|
||||||
|
- Removes the Supervisor database.
|
||||||
|
(If for some reason the images also need to be removed, run
|
||||||
|
`balena rmi -f $(balena images -q)` which will remove all images _including_
|
||||||
|
the Supervisor image).
|
||||||
|
You can now restart the Supervisor:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
root@debug-device:~# systemctl start update-balena-supervisor.timer balena-supervisor.service
|
||||||
|
```
|
||||||
|
|
||||||
|
If you deleted all the images, this will first download the Supervisor image
|
||||||
|
again before restarting it.
|
||||||
|
At this point, the Supervisor will start up as if the device has just been
|
||||||
|
provisioned (though it will already be registered), and the release will
|
||||||
|
be freshly downloaded (if the images were removed) before starting the service
|
||||||
|
containers.
|
Loading…
x
Reference in New Issue
Block a user