Commit Graph

37 Commits

Author SHA1 Message Date
d2faf7c66d Fix case of logger format string specifier (#1160)
Fix a log statement with an invalid format string specifier. At runtime, the invalid specifier causes the service to throw a `ValueError`. This is typically invoked in the `agent_can_schedule` function [here](https://github.com/microsoft/onefuzz/blob/main/src/api-service/__app__/agent_can_schedule/__init__.py#L33).
2021-08-23 14:37:01 +00:00
9ec7e7a20a process all expired nodes rather than those not already marked for deletion (#1103)
This makes sure debug_keep_node is reset and the rest of the reimage processing occurs regardless of reimage_requested and delete_requested being set.

Without this, nodes that are marked `debug_keep_node` do not get reimaged/deleted.
2021-07-27 00:53:04 +00:00
55366e751a allow pools & scalesets set to shutdown to halt (#1104)
Currently, if a pool or scaleset is set to `shutdown`, it cannot be set to `halt`.

While moving from `halt` to `shutdown` would cause issues, moving from `shutdown` to `halt` is fine.
2021-07-23 13:14:47 +00:00
89b7d13125 Fix get_dead_nodes query (#1054) 2021-07-09 13:33:42 -04:00
826ef8dd22 Pool shrink queue (#1050) 2021-07-08 10:23:54 -04:00
45d468f2ce set pool_id on node creation (#1049) 2021-07-07 17:58:24 -04:00
52f83b5b26 add EventScalesetResizeScheduled (#1047) 2021-07-07 14:15:26 -04:00
7b2679a1ce make ShrinkQueue not scaleset specific (#1046) 2021-07-07 13:27:49 -04:00
883c93aaf4 ensure VM IDs are unique before calling Azure reimage/delete APIs (#1023) 2021-06-25 11:54:52 -04:00
5f8e423265 remove nodes from db upon reimage (#1005)
The flag `Node.reimage_queued` is intended to stop nodes from reimaging repeatedly.  

In #970, in order to work around Azure API failures, this flag was cycled if the node was already set to cleanup.  Unfortunately, reimaging can sometimes take a significant amount of time, causing this change to get nodes multiple times.

Instead of using `reimage_queued` as a flag, this PR deletes the node from the storage table upon reimage.  When the node registers OR the next time through `Scaleset.cleanup_nodes`, the Node will be recreated automatically, whichever comes first.
2021-06-23 22:25:15 +00:00
50652c2e48 mark tasks as failed when the node is being reimaged due to heartbeat issues (#1015) 2021-06-23 16:39:47 -04:00
b9950c5526 update log messages to ease debugging (#988) 2021-06-14 15:18:03 -04:00
bcdae2d5cb Check scaleset size for missing nodes (#984) 2021-06-11 18:47:21 -04:00
2be1edd9dc handle reimaging failures by resetting reimage_queued (#970)
In a previous commit, reimage_queued was added to prevent reimaging a node while it is reimaging.  However, this means reimaging failures due to Azure issues don't finish reimaging.

This will reset the this flag allowing the node to reimage in the following cleanup cycle.
2021-06-09 18:58:56 +00:00
af39d25a7d reimage/delete expired nodes even with the debug_keep_node flag (#968)
Fixes #965
2021-06-08 17:37:10 +00:00
b761908409 send NodeCommandStopIfFree on node shutdown (#940)
If we move to shutdown a single node but it's not doing work, it will
wait until it picks up work to shutdown.  This shortcuts that.
2021-06-01 15:03:33 +00:00
8b74d08d3d fix deleting nodes with expired heartbeats (#930) 2021-05-26 13:06:44 -04:00
ff140a6b1b Stop tasks on nodes before deleting task queues (#801) 2021-05-17 18:59:13 +00:00
cb5e786bcd add event for scaleset state updates (#882)
This moves all scaleset state updates through `Scaleset.set_state` and adds a new event EventScalesetStateUpdated.
2021-05-13 21:23:02 +00:00
584f68065d cleanup a handful of scaleset logs (#880) 2021-05-12 17:31:08 -04:00
221a3316a1 Add StopIfFree node command to tell free nodes to stop asking for new work (#866) 2021-05-07 13:55:50 -04:00
007ecf2efe shutdown missing scalesets during resize (#860) 2021-05-06 12:00:09 -04:00
ced21b2ea3 Add node messages to node get (#836)
This exposes the node commands that have yet to be processed by the node.  Example use case:  The SDK can now ask "has this node installed my SSH key"
2021-04-26 16:14:58 -04:00
f4b5c1ae73 when processing node updates, don't wait on the node in cases it should be stopped (#834)
In situations when the node should be done, mark it as done without
waiting for the node to respond to the Done command.
2021-04-26 15:19:46 -04:00
cf3d904940 address formatting from black 21.4b0 (#831) 2021-04-26 12:35:16 -04:00
80b3533f83 Report the setup failure in the task when available (#781) 2021-04-09 08:57:56 -04:00
3096f99e86 enable using ephemeral disks by default (#461) 2021-03-30 18:48:44 -04:00
a3fdc74c53 handle exception related to manually deleted scalesets (#672)
If a user manually deletes a scaleset managed by OneFuzz, then `get_vmss_size` returns None.

When this happens, `Scaleset.shutdown` generates an exception from the `logging.info` call on line 573.

This PR handles this edge condition.
2021-03-15 14:18:59 +00:00
fb482e357e don't schedule work to a node if the scaleset or pool is shutting down (#583) 2021-02-23 13:33:41 -05:00
feb80ecb54 allow nodes with multiple tasks to continue on task stop (#567)
As is, when multiple tasks are running on a single node, if any one of them stops, the node gets reimaged.

This changes the behavior such that when a node with multiple tasks has one task stop, the other tasks will continue.
2021-02-19 23:54:26 +00:00
8ce4638b8a clarify scaleset logging (#568) 2021-02-19 19:36:16 +00:00
929d9ce496 make user triggered reimaging happen immediately (#566) 2021-02-18 14:08:25 -05:00
8c9f65c0be add missing scaleset nodes (#518) 2021-02-08 13:50:08 -05:00
1d74379a70 use the primitive types in more places (#514) 2021-02-05 13:10:37 -05:00
e3dfcb8b95 Scalesets that are about to be deleted don't need updated configs (#511) 2021-02-05 09:53:29 -05:00
3cb055d331 clarify message upon service & agent version mismatch (#510) 2021-02-04 19:58:45 -05:00
a02e084522 split out node, scaleset, and pool code (#507) 2021-02-04 19:07:49 -05:00