* verify aad tenants, primarily needed in multi-tenant deployments
* add logging and fix trailing slash for issuer
* handle call_if* not supporting additional argument callbacks
* add logging
* include new datatype in webhook docs
* fix pytypes unit tests
Co-authored-by: Brian Caswell <bmc@shmoo.com>
This makes sure debug_keep_node is reset and the rest of the reimage processing occurs regardless of reimage_requested and delete_requested being set.
Without this, nodes that are marked `debug_keep_node` do not get reimaged/deleted.
Currently, if a pool or scaleset is set to `shutdown`, it cannot be set to `halt`.
While moving from `halt` to `shutdown` would cause issues, moving from `shutdown` to `halt` is fine.
The SignalR integration from Azure Functions does not have automatic retry. When the SignalR instance has issues, all other APIs fail.
To make the service resilient to SignalR outages, this bounces SignalR events through an Azure Storage queue.
NOTE: This PR does not remove the integration from all of the functions. That is intended to be done as a follow-on PR.
In order to reduce how frequently the IMS is hit from the service, the service caches the azure-mgmt clients between API calls. While the management APIs should have some amount of authentication expiration redundancy built in, not all of them do.
This is seen with `ClientAuthenticationError`, most often with the nested exception record of `ExpiredAuthenticationToken`.
This wraps all of the compute layer functionality with a wrapper that checks if there has been an exception, and retries the request.
The flag `Node.reimage_queued` is intended to stop nodes from reimaging repeatedly.
In #970, in order to work around Azure API failures, this flag was cycled if the node was already set to cleanup. Unfortunately, reimaging can sometimes take a significant amount of time, causing this change to get nodes multiple times.
Instead of using `reimage_queued` as a flag, this PR deletes the node from the storage table upon reimage. When the node registers OR the next time through `Scaleset.cleanup_nodes`, the Node will be recreated automatically, whichever comes first.
In a previous commit, reimage_queued was added to prevent reimaging a node while it is reimaging. However, this means reimaging failures due to Azure issues don't finish reimaging.
This will reset the this flag allowing the node to reimage in the following cleanup cycle.
We're experiencing a bug where Unions of sub-models are getting downcast, which causes a loss of information.
As an example, EventScalesetCreated was getting downcast to EventScalesetDeleted. I have not figured out why, nor can I replicate it locally to minimize the bug send upstream, but I was able to reliably replicate it on the service.
While working through this issue, I noticed that deserialization of SignalR events was frequently wrong, leaving things like tasks as "init" in `status top`.
Both of these issues are Unions of models with a type field, so it's likely these are related.
this fixes an issue related to object id reuse that can occur making the
object identification cache fail. Instead, this simplifies the
hide_secrets to always recurse and use setattr to always set the value
based on the recursion.
Note, the object id reuse issue was seen in the
`events.filter_event_recurse` development and this was the fix for the
id reuse there.
Python documentation states:
id(object):
Return the “identity” of an object. This is an integer (or long integer)
which is guaranteed to be unique and constant for this object during its
lifetime. Two objects with non-overlapping lifetimes may have the same
id() value.