Currently, if a pool or scaleset is set to `shutdown`, it cannot be set to `halt`.
While moving from `halt` to `shutdown` would cause issues, moving from `shutdown` to `halt` is fine.
The SignalR integration from Azure Functions does not have automatic retry. When the SignalR instance has issues, all other APIs fail.
To make the service resilient to SignalR outages, this bounces SignalR events through an Azure Storage queue.
NOTE: This PR does not remove the integration from all of the functions. That is intended to be done as a follow-on PR.
In order to reduce how frequently the IMS is hit from the service, the service caches the azure-mgmt clients between API calls. While the management APIs should have some amount of authentication expiration redundancy built in, not all of them do.
This is seen with `ClientAuthenticationError`, most often with the nested exception record of `ExpiredAuthenticationToken`.
This wraps all of the compute layer functionality with a wrapper that checks if there has been an exception, and retries the request.
The flag `Node.reimage_queued` is intended to stop nodes from reimaging repeatedly.
In #970, in order to work around Azure API failures, this flag was cycled if the node was already set to cleanup. Unfortunately, reimaging can sometimes take a significant amount of time, causing this change to get nodes multiple times.
Instead of using `reimage_queued` as a flag, this PR deletes the node from the storage table upon reimage. When the node registers OR the next time through `Scaleset.cleanup_nodes`, the Node will be recreated automatically, whichever comes first.
In a previous commit, reimage_queued was added to prevent reimaging a node while it is reimaging. However, this means reimaging failures due to Azure issues don't finish reimaging.
This will reset the this flag allowing the node to reimage in the following cleanup cycle.
We're experiencing a bug where Unions of sub-models are getting downcast, which causes a loss of information.
As an example, EventScalesetCreated was getting downcast to EventScalesetDeleted. I have not figured out why, nor can I replicate it locally to minimize the bug send upstream, but I was able to reliably replicate it on the service.
While working through this issue, I noticed that deserialization of SignalR events was frequently wrong, leaving things like tasks as "init" in `status top`.
Both of these issues are Unions of models with a type field, so it's likely these are related.
this fixes an issue related to object id reuse that can occur making the
object identification cache fail. Instead, this simplifies the
hide_secrets to always recurse and use setattr to always set the value
based on the recursion.
Note, the object id reuse issue was seen in the
`events.filter_event_recurse` development and this was the fix for the
id reuse there.
Python documentation states:
id(object):
Return the “identity” of an object. This is an integer (or long integer)
which is guaranteed to be unique and constant for this object during its
lifetime. Two objects with non-overlapping lifetimes may have the same
id() value.
## Summary of the Pull Request
_What is this about?_
We'd like to refactor the proxy lifecycle to only delete when the proxy is out-of-date - i.e. when the proxy is older than 7 days or a mismatched version. I've changed two files, proxy.py and timer_daily\init.py to check for the version and timestamp before stopping a live proxy.
## PR Checklist
* [ ] Applies to work item: #xxx
* [ ] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/onefuzz) and sign the CLI.
* [ ] Tests added/passed
* [ ] Requires documentation to be updated
* [x] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx
## Info on Pull Request
_What does this include?_
Changes to two files:
proxy.py:
- get_or_create() edited to check if timestamp is >7 days.
- Created is_outdated() to check version and timestamp for out-of-date proxy.
timer_daily/init.py
- Proxy check now includes is_outdated() before determining if a proxy should be shutdown.
## Validation Steps Performed
Deploying test instance to determine if proxy lives past a single day.