Commit Graph

221 Commits

Author SHA1 Message Date
bcdae2d5cb Check scaleset size for missing nodes (#984) 2021-06-11 18:47:21 -04:00
2be1edd9dc handle reimaging failures by resetting reimage_queued (#970)
In a previous commit, reimage_queued was added to prevent reimaging a node while it is reimaging.  However, this means reimaging failures due to Azure issues don't finish reimaging.

This will reset the this flag allowing the node to reimage in the following cleanup cycle.
2021-06-09 18:58:56 +00:00
da931b3a5c address issues raised from latest mypy (#972) 2021-06-09 12:04:24 -04:00
af39d25a7d reimage/delete expired nodes even with the debug_keep_node flag (#968)
Fixes #965
2021-06-08 17:37:10 +00:00
ed289c9a3c handle scaleset resize exceptions (#967) 2021-06-08 09:30:36 -04:00
2c72bd590f Add generic coverage task (#763)
**Todo:**
- [x] Finalize format for coverage file(s)
- [x] Add service support
- [x] Integration test
- [x] Merge #926 
- [x] Merge #929
2021-06-03 23:36:00 +00:00
a92c84d42a work around issue with discriminated typed unions (#939)
We're experiencing a bug where Unions of sub-models are getting downcast, which causes a loss of information.  

As an example, EventScalesetCreated was getting downcast to EventScalesetDeleted.  I have not figured out why, nor can I replicate it locally to minimize the bug send upstream, but I was able to reliably replicate it on the service.

While working through this issue, I noticed that deserialization of SignalR events was frequently wrong, leaving things like tasks as "init" in `status top`.

Both of these issues are Unions of models with a type field, so it's likely these are related.
2021-06-02 16:40:58 +00:00
60ae07c34f handle azure-storage deleting nonexistent containers (#948) 2021-06-02 15:11:33 +00:00
b761908409 send NodeCommandStopIfFree on node shutdown (#940)
If we move to shutdown a single node but it's not doing work, it will
wait until it picks up work to shutdown.  This shortcuts that.
2021-06-01 15:03:33 +00:00
0a6021bfa1 prevent object id collision in hide_secrets (#936)
this fixes an issue related to object id reuse that can occur making the
object identification cache fail.  Instead, this simplifies the
hide_secrets to always recurse and use setattr to always set the value
based on the recursion.

Note, the object id reuse issue was seen in the
`events.filter_event_recurse` development and this was the fix for the
id reuse there.

Python documentation states:

id(object):

Return the “identity” of an object. This is an integer (or long integer)
which is guaranteed to be unique and constant for this object during its
lifetime. Two objects with non-overlapping lifetimes may have the same
id() value.
2021-05-27 08:28:02 -04:00
d557fc16c6 mark tasks that are stopped that never started with an error (#935) 2021-05-26 18:42:21 -04:00
c107a04cf9 fix issue deleting proxy from storage tables (#932) 2021-05-26 13:33:22 -04:00
8b74d08d3d fix deleting nodes with expired heartbeats (#930) 2021-05-26 13:06:44 -04:00
2241dcc7a4 update azure-mgmt-resource to 18.0.0 (#903) 2021-05-24 16:33:06 +00:00
a103985c0d Fix multi proxy race condition (#909)
Refactored PR of #904 for easier review.  Once #908 is reviewed & merged, this will be easier to review.
2021-05-22 06:50:08 +00:00
6e5f7e4d4c encode proxy name as base58 to allow full deletion of resources (#907) 2021-05-21 20:54:17 -04:00
a4bb670fb2 add proxy_state_updated events (#908) 2021-05-21 12:47:54 -04:00
2f81c44f01 Refactoring proxy lifetime to only shutdown when proxy is out-of-date. (#839)
## Summary of the Pull Request

_What is this about?_
We'd like to refactor the proxy lifecycle to only delete when the proxy is out-of-date - i.e. when the proxy is older than 7 days or a mismatched version. I've changed two files, proxy.py and timer_daily\init.py to check for the version and timestamp before stopping a live proxy. 

## PR Checklist
* [ ] Applies to work item: #xxx
* [ ] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/onefuzz) and sign the CLI.
* [ ] Tests added/passed
* [ ] Requires documentation to be updated
* [x] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

## Info on Pull Request

_What does this include?_
Changes to two files: 
proxy.py: 
- get_or_create() edited to check if timestamp is >7 days.
- Created is_outdated() to check version and timestamp for out-of-date proxy. 
timer_daily/init.py
- Proxy check now includes is_outdated() before determining if a proxy should be shutdown. 

## Validation Steps Performed
Deploying test instance to determine if proxy lives past a single day.
2021-05-20 14:33:29 +00:00
ff140a6b1b Stop tasks on nodes before deleting task queues (#801) 2021-05-17 18:59:13 +00:00
811264e249 handle issue from azure-mgmt-resource 17.0.0 upgrade (#893) 2021-05-14 16:19:52 -04:00
e8b654d0d4 update HasState Protocol to alway log state transitions (#881) 2021-05-14 02:47:59 +00:00
cb5e786bcd add event for scaleset state updates (#882)
This moves all scaleset state updates through `Scaleset.set_state` and adds a new event EventScalesetStateUpdated.
2021-05-13 21:23:02 +00:00
584f68065d cleanup a handful of scaleset logs (#880) 2021-05-12 17:31:08 -04:00
221a3316a1 Add StopIfFree node command to tell free nodes to stop asking for new work (#866) 2021-05-07 13:55:50 -04:00
007ecf2efe shutdown missing scalesets during resize (#860) 2021-05-06 12:00:09 -04:00
ced21b2ea3 Add node messages to node get (#836)
This exposes the node commands that have yet to be processed by the node.  Example use case:  The SDK can now ask "has this node installed my SSH key"
2021-04-26 16:14:58 -04:00
541e745199 handle queues vanishing during peek (#832)
Handle queues getting deleted during peek_queue.  This can happen when
polling the pool for work while the pool is getting shut down.
2021-04-26 15:42:40 -04:00
f4b5c1ae73 when processing node updates, don't wait on the node in cases it should be stopped (#834)
In situations when the node should be done, mark it as done without
waiting for the node to respond to the Done command.
2021-04-26 15:19:46 -04:00
cf3d904940 address formatting from black 21.4b0 (#831) 2021-04-26 12:35:16 -04:00
c5e0163068 catch VM SKU & VM Image generation mismatch failures (#803) 2021-04-14 14:34:12 -04:00
c8572cd55a remove timestamp from WebhookMessageLog model (#804) 2021-04-14 12:49:46 -04:00
627463d94b only record the first failure if a task has multiple failures (#797) 2021-04-13 17:36:56 -04:00
470e95c833 add Timestamp to multiple models (#796)
Expose the Azure storage table's "Timestamp" for the models where Timestamp should be user-accessible and remove the Timestamp field from models that did not sign up for it.

The behavior where Timestamp is only set by Azure Storage is kept.
2021-04-13 19:03:25 +00:00
39464dc606 simplify removing UserInfo from events prior to logging (#795) 2021-04-13 15:47:08 +00:00
46b8bdccbc add TaskConfig to crash_reported and regression_reported events (#793)
resolves #757 and #758
2021-04-13 10:24:12 +00:00
80b3533f83 Report the setup failure in the task when available (#781) 2021-04-09 08:57:56 -04:00
e21eafd135 clarify telemetry key names at the service level (#769) 2021-04-05 15:23:03 -04:00
ca12904684 add log checking to refactored integration check (#700)
In practice, Application Insights can take up to 3 minutes before something sent to it is available via KQL.

This PR logs a start and stop marker such that the integration tests only search for logs during the integration tests. This reduces the complexity when using the integration tests during the development process.

Note: this migrated the new functionality from #356 into the latest integration test tools.
2021-04-02 21:49:19 +00:00
7e5cf780a6 Added support for multi tenant authentication (#746)
## Summary of the Pull Request

_What is this about?_

## PR Checklist
* [x] Applies to work item: #562 
* [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/onefuzz) and sign the CLI.
* [x] Tests added/passed
* [ ] Requires documentation to be updated
* [x] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

## Info on Pull Request

The end-to-end changes needed to have onefuzz deployed with multi-tenant authentication.

## Validation Steps Performed

_How does someone test & validate?_
2021-04-02 14:39:20 +00:00
3096f99e86 enable using ephemeral disks by default (#461) 2021-03-30 18:48:44 -04:00
3eb7c8643b set expect_crash_on_failure default to False on libFuzzer tasks (#748) 2021-03-30 21:51:15 +00:00
92b5139a0a Removing UserInfo from notifications logging (#724) 2021-03-23 18:47:05 -04:00
1706a91291 Removing UserInfo from 'created task' logging (#725) 2021-03-23 18:45:18 -04:00
516b1e000e expose minimized_stack_depth functionality in the CLI/API (#715) 2021-03-23 10:09:34 -04:00
6e60a8cf10 add regression testing tasks (#664) 2021-03-18 15:37:19 -04:00
7ebdeac537 Added UserInfo Filter Logging Function (#661)
## Summary of the Pull Request

_What is this about?_
Due to our GDPR privacy requirements, we decided that it would be best to completely purge personal identifiable information from our AppInsights telemetry and logging. Instead of just removing all of the logging statements with personal info, I created a filter function that logs telemetry after it's been run through a recursive scrubbing function. This PR includes this new scrubbing function. 

## PR Checklist
* [x] Applies to work item: #660
* [ ] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/onefuzz) and sign the CLI.
* [ ] Tests added/passed
* [ ] Requires documentation to be updated
* [x] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

## Info on Pull Request

_What does this include?_
Includes changes to events.py in onefuzzlib. I've implemented functionality - log_event() - to recursively check Event structures for UserInfo before logging to AppInsights. 

## Validation Steps Performed
I run local tests using a script I created with test events. 

_How does someone test & validate?_
I can provide local testing script. If that is insufficient, I can write a unit test that will run against this code.
2021-03-15 23:56:00 +00:00
a3fdc74c53 handle exception related to manually deleted scalesets (#672)
If a user manually deletes a scaleset managed by OneFuzz, then `get_vmss_size` returns None.

When this happens, `Scaleset.shutdown` generates an exception from the `logging.info` call on line 573.

This PR handles this edge condition.
2021-03-15 14:18:59 +00:00
6888fc8fb8 send EventTaskFailed and EventTaskStopped once the task is stopped (#651)
As is, these events are sent once the task enters the state `stopping`.
However, the tasks can still be running on the VMs which can be
confusing.
2021-03-12 01:48:28 +00:00
14c7d5e4d9 mark dependant tasks failed upon failure (#650)
Fix #644
2021-03-11 22:24:43 +00:00
b4ceb263e0 stop jobs once all tasks are stopped (#649)
Fixed #643
2021-03-09 20:09:18 +00:00