Commit Graph

115 Commits

Author SHA1 Message Date
8600a44f1f fix bool queries (#597)
This addresses broken queries used for identifying outdated nodes.
2021-02-26 16:51:05 +00:00
fb482e357e don't schedule work to a node if the scaleset or pool is shutting down (#583) 2021-02-23 13:33:41 -05:00
cebb84b9e7 handle error condition when creating a container that is being deleted (#582)
When users try to create a container immediately after deleting it, Azure will fail saying the deletion is in-progress.

catching ResourceExistsError during create handles this error.
2021-02-22 01:49:07 +00:00
feb80ecb54 allow nodes with multiple tasks to continue on task stop (#567)
As is, when multiple tasks are running on a single node, if any one of them stops, the node gets reimaged.

This changes the behavior such that when a node with multiple tasks has one task stop, the other tasks will continue.
2021-02-19 23:54:26 +00:00
6ba5795f36 update proxy port ranges to avoid current blocks (#552) 2021-02-19 17:50:09 -05:00
4de19ffe5e stop jobs that do not start within 30 days (#565)
If a job does not start within 30 days, stop the job and mark all of the tasks as `failed`.
2021-02-19 21:23:35 +00:00
305c23a4d9 add instance information to webhooks (#577)
Fixes #574
2021-02-19 21:00:51 +00:00
8ce4638b8a clarify scaleset logging (#568) 2021-02-19 19:36:16 +00:00
4992b494f1 add task config to all task events (#580) 2021-02-19 14:10:48 -05:00
872a5ddc14 add details to exceptions generated during report render failures (#576) 2021-02-19 13:48:49 -05:00
3a7bc95316 import local relative paths (#579) 2021-02-19 12:29:35 -05:00
929d9ce496 make user triggered reimaging happen immediately (#566) 2021-02-18 14:08:25 -05:00
279629292f handle SkuNotAvailable errors when creating VM Scalesets (#557) 2021-02-17 16:52:37 -05:00
bdcab6eb08 handle tokens from x-ms-token-aad-id-token (#531) 2021-02-10 12:41:15 -05:00
8ee7fae240 use the cached Azure Identity instance for storage operations (#526) 2021-02-09 12:20:12 -05:00
91a3690551 fix logging to show corpus accounts found, not a function ref (#524) 2021-02-09 11:55:52 -05:00
5114332ea0 clarify proxy log messages (#520) 2021-02-08 17:39:26 -05:00
8c9f65c0be add missing scaleset nodes (#518) 2021-02-08 13:50:08 -05:00
1d74379a70 use the primitive types in more places (#514) 2021-02-05 13:10:37 -05:00
e3dfcb8b95 Scalesets that are about to be deleted don't need updated configs (#511) 2021-02-05 09:53:29 -05:00
3cb055d331 clarify message upon service & agent version mismatch (#510) 2021-02-04 19:58:45 -05:00
a02e084522 split out node, scaleset, and pool code (#507) 2021-02-04 19:07:49 -05:00
c5bb0f0588 Update Proxy heartbeat & logging (#502) 2021-02-04 15:38:17 -05:00
5e2e9448df add security auditing of python code using Bandit during CICD (#491) 2021-02-01 16:51:03 -05:00
0f70ffa3e2 try pushing updates to scaleset configs frequently until the push succeeds (#489) 2021-02-01 10:09:40 -05:00
a46f7b4193 expose supervisor tasks that are fully self-contained fuzzing tasks in the service (#474)
Exposes the functionality added in #454 to the service & CLI.

Fixes #439
2021-01-29 00:01:59 +00:00
14fc1ca51f remove unused Event generation from the pre-2.0.0 SignalR integration (#477)
Remove a vestige of the adhoc events used by the previous SingalR integration for container updates.
2021-01-28 21:56:31 +00:00
f155ad625f reimage long-lived nodes (#476)
This helps keep nodes on scalesets that use `latest` OS image SKUs reasonably up-to-date with OS patches without disrupting running fuzzing tasks with patch reboot cycles.

In combination with the already-merged #416, this PR closes #414.
2021-01-28 20:36:40 +00:00
24685ca8df Updating Windows Default Image from RS5-Pro to 20H2-Pro (#469)
RS5-Pro is no longer updated in the Azure Marketplace. In order to ensure the Windows 10 VMs are regularly updated, we need to switch the default image to 20H2-Pro, which is regularly maintained.
2021-01-27 13:46:46 +00:00
5027745ee2 simplify get/delete for scalesets (#468) 2021-01-26 14:43:14 -05:00
165257e989 update python prereqs (#427)
Updates the following libraries in the service:
* azure-core
* azure-functions
* azure-identity
* azure-keyvault-keys
* azure-keyvault-secrets
* azure-mgmt-compute
* azure-mgmt-core
* azure-mgmt-loganalytics
* azure-mgmt-network
* azure-mgmt-resource
* azure-mgmt-storage
* azure-mgmt-subscription
* azure-storage-blob
* azure-storage-queue
* pydantic
* requests
* jsonpatch

Removes the following libraries in the service:
* azure-cli-core
* azure-cli-nspkg
* azure-mgmt-cosmosdb
* azure-servicebus

Updates the following libraries in the CLI:
* requests
* semver
* asciimatics
* pydantic
* tenacity

Updates the following libraries in onefuzztypes:
* pydantic

The primary "legacy" libraries are [azure-graphrbac](https://pypi.org/project/azure-graphrbac/) and azure-cosmosdb-table.  The former has not been updated to use azure-identity yet. The later is being rewritten as [azure-data-tables](https://pypi.org/project/azure-data-tables/), but is still in early beta.
2021-01-25 20:53:40 +00:00
31ea71e8b6 use the unique-string based keyvault names (#462) 2021-01-25 15:02:12 -05:00
4bc90a7564 set max stdout/stderr size (#460) 2021-01-25 13:07:35 -05:00
3f2883d38e Storing secrets in azure keyvault (#326) 2021-01-25 11:12:07 -05:00
e4ecf7e230 remove early-exit from cleanup_nodes that broke dead node cleanup (#458) 2021-01-22 18:04:50 -05:00
2f3139cda1 unify node resetting & deleting into delete/recreate (#450) 2021-01-22 22:04:44 +00:00
e6dec041b2 move to using machine_id rather than node_id (#451)
Handle unifying onto machine_id for NodeMessage.
2021-01-21 16:22:16 +00:00
fd956380d4 experimental "local fuzzing" support (#405)
This PR adds an experimental "local" mode for the agent, starting with `libfuzzer`.  For tasks that poll a queue, in local mode, they just monitor a directory for new files.

Supported commands: 
* libfuzzer-fuzz (models the `libfuzzer-fuzz` task)
* libfuzzer-coverage (models the `libfuzzer-coverage` task)
* libfuzzer-crash-report (models the `libfuzzer-crash-report` task)
* libfuzzer (models the `libfuzzer basic` job template, running libfuzzer-fuzz and libfuzzer-crash-report tasks concurrently, where any files that show up in `crashes_dir` are automatically turned into reports, and optionally runs the coverage task which runs the coverage data exporter for each file that shows up in `inputs_dir`).

Under the hood, there are a handful of changes required to the rest of the system to enable this feature.
1. `SyncedDir` URLs are now optional.  In local mode, these no longer make sense.   (We've discussed moving management of `SyncedDirs` to the Supervisor.  This is tangential to that effort.)
2. `InputPoller` uses a `tempdir` rather than abusing `task_id` for temporary directory naming.
3. Moved the `agent` to only use a single tokio runtime, rather than one for each of the subcommands.
4. Sets the default log level to `info`.  (RUST_LOG can still be used as is).

Note, this removes the `onefuzz-agent debug` commands for the tasks that are now exposed via `onefuzz-agent local`, as these provide a more featureful version of the debug tasks.
2021-01-20 03:33:25 +00:00
513d1f52c9 Unify Dashboard & Webhook events (#394)
This change unifies the previously adhoc SignalR events and Webhooks into a single event format.
2021-01-11 21:43:09 +00:00
d573100a97 Clear node messages on deletion (#419)
## Summary of the Pull Request

_What is this about?_

## PR Checklist
* [ ] Applies to work item: #xxx
* [ ] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/onefuzz) and sign the CLI.
* [ ] Tests added/passed
* [ ] Requires documentation to be updated
* [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

## Info on Pull Request

_What does this include?_

## Validation Steps Performed

_How does someone test & validate?_
2021-01-11 20:14:43 +00:00
6aa7d5f6cf remove unused back_channel_address entry (#420) 2021-01-08 14:23:30 -05:00
46e8454569 compare containers rather than SAS urls when building worksets (#418)
By comparing container names rather than SAS urls, this removes a race condition that prevented co-locatable tasks from being co-located.
2021-01-08 09:45:05 +00:00
e799eb03cd Shorten the expiry window for the work queue SAS URLs assigned at node registration (#416)
The underlying impact is that nodes must re-register on a more frequent basis.

Nodes find out they are out-of-date is during registration and immediately prior to starting a new set of work.  Requiring nodes re-register on a shortened cycle provides more opportunities for nodes to get re-imaged.

Additionally, this addresses an issue handling the SAS URL expiry in a more clean fashion in the supervisor.
2021-01-07 12:34:26 +00:00
3b26ffef65 support multiple corpus accounts (#334)
Add support for sharding across multiple storage accounts for blob containers used for corpus management.

Things to note:

1. Additional storage accounts must be in the same resource group, support the "blob" endpoint, and have the tag `storage_type` with the value `corpus`.  A utility is provided (`src/utils/add-corpus-storage-accounts`), which adds storage accounts. 
2. If any secondary storage accounts exist, they are used by default for containers.
3. Storage account names are cached in memory the Azure Function instance forever.   Upon adding new storage accounts, the app needs to be restarted to pick up the new accounts.
2021-01-06 23:11:39 +00:00
f345bd239d Add ssh keys to nodes on demand (#411)
Our existing model has a per-scaleset SSH key.  This update moves towards using user provided SSH keys when they need to connect to a given node.
2021-01-06 19:29:38 +00:00
c1a50f6f6c Colocate tasks (#402)
Enables co-locating multiple tasks in a given work-set.

Tasks are bucketed by the following:
* OS
* job id
* setup container
* VM SKU & image (used in pre-1.0 style tasks)
* pool name (used in 1.0+ style tasks)
* if the task needs rebooting after the task setup script executes.

Additionally, a task will end up in a unique bucket if any of the following are true:
* The task is set to run on more than one VM
* The task is missing the `task.config.colocate` flag (all tasks created prior to this functionality) or the value is False

This updates the libfuzzer template to make use of colocation.  Users can specify co-locating all of the tasks *or* co-locating the secondary tasks.
2021-01-06 13:49:15 +00:00
986df8fcc6 limit updating outdated nodes to 500 at a time (#397) 2021-01-05 17:40:36 -05:00
633e5b5f02 restrict api endpoints (#404)
Restrict API endpoints from agents
2021-01-05 19:40:58 +00:00
37f06bb324 handle libfuzzer fuzzing non-zero exits better (#381)
When running libfuzzer in 'fuzzing' mode, we expect the following on exit.

If the exit code is zero, crashing input isn't required.  This happens if the user specifies '-runs=N'

If the exit code is non-zero, then crashes are expected.  In practice, there are two causes to non-zero exits.
1. If the binary can't execute for some reason, like a missing prerequisite
2. If the binary _can_ execute, sometimes the sanitizers are put in such a bad place that they are unable to record the input that caused the crash.

This PR enables handling these two non-zero exit cases.

1. Optionally verify the libfuzzer target loads appropriately using `target_exe -help=1`.  This allows failing faster in the common issues, such a missing prerequisite library.
2. Optionally allow non-zero exits without crashes to be a warning, rather than a task failure.
2021-01-05 14:40:15 +00:00
29c7cfbd5d filter out deleted nodes as to prevent them from being saved later (#391)
In `Scaleset.cleanup_nodes`, nodes that are no longer part of the scaleset should get deleted.  Without filtering the list, the nodes could get re-saved to the Node table later on.
2021-01-04 20:28:57 +00:00