Commit Graph

223 Commits

Author SHA1 Message Date
e799eb03cd Shorten the expiry window for the work queue SAS URLs assigned at node registration (#416)
The underlying impact is that nodes must re-register on a more frequent basis.

Nodes find out they are out-of-date is during registration and immediately prior to starting a new set of work.  Requiring nodes re-register on a shortened cycle provides more opportunities for nodes to get re-imaged.

Additionally, this addresses an issue handling the SAS URL expiry in a more clean fashion in the supervisor.
2021-01-07 12:34:26 +00:00
3b26ffef65 support multiple corpus accounts (#334)
Add support for sharding across multiple storage accounts for blob containers used for corpus management.

Things to note:

1. Additional storage accounts must be in the same resource group, support the "blob" endpoint, and have the tag `storage_type` with the value `corpus`.  A utility is provided (`src/utils/add-corpus-storage-accounts`), which adds storage accounts. 
2. If any secondary storage accounts exist, they are used by default for containers.
3. Storage account names are cached in memory the Azure Function instance forever.   Upon adding new storage accounts, the app needs to be restarted to pick up the new accounts.
2021-01-06 23:11:39 +00:00
f345bd239d Add ssh keys to nodes on demand (#411)
Our existing model has a per-scaleset SSH key.  This update moves towards using user provided SSH keys when they need to connect to a given node.
2021-01-06 19:29:38 +00:00
c1a50f6f6c Colocate tasks (#402)
Enables co-locating multiple tasks in a given work-set.

Tasks are bucketed by the following:
* OS
* job id
* setup container
* VM SKU & image (used in pre-1.0 style tasks)
* pool name (used in 1.0+ style tasks)
* if the task needs rebooting after the task setup script executes.

Additionally, a task will end up in a unique bucket if any of the following are true:
* The task is set to run on more than one VM
* The task is missing the `task.config.colocate` flag (all tasks created prior to this functionality) or the value is False

This updates the libfuzzer template to make use of colocation.  Users can specify co-locating all of the tasks *or* co-locating the secondary tasks.
2021-01-06 13:49:15 +00:00
986df8fcc6 limit updating outdated nodes to 500 at a time (#397) 2021-01-05 17:40:36 -05:00
633e5b5f02 restrict api endpoints (#404)
Restrict API endpoints from agents
2021-01-05 19:40:58 +00:00
37f06bb324 handle libfuzzer fuzzing non-zero exits better (#381)
When running libfuzzer in 'fuzzing' mode, we expect the following on exit.

If the exit code is zero, crashing input isn't required.  This happens if the user specifies '-runs=N'

If the exit code is non-zero, then crashes are expected.  In practice, there are two causes to non-zero exits.
1. If the binary can't execute for some reason, like a missing prerequisite
2. If the binary _can_ execute, sometimes the sanitizers are put in such a bad place that they are unable to record the input that caused the crash.

This PR enables handling these two non-zero exit cases.

1. Optionally verify the libfuzzer target loads appropriately using `target_exe -help=1`.  This allows failing faster in the common issues, such a missing prerequisite library.
2. Optionally allow non-zero exits without crashes to be a warning, rather than a task failure.
2021-01-05 14:40:15 +00:00
29c7cfbd5d filter out deleted nodes as to prevent them from being saved later (#391)
In `Scaleset.cleanup_nodes`, nodes that are no longer part of the scaleset should get deleted.  Without filtering the list, the nodes could get re-saved to the Node table later on.
2021-01-04 20:28:57 +00:00
4c2679d61e Re-add windows ssh key (#390)
Adds a scaleset specific setup script, which allows us to save the scaleset based SSH keys into the VM on setup.
2021-01-04 19:52:27 +00:00
e6b55ab95a Simplify job template management workflow (#354)
1. Merge 'create' and 'update' to a single 'save' operation.
2. Allow fetching a single template.

This enables the following workflow:

```
$ onefuzz job_templates manage get libfuzzer_linux > template.json
$ <... update template as desired ...>
$ onefuzz job_templates manage save libfuzzer_linux @./template.json
$
```
2020-12-02 14:27:42 +00:00
9b3ccf37ea use the correct instrumentation key (#355) 2020-12-01 18:44:10 -05:00
7f97c142ed add the instrumentation key to Info (#353) 2020-12-01 11:13:06 -05:00
30cc5d4778 ignore nodes already scheduled for re-imaging in outdated check (#341)
If a node is already scheduled to be reimaged/deleted, we should not bother checking if it's outdated.
2020-11-30 17:36:15 +00:00
33b7608aaf Adding option to merge all inputs at once (#282) 2020-11-24 08:43:08 -05:00
d47124fe8c Fix state management in the scheduler (#337) 2020-11-24 12:43:51 +00:00
9e2a61fe66 Add user_info to Jobs & Repro (#327)
This adds information about the user that created a job or repro VM to the respective resources.

This expands on the addition made to tasks in #303.
2020-11-20 15:46:52 +00:00
b2b4a06afa Address typing issues hidden by memoization.caching (#322) 2020-11-18 15:08:40 -05:00
e47e89609a Use Storage Account types, rather than account_id (#320)
We need to move to supporting data sharding.

One of the steps towards that is stop passing around `account_id`, rather we need to specify the type of storage we need.
2020-11-18 14:06:14 +00:00
64bd389eb7 Declarative templates (#266) 2020-11-17 16:00:09 -05:00
beea318968 Add User Info to created tasks (#303)
This PR makes user information from JWT tokens available as part of a Task.

Included changes:
* Renamed `verify_token` to `call_if_agent`, since this function is specific to agent token verification
* Renames `is_authorized` to `is_agent`, since this function checks if the token is an agent
* Adds support for unmanaged nodes in `is_agent` (see #133 for information) 
* Saves the user information from the JWT token on task create as part of `TaskConfig`

Note, `TaskConfig` is what is provided to notification templates.  This enables Github issues and ADO work items to tie back to the user that created the task.

Note, while `upn` _usually_ means email for AAD user tokens.  If we were going to make use of the email address, we should perform a graph lookup based on the `oid`, but we're not.
2020-11-13 11:50:52 +00:00
31f099d3d4 Event based webhooks (#296) 2020-11-12 17:44:42 -05:00
a0b5d10c81 Add target_workers to TaskUnitConfig (#305) 2020-11-12 13:22:53 -05:00
ca209eb543 refactor agent_events handler (#261) 2020-11-11 18:28:16 -05:00
dec1a2d7b0 removing nodes whose ground truth is not avail (#275) 2020-11-11 12:20:05 -05:00
ba59230187 fix create_vmss log message (#293) 2020-11-11 12:17:42 -05:00
e638908aac Add application-insights debug cli (#281) 2020-11-11 06:17:43 -05:00
82806b1cf2 Keeps task/node association until the nodes are reimaged (#273) 2020-11-10 17:41:51 -05:00
bbee84ab1f Storing the user assigned managed identity in the scaleset table (#255) 2020-11-05 18:36:59 -05:00
b5578381ce default TTL for queued messages to infinite (#259) 2020-11-04 15:41:05 -05:00
04643a9eed fixing libfuzzer_merge (#240) 2020-11-03 15:46:18 -05:00
4ef489b397 adding node shutdown (#252) 2020-11-03 11:39:51 -05:00
6c598773dd add instance_id generated at install time (#245) 2020-11-02 14:27:51 -05:00
ced8200d74 enable setting ensemble sync duration timer (#229) 2020-10-29 14:48:12 -04:00
154be220ae Enable User assigned managed identity for scalesets (#219) 2020-10-29 13:53:11 -04:00
f4b874e19e Always use the get_*_account helper methods (#226) 2020-10-28 21:40:21 -04:00
1d2fb99dd4 expose the ability manually override node reset (#201) 2020-10-27 17:29:53 -04:00
d4c584342a address multiple issues found by pylint (#206) 2020-10-26 12:24:50 -04:00
8a62830f2b reduce how often list_containers is called (#196) 2020-10-23 09:43:45 -04:00
bfbc9f8c9e Bug fixes for Pool Resize (#158) 2020-10-23 08:58:15 -04:00
d769072343 cache tokens in memory forever (#195) 2020-10-22 19:13:59 -04:00
0a560cefba fix format strings (#189) 2020-10-22 12:15:39 -04:00
041c6ae130 Reimage dead nodes (#154) 2020-10-20 16:58:02 -04:00
178537df05 Store the heartbeat data in the task and node tables (#164) 2020-10-20 14:24:00 -04:00
75f29b9f2e Remove update_event as a single event loop for the system (#160) 2020-10-16 21:42:35 -04:00
fa25823342 split node and task heartbeats in two nodes (#163) 2020-10-15 21:30:03 -04:00
645a5e5702 stop automatically queueing objects for work (#159) 2020-10-15 14:39:37 -04:00
3189daeeb7 implementing heartbeat for the supervisor (#30) 2020-10-14 15:13:16 -04:00
d73616b366 Save the managed identity as soon as it's available (#144) 2020-10-14 11:38:05 -04:00
7f0c25e2da Managing Pool Resizing at service side (#107) 2020-10-13 14:04:26 -04:00
1398285aea Fixes typing error identified by a new mypy release (#129) 2020-10-09 16:44:59 -04:00