Commit Graph

1063 Commits

Author SHA1 Message Date
7f66eeee0d handle OperationNotAllowed errors when creating VMSS (#614) 2021-03-02 16:14:10 -05:00
a0c04ec3d1 Add symbol cache and filtering (#570)
- Add caching to symbol table-driven module disassembly on Linux.
- Add configurable regex-based filtering for coverage collection, by module and module-scoped symbol name.

Block coverage recording can be manually tested using the `block_coverage` example in the `coverage` crate. See `./block_coverage -h` for expected args.

The filter file is optional. The file format is JSON like this:
```json
{
    "modules": {
        "allow": [
            "<module-path-regex-1>",
            "<module-path-regex-2>",
        ]
    },
    "symbols": {
        "<module-path-regex-1>": {
            "allow": [
                "<symbol-name-regex-1>",
                "<symbol-name-regex-2>",
            ]
        },
        "<module-path-regex-2>": {
            "deny": [
                "<symbol-name-regex-3>",
                "<symbol-name-regex-4>",
            ]
        }
    }
}
```

Closes #285.
2021-03-02 19:42:05 +00:00
b97093735a fix agent retry on connection level failures (#623)
In debugging the connection retry issues, I dug into this more.  

Apparently, some of hyper's connection errors are not mapped to std::io::Error, rendering the existing downcast impl ineffective.

As such, this PR makes the following updates:
1. Any request that fails for what `reqwest` calls a `connection` error is considered transient.
2. Updates the retry notify code to use our `warn` macro such that the events show up in application insights.
3. Updates the unit test to demonstrate that failures by trying to connect to `http://localhost:81/`, which shouldn't be listening on any system.
4. Adds a simple unit test to verify with send_retry_default, connections to https://www.microsoft.com work

Fixes #263
2021-03-02 19:02:10 +00:00
c537458ade update azure-mgmt-compute to 19.0.0 (#611) 2021-03-02 18:44:26 +00:00
ba836a2062 update azure-mgmt-eventgrid to 3.0.0rc9 (#610) 2021-03-02 18:18:38 +00:00
296ba2ee23 update azure-mgmt-storage to 17.0.0 (#612) 2021-03-02 18:00:07 +00:00
e43c1c875c simplify batch-processing log (#622)
Simplifies the logs from:

`Processing batch-downloaded input Ok(DirEntry(DirEntry("task_crashes_1/input-b4c3482194a6ebd275577ea52255fcea3358f3220c408d3c53b9f32b653e6586.txt")))`

to:

`Processing batch-downloaded input: task_crashes_1/input-b4c3482194a6ebd275577ea52255fcea3358f3220c408d3c53b9f32b653e6586.txt`
2021-03-02 17:32:07 +00:00
d4cedabdf8 update 3rd party rust dependencies (#624) 2021-03-02 11:41:30 -05:00
32681b2611 update azure-mgmt-resource (#607) 2021-03-02 08:35:07 +00:00
100e22a359 Rewrite redundant Result wraps (#616) 2021-03-01 12:43:30 -05:00
0f895d11c9 add context to logging of supervisor work queue interaction (#601) 2021-02-27 20:17:04 -05:00
c1a2c9febb fix infinite loop on request error that isn't an IO Error (#603) 2021-02-26 20:23:39 -05:00
6a82f57c4a remove unused library from prereqs (#599) 2021-02-26 16:57:55 -05:00
e3c73d7a10 Update command variable expansion (#561)
* Documents `crashes_account` and `crashes_container`
* Adds `reports_dir` and support for `unique_reports`, `reports`, and `no_repro` containers to the generic analysis task
* Adds `microsoft_telemetry_key` and `instance_telemetry_key` to generic supervisor, generator, and analysis tasks
2021-02-26 20:58:09 +00:00
419ca05b28 Actively tail worker stdio from supervisor agent (#588)
In the supervisor agent, incrementally read from the running worker agent's redirected stderr and stdout, instead of waiting until it exits.

The worker agent's stderr and stdout are piped to the supervisor when tasks are run. The supervisor's `WorkerRunner` does _not_ use `wait_with_output()`, which handles this (at the cost of blocking). Instead, it makes repeated calls to to `try_wait()` on timer-based state transitions, and does not try to read the pipes until the worker exits. But when one of the child's pipes is full, the child can block forever waiting on a `write(2)`, such as in a `log` facade implementation.

This bug has not been caught because we control the child worker agent, and until recently, it mostly only wrote to these streams using `env_logger` at its default log level. But recent work: (1) set more-verbose `INFO` level default logging, (2) logged stderr/stdout lines of child processes of _the worker_, and (3) some user targets logged very verbosely for debugging. This surfaced the underlying issue.
2021-02-26 20:09:02 +00:00
06f45f338c Update Task Heartbeat to include Job_id (#594) 2021-02-26 13:36:10 -05:00
6a049db3a3 Renames application insights keys to be more clear (#587)
* renames `telemetry_key` to `microsoft_telemetry_key`
* renames `instrumentation_key` to `instance_telemetry_key`
* renames `can_share` to `can_share_with_microsoft`
* renames the `applicationinsights-rs` instances to `internal` and `microsoft` respective of the keys used during construction.

This clarifies the underlying use of Application Insights keys and uses struct tuple to ensure the keys are used correctly via rust's type checker.
2021-02-26 17:04:49 +00:00
8600a44f1f fix bool queries (#597)
This addresses broken queries used for identifying outdated nodes.
2021-02-26 16:51:05 +00:00
bc6c8408c4 add onefuzz containers files download_dir (#598)
fixes #571
2021-02-26 15:27:51 +00:00
4cd2de0e93 Update azure-cli & azure-cli-core (#596) 2021-02-26 09:19:25 -05:00
daef1637f8 update jinja2 (#595) 2021-02-26 09:19:10 -05:00
a3fa5f6b62 Update onefuzz-agent unit tests (#592) 2021-02-24 20:54:36 -05:00
ed86bb0099 Use non-deprecated atomic method (#593) 2021-02-24 17:41:24 -05:00
fb482e357e don't schedule work to a node if the scaleset or pool is shutting down (#583) 2021-02-23 13:33:41 -05:00
e7fe099f25 handle delayed AAD resources in deployments (#585) 2021-02-22 19:40:07 -05:00
cebb84b9e7 handle error condition when creating a container that is being deleted (#582)
When users try to create a container immediately after deleting it, Azure will fail saying the deletion is in-progress.

catching ResourceExistsError during create handles this error.
2021-02-22 01:49:07 +00:00
feb80ecb54 allow nodes with multiple tasks to continue on task stop (#567)
As is, when multiple tasks are running on a single node, if any one of them stops, the node gets reimaged.

This changes the behavior such that when a node with multiple tasks has one task stop, the other tasks will continue.
2021-02-19 23:54:26 +00:00
6ba5795f36 update proxy port ranges to avoid current blocks (#552) 2021-02-19 17:50:09 -05:00
4de19ffe5e stop jobs that do not start within 30 days (#565)
If a job does not start within 30 days, stop the job and mark all of the tasks as `failed`.
2021-02-19 21:23:35 +00:00
305c23a4d9 add instance information to webhooks (#577)
Fixes #574
2021-02-19 21:00:51 +00:00
8ce4638b8a clarify scaleset logging (#568) 2021-02-19 19:36:16 +00:00
4992b494f1 add task config to all task events (#580) 2021-02-19 14:10:48 -05:00
872a5ddc14 add details to exceptions generated during report render failures (#576) 2021-02-19 13:48:49 -05:00
3a7bc95316 import local relative paths (#579) 2021-02-19 12:29:35 -05:00
cc5965ebbf add .gitignore to ignore libfuzzer-dotnet build artifacts (#564) 2021-02-19 09:32:26 +00:00
657af9722c coverage containers should be unique to the project/name/build/platform (#572) 2021-02-18 17:07:44 -05:00
929d9ce496 make user triggered reimaging happen immediately (#566) 2021-02-18 14:08:25 -05:00
279629292f handle SkuNotAvailable errors when creating VM Scalesets (#557) 2021-02-17 16:52:37 -05:00
89d7f060dd make missing symbols for coverage tasks more explicit (#554)
This moves from:

```
"Error: coverage extraction from C:\users\bcaswell\projects\bugs\andrew-coverage-fail\setup\oft-setup-5c77cfe1b181520ab0b33a16286a690a\fuzz.exe failed when processing file "11f6ad8ec52a2984abaafd7c3b516503785c2072".  target appears to be missing sancov instrumentation",
```

To even more explicit:
```
Error: Target appears to be missing sancov instrumentation.  This error can happen due to missing coverage symbols.
target_exe: C:\users\bcaswell\projects\bugs\andrew-coverage-fail\setup\oft-setup-5c77cfe1b181520ab0b33a16286a690a\fuzz.exe
input: "11f6ad8ec52a2984abaafd7c3b516503785c2072"
debugger stdout:
...
[+] disabling sympath
[+] processing fuzz.exe
[+] no tables  fuzz.exe
[+] processing C:\WINDOWS\SYSTEM32\kernel.appcore.dll
[+] no tables  C:\WINDOWS\SYSTEM32\kernel.appcore.dll
[+] processing C:\WINDOWS\System32\KERNELBASE.dll
[+] no tables  C:\WINDOWS\System32\KERNELBASE.dll
[+] processing C:\WINDOWS\System32\RPCRT4.dll
[+] no tables  C:\WINDOWS\System32\RPCRT4.dll
[+] processing C:\WINDOWS\System32\msvcrt.dll
[+] no tables  C:\WINDOWS\System32\msvcrt.dll
[+] processing C:\WINDOWS\System32\KERNEL32.DLL
[+] no tables  C:\WINDOWS\System32\KERNEL32.DLL
[+] processing ntdll.dll
[+] no tables  ntdll.dll
Error: unable to find sancov counter symbols [at DumpCounters (line 114 col 9)]
...
```
2021-02-17 16:34:09 +00:00
ce47e4924a add status job commands (#550) 2021-02-16 13:47:57 -05:00
c160088998 expose input_blob fields needed to generate crash reports (#551) 2021-02-16 13:16:54 -05:00
f64a0dcc05 lint integration-test.py (#549) 2021-02-16 12:22:45 -05:00
e9b67952e3 update 3rd-party rust dependencies (#548) 2021-02-16 11:11:20 -05:00
933fe6850c libfuzzer-dotnet integration (#535) 2021-02-11 17:30:24 -05:00
360693e8a4 move verbose to debug to align with log and opentelemetry (#541) 2021-02-11 16:49:27 -05:00
a3d73a240d report the total coverage after processing all inputs in local mode (#537) 2021-02-11 19:34:09 +00:00
1e536c54d3 update error message when coverage extraction fails (#539) 2021-02-11 14:18:49 -05:00
18d9daf909 Enable waiting for a job to start for managed templates (#532)
Provide `--wait_for_running` for managed templates.
2021-02-11 08:34:28 +00:00
f8046934e9 add roles to agent & supervisor (#527) 2021-02-10 20:56:22 +00:00
bdcab6eb08 handle tokens from x-ms-token-aad-id-token (#531) 2021-02-10 12:41:15 -05:00