During `TimerWorkers` if updating one entity fails and throws an exception we will abandon the whole update. Instead log the error and continue to attempt to process the remaining entities. This will allow us to make progress even if one entity is stuck.
In our integration test run we are seeing some connection-reset errors which causes the CLI operation to fail.
To fix this:
1. Set TCP-KeepAlive to keep Azure load balancer connections alive longer than the default timeout (4 minutes).
2. Treat ConnectionResetError as retryable.
* use InterpolatedStringHandler to move values to CustomDimensions Tags instead of keeping them in the error message
* log blob save raw response failure
* add StringBuilder to CSharpExtensions
Co-authored-by: stas <statis@microsoft.com>
As seen in #2441, it is easy to drop return values of updated entities accidentally.
This PR adds a Roslyn Analyzer which will detect when return values are unused. To explicitly ignore a value you can drop it with `_ = …;`
Closes#2442.
* add logs
* avoid relying on exceptions for logic flow control
* add logs to agent commands
* add more logs and fix error logging when table writes fail
* move machine ID to CustomDimensions
* log insert errors
* Log Delete failures
* more logs
* more logs
* more logs
* More logs (I think that's it there is no more...)
Co-authored-by: stas <statis@microsoft.com>
* fix OMS Linux repro extension config
* Fixing lost Node state updates
* fix bug in ReproVmss
* rewrite ssh auth
* win azure function ssh-keygen fix
* more logs
* try -P
* use empty string for password
* use argument list
* addressing comments
Co-authored-by: stas <statis@microsoft.com>
Co-authored-by: George Pollard <gpollard@microsoft.com>
* mark tasks as failed if a work unit cannot be created for the task
* fix up time queries
* query improvements
Co-authored-by: stas <statis@microsoft.com>
* - fix queries in timer retention
- do not discard proxy record after proxy state is processed, since that record needs to persist
* addressing comments
Co-authored-by: stas <statis@microsoft.com>
Use Codecov to show coverage reports, so we get highlighted versions of the files where it is easy to see missing coverage.
- Setup Rust coverage using [`cargo-llvm-cov`](https://github.com/taiki-e/cargo-llvm-cov).
- Add the `ci/agent.sh` build script to the agent artifact cache key, since it wasn't there before.
- Don't run Rust tests in `--release` mode (have been meaning to change this so doing it at the same time).
There is some subtlety about putting the coverage result into the cached agent artifact, so that when we reuse the agent artifact we can still upload the coverage information for it to Codecov. Without this it would look like the coverage had dropped.
Two fixes to scheduling code:
- `GetPool` was not correct for the VM case (this code is possibly legacy and not used any more)
- `BuildWorkUnit` could fetch the same pool multiple times and then fail due to `BucketConfig` mismatch (on `TimeStamp`)
- add a cache to the loop so that we only fetch each pool once