This ought to close the potential for dropped errors and hanging downloads.
Verify needs to be examined, I may have broken it, although all tests pass.
This check needs to be done with each fetch from the storage server, to
detect when someone has changed the share (i.e. our servermap goes stale).
Doing it just once at the beginning of retrieve isn't enough: a write might
occur after the first segment but before the second, etc.
_try_to_validate_prefix() was not removed: it will be used by the future
check-with-each-fetch code.
test_mutable.Roundtrip.test_corrupt_all_seqnum_late was disabled, since it
fails until this check is brought back. (the corruption it applies only
touches the prefix, not the block data, so the check-less retrieve actually
tolerates it). Don't forget to re-enable it once the check is brought back.
This is a neat trick to reduce Foolscap overhead, but the need for an
explicit flush() complicates the Retrieve path and makes it prone to
lost-progress bugs.
Also change test_mutable.FakeStorageServer to tolerate multiple reads of the
same share in a row, a limitation exposed by turning off the queue.
Note that the downloader will still fetch a segment for a zero-length
read, which is wasteful. Fixing that isn't specifically required to fix
#1512, but it should probably be fixed before 1.9.
publish:
* encrypt and encode times are cumulative, not just current-segment
retrieve:
* same for decrypt and decode times
* update "current status" to include segment number
* set status to Finished/Failed when download is complete
* set progress to 1.0 when complete
More improvements to consider:
* progress is currently 0% or 100%: should calculate how many segments are
involved (remembering retrieve can be less than the whole file) and set it
to a fraction
* "fetch" time is fuzzy: what we want is to know how much of the delay is not
our own fault, but since we do decode/decrypt work while waiting for more
shares, it's not straightforward
The old code was calculating the "extension parameters" (a list) from the
downloader hints (a dictionary) with hints.values(), which is not stable, and
would result in corrupted filecaps (with the 'k' and 'segsize' hints
occasionally swapped). The new code always uses [k,segsize].
Without this, we get a regression when modifying a mutable file that was
created with more shares (larger N) than our current tahoe.cfg . The
modification attempt creates new versions of the (0,1,..,newN-1) shares, but
leaves the old versions of the (newN,..,oldN-1) shares alone (and throws a
assertion error in SDMFSlotWriteProxy.finish_publishing in the process).
The mixed versions that result (some shares with e.g. N=10, some with N=20,
such that both versions are recoverable) cause problems for the Publish code,
even before MDMF landed. Might be related to refs #1390 and refs #1042.
The changes in layout.py are mostly concerned with the MDMF share
format. In particular, we define read and write proxy objects used by
retrieval, publishing, and other code to write and read the MDMF share
format. We create equivalent proxies for SDMF objects so that these
objects can be suitably general.
In particular:
- Break MutableFileNode and MutableFileVersion into distinct classes.
- Implement the interface modifications made for MDMF.
- Be aware of MDMF caps.
- Learn how to create and work with MDMF files.
Remove install.html (long since deprecated).
Also replace some obsolete references to install.html with references to quickstart.rst.
Fix some broken internal references within docs/historical/historical_known_issues.txt.
Thanks to Ravi Pinjala and Patrick McDonald.
refs #1227
Pass around IServer instance instead of (peerid, rref) tuple. Replace
"descriptor" with "server". Other replacements:
get_all_servers -> get_connected_servers/get_known_servers
get_servers_for_index -> get_servers_for_psi (now returns IServers)
This change still needs to be pushed further down: lots of code is now
getting the IServer and then distributing (peerid, rref) internally.
Instead, it ought to distribute the IServer internally and delay
extracting a serverid or rref until the last moment.
no_network.py was updated to retain parallelism.
allmydata.util.log.err() either takes a Failure as the first positional
argument, or takes no positional arguments and must be invoked in an
exception handler. Fixed its signature to match both foolscap.logging.log.err
and twisted.python.log.err . Included a brief unit test.
Stop checking separately for ConnectionDone/ConnectionLost, since those have
been folded into DeadReferenceError since foolscap-0.3.1 . Write
rrefutil.trap_deadref() in terms of rrefutil.trap_and_discard() to improve
code coverage.
The bug was that a disconnected server could cause us to re-enter the initial
loop() call, sending multiple queries to a single server, provoking an
incorrect UCWE. To fix it, stall the loop() with an eventual.fireEventually()
instead of weird errors. Closes#874 and #786.
Previously, if the file had 0 shares, this would raise TypeError as it tried
to call download_version(None). If the file had some shares but fewer than
'k', it would incorrectly raise MustForceRepairError.
Added get_successful() to the IRepairResults API, to give repair() a place to
report non-code-bug problems like this.
Mutable servermap updates and the immutable checker, when run with
add_lease=True, send both the do-you-have-block and add-lease commands in
parallel, to avoid an extra round trip time. Many older servers have problems
with add-lease and raise various exceptions, which don't generally matter.
The client-side code was catching+ignoring some of them, but unrecognized
exceptions were passed through to the DYHB code, concealing the DYHB results
from the checker, making it think the server had no shares.
The fix is to separate the code paths. Both commands are sent at the same
time, but the errback path from add-lease is handled separately. Known
exceptions are ignored, the others (both unknown-remote and all-local) are
logged (log.WEIRD, which will trigger an Incident), but neither will affect
the DYHB results.
The add-lease message is sent first, and we know that the server handles them
synchronously. So when the checker is done, we can be sure that all the
add-lease messages have been retired. This makes life easier for unit tests.
under normal conditions, this wouldn't cause any problems, but if the shares
are really sparse (perhaps because new servers were added), then
file-modifies might stop looking too early and leave old shares in place