If a server did not respond to the pre-repair filecheck, but did respond
to the repair, that server was not correctly added to the
RepairResults.data["servers-responding"] list. (This resulted from a
buggy usage of DictOfSets.union() in filenode.py).
In addition, servers to which filecheck queries were sent, but did not
respond, were incorrectly added to the servers-responding list
anyawys. (This resulted from code in the checker.py not paying attention
to the 'responded' flag).
The first bug was neatly masked by the second: it's pretty rare to have
a server suddenly start responding in the one-second window between a
filecheck and a subsequent repair, and if the server was around for the
filecheck, you'd never notice the problem. I only spotted the smelly
code while I was changing it for IServer cleanup purposes.
I added coverage to test_repairer.py for this. Trying to get that test
to fail before fixing the first bug is what led me to discover the
second bug. I also had to update test_corrupt_file_verno, since it was
incorrectly asserting that 10 servers responded, when in fact one of
them throws an error (but the second bug was causing it to be reported
anyways).
Shares are still verified in parallel, but within a share, don't request a
block until the previous block has been verified and the memory we used to hold
it has been freed up.
Patch originally due to Brian. This version has a mockery-patchery-style test
which is "low tech" (it implements the patching inline in the test code instead
of using an extension of the mock.patch() function from the mock library) and
which unpatches in case of exception.
fixes#1395
This patch was originally written by Brian, but was re-recorded by Zooko to use
darcs replace instead of hunks for any file in which it would result in fewer
total hunks.
refs #1363
Pass around IServer instance instead of (peerid, rref) tuple. Replace
"descriptor" with "server". Other replacements:
get_all_servers -> get_connected_servers/get_known_servers
get_servers_for_index -> get_servers_for_psi (now returns IServers)
This change still needs to be pushed further down: lots of code is now
getting the IServer and then distributing (peerid, rref) internally.
Instead, it ought to distribute the IServer internally and delay
extracting a serverid or rref until the last moment.
no_network.py was updated to retain parallelism.
allmydata.util.log.err() either takes a Failure as the first positional
argument, or takes no positional arguments and must be invoked in an
exception handler. Fixed its signature to match both foolscap.logging.log.err
and twisted.python.log.err . Included a brief unit test.
Stop checking separately for ConnectionDone/ConnectionLost, since those have
been folded into DeadReferenceError since foolscap-0.3.1 . Write
rrefutil.trap_deadref() in terms of rrefutil.trap_and_discard() to improve
code coverage.
Mutable servermap updates and the immutable checker, when run with
add_lease=True, send both the do-you-have-block and add-lease commands in
parallel, to avoid an extra round trip time. Many older servers have problems
with add-lease and raise various exceptions, which don't generally matter.
The client-side code was catching+ignoring some of them, but unrecognized
exceptions were passed through to the DYHB code, concealing the DYHB results
from the checker, making it think the server had no shares.
The fix is to separate the code paths. Both commands are sent at the same
time, but the errback path from add-lease is handled separately. Known
exceptions are ignored, the others (both unknown-remote and all-local) are
logged (log.WEIRD, which will trigger an Incident), but neither will affect
the DYHB results.
The add-lease message is sent first, and we know that the server handles them
synchronously. So when the checker is done, we can be sure that all the
add-lease messages have been retired. This makes life easier for unit tests.
* stop using IURI as an adapter
* pass cap strings around instead of URI instances
* move filenode/dirnode creation duties from Client to new NodeMaker class
* move other Client duties to KeyGenerator, SecretHolder, History classes
* stop passing Client reference to dirnode/filenode constructors
- pass less-powerful references instead, like StorageBroker or Uploader
* always create DirectoryNodes by wrapping a filenode (mutable for now)
* remove some specialized mock classes from unit tests
Detailed list of changes (done one at a time, then merged together)
always pass a string to create_node_from_uri(), not an IURI instance
always pass a string to IFilesystemNode constructors, not an IURI instance
stop using IURI() as an adapter, switch on cap prefix in create_node_from_uri()
client.py: move SecretHolder code out to a separate class
test_web.py: hush pyflakes
client.py: move NodeMaker functionality out into a separate object
LiteralFileNode: stop storing a Client reference
immutable Checker: remove Client reference, it only needs a SecretHolder
immutable Upload: remove Client reference, leave SecretHolder and StorageBroker
immutable Repairer: replace Client reference with StorageBroker and SecretHolder
immutable FileNode: remove Client reference
mutable.Publish: stop passing Client
mutable.ServermapUpdater: get StorageBroker in constructor, not by peeking into Client reference
MutableChecker: reference StorageBroker and History directly, not through Client
mutable.FileNode: removed unused indirection to checker classes
mutable.FileNode: remove Client reference
client.py: move RSA key generation into a separate class, so it can be passed to the nodemaker
move create_mutable_file() into NodeMaker
test_dirnode.py: stop using FakeClient mockups, use NoNetworkGrid instead. This simplifies the code, but takes longer to run (17s instead of 6s). This should come down later when other cleanups make it possible to use simpler (non-RSA) fake mutable files for dirnode tests.
test_mutable.py: clean up basedir names
client.py: move create_empty_dirnode() into NodeMaker
dirnode.py: get rid of DirectoryNode.create
remove DirectoryNode.init_from_uri, refactor NodeMaker for customization, simplify test_web's mock Client to match
stop passing Client to DirectoryNode, make DirectoryNode.create_with_mutablefile the normal DirectoryNode constructor, start removing client from NodeMaker
remove Client from NodeMaker
move helper status into History, pass History to web.Status instead of Client
test_mutable.py: fix minor typo
instead of examining the value returned by f.trap, because the latter appears
to squash exception types down into their base classes (i.e. since
ShareVersionIncompatible is a subclass of LayoutInvalid,
f.trap(Failure(ShareVersionIncompatible)) == LayoutInvalid).
All this resulted in 'incompatible' shares being misclassified as 'corrupt'.
This implements an immutable repairer by marrying a CiphertextDownloader to a CHKUploader. It extends the IDownloadTarget interface so that the downloader can provide some metadata that the uploader requires.
The processing is incremental -- it uploads the first segments before it finishes downloading the whole file. This is necessary so that you can repair large files without running out of RAM or using a temporary file on the repairer.
It requires only a verifycap, not a readcap. That is: it doesn't need or use the decryption key, only the integrity check codes.
There are several tests marked TODO and several instances of XXX in the source code. I intend to open tickets to document further improvements to functionality and testing, but the current version is probably good enough for Tahoe-1.3.0.
We were not making sure that we really got all the crypttext hashes during download. If a server were to return less than the complete set of crypttext hashes, then our subsequent attempt to verify the correctness of the ciphertext would fail. (And it wouldn't be obvious without very careful debugging why it had failed.)
This patch makes it so that you keep trying to get ciphertext hashes until you have a full set or you run out of servers to ask.
New checker and verifier use the new download class. They are robust against various sorts of failures or corruption. They return detailed results explaining what they learned about your immutable files. Some grotesque sorts of corruption are not properly handled yet, and those ones are marked as TODO or commented-out in the unit tests.
There is also a repairer module in this patch with the beginnings of a repairer in it. That repairer is mostly just the interface to the outside world -- the core operation of actually reconstructing the missing data blocks and uploading them is not in there yet.
This patch also refactors the unit tests in test_immutable so that the handling of each kind of corruption is reported as passing or failing separately, can be separately TODO'ified, etc. The unit tests are also improved in various ways to require more of the code under test or to stop requiring unreasonable things of it. :-)
I get confused about whether a given argument or return value is a uri-as-string or uri-as-object. This patch adds a lot of assertions that it is one or the other, and also changes CheckerResults to take objects not strings.
In the future, I hope that we generally use Python objects except when importing into or exporting from the Python interpreter e.g. over the wire, the UI, or a stored file.
Refactor into a class the logic of asking each server in turn until one of them gives an answer
that validates. It is called ValidatedThingObtainer.
Refactor the downloading and verification of the URI Extension Block into a class named
ValidatedExtendedURIProxy.
The new logic of validating UEBs is minimalist: it doesn't require the UEB to contain any
unncessary information, but of course it still accepts such information for backwards
compatibility (so that this new download code is able to download files uploaded with old, and
for that matter with current, upload code).
The new logic of validating UEBs follows the practice of doing all validation up front. This
practice advises one to isolate the validation of incoming data into one place, so that all of
the rest of the code can assume only valid data.
If any redundant information is present in the UEB+URI, the new code cross-checks and asserts
that it is all fully consistent. This closes some issues where the uploader could have
uploaded inconsistent redundant data, which would probably have caused the old downloader to
simply reject that download after getting a Python exception, but perhaps could have caused
greater harm to the old downloader.
I removed the notion of selecting an erasure codec from codec.py based on the string that was
passed in the UEB. Currently "crs" is the only such string that works, so
"_assert(codec_name == 'crs')" is simpler and more explicit. This is also in keeping with the
"validate up front" strategy -- now if someone sets a different string than "crs" in their UEB,
the downloader will reject the download in the "validate this UEB" function instead of in a
separate "select the codec instance" function.
I removed the code to check plaintext hashes and plaintext Merkle Trees. Uploaders do not
produce this information any more (since it potentially exposes confidential information about
the file), and the unit tests for it were disabled. The downloader before this patch would
check that plaintext hash or plaintext merkle tree if they were present, but not complain if
they were absent. The new downloader in this patch complains if they are present and doesn't
check them. (We might in the future re-introduce such hashes over the plaintext, but encrypt
the hashes which are stored in the UEB to preserve confidentiality. This would be a double-
check on the correctness of our own source code -- the current Merkle Tree over the ciphertext
is already sufficient to guarantee the integrity of the download unless there is a bug in our
Merkle Tree or AES implementation.)
This patch increases the lines-of-code count by 8 (from 17,770 to 17,778), and reduces the
uncovered-by-tests lines-of-code count by 24 (from 1408 to 1384). Those numbers would be more
meaningful if we omitted src/allmydata/util/ from the test-coverage statistics.
to make the internal ones use binary strings (nodeid, storage index) and
the web/JSON ones use base32-encoded strings. The immutable verifier is
still incomplete (it returns imaginary healty results).