Rename "downloadable" to "target".
Rename "u" to "v" in FileDownloader.__init__().
Rename "_uri" to "_verifycap" in FileDownloader.
Rename "_downloadable" to "_target" in FileDownloader.
Rename "FileDownloader" to "CiphertextDownloader".
FileDownloader takes a verify cap and produces ciphertext, instead of taking a read cap and producing plaintext.
FileDownloader does all integrity checking including the mandatory ciphertext hash tree and the optional ciphertext flat hash, rather than expecting its target to do some of that checking.
Rename immutable.download.Output to immutable.download.DecryptingOutput. An instance of DecryptingOutput can be passed to FileDownloader to use as the latter's target. Text pushed to the DecryptingOutput is decrypted and then pushed to *its* target.
DecryptingOutput satisfies the IConsumer interface, and if its target also satisfies IConsumer, then it forwards and pause/unpause signals to its producer (which is the FileDownloader).
This patch also changes some logging code to use the new logging mixin class.
Check integrity of a segment and decrypt the segment one block-sized buffer at a time instead of copying the buffers together into one segment-sized buffer (reduces peak memory usage, I think, and is probably a tad faster/less CPU, depending on your encoding parameters).
Refactor FileDownloader so that processing of segments and of tail-segment share as much code is possible.
FileDownloader and FileNode take caps as instances of URI (Python objects), not as strings.
This makes Uploader take an EncryptedUploadable object instead of an Uploadable object. I also changed it to return a verify cap instead of a tuple of the bits of data that one finds in a verify cap.
This will facilitate hooking together an Uploader and a Downloader to make a Repairer.
Also move offloaded.py into src/allmydata/immutable/.
The code for validating the share hash tree and the block hash tree has been rewritten to make sure it handles all cases, to share metadata about the file (such as the share hash tree, block hash trees, and UEB) among different share downloads, and not to require hashes to be stored on the server unnecessarily, such as the roots of the block hash trees (not needed since they are also the leaves of the share hash tree), and the root of the share hash tree (not needed since it is also included in the UEB). It also passes the latest tests including handling corrupted shares well.
ValidatedReadBucketProxy takes a share_hash_tree argument to its constructor, which is a reference to a share hash tree shared by all ValidatedReadBucketProxies for that immutable file download.
ValidatedReadBucketProxy requires the block_size and share_size to be provided in its constructor, and it then uses those to compute the offsets and lengths of blocks when it needs them, instead of reading those values out of the share. The user of ValidatedReadBucketProxy therefore has to have first used a ValidatedExtendedURIProxy to compute those two values from the validated contents of the URI. This is pleasingly simplifies safety analysis: the client knows which span of bytes corresponds to a given block from the validated URI data, rather than from the unvalidated data stored on the storage server. It also simplifies unit testing of verifier/repairer, because now it doesn't care about the contents of the "share size" and "block size" fields in the share. It does not relieve the need for share data v2 layout, because we still need to store and retrieve the offsets of the fields which come after the share data, therefore we still need to use share data v2 with its 8-byte fields if we want to store share data larger than about 2^32.
Specify which subset of the block hashes and share hashes you need while downloading a particular share. In the future this will hopefully be used to fetch only a subset, for network efficiency, but currently all of them are fetched, regardless of which subset you specify.
ReadBucketProxy hides the question of whether it has "started" or not (sent a request to the server to get metadata) from its user.
Download is optimized to do as few roundtrips and as few requests as possible, hopefully speeding up download a bit.
Refactor into a class the logic of asking each server in turn until one of them gives an answer
that validates. It is called ValidatedThingObtainer.
Refactor the downloading and verification of the URI Extension Block into a class named
ValidatedExtendedURIProxy.
The new logic of validating UEBs is minimalist: it doesn't require the UEB to contain any
unncessary information, but of course it still accepts such information for backwards
compatibility (so that this new download code is able to download files uploaded with old, and
for that matter with current, upload code).
The new logic of validating UEBs follows the practice of doing all validation up front. This
practice advises one to isolate the validation of incoming data into one place, so that all of
the rest of the code can assume only valid data.
If any redundant information is present in the UEB+URI, the new code cross-checks and asserts
that it is all fully consistent. This closes some issues where the uploader could have
uploaded inconsistent redundant data, which would probably have caused the old downloader to
simply reject that download after getting a Python exception, but perhaps could have caused
greater harm to the old downloader.
I removed the notion of selecting an erasure codec from codec.py based on the string that was
passed in the UEB. Currently "crs" is the only such string that works, so
"_assert(codec_name == 'crs')" is simpler and more explicit. This is also in keeping with the
"validate up front" strategy -- now if someone sets a different string than "crs" in their UEB,
the downloader will reject the download in the "validate this UEB" function instead of in a
separate "select the codec instance" function.
I removed the code to check plaintext hashes and plaintext Merkle Trees. Uploaders do not
produce this information any more (since it potentially exposes confidential information about
the file), and the unit tests for it were disabled. The downloader before this patch would
check that plaintext hash or plaintext merkle tree if they were present, but not complain if
they were absent. The new downloader in this patch complains if they are present and doesn't
check them. (We might in the future re-introduce such hashes over the plaintext, but encrypt
the hashes which are stored in the UEB to preserve confidentiality. This would be a double-
check on the correctness of our own source code -- the current Merkle Tree over the ciphertext
is already sufficient to guarantee the integrity of the download unless there is a bug in our
Merkle Tree or AES implementation.)
This patch increases the lines-of-code count by 8 (from 17,770 to 17,778), and reduces the
uncovered-by-tests lines-of-code count by 24 (from 1408 to 1384). Those numbers would be more
meaningful if we omitted src/allmydata/util/ from the test-coverage statistics.
This fixes#491 (URIs do not refer to unique files in Allmydata Tahoe).
Fortunately all of the versions of Tahoe currently in use are already producing
this ciphertext hash tree when uploading, so there is no
backwards-compatibility problem with having the downloader require it to be
present.
This removes the guess-partial-information attack vector, and reduces
the amount of overhead that we consume with each file. It also introduces
a forwards-compability break: older versions of the code (before the
previous download-time "make hashes optional" patch) will be unable
to read files uploaded by this version, as they will complain about the
missing hashes. This patch is experimental, and is being pushed into
trunk to obtain test coverage. We may undo it before releasing 1.0.
Now upload or encode methods take a required argument named "convergence" which can be either None, indicating no convergent encryption at all, or a string, which is the "added secret" to be mixed in to the content hash key. If you want traditional convergent encryption behavior, set the added secret to be the empty string.
This patch also renames "content hash key" to "convergent encryption" in a argument names and variable names. (A different and larger renaming is needed in order to clarify that Tahoe supports immutable files which are not encrypted content-hash-key a.k.a. convergent encryption.)
This patch also changes a few unit tests to use non-convergent encryption, because it doesn't matter for what they are testing and non-convergent encryption is slightly faster.
This removes the guess-partial-information attack vector, and reduces
the amount of overhead that we consume with each file. It also introduces
a forwards-compability break: older versions of the code (before the
previous download-time "make hashes optional" patch) will be unable
to read files uploaded by this version, as they will complain about the
missing hashes. This patch is experimental, and is being pushed into
trunk to obtain test coverage. We may undo it before releasing 1.0.
Removing the plaintext hashes can help with the guess-partial-information
attack. This does not affect compatibility, but if and when we actually
remove any hashes from the share, that will introduce a
forwards-compatibility break: tahoe-0.9 will not be able to read such files.
This will make it easier to change RIBucketWriter in the future to reduce the wire
protocol to just open/write(offset,data)/close, and do all the structuring on the
client end. The ultimate goal is to store each bucket in a single file, to reduce
the considerable filesystem-quantization/inode overhead on the storage servers.
The only SHA-1 hash that remains is used in the permutation of nodeids,
where we need to decide if we care about performance or long-term security.
I suspect that we could use a much weaker hash (and faster) hash for
this purpose. In the long run, we'll be doing thousands of such hashes
for each file uploaded or downloaded (one per known peer).
This (compatibility-breaking) change moves much of the validation data and
encoding parameters out of the URI and into the so-called "thingA" block
(which will get a better name as soon as we find one we're comfortable with).
The URI retains the "storage_index" (a generalized term for the role that
we're currently using the verifierid for, the unique index for each file
that gets used by storage servers to decide which shares to return), the
decryption key, the needed_shares/total_shares counts (since they affect
peer selection), and the hash of the thingA block.
This shortens the URI and lets us add more kinds of validation data without
growing the URI (like plaintext merkle trees, to enable strong incremental
plaintext validation), at the cost of maybe 150 bytes of alacrity. Each
storage server holds an identical copy of the thingA block.
This is an incompatible change: new messages have been added to the storage
server interface, and the URI format has changed drastically.