This removes the guess-partial-information attack vector, and reduces
the amount of overhead that we consume with each file. It also introduces
a forwards-compability break: older versions of the code (before the
previous download-time "make hashes optional" patch) will be unable
to read files uploaded by this version, as they will complain about the
missing hashes. This patch is experimental, and is being pushed into
trunk to obtain test coverage. We may undo it before releasing 1.0.
Now upload or encode methods take a required argument named "convergence" which can be either None, indicating no convergent encryption at all, or a string, which is the "added secret" to be mixed in to the content hash key. If you want traditional convergent encryption behavior, set the added secret to be the empty string.
This patch also renames "content hash key" to "convergent encryption" in a argument names and variable names. (A different and larger renaming is needed in order to clarify that Tahoe supports immutable files which are not encrypted content-hash-key a.k.a. convergent encryption.)
This patch also changes a few unit tests to use non-convergent encryption, because it doesn't matter for what they are testing and non-convergent encryption is slightly faster.
This removes the guess-partial-information attack vector, and reduces
the amount of overhead that we consume with each file. It also introduces
a forwards-compability break: older versions of the code (before the
previous download-time "make hashes optional" patch) will be unable
to read files uploaded by this version, as they will complain about the
missing hashes. This patch is experimental, and is being pushed into
trunk to obtain test coverage. We may undo it before releasing 1.0.
Removing the plaintext hashes can help with the guess-partial-information
attack. This does not affect compatibility, but if and when we actually
remove any hashes from the share, that will introduce a
forwards-compatibility break: tahoe-0.9 will not be able to read such files.
This will make it easier to change RIBucketWriter in the future to reduce the wire
protocol to just open/write(offset,data)/close, and do all the structuring on the
client end. The ultimate goal is to store each bucket in a single file, to reduce
the considerable filesystem-quantization/inode overhead on the storage servers.
The only SHA-1 hash that remains is used in the permutation of nodeids,
where we need to decide if we care about performance or long-term security.
I suspect that we could use a much weaker hash (and faster) hash for
this purpose. In the long run, we'll be doing thousands of such hashes
for each file uploaded or downloaded (one per known peer).
This (compatibility-breaking) change moves much of the validation data and
encoding parameters out of the URI and into the so-called "thingA" block
(which will get a better name as soon as we find one we're comfortable with).
The URI retains the "storage_index" (a generalized term for the role that
we're currently using the verifierid for, the unique index for each file
that gets used by storage servers to decide which shares to return), the
decryption key, the needed_shares/total_shares counts (since they affect
peer selection), and the hash of the thingA block.
This shortens the URI and lets us add more kinds of validation data without
growing the URI (like plaintext merkle trees, to enable strong incremental
plaintext validation), at the cost of maybe 150 bytes of alacrity. Each
storage server holds an identical copy of the thingA block.
This is an incompatible change: new messages have been added to the storage
server interface, and the URI format has changed drastically.
Implement enough of chunk.IncompleteHashTree to be usable.
Rearrange download: all block/hash requests now go through
a ValidatedBucket instance, which is responsible for retrieving
and verifying hashes before providing validated data. Download
was changed to use ValidatedBuckets everywhere instead of
unwrapped RIBucketReader references.
Note that using "whatever version of python the name 'python' maps to in the current shell environment" is more error-prone that specifying which python you mean, such as by executing "/usr/bin/python setup.py" instead of executing "./setup.py". When you build tahoe (by running "make") it will make a copy of bin/allmydata-tahoe in instdir/bin/allmydata-tahoe with the shebang line rewritten to execute the specific version of python that was used when building instead of to execute "/usr/bin/env python".
However, it seems better that the default for lazy people be "whatever 'python' means currently" instead of "whatever 'python' meant to the manufacturer of your operating system".