mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2024-12-28 16:58:53 +00:00
709 lines
36 KiB
Plaintext
709 lines
36 KiB
Plaintext
|
|
||
|
(protocol proposal, work-in-progress, not authoritative)
|
||
|
|
||
|
(this document describes DSA-based mutable files, as opposed to the RSA-based
|
||
|
mutable files that were introduced in tahoe-0.7.0 . This proposal has not yet
|
||
|
been implemented. Please see mutable-DSA.svg for a quick picture of the
|
||
|
crypto scheme described herein)
|
||
|
|
||
|
= Mutable Files =
|
||
|
|
||
|
Mutable File Slots are places with a stable identifier that can hold data
|
||
|
that changes over time. In contrast to CHK slots, for which the
|
||
|
URI/identifier is derived from the contents themselves, the Mutable File Slot
|
||
|
URI remains fixed for the life of the slot, regardless of what data is placed
|
||
|
inside it.
|
||
|
|
||
|
Each mutable slot is referenced by two different URIs. The "read-write" URI
|
||
|
grants read-write access to its holder, allowing them to put whatever
|
||
|
contents they like into the slot. The "read-only" URI is less powerful, only
|
||
|
granting read access, and not enabling modification of the data. The
|
||
|
read-write URI can be turned into the read-only URI, but not the other way
|
||
|
around.
|
||
|
|
||
|
The data in these slots is distributed over a number of servers, using the
|
||
|
same erasure coding that CHK files use, with 3-of-10 being a typical choice
|
||
|
of encoding parameters. The data is encrypted and signed in such a way that
|
||
|
only the holders of the read-write URI will be able to set the contents of
|
||
|
the slot, and only the holders of the read-only URI will be able to read
|
||
|
those contents. Holders of either URI will be able to validate the contents
|
||
|
as being written by someone with the read-write URI. The servers who hold the
|
||
|
shares cannot read or modify them: the worst they can do is deny service (by
|
||
|
deleting or corrupting the shares), or attempt a rollback attack (which can
|
||
|
only succeed with the cooperation of at least k servers).
|
||
|
|
||
|
== Consistency vs Availability ==
|
||
|
|
||
|
There is an age-old battle between consistency and availability. Epic papers
|
||
|
have been written, elaborate proofs have been established, and generations of
|
||
|
theorists have learned that you cannot simultaneously achieve guaranteed
|
||
|
consistency with guaranteed reliability. In addition, the closer to 0 you get
|
||
|
on either axis, the cost and complexity of the design goes up.
|
||
|
|
||
|
Tahoe's design goals are to largely favor design simplicity, then slightly
|
||
|
favor read availability, over the other criteria.
|
||
|
|
||
|
As we develop more sophisticated mutable slots, the API may expose multiple
|
||
|
read versions to the application layer. The tahoe philosophy is to defer most
|
||
|
consistency recovery logic to the higher layers. Some applications have
|
||
|
effective ways to merge multiple versions, so inconsistency is not
|
||
|
necessarily a problem (i.e. directory nodes can usually merge multiple "add
|
||
|
child" operations).
|
||
|
|
||
|
== The Prime Coordination Directive: "Don't Do That" ==
|
||
|
|
||
|
The current rule for applications which run on top of Tahoe is "do not
|
||
|
perform simultaneous uncoordinated writes". That means you need non-tahoe
|
||
|
means to make sure that two parties are not trying to modify the same mutable
|
||
|
slot at the same time. For example:
|
||
|
|
||
|
* don't give the read-write URI to anyone else. Dirnodes in a private
|
||
|
directory generally satisfy this case, as long as you don't use two
|
||
|
clients on the same account at the same time
|
||
|
* if you give a read-write URI to someone else, stop using it yourself. An
|
||
|
inbox would be a good example of this.
|
||
|
* if you give a read-write URI to someone else, call them on the phone
|
||
|
before you write into it
|
||
|
* build an automated mechanism to have your agents coordinate writes.
|
||
|
For example, we expect a future release to include a FURL for a
|
||
|
"coordination server" in the dirnodes. The rule can be that you must
|
||
|
contact the coordination server and obtain a lock/lease on the file
|
||
|
before you're allowed to modify it.
|
||
|
|
||
|
If you do not follow this rule, Bad Things will happen. The worst-case Bad
|
||
|
Thing is that the entire file will be lost. A less-bad Bad Thing is that one
|
||
|
or more of the simultaneous writers will lose their changes. An observer of
|
||
|
the file may not see monotonically-increasing changes to the file, i.e. they
|
||
|
may see version 1, then version 2, then 3, then 2 again.
|
||
|
|
||
|
Tahoe takes some amount of care to reduce the badness of these Bad Things.
|
||
|
One way you can help nudge it from the "lose your file" case into the "lose
|
||
|
some changes" case is to reduce the number of competing versions: multiple
|
||
|
versions of the file that different parties are trying to establish as the
|
||
|
one true current contents. Each simultaneous writer counts as a "competing
|
||
|
version", as does the previous version of the file. If the count "S" of these
|
||
|
competing versions is larger than N/k, then the file runs the risk of being
|
||
|
lost completely. If at least one of the writers remains running after the
|
||
|
collision is detected, it will attempt to recover, but if S>(N/k) and all
|
||
|
writers crash after writing a few shares, the file will be lost.
|
||
|
|
||
|
|
||
|
== Small Distributed Mutable Files ==
|
||
|
|
||
|
SDMF slots are suitable for small (<1MB) files that are editing by rewriting
|
||
|
the entire file. The three operations are:
|
||
|
|
||
|
* allocate (with initial contents)
|
||
|
* set (with new contents)
|
||
|
* get (old contents)
|
||
|
|
||
|
The first use of SDMF slots will be to hold directories (dirnodes), which map
|
||
|
encrypted child names to rw-URI/ro-URI pairs.
|
||
|
|
||
|
=== SDMF slots overview ===
|
||
|
|
||
|
Each SDMF slot is created with a DSA public/private key pair, using a
|
||
|
system-wide common modulus and generator, in which the private key is a
|
||
|
random 256 bit number, and the public key is a larger value (about 2048 bits)
|
||
|
that can be derived with a bit of math from the private key. The public key
|
||
|
is known as the "verification key", while the private key is called the
|
||
|
"signature key".
|
||
|
|
||
|
The 256-bit signature key is used verbatim as the "write capability". This
|
||
|
can be converted into the 2048ish-bit verification key through a fairly cheap
|
||
|
set of modular exponentiation operations; this is done any time the holder of
|
||
|
the write-cap wants to read the data. (Note that the signature key can either
|
||
|
be a newly-generated random value, or the hash of something else, if we found
|
||
|
a need for a capability that's stronger than the write-cap).
|
||
|
|
||
|
This results in a write-cap which is 256 bits long and can thus be expressed
|
||
|
in an ASCII/transport-safe encoded form (base62 encoding, fits in 72
|
||
|
characters, including a local-node http: convenience prefix).
|
||
|
|
||
|
The private key is hashed to form a 256-bit "salt". The public key is also
|
||
|
hashed to form a 256-bit "pubkey hash". These two values are concatenated,
|
||
|
hashed, and truncated to 192 bits to form the first 192 bits of the read-cap.
|
||
|
The pubkey hash is hashed by itself and truncated to 64 bits to form the last
|
||
|
64 bits of the read-cap. The full read-cap is 256 bits long, just like the
|
||
|
write-cap.
|
||
|
|
||
|
The first 192 bits of the read-cap are hashed and truncated to form the first
|
||
|
64 bits of the storage index. The last 64 bits of the read-cap are hashed to
|
||
|
form the last 64 bits of the storage index. This gives us a 128-bit storage
|
||
|
index.
|
||
|
|
||
|
The verification-cap is the first 64 bits of the storage index plus the
|
||
|
pubkey hash, 320 bits total. The verification-cap doesn't need to be
|
||
|
expressed in a printable transport-safe form, so it's ok that it's longer.
|
||
|
|
||
|
The read-cap is hashed one way to form an AES encryption key that is used to
|
||
|
encrypt the salt; this key is called the "salt key". The encrypted salt is
|
||
|
stored in the share. The private key never changes, therefore the salt never
|
||
|
changes, and the salt key is only used for a single purpose, so there is no
|
||
|
need for an IV.
|
||
|
|
||
|
The read-cap is hashed a different way to form the master data encryption
|
||
|
key. A random "data salt" is generated each time the share's contents are
|
||
|
replaced, and the master data encryption key is concatenated with the data
|
||
|
salt, then hashed, to form the AES CTR-mode "read key" that will be used to
|
||
|
encrypt the actual file data. This is to avoid key-reuse. An outstanding
|
||
|
issue is how to avoid key reuse when files are modified in place instead of
|
||
|
being replaced completely; this is not done in SDMF but might occur in MDMF.
|
||
|
|
||
|
The private key is hashed one way to form the salt, and a different way to
|
||
|
form the "write enabler master". For each storage server on which a share is
|
||
|
kept, the write enabler master is concatenated with the server's nodeid and
|
||
|
hashed, and the result is called the "write enabler" for that particular
|
||
|
server. Note that multiple shares of the same slot stored on the same server
|
||
|
will all get the same write enabler, i.e. the write enabler is associated
|
||
|
with the "bucket", rather than the individual shares.
|
||
|
|
||
|
The private key is hashed a third way to form the "data write key", which can
|
||
|
be used by applications which wish to store some data in a form that is only
|
||
|
available to those with a write-cap, and not to those with merely a read-cap.
|
||
|
This is used to implement transitive read-onlyness of dirnodes.
|
||
|
|
||
|
The public key is stored on the servers, as is the encrypted salt, the
|
||
|
(non-encrypted) data salt, the encrypted data, and a signature. The container
|
||
|
records the write-enabler, but of course this is not visible to readers. To
|
||
|
make sure that every byte of the share can be verified by a holder of the
|
||
|
verify-cap (and also by the storage server itself), the signature covers the
|
||
|
version number, the sequence number, the root hash "R" of the share merkle
|
||
|
tree, the encoding parameters, and the encrypted salt. "R" itself covers the
|
||
|
hash trees and the share data.
|
||
|
|
||
|
The read-write URI is just the private key. The read-only URI is the read-cap
|
||
|
key. The verify-only URI contains the the pubkey hash and the first 64 bits
|
||
|
of the storage index.
|
||
|
|
||
|
FMW:b2a(privatekey)
|
||
|
FMR:b2a(readcap)
|
||
|
FMV:b2a(storageindex[:64])b2a(pubkey-hash)
|
||
|
|
||
|
Note that this allows the read-only and verify-only URIs to be derived from
|
||
|
the read-write URI without actually retrieving any data from the share, but
|
||
|
instead by regenerating the public key from the private one. Uses of the
|
||
|
read-only or verify-only caps must validate the public key against their
|
||
|
pubkey hash (or its derivative) the first time they retrieve the pubkey,
|
||
|
before trusting any signatures they see.
|
||
|
|
||
|
The SDMF slot is allocated by sending a request to the storage server with a
|
||
|
desired size, the storage index, and the write enabler for that server's
|
||
|
nodeid. If granted, the write enabler is stashed inside the slot's backing
|
||
|
store file. All further write requests must be accompanied by the write
|
||
|
enabler or they will not be honored. The storage server does not share the
|
||
|
write enabler with anyone else.
|
||
|
|
||
|
The SDMF slot structure will be described in more detail below. The important
|
||
|
pieces are:
|
||
|
|
||
|
* a sequence number
|
||
|
* a root hash "R"
|
||
|
* the data salt
|
||
|
* the encoding parameters (including k, N, file size, segment size)
|
||
|
* a signed copy of [seqnum,R,data_salt,encoding_params] (using signature key)
|
||
|
* the verification key (not encrypted)
|
||
|
* the share hash chain (part of a Merkle tree over the share hashes)
|
||
|
* the block hash tree (Merkle tree over blocks of share data)
|
||
|
* the share data itself (erasure-coding of read-key-encrypted file data)
|
||
|
* the salt, encrypted with the salt key
|
||
|
|
||
|
The access pattern for read (assuming we hold the write-cap) is:
|
||
|
* generate public key from the private one
|
||
|
* hash private key to get the salt, hash public key, form read-cap
|
||
|
* form storage-index
|
||
|
* use storage-index to locate 'k' shares with identical 'R' values
|
||
|
* either get one share, read 'k' from it, then read k-1 shares
|
||
|
* or read, say, 5 shares, discover k, either get more or be finished
|
||
|
* or copy k into the URIs
|
||
|
* .. jump to "COMMON READ", below
|
||
|
|
||
|
To read (assuming we only hold the read-cap), do:
|
||
|
* hash read-cap pieces to generate storage index and salt key
|
||
|
* use storage-index to locate 'k' shares with identical 'R' values
|
||
|
* retrieve verification key and encrypted salt
|
||
|
* decrypt salt
|
||
|
* hash decrypted salt and pubkey to generate another copy of the read-cap,
|
||
|
make sure they match (this validates the pubkey)
|
||
|
* .. jump to "COMMON READ"
|
||
|
|
||
|
* COMMON READ:
|
||
|
* read seqnum, R, data salt, encoding parameters, signature
|
||
|
* verify signature against verification key
|
||
|
* hash data salt and read-cap to generate read-key
|
||
|
* read share data, compute block-hash Merkle tree and root "r"
|
||
|
* read share hash chain (leading from "r" to "R")
|
||
|
* validate share hash chain up to the root "R"
|
||
|
* submit share data to erasure decoding
|
||
|
* decrypt decoded data with read-key
|
||
|
* submit plaintext to application
|
||
|
|
||
|
The access pattern for write is:
|
||
|
* generate pubkey, salt, read-cap, storage-index as in read case
|
||
|
* generate data salt for this update, generate read-key
|
||
|
* encrypt plaintext from application with read-key
|
||
|
* application can encrypt some data with the data-write-key to make it
|
||
|
only available to writers (used for transitively-readonly dirnodes)
|
||
|
* erasure-code crypttext to form shares
|
||
|
* split shares into blocks
|
||
|
* compute Merkle tree of blocks, giving root "r" for each share
|
||
|
* compute Merkle tree of shares, find root "R" for the file as a whole
|
||
|
* create share data structures, one per server:
|
||
|
* use seqnum which is one higher than the old version
|
||
|
* share hash chain has log(N) hashes, different for each server
|
||
|
* signed data is the same for each server
|
||
|
* include pubkey, encrypted salt, data salt
|
||
|
* now we have N shares and need homes for them
|
||
|
* walk through peers
|
||
|
* if share is not already present, allocate-and-set
|
||
|
* otherwise, try to modify existing share:
|
||
|
* send testv_and_writev operation to each one
|
||
|
* testv says to accept share if their(seqnum+R) <= our(seqnum+R)
|
||
|
* count how many servers wind up with which versions (histogram over R)
|
||
|
* keep going until N servers have the same version, or we run out of servers
|
||
|
* if any servers wound up with a different version, report error to
|
||
|
application
|
||
|
* if we ran out of servers, initiate recovery process (described below)
|
||
|
|
||
|
==== Cryptographic Properties ====
|
||
|
|
||
|
This scheme protects the data's confidentiality with 192 bits of key
|
||
|
material, since the read-cap contains 192 secret bits (derived from an
|
||
|
encrypted salt, which is encrypted using those same 192 bits plus some
|
||
|
additional public material).
|
||
|
|
||
|
The integrity of the data (assuming that the signature is valid) is protected
|
||
|
by the 256-bit hash which gets included in the signature. The privilege of
|
||
|
modifying the data (equivalent to the ability to form a valid signature) is
|
||
|
protected by a 256 bit random DSA private key, and the difficulty of
|
||
|
computing a discrete logarithm in a 2048-bit field.
|
||
|
|
||
|
There are a few weaker denial-of-service attacks possible. If N-k+1 of the
|
||
|
shares are damaged or unavailable, the client will be unable to recover the
|
||
|
file. Any coalition of more than N-k shareholders will be able to effect this
|
||
|
attack by merely refusing to provide the desired share. The "write enabler"
|
||
|
shared secret protects existing shares from being displaced by new ones,
|
||
|
except by the holder of the write-cap. One server cannot affect the other
|
||
|
shares of the same file, once those other shares are in place.
|
||
|
|
||
|
The worst DoS attack is the "roadblock attack", which must be made before
|
||
|
those shares get placed. Storage indexes are effectively random (being
|
||
|
derived from the hash of a random value), so they are not guessable before
|
||
|
the writer begins their upload, but there is a window of vulnerability during
|
||
|
the beginning of the upload, when some servers have heard about the storage
|
||
|
index but not all of them.
|
||
|
|
||
|
The roadblock attack we want to prevent is when the first server that the
|
||
|
uploader contacts quickly runs to all the other selected servers and places a
|
||
|
bogus share under the same storage index, before the uploader can contact
|
||
|
them. These shares will normally be accepted, since storage servers create
|
||
|
new shares on demand. The bogus shares would have randomly-generated
|
||
|
write-enablers, which will of course be different than the real uploader's
|
||
|
write-enabler, since the malicious server does not know the write-cap.
|
||
|
|
||
|
If this attack were successful, the uploader would be unable to place any of
|
||
|
their shares, because the slots have already been filled by the bogus shares.
|
||
|
The uploader would probably try for peers further and further away from the
|
||
|
desired location, but eventually they will hit a preconfigured distance limit
|
||
|
and give up. In addition, the further the writer searches, the less likely it
|
||
|
is that a reader will search as far. So a successful attack will either cause
|
||
|
the file to be uploaded but not be reachable, or it will cause the upload to
|
||
|
fail.
|
||
|
|
||
|
If the uploader tries again (creating a new privkey), they may get lucky and
|
||
|
the malicious servers will appear later in the query list, giving sufficient
|
||
|
honest servers a chance to see their share before the malicious one manages
|
||
|
to place bogus ones.
|
||
|
|
||
|
The first line of defense against this attack is the timing challenges: the
|
||
|
attacking server must be ready to act the moment a storage request arrives
|
||
|
(which will only occur for a certain percentage of all new-file uploads), and
|
||
|
only has a few seconds to act before the other servers will have allocated
|
||
|
the shares (and recorded the write-enabler, terminating the window of
|
||
|
vulnerability).
|
||
|
|
||
|
The second line of defense is post-verification, and is possible because the
|
||
|
storage index is partially derived from the public key hash. A storage server
|
||
|
can, at any time, verify every public bit of the container as being signed by
|
||
|
the verification key (this operation is recommended as a continual background
|
||
|
process, when disk usage is minimal, to detect disk errors). The server can
|
||
|
also hash the verification key to derive 64 bits of the storage index. If it
|
||
|
detects that these 64 bits do not match (but the rest of the share validates
|
||
|
correctly), then the implication is that this share was stored to the wrong
|
||
|
storage index, either due to a bug or a roadblock attack.
|
||
|
|
||
|
If an uploader finds that they are unable to place their shares because of
|
||
|
"bad write enabler errors" (as reported by the prospective storage servers),
|
||
|
it can "cry foul", and ask the storage server to perform this verification on
|
||
|
the share in question. If the pubkey and storage index do not match, the
|
||
|
storage server can delete the bogus share, thus allowing the real uploader to
|
||
|
place their share. Of course the origin of the offending bogus share should
|
||
|
be logged and reported to a central authority, so corrective measures can be
|
||
|
taken. It may be necessary to have this "cry foul" protocol include the new
|
||
|
write-enabler, to close the window during which the malicious server can
|
||
|
re-submit the bogus share during the adjudication process.
|
||
|
|
||
|
If the problem persists, the servers can be placed into pre-verification
|
||
|
mode, in which this verification is performed on all potential shares before
|
||
|
being committed to disk. This mode is more CPU-intensive (since normally the
|
||
|
storage server ignores the contents of the container altogether), but would
|
||
|
solve the problem completely.
|
||
|
|
||
|
The mere existence of these potential defenses should be sufficient to deter
|
||
|
any actual attacks. Note that the storage index only has 64 bits of
|
||
|
pubkey-derived data in it, which is below the usual crypto guidelines for
|
||
|
security factors. In this case it's a pre-image attack which would be needed,
|
||
|
rather than a collision, and the actual attack would be to find a keypair for
|
||
|
which the public key can be hashed three times to produce the desired portion
|
||
|
of the storage index. We believe that 64 bits of material is sufficiently
|
||
|
resistant to this form of pre-image attack to serve as a suitable deterrent.
|
||
|
|
||
|
|
||
|
=== Server Storage Protocol ===
|
||
|
|
||
|
The storage servers will provide a mutable slot container which is oblivious
|
||
|
to the details of the data being contained inside it. Each storage index
|
||
|
refers to a "bucket", and each bucket has one or more shares inside it. (In a
|
||
|
well-provisioned network, each bucket will have only one share). The bucket
|
||
|
is stored as a directory, using the base32-encoded storage index as the
|
||
|
directory name. Each share is stored in a single file, using the share number
|
||
|
as the filename.
|
||
|
|
||
|
The container holds space for a container magic number (for versioning), the
|
||
|
write enabler, the nodeid which accepted the write enabler (used for share
|
||
|
migration, described below), a small number of lease structures, the embedded
|
||
|
data itself, and expansion space for additional lease structures.
|
||
|
|
||
|
# offset size name
|
||
|
1 0 32 magic verstr "tahoe mutable container v1" plus binary
|
||
|
2 32 20 write enabler's nodeid
|
||
|
3 52 32 write enabler
|
||
|
4 84 8 data size (actual share data present) (a)
|
||
|
5 92 8 offset of (8) count of extra leases (after data)
|
||
|
6 100 368 four leases, 92 bytes each
|
||
|
0 4 ownerid (0 means "no lease here")
|
||
|
4 4 expiration timestamp
|
||
|
8 32 renewal token
|
||
|
40 32 cancel token
|
||
|
72 20 nodeid which accepted the tokens
|
||
|
7 468 (a) data
|
||
|
8 ?? 4 count of extra leases
|
||
|
9 ?? n*92 extra leases
|
||
|
|
||
|
The "extra leases" field must be copied and rewritten each time the size of
|
||
|
the enclosed data changes. The hope is that most buckets will have four or
|
||
|
fewer leases and this extra copying will not usually be necessary.
|
||
|
|
||
|
The (4) "data size" field contains the actual number of bytes of data present
|
||
|
in field (7), such that a client request to read beyond 504+(a) will result
|
||
|
in an error. This allows the client to (one day) read relative to the end of
|
||
|
the file. The container size (that is, (8)-(7)) might be larger, especially
|
||
|
if extra size was pre-allocated in anticipation of filling the container with
|
||
|
a lot of data.
|
||
|
|
||
|
The offset in (5) points at the *count* of extra leases, at (8). The actual
|
||
|
leases (at (9)) begin 4 bytes later. If the container size changes, both (8)
|
||
|
and (9) must be relocated by copying.
|
||
|
|
||
|
The server will honor any write commands that provide the write token and do
|
||
|
not exceed the server-wide storage size limitations. Read and write commands
|
||
|
MUST be restricted to the 'data' portion of the container: the implementation
|
||
|
of those commands MUST perform correct bounds-checking to make sure other
|
||
|
portions of the container are inaccessible to the clients.
|
||
|
|
||
|
The two methods provided by the storage server on these "MutableSlot" share
|
||
|
objects are:
|
||
|
|
||
|
* readv(ListOf(offset=int, length=int))
|
||
|
* returns a list of bytestrings, of the various requested lengths
|
||
|
* offset < 0 is interpreted relative to the end of the data
|
||
|
* spans which hit the end of the data will return truncated data
|
||
|
|
||
|
* testv_and_writev(write_enabler, test_vector, write_vector)
|
||
|
* this is a test-and-set operation which performs the given tests and only
|
||
|
applies the desired writes if all tests succeed. This is used to detect
|
||
|
simultaneous writers, and to reduce the chance that an update will lose
|
||
|
data recently written by some other party (written after the last time
|
||
|
this slot was read).
|
||
|
* test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
|
||
|
* the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
|
||
|
* each element of the test vector is read from the slot's data and
|
||
|
compared against the specimen using the desired (in)equality. If all
|
||
|
tests evaluate True, the write is performed
|
||
|
* write_vector=ListOf(TupleOf(offset, newdata))
|
||
|
* offset < 0 is not yet defined, it probably means relative to the
|
||
|
end of the data, which probably means append, but we haven't nailed
|
||
|
it down quite yet
|
||
|
* write vectors are executed in order, which specifies the results of
|
||
|
overlapping writes
|
||
|
* return value:
|
||
|
* error: OutOfSpace
|
||
|
* error: something else (io error, out of memory, whatever)
|
||
|
* (True, old_test_data): the write was accepted (test_vector passed)
|
||
|
* (False, old_test_data): the write was rejected (test_vector failed)
|
||
|
* both 'accepted' and 'rejected' return the old data that was used
|
||
|
for the test_vector comparison. This can be used by the client
|
||
|
to detect write collisions, including collisions for which the
|
||
|
desired behavior was to overwrite the old version.
|
||
|
|
||
|
In addition, the storage server provides several methods to access these
|
||
|
share objects:
|
||
|
|
||
|
* allocate_mutable_slot(storage_index, sharenums=SetOf(int))
|
||
|
* returns DictOf(int, MutableSlot)
|
||
|
* get_mutable_slot(storage_index)
|
||
|
* returns DictOf(int, MutableSlot)
|
||
|
* or raises KeyError
|
||
|
|
||
|
We intend to add an interface which allows small slots to allocate-and-write
|
||
|
in a single call, as well as do update or read in a single call. The goal is
|
||
|
to allow a reasonably-sized dirnode to be created (or updated, or read) in
|
||
|
just one round trip (to all N shareholders in parallel).
|
||
|
|
||
|
==== migrating shares ====
|
||
|
|
||
|
If a share must be migrated from one server to another, two values become
|
||
|
invalid: the write enabler (since it was computed for the old server), and
|
||
|
the lease renew/cancel tokens.
|
||
|
|
||
|
Suppose that a slot was first created on nodeA, and was thus initialized with
|
||
|
WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is
|
||
|
moved from nodeA to nodeB.
|
||
|
|
||
|
Readers may still be able to find the share in its new home, depending upon
|
||
|
how many servers are present in the grid, where the new nodeid lands in the
|
||
|
permuted index for this particular storage index, and how many servers the
|
||
|
reading client is willing to contact.
|
||
|
|
||
|
When a client attempts to write to this migrated share, it will get a "bad
|
||
|
write enabler" error, since the WE it computes for nodeB will not match the
|
||
|
WE(nodeA) that was embedded in the share. When this occurs, the "bad write
|
||
|
enabler" message must include the old nodeid (e.g. nodeA) that was in the
|
||
|
share.
|
||
|
|
||
|
The client then computes H(nodeB+H(WEM+nodeA)), which is the same as
|
||
|
H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which
|
||
|
is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to
|
||
|
anyone else. Also note that the client does not send a value to nodeB that
|
||
|
would allow the node to impersonate the client to a third node: everything
|
||
|
sent to nodeB will include something specific to nodeB in it.
|
||
|
|
||
|
The server locally computes H(nodeB+WE(nodeA)), using its own node id and the
|
||
|
old write enabler from the share. It compares this against the value supplied
|
||
|
by the client. If they match, this serves as proof that the client was able
|
||
|
to compute the old write enabler. The server then accepts the client's new
|
||
|
WE(nodeB) and writes it into the container.
|
||
|
|
||
|
This WE-fixup process requires an extra round trip, and requires the error
|
||
|
message to include the old nodeid, but does not require any public key
|
||
|
operations on either client or server.
|
||
|
|
||
|
Migrating the leases will require a similar protocol. This protocol will be
|
||
|
defined concretely at a later date.
|
||
|
|
||
|
=== Code Details ===
|
||
|
|
||
|
The current FileNode class will be renamed ImmutableFileNode, and a new
|
||
|
MutableFileNode class will be created. Instances of this class will contain a
|
||
|
URI and a reference to the client (for peer selection and connection). The
|
||
|
methods of MutableFileNode are:
|
||
|
|
||
|
* replace(newdata) -> OK, ConsistencyError, NotEnoughPeersError
|
||
|
* get() -> [deferred] newdata, NotEnoughPeersError
|
||
|
* if there are multiple retrieveable versions in the grid, get() returns
|
||
|
the first version it can reconstruct, and silently ignores the others.
|
||
|
In the future, a more advanced API will signal and provide access to
|
||
|
the multiple heads.
|
||
|
|
||
|
The peer-selection and data-structure manipulation (and signing/verification)
|
||
|
steps will be implemented in a separate class in allmydata/mutable.py .
|
||
|
|
||
|
=== SMDF Slot Format ===
|
||
|
|
||
|
This SMDF data lives inside a server-side MutableSlot container. The server
|
||
|
is generally oblivious to this format, but it may look inside the container
|
||
|
when verification is desired.
|
||
|
|
||
|
This data is tightly packed. There are no gaps left between the different
|
||
|
fields, and the offset table is mainly present to allow future flexibility of
|
||
|
key sizes.
|
||
|
|
||
|
# offset size name
|
||
|
1 0 1 version byte, \x01 for this format
|
||
|
2 1 8 sequence number. 2^64-1 must be handled specially, TBD
|
||
|
3 9 32 "R" (root of share hash Merkle tree)
|
||
|
4 41 32 data salt (readkey is H(readcap+data_salt))
|
||
|
5 73 32 encrypted salt (AESenc(key=H(readcap), salt)
|
||
|
6 105 18 encoding parameters:
|
||
|
105 1 k
|
||
|
106 1 N
|
||
|
107 8 segment size
|
||
|
115 8 data length (of original plaintext)
|
||
|
7 123 36 offset table:
|
||
|
127 4 (9) signature
|
||
|
131 4 (10) share hash chain
|
||
|
135 4 (11) block hash tree
|
||
|
139 4 (12) share data
|
||
|
143 8 (13) EOF
|
||
|
8 151 256 verification key (2048bit DSA key)
|
||
|
9 407 40 signature=DSAsig(H([1,2,3,4,5,6]))
|
||
|
10 447 (a) share hash chain, encoded as:
|
||
|
"".join([pack(">H32s", shnum, hash)
|
||
|
for (shnum,hash) in needed_hashes])
|
||
|
11 ?? (b) block hash tree, encoded as:
|
||
|
"".join([pack(">32s",hash) for hash in block_hash_tree])
|
||
|
12 ?? LEN share data
|
||
|
13 ?? -- EOF
|
||
|
|
||
|
(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long.
|
||
|
This is the set of hashes necessary to validate this share's leaf in the
|
||
|
share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes.
|
||
|
(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes
|
||
|
long. This is the set of hashes necessary to validate any given block of
|
||
|
share data up to the per-share root "r". Each "r" is a leaf of the share
|
||
|
has tree (with root "R"), from which a minimal subset of hashes is put in
|
||
|
the share hash chain in (8).
|
||
|
|
||
|
=== Recovery ===
|
||
|
|
||
|
The first line of defense against damage caused by colliding writes is the
|
||
|
Prime Coordination Directive: "Don't Do That".
|
||
|
|
||
|
The second line of defense is to keep "S" (the number of competing versions)
|
||
|
lower than N/k. If this holds true, at least one competing version will have
|
||
|
k shares and thus be recoverable. Note that server unavailability counts
|
||
|
against us here: the old version stored on the unavailable server must be
|
||
|
included in the value of S.
|
||
|
|
||
|
The third line of defense is our use of testv_and_writev() (described below),
|
||
|
which increases the convergence of simultaneous writes: one of the writers
|
||
|
will be favored (the one with the highest "R"), and that version is more
|
||
|
likely to be accepted than the others. This defense is least effective in the
|
||
|
pathological situation where S simultaneous writers are active, the one with
|
||
|
the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
|
||
|
the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
|
||
|
one with the highest "R" writes to k-1 shares and dies. Any other sequencing
|
||
|
will allow the highest "R" to write to at least k shares and establish a new
|
||
|
revision.
|
||
|
|
||
|
The fourth line of defense is the fact that each client keeps writing until
|
||
|
at least one version has N shares. This uses additional servers, if
|
||
|
necessary, to make sure that either the client's version or some
|
||
|
newer/overriding version is highly available.
|
||
|
|
||
|
The fifth line of defense is the recovery algorithm, which seeks to make sure
|
||
|
that at least *one* version is highly available, even if that version is
|
||
|
somebody else's.
|
||
|
|
||
|
The write-shares-to-peers algorithm is as follows:
|
||
|
|
||
|
* permute peers according to storage index
|
||
|
* walk through peers, trying to assign one share per peer
|
||
|
* for each peer:
|
||
|
* send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
|
||
|
* this means that we will overwrite any old versions, and we will
|
||
|
overwrite simultaenous writers of the same version if our R is higher.
|
||
|
We will not overwrite writers using a higher seqnum.
|
||
|
* record the version that each share winds up with. If the write was
|
||
|
accepted, this is our own version. If it was rejected, read the
|
||
|
old_test_data to find out what version was retained.
|
||
|
* if old_test_data indicates the seqnum was equal or greater than our
|
||
|
own, mark the "Simultanous Writes Detected" flag, which will eventually
|
||
|
result in an error being reported to the writer (in their close() call).
|
||
|
* build a histogram of "R" values
|
||
|
* repeat until the histogram indicate that some version (possibly ours)
|
||
|
has N shares. Use new servers if necessary.
|
||
|
* If we run out of servers:
|
||
|
* if there are at least shares-of-happiness of any one version, we're
|
||
|
happy, so return. (the close() might still get an error)
|
||
|
* not happy, need to reinforce something, goto RECOVERY
|
||
|
|
||
|
RECOVERY:
|
||
|
* read all shares, count the versions, identify the recoverable ones,
|
||
|
discard the unrecoverable ones.
|
||
|
* sort versions: locate max(seqnums), put all versions with that seqnum
|
||
|
in the list, sort by number of outstanding shares. Then put our own
|
||
|
version. (TODO: put versions with seqnum <max but >us ahead of us?).
|
||
|
* for each version:
|
||
|
* attempt to recover that version
|
||
|
* if not possible, remove it from the list, go to next one
|
||
|
* if recovered, start at beginning of peer list, push that version,
|
||
|
continue until N shares are placed
|
||
|
* if pushing our own version, bump up the seqnum to one higher than
|
||
|
the max seqnum we saw
|
||
|
* if we run out of servers:
|
||
|
* schedule retry and exponential backoff to repeat RECOVERY
|
||
|
* admit defeat after some period? presumeably the client will be shut down
|
||
|
eventually, maybe keep trying (once per hour?) until then.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
== Medium Distributed Mutable Files ==
|
||
|
|
||
|
These are just like the SDMF case, but:
|
||
|
|
||
|
* we actually take advantage of the Merkle hash tree over the blocks, by
|
||
|
reading a single segment of data at a time (and its necessary hashes), to
|
||
|
reduce the read-time alacrity
|
||
|
* we allow arbitrary writes to the file (i.e. seek() is provided, and
|
||
|
O_TRUNC is no longer required)
|
||
|
* we write more code on the client side (in the MutableFileNode class), to
|
||
|
first read each segment that a write must modify. This looks exactly like
|
||
|
the way a normal filesystem uses a block device, or how a CPU must perform
|
||
|
a cache-line fill before modifying a single word.
|
||
|
* we might implement some sort of copy-based atomic update server call,
|
||
|
to allow multiple writev() calls to appear atomic to any readers.
|
||
|
|
||
|
MDMF slots provide fairly efficient in-place edits of very large files (a few
|
||
|
GB). Appending data is also fairly efficient, although each time a power of 2
|
||
|
boundary is crossed, the entire file must effectively be re-uploaded (because
|
||
|
the size of the block hash tree changes), so if the filesize is known in
|
||
|
advance, that space ought to be pre-allocated (by leaving extra space between
|
||
|
the block hash tree and the actual data).
|
||
|
|
||
|
MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2
|
||
|
adds cache-line reads to allow random-access writes.
|
||
|
|
||
|
== Large Distributed Mutable Files ==
|
||
|
|
||
|
LDMF slots use a fundamentally different way to store the file, inspired by
|
||
|
Mercurial's "revlog" format. They enable very efficient insert/remove/replace
|
||
|
editing of arbitrary spans. Multiple versions of the file can be retained, in
|
||
|
a revision graph that can have multiple heads. Each revision can be
|
||
|
referenced by a cryptographic identifier. There are two forms of the URI, one
|
||
|
that means "most recent version", and a longer one that points to a specific
|
||
|
revision.
|
||
|
|
||
|
Metadata can be attached to the revisions, like timestamps, to enable rolling
|
||
|
back an entire tree to a specific point in history.
|
||
|
|
||
|
LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
|
||
|
provides explicit support for revision identifiers and branching.
|
||
|
|
||
|
== TODO ==
|
||
|
|
||
|
improve allocate-and-write or get-writer-buckets API to allow one-call (or
|
||
|
maybe two-call) updates. The challenge is in figuring out which shares are on
|
||
|
which machines. First cut will have lots of round trips.
|
||
|
|
||
|
(eventually) define behavior when seqnum wraps. At the very least make sure
|
||
|
it can't cause a security problem. "the slot is worn out" is acceptable.
|
||
|
|
||
|
(eventually) define share-migration lease update protocol. Including the
|
||
|
nodeid who accepted the lease is useful, we can use the same protocol as we
|
||
|
do for updating the write enabler. However we need to know which lease to
|
||
|
update.. maybe send back a list of all old nodeids that we find, then try all
|
||
|
of them when we accept the update?
|
||
|
|
||
|
We now do this in a specially-formatted IndexError exception:
|
||
|
"UNABLE to renew non-existent lease. I have leases accepted by " +
|
||
|
"nodeids: '12345','abcde','44221' ."
|
||
|
|
||
|
Every node in a given tahoe grid must have the same common DSA moduli and
|
||
|
exponent, but different grids could use different parameters. We haven't
|
||
|
figured out how to define a "grid id" yet, but I think the DSA parameters
|
||
|
should be part of that identifier. In practical terms, this might mean that
|
||
|
the Introducer tells each node what parameters to use, or perhaps the node
|
||
|
could have a config file which specifies them instead.
|