tahoe-lafs/docs/mutable-DSA.txt

731 lines
37 KiB
Plaintext
Raw Normal View History

(protocol proposal, work-in-progress, not authoritative)
(this document describes DSA-based mutable files, as opposed to the RSA-based
mutable files that were introduced in tahoe-0.7.0 . This proposal has not yet
been implemented. Please see mutable-DSA.svg for a quick picture of the
crypto scheme described herein)
= Mutable Files =
Mutable File Slots are places with a stable identifier that can hold data
that changes over time. In contrast to CHK slots, for which the
URI/identifier is derived from the contents themselves, the Mutable File Slot
URI remains fixed for the life of the slot, regardless of what data is placed
inside it.
Each mutable slot is referenced by two different URIs. The "read-write" URI
grants read-write access to its holder, allowing them to put whatever
contents they like into the slot. The "read-only" URI is less powerful, only
granting read access, and not enabling modification of the data. The
read-write URI can be turned into the read-only URI, but not the other way
around.
The data in these slots is distributed over a number of servers, using the
same erasure coding that CHK files use, with 3-of-10 being a typical choice
of encoding parameters. The data is encrypted and signed in such a way that
only the holders of the read-write URI will be able to set the contents of
the slot, and only the holders of the read-only URI will be able to read
those contents. Holders of either URI will be able to validate the contents
as being written by someone with the read-write URI. The servers who hold the
shares cannot read or modify them: the worst they can do is deny service (by
deleting or corrupting the shares), or attempt a rollback attack (which can
only succeed with the cooperation of at least k servers).
== Consistency vs Availability ==
There is an age-old battle between consistency and availability. Epic papers
have been written, elaborate proofs have been established, and generations of
theorists have learned that you cannot simultaneously achieve guaranteed
consistency with guaranteed reliability. In addition, the closer to 0 you get
on either axis, the cost and complexity of the design goes up.
Tahoe's design goals are to largely favor design simplicity, then slightly
favor read availability, over the other criteria.
As we develop more sophisticated mutable slots, the API may expose multiple
read versions to the application layer. The tahoe philosophy is to defer most
consistency recovery logic to the higher layers. Some applications have
effective ways to merge multiple versions, so inconsistency is not
necessarily a problem (i.e. directory nodes can usually merge multiple "add
child" operations).
== The Prime Coordination Directive: "Don't Do That" ==
The current rule for applications which run on top of Tahoe is "do not
perform simultaneous uncoordinated writes". That means you need non-tahoe
means to make sure that two parties are not trying to modify the same mutable
slot at the same time. For example:
* don't give the read-write URI to anyone else. Dirnodes in a private
directory generally satisfy this case, as long as you don't use two
clients on the same account at the same time
* if you give a read-write URI to someone else, stop using it yourself. An
inbox would be a good example of this.
* if you give a read-write URI to someone else, call them on the phone
before you write into it
* build an automated mechanism to have your agents coordinate writes.
For example, we expect a future release to include a FURL for a
"coordination server" in the dirnodes. The rule can be that you must
contact the coordination server and obtain a lock/lease on the file
before you're allowed to modify it.
If you do not follow this rule, Bad Things will happen. The worst-case Bad
Thing is that the entire file will be lost. A less-bad Bad Thing is that one
or more of the simultaneous writers will lose their changes. An observer of
the file may not see monotonically-increasing changes to the file, i.e. they
may see version 1, then version 2, then 3, then 2 again.
Tahoe takes some amount of care to reduce the badness of these Bad Things.
One way you can help nudge it from the "lose your file" case into the "lose
some changes" case is to reduce the number of competing versions: multiple
versions of the file that different parties are trying to establish as the
one true current contents. Each simultaneous writer counts as a "competing
version", as does the previous version of the file. If the count "S" of these
competing versions is larger than N/k, then the file runs the risk of being
lost completely. If at least one of the writers remains running after the
collision is detected, it will attempt to recover, but if S>(N/k) and all
writers crash after writing a few shares, the file will be lost.
== Small Distributed Mutable Files ==
SDMF slots are suitable for small (<1MB) files that are editing by rewriting
the entire file. The three operations are:
* allocate (with initial contents)
* set (with new contents)
* get (old contents)
The first use of SDMF slots will be to hold directories (dirnodes), which map
encrypted child names to rw-URI/ro-URI pairs.
=== SDMF slots overview ===
Each SDMF slot is created with a DSA public/private key pair, using a
system-wide common modulus and generator, in which the private key is a
random 256 bit number, and the public key is a larger value (about 2048 bits)
that can be derived with a bit of math from the private key. The public key
is known as the "verification key", while the private key is called the
"signature key".
The 256-bit signature key is used verbatim as the "write capability". This
can be converted into the 2048ish-bit verification key through a fairly cheap
set of modular exponentiation operations; this is done any time the holder of
the write-cap wants to read the data. (Note that the signature key can either
be a newly-generated random value, or the hash of something else, if we found
a need for a capability that's stronger than the write-cap).
This results in a write-cap which is 256 bits long and can thus be expressed
in an ASCII/transport-safe encoded form (base62 encoding, fits in 72
characters, including a local-node http: convenience prefix).
The private key is hashed to form a 256-bit "salt". The public key is also
hashed to form a 256-bit "pubkey hash". These two values are concatenated,
hashed, and truncated to 192 bits to form the first 192 bits of the read-cap.
The pubkey hash is hashed by itself and truncated to 64 bits to form the last
64 bits of the read-cap. The full read-cap is 256 bits long, just like the
write-cap.
The first 192 bits of the read-cap are hashed and truncated to form the first
192 bits of the "traversal cap". The last 64 bits of the read-cap are hashed
to form the last 64 bits of the traversal cap. This gives us a 256-bit
traversal cap.
The first 192 bits of the traversal-cap are hashed and truncated to form the
first 64 bits of the storage index. The last 64 bits of the traversal-cap are
hashed to form the last 64 bits of the storage index. This gives us a 128-bit
storage index.
The verification-cap is the first 64 bits of the storage index plus the
pubkey hash, 320 bits total. The verification-cap doesn't need to be
expressed in a printable transport-safe form, so it's ok that it's longer.
The read-cap is hashed one way to form an AES encryption key that is used to
encrypt the salt; this key is called the "salt key". The encrypted salt is
stored in the share. The private key never changes, therefore the salt never
changes, and the salt key is only used for a single purpose, so there is no
need for an IV.
The read-cap is hashed a different way to form the master data encryption
key. A random "data salt" is generated each time the share's contents are
replaced, and the master data encryption key is concatenated with the data
salt, then hashed, to form the AES CTR-mode "read key" that will be used to
encrypt the actual file data. This is to avoid key-reuse. An outstanding
issue is how to avoid key reuse when files are modified in place instead of
being replaced completely; this is not done in SDMF but might occur in MDMF.
The master data encryption key is used to encrypt data that should be visible
to holders of a write-cap or a read-cap, but not to holders of a
traversal-cap.
The private key is hashed one way to form the salt, and a different way to
form the "write enabler master". For each storage server on which a share is
kept, the write enabler master is concatenated with the server's nodeid and
hashed, and the result is called the "write enabler" for that particular
server. Note that multiple shares of the same slot stored on the same server
will all get the same write enabler, i.e. the write enabler is associated
with the "bucket", rather than the individual shares.
The private key is hashed a third way to form the "data write key", which can
be used by applications which wish to store some data in a form that is only
available to those with a write-cap, and not to those with merely a read-cap.
This is used to implement transitive read-onlyness of dirnodes.
The traversal cap is hashed to work the "traversal key", which can be used by
applications that wish to store data in a form that is available to holders
of a write-cap, read-cap, or traversal-cap.
The idea is that dirnodes will store child write-caps under the writekey,
child names and read-caps under the read-key, and verify-caps (for files) or
deep-verify-caps (for directories) under the traversal key. This would give
the holder of a root deep-verify-cap the ability to create a verify manifest
for everything reachable from the root, but not the ability to see any
plaintext or filenames. This would make it easier to delegate filechecking
and repair to a not-fully-trusted agent.
The public key is stored on the servers, as is the encrypted salt, the
(non-encrypted) data salt, the encrypted data, and a signature. The container
records the write-enabler, but of course this is not visible to readers. To
make sure that every byte of the share can be verified by a holder of the
verify-cap (and also by the storage server itself), the signature covers the
version number, the sequence number, the root hash "R" of the share merkle
tree, the encoding parameters, and the encrypted salt. "R" itself covers the
hash trees and the share data.
The read-write URI is just the private key. The read-only URI is the read-cap
key. The deep-verify URI is the traversal-cap. The verify-only URI contains
the the pubkey hash and the first 64 bits of the storage index.
FMW:b2a(privatekey)
FMR:b2a(readcap)
FMT:b2a(traversalcap)
FMV:b2a(storageindex[:64])b2a(pubkey-hash)
Note that this allows the read-only, deep-verify, and verify-only URIs to be
derived from the read-write URI without actually retrieving any data from the
share, but instead by regenerating the public key from the private one. Users
of the read-only, deep-verify, or verify-only caps must validate the public
key against their pubkey hash (or its derivative) the first time they
retrieve the pubkey, before trusting any signatures they see.
The SDMF slot is allocated by sending a request to the storage server with a
desired size, the storage index, and the write enabler for that server's
nodeid. If granted, the write enabler is stashed inside the slot's backing
store file. All further write requests must be accompanied by the write
enabler or they will not be honored. The storage server does not share the
write enabler with anyone else.
The SDMF slot structure will be described in more detail below. The important
pieces are:
* a sequence number
* a root hash "R"
* the data salt
* the encoding parameters (including k, N, file size, segment size)
* a signed copy of [seqnum,R,data_salt,encoding_params] (using signature key)
* the verification key (not encrypted)
* the share hash chain (part of a Merkle tree over the share hashes)
* the block hash tree (Merkle tree over blocks of share data)
* the share data itself (erasure-coding of read-key-encrypted file data)
* the salt, encrypted with the salt key
The access pattern for read (assuming we hold the write-cap) is:
* generate public key from the private one
* hash private key to get the salt, hash public key, form read-cap
* form storage-index
* use storage-index to locate 'k' shares with identical 'R' values
* either get one share, read 'k' from it, then read k-1 shares
* or read, say, 5 shares, discover k, either get more or be finished
* or copy k into the URIs
* .. jump to "COMMON READ", below
To read (assuming we only hold the read-cap), do:
* hash read-cap pieces to generate storage index and salt key
* use storage-index to locate 'k' shares with identical 'R' values
* retrieve verification key and encrypted salt
* decrypt salt
* hash decrypted salt and pubkey to generate another copy of the read-cap,
make sure they match (this validates the pubkey)
* .. jump to "COMMON READ"
* COMMON READ:
* read seqnum, R, data salt, encoding parameters, signature
* verify signature against verification key
* hash data salt and read-cap to generate read-key
* read share data, compute block-hash Merkle tree and root "r"
* read share hash chain (leading from "r" to "R")
* validate share hash chain up to the root "R"
* submit share data to erasure decoding
* decrypt decoded data with read-key
* submit plaintext to application
The access pattern for write is:
* generate pubkey, salt, read-cap, storage-index as in read case
* generate data salt for this update, generate read-key
* encrypt plaintext from application with read-key
* application can encrypt some data with the data-write-key to make it
only available to writers (used for transitively-readonly dirnodes)
* erasure-code crypttext to form shares
* split shares into blocks
* compute Merkle tree of blocks, giving root "r" for each share
* compute Merkle tree of shares, find root "R" for the file as a whole
* create share data structures, one per server:
* use seqnum which is one higher than the old version
* share hash chain has log(N) hashes, different for each server
* signed data is the same for each server
* include pubkey, encrypted salt, data salt
* now we have N shares and need homes for them
* walk through peers
* if share is not already present, allocate-and-set
* otherwise, try to modify existing share:
* send testv_and_writev operation to each one
* testv says to accept share if their(seqnum+R) <= our(seqnum+R)
* count how many servers wind up with which versions (histogram over R)
* keep going until N servers have the same version, or we run out of servers
* if any servers wound up with a different version, report error to
application
* if we ran out of servers, initiate recovery process (described below)
==== Cryptographic Properties ====
This scheme protects the data's confidentiality with 192 bits of key
material, since the read-cap contains 192 secret bits (derived from an
encrypted salt, which is encrypted using those same 192 bits plus some
additional public material).
The integrity of the data (assuming that the signature is valid) is protected
by the 256-bit hash which gets included in the signature. The privilege of
modifying the data (equivalent to the ability to form a valid signature) is
protected by a 256 bit random DSA private key, and the difficulty of
computing a discrete logarithm in a 2048-bit field.
There are a few weaker denial-of-service attacks possible. If N-k+1 of the
shares are damaged or unavailable, the client will be unable to recover the
file. Any coalition of more than N-k shareholders will be able to effect this
attack by merely refusing to provide the desired share. The "write enabler"
shared secret protects existing shares from being displaced by new ones,
except by the holder of the write-cap. One server cannot affect the other
shares of the same file, once those other shares are in place.
The worst DoS attack is the "roadblock attack", which must be made before
those shares get placed. Storage indexes are effectively random (being
derived from the hash of a random value), so they are not guessable before
the writer begins their upload, but there is a window of vulnerability during
the beginning of the upload, when some servers have heard about the storage
index but not all of them.
The roadblock attack we want to prevent is when the first server that the
uploader contacts quickly runs to all the other selected servers and places a
bogus share under the same storage index, before the uploader can contact
them. These shares will normally be accepted, since storage servers create
new shares on demand. The bogus shares would have randomly-generated
write-enablers, which will of course be different than the real uploader's
write-enabler, since the malicious server does not know the write-cap.
If this attack were successful, the uploader would be unable to place any of
their shares, because the slots have already been filled by the bogus shares.
The uploader would probably try for peers further and further away from the
desired location, but eventually they will hit a preconfigured distance limit
and give up. In addition, the further the writer searches, the less likely it
is that a reader will search as far. So a successful attack will either cause
the file to be uploaded but not be reachable, or it will cause the upload to
fail.
If the uploader tries again (creating a new privkey), they may get lucky and
the malicious servers will appear later in the query list, giving sufficient
honest servers a chance to see their share before the malicious one manages
to place bogus ones.
The first line of defense against this attack is the timing challenges: the
attacking server must be ready to act the moment a storage request arrives
(which will only occur for a certain percentage of all new-file uploads), and
only has a few seconds to act before the other servers will have allocated
the shares (and recorded the write-enabler, terminating the window of
vulnerability).
The second line of defense is post-verification, and is possible because the
storage index is partially derived from the public key hash. A storage server
can, at any time, verify every public bit of the container as being signed by
the verification key (this operation is recommended as a continual background
process, when disk usage is minimal, to detect disk errors). The server can
also hash the verification key to derive 64 bits of the storage index. If it
detects that these 64 bits do not match (but the rest of the share validates
correctly), then the implication is that this share was stored to the wrong
storage index, either due to a bug or a roadblock attack.
If an uploader finds that they are unable to place their shares because of
"bad write enabler errors" (as reported by the prospective storage servers),
it can "cry foul", and ask the storage server to perform this verification on
the share in question. If the pubkey and storage index do not match, the
storage server can delete the bogus share, thus allowing the real uploader to
place their share. Of course the origin of the offending bogus share should
be logged and reported to a central authority, so corrective measures can be
taken. It may be necessary to have this "cry foul" protocol include the new
write-enabler, to close the window during which the malicious server can
re-submit the bogus share during the adjudication process.
If the problem persists, the servers can be placed into pre-verification
mode, in which this verification is performed on all potential shares before
being committed to disk. This mode is more CPU-intensive (since normally the
storage server ignores the contents of the container altogether), but would
solve the problem completely.
The mere existence of these potential defenses should be sufficient to deter
any actual attacks. Note that the storage index only has 64 bits of
pubkey-derived data in it, which is below the usual crypto guidelines for
security factors. In this case it's a pre-image attack which would be needed,
rather than a collision, and the actual attack would be to find a keypair for
which the public key can be hashed three times to produce the desired portion
of the storage index. We believe that 64 bits of material is sufficiently
resistant to this form of pre-image attack to serve as a suitable deterrent.
=== Server Storage Protocol ===
The storage servers will provide a mutable slot container which is oblivious
to the details of the data being contained inside it. Each storage index
refers to a "bucket", and each bucket has one or more shares inside it. (In a
well-provisioned network, each bucket will have only one share). The bucket
is stored as a directory, using the base32-encoded storage index as the
directory name. Each share is stored in a single file, using the share number
as the filename.
The container holds space for a container magic number (for versioning), the
write enabler, the nodeid which accepted the write enabler (used for share
migration, described below), a small number of lease structures, the embedded
data itself, and expansion space for additional lease structures.
# offset size name
1 0 32 magic verstr "tahoe mutable container v1" plus binary
2 32 20 write enabler's nodeid
3 52 32 write enabler
4 84 8 data size (actual share data present) (a)
5 92 8 offset of (8) count of extra leases (after data)
6 100 368 four leases, 92 bytes each
0 4 ownerid (0 means "no lease here")
4 4 expiration timestamp
8 32 renewal token
40 32 cancel token
72 20 nodeid which accepted the tokens
7 468 (a) data
8 ?? 4 count of extra leases
9 ?? n*92 extra leases
The "extra leases" field must be copied and rewritten each time the size of
the enclosed data changes. The hope is that most buckets will have four or
fewer leases and this extra copying will not usually be necessary.
The (4) "data size" field contains the actual number of bytes of data present
in field (7), such that a client request to read beyond 504+(a) will result
in an error. This allows the client to (one day) read relative to the end of
the file. The container size (that is, (8)-(7)) might be larger, especially
if extra size was pre-allocated in anticipation of filling the container with
a lot of data.
The offset in (5) points at the *count* of extra leases, at (8). The actual
leases (at (9)) begin 4 bytes later. If the container size changes, both (8)
and (9) must be relocated by copying.
The server will honor any write commands that provide the write token and do
not exceed the server-wide storage size limitations. Read and write commands
MUST be restricted to the 'data' portion of the container: the implementation
of those commands MUST perform correct bounds-checking to make sure other
portions of the container are inaccessible to the clients.
The two methods provided by the storage server on these "MutableSlot" share
objects are:
* readv(ListOf(offset=int, length=int))
* returns a list of bytestrings, of the various requested lengths
* offset < 0 is interpreted relative to the end of the data
* spans which hit the end of the data will return truncated data
* testv_and_writev(write_enabler, test_vector, write_vector)
* this is a test-and-set operation which performs the given tests and only
applies the desired writes if all tests succeed. This is used to detect
simultaneous writers, and to reduce the chance that an update will lose
data recently written by some other party (written after the last time
this slot was read).
* test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
* the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
* each element of the test vector is read from the slot's data and
compared against the specimen using the desired (in)equality. If all
tests evaluate True, the write is performed
* write_vector=ListOf(TupleOf(offset, newdata))
* offset < 0 is not yet defined, it probably means relative to the
end of the data, which probably means append, but we haven't nailed
it down quite yet
* write vectors are executed in order, which specifies the results of
overlapping writes
* return value:
* error: OutOfSpace
* error: something else (io error, out of memory, whatever)
* (True, old_test_data): the write was accepted (test_vector passed)
* (False, old_test_data): the write was rejected (test_vector failed)
* both 'accepted' and 'rejected' return the old data that was used
for the test_vector comparison. This can be used by the client
to detect write collisions, including collisions for which the
desired behavior was to overwrite the old version.
In addition, the storage server provides several methods to access these
share objects:
* allocate_mutable_slot(storage_index, sharenums=SetOf(int))
* returns DictOf(int, MutableSlot)
* get_mutable_slot(storage_index)
* returns DictOf(int, MutableSlot)
* or raises KeyError
We intend to add an interface which allows small slots to allocate-and-write
in a single call, as well as do update or read in a single call. The goal is
to allow a reasonably-sized dirnode to be created (or updated, or read) in
just one round trip (to all N shareholders in parallel).
==== migrating shares ====
If a share must be migrated from one server to another, two values become
invalid: the write enabler (since it was computed for the old server), and
the lease renew/cancel tokens.
Suppose that a slot was first created on nodeA, and was thus initialized with
WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is
moved from nodeA to nodeB.
Readers may still be able to find the share in its new home, depending upon
how many servers are present in the grid, where the new nodeid lands in the
permuted index for this particular storage index, and how many servers the
reading client is willing to contact.
When a client attempts to write to this migrated share, it will get a "bad
write enabler" error, since the WE it computes for nodeB will not match the
WE(nodeA) that was embedded in the share. When this occurs, the "bad write
enabler" message must include the old nodeid (e.g. nodeA) that was in the
share.
The client then computes H(nodeB+H(WEM+nodeA)), which is the same as
H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which
is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to
anyone else. Also note that the client does not send a value to nodeB that
would allow the node to impersonate the client to a third node: everything
sent to nodeB will include something specific to nodeB in it.
The server locally computes H(nodeB+WE(nodeA)), using its own node id and the
old write enabler from the share. It compares this against the value supplied
by the client. If they match, this serves as proof that the client was able
to compute the old write enabler. The server then accepts the client's new
WE(nodeB) and writes it into the container.
This WE-fixup process requires an extra round trip, and requires the error
message to include the old nodeid, but does not require any public key
operations on either client or server.
Migrating the leases will require a similar protocol. This protocol will be
defined concretely at a later date.
=== Code Details ===
The current FileNode class will be renamed ImmutableFileNode, and a new
MutableFileNode class will be created. Instances of this class will contain a
URI and a reference to the client (for peer selection and connection). The
methods of MutableFileNode are:
* replace(newdata) -> OK, ConsistencyError, NotEnoughPeersError
* get() -> [deferred] newdata, NotEnoughPeersError
* if there are multiple retrieveable versions in the grid, get() returns
the first version it can reconstruct, and silently ignores the others.
In the future, a more advanced API will signal and provide access to
the multiple heads.
The peer-selection and data-structure manipulation (and signing/verification)
steps will be implemented in a separate class in allmydata/mutable.py .
=== SMDF Slot Format ===
This SMDF data lives inside a server-side MutableSlot container. The server
is generally oblivious to this format, but it may look inside the container
when verification is desired.
This data is tightly packed. There are no gaps left between the different
fields, and the offset table is mainly present to allow future flexibility of
key sizes.
# offset size name
1 0 1 version byte, \x01 for this format
2 1 8 sequence number. 2^64-1 must be handled specially, TBD
3 9 32 "R" (root of share hash Merkle tree)
4 41 32 data salt (readkey is H(readcap+data_salt))
5 73 32 encrypted salt (AESenc(key=H(readcap), salt)
6 105 18 encoding parameters:
105 1 k
106 1 N
107 8 segment size
115 8 data length (of original plaintext)
7 123 36 offset table:
127 4 (9) signature
131 4 (10) share hash chain
135 4 (11) block hash tree
139 4 (12) share data
143 8 (13) EOF
8 151 256 verification key (2048bit DSA key)
9 407 40 signature=DSAsig(H([1,2,3,4,5,6]))
10 447 (a) share hash chain, encoded as:
"".join([pack(">H32s", shnum, hash)
for (shnum,hash) in needed_hashes])
11 ?? (b) block hash tree, encoded as:
"".join([pack(">32s",hash) for hash in block_hash_tree])
12 ?? LEN share data
13 ?? -- EOF
(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long.
This is the set of hashes necessary to validate this share's leaf in the
share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes.
(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes
long. This is the set of hashes necessary to validate any given block of
share data up to the per-share root "r". Each "r" is a leaf of the share
has tree (with root "R"), from which a minimal subset of hashes is put in
the share hash chain in (8).
=== Recovery ===
The first line of defense against damage caused by colliding writes is the
Prime Coordination Directive: "Don't Do That".
The second line of defense is to keep "S" (the number of competing versions)
lower than N/k. If this holds true, at least one competing version will have
k shares and thus be recoverable. Note that server unavailability counts
against us here: the old version stored on the unavailable server must be
included in the value of S.
The third line of defense is our use of testv_and_writev() (described below),
which increases the convergence of simultaneous writes: one of the writers
will be favored (the one with the highest "R"), and that version is more
likely to be accepted than the others. This defense is least effective in the
pathological situation where S simultaneous writers are active, the one with
the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
one with the highest "R" writes to k-1 shares and dies. Any other sequencing
will allow the highest "R" to write to at least k shares and establish a new
revision.
The fourth line of defense is the fact that each client keeps writing until
at least one version has N shares. This uses additional servers, if
necessary, to make sure that either the client's version or some
newer/overriding version is highly available.
The fifth line of defense is the recovery algorithm, which seeks to make sure
that at least *one* version is highly available, even if that version is
somebody else's.
The write-shares-to-peers algorithm is as follows:
* permute peers according to storage index
* walk through peers, trying to assign one share per peer
* for each peer:
* send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
* this means that we will overwrite any old versions, and we will
overwrite simultaenous writers of the same version if our R is higher.
We will not overwrite writers using a higher seqnum.
* record the version that each share winds up with. If the write was
accepted, this is our own version. If it was rejected, read the
old_test_data to find out what version was retained.
* if old_test_data indicates the seqnum was equal or greater than our
own, mark the "Simultanous Writes Detected" flag, which will eventually
result in an error being reported to the writer (in their close() call).
* build a histogram of "R" values
* repeat until the histogram indicate that some version (possibly ours)
has N shares. Use new servers if necessary.
* If we run out of servers:
* if there are at least shares-of-happiness of any one version, we're
happy, so return. (the close() might still get an error)
* not happy, need to reinforce something, goto RECOVERY
RECOVERY:
* read all shares, count the versions, identify the recoverable ones,
discard the unrecoverable ones.
* sort versions: locate max(seqnums), put all versions with that seqnum
in the list, sort by number of outstanding shares. Then put our own
version. (TODO: put versions with seqnum <max but >us ahead of us?).
* for each version:
* attempt to recover that version
* if not possible, remove it from the list, go to next one
* if recovered, start at beginning of peer list, push that version,
continue until N shares are placed
* if pushing our own version, bump up the seqnum to one higher than
the max seqnum we saw
* if we run out of servers:
* schedule retry and exponential backoff to repeat RECOVERY
* admit defeat after some period? presumeably the client will be shut down
eventually, maybe keep trying (once per hour?) until then.
== Medium Distributed Mutable Files ==
These are just like the SDMF case, but:
* we actually take advantage of the Merkle hash tree over the blocks, by
reading a single segment of data at a time (and its necessary hashes), to
reduce the read-time alacrity
* we allow arbitrary writes to the file (i.e. seek() is provided, and
O_TRUNC is no longer required)
* we write more code on the client side (in the MutableFileNode class), to
first read each segment that a write must modify. This looks exactly like
the way a normal filesystem uses a block device, or how a CPU must perform
a cache-line fill before modifying a single word.
* we might implement some sort of copy-based atomic update server call,
to allow multiple writev() calls to appear atomic to any readers.
MDMF slots provide fairly efficient in-place edits of very large files (a few
GB). Appending data is also fairly efficient, although each time a power of 2
boundary is crossed, the entire file must effectively be re-uploaded (because
the size of the block hash tree changes), so if the filesize is known in
advance, that space ought to be pre-allocated (by leaving extra space between
the block hash tree and the actual data).
MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2
adds cache-line reads to allow random-access writes.
== Large Distributed Mutable Files ==
LDMF slots use a fundamentally different way to store the file, inspired by
Mercurial's "revlog" format. They enable very efficient insert/remove/replace
editing of arbitrary spans. Multiple versions of the file can be retained, in
a revision graph that can have multiple heads. Each revision can be
referenced by a cryptographic identifier. There are two forms of the URI, one
that means "most recent version", and a longer one that points to a specific
revision.
Metadata can be attached to the revisions, like timestamps, to enable rolling
back an entire tree to a specific point in history.
LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
provides explicit support for revision identifiers and branching.
== TODO ==
improve allocate-and-write or get-writer-buckets API to allow one-call (or
maybe two-call) updates. The challenge is in figuring out which shares are on
which machines. First cut will have lots of round trips.
(eventually) define behavior when seqnum wraps. At the very least make sure
it can't cause a security problem. "the slot is worn out" is acceptable.
(eventually) define share-migration lease update protocol. Including the
nodeid who accepted the lease is useful, we can use the same protocol as we
do for updating the write enabler. However we need to know which lease to
update.. maybe send back a list of all old nodeids that we find, then try all
of them when we accept the update?
We now do this in a specially-formatted IndexError exception:
"UNABLE to renew non-existent lease. I have leases accepted by " +
"nodeids: '12345','abcde','44221' ."
Every node in a given tahoe grid must have the same common DSA moduli and
exponent, but different grids could use different parameters. We haven't
figured out how to define a "grid id" yet, but I think the DSA parameters
should be part of that identifier. In practical terms, this might mean that
the Introducer tells each node what parameters to use, or perhaps the node
could have a config file which specifies them instead.