tahoe-lafs/docs/mutable.txt


(protocol proposal, work-in-progress, not authoritative)

= Mutable Files =

Mutable File Slots are places with a stable identifier that can hold data
that changes over time. In contrast to CHK slots, for which the
URI/identifier is derived from the contents themselves, the Mutable File Slot
URI remains fixed for the life of the slot, regardless of what data is placed
inside it.

Each mutable slot is referenced by two different URIs. The "read-write" URI
grants read-write access to its holder, allowing them to put whatever
contents they like into the slot. The "read-only" URI is less powerful, only
granting read access, and not enabling modification of the data. The
read-write URI can be turned into the read-only URI, but not the other way
around.

The data in these slots is distributed over a number of servers, using the
same erasure coding that CHK files use, with 3-of-10 being a typical choice
of encoding parameters. The data is encrypted and signed in such a way that
only the holders of the read-write URI will be able to set the contents of
the slot, and only the holders of the read-only URI will be able to read
those contents. Holders of either URI will be able to validate the contents
as being written by someone with the read-write URI. The servers who hold the
shares cannot read or modify them: the worst they can do is deny service (by
deleting or corrupting the shares), or attempt a rollback attack (which can
only succeed with the cooperation of at least k servers).

== Consistency vs Availability ==

There is an age-old battle between consistency and availability. Epic papers
have been written, elaborate proofs have been established, and generations of
theorists have learned that you cannot simultaneously achieve guaranteed
consistency with guaranteed reliability. In addition, the closer to 0 you get
on either axis, the cost and complexity of the design goes up.

Tahoe's design goals are to largely favor design simplicity, then slightly
favor read availability, over the other criteria.

As we develop more sophisticated mutable slots, the API may expose multiple
read versions to the application layer. The tahoe philosophy is to defer most
consistency recovery logic to the higher layers. Some applications have
effective ways to merge multiple versions, so inconsistency is not
necessarily a problem (i.e. directory nodes can usually merge multiple "add
child" operations).

== The Prime Coordination Directive: "Don't Do That" ==

The current rule for applications which run on top of Tahoe is "do not
perform simultaneous uncoordinated writes". That means you need non-tahoe
means to make sure that two parties are not trying to modify the same mutable
slot at the same time. For example:

 * don't give the read-write URI to anyone else. Dirnodes in a private
   directory generally satisfy this case, as long as you don't use two
   clients on the same account at the same time
 * if you give a read-write URI to someone else, stop using it yourself. An
   inbox would be a good example of this.
 * if you give a read-write URI to someone else, call them on the phone
   before you write into it
 * build an automated mechanism to have your agents coordinate writes.
   For example, we expect a future release to include a FURL for a
   "coordination server" in the dirnodes. The rule can be that you must
   contact the coordination server and obtain a lock/lease on the file
   before you're allowed to modify it.

If you do not follow this rule, Bad Things will happen. The worst-case Bad
Thing is that the entire file will be lost. A less-bad Bad Thing is that one
or more of the simultaneous writers will lose their changes. An observer of
the file may not see monotonically-increasing changes to the file, i.e. they
may see version 1, then version 2, then 3, then 2 again.

Tahoe takes some amount of care to reduce the badness of these Bad Things.
One way you can help nudge it from the "lose your file" case into the "lose
some changes" case is to reduce the number of competing versions: multiple
versions of the file that different parties are trying to establish as the
one true current contents. Each simultaneous writer counts as a "competing
version", as does the previous version of the file. If the count "S" of these
competing versions is larger than N/k, then the file runs the risk of being
lost completely. If at least one of the writers remains running after the
collision is detected, it will attempt to recover, but if S>(N/k) and all
writers crash after writing a few shares, the file will be lost.


== Small Distributed Mutable Files ==

SDMF slots are suitable for small (<1MB) files that are editing by rewriting
the entire file. The three operations are:

 * allocate (with initial contents)
 * set (with new contents)
 * get (old contents)

The first use of SDMF slots will be to hold directories (dirnodes), which map
encrypted child names to rw-URI/ro-URI pairs.

=== SDMF slots overview ===

Each SDMF slot is created with a public/private key pair (known as the
"verification key" and the "signature key"). The public key is hashed to form
the "read key" (an AES symmetric key), and the read key is hashed to form the
Storage Index (a unique string). The private key and public key are
concatenated together and hashed to form the "write key". The write key is
then hashed to form the "write enabler master". For each storage server on
which a share is kept, the write enabler master is concatenated with the
server's nodeid and hashed, and the result is called the "write enabler" for
that particular server.

The read-write URI consists of the write key and the storage index. The
read-only URI contains just the read key.

The SDMF slot is allocated by sending a request to the storage server with a
desired size, the storage index, and the write enabler for that server's
nodeid. If granted, the write enabler is stashed inside the slot's backing
store file. All further write requests must be accompanied by the write
enabler or they will not be honored. The storage server does not share the
write enabler with anyone else.

The SDMF slot structure will be described in more detail below. The important
pieces are:

  * a sequence number
  * a root hash "R"
  * the encoding parameters (including k, N, and the file size)
  * a signed copy of [seqnum,R,encoding_params], using the signature key
  * the verification key (not encrypted)
  * the share hash chain (part of a Merkle tree over the share hashes)
  * the share data itself (erasure-coding of read-key-encrypted file data)
  * the signature key, encrypted with the write key

The access pattern for read is:
 * use storage index to locate 'k' shares with identical 'R' values
 * read verification key
 * hash verification key, compare against read key
   * OOPS!!! verification key is in the clear, so read key is too!! FIX!
 * read seqnum, R, encoding parameters, signature
 * verify signature
 * read share data, hash
 * read share hash chain
 * validate share hash chain up to the root "R"
 * submit share data to erasure decoding
 * decrypt decoded data with read-key
 * submit plaintext to application

The access pattern for write is:
 * use the storage index to locate at least one share
 * read verification key and encrypted signature key
 * decrypt signature key using write-key
 * concatenate signature and verification keys, compare against write-key
 * hash verification key to form read-key
 * encrypt plaintext from application with read-key
 * erasure-code crypttext to form shares
 * compute Merkle tree of shares, find root "R"
 * create share data structures, one per server:
   * use seqnum which is one higher than the old version
   * share hash chain has log(N) hashes, different for each server
   * signed data is the same for each server
 * now we have N shares and need homes for them
 * walk through peers
   * if share is not already present, allocate-and-set
   * otherwise, try to modify existing share:
   * send testv_and_writev operation to each one
   * testv says to accept share if their(seqnum+R) <= our(seqnum+R)
   * count how many servers wind up with which versions (histogram over R)
   * keep going until N servers have the same version, or we run out of servers
     * if any servers wound up with a different version, report error to
       application
     * if we ran out of servers, initiate recovery process (described below)

=== Server Storage Protocol ===

The storage servers will provide a mutable slot container which is oblivious
to the details of the data being contained inside it. Each storage index
refers to a "bucket", and each bucket has one or more shares inside it. (In a
well-provisioned network, each bucket will have only one share). The bucket
is stored as a directory, using the base32-encoded storage index as the
directory name. Each share is stored in a single file, using the share number
as the filename.

The container holds space for a container magic number (for versioning), the
write enabler, the nodeid for which the write enabler was generated (for
share migration, TBD), a small number of lease structures, the embedded data
itself, and expansion space for additional lease structures.

 #   offset    size    name
 1   0         32      magic verstr "tahoe mutable container v1" plus binary
 2   32        32      write enabler's nodeid
 3   64        32      write enabler
 4   72        8       offset of extra leases (after data)
 5   80        288     four leases:
                        0    4   ownerid (0 means "no lease here")
                        4    4   expiration timestamp
                        8   32   renewal token
                        40  32   cancel token
 6   368       ??      data
 7   ??        4       count of extra leases
 8   ??        n*72    extra leases

The "extra leases" field must be copied and rewritten each time the size of
the enclosed data changes. The hope is that most buckets will have four or
fewer leases and this extra copying will not usually be necessary.

The server will honor any write commands that provide the write token and do
not exceed the server-wide storage size limitations. Read and write commands
MUST be restricted to the 'data' portion of the container: the implementation
of those commands MUST perform correct bounds-checking to make sure other
portions of the container are inaccessible to the clients.

The two methods provided by the storage server on these "MutableSlot" share
objects are:

 * readv(ListOf(offset=int, length=int))
   * returns a list of bytestrings, of the various requested lengths
   * offset < 0 is interpreted relative to the end of the data
   * spans which hit the end of the data will return truncated data

 * testv_and_writev(write_enabler, test_vector, write_vector)
   * this is a test-and-set operation which performs the given tests and only
     applies the desired writes if all tests succeed. This is used to detect
     simultaneous writers, and to reduce the chance that an update will lose
     data recently written by some other party (written after the last time
     this slot was read).
   * test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
   * the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
   * each element of the test vector is read from the slot's data and 
     compared against the specimen using the desired (in)equality. If all
     tests evaluate True, the write is performed
   * write_vector=ListOf(TupleOf(offset, newdata))
     * offset < 0 is not yet defined, it probably means relative to the
       end of the data, which probably means append, but we haven't nailed
       it down quite yet
     * write vectors are executed in order, which specifies the results of
       overlapping writes
   * return value:
     * error: OutOfSpace
     * error: something else (io error, out of memory, whatever)
     * (True, old_test_data): the write was accepted (test_vector passed)
     * (False, old_test_data): the write was rejected (test_vector failed)
       * both 'accepted' and 'rejected' return the old data that was used
         for the test_vector comparison. This can be used by the client
         to detect write collisions, including collisions for which the
         desired behavior was to overwrite the old version.

In addition, the storage server provides several methods to access these
share objects:

 * allocate_mutable_slot(storage_index, sharenums=SetOf(int))
   * returns DictOf(int, MutableSlot)
 * get_mutable_slot(storage_index)
   * returns DictOf(int, MutableSlot)
   * or raises KeyError

We intend to add an interface which allows small slots to allocate-and-write
in a single call, as well as do update or read in a single call. The goal is
to allow a reasonably-sized dirnode to be created (or updated, or read) in
just one round trip (to all N shareholders in parallel).

==== migrating shares ====

If a share must be migrated from one server to another, two values become
invalid: the write enabler (since it was computed for the old server), and
the lease renew/cancel tokens.

One idea we have is to say that the migration process is obligated to replace
the write enabler with its hash (but leaving the old "write enabler node id"
in place, to remind it that this WE isn't its own). When a writer attempts to
modify a slot with the old write enabler, the server will reject the request
and include the old WE-nodeid in the rejection message. The writer should
then realize that the share has been migrated and try again with the hash of
their old write enabler.

This process doesn't provide any means to fix up the write enabler, though,
requiring an extra roundtrip for the remainder of the slot's lifetime. It
might work better to have a call that allows the WE to be replaced, by
proving that the writer knows H(old-WE-nodeid,old-WE). If we leave the old WE
in place when migrating, this allows both writer and server to agree upon the
writer's authority, hopefully without granting the server any new authority
(or enabling it to trick a writer into revealing one).

=== Code Details ===

The current FileNode class will be renamed ImmutableFileNode, and a new
MutableFileNode class will be created. Instances of this class will contain a
URI and a reference to the client (for peer selection and connection). The
methods of MutableFileNode are:

 * replace(newdata) -> OK, ConsistencyError, NotEnoughPeersError
 * get() -> [deferred] newdata, NotEnoughPeersError
   * if there are multiple retrieveable versions in the grid, get() returns
     the first version it can reconstruct, and silently ignores the others.
     In the future, a more advanced API will signal and provide access to
     the multiple heads.

The peer-selection and data-structure manipulation (and signing/verification)
steps will be implemented in a separate class in allmydata/mutable.py .

=== SMDF Slot Format ===

This SMDF data lives inside a server-side MutableSlot container. The server
is oblivious to this format.

 #    offset   size    name
 1    0        1       version byte, \x00 for this format
 2    1        8       sequence number. 2^64-1 must be handled specially, TBD
 3    9        32      "R" (root of share hash Merkle tree)
 4    41       18      encoding parameters:
       41       1        k
       42       1        N
       43       8        segment size
       51       8        data length
 5    59       32      offset table:
       91       4        (6) signature
       95       4        (7) share hash chain
       99       4        (8) share data
       103      8        (9) encrypted private key
 6    111      256     verification key (2048 RSA key 'n' value, e=3)
 7    367      256     signature= RSAenc(sig-key, H(version+seqnum+r+encparm))
 8    623      (a)     share hash chain
 9    ??       LEN     share data
10    ??       256     encrypted private key= AESenc(write-key, RSA 'd' value)

(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long.
    This is the set of hashes necessary to validate this share's leaf in the
    share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes.

=== Recovery ===

The first line of defense against damage caused by colliding writes is the
Prime Coordination Directive: "Don't Do That".

The second line of defense is to keep "S" (the number of competing versions)
lower than N/k. If this holds true, at least one competing version will have
k shares and thus be recoverable. Note that server unavailability counts
against us here: the old version stored on the unavailable server must be
included in the value of S.

The third line of defense is our use of testv_and_writev() (described below),
which increases the convergence of simultaneous writes: one of the writers
will be favored (the one with the highest "R"), and that version is more
likely to be accepted than the others. This defense is least effective in the
pathological situation where S simultaneous writers are active, the one with
the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
one with the highest "R" writes to k-1 shares and dies. Any other sequencing
will allow the highest "R" to write to at least k shares and establish a new
revision.

The fourth line of defense is the fact that each client keeps writing until
at least one version has N shares. This uses additional servers, if
necessary, to make sure that either the client's version or some
newer/overriding version is highly available.

The fifth line of defense is the recovery algorithm, which seeks to make sure
that at least *one* version is highly available, even if that version is
somebody else's.

The write-shares-to-peers algorithm is as follows:

 * permute peers according to storage index
 * walk through peers, trying to assign one share per peer
 * for each peer:
   * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
     * this means that we will overwrite any old versions, and we will
       overwrite simultaenous writers of the same version if our R is higher.
       We will not overwrite writers using a higher seqnum.
   * record the version that each share winds up with. If the write was
     accepted, this is our own version. If it was rejected, read the
     old_test_data to find out what version was retained.
   * if old_test_data indicates the seqnum was equal or greater than our
     own, mark the "Simultanous Writes Detected" flag, which will eventually
     result in an error being reported to the writer (in their close() call).
   * build a histogram of "R" values
   * repeat until the histogram indicate that some version (possibly ours)
     has N shares. Use new servers if necessary.
   * If we run out of servers:
     * if there are at least shares-of-happiness of any one version, we're
       happy, so return. (the close() might still get an error)
     * not happy, need to reinforce something, goto RECOVERY

RECOVERY:
 * read all shares, count the versions, identify the recoverable ones,
   discard the unrecoverable ones.
 * sort versions: locate max(seqnums), put all versions with that seqnum
   in the list, sort by number of outstanding shares. Then put our own
   version. (TODO: put versions with seqnum <max but >us ahead of us?).
 * for each version:
   * attempt to recover that version
   * if not possible, remove it from the list, go to next one
   * if recovered, start at beginning of peer list, push that version,
     continue until N shares are placed
   * if pushing our own version, bump up the seqnum to one higher than
     the max seqnum we saw
   * if we run out of servers:
     * schedule retry and exponential backoff to repeat RECOVERY
   * admit defeat after some period? presumeably the client will be shut down
     eventually, maybe keep trying (once per hour?) until then.


== Medium Distributed Mutable Files ==

These are just like the SDMF case, but:

 * we use a Merkle hash tree over the blocks, instead of using a single flat
   hash, to reduce the read-time alacrity
 * we allow arbitrary writes to the file (i.e. seek() is provided, and
   O_TRUNC is no longer required)
 * we write more code on the client side (in the MutableFileNode class), to
   first read each segment that a write must modify. This looks exactly like
   the way a normal filesystem uses a block device, or how a CPU must perform
   a cache-line fill before modifying a single word.
 * we might implement some sort of copy-based atomic update server call,
   to allow multiple writev() calls to appear atomic to any readers.

MDMF slots provide fairly efficient in-place edits of very large files (a few
GB). Appending data is also fairly efficient, although each time a power of 2
boundary is crossed, the entire file must effectively be re-uploaded, so if
the filesize is known in advance, that space ought to be pre-allocated.

MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2
adds cache-line reads to allow random-access writes.

== Large Distributed Mutable Files ==

LDMF slots use a fundamentally different way to store the file, inspired by
Mercurial's "revlog" format. They enable very efficient insert/remove/replace
editing of arbitrary spans. Multiple versions of the file can be retained, in
a revision graph that can have multiple heads. Each revision can be
referenced by a cryptographic identifier. There are two forms of the URI, one
that means "most recent version", and a longer one that points to a specific
revision.

Metadata can be attached to the revisions, like timestamps, to enable rolling
back an entire tree to a specific point in history.

LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
provides explicit support for revision identifiers and branching.

== TODO ==

fix gigantic RO-URI security bug, probably by adding a second secret

how about:
 * H(privkey+pubkey) -> writekey -> readkey -> storageindex
 * RW-URI = writekey
 * RO-URI = readkey + H(pubkey)


improve allocate-and-write or get-writer-buckets API to allow one-call (or
maybe two-call) updates. The challenge is in figuring out which shares are on
which machines.

(eventually) define behavior when seqnum wraps. At the very least make sure
it can't cause a security problem. "the slot is worn out" is acceptable.

(eventually) define share-migration WE-update protocol