tahoe-lafs/docs/denver.txt

The "Denver Airport" Protocol

 (discussed whilst returning robk to DEN, 12/1/06)

This is a scaling improvement on the "Select Peers" phase of Tahoe2. The
problem it tries to address is the storage and maintenance of the 1M-long
peer list, and the relative difficulty of gathering long-term reliability
information on a useful numbers of those peers.

In DEN, each node maintains a Chord-style set of connections to other nodes:
log2(N) "finger" connections to distant peers (the first of which is halfway
across the ring, the second is 1/4 across, then 1/8th, etc). These
connections need to be kept alive with relatively short timeouts (5s?), so
any breaks can be rejoined quickly. In addition to the finger connections,
each node must also remain aware of K "successor" nodes (those which are
immediately clockwise of the starting point). The node is not required to
maintain connections to these, but it should remain informed about their
contact information, so that it can create connections when necessary. We
probably need a connection open to the immediate successor at all times.

Since inbound connections exist too, each node has something like 2*log2(N)
plus up to 2*K connections.

Each node keeps history of uptime/availability of the nodes that it remains
connected to. Each message that is sent to these peers includes an estimate
of that peer's availability from the point of view of the outside world. The
receiving node will average these reports together to determine what kind of
reliability they should announce to anyone they accept leases for. This
reliability is expressed as a percentage uptime: P=1.0 means the peer is
available 24/7, P=0.0 means it is almost never reachable.


When a node wishes to publish a file, it creates a list of (verifierid,
sharenum) tuples, and computes a hash of each tuple. These hashes then
represent starting points for the landlord search:

 starting_points = [(sharenum,sha(verifierid + str(sharenum)))
                    for sharenum in range(256)]

The node then constructs a reservation message that contains enough
information for the potential landlord to evaluate the lease, *and* to make a
connection back to the starting node:

 message = [verifierid, sharesize, requestor_pburl, starting_points]

The node looks through its list of finger connections and splits this message
into up to log2(N) smaller messages, each of which contains only the starting
points that should be sent to that finger connection. Specifically we sent a
starting_point to a finger A if the nodeid of that finger is <= the
starting_point and if the next finger B is > starting_point. Each message
sent out can contain multiple starting_points, each for a different share.

When a finger node receives this message, it performs the same splitting
algorithm, sending each starting_point to other fingers. Eventually a
starting_point is received by a node that knows that the starting_point lies
between itself and its immediate successor. At this point the message
switches from the "hop" mode (following fingers) to the "search" mode
(following successors).

While in "search" mode, each node interprets the message as a lease request.
It checks its storage pool to see if it can accomodate the reservation. If
so, it uses requestor_pburl to contact the originator and announces its
willingness to host the given sharenum. This message will include the
reliability measurement derived from the host's counterclockwise neighbors.

If the recipient cannot host the share, it forwards the request on to the
next successor, which repeats the cycle. Each message has a maximum hop count
which limits the number of peers which may be searched before giving up. If a
node sees itself to be the last such hop, it must establish a connection to
the originator and let them know that this sharenum could not be hosted.

The originator sends out something like 100 or 200 starting points, and
expects to get back responses (positive or negative) in a reasonable amount
of time. (perhaps if we receive half of the responses in time T, wait for a
total of 2T for the remaining ones). If no response is received with the
timeout, either re-send the requests for those shares (to different fingers)
or send requests for completely different shares.

Each share represents some fraction of a point "S", such that the points for
enough shares to reconstruct the whole file total to 1.0 points. I.e., if we
construct 100 shares such that we need 25 of them to reconstruct the file,
then each share represents .04 points.

As the positive responses come in, we accumulate two counters: the capacity
counter (which gets a full S points for each positive response), and the
reliability counter (which gets S*(reliability-of-host) points). The capacity
counter is not allowed to go above some limit (like 4x), as determined by
provisioning. The node keeps adding leases until the reliability counter has
gone above some other threshold (larger but close to 1.0).

[ at download time, each host will be able to provide the share back with
  probability P times an exponential decay factor related to peer death. Sum
  these probabilities to get the average number of shares that will be
  available. The interesting thing is actually the distribution of these
  probabilities, and what threshold you have to pick to get a sufficiently
  high chance of recovering the file. If there are N identical peers with
  probability P, the number of recovered shares will have a gaussian
  distribution with an average of N*P and a stddev of (??). The PMF of this
  function is an S-curve, with a sharper slope when N is large. The
  probability of recovering the file is the value of this S curve at the
  threshold value (the number of necessary shares).

  P is not actually constant across all peers, rather we assume that it has
  its own distribution: maybe gaussian, more likely exponential (power law).
  This changes the shape of the S-curve. Assuming that we can characterize
  the distribution of P with perhaps two parameters (say meanP and stddevP),
  the S-curve is a function of meanP, stddevP, N, and threshold...

  To get 99.99% or 99.999% recoverability, we must choose a threshold value
  high enough to accomodate the random variations and uncertainty about the
  real values of P for each of the hosts we've selected. By counting
  reliability points, we are trying to estimate meanP/stddevP, so we know
  which S-curve to look at. The threshold is fixed at 1.0, since that's what
  erasure coding tells us we need to recover the file. The job is then to add
  hosts (increasing N and possibly changing meanP/stddevP) until our
  recoverability probability is as high as we want.
]

The originator takes all acceptance messages and adds them in order to the
list of landlords that will be used to host the file. It stops when it gets
enough reliability points. Note that it does *not* discriminate against
unreliable hosts: they are less likely to have been found in the first place,
so we don't need to discriminate against them a second time. We do, however,
use the reliability points to acknowledge that sending data to an unreliable
peer is not as useful as sending it to a reliable one (there is still value
in doing so, though). The remaining reservation-acceptance messages are
cancelled and then put aside: if we need to make a second pass, we ask those
peers first.

Shares are then created and published as in Tahoe2. If we lose a connection
during the encoding, that share is lost. If we lose enough shares, we might
want to generate more to make up for them: this is done by using the leftover
acceptance messages first, then triggering a new Chord search for the
as-yet-unaccepted sharenums. These new peers will get shares from all
segments that have not yet been finished, then a second pass will be made to
catch them up on the earlier segments.

Properties of this approach:
 the total number of peers that each node must know anything about is bounded
 to something like 2*log2(N) + K, probably on the order of 50 to 100 total.
 This is the biggest advantage, since in tahoe2 each node must know at least
 the nodeid of all 1M peers. The maintenance traffic should be much less as a
 result.

 each node must maintain open (keep-alived) connections to something like
 2*log2(N) peers. In tahoe2, this number is 0 (well, probably 1 for the
 introducer).

 during upload, each node must actively use 100 connections to a random set
 of peers to push data (just like tahoe2).

 The probability that any given share-request gets a response is equal to the
 number of hops it travels through times the chance that a peer dies while
 holding on to the message. This should be pretty small, as the message
 should only be held by a peer for a few seconds (more if their network is
 busy). In tahoe2, each share-request always gets a response, since they are
 made directly to the target.

I visualize the peer-lookup process as the originator creating a
message-in-a-bottle for each share. Each message says "Dear Sir/Madam, I
would like to store X bytes of data for file Y (share #Z) on a system close
to (but not below) nodeid STARTING_POINT. If you find this amenable, please
contact me at PBURL so we can make arrangements.". These messages are then
bundled together according to their rough destination (STARTING_POINT) and
sent somewhere in the right direction.

Download happens the same way: lookup messages are disseminated towards the
STARTING_POINT and then search one successor at a time from there. There are
two ways that the share might go missing: if the node is now offline (or has
for some reason lost its shares), or if new nodes have joined since the
original upload and the search depth (maximum hop count) is too small to
accomodate the churn. Both result in the same amount of localized traffic. In
the latter case, a storage node might want to migrate the share closer to the
starting point, or perhaps just send them a note to remember a pointer for
the share.

Checking: anyone who wishes to do a filecheck needs to send out a lookup
message for every potential share. These lookup messages could have a higher
search depth than usual. It would be useful to know how many peers each
message went through before being returned: this might be useful to perform
repair by instructing the old host (which is further from the starting point
than you'd like) to push their share closer towards the starting point.