mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2024-12-23 14:52:26 +00:00
d19d1058e0
This is a potentially disruptive and potentially ugly change to the code base, because I renamed the object that serves in both roles from "Queen" to "IntroducerAndVdrive", which is a bit of an ugly name. However, I think that clarity is important enough in this release to make this change. All unit tests pass. I'm now darcs recording this patch in order to pull it to other machines for more testing.
183 lines
10 KiB
Plaintext
183 lines
10 KiB
Plaintext
The "Denver Airport" Protocol
|
|
|
|
(discussed whilst returning robk to DEN, 12/1/06)
|
|
|
|
This is a scaling improvement on the "Select Peers" phase of Tahoe2. The
|
|
problem it tries to address is the storage and maintenance of the 1M-long
|
|
peer list, and the relative difficulty of gathering long-term reliability
|
|
information on a useful numbers of those peers.
|
|
|
|
In DEN, each node maintains a Chord-style set of connections to other nodes:
|
|
log2(N) "finger" connections to distant peers (the first of which is halfway
|
|
across the ring, the second is 1/4 across, then 1/8th, etc). These
|
|
connections need to be kept alive with relatively short timeouts (5s?), so
|
|
any breaks can be rejoined quickly. In addition to the finger connections,
|
|
each node must also remain aware of K "successor" nodes (those which are
|
|
immediately clockwise of the starting point). The node is not required to
|
|
maintain connections to these, but it should remain informed about their
|
|
contact information, so that it can create connections when necessary. We
|
|
probably need a connection open to the immediate successor at all times.
|
|
|
|
Since inbound connections exist too, each node has something like 2*log2(N)
|
|
plus up to 2*K connections.
|
|
|
|
Each node keeps history of uptime/availability of the nodes that it remains
|
|
connected to. Each message that is sent to these peers includes an estimate
|
|
of that peer's availability from the point of view of the outside world. The
|
|
receiving node will average these reports together to determine what kind of
|
|
reliability they should announce to anyone they accept leases for. This
|
|
reliability is expressed as a percentage uptime: P=1.0 means the peer is
|
|
available 24/7, P=0.0 means it is almost never reachable.
|
|
|
|
|
|
When a node wishes to publish a file, it creates a list of (verifierid,
|
|
sharenum) tuples, and computes a hash of each tuple. These hashes then
|
|
represent starting points for the landlord search:
|
|
|
|
starting_points = [(sharenum,sha(verifierid + str(sharenum)))
|
|
for sharenum in range(256)]
|
|
|
|
The node then constructs a reservation message that contains enough
|
|
information for the potential landlord to evaluate the lease, *and* to make a
|
|
connection back to the starting node:
|
|
|
|
message = [verifierid, sharesize, requestor_pburl, starting_points]
|
|
|
|
The node looks through its list of finger connections and splits this message
|
|
into up to log2(N) smaller messages, each of which contains only the starting
|
|
points that should be sent to that finger connection. Specifically we sent a
|
|
starting_point to a finger A if the nodeid of that finger is <= the
|
|
starting_point and if the next finger B is > starting_point. Each message
|
|
sent out can contain multiple starting_points, each for a different share.
|
|
|
|
When a finger node receives this message, it performs the same splitting
|
|
algorithm, sending each starting_point to other fingers. Eventually a
|
|
starting_point is received by a node that knows that the starting_point lies
|
|
between itself and its immediate successor. At this point the message
|
|
switches from the "hop" mode (following fingers) to the "search" mode
|
|
(following successors).
|
|
|
|
While in "search" mode, each node interprets the message as a lease request.
|
|
It checks its storage pool to see if it can accomodate the reservation. If
|
|
so, it uses requestor_pburl to contact the originator and announces its
|
|
willingness to host the given sharenum. This message will include the
|
|
reliability measurement derived from the host's counterclockwise neighbors.
|
|
|
|
If the recipient cannot host the share, it forwards the request on to the
|
|
next successor, which repeats the cycle. Each message has a maximum hop count
|
|
which limits the number of peers which may be searched before giving up. If a
|
|
node sees itself to be the last such hop, it must establish a connection to
|
|
the originator and let them know that this sharenum could not be hosted.
|
|
|
|
The originator sends out something like 100 or 200 starting points, and
|
|
expects to get back responses (positive or negative) in a reasonable amount
|
|
of time. (perhaps if we receive half of the responses in time T, wait for a
|
|
total of 2T for the remaining ones). If no response is received with the
|
|
timeout, either re-send the requests for those shares (to different fingers)
|
|
or send requests for completely different shares.
|
|
|
|
Each share represents some fraction of a point "S", such that the points for
|
|
enough shares to reconstruct the whole file total to 1.0 points. I.e., if we
|
|
construct 100 shares such that we need 25 of them to reconstruct the file,
|
|
then each share represents .04 points.
|
|
|
|
As the positive responses come in, we accumulate two counters: the capacity
|
|
counter (which gets a full S points for each positive response), and the
|
|
reliability counter (which gets S*(reliability-of-host) points). The capacity
|
|
counter is not allowed to go above some limit (like 4x), as determined by
|
|
provisioning. The node keeps adding leases until the reliability counter has
|
|
gone above some other threshold (larger but close to 1.0).
|
|
|
|
[ at download time, each host will be able to provide the share back with
|
|
probability P times an exponential decay factor related to peer death. Sum
|
|
these probabilities to get the average number of shares that will be
|
|
available. The interesting thing is actually the distribution of these
|
|
probabilities, and what threshold you have to pick to get a sufficiently
|
|
high chance of recovering the file. If there are N identical peers with
|
|
probability P, the number of recovered shares will have a gaussian
|
|
distribution with an average of N*P and a stddev of (??). The PMF of this
|
|
function is an S-curve, with a sharper slope when N is large. The
|
|
probability of recovering the file is the value of this S curve at the
|
|
threshold value (the number of necessary shares).
|
|
|
|
P is not actually constant across all peers, rather we assume that it has
|
|
its own distribution: maybe gaussian, more likely exponential (power law).
|
|
This changes the shape of the S-curve. Assuming that we can characterize
|
|
the distribution of P with perhaps two parameters (say meanP and stddevP),
|
|
the S-curve is a function of meanP, stddevP, N, and threshold...
|
|
|
|
To get 99.99% or 99.999% recoverability, we must choose a threshold value
|
|
high enough to accomodate the random variations and uncertainty about the
|
|
real values of P for each of the hosts we've selected. By counting
|
|
reliability points, we are trying to estimate meanP/stddevP, so we know
|
|
which S-curve to look at. The threshold is fixed at 1.0, since that's what
|
|
erasure coding tells us we need to recover the file. The job is then to add
|
|
hosts (increasing N and possibly changing meanP/stddevP) until our
|
|
recoverability probability is as high as we want.
|
|
]
|
|
|
|
The originator takes all acceptance messages and adds them in order to the
|
|
list of landlords that will be used to host the file. It stops when it gets
|
|
enough reliability points. Note that it does *not* discriminate against
|
|
unreliable hosts: they are less likely to have been found in the first place,
|
|
so we don't need to discriminate against them a second time. We do, however,
|
|
use the reliability points to acknowledge that sending data to an unreliable
|
|
peer is not as useful as sending it to a reliable one (there is still value
|
|
in doing so, though). The remaining reservation-acceptance messages are
|
|
cancelled and then put aside: if we need to make a second pass, we ask those
|
|
peers first.
|
|
|
|
Shares are then created and published as in Tahoe2. If we lose a connection
|
|
during the encoding, that share is lost. If we lose enough shares, we might
|
|
want to generate more to make up for them: this is done by using the leftover
|
|
acceptance messages first, then triggering a new Chord search for the
|
|
as-yet-unaccepted sharenums. These new peers will get shares from all
|
|
segments that have not yet been finished, then a second pass will be made to
|
|
catch them up on the earlier segments.
|
|
|
|
Properties of this approach:
|
|
the total number of peers that each node must know anything about is bounded
|
|
to something like 2*log2(N) + K, probably on the order of 50 to 100 total.
|
|
This is the biggest advantage, since in tahoe2 each node must know at least
|
|
the nodeid of all 1M peers. The maintenance traffic should be much less as a
|
|
result.
|
|
|
|
each node must maintain open (keep-alived) connections to something like
|
|
2*log2(N) peers. In tahoe2, this number is 0 (well, probably 1 for the
|
|
introducer).
|
|
|
|
during upload, each node must actively use 100 connections to a random set
|
|
of peers to push data (just like tahoe2).
|
|
|
|
The probability that any given share-request gets a response is equal to the
|
|
number of hops it travels through times the chance that a peer dies while
|
|
holding on to the message. This should be pretty small, as the message
|
|
should only be held by a peer for a few seconds (more if their network is
|
|
busy). In tahoe2, each share-request always gets a response, since they are
|
|
made directly to the target.
|
|
|
|
I visualize the peer-lookup process as the originator creating a
|
|
message-in-a-bottle for each share. Each message says "Dear Sir/Madam, I
|
|
would like to store X bytes of data for file Y (share #Z) on a system close
|
|
to (but not below) nodeid STARTING_POINT. If you find this amenable, please
|
|
contact me at PBURL so we can make arrangements.". These messages are then
|
|
bundled together according to their rough destination (STARTING_POINT) and
|
|
sent somewhere in the right direction.
|
|
|
|
Download happens the same way: lookup messages are disseminated towards the
|
|
STARTING_POINT and then search one successor at a time from there. There are
|
|
two ways that the share might go missing: if the node is now offline (or has
|
|
for some reason lost its shares), or if new nodes have joined since the
|
|
original upload and the search depth (maximum hop count) is too small to
|
|
accomodate the churn. Both result in the same amount of localized traffic. In
|
|
the latter case, a storage node might want to migrate the share closer to the
|
|
starting point, or perhaps just send them a note to remember a pointer for
|
|
the share.
|
|
|
|
Checking: anyone who wishes to do a filecheck needs to send out a lookup
|
|
message for every potential share. These lookup messages could have a higher
|
|
search depth than usual. It would be useful to know how many peers each
|
|
message went through before being returned: this might be useful to perform
|
|
repair by instructing the old host (which is further from the starting point
|
|
than you'd like) to push their share closer towards the starting point.
|