mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2024-12-23 14:52:26 +00:00
179 lines
8.8 KiB
ReStructuredText
179 lines
8.8 KiB
ReStructuredText
.. -*- coding: utf-8-with-signature -*-
|
|
|
|
====================
|
|
Servers of Happiness
|
|
====================
|
|
|
|
When you upload a file to a Tahoe-LAFS grid, you expect that it will
|
|
stay there for a while, and that it will do so even if a few of the
|
|
peers on the grid stop working, or if something else goes wrong. An
|
|
upload health metric helps to make sure that this actually happens.
|
|
An upload health metric is a test that looks at a file on a Tahoe-LAFS
|
|
grid and says whether or not that file is healthy; that is, whether it
|
|
is distributed on the grid in such a way as to ensure that it will
|
|
probably survive in good enough shape to be recoverable, even if a few
|
|
things go wrong between the time of the test and the time that it is
|
|
recovered. Our current upload health metric for immutable files is called
|
|
'servers-of-happiness'; its predecessor was called 'shares-of-happiness'.
|
|
|
|
shares-of-happiness used the number of encoded shares generated by a
|
|
file upload to say whether or not it was healthy. If there were more
|
|
shares than a user-configurable threshold, the file was reported to be
|
|
healthy; otherwise, it was reported to be unhealthy. In normal
|
|
situations, the upload process would distribute shares fairly evenly
|
|
over the peers in the grid, and in that case shares-of-happiness
|
|
worked fine. However, because it only considered the number of shares,
|
|
and not where they were on the grid, it could not detect situations
|
|
where a file was unhealthy because most or all of the shares generated
|
|
from the file were stored on one or two peers.
|
|
|
|
servers-of-happiness addresses this by extending the share-focused
|
|
upload health metric to also consider the location of the shares on
|
|
grid. servers-of-happiness looks at the mapping of peers to the shares
|
|
that they hold, and compares the cardinality of the largest happy subset
|
|
of those to a user-configurable threshold. A happy subset of peers has
|
|
the property that any k (where k is as in k-of-n encoding) peers within
|
|
the subset can reconstruct the source file. This definition of file
|
|
health provides a stronger assurance of file availability over time;
|
|
with 3-of-10 encoding, and happy=7, a healthy file is still guaranteed
|
|
to be available even if 4 peers fail.
|
|
|
|
Measuring Servers of Happiness
|
|
==============================
|
|
|
|
We calculate servers-of-happiness by computing a matching on a
|
|
bipartite graph that is related to the layout of shares on the grid.
|
|
One set of vertices is the peers on the grid, and one set of vertices is
|
|
the shares. An edge connects a peer and a share if the peer will (or
|
|
does, for existing shares) hold the share. The size of the maximum
|
|
matching on this graph is the size of the largest happy peer set that
|
|
exists for the upload.
|
|
|
|
First, note that a bipartite matching of size n corresponds to a happy
|
|
subset of size n. This is because a bipartite matching of size n implies
|
|
that there are n peers such that each peer holds a share that no other
|
|
peer holds. Then any k of those peers collectively hold k distinct
|
|
shares, and can restore the file.
|
|
|
|
A bipartite matching of size n is not necessary for a happy subset of
|
|
size n, however (so it is not correct to say that the size of the
|
|
maximum matching on this graph is the size of the largest happy subset
|
|
of peers that exists for the upload). For example, consider a file with
|
|
k = 3, and suppose that each peer has all three of those pieces. Then,
|
|
since any peer from the original upload can restore the file, if there
|
|
are 10 peers holding shares, and the happiness threshold is 7, the
|
|
upload should be declared happy, because there is a happy subset of size
|
|
10, and 10 > 7. However, since a maximum matching on the bipartite graph
|
|
related to this layout has only 3 edges, Tahoe-LAFS declares the upload
|
|
unhealthy. Though it is not unhealthy, a share layout like this example
|
|
is inefficient; for k = 3, and if there are n peers, it corresponds to
|
|
an expansion factor of 10x. Layouts that are declared healthy by the
|
|
bipartite graph matching approach have the property that they correspond
|
|
to uploads that are either already relatively efficient in their
|
|
utilization of space, or can be made to be so by deleting shares; and
|
|
that place all of the shares that they generate, enabling redistribution
|
|
of shares later without having to re-encode the file. Also, it is
|
|
computationally reasonable to compute a maximum matching in a bipartite
|
|
graph, and there are well-studied algorithms to do that.
|
|
|
|
Issues
|
|
======
|
|
|
|
The uploader is good at detecting unhealthy upload layouts, but it
|
|
doesn't always know how to make an unhealthy upload into a healthy
|
|
upload if it is possible to do so; it attempts to redistribute shares to
|
|
achieve happiness, but only in certain circumstances. The redistribution
|
|
algorithm isn't optimal, either, so even in these cases it will not
|
|
always find a happy layout if one can be arrived at through
|
|
redistribution. We are investigating improvements to address these
|
|
issues.
|
|
|
|
We don't use servers-of-happiness for mutable files yet; this fix will
|
|
likely come in Tahoe-LAFS version 1.13.
|
|
|
|
|
|
============================
|
|
Upload Strategy of Happiness
|
|
============================
|
|
|
|
As mentioned above, the uploader is good at detecting instances which
|
|
do not pass the servers-of-happiness test, but the share distribution algorithm
|
|
is not always successful in instances where happiness can be achieved. A new
|
|
placement algorithm designed to pass the servers-of-happiness test, titled
|
|
'Upload Strategy of Happiness', is meant to fix these instances where the uploader
|
|
is unable to achieve happiness.
|
|
|
|
Calculating Share Placements
|
|
============================
|
|
|
|
We calculate share placement like so:
|
|
|
|
0. Start with an ordered list of servers. Maybe *2N* of them.
|
|
|
|
1. Query all servers for existing shares.
|
|
|
|
1a. Query remaining space from all servers. Every server that has
|
|
enough free space is considered "readwrite" and every server with too
|
|
little space is "readonly".
|
|
|
|
2. Construct a bipartite graph G1 of *readonly* servers to pre-existing
|
|
shares, where an edge exists between an arbitrary readonly server S and an
|
|
arbitrary share T if and only if S currently holds T.
|
|
|
|
3. Calculate a maximum matching graph of G1 (a set of S->T edges that has or
|
|
is-tied-for the highest "happiness score"). There is a clever efficient
|
|
algorithm for this, named "Ford-Fulkerson". There may be more than one
|
|
maximum matching for this graph; we choose one of them arbitrarily, but
|
|
prefer earlier servers. Call this particular placement M1. The placement
|
|
maps shares to servers, where each share appears at most once, and each
|
|
server appears at most once.
|
|
|
|
4. Construct a bipartite graph G2 of readwrite servers to pre-existing
|
|
shares. Then remove any edge (from G2) that uses a server or a share found
|
|
in M1. Let an edge exist between server S and share T if and only if S
|
|
already holds T.
|
|
|
|
5. Calculate a maximum matching graph of G2, call this M2, again preferring
|
|
earlier servers.
|
|
|
|
6. Construct a bipartite graph G3 of (only readwrite) servers to
|
|
shares (some shares may already exist on a server). Then remove
|
|
(from G3) any servers and shares used in M1 or M2 (note that we
|
|
retain servers/shares that were in G1/G2 but *not* in the M1/M2
|
|
subsets)
|
|
|
|
7. Calculate a maximum matching graph of G3, call this M3, preferring earlier
|
|
servers. The final placement table is the union of M1+M2+M3.
|
|
|
|
8. Renew the shares on their respective servers from M1 and M2.
|
|
|
|
9. Upload share T to server S if an edge exists between S and T in M3.
|
|
|
|
10. If any placements from step 9 fail, mark the server as read-only. Go back
|
|
to step 2 (since we may discover a server is/has-become read-only, or has
|
|
failed, during step 9).
|
|
|
|
Rationale (Step 4): when we see pre-existing shares on read-only servers, we
|
|
prefer to rely upon those (rather than the ones on read-write servers), so we
|
|
can maybe use the read-write servers for new shares. If we picked the
|
|
read-write server's share, then we couldn't re-use that server for new ones
|
|
(we only rely upon each server for one share, more or less).
|
|
|
|
Properties of Upload Strategy of Happiness
|
|
==========================================
|
|
|
|
The size of the maximum bipartite matching is bounded by the size of the smaller
|
|
set of vertices. Therefore in a situation where the set of servers is smaller
|
|
than the set of shares, placement is not generated for a subset of shares. In
|
|
this case the remaining shares are distributed as evenly as possible across the
|
|
set of writable servers.
|
|
|
|
If the servers-of-happiness criteria can be met, the upload strategy of
|
|
happiness guarantees that H shares will be placed on the network. During file
|
|
repair, if the set of servers is larger than N, the algorithm will only attempt
|
|
to spread shares over N distinct servers. For both initial file upload and file
|
|
repair, N should be viewed as the maximum number of distinct servers shares
|
|
can be placed on, and H as the minimum amount. The uploader will fail if
|
|
the number of distinct servers is less than H, and it will never attempt to
|
|
exceed N.
|