tahoe-lafs/docs/specifications/servers-of-happiness.rst

8.8 KiB

Servers of Happiness

When you upload a file to a Tahoe-LAFS grid, you expect that it will stay there for a while, and that it will do so even if a few of the peers on the grid stop working, or if something else goes wrong. An upload health metric helps to make sure that this actually happens. An upload health metric is a test that looks at a file on a Tahoe-LAFS grid and says whether or not that file is healthy; that is, whether it is distributed on the grid in such a way as to ensure that it will probably survive in good enough shape to be recoverable, even if a few things go wrong between the time of the test and the time that it is recovered. Our current upload health metric for immutable files is called 'servers-of-happiness'; its predecessor was called 'shares-of-happiness'.

shares-of-happiness used the number of encoded shares generated by a file upload to say whether or not it was healthy. If there were more shares than a user-configurable threshold, the file was reported to be healthy; otherwise, it was reported to be unhealthy. In normal situations, the upload process would distribute shares fairly evenly over the peers in the grid, and in that case shares-of-happiness worked fine. However, because it only considered the number of shares, and not where they were on the grid, it could not detect situations where a file was unhealthy because most or all of the shares generated from the file were stored on one or two peers.

servers-of-happiness addresses this by extending the share-focused upload health metric to also consider the location of the shares on grid. servers-of-happiness looks at the mapping of peers to the shares that they hold, and compares the cardinality of the largest happy subset of those to a user-configurable threshold. A happy subset of peers has the property that any k (where k is as in k-of-n encoding) peers within the subset can reconstruct the source file. This definition of file health provides a stronger assurance of file availability over time; with 3-of-10 encoding, and happy=7, a healthy file is still guaranteed to be available even if 4 peers fail.

Measuring Servers of Happiness

We calculate servers-of-happiness by computing a matching on a bipartite graph that is related to the layout of shares on the grid. One set of vertices is the peers on the grid, and one set of vertices is the shares. An edge connects a peer and a share if the peer will (or does, for existing shares) hold the share. The size of the maximum matching on this graph is the size of the largest happy peer set that exists for the upload.

First, note that a bipartite matching of size n corresponds to a happy subset of size n. This is because a bipartite matching of size n implies that there are n peers such that each peer holds a share that no other peer holds. Then any k of those peers collectively hold k distinct shares, and can restore the file.

A bipartite matching of size n is not necessary for a happy subset of size n, however (so it is not correct to say that the size of the maximum matching on this graph is the size of the largest happy subset of peers that exists for the upload). For example, consider a file with k = 3, and suppose that each peer has all three of those pieces. Then, since any peer from the original upload can restore the file, if there are 10 peers holding shares, and the happiness threshold is 7, the upload should be declared happy, because there is a happy subset of size 10, and 10 > 7. However, since a maximum matching on the bipartite graph related to this layout has only 3 edges, Tahoe-LAFS declares the upload unhealthy. Though it is not unhealthy, a share layout like this example is inefficient; for k = 3, and if there are n peers, it corresponds to an expansion factor of 10x. Layouts that are declared healthy by the bipartite graph matching approach have the property that they correspond to uploads that are either already relatively efficient in their utilization of space, or can be made to be so by deleting shares; and that place all of the shares that they generate, enabling redistribution of shares later without having to re-encode the file. Also, it is computationally reasonable to compute a maximum matching in a bipartite graph, and there are well-studied algorithms to do that.

Issues

The uploader is good at detecting unhealthy upload layouts, but it doesn't always know how to make an unhealthy upload into a healthy upload if it is possible to do so; it attempts to redistribute shares to achieve happiness, but only in certain circumstances. The redistribution algorithm isn't optimal, either, so even in these cases it will not always find a happy layout if one can be arrived at through redistribution. We are investigating improvements to address these issues.

We don't use servers-of-happiness for mutable files yet; this fix will likely come in Tahoe-LAFS version 1.13.

Upload Strategy of Happiness

As mentioned above, the uploader is good at detecting instances which do not pass the servers-of-happiness test, but the share distribution algorithm is not always successful in instances where happiness can be achieved. A new placement algorithm designed to pass the servers-of-happiness test, titled 'Upload Strategy of Happiness', is meant to fix these instances where the uploader is unable to achieve happiness.

Calculating Share Placements

We calculate share placement like so:

  1. Start with an ordered list of servers. Maybe 2N of them.
  2. Query all servers for existing shares.
1a. Query remaining space from all servers. Every server that has

enough free space is considered "readwrite" and every server with too little space is "readonly".

2. Construct a bipartite graph G1 of readonly servers to pre-existing

shares, where an edge exists between an arbitrary readonly server S and an arbitrary share T if and only if S currently holds T.

3. Calculate a maximum matching graph of G1 (a set of S->T edges that has or

is-tied-for the highest "happiness score"). There is a clever efficient algorithm for this, named "Ford-Fulkerson". There may be more than one maximum matching for this graph; we choose one of them arbitrarily, but prefer earlier servers. Call this particular placement M1. The placement maps shares to servers, where each share appears at most once, and each server appears at most once.

4. Construct a bipartite graph G2 of readwrite servers to pre-existing

shares. Then remove any edge (from G2) that uses a server or a share found in M1. Let an edge exist between server S and share T if and only if S already holds T.

5. Calculate a maximum matching graph of G2, call this M2, again preferring

earlier servers.

6. Construct a bipartite graph G3 of (only readwrite) servers to

shares (some shares may already exist on a server). Then remove (from G3) any servers and shares used in M1 or M2 (note that we retain servers/shares that were in G1/G2 but not in the M1/M2 subsets)

7. Calculate a maximum matching graph of G3, call this M3, preferring earlier

servers. The final placement table is the union of M1+M2+M3.

  1. Renew the shares on their respective servers from M1 and M2.
  2. Upload share T to server S if an edge exists between S and T in M3.
  3. If any placements from step 9 fail, mark the server as read-only. Go back to step 2 (since we may discover a server is/has-become read-only, or has failed, during step 9).

Rationale (Step 4): when we see pre-existing shares on read-only servers, we prefer to rely upon those (rather than the ones on read-write servers), so we can maybe use the read-write servers for new shares. If we picked the read-write server's share, then we couldn't re-use that server for new ones (we only rely upon each server for one share, more or less).

Properties of Upload Strategy of Happiness

The size of the maximum bipartite matching is bounded by the size of the smaller set of vertices. Therefore in a situation where the set of servers is smaller than the set of shares, placement is not generated for a subset of shares. In this case the remaining shares are distributed as evenly as possible across the set of writable servers.

If the servers-of-happiness criteria can be met, the upload strategy of happiness guarantees that H shares will be placed on the network. During file repair, if the set of servers is larger than N, the algorithm will only attempt to spread shares over N distinct servers. For both initial file upload and file repair, N should be viewed as the maximum number of distinct servers shares can be placed on, and H as the minimum amount. The uploader will fail if the number of distinct servers is less than H, and it will never attempt to exceed N.