#LyX 1.6.1 created this file. For more info see http://www.lyx.org/ \lyxformat 345 \begin_document \begin_header \textclass amsart \use_default_options true \begin_modules theorems-ams theorems-ams-extended \end_modules \language english \inputencoding auto \font_roman default \font_sans default \font_typewriter default \font_default_family default \font_sc false \font_osf false \font_sf_scale 100 \font_tt_scale 100 \graphics default \float_placement h \paperfontsize default \spacing single \use_hyperref false \papersize default \use_geometry false \use_amsmath 1 \use_esint 1 \cite_engine basic \use_bibtopic false \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \papercolumns 1 \papersides 1 \paperpagestyle default \tracking_changes false \output_changes false \author "" \author "" \end_header \begin_body \begin_layout Title Tahoe Distributed Filesharing System Loss Model \end_layout \begin_layout Author Shawn Willden \end_layout \begin_layout Date 01/14/2009 \end_layout \begin_layout Address South Weber, Utah \end_layout \begin_layout Email shawn@willden.org \end_layout \begin_layout Abstract The abstract goes here \end_layout \begin_layout Section Problem Statement \end_layout \begin_layout Standard The allmydata Tahoe distributed file system uses Reed-Solomon erasure coding to split files into \begin_inset Formula $N$ \end_inset shares, each of which is then delivered to a randomly-selected peer in a distributed network. The file can later be reassembled from any \begin_inset Formula $k\leq N$ \end_inset of the shares, if they are available. \end_layout \begin_layout Standard Over time shares are lost for a variety of reasons. Storage servers may crash, be destroyed or simply be removed from the network. To mitigate such losses, Tahoe network clients employ a repair agent which scans the peers once per time period \begin_inset Formula $A$ \end_inset and determines how many of the shares remain. If less than \begin_inset Formula $L$ \end_inset ( \begin_inset Formula $k\leq L\leq N$ \end_inset ) shares remain, then the repairer reconstructs the file shares and redistribute s the missing ones, bringing the availability back up to full. \end_layout \begin_layout Standard The question we're trying to answer is "What's the probability that we'll be able to reassemble the file at some later time \begin_inset Formula $T$ \end_inset ?". We'd also like to be able to determine what values we should choose for \begin_inset Formula $k$ \end_inset , \begin_inset Formula $N$ \end_inset , \begin_inset Formula $A$ \end_inset , and \begin_inset Formula $L$ \end_inset in order to ensure \begin_inset Formula $Pr[loss]\leq t$ \end_inset for some threshold probability \begin_inset Formula $t$ \end_inset . This is an optimization problem because although we could obtain very low \begin_inset Formula $Pr[loss]$ \end_inset by choosing small \begin_inset Formula $k,$ \end_inset large \begin_inset Formula $N$ \end_inset , small \begin_inset Formula $A$ \end_inset , and setting \begin_inset Formula $L=N$ \end_inset , these choices have costs. The peer storage and bandwidth consumed by the share distribution process are approximately \begin_inset Formula $\nicefrac{N}{k}$ \end_inset times the size of the original file, so we would like to reduce this ratio as far as possible consistent with \begin_inset Formula $Pr[loss]\leq t$ \end_inset . Likewise, frequent and aggressive repair process can be used to ensure that the number of shares available at any time is very close to \begin_inset Formula $N,$ \end_inset but at a cost in bandwidth as the repair agent downloads \begin_inset Formula $k$ \end_inset shares to reconstruct the file and uploads new shares to replace those that are lost. \end_layout \begin_layout Section Reliability \end_layout \begin_layout Standard The probability that the file becomes unrecoverable is dependent upon the probability that the peers to whom we send shares are able to return those copies on demand. Shares that are returned in corrupted form can be detected and discarded, so there is no need to distinguish between corruption and loss. \end_layout \begin_layout Standard There are a large number of factors that affect share availability. Availability can be temporarily interrupted by peer unavailability, due to network outages, power failures or administrative shutdown, among other reasons. Availability can be permanently lost due to failure or corruption of storage media, catastrophic damage to the peer system, administrative error, withdrawal from the network, malicious corruption, etc. \end_layout \begin_layout Standard The existence of intermittent failure modes motivates the introduction of a distinction between \noun on availability \noun default and \noun on reliability \noun default . Reliability is the probability that a share is retrievable assuming intermitten t failures can be waited out, so reliability considers only permanent failures. Availability considers all failures, and is focused on the probability of retrieval within some defined time frame. \end_layout \begin_layout Standard Another consideration is that some failures affect multiple shares. If multiple shares of a file are stored on a single hard drive, for example, failure of that drive may lose them all. Catastrophic damage to a data center may destroy all shares on all peers in that data center. \end_layout \begin_layout Standard While the types of failures that may occur are pretty consistent across even very different peers, their probabilities differ dramatically. A professionally-administered blade server with redundant storage, power and Internet located in a carefully-monitored data center with automatic fire suppression systems is much less likely to become either temporarily or permanently unavailable than the typical virus and malware-ridden home computer on a single cable modem connection. A variety of situations in between exist as well, such as the case of the author's home file server, which is administered by an IT professional and uses RAID level 6 redundant storage, but runs on old, cobbled-together equipment, and has a consumer-grade Internet connection. \end_layout \begin_layout Standard To begin with, let's use a simple definition of reliability: \end_layout \begin_layout Definition \noun on Reliability \noun default is the probability \begin_inset Formula $p_{i}$ \end_inset that a share \begin_inset Formula $s_{i}$ \end_inset will surve to (be retrievable at) time \begin_inset Formula $T=A$ \end_inset , ignoring intermittent failures. That is, the probability that the share will be retrievable at the end of the current repair cycle, and therefore usable by the repairer to regenerate any lost shares. \end_layout \begin_layout Definition Reliability is clearly dependent on \begin_inset Formula $A$ \end_inset . Short repair cycles offer less time for shares to \begin_inset Quotes eld \end_inset decay \begin_inset Quotes erd \end_inset into unavailability. \end_layout \begin_layout Subsection Fixed Reliability \begin_inset CommandInset label LatexCommand label name "sub:Fixed-Reliability" \end_inset \end_layout \begin_layout Standard In the simplest case, the peers holding the file shares all have the same reliability \begin_inset Formula $p$ \end_inset , and are all independent from one another. Let \begin_inset Formula $K$ \end_inset be a random variable that represents the number of shares that survive \begin_inset Formula $A$ \end_inset . Each share's survival can be viewed as an indepedent Bernoulli trial with a succes probability of \begin_inset Formula $p$ \end_inset , which means that \begin_inset Formula $K$ \end_inset follows the binomial distribution with paramaters \begin_inset Formula $N$ \end_inset and \begin_inset Formula $p$ \end_inset . That is, \begin_inset Formula $K\sim B(N,p)$ \end_inset . \end_layout \begin_layout Theorem Binomial Distribution Theorem \end_layout \begin_layout Theorem Consider \begin_inset Formula $n$ \end_inset independent Bernoulli trials \begin_inset Foot status collapsed \begin_layout Plain Layout A Bernoulli trial is simply a test of some sort that results in one of two outcomes, one of which is designated success and the other failure. The classic example of a Bernoulli trial is a coin toss. \end_layout \end_inset that succeed with probability \begin_inset Formula $p$ \end_inset , and let \begin_inset Formula $K$ \end_inset be a random variable that represents the number of successes. We say that \begin_inset Formula $K$ \end_inset follows the Binomial Distribution with parameters n and p, denoted \begin_inset Formula $K\sim B(n,p)$ \end_inset . The probability that \begin_inset Formula $K$ \end_inset takes a particular value \begin_inset Formula $m$ \end_inset (the probability that there are exactly \begin_inset Formula $m$ \end_inset successful trials, and therefore \begin_inset Formula $n-m$ \end_inset failures) is called the probability mass function and is given by: \begin_inset Formula \begin{equation} Pr[K=m]=f(m;n,p)=\binom{n}{p}p^{m}(1-p)^{n-m}\label{eq:binomial-pmf}\end{equation} \end_inset \end_layout \begin_layout Proof Consider the specific case of exactly \begin_inset Formula $m$ \end_inset successes followed by \begin_inset Formula $n-m$ \end_inset failures, because each success has probability \begin_inset Formula $p$ \end_inset , each failure has probability \begin_inset Formula $1-p$ \end_inset , and the trials are independent, the probability of this exact case occurring is \begin_inset Formula $p^{m}\left(1-p\right)^{\left(n-m\right)}$ \end_inset , the product of the probabilities of the outcome of each trial. \end_layout \begin_layout Proof Now consider any reordering of these \begin_inset Formula $m$ \end_inset successes and \begin_inset Formula $n$ \end_inset failures. Any such reordering occurs with the same probability \begin_inset Formula $p^{m}\left(1-p\right)^{\left(n-m\right)}$ \end_inset , but with the terms of the product reordered. Since multiplication is commutative, each such reordering has the same probability. There are n-choose-m such orderings, and each ordering is an independent event, so the probability that any ordering of \begin_inset Formula $m$ \end_inset successes and \begin_inset Formula $n-m$ \end_inset failures occurs is given by \begin_inset Formula \[ \binom{n}{m}p^{m}\left(1-p\right)^{\left(n-m\right)}\] \end_inset which is the right-hand-side of equation \begin_inset CommandInset ref LatexCommand ref reference "eq:binomial-pmf" \end_inset . \end_layout \begin_layout Standard A file survives if at least \begin_inset Formula $k$ \end_inset of the \begin_inset Formula $N$ \end_inset shares survive. Equation \begin_inset CommandInset ref LatexCommand ref reference "eq:binomial-pmf" \end_inset gives the probability that exactly \begin_inset Formula $i$ \end_inset shares survive, for any \begin_inset Formula $1\leq i\leq n$ \end_inset , so the probability that fewer than \begin_inset Formula $k$ \end_inset survive is the sum of the probabilities that \begin_inset Formula $0,1,2,\ldots,k-1$ \end_inset shares survive. That is: \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} Pr[file\, lost]=\sum_{i=0}^{k-1}\binom{n}{i}p^{i}(1-p)^{n-i}\label{eq:simple-failure}\end{equation} \end_inset \end_layout \begin_layout Subsection Independent Reliability \begin_inset CommandInset label LatexCommand label name "sub:Independent-Reliability" \end_inset \end_layout \begin_layout Standard Equation \begin_inset CommandInset ref LatexCommand ref reference "eq:simple-failure" \end_inset assumes that each share has the same probability of survival, but as explained above, this is not necessarily true. A more accurate model allows each share \begin_inset Formula $s_{i}$ \end_inset an independent probability of survival \begin_inset Formula $p_{i}$ \end_inset . Each share's survival can still be treated as an independent Bernoulli trial, but with success probability \begin_inset Formula $p_{i}$ \end_inset . Under this assumption, \begin_inset Formula $K$ \end_inset follows a generalized binomial distribution with parameters \begin_inset Formula $N$ \end_inset and \begin_inset Formula $p_{i}$ \end_inset where \begin_inset Formula $1\leq i\leq N$ \end_inset . \end_layout \begin_layout Standard The PMF for this generalized \begin_inset Formula $K$ \end_inset does not have a simple closed-form representation. However, the PMFs for random variables representing individual share survival do. Let \begin_inset Formula $S_{i}$ \end_inset be a random variable such that: \end_layout \begin_layout Standard \begin_inset Formula \[ S_{i}=\begin{cases} 1 & \textnormal{if }s_{i}\textnormal{ survives}\\ 0 & \textnormal{if }s_{i}\textnormal{ fails}\end{cases}\] \end_inset \end_layout \begin_layout Standard The PMF for \begin_inset Formula $S_{i}$ \end_inset is very simple: \begin_inset Formula \[ Pr[S_{i}=j]=\begin{cases} 1-p_{i} & j=0\\ p_{i} & j=1\end{cases}\] \end_inset \end_layout \begin_layout Standard Note that since each \begin_inset Formula $S_{i}$ \end_inset represents the count of shares \begin_inset Formula $s_{i}$ \end_inset that survives (either 0 or 1), if we add up all of the individual survivor counts, we get the group survivor count. That is: \begin_inset Formula \[ \sum_{i=1}^{N}S_{i}=K\] \end_inset Effectively, \begin_inset Formula $K$ \end_inset has just been separated into the series of Bernoulli trials that make it up. \end_layout \begin_layout Theorem Discrete Convolution Theorem \end_layout \begin_layout Theorem Let \begin_inset Formula $X$ \end_inset and \begin_inset Formula $Y$ \end_inset be discrete random variables with probability mass functions given by \begin_inset Formula $Pr\left[X=x\right]=f(x)$ \end_inset and \begin_inset Formula $Pr\left[Y=y\right]=g(y).$ \end_inset Let \begin_inset Formula $Z$ \end_inset be the discrete random random variable obtained by summing \begin_inset Formula $X$ \end_inset and \begin_inset Formula $Y$ \end_inset . \end_layout \begin_layout Theorem The probability mass function of \begin_inset Formula $Z$ \end_inset is given by \begin_inset Formula \[ Pr[Z=z]=h(z)=\left(f\star g\right)(z)\] \end_inset where \begin_inset Formula $\star$ \end_inset denotes the discrete convolution operation: \begin_inset Formula \[ \left(f\star g\right)\left(n\right)=\sum_{m=-\infty}^{\infty}f\left(m\right)g\left(m-n\right)\] \end_inset \end_layout \begin_layout Proof The proof is beyond the scope of this paper. \begin_inset Foot status collapsed \begin_layout Plain Layout \begin_inset Quotes eld \end_inset Beyond the scope of this paper \begin_inset Quotes erd \end_inset usually means \begin_inset Quotes eld \end_inset Too long and nasty to bore you with \begin_inset Quotes erd \end_inset . In this case it means \begin_inset Quotes eld \end_inset The author hasn't the foggiest idea why this is true, or how to prove it, but reliable authorities say it's real, and in practice it works a treat. \begin_inset Quotes erd \end_inset \end_layout \end_inset If you don't believe it's true, look it up on Wikipedia, which is never wrong. \end_layout \begin_layout Standard Applying the discrete convolution theorem, if \begin_inset Formula $Pr[K=i]=f(i)$ \end_inset and \begin_inset Formula $Pr[S_{i}=j]=g_{i}(j)$ \end_inset , then \begin_inset Formula $f=g_{1}\star g_{2}\star g_{3}\star\ldots\star g_{N}$ \end_inset . Since convolution is associative, this can also be written as \begin_inset Formula $ $ \end_inset \begin_inset Formula \begin{equation} f=(\ldots((g_{1}\star g_{2})\star g_{3})\star\ldots)\star g_{N})\label{eq:convolution}\end{equation} \end_inset Therefore, \begin_inset Formula $f$ \end_inset can be computed as a sequence of convolution operations on the simple PMFs of the random variables \begin_inset Formula $S_{i}$ \end_inset . In fact, for large \begin_inset Formula $N$ \end_inset , equation \begin_inset CommandInset ref LatexCommand ref reference "eq:convolution" \end_inset turns out to be a more effective means of computing the PMF of \begin_inset Formula $K$ \end_inset even in the case of the standard binomial distribution, primarily because the binomial calculation in equation \begin_inset CommandInset ref LatexCommand ref reference "eq:binomial-pmf" \end_inset produces very large values that overflow unless arbitrary precision numeric representations are used. \end_layout \begin_layout Standard Note also that it is not necessary to have very simple PMFs like those of the \begin_inset Formula $S_{i}$ \end_inset . Any share or set of shares that has a known PMF can be combined with any other set with a known PMF by convolution, as long as the two share sets are independent. Since PMFs are easily represented as simple lists of probabilities, where the \begin_inset Formula $i$ \end_inset th element in the list corresponds to \begin_inset Formula $Pr[K=i]$ \end_inset , these functions are easily managed in software, and computing the convolution is both simple and efficient. \end_layout \begin_layout Subsection Multiple Failure Modes \begin_inset CommandInset label LatexCommand label name "sub:Multiple-Failure-Modes" \end_inset \end_layout \begin_layout Standard In modeling share survival probabilities, it's useful to be able to analyze separately each of the various failure modes. If reliable statistics for disk failure can be obtained, then a probability mass function for that form of failure can be generated. Similarly, statistics on other hardware failures, administrative errors, network losses, etc., can all be estimated independently. If those estimates can then be combined into a single PMF for a share, then we can use it to predict failures for that share. \end_layout \begin_layout Standard Combining independent failure modes for a single share is straightforward. If \begin_inset Formula $p_{i,j}$ \end_inset is the probability of survival of the \begin_inset Formula $j$ \end_inset th failure mode of share \begin_inset Formula $i$ \end_inset , \begin_inset Formula $1\leq j\leq m$ \end_inset , then \begin_inset Formula \[ Pr[S_{i}=k]=f_{i}(k)=\begin{cases} \prod_{j=1}^{m}p_{i,j} & k=1\\ 1-\prod_{j=1}^{m}p_{i,j} & k=0\end{cases}\] \end_inset is the survival PMF. \end_layout \begin_layout Subsection Multi-share failures \begin_inset CommandInset label LatexCommand label name "sub:Multi-share-failures" \end_inset \end_layout \begin_layout Standard If there are failure modes that affect multiple computers, we can also construct the PMF that predicts their survival. The key observation is that the PMF has non-zero probabilities only for \begin_inset Formula $0$ \end_inset survivors and \begin_inset Formula $n$ \end_inset survivors, where \begin_inset Formula $n$ \end_inset is the number of shares in the set. If \begin_inset Formula $p$ \end_inset is the probability of survival, the PMF of \begin_inset Formula $K$ \end_inset , a random variable representing the number of surviors is \begin_inset Formula \[ Pr[K=i]=f(i)=\begin{cases} p & i=n\\ 0 & 0 \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $k$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $Pr[K=k]$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $Pr[file\, loss]=Pr[K \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $N/k$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 1 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $1.60\times10^{-9}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $2.53\times10^{-11}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 12 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 2 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $3.80\times10^{-8}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $1.63\times10^{-9}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 6 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 3 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $4.04\times10^{-7}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $3.70\times10^{-8}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 4 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 4 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $2.06\times10^{-6}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $4.44\times10^{-7}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 3 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 5 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $2.10\times10^{-5}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $2.50\times10^{-6}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 2.4 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 6 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.000428$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $2.35\times10^{-5}$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 2 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 7 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.00417$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.000452$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 1.7 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 8 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.0157$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.00462$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 1.5 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 9 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.00127$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.0203$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 1.3 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 10 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.0230$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.0216$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 1.2 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 11 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.208$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.0446$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 1.1 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 12 \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.747$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout \begin_inset Formula $0.253$ \end_inset \end_layout \end_inset \begin_inset Text \begin_layout Plain Layout 1 \end_layout \end_inset \end_inset \end_layout \begin_layout Plain Layout \begin_inset Caption \begin_layout Plain Layout \align left \begin_inset CommandInset label LatexCommand label name "tab:Example-PMF" \end_inset Example PMF \end_layout \end_inset \end_layout \begin_layout Plain Layout \end_layout \end_inset \end_layout \begin_layout Standard The table demonstrates the importance of the selection of \begin_inset Formula $k$ \end_inset , and the tradeoff against file size expansion. Note that the survival of exactly 9 servers is significantly less likely than the survival of 8 or 10 servers. This is, again, an artifact of the group failure modes. Because of this, there is no reason to choose \begin_inset Formula $k=9$ \end_inset over \begin_inset Formula $k=10$ \end_inset . Normally, reducing the number of shares needed for reassembly improve the file's chances of survival, but in this case it provides a miniscule gain in reliability at the cost of a 10% increase in bandwidth and storage consumed. \end_layout \begin_layout Subsection Share Duplication \end_layout \begin_layout Standard Before moving on to consider issues other than single-interval file loss, let's analyze one more possibility, that of \begin_inset Quotes eld \end_inset cheap \begin_inset Quotes erd \end_inset file repair via share duplication. \end_layout \begin_layout Standard Initially, files are split using erasure coding, which creates \begin_inset Formula $N$ \end_inset unique shares, any \begin_inset Formula $k$ \end_inset of which can be used to to reconstruct the file. When shares are lost, proper repair downloads some \begin_inset Formula $k$ \end_inset shares, reconstructs the original file and then uses the erasure coding algorithm to reconstruct the lost shares, then redeploys them to peers in the network. This is a somewhat expensive process. \end_layout \begin_layout Standard A cheaper repair option is simply to direct some peer that has share \begin_inset Formula $s_{i}$ \end_inset to send a copy to another peer, thus increasing by one the number of shares in the network. This is not as good as actually replacing the lost share, though. Suppose that more shares were lost, leaving only \begin_inset Formula $ $ \end_inset \begin_inset Formula $k$ \end_inset shares remaining. If two of those shares are identical, because one was duplicated in this fashion, then only \begin_inset Formula $k-1$ \end_inset shares truly remain, and the file can no longer be reconstructed. \end_layout \begin_layout Standard However, such cheap repair is not completely pointless; it does increase file survivability. The question is: By how much? \end_layout \begin_layout Standard Effectively, share duplication simply increases the probability that \begin_inset Formula $s_{i}$ \end_inset will survive, by providing two locations from which to retrieve it. We can view the two copies of the single share as one, but with a higher probability of survival than would be provided by either of the two peers. In particular, if \begin_inset Formula $p_{1}$ \end_inset and \begin_inset Formula $p_{2}$ \end_inset are the probabilities that the two peers will survive, respectively, then \begin_inset Formula \[ Pr[s_{i}\, survives]=p_{1}+p_{2}-p_{1}p_{2}\] \end_inset \end_layout \begin_layout Standard More generally, if a single share is deployed on \begin_inset Formula $n$ \end_inset peers, each with a PMF \begin_inset Formula $f_{i}(j),0\leq j\leq1,1\leq i\leq n$ \end_inset , the share survival count is a random variable \begin_inset Formula $S$ \end_inset and the probability of share loss is \begin_inset Formula \[ Pr[S=0]=(f_{1}\star f_{2}\star\ldots\star f_{n})(0)\] \end_inset \end_layout \begin_layout Standard From that, we can construct a share PMF in the obvious way, which can then be convolved with the other share PMFs to produce the share set PMF. \end_layout \begin_layout Example Suppose a file has \begin_inset Formula $N=10,k=3$ \end_inset and that all servers have survival probability \begin_inset Formula $p=.9$ \end_inset . Given a full complement of shares, \begin_inset Formula $Pr[\textrm{file\, loss}]=3.74\times10^{-7}$ \end_inset . Suppose that four shares are lost, which increases \begin_inset Formula $Pr[\textrm{file\, loss}]$ \end_inset to \begin_inset Formula $.00127$ \end_inset , a value \begin_inset Formula $3400$ \end_inset times greater. Rather than doing a proper reconstruction, we could direct four peers still holding shares to send a copy of their share to new peer, which changes the composition of the shares from one of six, unique \begin_inset Quotes eld \end_inset standard \begin_inset Quotes erd \end_inset shares, to one of two standard shares, each with survival probability \begin_inset Formula $.9$ \end_inset and four \begin_inset Quotes eld \end_inset doubled \begin_inset Quotes erd \end_inset shares, each with survival probability \begin_inset Formula $2p-p^{2}\approx.99$ \end_inset . \end_layout \begin_layout Example Combining the two single-peer share PMFs with the four double-share PMFs gives a new file survival probability of \begin_inset Formula $6.64\times10^{-6}$ \end_inset . Not as good as a full repair, but still quite respectable. Also, if storage were not a concern, all six shares could be duplicated, for a \begin_inset Formula $Pr[file\, loss]=1.48\times10^{-7}$ \end_inset , which is actually three time better than the nominal case. \end_layout \begin_layout Example The reason such cheap repairs may be attractive in many cases is that distribute d bandwidth is cheaper than bandwidth through a single peer. This is particularly true if that single peer has a very slow connection, which is common for home computers -- especially in the outbound direction. \end_layout \begin_layout Section Long-Term Reliability \end_layout \begin_layout Standard Thus far, we've focused entirely on the probability that a file survives the interval \begin_inset Formula $A$ \end_inset between repair times. The probability that a file survives long-term, though, is also important. As long as the probability of failure during a repair period is non-zero, a given file will eventually be lost. We want to know what the probability of surviving for time \begin_inset Formula $T$ \end_inset is, and how the parameters \begin_inset Formula $A$ \end_inset (time between repairs) and \begin_inset Formula $L$ \end_inset (share low watermark) affect survival time. \end_layout \begin_layout Standard To model file survival time, let \begin_inset Formula $T$ \end_inset be a random variable denoting the time at which a given file becomes unrecovera ble, and \begin_inset Formula $R(t)=Pr[T>t]$ \end_inset be a function that gives the probability that the file survives to time \begin_inset Formula $t$ \end_inset . \begin_inset Formula $R(t)$ \end_inset is the cumulative distribution function of \begin_inset Formula $T$ \end_inset . \end_layout \begin_layout Standard Most survival functions are continuous, but \begin_inset Formula $R(t)$ \end_inset is inherently discrete, and stochastic. The time steps are the repair intervals, each of length \begin_inset Formula $A$ \end_inset , so \begin_inset Formula $T$ \end_inset -values are multiples of \begin_inset Formula $A$ \end_inset . During each interval, the file's shares degrade according to the probability mass function of \begin_inset Formula $K$ \end_inset . \end_layout \begin_layout Subsection Aggressive Repairs \end_layout \begin_layout Standard Let's first consider the case of an aggressive repairer. Every interval, this repairer checks the file for share losses and restores them. Thus, at the beginning of each interval, the file always has \begin_inset Formula $N$ \end_inset shares, distributed on servers with various individual and group failure probalities, which will survive or fail per the output of random variable \begin_inset Formula $K$ \end_inset . \end_layout \begin_layout Standard For any interval, then, the probability that the file will survive is \begin_inset Formula $f\left(k\right)=Pr[K\geq k]$ \end_inset . Since each interval success or failure is independent, and assuming the share reliabilities remain constant over time, \begin_inset Formula \begin{equation} R\left(t\right)=f(k)^{t}\end{equation} \end_inset \end_layout \begin_layout Standard This simple survival function makes it simple to select parameters \begin_inset Formula $N$ \end_inset and \begin_inset Formula $K$ \end_inset such that \begin_inset Formula $R(t)\geq r$ \end_inset , where \begin_inset Formula $r$ \end_inset is a user-specified parameter indicating the desired probability of survival to time \begin_inset Formula $t$ \end_inset . Specifically, we can solve for \begin_inset Formula $f\left(k\right)$ \end_inset in \begin_inset Formula $r\leq f\left(k\right)^{t}$ \end_inset , giving: \begin_inset Formula \begin{equation} f\left(k\right)\geq\sqrt[t]{r}\end{equation} \end_inset \end_layout \begin_layout Standard So, given a PMF \begin_inset Formula $f\left(k\right)$ \end_inset , to assure the survival of a file to time \begin_inset Formula $t$ \end_inset with probability at least \begin_inset Formula $r$ \end_inset , choose \begin_inset Formula $k:f\left(k\right)\geq\sqrt[t]{r}$ \end_inset . For example, if \begin_inset Formula $A$ \end_inset is one month, and \begin_inset Formula $r=1-\nicefrac{1}{1000000}$ \end_inset and \begin_inset Formula $t=120$ \end_inset , or 10 years, we calculate \begin_inset Formula $f\left(k\right)\geq\sqrt[120]{.999999}\cong0.999999992$ \end_inset . Per the PMF of table \begin_inset CommandInset ref LatexCommand ref reference "tab:Example-PMF" \end_inset , this means \begin_inset Formula $k=2$ \end_inset , achieves the goal, at the cose of a six-fold expansion in stored file size. If the lesser goal of no more than \begin_inset Formula $\nicefrac{1}{1000}$ \end_inset probability of loss is taken, then since \begin_inset Formula $\sqrt[120]{.9999}=.999992$ \end_inset , \begin_inset Formula $k=5$ \end_inset achieves the goal with an expansion factor of \begin_inset Formula $2.4$ \end_inset . \end_layout \begin_layout Subsection Repair Cost \end_layout \begin_layout Standard The simplicity and predictability of aggressive repair is attractive, but there is a downside: Repairs cost processing power and bandwidth. The processing power is proportional to the size of the file, since the whole file must be reconstructed and then re-processed using the Reed-Solomon algorithm, while the bandwidth cost is proportional to the number of missing shares that must be replaced, \begin_inset Formula $N-K$ \end_inset . \end_layout \begin_layout Standard Let \begin_inset Formula $c\left(s,d,k\right)$ \end_inset be a cost function that combines the processing cost of regenerating a file of size \begin_inset Formula $s$ \end_inset and the bandwidth cost of downloading a file of size \begin_inset Formula $s$ \end_inset and uploading \begin_inset Formula $d$ \end_inset shares each of size \begin_inset Formula $\nicefrac{s}{k}$ \end_inset . Also, let \begin_inset Formula $D$ \end_inset denote the random variable \begin_inset Formula $N-K$ \end_inset , which is the number of shares that must be redistributed to bring the file share set back up to \begin_inset Formula $N$ \end_inset after degrading during an interval. The probability mass function of \begin_inset Formula $D$ \end_inset is \begin_inset Formula \[ Pr[D=d]=f(d)=\begin{cases} Pr\left[K=N\right]+Pr[K