Lossmodel updates

Various improvements to the lossmodel, plus addition of README.lossmodel that provides a link to the PDF.
2025-02-08 04:10:25 +00:00 · 2009-07-28 15:44:30 -07:00 · 2009-07-28 15:44:30 -07:00 · 3782c27ac5
commit 3782c27ac5
parent 560b09d696
2 changed files with 402 additions and 175 deletions
--- a/docs/proposed/README.lossmodel
+++ b/docs/proposed/README.lossmodel
@ -0,0 +1,28 @@
 The lossmodel.lyx file is the source document for an in-progress paper
 that analyzes the probability of losing files stored in a Tahoe
 Least-acces File System under various scenarios.  It describes:
 1.  How to estimate peer reliabilities, based on peer MTBF failure
 data.
 2.  How to compute file loss probabilities, based on a given set of
 shares stored on peers with estimated reliabilities.  The peer
 reliabilities do not have to be uniform, and the model takes into
 account the file repair process.
 3.  How to estimate Tahoe parameters for k (shares needed), n (shares
 distributed) and A (repair interval) to achieve a file reliability
 target.
 4.  How to compute the estimated repair cost over time, discounted at
 a fixed rate, of maintaining a file for a time period T.
 Future work will also address the latter three issues in the context
 of "non-aggressive" repair, where repair will only be performed if
 too many shares are lost, and it will also extend the repair cost
 estimation model to suggest cost functions appropriate for common
 network architectures.
 A PDF of the current version of the file may be downloaded from:
    http://willden.org/~shawn/lossmodel.pdf
--- a/docs/proposed/lossmodel.lyx
+++ b/docs/proposed/lossmodel.lyx
@ -1,4 +1,4 @@
-#LyX 1.6.1 created this file. For more info see http://www.lyx.org/
+#LyX 1.6.2 created this file. For more info see http://www.lyx.org/
 \lyxformat 345
 \begin_document
 \begin_header
@ -56,7 +56,7 @@ Shawn Willden
 \end_layout
 \begin_layout Date
-01/14/2009
+07/22/2009
 \end_layout
 \begin_layout Address
@ -81,8 +81,8 @@ The allmydata Tahoe distributed file system uses Reed-Solomon erasure coding
 \begin_inset Formula $N$
 \end_inset
- shares, each of which is then delivered to a randomly-selected peer in
+ shares which are delivered to randomly-selected peers in a distributed
- a distributed network.
+ network.
 The file can later be reassembled from any 
 \begin_inset Formula $k\leq N$
 \end_inset
@ -112,7 +112,7 @@ s the missing ones, bringing the availability back up to full.
 \end_layout
 \begin_layout Standard
-The question we're trying to answer is "What's the probability that we'll
+The question we're trying to answer is "What is the probability that we'll
 be able to reassemble the file at some later time 
 \begin_inset Formula $T$
 \end_inset
@ -136,11 +136,11 @@ The question we're trying to answer is "What's the probability that we'll
 \end_inset
 in order to ensure 
-\begin_inset Formula $Pr[loss]\leq t$
+\begin_inset Formula $Pr[loss]\leq r$
 \end_inset
 for some threshold probability 
-\begin_inset Formula $t$
+\begin_inset Formula $r$
 \end_inset
 .
@ -149,45 +149,33 @@ The question we're trying to answer is "What's the probability that we'll
 \begin_inset Formula $Pr[loss]$
 \end_inset
- by choosing small 
+ by selecting conservative parameters, these choices have costs.
 \begin_inset Formula $k,$
 \end_inset
 large 
 \begin_inset Formula $N$
 \end_inset
 , small 
 \begin_inset Formula $A$
 \end_inset
 , and setting 
 \begin_inset Formula $L=N$
 \end_inset
 , these choices have costs.
 The peer storage and bandwidth consumed by the share distribution process
 are approximately 
 \begin_inset Formula $\nicefrac{N}{k}$
 \end_inset
- times the size of the original file, so we would like to reduce this ratio
+ times the size of the original file, so we would like to minimize 
- as far as possible consistent with 
+\begin_inset Formula $\nicefrac{N}{k}$
-\begin_inset Formula $Pr[loss]\leq t$
+\end_inset
 , consistent with 
 \begin_inset Formula $Pr[loss]\leq r$
 \end_inset
 .
- Likewise, frequent and aggressive repair process can be used to ensure
+ Likewise, a frequent and aggressive repair process keeps the number of
- that the number of shares available at any time is very close to 
+ shares available close to 
 \begin_inset Formula $N,$
 \end_inset
- but at a cost in bandwidth as the repair agent downloads 
+ but at a cost in bandwidth and processing time as the repair agent downloads
 \begin_inset Formula $k$
 \end_inset
- shares to reconstruct the file and uploads new shares to replace those
+ shares, reconstructs the file and uploads new shares to replace those that
- that are lost.
+ are lost.
 \end_layout
 \begin_layout Section
@ -198,13 +186,13 @@ Reliability
 The probability that the file becomes unrecoverable is dependent upon the
 probability that the peers to whom we send shares are able to return those
 copies on demand.
- Shares that are returned in corrupted form can be detected and discarded,
+ Shares that are corrupted are detected and discarded, so there is no need
- so there is no need to distinguish between corruption and loss.
+ to distinguish between corruption and loss.
 \end_layout
 \begin_layout Standard
-There are a large number of factors that affect share availability.
+Many factors affect share availability.
- Availability can be temporarily interrupted by peer unavailability, due
+ Availability can be temporarily interrupted by peer unavailability due
 to network outages, power failures or administrative shutdown, among other
 reasons.
 Availability can be permanently lost due to failure or corruption of storage
@ -238,12 +226,12 @@ Another consideration is that some failures affect multiple shares.
 \end_layout
 \begin_layout Standard
-While the types of failures that may occur are pretty consistent across
+While the types of failures that may occur are quite consistent across peers,
- even very different peers, their probabilities differ dramatically.
+ their probabilities differ dramatically.
- A professionally-administered blade server with redundant storage, power
+ A professionally-administered server with redundant storage, power and
- and Internet located in a carefully-monitored data center with automatic
+ Internet located in a carefully-monitored data center with automatic fire
- fire suppression systems is much less likely to become either temporarily
+ suppression systems is much less likely to become either temporarily or
- or permanently unavailable than the typical virus and malware-ridden home
+ permanently unavailable than the typical virus and malware-ridden home
 computer on a single cable modem connection.
 A variety of situations in between exist as well, such as the case of the
 author's home file server, which is administered by an IT professional
@ -268,7 +256,7 @@ Reliability
 \begin_inset Formula $s_{i}$
 \end_inset
- will surve to (be retrievable at) time 
+ will survive to (be retrievable at) time 
 \begin_inset Formula $T=A$
 \end_inset
@ -278,8 +266,12 @@ Reliability
 any lost shares.
 \end_layout
-\begin_layout Definition
+\begin_layout Standard
-Reliability is clearly dependent on 
+Reliability 
 \begin_inset Formula $p_{i}$
 \end_inset
 is clearly dependent on 
 \begin_inset Formula $A$
 \end_inset
@ -296,7 +288,181 @@ decay
 \end_layout
 \begin_layout Subsection
-Fixed Reliability
+Peer Reliability
 \end_layout
 \begin_layout Standard
 Since peer reliability is the basis for any computations we may do on share
 and file reliability, we must have a way to estimate it.
 Reliability modeling of hardware, software and human performance are each
 complex topics, the subject of much ongoing research.
 In particular, the reliability of one of the key components of any peer
 from our perspective -- the hard drive where file shares are stored --
 is the subject of much current debate.
 \end_layout
 \begin_layout Standard
 A common assumption about hardware failure is that it follows the 
 \begin_inset Quotes eld
 \end_inset
 bathtub curve
 \begin_inset Quotes erd
 \end_inset
 , with frequent failures during the first few months, a constant failure
 rate for a few years and then a rising failure rate as the hardware wears
 out.
 This curve is often flattened by burn-in stress testing, and by periodic
 replacement that assures that in-service components never reach 
 \begin_inset Quotes eld
 \end_inset
 old age
 \begin_inset Quotes erd
 \end_inset
 .
 \end_layout
 \begin_layout Standard
 In any case, we're generally going to ignore all of that complexity and
 focus on the bottom of the bathtub, assuming constant failure rates.
 This is a particularly reasonable assumption as long as we're focused on
 failures during a particular, relatively short interval 
 \begin_inset Formula $A$
 \end_inset
 .
 Towards the end of this paper, as we examine failures over many repair
 intervals, the assumption becomes more tenuous, and we note some of the
 issues.
 \end_layout
 \begin_layout Subsubsection
 Estimate Adaptation
 \end_layout
 \begin_layout Standard
 Even assuming constant failure rates, however, it will be rare that the
 duration of 
 \begin_inset Formula $A$
 \end_inset
 coincides with the available failure rate data, particularly since we want
 to view 
 \begin_inset Formula $A$
 \end_inset
 as a tunable parameter.
 It's necessary to be able adapt failure rates baselined against any given
 duration to the selected value of 
 \begin_inset Formula $A$
 \end_inset
 .
 \end_layout
 \begin_layout Standard
 Another issue is that failure rates of hardware, etc., are necessarily continuous
 in nature, while the per-interval failure/survival rates that are of interest
 for file reliability calculations are discrete -- a peer either survives
 or fails during the interval.
 The continuous nature of failure rates means that the common and obvious
 methods for estimating failure rates result in values that follow continuous,
 not discrete distributions.
 The difference is minor for small failure probabilities, and converges
 to zero as the number of intervals goes to infinity, but is important enough
 in some cases to be worth correcting for.
 \end_layout
 \begin_layout Standard
 Continuous failure rates are described in terms of mean time to failure,
 and under the assumption that failure rates are constant, are exponentially
 distributed.
 Under these assumptions, the probability that a machine fails at time 
 \begin_inset Formula $t$
 \end_inset
 , is 
 \begin_inset Formula \[
 f\left(t\right)=\lambda e^{-\lambda t}\]
 \end_inset
 where 
 \begin_inset Formula $\lambda$
 \end_inset
 represents the per unit-time failure rate.
 The probability that a machine fails at or before time 
 \begin_inset Formula $A$
 \end_inset
 is therefore
 \begin_inset Formula \begin{align}
 F\left(t\right) & =\int_{0}^{A}f\left(x\right)dx\nonumber \\
 & =\int_{0}^{A}\lambda e^{-\lambda x}dx\nonumber \\
 & =1-e^{-\lambda A}\label{eq:failure-time}\end{align}
 \end_inset
 \end_layout
 \begin_layout Standard
 Note that 
 \begin_inset Formula $A$
 \end_inset
 and 
 \begin_inset Formula $\lambda$
 \end_inset
 in 
 \begin_inset CommandInset ref
 LatexCommand ref
 reference "eq:failure-time"
 \end_inset
 must be expressed in consistent time units.
 If they're different, unit conversions should be applied in the normal
 way.
 For example, if the estimate for 
 \begin_inset Formula $\lambda$
 \end_inset
 is 750 failures per million hours, and 
 \begin_inset Formula $A$
 \end_inset
 is one month, then either 
 \begin_inset Formula $A$
 \end_inset
 should be represented as 
 \begin_inset Formula $30\cdot24/1000000=.00072$
 \end_inset
 , or 
 \begin_inset Formula $\lambda$
 \end_inset
 should be converted to failures per month.
 Or both may be converted to hours.
 \end_layout
 \begin_layout Subsubsection
 Acquiring Peer Reliability Estimates
 \end_layout
 \begin_layout Standard
 Need to write this.
 \end_layout
 \begin_layout Subsection
 Uniform Reliability
 \begin_inset CommandInset label
 LatexCommand label
 name "sub:Fixed-Reliability"
@ -323,8 +489,8 @@ In the simplest case, the peers holding the file shares all have the same
 \end_inset
 .
- Each share's survival can be viewed as an indepedent Bernoulli trial with
+ Each share's survival can be viewed as an independent Bernoulli trial with
- a succes probability of 
+ a success probability of 
 \begin_inset Formula $p$
 \end_inset
@ -332,7 +498,7 @@ In the simplest case, the peers holding the file shares all have the same
 \begin_inset Formula $K$
 \end_inset
- follows the binomial distribution with paramaters 
+ follows the binomial distribution with parameters 
 \begin_inset Formula $N$
 \end_inset
@ -377,7 +543,15 @@ A Bernoulli trial is simply a test of some sort that results in one of two
 \begin_inset Formula $K$
 \end_inset
- be a random variable that represents the number of successes.
+ be a random variable that represents the number, 
 \begin_inset Formula $m$
 \end_inset
 , of successes, 
 \begin_inset Formula $0\le m\le n$
 \end_inset
 .
 We say that 
 \begin_inset Formula $K$
 \end_inset
@ -387,7 +561,8 @@ A Bernoulli trial is simply a test of some sort that results in one of two
 \end_inset
 .
- The probability that 
+ The probability mass function (PMF) of K is a function that gives the probabili
 ty that 
 \begin_inset Formula $K$
 \end_inset
@ -403,9 +578,10 @@ A Bernoulli trial is simply a test of some sort that results in one of two
 \begin_inset Formula $n-m$
 \end_inset
- failures) is called the probability mass function and is given by:
+ failures).
 The PMF of K is
 \begin_inset Formula \begin{equation}
-Pr[K=m]=f(m;n,p)=\binom{n}{p}p^{m}(1-p)^{n-m}\label{eq:binomial-pmf}\end{equation}
+Pr[K=m]=f(m;n,p)=\binom{n}{m}p^{m}(1-p)^{n-m}\label{eq:binomial-pmf}\end{equation}
 \end_inset
@ -455,7 +631,8 @@ Now consider any reordering of these
 Since multiplication is commutative, each such reordering has the same
 probability.
 There are n-choose-m such orderings, and each ordering is an independent
- event, so the probability that any ordering of 
+ event, meaning we can sum the probabilities of the individual orderings,
 so the probability that any ordering of 
 \begin_inset Formula $m$
 \end_inset
@ -544,7 +721,7 @@ reference "eq:simple-failure"
 \end_inset
- assumes that each share has the same probability of survival, but as explained
+ assumes that all shares have the same probability of survival, but as explained
 above, this is not necessarily true.
 A more accurate model allows each share 
 \begin_inset Formula $s_{i}$
@ -570,11 +747,7 @@ reference "eq:simple-failure"
 \end_inset
 and 
-\begin_inset Formula $p_{i}$
+\begin_inset Formula $p_{1},p_{2},\dots,p_{N}$
 \end_inset
 where 
 \begin_inset Formula $1\leq i\leq N$
 \end_inset
 .
@ -589,7 +762,7 @@ The PMF for this generalized
 However, the PMFs for random variables representing individual share survival
 do.
 Let 
-\begin_inset Formula $S_{i}$
+\begin_inset Formula $K_{i}$
 \end_inset
 be a random variable such that:
@ -597,7 +770,7 @@ The PMF for this generalized
 \begin_layout Standard
 \begin_inset Formula \[
-S_{i}=\begin{cases}
+K_{i}=\begin{cases}
 1 & \textnormal{if }s_{i}\textnormal{ survives}\\
 0 & \textnormal{if }s_{i}\textnormal{ fails}\end{cases}\]
@ -608,14 +781,20 @@ S_{i}=\begin{cases}
 \begin_layout Standard
 The PMF for 
-\begin_inset Formula $S_{i}$
+\begin_inset Formula $K_{i}$
 \end_inset
 is very simple: 
 \begin_inset Formula \[
-Pr[S_{i}=j]=\begin{cases}
+Pr[K_{i}=j]=\begin{cases}
-1-p_{i} & j=0\\
+p_{i} & j=1\\
-p_{i} & j=1\end{cases}\]
+1-p_{i} & j=0\end{cases}\]
 \end_inset
 which can also be expressed as
 \begin_inset Formula \[
 Pr[K_{i}=j]=f\left(j\right)=\left(1-p_{i}\right)\left(1-j\right)+p_{i}\left(j\right)\]
 \end_inset
@ -624,7 +803,7 @@ p_{i} & j=1\end{cases}\]
 \begin_layout Standard
 Note that since each 
-\begin_inset Formula $S_{i}$
+\begin_inset Formula $K_{i}$
 \end_inset
 represents the count of shares 
@ -635,16 +814,15 @@ Note that since each
 counts, we get the group survivor count.
 That is:
 \begin_inset Formula \[
-\sum_{i=1}^{N}S_{i}=K\]
+\sum_{i=1}^{N}K_{i}=K\]
 \end_inset
-Effectively, 
+Effectively, we have separated 
 \begin_inset Formula $K$
 \end_inset
- has just been separated into the series of Bernoulli trials that make it
+ into the series of Bernoulli trials that make it up.
 up.
 \end_layout
 \begin_layout Theorem
@ -709,54 +887,38 @@ where
 \begin_layout Proof
 The proof is beyond the scope of this paper.
 \begin_inset Foot
 status collapsed
 \begin_layout Plain Layout
 \begin_inset Quotes eld
 \end_inset
 Beyond the scope of this paper
 \begin_inset Quotes erd
 \end_inset
 usually means 
 \begin_inset Quotes eld
 \end_inset
 Too long and nasty to bore you with
 \begin_inset Quotes erd
 \end_inset
 .
 In this case it means 
 \begin_inset Quotes eld
 \end_inset
 The author hasn't the foggiest idea why this is true, or how to prove it,
 but reliable authorities say it's real, and in practice it works a treat.
 \begin_inset Quotes erd
 \end_inset
 \end_layout
 \end_inset
 If you don't believe it's true, look it up on Wikipedia, which is never
 wrong.
 \end_layout
 \begin_layout Standard
-Applying the discrete convolution theorem, if 
+If we denote the PMF of 
-\begin_inset Formula $Pr[K=i]=f(i)$
+\begin_inset Formula $K$
 \end_inset
 with 
 \begin_inset Formula $f$
 \end_inset
 and the PMF of 
 \begin_inset Formula $K_{i}$
 \end_inset
 with 
 \begin_inset Formula $g_{i}$
 \end_inset
 (more formally, 
 \begin_inset Formula $Pr[K=x]=f(x)$
 \end_inset
 and 
-\begin_inset Formula $Pr[S_{i}=j]=g_{i}(j)$
+\begin_inset Formula $Pr[K_{i}=x]=g_{i}(x)$
 \end_inset
-, then 
+) then since 
 \begin_inset Formula $K=\sum_{i=1}^{N}K_{i}$
 \end_inset
 , according to the discrete convolution theorem 
 \begin_inset Formula $f=g_{1}\star g_{2}\star g_{3}\star\ldots\star g_{N}$
 \end_inset
@ -777,7 +939,7 @@ Therefore,
 can be computed as a sequence of convolution operations on the simple PMFs
 of the random variables 
-\begin_inset Formula $S_{i}$
+\begin_inset Formula $K_{i}$
 \end_inset
 .
@ -796,8 +958,13 @@ reference "eq:convolution"
 \begin_inset Formula $K$
 \end_inset
- even in the case of the standard binomial distribution, primarily because
+ than the binomial theorem.
- the binomial calculation in equation 
+ even in the case of shares with identical survival probability.
 The reason it's better is because the calculation of 
 \begin_inset Formula $\binom{n}{m}$
 \end_inset
 in equation 
 \begin_inset CommandInset ref
 LatexCommand ref
 reference "eq:binomial-pmf"
@ -811,24 +978,20 @@ reference "eq:binomial-pmf"
 \begin_layout Standard
 Note also that it is not necessary to have very simple PMFs like those of
 the 
-\begin_inset Formula $S_{i}$
+\begin_inset Formula $K_{i}$
 \end_inset
 .
 Any share or set of shares that has a known PMF can be combined with any
 other set with a known PMF by convolution, as long as the two share sets
 are independent.
- Since PMFs are easily represented as simple lists of probabilities, where
+ The reverse holds as well; given a group with an empirically-derived PMF,
- the 
+ in it's theoretically possible to solve for an individual PMF, and thereby
-\begin_inset Formula $i$
+ determine 
 \begin_inset Formula $p_{i}$
 \end_inset
-th element in the list corresponds to 
+ even when per-share data is unavailable.
 \begin_inset Formula $Pr[K=i]$
 \end_inset
 , these functions are easily managed in software, and computing the convolution
 is both simple and efficient.
 \end_layout
 \begin_layout Subsection
@ -845,8 +1008,8 @@ name "sub:Multiple-Failure-Modes"
 \begin_layout Standard
 In modeling share survival probabilities, it's useful to be able to analyze
 separately each of the various failure modes.
- If reliable statistics for disk failure can be obtained, then a probability
+ For example, if reliable statistics for disk failure can be obtained, then
- mass function for that form of failure can be generated.
+ a probability mass function for that form of failure can be generated.
 Similarly, statistics on other hardware failures, administrative errors,
 network losses, etc., can all be estimated independently.
 If those estimates can then be combined into a single PMF for a share,
@ -873,7 +1036,7 @@ th failure mode of share
 , then 
 \begin_inset Formula \[
-Pr[S_{i}=k]=f_{i}(k)=\begin{cases}
+Pr[K_{i}=k]=f_{i}(k)=\begin{cases}
 \prod_{j=1}^{m}p_{i,j} & k=1\\
 1-\prod_{j=1}^{m}p_{i,j} & k=0\end{cases}\]
@ -918,12 +1081,12 @@ If there are failure modes that affect multiple computers, we can also construct
 \begin_inset Formula $K$
 \end_inset
-, a random variable representing the number of surviors is
+, a random variable representing the number of survivors is
 \begin_inset Formula \[
-Pr[K=i]=f(i)=\begin{cases}
+Pr[K=k]=f(k)=\begin{cases}
-p & i=n\\
+p & k=n\\
 0 & 0<i<n\\
-1-p & i=0\end{cases}\]
+1-p & k=0\end{cases}\]
 \end_inset
@ -994,7 +1157,7 @@ Four PCs located in random homes, connected to the Internet via assorted
 \end_layout
 \begin_layout Standard
-If one share is placed on each of these 20 computers, what's the probability
+If one share is placed on each of these 12 computers, what's the probability
 mass function of share survival? To more compactly describe PMFs, we'll
 denote them as probability vectors of the form 
 \begin_inset Formula $\left[\alpha_{o},\alpha_{1},\alpha_{2},\ldots\alpha_{n}\right]$
@ -1012,13 +1175,18 @@ If one share is placed on each of these 20 computers, what's the probability
 \end_layout
 \begin_layout Standard
-The servers in the two data centers have individual survival probabilities
+The servers in the two data centers have individual failure probabilities
- of RAID failure (.0002) and administrative error (.003) giving 
+ of RAID failure (.0002) and administrative error (.003) giving an individual
 survival probability of 
 \begin_inset Formula \[
 (1-.0002)\cdot(1-.003)=.9998\cdot.997=.9968\]
 \end_inset
 \end_layout
 \begin_layout Standard
 Using 
 \begin_inset Formula $p=.9968,n=4$
 \end_inset
@ -1046,26 +1214,30 @@ which applies to each group of four servers.
 \begin_inset Formula $1.049\cdot10^{-10}$
 \end_inset
-) times the probability they all fail because of a network outage (
+) plus the probability they all fail because of a network outage (
 \begin_inset Formula $.0001$
 \end_inset
 ) less the probability they fail for both reasons:
 \begin_inset Formula \[
-\left(1.049\times10^{-10}\right)+\left(0.0001\right)-\left[\left(1.049\times10^{-10}\right)\cdot\left(0.0001\right)\right]=0.0001\]
+\left(1.049\times10^{-10}\right)+\left(0.0001\right)-\left[\left(1.049\times10^{-10}\right)\cdot\left(0.0001\right)\right]\approxeq0.0001\]
 \end_inset
 \end_layout
 \begin_layout Standard
 That's the 
-\begin_inset Formula $0$
+\begin_inset Formula $i=0$
 \end_inset
-th element of the combined PMF.
+ element of the combined PMF.
 The combined probability of survival of 
 \begin_inset Formula $0<i\leq4$
 \end_inset
- servers is simpler: it's the probility they survive individual failure,
+ servers is simpler: it's the probability they survive individual failure,
 from the individual failure PMF above, times the probability they survive
 network failure (.9999).
 So the combined survival PMF, which we'll denote as 
@ -1096,9 +1268,9 @@ Of course, the failures need not be truly simultaneous, they just have happen
 \end_layout
 \begin_layout Standard
-The same process for the Hawaii servers, but with group survival probability
+We apply the same process for the Hawaii servers, but with group survival
- of 
+ probability of 
-\begin_inset Formula $(1-.0001)(1-.02)=.9799$
+\begin_inset Formula $(1-.0001)(1-.04)=.9799$
 \end_inset
 gives the survival PMF 
@ -1107,10 +1279,7 @@ h(i)=\left[0.0201,1.280\times10^{-7},5.982\times10^{-5},0.01242,0.9674\right]\]
 \end_inset
-which has the unusual property that it's more likely that all of the servers
+
 will be lost than that only one will survive.
 This is because in order for exactly one to survive, it's necessary for
 three to have the 
 \end_layout
 \begin_layout Standard
@ -1140,10 +1309,30 @@ Applying the convolution operator to
 \end_inset
-Note the interesting fact that losing four shares is 10,000 times more likely
+
- than losing three.
+\end_layout
- This is because both data centers have a whole-center failure modes, and
+
 \begin_layout Standard
 \begin_inset VSpace defskip
 \end_inset
 \end_layout
 \begin_layout Standard
 Note that losing four shares (
 \begin_inset Formula $i=4$
 \end_inset
 ) is 10,000 times more likely than losing three (
 \begin_inset Formula $i=5$
 \end_inset
 ).
 This is because both data centers have a whole-center failure mode, and
 the Hawaii center's lava burn probability is so high.
 Similarly, the probability of losing all of them is 1000 times higher than
 the probability of losing all but one.
 \end_layout
 \begin_layout Standard
@ -1173,16 +1362,16 @@ reference "eq:binomial-pmf"
 \end_inset
 to compute the PMF 
-\begin_inset Formula $f(i),0\leq i\leq4$
+\begin_inset Formula $g(i),0\leq i\leq4$
 \end_inset
 for the PCs and finally compute 
-\begin_inset Formula $s(i)=\left(f\star\left(n\star h\right)\right)\left(i\right)$
+\begin_inset Formula $f(i)=\left(g\star\left(n\star h\right)\right)\left(i\right)$
 \end_inset
 , the PMF of the whole share set.
 Summing the values of 
-\begin_inset Formula $s(i)$
+\begin_inset Formula $f(i)$
 \end_inset
 for 
@ -1848,7 +2037,7 @@ The table demonstrates the importance of the selection of
 .
 Normally, reducing the number of shares needed for reassembly improve the
- file's chances of survival, but in this case it provides a miniscule gain
+ file's chances of survival, but in this case it provides a minuscule gain
 in reliability at the cost of a 10% increase in bandwidth and storage consumed.
 \end_layout
@ -1898,10 +2087,6 @@ A cheaper repair option is simply to direct some peer that has share
 in the network.
 This is not as good as actually replacing the lost share, though.
 Suppose that more shares were lost, leaving only 
 \begin_inset Formula $ $
 \end_inset
 \begin_inset Formula $k$
 \end_inset
@ -1917,7 +2102,7 @@ A cheaper repair option is simply to direct some peer that has share
 \begin_layout Standard
 However, such cheap repair is not completely pointless; it does increase
 file survivability.
- The question is: By how much?
+ But by how much?
 \end_layout
 \begin_layout Standard
@ -1955,12 +2140,12 @@ More generally, if a single share is deployed on
 \end_inset
 , the share survival count is a random variable 
-\begin_inset Formula $S$
+\begin_inset Formula $K$
 \end_inset
 and the probability of share loss is 
 \begin_inset Formula \[
-Pr[S=0]=(f_{1}\star f_{2}\star\ldots\star f_{n})(0)\]
+Pr[K=0]=(f_{1}\star f_{2}\star\ldots\star f_{n})(0)\]
 \end_inset
@ -2023,7 +2208,7 @@ doubled
 \end_inset
 shares, each with survival probability 
-\begin_inset Formula $2p-p^{2}\approx.99$
+\begin_inset Formula $2p-p^{2}\approxeq.99$
 \end_inset
 .
@ -2066,11 +2251,11 @@ Thus far, we've focused entirely on the probability that a file survives
 The probability that a file survives long-term, though, is also important.
 As long as the probability of failure during a repair period is non-zero,
 a given file will eventually be lost.
- We want to know what the probability of surviving for time 
+ We want to know the probability of surviving for time 
 \begin_inset Formula $T$
 \end_inset
- is, and how the parameters 
+, and how the parameters 
 \begin_inset Formula $A$
 \end_inset
@ -2078,7 +2263,7 @@ Thus far, we've focused entirely on the probability that a file survives
 \begin_inset Formula $L$
 \end_inset
- (share low watermark) affect survival time.
+ (allowed share low watermark) affect survival time.
 \end_layout
 \begin_layout Standard
@ -2113,7 +2298,7 @@ Most survival functions are continuous, but
 \begin_inset Formula $R(t)$
 \end_inset
- is inherently discrete, and stochastic.
+ is inherently discrete and stochastic.
 The time steps are the repair intervals, each of length 
 \begin_inset Formula $A$
 \end_inset
@ -2136,7 +2321,7 @@ Most survival functions are continuous, but
 \end_layout
 \begin_layout Subsection
-Aggressive Repairs
+Aggressive Repair
 \end_layout
 \begin_layout Standard
@ -2148,7 +2333,7 @@ Let's first consider the case of an aggressive repairer.
 \end_inset
 shares, distributed on servers with various individual and group failure
- probalities, which will survive or fail per the output of random variable
+ probabilities, which will survive or fail per the output of random variable
 \begin_inset Formula $K$
 \end_inset
@ -2226,7 +2411,11 @@ So, given a PMF
 \end_inset
 , choose 
-\begin_inset Formula $k:f\left(k\right)\geq\sqrt[t]{r}$
+\begin_inset Formula $k$
 \end_inset
 such that 
 \begin_inset Formula $f\left(k\right)\geq\sqrt[t]{r}$
 \end_inset
 .
@ -2235,7 +2424,7 @@ So, given a PMF
 \end_inset
 is one month, and 
-\begin_inset Formula $r=1-\nicefrac{1}{1000000}$
+\begin_inset Formula $r=1-\nicefrac{1}{10^{6}}$
 \end_inset
 and 
@ -2243,7 +2432,7 @@ So, given a PMF
 \end_inset
 , or 10 years, we calculate 
-\begin_inset Formula $f\left(k\right)\geq\sqrt[120]{.999999}\cong0.999999992$
+\begin_inset Formula $f\left(k\right)\geq\sqrt[120]{.999999}\approx0.999999992$
 \end_inset
 .
@ -2258,7 +2447,7 @@ reference "tab:Example-PMF"
 \begin_inset Formula $k=2$
 \end_inset
-, achieves the goal, at the cose of a six-fold expansion in stored file
+, achieves the goal, at the cost of a six-fold expansion in stored file
 size.
 If the lesser goal of no more than 
 \begin_inset Formula $\nicefrac{1}{1000}$
@ -2396,9 +2585,9 @@ Since each interval starts with a full complement of shares, the expected
 \end_layout
 \begin_layout Standard
-It may also be useful to discount future cost, since CPU and bandwidth are
+It is also necessary to discount future cost, since CPU and bandwidth are
 both going to get cheaper over time.
- To accomodate this, we throw in an addition per-period discount rate 
+ To accommodate this, we throw in an addition per-period discount rate 
 \begin_inset Formula $r$
 \end_inset
@ -2429,6 +2618,14 @@ If
 this collapses to the previous result, as one would expect.
 \end_layout
 \begin_layout Subsection
 Non-aggressive Repair
 \end_layout
 \begin_layout Standard
 Need to write this.
 \end_layout
 \begin_layout Section
 Time-Sensitive Retrieval
 \end_layout
@ -2436,8 +2633,10 @@ Time-Sensitive Retrieval
 \begin_layout Standard
 The above work has almost entirely ignored the distinction between availability
 and reliability.
- In reality, temporary and permanent failures need to be modeled separately,
+\end_layout
- and 
+
 \begin_layout Standard
 Need to write this.
 \end_layout
 \end_body