mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2025-01-11 23:42:38 +00:00
d3d1293d2f
Thanks to David-Sarah Hopwood for the patch.
525 lines
29 KiB
Plaintext
525 lines
29 KiB
Plaintext
|
|
Tahoe-LAFS Architecture
|
|
|
|
(See the docs/specifications directory for more details.)
|
|
|
|
OVERVIEW
|
|
|
|
At a high-level this system consists of three layers: the key-value store,
|
|
the filesystem, and the application.
|
|
|
|
The lowest layer is the key-value store, which is a distributed hashtable
|
|
mapping from capabilities to data. The capabilities are relatively short
|
|
ASCII strings, each used as a reference to an arbitrary-length sequence of
|
|
data bytes, and are like a URI for that data. This data is encrypted and
|
|
distributed across a number of nodes, such that it will survive the loss of
|
|
most of the nodes.
|
|
|
|
The middle layer is the decentralized filesystem: a directed graph in which
|
|
the intermediate nodes are directories and the leaf nodes are files. The leaf
|
|
nodes contain only the file data -- they contain no metadata about the file
|
|
other than the length in bytes. The edges leading to leaf nodes have metadata
|
|
attached to them about the file they point to. Therefore, the same file may
|
|
be associated with different metadata if it is dereferenced through different
|
|
edges.
|
|
|
|
The top layer consists of the applications using the filesystem.
|
|
Allmydata.com uses it for a backup service: the application periodically
|
|
copies files from the local disk onto the decentralized filesystem. We later
|
|
provide read-only access to those files, allowing users to recover them. The
|
|
filesystem can be used by other applications, too.
|
|
|
|
|
|
THE GRID OF STORAGE SERVERS
|
|
|
|
A key-value store is implemented by a collection of peer nodes -- processes
|
|
running on computers -- called a "grid". (The term "grid" is also used
|
|
loosely for the filesystem supported by these nodes.) The nodes in a grid
|
|
establish TCP connections to each other using Foolscap, a secure
|
|
remote-message-passing library.
|
|
|
|
Each node offers certain services to the others. The primary service is that
|
|
of the storage server, which holds data in the form of "shares". Shares are
|
|
encoded pieces of files. There are a configurable number of shares for each
|
|
file, 10 by default. Normally, each share is stored on a separate server, but
|
|
a single server can hold multiple shares for a single file.
|
|
|
|
Nodes learn about each other through an "introducer". Each node connects to a
|
|
central introducer at startup, and receives a list of all other nodes from
|
|
it. Each node then connects to all other nodes, creating a fully-connected
|
|
topology. In the current release, nodes behind NAT boxes will connect to all
|
|
nodes that they can open connections to, but they cannot open connections to
|
|
other nodes behind NAT boxes. Therefore, the more nodes behind NAT boxes, the
|
|
less the topology resembles the intended fully-connected topology.
|
|
|
|
The introducer in nominally a single point of failure, in that clients who
|
|
never see the introducer will be unable to connect to any storage servers.
|
|
But once a client has been introduced to everybody, they do not need the
|
|
introducer again until they are restarted. The danger of a SPOF is further
|
|
reduced in other ways. First, the introducer is defined by a hostname and a
|
|
private key, which are easy to move to a new host in case the original one
|
|
suffers an unrecoverable hardware problem. Second, even if the private key is
|
|
lost, clients can be reconfigured with a new introducer.furl that points to a
|
|
new one. Finally, we have plans to decentralize introduction, allowing any
|
|
node to tell a new client about all the others. With decentralized
|
|
"gossip-based" introduction, simply knowing how to contact any one node will
|
|
be enough to contact all of them.
|
|
|
|
|
|
FILE ENCODING
|
|
|
|
When a node stores a file on its grid, it first encrypts the file, using a key
|
|
that is optionally derived from the hash of the file itself. It then segments
|
|
the encrypted file into small pieces, in order to reduce the memory footprint,
|
|
and to decrease the lag between initiating a download and receiving the first
|
|
part of the file; for example the lag between hitting "play" and a movie
|
|
actually starting.
|
|
|
|
The node then erasure-codes each segment, producing blocks such that only a
|
|
subset of them are needed to reconstruct the segment. It sends one block from
|
|
each segment to a given server. The set of blocks on a given server
|
|
constitutes a "share". Only a subset of the shares (3 out of 10, by default)
|
|
are needed to reconstruct the file.
|
|
|
|
A tagged hash of the encryption key is used to form the "storage index", which
|
|
is used for both server selection (described below) and to index shares within
|
|
the Storage Servers on the selected nodes.
|
|
|
|
Hashes are computed while the shares are being produced, to validate the
|
|
ciphertext and the shares themselves. Merkle hash trees are used to enable
|
|
validation of individual segments of ciphertext without requiring the
|
|
download/decoding of the whole file. These hashes go into the "Capability
|
|
Extension Block", which will be stored with each share.
|
|
|
|
The capability contains the encryption key, the hash of the Capability
|
|
Extension Block, and any encoding parameters necessary to perform the eventual
|
|
decoding process. For convenience, it also contains the size of the file
|
|
being stored.
|
|
|
|
|
|
On the download side, the node that wishes to turn a capability into a
|
|
sequence of bytes will obtain the necessary shares from remote nodes, break
|
|
them into blocks, use erasure-decoding to turn them into segments of
|
|
ciphertext, use the decryption key to convert that into plaintext, then emit
|
|
the plaintext bytes to the output target (which could be a file on disk, or it
|
|
could be streamed directly to a web browser or media player).
|
|
|
|
All hashes use SHA-256, and a different tag is used for each purpose.
|
|
Netstrings are used where necessary to insure these tags cannot be confused
|
|
with the data to be hashed. All encryption uses AES in CTR mode. The erasure
|
|
coding is performed with zfec.
|
|
|
|
A Merkle Hash Tree is used to validate the encoded blocks before they are fed
|
|
into the decode process, and a transverse tree is used to validate the shares
|
|
as they are retrieved. A third merkle tree is constructed over the plaintext
|
|
segments, and a fourth is constructed over the ciphertext segments. All
|
|
necessary hashes are stored with the shares, and the hash tree roots are put
|
|
in the Capability Extension Block. The final hash of the extension block goes
|
|
into the capability itself.
|
|
|
|
Note that the number of shares created is fixed at the time the file is
|
|
uploaded: it is not possible to create additional shares later. The use of a
|
|
top-level hash tree also requires that nodes create all shares at once, even
|
|
if they don't intend to upload some of them, otherwise the hashroot cannot be
|
|
calculated correctly.
|
|
|
|
|
|
CAPABILITIES
|
|
|
|
Capabilities to immutable files represent a specific set of bytes. Think of
|
|
it like a hash function: you feed in a bunch of bytes, and you get out a
|
|
capability, which is deterministically derived from the input data: changing
|
|
even one bit of the input data will result in a completely different
|
|
capability.
|
|
|
|
Read-only capabilities to mutable files represent the ability to get a set of
|
|
bytes representing some version of the file, most likely the latest version.
|
|
Each read-only capability is unique. In fact, each mutable file has a unique
|
|
public/private key pair created when the mutable file is created, and the
|
|
read-only capability to that file includes a secure hash of the public key.
|
|
|
|
Read-write capabilities to mutable files represent the ability to read the
|
|
file (just like a read-only capability) and also to write a new version of
|
|
the file, overwriting any extant version. Read-write capabilities are unique
|
|
-- each one includes the secure hash of the private key associated with that
|
|
mutable file.
|
|
|
|
The capability provides both "location" and "identification": you can use it
|
|
to retrieve a set of bytes, and then you can use it to validate ("identify")
|
|
that these potential bytes are indeed the ones that you were looking for.
|
|
|
|
The "key-value store" layer is insufficient to provide a usable filesystem,
|
|
which requires human-meaningful names. Capabilities sit on the
|
|
"global+secure" edge of Zooko's Triangle[1]. They are self-authenticating,
|
|
meaning that nobody can trick you into using a file that doesn't match the
|
|
capability you used to refer to that file.
|
|
|
|
|
|
SERVER SELECTION
|
|
|
|
When a file is uploaded, the encoded shares are sent to other nodes. But to
|
|
which ones? The "server selection" algorithm is used to make this choice.
|
|
|
|
In the current version, the storage index is used to consistently-permute the
|
|
set of all peer nodes (by sorting the peer nodes by
|
|
HASH(storage_index+peerid)). Each file gets a different permutation, which
|
|
(on average) will evenly distribute shares among the grid and avoid hotspots.
|
|
|
|
We use this permuted list of nodes to ask each node, in turn, if it will hold
|
|
a share for us, by sending an 'allocate_buckets() query' to each one. Some
|
|
will say yes, others (those who are full) will say no: when a node refuses our
|
|
request, we just take that share to the next node on the list. We keep going
|
|
until we run out of shares to place. At the end of the process, we'll have a
|
|
table that maps each share number to a node, and then we can begin the
|
|
encode+push phase, using the table to decide where each share should be sent.
|
|
|
|
Most of the time, this will result in one share per node, which gives us
|
|
maximum reliability (since it disperses the failures as widely as possible).
|
|
If there are fewer useable nodes than there are shares, we'll be forced to
|
|
loop around, eventually giving multiple shares to a single node. This reduces
|
|
reliability, so it isn't the sort of thing we want to happen all the time, and
|
|
either indicates that the default encoding parameters are set incorrectly
|
|
(creating more shares than you have nodes), or that the grid does not have
|
|
enough space (many nodes are full). But apart from that, it doesn't hurt. If
|
|
we have to loop through the node list a second time, we accelerate the query
|
|
process, by asking each node to hold multiple shares on the second pass. In
|
|
most cases, this means we'll never send more than two queries to any given
|
|
node.
|
|
|
|
If a node is unreachable, or has an error, or refuses to accept any of our
|
|
shares, we remove them from the permuted list, so we won't query them a second
|
|
time for this file. If a node already has shares for the file we're uploading
|
|
(or if someone else is currently sending them shares), we add that information
|
|
to the share-to-peer-node table. This lets us do less work for files which have
|
|
been uploaded once before, while making sure we still wind up with as many
|
|
shares as we desire.
|
|
|
|
If we are unable to place every share that we want, but we still managed to
|
|
place a quantity known as "shares of happiness", we'll do the upload anyways.
|
|
If we cannot place at least this many, the upload is declared a failure.
|
|
|
|
The current defaults use k=3, shares_of_happiness=7, and N=10, meaning that
|
|
we'll try to place 10 shares, we'll be happy if we can place 7, and we need to
|
|
get back any 3 to recover the file. This results in a 3.3x expansion
|
|
factor. In general, you should set N about equal to the number of nodes in
|
|
your grid, then set N/k to achieve your desired availability goals.
|
|
|
|
When downloading a file, the current release just asks all known nodes for any
|
|
shares they might have, chooses the minimal necessary subset, then starts
|
|
downloading and processing those shares. A later release will use the full
|
|
algorithm to reduce the number of queries that must be sent out. This
|
|
algorithm uses the same consistent-hashing permutation as on upload, but stops
|
|
after it has located k shares (instead of all N). This reduces the number of
|
|
queries that must be sent before downloading can begin.
|
|
|
|
The actual number of queries is directly related to the availability of the
|
|
nodes and the degree of overlap between the node list used at upload and at
|
|
download. For stable grids, this overlap is very high, and usually the first k
|
|
queries will result in shares. The number of queries grows as the stability
|
|
decreases. Some limits may be imposed in large grids to avoid querying a
|
|
million nodes; this provides a tradeoff between the work spent to discover
|
|
that a file is unrecoverable and the probability that a retrieval will fail
|
|
when it could have succeeded if we had just tried a little bit harder. The
|
|
appropriate value of this tradeoff will depend upon the size of the grid, and
|
|
will change over time.
|
|
|
|
Other peer-node selection algorithms are possible. One earlier version (known
|
|
as "Tahoe 3") used the permutation to place the nodes around a large ring,
|
|
distributed shares evenly around the same ring, then walks clockwise from 0
|
|
with a basket: each time we encounter a share, put it in the basket, each time
|
|
we encounter a node, give them as many shares from our basket as they'll
|
|
accept. This reduced the number of queries (usually to 1) for small grids
|
|
(where N is larger than the number of nodes), but resulted in extremely
|
|
non-uniform share distribution, which significantly hurt reliability
|
|
(sometimes the permutation resulted in most of the shares being dumped on a
|
|
single node).
|
|
|
|
Another algorithm (known as "denver airport"[2]) uses the permuted hash to
|
|
decide on an approximate target for each share, then sends lease requests via
|
|
Chord routing. The request includes the contact information of the uploading
|
|
node, and asks that the node which eventually accepts the lease should contact
|
|
the uploader directly. The shares are then transferred over direct connections
|
|
rather than through multiple Chord hops. Download uses the same approach. This
|
|
allows nodes to avoid maintaining a large number of long-term connections, at
|
|
the expense of complexity, latency, and reliability.
|
|
|
|
|
|
SWARMING DOWNLOAD, TRICKLING UPLOAD
|
|
|
|
Because the shares being downloaded are distributed across a large number of
|
|
nodes, the download process will pull from many of them at the same time. The
|
|
current encoding parameters require 3 shares to be retrieved for each segment,
|
|
which means that up to 3 nodes will be used simultaneously. For larger
|
|
networks, 8-of-22 encoding could be used, meaning 8 nodes can be used
|
|
simultaneously. This allows the download process to use the sum of the
|
|
available nodes' upload bandwidths, resulting in downloads that take full
|
|
advantage of the common 8x disparity between download and upload bandwith on
|
|
modern ADSL lines.
|
|
|
|
On the other hand, uploads are hampered by the need to upload encoded shares
|
|
that are larger than the original data (3.3x larger with the current default
|
|
encoding parameters), through the slow end of the asymmetric connection. This
|
|
means that on a typical 8x ADSL line, uploading a file will take about 32
|
|
times longer than downloading it again later.
|
|
|
|
Smaller expansion ratios can reduce this upload penalty, at the expense of
|
|
reliability. See RELIABILITY, below. By using an "upload helper", this penalty
|
|
is eliminated: the client does a 1x upload of encrypted data to the helper,
|
|
then the helper performs encoding and pushes the shares to the storage
|
|
servers. This is an improvement if the helper has significantly higher upload
|
|
bandwidth than the client, so it makes the most sense for a commercially-run
|
|
grid for which all of the storage servers are in a colo facility with high
|
|
interconnect bandwidth. In this case, the helper is placed in the same
|
|
facility, so the helper-to-storage-server bandwidth is huge.
|
|
|
|
See "helper.txt" for details about the upload helper.
|
|
|
|
|
|
THE FILESYSTEM LAYER
|
|
|
|
The "filesystem" layer is responsible for mapping human-meaningful pathnames
|
|
(directories and filenames) to pieces of data. The actual bytes inside these
|
|
files are referenced by capability, but the filesystem layer is where the
|
|
directory names, file names, and metadata are kept.
|
|
|
|
The filesystem layer is a graph of directories. Each directory contains a
|
|
table of named children. These children are either other directories or
|
|
files. All children are referenced by their capability.
|
|
|
|
A directory has two forms of capability: read-write caps and read-only
|
|
caps. The table of children inside the directory has a read-write and
|
|
read-only capability for each child. If you have a read-only capability for a
|
|
given directory, you will not be able to access the read-write capability of
|
|
its children. This results in "transitively read-only" directory access.
|
|
|
|
By having two different capabilities, you can choose which you want to share
|
|
with someone else. If you create a new directory and share the read-write
|
|
capability for it with a friend, then you will both be able to modify its
|
|
contents. If instead you give them the read-only capability, then they will
|
|
*not* be able to modify the contents. Any capability that you receive can be
|
|
linked in to any directory that you can modify, so very powerful
|
|
shared+published directory structures can be built from these components.
|
|
|
|
This structure enable individual users to have their own personal space, with
|
|
links to spaces that are shared with specific other users, and other spaces
|
|
that are globally visible.
|
|
|
|
|
|
LEASES, REFRESHING, GARBAGE COLLECTION
|
|
|
|
When a file or directory in the virtual filesystem is no longer referenced,
|
|
the space that its shares occupied on each storage server can be freed,
|
|
making room for other shares. Tahoe currently uses a garbage collection
|
|
("GC") mechanism to implement this space-reclamation process. Each share has
|
|
one or more "leases", which are managed by clients who want the
|
|
file/directory to be retained. The storage server accepts each share for a
|
|
pre-defined period of time, and is allowed to delete the share if all of the
|
|
leases are cancelled or allowed to expire.
|
|
|
|
Garbage collection is not enabled by default: storage servers will not delete
|
|
shares without being explicitly configured to do so. When GC is enabled,
|
|
clients are responsible for renewing their leases on a periodic basis at
|
|
least frequently enough to prevent any of the leases from expiring before the
|
|
next renewal pass.
|
|
|
|
See docs/garbage-collection.txt for further information, and how to configure
|
|
garbage collection.
|
|
|
|
|
|
FILE REPAIRER
|
|
|
|
Shares may go away because the storage server hosting them has suffered a
|
|
failure: either temporary downtime (affecting availability of the file), or a
|
|
permanent data loss (affecting the reliability of the file). Hard drives
|
|
crash, power supplies explode, coffee spills, and asteroids strike. The goal
|
|
of a robust distributed filesystem is to survive these setbacks.
|
|
|
|
To work against this slow, continual loss of shares, a File Checker is used to
|
|
periodically count the number of shares still available for any given file. A
|
|
more extensive form of checking known as the File Verifier can download the
|
|
ciphertext of the target file and perform integrity checks (using strong
|
|
hashes) to make sure the data is stil intact. When the file is found to have
|
|
decayed below some threshold, the File Repairer can be used to regenerate and
|
|
re-upload the missing shares. These processes are conceptually distinct (the
|
|
repairer is only run if the checker/verifier decides it is necessary), but in
|
|
practice they will be closely related, and may run in the same process.
|
|
|
|
The repairer process does not get the full capability of the file to be
|
|
maintained: it merely gets the "repairer capability" subset, which does not
|
|
include the decryption key. The File Verifier uses that data to find out which
|
|
nodes ought to hold shares for this file, and to see if those nodes are still
|
|
around and willing to provide the data. If the file is not healthy enough, the
|
|
File Repairer is invoked to download the ciphertext, regenerate any missing
|
|
shares, and upload them to new nodes. The goal of the File Repairer is to
|
|
finish up with a full set of "N" shares.
|
|
|
|
There are a number of engineering issues to be resolved here. The bandwidth,
|
|
disk IO, and CPU time consumed by the verification/repair process must be
|
|
balanced against the robustness that it provides to the grid. The nodes
|
|
involved in repair will have very different access patterns than normal nodes,
|
|
such that these processes may need to be run on hosts with more memory or
|
|
network connectivity than usual. The frequency of repair will directly affect
|
|
the resources consumed. In some cases, verification of multiple files can be
|
|
performed at the same time, and repair of files can be delegated off to other
|
|
nodes.
|
|
|
|
The security model we are currently using assumes that nodes who claim to hold
|
|
a share will actually provide it when asked. (We validate the data they
|
|
provide before using it in any way, but if enough nodes claim to hold the data
|
|
and are wrong, the file will not be repaired, and may decay beyond
|
|
recoverability). There are several interesting approaches to mitigate this
|
|
threat, ranging from challenges to provide a keyed hash of the allegedly-held
|
|
data (using "buddy nodes", in which two nodes hold the same block, and check
|
|
up on each other), to reputation systems, or even the original Mojo Nation
|
|
economic model.
|
|
|
|
|
|
SECURITY
|
|
|
|
The design goal for this project is that an attacker may be able to deny
|
|
service (i.e. prevent you from recovering a file that was uploaded earlier)
|
|
but can accomplish none of the following three attacks:
|
|
|
|
1) violate confidentiality: the attacker gets to view data to which you have
|
|
not granted them access
|
|
2) violate consistency: the attacker convinces you that the wrong data is
|
|
actually the data you were intending to retrieve
|
|
3) violate mutability: the attacker gets to modify a directory (either the
|
|
pathnames or the file contents) to which you have not given them
|
|
mutability rights
|
|
|
|
Data validity and consistency (the promise that the downloaded data will match
|
|
the originally uploaded data) is provided by the hashes embedded in the
|
|
capability. Data confidentiality (the promise that the data is only readable
|
|
by people with the capability) is provided by the encryption key embedded in
|
|
the capability. Data availability (the hope that data which has been uploaded
|
|
in the past will be downloadable in the future) is provided by the grid, which
|
|
distributes failures in a way that reduces the correlation between individual
|
|
node failure and overall file recovery failure, and by the erasure-coding
|
|
technique used to generate shares.
|
|
|
|
Many of these security properties depend upon the usual cryptographic
|
|
assumptions: the resistance of AES and RSA to attack, the resistance of
|
|
SHA-256 to pre-image attacks, and upon the proximity of 2^-128 and 2^-256 to
|
|
zero. A break in AES would allow a confidentiality violation, a pre-image
|
|
break in SHA-256 would allow a consistency violation, and a break in RSA
|
|
would allow a mutability violation. The discovery of a collision in SHA-256
|
|
is unlikely to allow much, but could conceivably allow a consistency
|
|
violation in data that was uploaded by the attacker. If SHA-256 is
|
|
threatened, further analysis will be warranted.
|
|
|
|
There is no attempt made to provide anonymity, neither of the origin of a
|
|
piece of data nor the identity of the subsequent downloaders. In general,
|
|
anyone who already knows the contents of a file will be in a strong position
|
|
to determine who else is uploading or downloading it. Also, it is quite easy
|
|
for a sufficiently large coalition of nodes to correlate the set of nodes who
|
|
are all uploading or downloading the same file, even if the attacker does not
|
|
know the contents of the file in question.
|
|
|
|
Also note that the file size and (when convergence is being used) a keyed hash
|
|
of the plaintext are not protected. Many people can determine the size of the
|
|
file you are accessing, and if they already know the contents of a given file,
|
|
they will be able to determine that you are uploading or downloading the same
|
|
one.
|
|
|
|
A likely enhancement is the ability to use distinct encryption keys for each
|
|
file, avoiding the file-correlation attacks at the expense of increased
|
|
storage consumption. This is known as "non-convergent" encoding.
|
|
|
|
The capability-based security model is used throughout this project. Directory
|
|
operations are expressed in terms of distinct read- and write- capabilities.
|
|
Knowing the read-capability of a file is equivalent to the ability to read the
|
|
corresponding data. The capability to validate the correctness of a file is
|
|
strictly weaker than the read-capability (possession of read-capability
|
|
automatically grants you possession of validate-capability, but not vice
|
|
versa). These capabilities may be expressly delegated (irrevocably) by simply
|
|
transferring the relevant secrets.
|
|
|
|
The application layer can provide whatever access model is desired, built on
|
|
top of this capability access model. The first big user of this system so far
|
|
is allmydata.com. The allmydata.com access model currently works like a
|
|
normal web site, using username and password to give a user access to her
|
|
virtual drive. In addition, allmydata.com users can share individual files
|
|
(using a file sharing interface built on top of the immutable file read
|
|
capabilities).
|
|
|
|
|
|
RELIABILITY
|
|
|
|
File encoding and peer-node selection parameters can be adjusted to achieve
|
|
different goals. Each choice results in a number of properties; there are many
|
|
tradeoffs.
|
|
|
|
First, some terms: the erasure-coding algorithm is described as K-out-of-N
|
|
(for this release, the default values are K=3 and N=10). Each grid will have
|
|
some number of nodes; this number will rise and fall over time as nodes join,
|
|
drop out, come back, and leave forever. Files are of various sizes, some are
|
|
popular, others are rare. Nodes have various capacities, variable
|
|
upload/download bandwidths, and network latency. Most of the mathematical
|
|
models that look at node failure assume some average (and independent)
|
|
probability 'P' of a given node being available: this can be high (servers
|
|
tend to be online and available >90% of the time) or low (laptops tend to be
|
|
turned on for an hour then disappear for several days). Files are encoded in
|
|
segments of a given maximum size, which affects memory usage.
|
|
|
|
The ratio of N/K is the "expansion factor". Higher expansion factors improve
|
|
reliability very quickly (the binomial distribution curve is very sharp), but
|
|
consumes much more grid capacity. When P=50%, the absolute value of K affects
|
|
the granularity of the binomial curve (1-out-of-2 is much worse than
|
|
50-out-of-100), but high values asymptotically approach a constant (i.e.
|
|
500-of-1000 is not much better than 50-of-100). When P is high and the
|
|
expansion factor is held at a constant, higher values of K and N give much
|
|
better reliability (for P=99%, 50-out-of-100 is much much better than 5-of-10,
|
|
roughly 10^50 times better), because there are more shares that can be lost
|
|
without losing the file.
|
|
|
|
Likewise, the total number of nodes in the network affects the same
|
|
granularity: having only one node means a single point of failure, no matter
|
|
how many copies of the file you make. Independent nodes (with uncorrelated
|
|
failures) are necessary to hit the mathematical ideals: if you have 100 nodes
|
|
but they are all in the same office building, then a single power failure will
|
|
take out all of them at once. The "Sybil Attack" is where a single attacker
|
|
convinces you that they are actually multiple servers, so that you think you
|
|
are using a large number of independent nodes, but in fact you have a single
|
|
point of failure (where the attacker turns off all their machines at
|
|
once). Large grids, with lots of truly independent nodes, will enable the use
|
|
of lower expansion factors to achieve the same reliability, but will increase
|
|
overhead because each node needs to know something about every other, and the
|
|
rate at which nodes come and go will be higher (requiring network maintenance
|
|
traffic). Also, the File Repairer work will increase with larger grids,
|
|
although then the job can be distributed out to more nodes.
|
|
|
|
Higher values of N increase overhead: more shares means more Merkle hashes
|
|
that must be included with the data, and more nodes to contact to retrieve the
|
|
shares. Smaller segment sizes reduce memory usage (since each segment must be
|
|
held in memory while erasure coding runs) and improves "alacrity" (since
|
|
downloading can validate a smaller piece of data faster, delivering it to the
|
|
target sooner), but also increase overhead (because more blocks means more
|
|
Merkle hashes to validate them).
|
|
|
|
In general, small private grids should work well, but the participants will
|
|
have to decide between storage overhead and reliability. Large stable grids
|
|
will be able to reduce the expansion factor down to a bare minimum while still
|
|
retaining high reliability, but large unstable grids (where nodes are coming
|
|
and going very quickly) may require more repair/verification bandwidth than
|
|
actual upload/download traffic.
|
|
|
|
Tahoe nodes that run a webserver have a page dedicated to provisioning
|
|
decisions: this tool may help you evaluate different expansion factors and
|
|
view the disk consumption of each. It is also acquiring some sections with
|
|
availability/reliability numbers, as well as preliminary cost analysis data.
|
|
This tool will continue to evolve as our analysis improves.
|
|
|
|
------------------------------
|
|
|
|
[1]: http://en.wikipedia.org/wiki/Zooko%27s_triangle
|
|
|
|
[2]: all of these names are derived from the location where they were
|
|
concocted, in this case in a car ride from Boulder to DEN. To be
|
|
precise, "Tahoe 1" was an unworkable scheme in which everyone who holds
|
|
shares for a given file would form a sort of cabal which kept track of
|
|
all the others, "Tahoe 2" is the first-100-nodes in the permuted hash
|
|
described in this document, and "Tahoe 3" (or perhaps "Potrero hill 1")
|
|
was the abandoned ring-with-many-hands approach.
|
|
|