2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
Allmydata "Tahoe" Architecture
|
|
|
|
|
|
|
|
OVERVIEW
|
|
|
|
|
2007-04-30 13:06:09 -07:00
|
|
|
The high-level view of this system consists of three layers: the grid, the
|
2007-04-19 23:43:47 -07:00
|
|
|
virtual drive, and the application that sits on top.
|
|
|
|
|
2007-04-30 13:06:09 -07:00
|
|
|
The lowest layer is the "grid", basically a DHT (Distributed Hash Table)
|
2007-04-30 09:57:52 -07:00
|
|
|
which maps URIs to data. The URIs are relatively short ascii strings
|
2007-04-20 01:14:29 -07:00
|
|
|
(currently about 140 bytes), and each is used as references to an immutable
|
2007-04-19 23:43:47 -07:00
|
|
|
arbitrary-length sequence of data bytes. This data is distributed around the
|
2007-04-30 13:06:09 -07:00
|
|
|
grid in a large number of nodes, such that a statistically unlikely number
|
2007-04-20 01:14:29 -07:00
|
|
|
of nodes would have to be unavailable for the data to become unavailable.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
The middle layer is the virtual drive: a tree-shaped data structure in which
|
|
|
|
the intermediate nodes are directories and the leaf nodes are files. Each
|
|
|
|
file contains both the URI of the file's data and all the necessary metadata
|
|
|
|
(MIME type, filename, ctime/mtime, etc) required to present the file to a
|
|
|
|
user in a meaningful way (displaying it in a web browser, or on a desktop).
|
|
|
|
|
|
|
|
The top layer is where the applications that use this virtual drive operate.
|
|
|
|
Allmydata uses this for a backup service, in which the application copies the
|
|
|
|
files to be backed up from the local disk into the virtual drive on a
|
|
|
|
periodic basis. By providing read-only access to the same virtual drive
|
|
|
|
later, a user can recover older versions of their files. Other sorts of
|
|
|
|
applications can run on top of the virtual drive, of course, anything that
|
|
|
|
has a use for a secure, robust, distributed filestore.
|
|
|
|
|
|
|
|
Note: some of the description below indicates design targets rather than
|
|
|
|
actual code present in the current release. Please take a look at roadmap.txt
|
|
|
|
to get an idea of how much of this has been implemented so far.
|
|
|
|
|
|
|
|
|
2007-04-30 13:06:09 -07:00
|
|
|
THE BIG GRID OF PEERS
|
2007-04-19 23:43:47 -07:00
|
|
|
|
2007-04-30 13:06:09 -07:00
|
|
|
Underlying the grid is a large collection of peer nodes. These are processes
|
2007-04-30 09:57:52 -07:00
|
|
|
running on a wide variety of computers, all of which know about each other in
|
|
|
|
some way or another. They establish TCP connections to one another using
|
|
|
|
Foolscap, an encrypted+authenticated remote message passing library (using
|
|
|
|
TLS connections and self-authenticating identifiers called "FURLs").
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
Each peer offers certain services to the others. The primary service is the
|
|
|
|
StorageServer, which offers to hold data for a limited period of time (a
|
|
|
|
"lease"). Each StorageServer has a quota, and it will reject lease requests
|
|
|
|
that would cause it to consume more space than it wants to provide. When a
|
|
|
|
lease expires, the data is deleted. Peers might renew their leases.
|
|
|
|
|
|
|
|
This storage is used to hold "shares", which are themselves used to store
|
2007-04-30 13:06:09 -07:00
|
|
|
files in the grid. There are many shares for each file, typically around 100
|
2007-04-19 23:43:47 -07:00
|
|
|
(the exact number depends upon the tradeoffs made between reliability,
|
|
|
|
overhead, and storage space consumed). The files are indexed by a piece of
|
|
|
|
the URI called the "verifierid", which is derived from the contents of the
|
|
|
|
file. Leases are indexed by verifierid, and a single StorageServer may hold
|
|
|
|
multiple shares for the corresponding file. Multiple peers can hold leases on
|
|
|
|
the same file, in which case the shares will be kept alive until the last
|
|
|
|
lease expires. The typical lease is expected to be for one month: enough time
|
|
|
|
for interested parties to renew it, but not so long that abandoned data
|
|
|
|
consumes unreasonable space. Peers are expected to "delete" (drop leases) on
|
|
|
|
data that they know they no longer want: lease expiration is meant as a
|
|
|
|
safety measure.
|
|
|
|
|
|
|
|
In this release, peers learn about each other through the "introducer". Each
|
|
|
|
peer connects to this central introducer at startup, and receives a list of
|
|
|
|
all other peers from it. Each peer then connects to all other peers, creating
|
2007-04-30 09:57:52 -07:00
|
|
|
a fully-connected topology. Future versions will reduce the number of
|
2007-04-30 13:06:09 -07:00
|
|
|
connections considerably, to enable the grid to scale to larger sizes: the
|
2007-04-30 09:57:52 -07:00
|
|
|
design target is one million nodes. In addition, future versions will offer
|
|
|
|
relay and NAT-traversal services to allow nodes without full internet
|
2007-04-30 13:06:09 -07:00
|
|
|
connectivity to participate. In the current release, nodes behind NAT boxes
|
|
|
|
will connect to all nodes that they can open connections to, but they cannot
|
|
|
|
open connections to other nodes behind NAT boxes. Therefore, the more nodes
|
|
|
|
there are behind NAT boxes the less the topology resembles the intended
|
2007-05-01 20:33:22 -07:00
|
|
|
fully-connected mesh topology. (See also
|
|
|
|
http://allmydata.org/trac/tahoe/ticket/22 ).
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
|
|
|
|
FILE ENCODING
|
|
|
|
|
2007-04-30 13:06:09 -07:00
|
|
|
When a file is to be added to the grid, it is first encrypted using a key
|
2007-04-19 23:43:47 -07:00
|
|
|
that is derived from the hash of the file itself. The encrypted file is then
|
|
|
|
broken up into segments so it can be processed in small pieces (to minimize
|
|
|
|
the memory footprint of both encode and decode operations, and to increase
|
|
|
|
the so-called "alacrity": how quickly can the download operation provide
|
2007-04-20 01:14:29 -07:00
|
|
|
validated data to the user, basically the lag between hitting "play" and the
|
|
|
|
movie actually starting). Each segment is erasure coded, which creates
|
2007-04-19 23:43:47 -07:00
|
|
|
encoded blocks that are larger than the input segment, such that only a
|
|
|
|
subset of the output blocks are required to reconstruct the segment. These
|
|
|
|
blocks are then combined into "shares", such that a subset of the shares can
|
|
|
|
be used to reconstruct the whole file. The shares are then deposited in
|
|
|
|
StorageServers in other peers.
|
|
|
|
|
|
|
|
A tagged hash of the original file is called the "fileid", while a
|
|
|
|
differently-tagged hash of the original file provides the encryption key. A
|
|
|
|
tagged hash of the *encrypted* file is called the "verifierid", and is used
|
|
|
|
for both peer selection (described below) and to index shares within the
|
|
|
|
StorageServers on the selected peers.
|
|
|
|
|
2007-04-20 01:14:29 -07:00
|
|
|
The URI contains the fileid, the verifierid, the encryption key, any encoding
|
|
|
|
parameters necessary to perform the eventual decoding process, and some
|
|
|
|
additional hashes that allow the download process to validate the data it
|
|
|
|
receives.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
On the download side, the node that wishes to turn a URI into a sequence of
|
|
|
|
bytes will obtain the necessary shares from remote nodes, break them into
|
|
|
|
blocks, use erasure-decoding to turn them into segments of crypttext, use the
|
|
|
|
decryption key to convert that into plaintext, then emit the plaintext bytes
|
|
|
|
to the output target (which could be a file on disk, or it could be streamed
|
|
|
|
directly to a web browser or media player).
|
|
|
|
|
|
|
|
All hashes use SHA256, and a different tag is used for each purpose.
|
|
|
|
Netstrings are used where necessary to insure these tags cannot be confused
|
|
|
|
with the data to be hashed. All encryption uses AES in CTR mode. The erasure
|
|
|
|
coding is performed with zfec (a python wrapper around Rizzo's FEC library).
|
|
|
|
A Merkle Hash Tree is used to validate the encoded blocks before they are fed
|
|
|
|
into the decode process, and a second tree is used to validate the shares
|
|
|
|
before they are retrieved. The hash tree root is put into the URI.
|
|
|
|
|
2007-04-20 01:14:29 -07:00
|
|
|
Note that the number of shares created is fixed at the time the file is
|
|
|
|
uploaded: it is not possible to create additional shares later. The use of a
|
|
|
|
top-level hash tree also requires that nodes create all shares at once, even
|
|
|
|
if they don't intend to upload some of them, otherwise the hashroot cannot be
|
|
|
|
calculated correctly.
|
|
|
|
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
URIs
|
|
|
|
|
|
|
|
Each URI represents a specific set of bytes. Think of it like a hash
|
|
|
|
function: you feed in a bunch of bytes, and you get out a URI. The URI is
|
|
|
|
deterministically derived from the input data: changing even one bit of the
|
|
|
|
input data will result in a drastically different URI. The URI provides both
|
2007-04-20 01:14:29 -07:00
|
|
|
"identification" and "location": you can use it to locate/retrieve a set of
|
|
|
|
bytes that are probably the same as the original file, and then you can use
|
|
|
|
it to validate that these potential bytes are indeed the ones that you were
|
|
|
|
looking for.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
URIs refer to an immutable set of bytes. If you modify a file and upload the
|
2007-04-30 13:06:09 -07:00
|
|
|
new version to the grid, you will get a different URI. URIs do not represent
|
2007-04-19 23:43:47 -07:00
|
|
|
filenames at all, just the data that a filename might point to at some given
|
2007-04-30 13:06:09 -07:00
|
|
|
point in time. This is why the "grid" layer is insufficient to provide a
|
2007-04-19 23:43:47 -07:00
|
|
|
virtual drive: an actual filesystem requires human-meaningful names and
|
|
|
|
mutability, while URIs provide neither. URIs sit on the "global+secure" edge
|
|
|
|
of Zooko's Triangle[1]. They are self-authenticating, meaning that nobody can
|
|
|
|
trick you into using the wrong data.
|
|
|
|
|
|
|
|
The URI should be considered as a "read capability" for the corresponding
|
|
|
|
data: anyone who knows the full URI has the ability to read the given data.
|
|
|
|
There is a subset of the URI (which leaves out the encryption key and fileid)
|
|
|
|
which is called the "verification capability": it allows the holder to
|
|
|
|
retrieve and validate the crypttext, but not the plaintext. Once the
|
|
|
|
crypttext is available, the erasure-coded shares can be regenerated. This
|
|
|
|
will allow a file-repair process to maintain and improve the robustness of
|
|
|
|
files without being able to read their contents.
|
|
|
|
|
|
|
|
The lease mechanism will also involve a "delete" capability, by which a peer
|
|
|
|
which uploaded a file can indicate that they don't want it anymore. It is not
|
|
|
|
truly a delete capability because other peers might be holding leases on the
|
|
|
|
same data, and it should not be deleted until the lease count (i.e. reference
|
|
|
|
count) goes to zero, so perhaps "cancel-the-lease capability" is more
|
|
|
|
accurate. The plan is to store this capability next to the URI in the virtual
|
|
|
|
drive structure.
|
|
|
|
|
|
|
|
|
|
|
|
PEER SELECTION
|
|
|
|
|
|
|
|
When a file is uploaded, the encoded shares are sent to other peers. But to
|
|
|
|
which ones? The "peer selection" algorithm is used to make this choice.
|
|
|
|
|
|
|
|
In the current version, the verifierid is used to consistently-permute the
|
2007-04-20 01:14:29 -07:00
|
|
|
set of all peers (by sorting the peers by HASH(verifierid+peerid)). Each file
|
|
|
|
gets a different permutation, which (on average) will evenly distribute
|
2007-04-30 13:06:09 -07:00
|
|
|
shares among the grid and avoid hotspots.
|
2007-04-20 01:14:29 -07:00
|
|
|
|
|
|
|
This permutation places the peers around a 2^256-sized ring, like the rim of
|
|
|
|
a big clock. The 100-or-so shares are then placed around the same ring (at 0,
|
|
|
|
1/100*2^256, 2/100*2^256, ... 99/100*2^256). Imagine that we start at 0 with
|
|
|
|
an empty basket in hand and proceed clockwise. When we come to a share, we
|
|
|
|
pick it up and put it in the basket. When we come to a peer, we ask that peer
|
|
|
|
if they will give us a lease for every share in our basket.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
The peer will grant us leases for some of those shares and reject others (if
|
|
|
|
they are full or almost full). If they reject all our requests, we remove
|
|
|
|
them from the ring, because they are full and thus unhelpful. Each share they
|
|
|
|
accept is removed from the basket. The remainder stay in the basket as we
|
|
|
|
continue walking clockwise.
|
|
|
|
|
|
|
|
We keep walking, accumulating shares and distributing them to peers, until
|
|
|
|
either we find a home for all shares, or there are no peers left in the ring
|
|
|
|
(because they are all full). If we run out of peers before we run out of
|
|
|
|
shares, the upload may be considered a failure, depending upon how many
|
|
|
|
shares we were able to place. The current parameters try to place 100 shares,
|
|
|
|
of which 25 must be retrievable to recover the file, and the peer selection
|
|
|
|
algorithm is happy if it was able to place at least 75 shares. These numbers
|
|
|
|
are adjustable: 25-out-of-100 means an expansion factor of 4x (every file in
|
2007-04-30 13:06:09 -07:00
|
|
|
the grid consumes four times as much space when totalled across all
|
2007-04-19 23:43:47 -07:00
|
|
|
StorageServers), but is highly reliable (the actual reliability is a binomial
|
|
|
|
distribution function of the expected availability of the individual peers,
|
|
|
|
but in general it goes up very quickly with the expansion factor).
|
|
|
|
|
|
|
|
If the file has been uploaded before (or if two uploads are happening at the
|
|
|
|
same time), a peer might already have shares for the same file we are
|
|
|
|
proposing to send to them. In this case, those shares are removed from the
|
|
|
|
list and assumed to be available (or will be soon). This reduces the number
|
|
|
|
of uploads that must be performed.
|
|
|
|
|
|
|
|
When downloading a file, the current release just asks all known peers for
|
|
|
|
any shares they might have, chooses the minimal necessary subset, then starts
|
|
|
|
downloading and processing those shares. A later release will use the full
|
|
|
|
algorithm to reduce the number of queries that must be sent out. This
|
|
|
|
algorithm uses the same consistent-hashing permutation as on upload, but
|
|
|
|
instead of one walker with one basket, we have 100 walkers (one per share).
|
2007-04-20 01:14:29 -07:00
|
|
|
They each proceed clockwise in parallel until they find a peer, and put that
|
|
|
|
one on the "A" list: out of all peers, this one is the most likely to be the
|
|
|
|
same one to which the share was originally uploaded. The next peer that each
|
|
|
|
walker encounters is put on the "B" list, etc.
|
|
|
|
|
|
|
|
All the "A" list peers are asked for any shares they might have. If enough of
|
|
|
|
them can provide a share, the download phase begins and those shares are
|
|
|
|
retrieved and decoded. If not, the "B" list peers are contacted, etc. This
|
|
|
|
routine will eventually find all the peers that have shares, and will find
|
|
|
|
them quickly if there is significant overlap between the set of peers that
|
|
|
|
were present when the file was uploaded and the set of peers that are present
|
|
|
|
as it is downloaded (i.e. if the "peerlist stability" is high). Some limits
|
2007-04-30 13:06:09 -07:00
|
|
|
may be imposed in large grids to avoid querying a million peers; this
|
2007-04-20 01:14:29 -07:00
|
|
|
provides a tradeoff between the work spent to discover that a file is
|
|
|
|
unrecoverable and the probability that a retrieval will fail when it could
|
|
|
|
have succeeded if we had just tried a little bit harder. The appropriate
|
2007-04-30 13:06:09 -07:00
|
|
|
value of this tradeoff will depend upon the size of the grid, and will change
|
2007-04-20 01:14:29 -07:00
|
|
|
over time.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
Other peer selection algorithms are being evaluated. One of them (known as
|
|
|
|
"tahoe 2") uses the same consistent hash, starts at 0 and requests one lease
|
|
|
|
per peer until it gets 100 of them. This is likely to get better overlap
|
|
|
|
(since a single insertion or deletion will still leave 99 overlapping peers),
|
2007-04-20 01:14:29 -07:00
|
|
|
but is non-ideal in other ways (TODO: what were they?). It would also make it
|
|
|
|
easier to select peers on the basis of their reliability, uptime, or
|
|
|
|
reputation: we could pick 75 good peers plus 50 marginal peers, if it seemed
|
|
|
|
likely that this would provide as good service as 100 good peers.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
Another algorithm (known as "denver airport"[2]) uses the permuted hash to
|
|
|
|
decide on an approximate target for each share, then sends lease requests via
|
2007-04-20 01:14:29 -07:00
|
|
|
Chord routing. The request includes the contact information of the uploading
|
|
|
|
node, and asks that the node which eventually accepts the lease should
|
|
|
|
contact the uploader directly. The shares are then transferred over direct
|
|
|
|
connections rather than through multiple Chord hops. Download uses the same
|
|
|
|
approach. This allows nodes to avoid maintaining a large number of long-term
|
|
|
|
connections, at the expense of complexity, latency, and reliability.
|
|
|
|
|
|
|
|
|
|
|
|
SWARMING DOWNLOAD, TRICKLING UPLOAD
|
|
|
|
|
|
|
|
Because the shares being downloaded are distributed across a large number of
|
|
|
|
peers, the download process will pull from many of them at the same time. The
|
|
|
|
current encoding parameters require 25 shares to be retrieved for each
|
|
|
|
segment, which means that up to 25 peers will be used simultaneously. This
|
|
|
|
allows the download process to use the sum of the available peers' upload
|
|
|
|
bandwidths, resulting in downloads that take full advantage of the common 8x
|
|
|
|
disparity between download and upload bandwith on modern ADSL lines.
|
|
|
|
|
|
|
|
On the other hand, uploads are hampered by the need to upload encoded shares
|
|
|
|
that are larger than the original data (4x larger with the current default
|
|
|
|
encoding parameters), through the slow end of the asymmetric connection. This
|
|
|
|
means that on a typical 8x ADSL line, uploading a file will take about 32
|
|
|
|
times longer than downloading it again later.
|
|
|
|
|
|
|
|
Smaller expansion ratios can reduce this upload penalty, at the expense of
|
|
|
|
reliability. See RELIABILITY, below.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
|
|
|
|
FILETREE: THE VIRTUAL DRIVE LAYER
|
|
|
|
|
|
|
|
The "virtual drive" layer is responsible for mapping human-meaningful
|
|
|
|
pathnames (directories and filenames) to pieces of data. The actual bytes
|
|
|
|
inside these files are referenced by URI, but the "filetree" is where the
|
|
|
|
directory names, file names, and metadata are kept.
|
|
|
|
|
|
|
|
The current release has a very simplistic filetree model. There is a single
|
|
|
|
globally-shared directory structure, which maps filename to URI. This
|
|
|
|
structure is maintained in a central node (which happens to be the same node
|
|
|
|
that houses the Introducer), by writing URIs to files in a local filesystem.
|
|
|
|
|
|
|
|
A future release (probably the next one) will offer each application the
|
|
|
|
ability to have a separate file tree. Each tree can reference others. Some
|
|
|
|
trees are redirections, while others actually contain subdirectories full of
|
|
|
|
filenames. The redirections may be mutable by some users but not by others,
|
|
|
|
allowing both read-only and read-write views of the same data. This will
|
|
|
|
enable individual users to have their own personal space, with links to
|
|
|
|
spaces that are shared with specific other users, and other spaces that are
|
|
|
|
globally visible. Eventually the application layer will present these pieces
|
|
|
|
in a way that allows the sharing of a specific file or the creation of a
|
|
|
|
"virtual CD" as easily as dragging a folder onto a user icon.
|
|
|
|
|
|
|
|
The URIs described above are "Content Hash Key" (CHK) identifiers[3], in
|
2007-04-20 01:14:29 -07:00
|
|
|
which the identifier refers to a specific, unchangeable sequence of bytes. In
|
|
|
|
this project, CHK identifiers are used for both files and immutable versions
|
|
|
|
of directories: the tree of directory and file nodes is serialized into a
|
|
|
|
sequence of bytes, which is then uploaded and turned into a URI. Each time
|
|
|
|
the directory is changed, a new URI is generated for it and propagated to the
|
|
|
|
filetree above it. There is a separate kind of upload, not yet implemented,
|
|
|
|
called SSK (short for Signed Subspace Key), in which the URI refers to a
|
|
|
|
mutable slot. Some users have a write-capability to this slot, allowing them
|
|
|
|
to change the data that it refers to. Others only have a read-capability,
|
|
|
|
merely letting them read the current contents. These SSK slots can be used to
|
|
|
|
provide mutability in the filetree, so that users can actually change the
|
|
|
|
contents of their virtual drive. Redirection nodes can also provide
|
|
|
|
mutability, such as a central service which allows a user to set the current
|
|
|
|
URI of their top-level filetree. SSK slots provide a decentralized way to
|
|
|
|
accomplish this mutability, whereas centralized redirection nodes are more
|
|
|
|
vulnerable to single-point-of-failure issues.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
|
|
|
|
FILE REPAIRER
|
|
|
|
|
|
|
|
Each node is expected to explicitly drop leases on files that it knows it no
|
|
|
|
longer wants (the "delete" operation). Nodes are also expected to renew
|
|
|
|
leases on files that still exist in their filetrees. When nodes are offline
|
|
|
|
for an extended period of time, their files may decay (both because of leases
|
|
|
|
expiring and because of StorageServers going offline). A File Verifier is
|
|
|
|
used to check on the health of any given file, and a File Repairer is used to
|
|
|
|
to keep desired files alive. The two are conceptually distinct (the repairer
|
|
|
|
is run if the verifier decides it is necessary), but in practice they will be
|
|
|
|
closely related, and may run in the same process.
|
|
|
|
|
|
|
|
The repairer process does not get the full URI of the file to be maintained:
|
|
|
|
it merely gets the "repairer capability" subset, which does not include the
|
|
|
|
decryption key. The File Verifier uses that data to find out which peers
|
|
|
|
ought to hold shares for this file, and to see if those peers are still
|
|
|
|
around and willing to provide the data. If the file is not healthy enough,
|
|
|
|
the File Repairer is invoked to download the crypttext, regenerate any
|
|
|
|
missing shares, and upload them to new peers. The goal of the File Repairer
|
|
|
|
is to finish up with a full set of 100 shares.
|
|
|
|
|
|
|
|
There are a number of engineering issues to be resolved here. The bandwidth,
|
|
|
|
disk IO, and CPU time consumed by the verification/repair process must be
|
2007-04-30 13:06:09 -07:00
|
|
|
balanced against the robustness that it provides to the grid. The nodes
|
2007-04-19 23:43:47 -07:00
|
|
|
involved in repair will have very different access patterns than normal
|
|
|
|
nodes, such that these processes may need to be run on hosts with more memory
|
|
|
|
or network connectivity than usual. The frequency of repair runs directly
|
|
|
|
affects the resources consumed. In some cases, verification of multiple files
|
|
|
|
can be performed at the same time, and repair of files can be delegated off
|
|
|
|
to other nodes.
|
|
|
|
|
|
|
|
The security model we are currently using assumes that peers who claim to
|
|
|
|
hold a share will actually provide it when asked. (We validate the data they
|
|
|
|
provide before using it in any way, but if enough peers claim to hold the
|
|
|
|
data and are wrong, the file will not be repaired, and may decay beyond
|
|
|
|
recoverability). There are several interesting approaches to mitigate this
|
|
|
|
threat, ranging from challenges to provide a keyed hash of the allegedly-held
|
|
|
|
data (using "buddy nodes", in which two peers hold the same block, and check
|
2007-04-20 01:14:29 -07:00
|
|
|
up on each other), to reputation systems, or even the original Mojo Nation
|
|
|
|
economic model.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
|
|
|
|
SECURITY
|
|
|
|
|
2007-04-20 01:14:29 -07:00
|
|
|
The design goal for this project is that an attacker may be able to deny
|
|
|
|
service (i.e. prevent you from recovering a file that was uploaded earlier)
|
|
|
|
but can accomplish none of the following three attacks:
|
|
|
|
|
|
|
|
1) violate privacy: the attacker gets to view data to which you have not
|
|
|
|
granted them access
|
|
|
|
2) violate consistency: the attacker convinces you that the wrong data is
|
|
|
|
actually the data you were intending to retrieve
|
|
|
|
3) violate mutability: the attacker gets to modify a filetree (either the
|
|
|
|
pathnames or the file contents) to which you have not given them
|
|
|
|
mutability rights
|
|
|
|
|
|
|
|
Data validity and consistency (the promise that the downloaded data will
|
|
|
|
match the originally uploaded data) is provided by the hashes embedded the
|
|
|
|
URI. Data security (the promise that the data is only readable by people with
|
|
|
|
the URI) is provided by the encryption key embedded in the URI. Data
|
|
|
|
availability (the hope that data which has been uploaded in the past will be
|
2007-04-30 13:06:09 -07:00
|
|
|
downloadable in the future) is provided by the grid, which distributes
|
2007-04-20 01:14:29 -07:00
|
|
|
failures in a way that reduces the correlation between individual node
|
|
|
|
failure and overall file recovery failure.
|
|
|
|
|
|
|
|
Many of these security properties depend upon the usual cryptographic
|
|
|
|
assumptions: the resistance of AES and RSA to attack, the resistance of
|
|
|
|
SHA256 to pre-image attacks, and upon the proximity of 2^-128 and 2^-256 to
|
|
|
|
zero. A break in AES would allow a privacy violation, a pre-image break in
|
|
|
|
SHA256 would allow a consistency violation, and a break in RSA would allow a
|
|
|
|
mutability violation. The discovery of a collision in SHA256 is unlikely to
|
|
|
|
allow much, but could conceivably allow a consistency violation in data that
|
|
|
|
was uploaded by the attacker. If SHA256 is threatened, further analysis will
|
|
|
|
be warranted.
|
|
|
|
|
|
|
|
There is no attempt made to provide anonymity, neither of the origin of a
|
|
|
|
piece of data nor the identity of the subsequent downloaders. In general,
|
|
|
|
anyone who already knows the contents of a file will be in a strong position
|
|
|
|
to determine who else is uploading or downloading it. Also, it is quite easy
|
|
|
|
for a coalition of more than 1% of the nodes to correlate the set of peers
|
|
|
|
who are all uploading or downloading the same file, even if the attacker does
|
|
|
|
not know the contents of the file in question.
|
|
|
|
|
|
|
|
Also note that the file size and verifierid are not protected. Many people
|
|
|
|
can determine the size of the file you are accessing, and if they already
|
|
|
|
know the contents of a given file, they will be able to determine that you
|
|
|
|
are uploading or downloading the same one.
|
|
|
|
|
|
|
|
A likely enhancement is the ability to use distinct encryption keys for each
|
|
|
|
file, avoiding the file-correlation attacks at the expense of increased
|
|
|
|
storage consumption.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
The capability-based security model is used throughout this project. Filetree
|
|
|
|
operations are expressed in terms of distinct read and write capabilities.
|
|
|
|
The URI of a file is the read-capability: knowing the URI is equivalent to
|
2007-04-20 01:14:29 -07:00
|
|
|
the ability to read the corresponding data. The capability to validate and
|
|
|
|
repair a file is a subset of the read-capability. The capability to read an
|
|
|
|
SSK slot is a subset of the capability to modify it. These capabilities may
|
|
|
|
be expressly delegated (irrevocably) by simply transferring the relevant
|
|
|
|
secrets. Special forms of SSK slots can be used to make revocable delegations
|
|
|
|
of particular directories. Certain redirections in the filetree code are
|
|
|
|
expressed as Foolscap "furls", which are also capabilities and provide access
|
|
|
|
to an instance of code running on a central server: these can be delegated
|
|
|
|
just as easily as any other capability, and can be made revocable by
|
|
|
|
delegating access to a forwarder instead of the actual target.
|
|
|
|
|
|
|
|
The application layer can provide whatever security/access model is desired,
|
|
|
|
but we expect the first few to also follow capability discipline: rather than
|
|
|
|
user accounts with passwords, each user will get a furl to their private
|
|
|
|
filetree, and the presentation layer will give them the ability to break off
|
|
|
|
pieces of this filetree for delegation or sharing with others on demand.
|
2007-04-19 23:43:47 -07:00
|
|
|
|
|
|
|
|
|
|
|
RELIABILITY
|
|
|
|
|
|
|
|
File encoding and peer selection parameters can be adjusted to achieve
|
|
|
|
different goals. Each choice results in a number of properties; there are
|
|
|
|
many tradeoffs.
|
|
|
|
|
|
|
|
First, some terms: the erasure-coding algorithm is described as K-out-of-N
|
2007-04-30 13:06:09 -07:00
|
|
|
(for this release, the default values are K=25 and N=100). Each grid will
|
2007-04-19 23:43:47 -07:00
|
|
|
have some number of peers; this number will rise and fall over time as peers
|
|
|
|
join, drop out, come back, and leave forever. Files are of various sizes,
|
|
|
|
some are popular, others are rare. Peers have various capacities, variable
|
|
|
|
upload/download bandwidths, and network latency. Most of the mathematical
|
|
|
|
models that look at peer failure assume some average (and independent)
|
|
|
|
probability 'P' of a given peer being available: this can be high (servers
|
|
|
|
tend to be online and available >90% of the time) or low (laptops tend to be
|
|
|
|
turned on for an hour then disappear for several days). Files are encoded in
|
|
|
|
segments of a given maximum size, which affects memory usage.
|
|
|
|
|
|
|
|
The ratio of N/K is the "expansion factor". Higher expansion factors improve
|
|
|
|
reliability very quickly (the binomial distribution curve is very sharp), but
|
2007-04-30 13:06:09 -07:00
|
|
|
consumes much more grid capacity. The absolute value of K affects the
|
2007-04-19 23:43:47 -07:00
|
|
|
granularity of the binomial curve (1-out-of-2 is much worse than
|
|
|
|
50-out-of-100), but high values asymptotically approach a constant that
|
|
|
|
depends upon 'P' (i.e. 500-of-1000 is not much better than 50-of-100).
|
|
|
|
|
|
|
|
Likewise, the total number of peers in the network affects the same
|
|
|
|
granularity: having only one peer means a single point of failure, no matter
|
|
|
|
how many copies of the file you make. Independent peers (with uncorrelated
|
|
|
|
failures) are necessary to hit the mathematical ideals: if you have 100 nodes
|
|
|
|
but they are all in the same office building, then a single power failure
|
|
|
|
will take out all of them at once. The "Sybil Attack" is where a single
|
|
|
|
attacker convinces you that they are actually multiple servers, so that you
|
|
|
|
think you are using a large number of independent peers, but in fact you have
|
|
|
|
a single point of failure (where the attacker turns off all their machines at
|
2007-04-30 13:06:09 -07:00
|
|
|
once). Large grids, with lots of truly-independent peers, will enable the
|
2007-04-19 23:43:47 -07:00
|
|
|
use of lower expansion factors to achieve the same reliability, but increase
|
|
|
|
overhead because each peer needs to know something about every other, and the
|
|
|
|
rate at which peers come and go will be higher (requiring network maintenance
|
2007-04-30 13:06:09 -07:00
|
|
|
traffic). Also, the File Repairer work will increase with larger grids,
|
2007-04-19 23:43:47 -07:00
|
|
|
although then the job can be distributed out to more peers.
|
|
|
|
|
|
|
|
Higher values of N increase overhead: more shares means more Merkle hashes
|
|
|
|
that must be included with the data, and more peers to contact to retrieve
|
|
|
|
the shares. Smaller segment sizes reduce memory usage (since each segment
|
|
|
|
must be held in memory while erasure coding runs) and increases "alacrity"
|
|
|
|
(since downloading can validate a smaller piece of data faster, delivering it
|
|
|
|
to the target sooner), but also increase overhead (because more blocks means
|
|
|
|
more Merkle hashes to validate them).
|
|
|
|
|
2007-04-30 13:06:09 -07:00
|
|
|
In general, small private grids should work well, but the participants will
|
|
|
|
have to decide between storage overhead and reliability. Large stable grids
|
2007-04-19 23:43:47 -07:00
|
|
|
will be able to reduce the expansion factor down to a bare minimum while
|
2007-04-30 13:06:09 -07:00
|
|
|
still retaining high reliability, but large unstable grids (where nodes are
|
2007-04-19 23:43:47 -07:00
|
|
|
coming and going very quickly) may require more repair/verification bandwidth
|
|
|
|
than actual upload/download traffic.
|
|
|
|
|
|
|
|
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
[1]: http://en.wikipedia.org/wiki/Zooko%27s_triangle
|
2007-04-20 01:14:29 -07:00
|
|
|
|
2007-04-19 23:43:47 -07:00
|
|
|
[2]: all of these names are derived from the location where they were
|
2007-04-20 01:14:29 -07:00
|
|
|
concocted, in this case in a car ride from Boulder to DEN. To be
|
|
|
|
precise, "tahoe 1" was an unworkable scheme in which everyone holding
|
|
|
|
shares for a given file formed a sort of cabal which kept track of all
|
|
|
|
the others, "tahoe 2" is the first-100-peers in the permuted hash, and
|
|
|
|
this document descibes "tahoe 3", or perhaps "potrero hill 1".
|
|
|
|
|
2007-04-19 23:43:47 -07:00
|
|
|
[3]: the terms CHK and SSK come from Freenet,
|
|
|
|
http://wiki.freenetproject.org/FreenetCHKPages ,
|
|
|
|
although we use "SSK" in a slightly different way
|
|
|
|
|