2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
Allmydata "Tahoe" Architecture
|
|
|
|
|
|
|
|
OVERVIEW
|
|
|
|
|
2007-04-30 20:06:09 +00:00
|
|
|
The high-level view of this system consists of three layers: the grid, the
|
2007-04-20 06:43:47 +00:00
|
|
|
virtual drive, and the application that sits on top.
|
|
|
|
|
2007-04-30 20:06:09 +00:00
|
|
|
The lowest layer is the "grid", basically a DHT (Distributed Hash Table)
|
2008-01-22 00:53:03 +00:00
|
|
|
which maps capabilities to data. The capabilities are relatively short ascii
|
|
|
|
strings, and each is used as a reference to an arbitrary-length sequence of
|
|
|
|
data bytes. This data is encrypted and distributed around the grid across a
|
|
|
|
large number of nodes, such that a large fraction of the nodes would have to
|
|
|
|
be unavailable for the data to become unavailable.
|
|
|
|
|
|
|
|
The middle layer is the virtual drive: a directed-acyclic-graph-shaped data
|
|
|
|
structure in which the intermediate nodes are directories and the leaf nodes
|
|
|
|
are files. The leaf nodes contain only the file data -- they don't contain
|
|
|
|
any metadata about the file except for the length. The edges that lead to
|
|
|
|
leaf nodes have metadata attached to them about the file that they point to.
|
|
|
|
Therefore, the same file may have different metadata associated with it if it
|
|
|
|
is dereferenced through different edges.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
The top layer is where the applications that use this virtual drive operate.
|
|
|
|
Allmydata uses this for a backup service, in which the application copies the
|
|
|
|
files to be backed up from the local disk into the virtual drive on a
|
|
|
|
periodic basis. By providing read-only access to the same virtual drive
|
|
|
|
later, a user can recover older versions of their files. Other sorts of
|
2007-08-09 05:31:05 +00:00
|
|
|
applications can run on top of the virtual drive, of course -- anything that
|
2008-01-22 00:53:03 +00:00
|
|
|
has a use for a secure, decentralized, fault-tolerant filesystem.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
|
2008-01-22 00:53:03 +00:00
|
|
|
THE BIG GRID OF STORAGE SERVERS
|
2007-04-20 06:43:47 +00:00
|
|
|
|
2007-04-30 20:06:09 +00:00
|
|
|
Underlying the grid is a large collection of peer nodes. These are processes
|
2007-04-30 16:57:52 +00:00
|
|
|
running on a wide variety of computers, all of which know about each other in
|
|
|
|
some way or another. They establish TCP connections to one another using
|
|
|
|
Foolscap, an encrypted+authenticated remote message passing library (using
|
|
|
|
TLS connections and self-authenticating identifiers called "FURLs").
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
Each peer offers certain services to the others. The primary service is the
|
2008-01-22 00:53:03 +00:00
|
|
|
StorageServer, which offers to hold data. Each StorageServer has a quota, and
|
|
|
|
it will reject storage requests that would cause it to consume more space
|
|
|
|
than it wants to provide.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
2007-09-21 21:12:26 +00:00
|
|
|
This storage is used to hold "shares", which are encoded pieces of files in
|
|
|
|
the grid. There are many shares for each file, typically between 10 and 100
|
|
|
|
(the exact number depends upon the tradeoffs made between reliability,
|
|
|
|
overhead, and storage space consumed). The files are indexed by a
|
2008-01-22 00:53:03 +00:00
|
|
|
"StorageIndex", which is derived from the encryption key, which is derived
|
|
|
|
from the contents of the file. Leases are indexed by StorageIndex, and a
|
|
|
|
single StorageServer may hold multiple shares for the corresponding
|
|
|
|
file. Multiple peers can hold leases on the same file.
|
|
|
|
|
|
|
|
Peers learn about each other through the "introducer". Each peer connects to
|
|
|
|
this central introducer at startup, and receives a list of all other peers
|
|
|
|
from it. Each peer then connects to all other peers, creating a
|
|
|
|
fully-connected topology. Future versions will reduce the number of
|
2007-04-30 20:06:09 +00:00
|
|
|
connections considerably, to enable the grid to scale to larger sizes: the
|
2007-04-30 16:57:52 +00:00
|
|
|
design target is one million nodes. In addition, future versions will offer
|
|
|
|
relay and NAT-traversal services to allow nodes without full internet
|
2007-04-30 20:06:09 +00:00
|
|
|
connectivity to participate. In the current release, nodes behind NAT boxes
|
|
|
|
will connect to all nodes that they can open connections to, but they cannot
|
|
|
|
open connections to other nodes behind NAT boxes. Therefore, the more nodes
|
|
|
|
there are behind NAT boxes the less the topology resembles the intended
|
2007-08-09 05:31:05 +00:00
|
|
|
fully-connected mesh topology.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
|
|
|
|
FILE ENCODING
|
|
|
|
|
2007-04-30 20:06:09 +00:00
|
|
|
When a file is to be added to the grid, it is first encrypted using a key
|
2008-01-22 00:53:03 +00:00
|
|
|
that is derived from the hash of the file itself. The encrypted file is then
|
|
|
|
broken up into segments so it can be processed in small pieces (to minimize
|
|
|
|
the memory footprint of both encode and decode operations, and to increase
|
|
|
|
the so-called "alacrity": how quickly can the download operation provide
|
|
|
|
validated data to the user, basically the lag between hitting "play" and the
|
|
|
|
movie actually starting). Each segment is erasure coded, which creates
|
|
|
|
encoded blocks such that only a subset of them are required to reconstruct
|
|
|
|
the segment. These blocks are then combined into "shares", such that a subset
|
|
|
|
of the shares can be used to reconstruct the whole file. The shares are then
|
|
|
|
deposited in StorageServers in other peers.
|
2007-07-23 03:30:05 +00:00
|
|
|
|
|
|
|
A tagged hash of the encryption key is used to form the "storage index",
|
2007-09-21 21:12:26 +00:00
|
|
|
which is used for both server selection (described below) and to index shares
|
2007-07-23 03:30:05 +00:00
|
|
|
within the StorageServers on the selected peers.
|
|
|
|
|
|
|
|
A variety of hashes are computed while the shares are being produced, to
|
2008-01-22 00:53:03 +00:00
|
|
|
validate the plaintext, the ciphertext, and the shares themselves. Merkle
|
|
|
|
hash trees are also produced to enable validation of individual segments of
|
|
|
|
plaintext or ciphertext without requiring the download/decoding of the whole
|
|
|
|
file. These hashes go into the "Capability Extension Block", which will be
|
|
|
|
stored with each share.
|
|
|
|
|
|
|
|
The capability contains the encryption key, the hash of the Capability
|
|
|
|
Extension Block, and any encoding parameters necessary to perform the
|
|
|
|
eventual decoding process. For convenience, it also contains the size of the
|
|
|
|
file being stored.
|
|
|
|
|
|
|
|
|
|
|
|
On the download side, the node that wishes to turn a capability into a
|
|
|
|
sequence of bytes will obtain the necessary shares from remote nodes, break
|
|
|
|
them into blocks, use erasure-decoding to turn them into segments of
|
|
|
|
ciphertext, use the decryption key to convert that into plaintext, then emit
|
|
|
|
the plaintext bytes to the output target (which could be a file on disk, or
|
|
|
|
it could be streamed directly to a web browser or media player).
|
|
|
|
|
|
|
|
All hashes use SHA-256, and a different tag is used for each purpose.
|
2007-04-20 06:43:47 +00:00
|
|
|
Netstrings are used where necessary to insure these tags cannot be confused
|
|
|
|
with the data to be hashed. All encryption uses AES in CTR mode. The erasure
|
2008-01-22 00:53:03 +00:00
|
|
|
coding is performed with zfec.
|
2007-09-19 21:27:04 +00:00
|
|
|
|
2007-04-20 06:43:47 +00:00
|
|
|
A Merkle Hash Tree is used to validate the encoded blocks before they are fed
|
2007-07-23 03:30:05 +00:00
|
|
|
into the decode process, and a transverse tree is used to validate the shares
|
2007-09-19 21:27:04 +00:00
|
|
|
as they are retrieved. A third merkle tree is constructed over the plaintext
|
2008-01-22 00:53:03 +00:00
|
|
|
segments, and a fourth is constructed over the ciphertext segments. All
|
2007-09-19 21:27:04 +00:00
|
|
|
necessary hashes are stored with the shares, and the hash tree roots are put
|
2008-01-22 00:53:03 +00:00
|
|
|
in the Capability Extension Block. The final hash of the extension block goes
|
|
|
|
into the capability itself.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
2007-04-20 08:14:29 +00:00
|
|
|
Note that the number of shares created is fixed at the time the file is
|
|
|
|
uploaded: it is not possible to create additional shares later. The use of a
|
|
|
|
top-level hash tree also requires that nodes create all shares at once, even
|
|
|
|
if they don't intend to upload some of them, otherwise the hashroot cannot be
|
|
|
|
calculated correctly.
|
|
|
|
|
2007-04-20 06:43:47 +00:00
|
|
|
|
2008-01-22 00:53:03 +00:00
|
|
|
Capabilities
|
|
|
|
|
|
|
|
Capabilities to immutable files represent a specific set of bytes. Think of
|
|
|
|
it like a hash function: you feed in a bunch of bytes, and you get out a
|
|
|
|
capability, which is deterministically derived from the input data: changing
|
|
|
|
even one bit of the input data will result in a completely different
|
|
|
|
capability.
|
|
|
|
|
|
|
|
Read-only capabilities to mutable files represent the ability to get a set of
|
|
|
|
bytes representing a version of the file. Each read-only capability is
|
|
|
|
unique. In fact, each mutable file has a unique public/private key pair
|
|
|
|
created when the mutable file is created, and the read-only capability to
|
|
|
|
that file includes a secure hash of the public key.
|
|
|
|
|
|
|
|
Read-write capabilities to mutable files represent the ability to read the
|
|
|
|
file (just like a read-only capability) and also to write a new version of
|
|
|
|
the file, overwriting any extant version. Read-write capabilities are unique
|
|
|
|
-- each one includes the secure hash of the private key associated with that
|
|
|
|
mutable file.
|
|
|
|
|
|
|
|
The capability provides both "location" and "identification": you can use it
|
|
|
|
to retrieve a set of bytes, and then you can use it to validate ("identify")
|
|
|
|
that these potential bytes are indeed the ones that you were looking for.
|
|
|
|
|
|
|
|
The "grid" layer is insufficient to provide a virtual drive: an actual
|
|
|
|
filesystem requires human-meaningful names. Capabilities sit on the
|
|
|
|
"global+secure" edge of Zooko's Triangle[1]. They are self-authenticating,
|
|
|
|
meaning that nobody can trick you into using a file that doesn't match the
|
|
|
|
capability you used to refer to that file.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
|
2007-09-21 21:12:26 +00:00
|
|
|
SERVER SELECTION
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
When a file is uploaded, the encoded shares are sent to other peers. But to
|
2007-09-21 21:12:26 +00:00
|
|
|
which ones? The "server selection" algorithm is used to make this choice.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
In the current version, the verifierid is used to consistently-permute the
|
2007-04-20 08:14:29 +00:00
|
|
|
set of all peers (by sorting the peers by HASH(verifierid+peerid)). Each file
|
|
|
|
gets a different permutation, which (on average) will evenly distribute
|
2007-04-30 20:06:09 +00:00
|
|
|
shares among the grid and avoid hotspots.
|
2007-04-20 08:14:29 +00:00
|
|
|
|
2007-09-18 01:24:48 +00:00
|
|
|
We use this permuted list of peers to ask each peer, in turn, if it will hold
|
|
|
|
on to share for us, by sending an 'allocate_buckets() query' to each one.
|
|
|
|
Some will say yes, others (those who are full) will say no: when a peer
|
|
|
|
refuses our request, we just take that share to the next peer on the list. We
|
|
|
|
keep going until we run out of shares to place. At the end of the process,
|
|
|
|
we'll have a table that maps each share number to a peer, and then we can
|
|
|
|
begin the encode+push phase, using the table to decide where each share
|
|
|
|
should be sent.
|
|
|
|
|
|
|
|
Most of the time, this will result in one share per peer, which gives us
|
|
|
|
maximum reliability (since it disperses the failures as widely as possible).
|
|
|
|
If there are fewer useable peers than there are shares, we'll be forced to
|
|
|
|
loop around, eventually giving multiple shares to a single peer. This reduces
|
|
|
|
reliability, so it isn't the sort of thing we want to happen all the time,
|
|
|
|
and either indicates that the default encoding parameters are set incorrectly
|
|
|
|
(creating more shares than you have peers), or that the grid does not have
|
|
|
|
enough space (many peers are full). But apart from that, it doesn't hurt. If
|
|
|
|
we have to loop through the peer list a second time, we accelerate the query
|
|
|
|
process, by asking each peer to hold multiple shares on the second pass. In
|
|
|
|
most cases, this means we'll never send more than two queries to any given
|
|
|
|
peer.
|
|
|
|
|
|
|
|
If a peer is unreachable, or has an error, or refuses to accept any of our
|
|
|
|
shares, we remove them from the permuted list, so we won't query them a
|
|
|
|
second time for this file. If a peer already has shares for the file we're
|
|
|
|
uploading (or if someone else is currently sending them shares), we add that
|
|
|
|
information to the share-to-peer table. This lets us do less work for files
|
|
|
|
which have been uploaded once before, while making sure we still wind up with
|
|
|
|
as many shares as we desire.
|
|
|
|
|
|
|
|
If we are unable to place every share that we want, but we still managed to
|
|
|
|
place a quantity known as "shares of happiness", we'll do the upload anyways.
|
|
|
|
If we cannot place at least this many, the upload is declared a failure.
|
|
|
|
|
|
|
|
The current defaults use k=3, shares_of_happiness=7, and N=10, meaning that
|
|
|
|
we'll try to place 10 shares, we'll be happy if we can place 7, and we need
|
|
|
|
to get back any 3 to recover the file. This results in a 3.3x expansion
|
|
|
|
factor. In general, you should set N about equal to the number of peers in
|
|
|
|
your grid, then set N/k to achieve your desired availability goals.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
When downloading a file, the current release just asks all known peers for
|
|
|
|
any shares they might have, chooses the minimal necessary subset, then starts
|
|
|
|
downloading and processing those shares. A later release will use the full
|
|
|
|
algorithm to reduce the number of queries that must be sent out. This
|
|
|
|
algorithm uses the same consistent-hashing permutation as on upload, but
|
2007-09-18 01:24:48 +00:00
|
|
|
stops after it has located k shares (instead of all N). This reduces the
|
|
|
|
number of queries that must be sent before downloading can begin.
|
|
|
|
|
|
|
|
The actual number of queries is directly related to the availability of the
|
|
|
|
peers and the degree of overlap between the peerlist used at upload and at
|
|
|
|
download. For stable grids, this overlap is very high, and usually the first
|
|
|
|
k queries will result in shares. The number of queries grows as the stability
|
|
|
|
decreases. Some limits may be imposed in large grids to avoid querying a
|
|
|
|
million peers; this provides a tradeoff between the work spent to discover
|
|
|
|
that a file is unrecoverable and the probability that a retrieval will fail
|
|
|
|
when it could have succeeded if we had just tried a little bit harder. The
|
|
|
|
appropriate value of this tradeoff will depend upon the size of the grid, and
|
|
|
|
will change over time.
|
|
|
|
|
|
|
|
Other peer selection algorithms are possible. One earlier version (known as
|
|
|
|
"tahoe 3") used the permutation to place the peers around a large ring,
|
|
|
|
distributed shares evenly around the same ring, then walks clockwise from 0
|
|
|
|
with a basket: each time we encounter a share, put it in the basket, each
|
|
|
|
time we encounter a peer, give them as many shares from our basket as they'll
|
|
|
|
accept. This reduced the number of queries (usually to 1) for small grids
|
|
|
|
(where N is larger than the number of peers), but resulted in extremely
|
|
|
|
non-uniform share distribution, which significantly hurt reliability
|
|
|
|
(sometimes the permutation resulted in most of the shares being dumped on a
|
|
|
|
single peer).
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
Another algorithm (known as "denver airport"[2]) uses the permuted hash to
|
|
|
|
decide on an approximate target for each share, then sends lease requests via
|
2007-04-20 08:14:29 +00:00
|
|
|
Chord routing. The request includes the contact information of the uploading
|
|
|
|
node, and asks that the node which eventually accepts the lease should
|
|
|
|
contact the uploader directly. The shares are then transferred over direct
|
|
|
|
connections rather than through multiple Chord hops. Download uses the same
|
|
|
|
approach. This allows nodes to avoid maintaining a large number of long-term
|
|
|
|
connections, at the expense of complexity, latency, and reliability.
|
|
|
|
|
|
|
|
|
|
|
|
SWARMING DOWNLOAD, TRICKLING UPLOAD
|
|
|
|
|
|
|
|
Because the shares being downloaded are distributed across a large number of
|
|
|
|
peers, the download process will pull from many of them at the same time. The
|
2007-09-18 01:24:48 +00:00
|
|
|
current encoding parameters require 3 shares to be retrieved for each
|
|
|
|
segment, which means that up to 3 peers will be used simultaneously. For
|
2007-10-16 02:53:59 +00:00
|
|
|
larger networks, 8-of-22 encoding could be used, meaning 8 peers can be used
|
|
|
|
simultaneously. This allows the download process to use the sum of the
|
2007-09-18 01:24:48 +00:00
|
|
|
available peers' upload bandwidths, resulting in downloads that take full
|
|
|
|
advantage of the common 8x disparity between download and upload bandwith on
|
|
|
|
modern ADSL lines.
|
2007-04-20 08:14:29 +00:00
|
|
|
|
|
|
|
On the other hand, uploads are hampered by the need to upload encoded shares
|
2007-09-18 01:24:48 +00:00
|
|
|
that are larger than the original data (3.3x larger with the current default
|
2007-04-20 08:14:29 +00:00
|
|
|
encoding parameters), through the slow end of the asymmetric connection. This
|
|
|
|
means that on a typical 8x ADSL line, uploading a file will take about 32
|
|
|
|
times longer than downloading it again later.
|
|
|
|
|
|
|
|
Smaller expansion ratios can reduce this upload penalty, at the expense of
|
2007-09-18 01:24:48 +00:00
|
|
|
reliability. See RELIABILITY, below. A project known as "offloaded uploading"
|
|
|
|
can eliminate the penalty, if there is a node somewhere else in the network
|
|
|
|
that is willing to do the work of encoding and upload for you.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
|
2007-09-18 01:24:48 +00:00
|
|
|
VDRIVE and DIRNODES: THE VIRTUAL DRIVE LAYER
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
The "virtual drive" layer is responsible for mapping human-meaningful
|
|
|
|
pathnames (directories and filenames) to pieces of data. The actual bytes
|
2008-01-22 00:53:03 +00:00
|
|
|
inside these files are referenced by capability, but the "vdrive" is where
|
|
|
|
the directory names, file names, and metadata are kept.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
2007-09-18 01:24:48 +00:00
|
|
|
In the current release, the virtual drive is a graph of "dirnodes". Each
|
|
|
|
dirnode represents a single directory, and thus contains a table of named
|
|
|
|
children. These children are either other dirnodes or actual files. All
|
2008-01-22 00:53:03 +00:00
|
|
|
children are referenced by their capability. Each client creates a "private
|
|
|
|
vdrive" dirnode at startup. The clients also receive access to a "global
|
|
|
|
vdrive" dirnode from the central introducer/vdrive server, which is shared
|
|
|
|
between all clients and serves as an easy demonstration of having multiple
|
|
|
|
writers for a single dirnode.
|
|
|
|
|
|
|
|
The dirnode itself has two forms of capability: one is read-write and the
|
|
|
|
other is read-only. The table of children inside the dirnode has a read-write
|
|
|
|
and read-only capability for each child. If you have a read-only capability
|
|
|
|
for a given dirnode, you will not be able to access the read-write capability
|
|
|
|
of the children. This results in "transitively read-only" dirnode access.
|
|
|
|
|
|
|
|
By having two different capabilities, you can choose which you want to share
|
|
|
|
with someone else. If you create a new directory and share the read-write
|
|
|
|
capability for it with a friend, then you will both be able to modify its
|
|
|
|
contents. If instead you give them the read-only capability, then they will
|
|
|
|
*not* be able to modify the contents. Any capability that you receive can be
|
|
|
|
attached to any dirnode that you can modify, so very powerful
|
|
|
|
shared+published directory structures can be built from these components.
|
2007-09-18 01:24:48 +00:00
|
|
|
|
|
|
|
This structure enable individual users to have their own personal space, with
|
|
|
|
links to spaces that are shared with specific other users, and other spaces
|
|
|
|
that are globally visible. Eventually the application layer will present
|
|
|
|
these pieces in a way that allows the sharing of a specific file or the
|
|
|
|
creation of a "virtual CD" as easily as dragging a folder onto a user icon.
|
|
|
|
|
|
|
|
In the current release, these dirnodes are *not* distributed. Instead, each
|
|
|
|
dirnode lives on a single host, in a file on it's local (physical) disk. In
|
|
|
|
addition, all dirnodes are on the same host, known as the "Introducer And
|
|
|
|
VDrive Node". This simplifies implementation and consistency, but obviously
|
|
|
|
has a drastic effect on reliability: the file data can survive multiple host
|
|
|
|
failures, but the vdrive that points to that data cannot. Fixing this
|
|
|
|
situation is a high priority task.
|
|
|
|
|
|
|
|
|
|
|
|
LEASES, REFRESHING, GARBAGE COLLECTION
|
|
|
|
|
|
|
|
Shares are uploaded to a storage server, but they do not necessarily stay
|
|
|
|
there forever. We are anticipating three main share-lifetime management modes
|
|
|
|
for Tahoe: 1) per-share leases which expire, 2) per-account timers which
|
|
|
|
expire and cancel all leases for the account, and 3) centralized account
|
|
|
|
management without expiration timers.
|
|
|
|
|
|
|
|
Multiple clients may be interested in a given share, for example if two
|
|
|
|
clients uploaded the same file, or if two clients are sharing a directory and
|
|
|
|
both want to make sure the files therein remain available. Consequently, each
|
|
|
|
share (technically each "bucket", which may contain multiple shares for a
|
|
|
|
single storage index) has a set of leases, one per client. One way to
|
|
|
|
visualize this is with a large table, with shares (i.e. buckets, or storage
|
|
|
|
indices, or files) as the rows, and accounts as columns. Each square of this
|
|
|
|
table might hold a lease.
|
|
|
|
|
|
|
|
Using limited-duration leases reduces the storage consumed by clients who
|
|
|
|
have (for whatever reason) forgotten about the share they once cared about.
|
|
|
|
Clients are supposed to explicitly cancel leases for every file that they
|
|
|
|
remove from their vdrive, and when the last lease is removed on a share, the
|
|
|
|
storage server deletes that share. However, the storage server might be
|
|
|
|
offline when the client deletes the file, or the client might experience a
|
|
|
|
bug or a race condition that results in forgetting about the file. Using
|
|
|
|
leases that expire unless otherwise renewed ensures that these lost files
|
|
|
|
will not consume storage space forever. On the other hand, they require
|
|
|
|
periodic maintenance, which can become prohibitively expensive for large
|
|
|
|
grids. In addition, clients who go offline for a while are then obligated to
|
|
|
|
get someone else to keep their files alive for them.
|
|
|
|
|
|
|
|
|
|
|
|
In the first mode, each client holds a limited-duration lease on each share
|
|
|
|
(typically one month), and clients are obligated to periodically renew these
|
|
|
|
leases to keep them from expiring (typically once a week). In this mode, the
|
|
|
|
storage server does not know anything about which client is which: it only
|
|
|
|
knows about leases.
|
|
|
|
|
|
|
|
In the second mode, each server maintains a list of clients and which leases
|
|
|
|
they hold. This is called the "account list", and each time a client wants to
|
|
|
|
upload a share or establish a lease, it provides credentials to allow the
|
|
|
|
server to know which Account it will be using. Rather than putting individual
|
|
|
|
timers on each lease, the server puts a timer on the Account. When the
|
|
|
|
account expires, all of the associated leases are cancelled.
|
|
|
|
|
|
|
|
In this mode, clients are obligated to renew the Account periodically, but
|
|
|
|
not the (thousands of) individual share leases. Clients which forget about
|
|
|
|
files are still incurring a storage cost for those files. An occasional
|
|
|
|
reconcilliation process (in which the client presents the storage server with
|
|
|
|
a list of all the files it cares about, and the server removes leases for
|
|
|
|
anything that isn't on the list) can be used to free this storage, but the
|
|
|
|
effort involved is large, so reconcilliation must be done very infrequently.
|
|
|
|
|
|
|
|
Our plan is to have the clients create their own Accounts, based upon the
|
|
|
|
possession of a private key. Clients can create as many accounts as they
|
|
|
|
wish, but they are responsible for their own maintenance. Servers can add up
|
|
|
|
all the leases for each account and present a report of usage, in bytes per
|
|
|
|
account. This is intended for friendnet scenarios where it would be nice to
|
|
|
|
know how much space your friends are consuming on your disk.
|
|
|
|
|
|
|
|
In the third mode, the Account objects are centrally managed, and are not
|
|
|
|
expired by the storage servers. In this mode, the client presents credentials
|
|
|
|
that are issued by a central authority, such as a signed message which the
|
|
|
|
storage server can verify. The storage used by this account is not freed
|
|
|
|
unless and until the central account manager says so.
|
|
|
|
|
|
|
|
This mode is more appropriate for a commercial offering, in which use of the
|
|
|
|
storage servers is contingent upon a monthly fee, or other membership
|
|
|
|
criteria. Being able to ask the storage usage for each account (or establish
|
|
|
|
limits on it) helps to enforce whatever kind of membership policy is desired.
|
|
|
|
|
|
|
|
|
|
|
|
Each lease is created with a pair of secrets: the "renew secret" and the
|
|
|
|
"cancel secret". These are just random-looking strings, derived by hashing
|
|
|
|
other higher-level secrets, starting with a per-client master secret. Anyone
|
|
|
|
who knows the secret is allowed to restart the expiration timer, or cancel
|
|
|
|
the lease altogether. Having these be individual values allows the original
|
|
|
|
uploading node to delegate these capabilities to others.
|
|
|
|
|
|
|
|
In the current release, clients provide lease secrets to the storage server,
|
|
|
|
and each lease contains an expiration time, but there is no facility to
|
|
|
|
actually expire leases, nor are there explicit owners (the "ownerid" field of
|
|
|
|
each lease is always set to zero). In addition, many features have not been
|
|
|
|
implemented yet: the client should claim leases on files which are added to
|
|
|
|
the vdrive by linking (as opposed to uploading), and the client should cancel
|
|
|
|
leases on files which are removed from the vdrive, but neither has been
|
|
|
|
written yet. This means that shares are not ever deleted in this release.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
|
|
|
|
FILE REPAIRER
|
|
|
|
|
2007-09-18 01:24:48 +00:00
|
|
|
Shares may go away because the storage server hosting them has suffered a
|
|
|
|
failure: either temporary downtime (affecting availability of the file), or a
|
|
|
|
permanent data loss (affecting the reliability of the file). Hard drives
|
|
|
|
crash, power supplies explode, coffee spills, and asteroids strike. The goal
|
|
|
|
of a robust distributed filesystem is to survive these setbacks.
|
|
|
|
|
|
|
|
To work against this slow, continually loss of shares, a File Checker is used
|
|
|
|
to periodically count the number of shares still available for any given
|
|
|
|
file. A more extensive form of checking known as the File Verifier can
|
2008-01-22 00:53:03 +00:00
|
|
|
download the ciphertext of the target file and perform integrity checks (using
|
2007-09-18 01:24:48 +00:00
|
|
|
strong hashes) to make sure the data is stil intact. When the file is found
|
|
|
|
to have decayed below some threshold, the File Repairer can be used to
|
|
|
|
regenerate and re-upload the missing shares. These processes are conceptually
|
|
|
|
distinct (the repairer is only run if the checker/verifier decides it is
|
|
|
|
necessary), but in practice they will be closely related, and may run in the
|
|
|
|
same process.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
2008-01-22 00:53:03 +00:00
|
|
|
The repairer process does not get the full capability of the file to be
|
|
|
|
maintained: it merely gets the "repairer capability" subset, which does not
|
|
|
|
include the decryption key. The File Verifier uses that data to find out
|
|
|
|
which peers ought to hold shares for this file, and to see if those peers are
|
|
|
|
still around and willing to provide the data. If the file is not healthy
|
|
|
|
enough, the File Repairer is invoked to download the ciphertext, regenerate
|
|
|
|
any missing shares, and upload them to new peers. The goal of the File
|
|
|
|
Repairer is to finish up with a full set of 100 shares.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
There are a number of engineering issues to be resolved here. The bandwidth,
|
|
|
|
disk IO, and CPU time consumed by the verification/repair process must be
|
2007-04-30 20:06:09 +00:00
|
|
|
balanced against the robustness that it provides to the grid. The nodes
|
2007-04-20 06:43:47 +00:00
|
|
|
involved in repair will have very different access patterns than normal
|
|
|
|
nodes, such that these processes may need to be run on hosts with more memory
|
|
|
|
or network connectivity than usual. The frequency of repair runs directly
|
|
|
|
affects the resources consumed. In some cases, verification of multiple files
|
|
|
|
can be performed at the same time, and repair of files can be delegated off
|
|
|
|
to other nodes.
|
|
|
|
|
|
|
|
The security model we are currently using assumes that peers who claim to
|
|
|
|
hold a share will actually provide it when asked. (We validate the data they
|
|
|
|
provide before using it in any way, but if enough peers claim to hold the
|
|
|
|
data and are wrong, the file will not be repaired, and may decay beyond
|
|
|
|
recoverability). There are several interesting approaches to mitigate this
|
|
|
|
threat, ranging from challenges to provide a keyed hash of the allegedly-held
|
|
|
|
data (using "buddy nodes", in which two peers hold the same block, and check
|
2007-04-20 08:14:29 +00:00
|
|
|
up on each other), to reputation systems, or even the original Mojo Nation
|
|
|
|
economic model.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
|
|
|
|
SECURITY
|
|
|
|
|
2007-04-20 08:14:29 +00:00
|
|
|
The design goal for this project is that an attacker may be able to deny
|
|
|
|
service (i.e. prevent you from recovering a file that was uploaded earlier)
|
|
|
|
but can accomplish none of the following three attacks:
|
|
|
|
|
2007-09-18 01:24:48 +00:00
|
|
|
1) violate confidentiality: the attacker gets to view data to which you have
|
|
|
|
not granted them access
|
2007-04-20 08:14:29 +00:00
|
|
|
2) violate consistency: the attacker convinces you that the wrong data is
|
|
|
|
actually the data you were intending to retrieve
|
2007-09-18 01:24:48 +00:00
|
|
|
3) violate mutability: the attacker gets to modify a dirnode (either the
|
2007-04-20 08:14:29 +00:00
|
|
|
pathnames or the file contents) to which you have not given them
|
|
|
|
mutability rights
|
|
|
|
|
|
|
|
Data validity and consistency (the promise that the downloaded data will
|
2008-01-22 00:53:03 +00:00
|
|
|
match the originally uploaded data) is provided by the hashes embedded in the
|
|
|
|
capability. Data confidentiality (the promise that the data is only readable
|
|
|
|
by people with the capability) is provided by the encryption key embedded in
|
|
|
|
the capability. Data availability (the hope that data which has been
|
|
|
|
uploaded in the past will be downloadable in the future) is provided by the
|
|
|
|
grid, which distributes failures in a way that reduces the correlation
|
|
|
|
between individual node failure and overall file recovery failure.
|
2007-04-20 08:14:29 +00:00
|
|
|
|
|
|
|
Many of these security properties depend upon the usual cryptographic
|
|
|
|
assumptions: the resistance of AES and RSA to attack, the resistance of
|
|
|
|
SHA256 to pre-image attacks, and upon the proximity of 2^-128 and 2^-256 to
|
2007-09-18 01:24:48 +00:00
|
|
|
zero. A break in AES would allow a confidentiality violation, a pre-image
|
|
|
|
break in SHA256 would allow a consistency violation, and a break in RSA would
|
|
|
|
allow a mutability violation. The discovery of a collision in SHA256 is
|
|
|
|
unlikely to allow much, but could conceivably allow a consistency violation
|
|
|
|
in data that was uploaded by the attacker. If SHA256 is threatened, further
|
|
|
|
analysis will be warranted.
|
2007-04-20 08:14:29 +00:00
|
|
|
|
|
|
|
There is no attempt made to provide anonymity, neither of the origin of a
|
|
|
|
piece of data nor the identity of the subsequent downloaders. In general,
|
|
|
|
anyone who already knows the contents of a file will be in a strong position
|
|
|
|
to determine who else is uploading or downloading it. Also, it is quite easy
|
|
|
|
for a coalition of more than 1% of the nodes to correlate the set of peers
|
|
|
|
who are all uploading or downloading the same file, even if the attacker does
|
|
|
|
not know the contents of the file in question.
|
|
|
|
|
2007-09-18 01:24:48 +00:00
|
|
|
Also note that the file size and (when convergence is being used) a keyed
|
|
|
|
hash of the plaintext are not protected. Many people can determine the size
|
|
|
|
of the file you are accessing, and if they already know the contents of a
|
|
|
|
given file, they will be able to determine that you are uploading or
|
|
|
|
downloading the same one.
|
2007-04-20 08:14:29 +00:00
|
|
|
|
|
|
|
A likely enhancement is the ability to use distinct encryption keys for each
|
|
|
|
file, avoiding the file-correlation attacks at the expense of increased
|
2007-09-18 01:24:48 +00:00
|
|
|
storage consumption. This is known as "non-convergent" encoding.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
2007-09-18 01:24:48 +00:00
|
|
|
The capability-based security model is used throughout this project. dirnode
|
2007-04-20 06:43:47 +00:00
|
|
|
operations are expressed in terms of distinct read and write capabilities.
|
2008-01-22 00:53:03 +00:00
|
|
|
Knowing the read-capability of a file is equivalent to the ability to read
|
|
|
|
the corresponding data. The capability to validate the correctness of a file
|
|
|
|
is strictly weaker than the read-capability (possession of read-capability
|
|
|
|
automatically grants you possession of validate-capability, but not vice
|
|
|
|
versa). These capabilities may be expressly delegated (irrevocably) by simply
|
|
|
|
transferring the relevant secrets.
|
2007-04-20 08:14:29 +00:00
|
|
|
|
|
|
|
The application layer can provide whatever security/access model is desired,
|
|
|
|
but we expect the first few to also follow capability discipline: rather than
|
2007-09-18 01:24:48 +00:00
|
|
|
user accounts with passwords, each user will get a FURL to their private
|
|
|
|
dirnode, and the presentation layer will give them the ability to break off
|
|
|
|
pieces of this vdrive for delegation or sharing with others on demand.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
|
|
|
|
RELIABILITY
|
|
|
|
|
|
|
|
File encoding and peer selection parameters can be adjusted to achieve
|
|
|
|
different goals. Each choice results in a number of properties; there are
|
|
|
|
many tradeoffs.
|
|
|
|
|
|
|
|
First, some terms: the erasure-coding algorithm is described as K-out-of-N
|
2007-10-16 02:53:59 +00:00
|
|
|
(for this release, the default values are K=3 and N=10). Each grid will have
|
|
|
|
some number of peers; this number will rise and fall over time as peers join,
|
|
|
|
drop out, come back, and leave forever. Files are of various sizes, some are
|
|
|
|
popular, others are rare. Peers have various capacities, variable
|
2007-04-20 06:43:47 +00:00
|
|
|
upload/download bandwidths, and network latency. Most of the mathematical
|
|
|
|
models that look at peer failure assume some average (and independent)
|
|
|
|
probability 'P' of a given peer being available: this can be high (servers
|
|
|
|
tend to be online and available >90% of the time) or low (laptops tend to be
|
|
|
|
turned on for an hour then disappear for several days). Files are encoded in
|
|
|
|
segments of a given maximum size, which affects memory usage.
|
|
|
|
|
|
|
|
The ratio of N/K is the "expansion factor". Higher expansion factors improve
|
|
|
|
reliability very quickly (the binomial distribution curve is very sharp), but
|
2007-04-30 20:06:09 +00:00
|
|
|
consumes much more grid capacity. The absolute value of K affects the
|
2007-04-20 06:43:47 +00:00
|
|
|
granularity of the binomial curve (1-out-of-2 is much worse than
|
|
|
|
50-out-of-100), but high values asymptotically approach a constant that
|
|
|
|
depends upon 'P' (i.e. 500-of-1000 is not much better than 50-of-100).
|
|
|
|
|
|
|
|
Likewise, the total number of peers in the network affects the same
|
|
|
|
granularity: having only one peer means a single point of failure, no matter
|
|
|
|
how many copies of the file you make. Independent peers (with uncorrelated
|
|
|
|
failures) are necessary to hit the mathematical ideals: if you have 100 nodes
|
|
|
|
but they are all in the same office building, then a single power failure
|
|
|
|
will take out all of them at once. The "Sybil Attack" is where a single
|
|
|
|
attacker convinces you that they are actually multiple servers, so that you
|
|
|
|
think you are using a large number of independent peers, but in fact you have
|
|
|
|
a single point of failure (where the attacker turns off all their machines at
|
2008-01-22 00:53:03 +00:00
|
|
|
once). Large grids, with lots of truly-independent peers, will enable the
|
|
|
|
use of lower expansion factors to achieve the same reliability, but increase
|
2007-04-20 06:43:47 +00:00
|
|
|
overhead because each peer needs to know something about every other, and the
|
|
|
|
rate at which peers come and go will be higher (requiring network maintenance
|
2007-04-30 20:06:09 +00:00
|
|
|
traffic). Also, the File Repairer work will increase with larger grids,
|
2007-04-20 06:43:47 +00:00
|
|
|
although then the job can be distributed out to more peers.
|
|
|
|
|
|
|
|
Higher values of N increase overhead: more shares means more Merkle hashes
|
|
|
|
that must be included with the data, and more peers to contact to retrieve
|
|
|
|
the shares. Smaller segment sizes reduce memory usage (since each segment
|
|
|
|
must be held in memory while erasure coding runs) and increases "alacrity"
|
|
|
|
(since downloading can validate a smaller piece of data faster, delivering it
|
|
|
|
to the target sooner), but also increase overhead (because more blocks means
|
|
|
|
more Merkle hashes to validate them).
|
|
|
|
|
2007-04-30 20:06:09 +00:00
|
|
|
In general, small private grids should work well, but the participants will
|
|
|
|
have to decide between storage overhead and reliability. Large stable grids
|
2007-04-20 06:43:47 +00:00
|
|
|
will be able to reduce the expansion factor down to a bare minimum while
|
2007-04-30 20:06:09 +00:00
|
|
|
still retaining high reliability, but large unstable grids (where nodes are
|
2007-04-20 06:43:47 +00:00
|
|
|
coming and going very quickly) may require more repair/verification bandwidth
|
|
|
|
than actual upload/download traffic.
|
|
|
|
|
2007-09-18 01:24:48 +00:00
|
|
|
Tahoe nodes that run a webserver have a page dedicated to provisioning
|
|
|
|
decisions: this tool may help you evaluate different expansion factors and
|
|
|
|
view the disk consumption of each. It is also acquiring some sections with
|
|
|
|
availability/reliability numbers, as well as preliminary cost analysis data.
|
|
|
|
This tool will continue to evolve as our analysis improves.
|
2007-04-20 06:43:47 +00:00
|
|
|
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
[1]: http://en.wikipedia.org/wiki/Zooko%27s_triangle
|
2007-04-20 08:14:29 +00:00
|
|
|
|
2007-04-20 06:43:47 +00:00
|
|
|
[2]: all of these names are derived from the location where they were
|
2007-04-20 08:14:29 +00:00
|
|
|
concocted, in this case in a car ride from Boulder to DEN. To be
|
2007-09-18 01:24:48 +00:00
|
|
|
precise, "tahoe 1" was an unworkable scheme in which everyone who holds
|
|
|
|
shares for a given file would form a sort of cabal which kept track of
|
|
|
|
all the others, "tahoe 2" is the first-100-peers in the permuted hash,
|
|
|
|
and this document descibes "tahoe 3", or perhaps "potrero hill 1".
|
2007-04-20 06:43:47 +00:00
|
|
|
|