tahoe-lafs/docs/architecture.txt


   Tahoe-LAFS Architecture

(See the docs/specifications directory for more details.)

OVERVIEW

There are three layers: the key-value store, the filesystem, and the
application.

The lowest layer is the key-value store. The keys are "capabilities" -- short
ascii strings -- and the values are sequences of data bytes. This data is
encrypted and distributed across a number of nodes, such that it will survive
the loss of most of the nodes. There are no hard limits on the size of the
values, but there may be performance issues with extremely large values (just
due to the limitation of network bandwidth). In practice, values as small as a
few bytes and as large as tens of gigabytes are in common use.

The middle layer is the decentralized filesystem: a directed graph in which
the intermediate nodes are directories and the leaf nodes are files. The leaf
nodes contain only the data -- they contain no metadata other than the length
in bytes.  The edges leading to leaf nodes have metadata attached to them
about the file they point to.  Therefore, the same file may be associated with
different metadata if it is referred to through different edges.

The top layer consists of the applications using the filesystem.
Allmydata.com uses it for a backup service: the application periodically
copies files from the local disk onto the decentralized filesystem.  We later
provide read-only access to those files, allowing users to recover them.
There are several other applications built on top of the Tahoe-LAFS filesystem
(see the RelatedProjects page of the wiki for a list).


THE KEY-VALUE STORE

The key-value store is implemented by a grid of Tahoe-LAFS storage servers --
user-space processes.  Tahoe-LAFS storage clients communicate with the storage
servers over TCP.

Storage servers hold data in the form of "shares".  Shares are encoded pieces
of files.  There are a configurable number of shares for each file, 10 by
default.  Normally, each share is stored on a separate server, but in some
cases a single server can hold multiple shares of a file.

Nodes learn about each other through an "introducer". Each server connects to
the introducer at startup and announces its presence. Each client connects to
the introducer at startup, and receives a list of all servers from it. Each
client then connects to every server, creating a "bi-clique" topology. In the
current release, nodes behind NAT boxes will connect to all nodes that they
can open connections to, but they cannot open connections to other nodes
behind NAT boxes.  Therefore, the more nodes behind NAT boxes, the less the
topology resembles the intended bi-clique topology.

The introducer is a Single Point of Failure ("SPoF"), in that clients who
never connect to the introducer will be unable to connect to any storage
servers, but once a client has been introduced to everybody, it does not need
the introducer again until it is restarted. The danger of a SPoF is further
reduced in two ways. First, the introducer is defined by a hostname and a
private key, which are easy to move to a new host in case the original one
suffers an unrecoverable hardware problem. Second, even if the private key is
lost, clients can be reconfigured to use a new introducer.

For future releases, we have plans to decentralize introduction, allowing any
server to tell a new client about all the others.


FILE ENCODING

When a client stores a file on the grid, it first encrypts the file. It then
breaks the encrypted file into small segments, in order to reduce the memory
footprint, and to decrease the lag between initiating a download and receiving
the first part of the file; for example the lag between hitting "play" and a
movie actually starting.

The client then erasure-codes each segment, producing blocks of which only a
subset are needed to reconstruct the segment (3 out of 10, with the default
settings).

It sends one block from each segment to a given server. The set of blocks on a
given server constitutes a "share". Therefore a subset f the shares (3 out of 10,
by default) are needed to reconstruct the file.

A hash of the encryption key is used to form the "storage index", which is used
for both server selection (described below) and to index shares within the
Storage Servers on the selected nodes.

The client computes secure hashes of the ciphertext and of the shares. It uses
Merkle Trees so that it is possible to verify the correctness of a subset of
the data without requiring all of the data.  For example, this allows you to
verify the correctness of the first segment of a movie file and then begin
playing the movie file in your movie viewer before the entire movie file has
been downloaded.

These hashes are stored in a small datastructure named the Capability
Extension Block which is stored on the storage servers alongside each share.

The capability contains the encryption key, the hash of the Capability
Extension Block, and any encoding parameters necessary to perform the eventual
decoding process.  For convenience, it also contains the size of the file
being stored.

To download, the client that wishes to turn a capability into a sequence of
bytes will obtain the blocks from storage servers, use erasure-decoding to
turn them into segments of ciphertext, use the decryption key to convert that
into plaintext, then emit the plaintext bytes to the output target.


CAPABILITIES

Capabilities to immutable files represent a specific set of bytes.  Think of
it like a hash function: you feed in a bunch of bytes, and you get out a
capability, which is deterministically derived from the input data: changing
even one bit of the input data will result in a completely different
capability.

Read-only capabilities to mutable files represent the ability to get a set of
bytes representing some version of the file, most likely the latest version.
Each read-only capability is unique. In fact, each mutable file has a unique
public/private key pair created when the mutable file is created, and the
read-only capability to that file includes a secure hash of the public key.

Read-write capabilities to mutable files represent the ability to read the
file (just like a read-only capability) and also to write a new version of
the file, overwriting any extant version. Read-write capabilities are unique
-- each one includes the secure hash of the private key associated with that
mutable file.

The capability provides both "location" and "identification": you can use it
to retrieve a set of bytes, and then you can use it to validate ("identify")
that these potential bytes are indeed the ones that you were looking for.

The "key-value store" layer doesn't include human-meaningful
names. Capabilities sit on the "global+secure" edge of Zooko's
Triangle[1]. They are self-authenticating, meaning that nobody can trick you
into accepting a file that doesn't match the capability you used to refer to
that file. The filesystem layer (described below) adds human-meaningful names
atop the key-value layer.


SERVER SELECTION

When a file is uploaded, the encoded shares are sent to other nodes. But to
which ones? The "server selection" algorithm is used to make this choice.

In the current version, the storage index is used to consistently-permute the
set of all peer nodes (by sorting the peer nodes by
HASH(storage_index+peerid)). Each file gets a different permutation, which
(on average) will evenly distribute shares among the grid and avoid hotspots.

We use this permuted list of nodes to ask each node, in turn, if it will hold
a share for us, by sending an 'allocate_buckets() query' to each one. Some
will say yes, others (those who are full) will say no: when a node refuses our
request, we just take that share to the next node on the list. We keep going
until we run out of shares to place. At the end of the process, we'll have a
table that maps each share number to a node, and then we can begin the
encode+push phase, using the table to decide where each share should be sent.

Most of the time, this will result in one share per node, which gives us
maximum reliability (since it disperses the failures as widely as possible).
If there are fewer useable nodes than there are shares, we'll be forced to
loop around, eventually giving multiple shares to a single node. This reduces
reliability, so it isn't the sort of thing we want to happen all the time, and
either indicates that the default encoding parameters are set incorrectly
(creating more shares than you have nodes), or that the grid does not have
enough space (many nodes are full). But apart from that, it doesn't hurt. If
we have to loop through the node list a second time, we accelerate the query
process, by asking each node to hold multiple shares on the second pass. In
most cases, this means we'll never send more than two queries to any given
node.

If a node is unreachable, or has an error, or refuses to accept any of our
shares, we remove them from the permuted list, so we won't query them a second
time for this file. If a node already has shares for the file we're uploading
(or if someone else is currently sending them shares), we add that information
to the share-to-peer-node table. This lets us do less work for files which have
been uploaded once before, while making sure we still wind up with as many
shares as we desire.

If we are unable to place every share that we want, but we still managed to
place a quantity known as "shares of happiness", we'll do the upload anyways.
If we cannot place at least this many, the upload is declared a failure.

The current defaults use k=3, shares_of_happiness=7, and N=10, meaning that
we'll try to place 10 shares, we'll be happy if we can place 7, and we need to
get back any 3 to recover the file. This results in a 3.3x expansion
factor. In general, you should set N about equal to the number of nodes in
your grid, then set N/k to achieve your desired availability goals.

When downloading a file, the current version just asks all known servers for
any shares they might have.  and then downloads the shares from the first servers that

chooses the minimal necessary subset, then starts
change downloading and processing those shares. A future release will use the
server selection algorithm to reduce the number of queries that must be sent
out. This algorithm uses the same consistent-hashing permutation as on upload,
but stops after it has located k shares (instead of all N). This reduces the
number of queries that must be sent before downloading can begin.

The actual number of queries is directly related to the availability of the
nodes and the degree of overlap between the node list used at upload and at
download. For stable grids, this overlap is very high, and usually the first k
queries will result in shares. The number of queries grows as the stability
decreases. Some limits may be imposed in large grids to avoid querying a
million nodes; this provides a tradeoff between the work spent to discover
that a file is unrecoverable and the probability that a retrieval will fail
when it could have succeeded if we had just tried a little bit harder. The
appropriate value of this tradeoff will depend upon the size of the grid, and
will change over time.

Other peer-node selection algorithms are possible. One earlier version (known
as "Tahoe 3") used the permutation to place the nodes around a large ring,
distributed shares evenly around the same ring, then walks clockwise from 0
with a basket: each time we encounter a share, put it in the basket, each time
we encounter a node, give them as many shares from our basket as they'll
accept. This reduced the number of queries (usually to 1) for small grids
(where N is larger than the number of nodes), but resulted in extremely
non-uniform share distribution, which significantly hurt reliability
(sometimes the permutation resulted in most of the shares being dumped on a
single node).

Another algorithm (known as "denver airport"[2]) uses the permuted hash to
decide on an approximate target for each share, then sends lease requests via
Chord routing. The request includes the contact information of the uploading
node, and asks that the node which eventually accepts the lease should contact
the uploader directly. The shares are then transferred over direct connections
rather than through multiple Chord hops. Download uses the same approach. This
allows nodes to avoid maintaining a large number of long-term connections, at
the expense of complexity, latency, and reliability.


SWARMING DOWNLOAD, TRICKLING UPLOAD

Because the shares being downloaded are distributed across a large number of
nodes, the download process will pull from many of them at the same time. The
current encoding parameters require 3 shares to be retrieved for each segment,
which means that up to 3 nodes will be used simultaneously. For larger
networks, 8-of-22 encoding could be used, meaning 8 nodes can be used
simultaneously. This allows the download process to use the sum of the
available nodes' upload bandwidths, resulting in downloads that take full
advantage of the common 8x disparity between download and upload bandwith on
modern ADSL lines.

On the other hand, uploads are hampered by the need to upload encoded shares
that are larger than the original data (3.3x larger with the current default
encoding parameters), through the slow end of the asymmetric connection. This
means that on a typical 8x ADSL line, uploading a file will take about 32
times longer than downloading it again later.

Smaller expansion ratios can reduce this upload penalty, at the expense of
reliability. See RELIABILITY, below. By using an "upload helper", this penalty
is eliminated: the client does a 1x upload of encrypted data to the helper,
then the helper performs encoding and pushes the shares to the storage
servers. This is an improvement if the helper has significantly higher upload
bandwidth than the client, so it makes the most sense for a commercially-run
grid for which all of the storage servers are in a colo facility with high
interconnect bandwidth. In this case, the helper is placed in the same
facility, so the helper-to-storage-server bandwidth is huge.

See "helper.txt" for details about the upload helper.


THE FILESYSTEM LAYER

The "filesystem" layer is responsible for mapping human-meaningful pathnames
(directories and filenames) to pieces of data. The actual bytes inside these
files are referenced by capability, but the filesystem layer is where the
directory names, file names, and metadata are kept.

The filesystem layer is a graph of directories. Each directory contains a
table of named children. These children are either other directories or
files. All children are referenced by their capability.

A directory has two forms of capability: read-write caps and read-only
caps. The table of children inside the directory has a read-write and
read-only capability for each child. If you have a read-only capability for a
given directory, you will not be able to access the read-write capability of
its children.  This results in "transitively read-only" directory access.

By having two different capabilities, you can choose which you want to share
with someone else. If you create a new directory and share the read-write
capability for it with a friend, then you will both be able to modify its
contents. If instead you give them the read-only capability, then they will
*not* be able to modify the contents. Any capability that you receive can be
linked in to any directory that you can modify, so very powerful
shared+published directory structures can be built from these components.

This structure enable individual users to have their own personal space, with
links to spaces that are shared with specific other users, and other spaces
that are globally visible.


LEASES, REFRESHING, GARBAGE COLLECTION

When a file or directory in the virtual filesystem is no longer referenced,
the space that its shares occupied on each storage server can be freed,
making room for other shares. Tahoe currently uses a garbage collection
("GC") mechanism to implement this space-reclamation process. Each share has
one or more "leases", which are managed by clients who want the
file/directory to be retained. The storage server accepts each share for a
pre-defined period of time, and is allowed to delete the share if all of the
leases are cancelled or allowed to expire.

Garbage collection is not enabled by default: storage servers will not delete
shares without being explicitly configured to do so. When GC is enabled,
clients are responsible for renewing their leases on a periodic basis at
least frequently enough to prevent any of the leases from expiring before the
next renewal pass.

See docs/garbage-collection.txt for further information, and how to configure
garbage collection.


FILE REPAIRER

Shares may go away because the storage server hosting them has suffered a
failure: either temporary downtime (affecting availability of the file), or a
permanent data loss (affecting the reliability of the file). Hard drives
crash, power supplies explode, coffee spills, and asteroids strike. The goal
of a robust distributed filesystem is to survive these setbacks.

To work against this slow, continual loss of shares, a File Checker is used to
periodically count the number of shares still available for any given file. A
more extensive form of checking known as the File Verifier can download the
ciphertext of the target file and perform integrity checks (using strong
hashes) to make sure the data is stil intact. When the file is found to have
decayed below some threshold, the File Repairer can be used to regenerate and
re-upload the missing shares. These processes are conceptually distinct (the
repairer is only run if the checker/verifier decides it is necessary), but in
practice they will be closely related, and may run in the same process.

The repairer process does not get the full capability of the file to be
maintained: it merely gets the "repairer capability" subset, which does not
include the decryption key. The File Verifier uses that data to find out which
nodes ought to hold shares for this file, and to see if those nodes are still
around and willing to provide the data. If the file is not healthy enough, the
File Repairer is invoked to download the ciphertext, regenerate any missing
shares, and upload them to new nodes. The goal of the File Repairer is to
finish up with a full set of "N" shares.

There are a number of engineering issues to be resolved here. The bandwidth,
disk IO, and CPU time consumed by the verification/repair process must be
balanced against the robustness that it provides to the grid. The nodes
involved in repair will have very different access patterns than normal nodes,
such that these processes may need to be run on hosts with more memory or
network connectivity than usual. The frequency of repair will directly affect
the resources consumed. In some cases, verification of multiple files can be
performed at the same time, and repair of files can be delegated off to other
nodes.

The security model we are currently using assumes that nodes who claim to hold
a share will actually provide it when asked. (We validate the data they
provide before using it in any way, but if enough nodes claim to hold the data
and are wrong, the file will not be repaired, and may decay beyond
recoverability). There are several interesting approaches to mitigate this
threat, ranging from challenges to provide a keyed hash of the allegedly-held
data (using "buddy nodes", in which two nodes hold the same block, and check
up on each other), to reputation systems, or even the original Mojo Nation
economic model.


SECURITY

The design goal for this project is that an attacker may be able to deny
service (i.e. prevent you from recovering a file that was uploaded earlier)
but can accomplish none of the following three attacks:

 1) violate confidentiality: the attacker gets to view data to which you have
    not granted them access
 2) violate consistency: the attacker convinces you that the wrong data is
    actually the data you were intending to retrieve
 3) violate mutability: the attacker gets to modify a directory (either the
    pathnames or the file contents) to which you have not given them
    mutability rights

Data validity and consistency (the promise that the downloaded data will match
the originally uploaded data) is provided by the hashes embedded in the
capability. Data confidentiality (the promise that the data is only readable
by people with the capability) is provided by the encryption key embedded in
the capability. Data availability (the hope that data which has been uploaded
in the past will be downloadable in the future) is provided by the grid, which
distributes failures in a way that reduces the correlation between individual
node failure and overall file recovery failure, and by the erasure-coding
technique used to generate shares.

Many of these security properties depend upon the usual cryptographic
assumptions: the resistance of AES and RSA to attack, the resistance of
SHA-256 to pre-image attacks, and upon the proximity of 2^-128 and 2^-256 to
zero. A break in AES would allow a confidentiality violation, a pre-image
break in SHA-256 would allow a consistency violation, and a break in RSA
would allow a mutability violation. The discovery of a collision in SHA-256
is unlikely to allow much, but could conceivably allow a consistency
violation in data that was uploaded by the attacker. If SHA-256 is
threatened, further analysis will be warranted.

There is no attempt made to provide anonymity, neither of the origin of a
piece of data nor the identity of the subsequent downloaders. In general,
anyone who already knows the contents of a file will be in a strong position
to determine who else is uploading or downloading it. Also, it is quite easy
for a sufficiently large coalition of nodes to correlate the set of nodes who
are all uploading or downloading the same file, even if the attacker does not
know the contents of the file in question.

Also note that the file size and (when convergence is being used) a keyed hash
of the plaintext are not protected. Many people can determine the size of the
file you are accessing, and if they already know the contents of a given file,
they will be able to determine that you are uploading or downloading the same
one.

A likely enhancement is the ability to use distinct encryption keys for each
file, avoiding the file-correlation attacks at the expense of increased
storage consumption. This is known as "non-convergent" encoding.

The capability-based security model is used throughout this project. Directory
operations are expressed in terms of distinct read- and write- capabilities.
Knowing the read-capability of a file is equivalent to the ability to read the
corresponding data. The capability to validate the correctness of a file is
strictly weaker than the read-capability (possession of read-capability
automatically grants you possession of validate-capability, but not vice
versa). These capabilities may be expressly delegated (irrevocably) by simply
transferring the relevant secrets.

The application layer can provide whatever access model is desired, built on
top of this capability access model.  The first big user of this system so far
is allmydata.com.  The allmydata.com access model currently works like a
normal web site, using username and password to give a user access to her
virtual drive.  In addition, allmydata.com users can share individual files
(using a file sharing interface built on top of the immutable file read
capabilities).


RELIABILITY

File encoding and peer-node selection parameters can be adjusted to achieve
different goals. Each choice results in a number of properties; there are many
tradeoffs.

First, some terms: the erasure-coding algorithm is described as K-out-of-N
(for this release, the default values are K=3 and N=10). Each grid will have
some number of nodes; this number will rise and fall over time as nodes join,
drop out, come back, and leave forever. Files are of various sizes, some are
popular, others are rare. Nodes have various capacities, variable
upload/download bandwidths, and network latency. Most of the mathematical
models that look at node failure assume some average (and independent)
probability 'P' of a given node being available: this can be high (servers
tend to be online and available >90% of the time) or low (laptops tend to be
turned on for an hour then disappear for several days). Files are encoded in
segments of a given maximum size, which affects memory usage.

The ratio of N/K is the "expansion factor". Higher expansion factors improve
reliability very quickly (the binomial distribution curve is very sharp), but
consumes much more grid capacity. When P=50%, the absolute value of K affects
the granularity of the binomial curve (1-out-of-2 is much worse than
50-out-of-100), but high values asymptotically approach a constant (i.e.
500-of-1000 is not much better than 50-of-100). When P is high and the
expansion factor is held at a constant, higher values of K and N give much
better reliability (for P=99%, 50-out-of-100 is much much better than 5-of-10,
roughly 10^50 times better), because there are more shares that can be lost
without losing the file.

Likewise, the total number of nodes in the network affects the same
granularity: having only one node means a single point of failure, no matter
how many copies of the file you make. Independent nodes (with uncorrelated
failures) are necessary to hit the mathematical ideals: if you have 100 nodes
but they are all in the same office building, then a single power failure will
take out all of them at once. The "Sybil Attack" is where a single attacker
convinces you that they are actually multiple servers, so that you think you
are using a large number of independent nodes, but in fact you have a single
point of failure (where the attacker turns off all their machines at
once). Large grids, with lots of truly independent nodes, will enable the use
of lower expansion factors to achieve the same reliability, but will increase
overhead because each node needs to know something about every other, and the
rate at which nodes come and go will be higher (requiring network maintenance
traffic). Also, the File Repairer work will increase with larger grids,
although then the job can be distributed out to more nodes.

Higher values of N increase overhead: more shares means more Merkle hashes
that must be included with the data, and more nodes to contact to retrieve the
shares. Smaller segment sizes reduce memory usage (since each segment must be
held in memory while erasure coding runs) and improves "alacrity" (since
downloading can validate a smaller piece of data faster, delivering it to the
target sooner), but also increase overhead (because more blocks means more
Merkle hashes to validate them).

In general, small private grids should work well, but the participants will
have to decide between storage overhead and reliability. Large stable grids
will be able to reduce the expansion factor down to a bare minimum while still
retaining high reliability, but large unstable grids (where nodes are coming
and going very quickly) may require more repair/verification bandwidth than
actual upload/download traffic.

Tahoe nodes that run a webserver have a page dedicated to provisioning
decisions: this tool may help you evaluate different expansion factors and
view the disk consumption of each. It is also acquiring some sections with
availability/reliability numbers, as well as preliminary cost analysis data.
This tool will continue to evolve as our analysis improves.

------------------------------

[1]: http://en.wikipedia.org/wiki/Zooko%27s_triangle

[2]: all of these names are derived from the location where they were
     concocted, in this case in a car ride from Boulder to DEN. To be
     precise, "Tahoe 1" was an unworkable scheme in which everyone who holds
     shares for a given file would form a sort of cabal which kept track of
     all the others, "Tahoe 2" is the first-100-nodes in the permuted hash
     described in this document, and "Tahoe 3" (or perhaps "Potrero hill 1")
     was the abandoned ring-with-many-hands approach.