Allmydata "Tahoe" Architecture OVERVIEW The high-level view of this system consists of three layers: the grid, the virtual drive, and the application that sits on top. The lowest layer is the "grid", basically a DHT (Distributed Hash Table) which maps URIs to data. The URIs are relatively short ascii strings (currently about 140 bytes), and each is used as a reference to an immutable arbitrary-length sequence of data bytes. This data is distributed around the grid across a large number of nodes, such that a statistically unlikely number of nodes would have to be unavailable for the data to become unavailable. The middle layer is the virtual drive: a tree-shaped data structure in which the intermediate nodes are directories and the leaf nodes are files. Each file contains both the URI of the file's data and all the necessary metadata (MIME type, filename, ctime/mtime, etc) required to present the file to a user in a meaningful way (displaying it in a web browser, or on a desktop). The top layer is where the applications that use this virtual drive operate. Allmydata uses this for a backup service, in which the application copies the files to be backed up from the local disk into the virtual drive on a periodic basis. By providing read-only access to the same virtual drive later, a user can recover older versions of their files. Other sorts of applications can run on top of the virtual drive, of course -- anything that has a use for a secure, robust, distributed filestore. Note: some of the description below indicates design targets rather than actual code present in the current release. Please take a look at roadmap.txt to get an idea of how much of this has been implemented so far. THE BIG GRID OF PEERS Underlying the grid is a large collection of peer nodes. These are processes running on a wide variety of computers, all of which know about each other in some way or another. They establish TCP connections to one another using Foolscap, an encrypted+authenticated remote message passing library (using TLS connections and self-authenticating identifiers called "FURLs"). Each peer offers certain services to the others. The primary service is the StorageServer, which offers to hold data for a limited period of time (a "lease"). Each StorageServer has a quota, and it will reject lease requests that would cause it to consume more space than it wants to provide. When a lease expires, the data is deleted. Peers might renew their leases. This storage is used to hold "shares", which are themselves used to store files in the grid. There are many shares for each file, typically between 10 and 100 (the exact number depends upon the tradeoffs made between reliability, overhead, and storage space consumed). The files are indexed by a "StorageIndex", which is derived from the encryption key, which may be randomly generated or it may be derived from the contents of the file. Leases are indexed by StorageIndex, and a single StorageServer may hold multiple shares for the corresponding file. Multiple peers can hold leases on the same file, in which case the shares will be kept alive until the last lease expires. The typical lease is expected to be for one month: enough time for interested parties to renew it, but not so long that abandoned data consumes unreasonable space. Peers are expected to "delete" (drop leases) on data that they know they no longer want: lease expiration is meant as a safety measure. In this release, peers learn about each other through the "introducer". Each peer connects to this central introducer at startup, and receives a list of all other peers from it. Each peer then connects to all other peers, creating a fully-connected topology. Future versions will reduce the number of connections considerably, to enable the grid to scale to larger sizes: the design target is one million nodes. In addition, future versions will offer relay and NAT-traversal services to allow nodes without full internet connectivity to participate. In the current release, nodes behind NAT boxes will connect to all nodes that they can open connections to, but they cannot open connections to other nodes behind NAT boxes. Therefore, the more nodes there are behind NAT boxes the less the topology resembles the intended fully-connected mesh topology. FILE ENCODING When a file is to be added to the grid, it is first encrypted using a key that is derived from the hash of the file itself (if convergence is desired) or randomly generated (if not). The encrypted file is then broken up into segments so it can be processed in small pieces (to minimize the memory footprint of both encode and decode operations, and to increase the so-called "alacrity": how quickly can the download operation provide validated data to the user, basically the lag between hitting "play" and the movie actually starting). Each segment is erasure coded, which creates encoded blocks such that only a subset of them are required to reconstruct the segment. These blocks are then combined into "shares", such that a subset of the shares can be used to reconstruct the whole file. The shares are then deposited in StorageServers in other peers. A tagged hash of the encryption key is used to form the "storage index", which is used for both peer selection (described below) and to index shares within the StorageServers on the selected peers. A variety of hashes are computed while the shares are being produced, to validate the plaintext, the crypttext, and the shares themselves. Merkle hash trees are also produced to enable validation of individual segments of plaintext or crypttext without requiring the download/decoding of the whole file. These hashes go into the "URI Extension Block", which will be stored with each share. The URI contains the encryption key, the hash of the URI Extension Block, and any encoding parameters necessary to perform the eventual decoding process. For convenience, it also contains the size of the file being stored. On the download side, the node that wishes to turn a URI into a sequence of bytes will obtain the necessary shares from remote nodes, break them into blocks, use erasure-decoding to turn them into segments of crypttext, use the decryption key to convert that into plaintext, then emit the plaintext bytes to the output target (which could be a file on disk, or it could be streamed directly to a web browser or media player). All hashes use SHA256, and a different tag is used for each purpose. Netstrings are used where necessary to insure these tags cannot be confused with the data to be hashed. All encryption uses AES in CTR mode. The erasure coding is performed with zfec (a python wrapper around Rizzo's FEC library). A Merkle Hash Tree is used to validate the encoded blocks before they are fed into the decode process, and a transverse tree is used to validate the shares before they are retrieved. A third merkle tree is constructed over the plaintext segments, and a fourth is constructed over the crypttext segments. All necessary hash chains are stored with the shares, and the hash tree roots are put in the URI extension block. The final hash of the extension block goes into the URI itself. Note that the number of shares created is fixed at the time the file is uploaded: it is not possible to create additional shares later. The use of a top-level hash tree also requires that nodes create all shares at once, even if they don't intend to upload some of them, otherwise the hashroot cannot be calculated correctly. URIs Each URI represents a specific set of bytes. Think of it like a hash function: you feed in a bunch of bytes, and you get out a URI. If convergence is enabled, the URI is deterministically derived from the input data: changing even one bit of the input data will result in a drastically different URI. If convergence is not enabled, the encoding process will generate a different URI each time the file is uploaded. The URI provides both "location" and "identification": you can use it to locate/retrieve a set of bytes that are possibly the same as the original file, and then you can use it to validate ("identify") that these potential bytes are indeed the ones that you were looking for. URIs refer to an immutable set of bytes. If you modify a file and upload the new version to the grid, you will get a different URI. URIs do not represent filenames at all, just the data that a filename might point to at some given point in time. This is why the "grid" layer is insufficient to provide a virtual drive: an actual filesystem requires human-meaningful names and mutability, while URIs provide neither. URIs sit on the "global+secure" edge of Zooko's Triangle[1]. They are self-authenticating, meaning that nobody can trick you into using the wrong data. The URI should be considered as a "read capability" for the corresponding data: anyone who knows the full URI has the ability to read the given data. There is a subset of the URI (which leaves out the encryption key and fileid) which is called the "verification capability": it allows the holder to retrieve and validate the crypttext, but not the plaintext. Once the crypttext is available, the erasure-coded shares can be regenerated. This will allow a file-repair process to maintain and improve the robustness of files without being able to read their contents. The lease mechanism will also involve a "delete" capability, by which a peer which uploaded a file can indicate that they don't want it anymore. It is not truly a delete capability because other peers might be holding leases on the same data, and it should not be deleted until the lease count (i.e. reference count) goes to zero, so perhaps "cancel-the-lease capability" is more accurate. The plan is to store this capability next to the URI in the virtual drive structure. PEER SELECTION When a file is uploaded, the encoded shares are sent to other peers. But to which ones? The "peer selection" algorithm is used to make this choice. In the current version, the verifierid is used to consistently-permute the set of all peers (by sorting the peers by HASH(verifierid+peerid)). Each file gets a different permutation, which (on average) will evenly distribute shares among the grid and avoid hotspots. This permutation places the peers around a 2^256-sized ring, like the rim of a big clock. The 100-or-so shares are then placed around the same ring (at 0, 1/100*2^256, 2/100*2^256, ... 99/100*2^256). Imagine that we start at 0 with an empty basket in hand and proceed clockwise. When we come to a share, we pick it up and put it in the basket. When we come to a peer, we ask that peer if they will give us a lease for every share in our basket. The peer will grant us leases for some of those shares and reject others (if they are full or almost full). If they reject all our requests, we remove them from the ring, because they are full and thus unhelpful. Each share they accept is removed from the basket. The remainder stay in the basket as we continue walking clockwise. We keep walking, accumulating shares and distributing them to peers, until either we find a home for all shares, or there are no peers left in the ring (because they are all full). If we run out of peers before we run out of shares, the upload may be considered a failure, depending upon how many shares we were able to place. The current parameters try to place 100 shares, of which 25 must be retrievable to recover the file, and the peer selection algorithm is happy if it was able to place at least 75 shares. These numbers are adjustable: 25-out-of-100 means an expansion factor of 4x (every file in the grid consumes four times as much space when totalled across all StorageServers), but is highly reliable (the actual reliability is a binomial distribution function of the expected availability of the individual peers, but in general it goes up very quickly with the expansion factor). If the file has been uploaded before (or if two uploads are happening at the same time), a peer might already have shares for the same file we are proposing to send to them. In this case, those shares are removed from the list and assumed to be available (or will be soon). This reduces the number of uploads that must be performed. When downloading a file, the current release just asks all known peers for any shares they might have, chooses the minimal necessary subset, then starts downloading and processing those shares. A later release will use the full algorithm to reduce the number of queries that must be sent out. This algorithm uses the same consistent-hashing permutation as on upload, but instead of one walker with one basket, we have 100 walkers (one per share). They each proceed clockwise in parallel until they find a peer, and put that one on the "A" list: out of all peers, this one is the most likely to be the same one to which the share was originally uploaded. The next peer that each walker encounters is put on the "B" list, etc. All the "A" list peers are asked for any shares they might have. If enough of them can provide a share, the download phase begins and those shares are retrieved and decoded. If not, the "B" list peers are contacted, etc. This routine will eventually find all the peers that have shares, and will find them quickly if there is significant overlap between the set of peers that were present when the file was uploaded and the set of peers that are present as it is downloaded (i.e. if the "peerlist stability" is high). Some limits may be imposed in large grids to avoid querying a million peers; this provides a tradeoff between the work spent to discover that a file is unrecoverable and the probability that a retrieval will fail when it could have succeeded if we had just tried a little bit harder. The appropriate value of this tradeoff will depend upon the size of the grid, and will change over time. Other peer selection algorithms are being evaluated. One of them (known as "tahoe 2") uses the same consistent hash, starts at 0 and requests one lease per peer until it gets 100 of them. This is likely to get better overlap (since a single insertion or deletion will still leave 99 overlapping peers), but is non-ideal in other ways (TODO: what were they?). It would also make it easier to select peers on the basis of their reliability, uptime, or reputation: we could pick 75 good peers plus 50 marginal peers, if it seemed likely that this would provide as good service as 100 good peers. Another algorithm (known as "denver airport"[2]) uses the permuted hash to decide on an approximate target for each share, then sends lease requests via Chord routing. The request includes the contact information of the uploading node, and asks that the node which eventually accepts the lease should contact the uploader directly. The shares are then transferred over direct connections rather than through multiple Chord hops. Download uses the same approach. This allows nodes to avoid maintaining a large number of long-term connections, at the expense of complexity, latency, and reliability. SWARMING DOWNLOAD, TRICKLING UPLOAD Because the shares being downloaded are distributed across a large number of peers, the download process will pull from many of them at the same time. The current encoding parameters require 25 shares to be retrieved for each segment, which means that up to 25 peers will be used simultaneously. This allows the download process to use the sum of the available peers' upload bandwidths, resulting in downloads that take full advantage of the common 8x disparity between download and upload bandwith on modern ADSL lines. On the other hand, uploads are hampered by the need to upload encoded shares that are larger than the original data (4x larger with the current default encoding parameters), through the slow end of the asymmetric connection. This means that on a typical 8x ADSL line, uploading a file will take about 32 times longer than downloading it again later. Smaller expansion ratios can reduce this upload penalty, at the expense of reliability. See RELIABILITY, below. FILETREE: THE VIRTUAL DRIVE LAYER The "virtual drive" layer is responsible for mapping human-meaningful pathnames (directories and filenames) to pieces of data. The actual bytes inside these files are referenced by URI, but the "filetree" is where the directory names, file names, and metadata are kept. The current release has a very simplistic filetree model. There is a single globally-shared directory structure, which maps filename to URI. This structure is maintained in a central node (which happens to be the same node that houses the Introducer), by writing URIs to files in a local filesystem. A future release (probably the next one) will offer each application the ability to have a separate file tree. Each tree can reference others. Some trees are redirections, while others actually contain subdirectories full of filenames. The redirections may be mutable by some users but not by others, allowing both read-only and read-write views of the same data. This will enable individual users to have their own personal space, with links to spaces that are shared with specific other users, and other spaces that are globally visible. Eventually the application layer will present these pieces in a way that allows the sharing of a specific file or the creation of a "virtual CD" as easily as dragging a folder onto a user icon. The URIs described above are "Content Hash Key" (CHK) identifiers[3], in which the identifier refers to a specific, unchangeable sequence of bytes. In this project, CHK identifiers are used for both files and immutable versions of directories: the tree of directory and file nodes is serialized into a sequence of bytes, which is then uploaded and turned into a URI. Each time the directory is changed, a new URI is generated for it and propagated to the filetree above it. There is a separate kind of upload, not yet implemented, called SSK (short for Signed Subspace Key), in which the URI refers to a mutable slot. Some users have a write-capability to this slot, allowing them to change the data that it refers to. Others only have a read-capability, merely letting them read the current contents. These SSK slots can be used to provide mutability in the filetree, so that users can actually change the contents of their virtual drive. Redirection nodes can also provide mutability, such as a central service which allows a user to set the current URI of their top-level filetree. SSK slots provide a decentralized way to accomplish this mutability, whereas centralized redirection nodes are more vulnerable to single-point-of-failure issues. FILE REPAIRER Each node is expected to explicitly drop leases on files that it knows it no longer wants (the "delete" operation). Nodes are also expected to renew leases on files that still exist in their filetrees. When nodes are offline for an extended period of time, their files may decay (both because of leases expiring and because of StorageServers going offline). A File Verifier is used to check on the health of any given file, and a File Repairer is used to to keep desired files alive. The two are conceptually distinct (the repairer is run if the verifier decides it is necessary), but in practice they will be closely related, and may run in the same process. The repairer process does not get the full URI of the file to be maintained: it merely gets the "repairer capability" subset, which does not include the decryption key. The File Verifier uses that data to find out which peers ought to hold shares for this file, and to see if those peers are still around and willing to provide the data. If the file is not healthy enough, the File Repairer is invoked to download the crypttext, regenerate any missing shares, and upload them to new peers. The goal of the File Repairer is to finish up with a full set of 100 shares. There are a number of engineering issues to be resolved here. The bandwidth, disk IO, and CPU time consumed by the verification/repair process must be balanced against the robustness that it provides to the grid. The nodes involved in repair will have very different access patterns than normal nodes, such that these processes may need to be run on hosts with more memory or network connectivity than usual. The frequency of repair runs directly affects the resources consumed. In some cases, verification of multiple files can be performed at the same time, and repair of files can be delegated off to other nodes. The security model we are currently using assumes that peers who claim to hold a share will actually provide it when asked. (We validate the data they provide before using it in any way, but if enough peers claim to hold the data and are wrong, the file will not be repaired, and may decay beyond recoverability). There are several interesting approaches to mitigate this threat, ranging from challenges to provide a keyed hash of the allegedly-held data (using "buddy nodes", in which two peers hold the same block, and check up on each other), to reputation systems, or even the original Mojo Nation economic model. SECURITY The design goal for this project is that an attacker may be able to deny service (i.e. prevent you from recovering a file that was uploaded earlier) but can accomplish none of the following three attacks: 1) violate privacy: the attacker gets to view data to which you have not granted them access 2) violate consistency: the attacker convinces you that the wrong data is actually the data you were intending to retrieve 3) violate mutability: the attacker gets to modify a filetree (either the pathnames or the file contents) to which you have not given them mutability rights Data validity and consistency (the promise that the downloaded data will match the originally uploaded data) is provided by the hashes embedded the URI. Data security (the promise that the data is only readable by people with the URI) is provided by the encryption key embedded in the URI. Data availability (the hope that data which has been uploaded in the past will be downloadable in the future) is provided by the grid, which distributes failures in a way that reduces the correlation between individual node failure and overall file recovery failure. Many of these security properties depend upon the usual cryptographic assumptions: the resistance of AES and RSA to attack, the resistance of SHA256 to pre-image attacks, and upon the proximity of 2^-128 and 2^-256 to zero. A break in AES would allow a privacy violation, a pre-image break in SHA256 would allow a consistency violation, and a break in RSA would allow a mutability violation. The discovery of a collision in SHA256 is unlikely to allow much, but could conceivably allow a consistency violation in data that was uploaded by the attacker. If SHA256 is threatened, further analysis will be warranted. There is no attempt made to provide anonymity, neither of the origin of a piece of data nor the identity of the subsequent downloaders. In general, anyone who already knows the contents of a file will be in a strong position to determine who else is uploading or downloading it. Also, it is quite easy for a coalition of more than 1% of the nodes to correlate the set of peers who are all uploading or downloading the same file, even if the attacker does not know the contents of the file in question. Also note that the file size and verifierid are not protected. Many people can determine the size of the file you are accessing, and if they already know the contents of a given file, they will be able to determine that you are uploading or downloading the same one. A likely enhancement is the ability to use distinct encryption keys for each file, avoiding the file-correlation attacks at the expense of increased storage consumption. The capability-based security model is used throughout this project. Filetree operations are expressed in terms of distinct read and write capabilities. The URI of a file is the read-capability: knowing the URI is equivalent to the ability to read the corresponding data. The capability to validate and repair a file is a subset of the read-capability. The capability to read an SSK slot is a subset of the capability to modify it. These capabilities may be expressly delegated (irrevocably) by simply transferring the relevant secrets. Special forms of SSK slots can be used to make revocable delegations of particular directories. Certain redirections in the filetree code are expressed as Foolscap "furls", which are also capabilities and provide access to an instance of code running on a central server: these can be delegated just as easily as any other capability, and can be made revocable by delegating access to a forwarder instead of the actual target. The application layer can provide whatever security/access model is desired, but we expect the first few to also follow capability discipline: rather than user accounts with passwords, each user will get a furl to their private filetree, and the presentation layer will give them the ability to break off pieces of this filetree for delegation or sharing with others on demand. RELIABILITY File encoding and peer selection parameters can be adjusted to achieve different goals. Each choice results in a number of properties; there are many tradeoffs. First, some terms: the erasure-coding algorithm is described as K-out-of-N (for this release, the default values are K=25 and N=100). Each grid will have some number of peers; this number will rise and fall over time as peers join, drop out, come back, and leave forever. Files are of various sizes, some are popular, others are rare. Peers have various capacities, variable upload/download bandwidths, and network latency. Most of the mathematical models that look at peer failure assume some average (and independent) probability 'P' of a given peer being available: this can be high (servers tend to be online and available >90% of the time) or low (laptops tend to be turned on for an hour then disappear for several days). Files are encoded in segments of a given maximum size, which affects memory usage. The ratio of N/K is the "expansion factor". Higher expansion factors improve reliability very quickly (the binomial distribution curve is very sharp), but consumes much more grid capacity. The absolute value of K affects the granularity of the binomial curve (1-out-of-2 is much worse than 50-out-of-100), but high values asymptotically approach a constant that depends upon 'P' (i.e. 500-of-1000 is not much better than 50-of-100). Likewise, the total number of peers in the network affects the same granularity: having only one peer means a single point of failure, no matter how many copies of the file you make. Independent peers (with uncorrelated failures) are necessary to hit the mathematical ideals: if you have 100 nodes but they are all in the same office building, then a single power failure will take out all of them at once. The "Sybil Attack" is where a single attacker convinces you that they are actually multiple servers, so that you think you are using a large number of independent peers, but in fact you have a single point of failure (where the attacker turns off all their machines at once). Large grids, with lots of truly-independent peers, will enable the use of lower expansion factors to achieve the same reliability, but increase overhead because each peer needs to know something about every other, and the rate at which peers come and go will be higher (requiring network maintenance traffic). Also, the File Repairer work will increase with larger grids, although then the job can be distributed out to more peers. Higher values of N increase overhead: more shares means more Merkle hashes that must be included with the data, and more peers to contact to retrieve the shares. Smaller segment sizes reduce memory usage (since each segment must be held in memory while erasure coding runs) and increases "alacrity" (since downloading can validate a smaller piece of data faster, delivering it to the target sooner), but also increase overhead (because more blocks means more Merkle hashes to validate them). In general, small private grids should work well, but the participants will have to decide between storage overhead and reliability. Large stable grids will be able to reduce the expansion factor down to a bare minimum while still retaining high reliability, but large unstable grids (where nodes are coming and going very quickly) may require more repair/verification bandwidth than actual upload/download traffic. ------------------------------ [1]: http://en.wikipedia.org/wiki/Zooko%27s_triangle [2]: all of these names are derived from the location where they were concocted, in this case in a car ride from Boulder to DEN. To be precise, "tahoe 1" was an unworkable scheme in which everyone holding shares for a given file formed a sort of cabal which kept track of all the others, "tahoe 2" is the first-100-peers in the permuted hash, and this document descibes "tahoe 3", or perhaps "potrero hill 1". [3]: the terms CHK and SSK come from Freenet, http://wiki.freenetproject.org/FreenetCHKPages , although we use "SSK" in a slightly different way