tahoe-lafs/docs/specifications/outline.rst
Zooko Wilcox-O'Hearn 82579cec96 magic first line tells emacs to use utf8+bom
Add ".. -*- coding: utf-8-with-signature -*-" to the first line of each .rst
file. This tells emacs to treat the file contents as utf-8, and also to prepend
a so-called utf-8 "bom" marker at the beginning of the file. This patch also
prepends those markers to each of those files.
2013-11-08 21:08:05 +00:00

11 KiB

Specification Document Outline

While we do not yet have a clear set of specification documents for Tahoe (explaining the file formats, so that others can write interoperable implementations), this document is intended to lay out an outline for what these specs ought to contain. Think of this as the ISO 7-Layer Model for Tahoe.

We currently imagine 4 documents.

  1. #1: Share Format, Encoding Algorithm
  2. #2: Share Exchange Protocol
  3. #3: Server Selection Algorithm, filecap format
  4. #4: Directory Format

#1: Share Format, Encoding Algorithm

This document will describe the way that files are encrypted and encoded into shares. It will include a specification of the share format, and explain both the encoding and decoding algorithms. It will cover both mutable and immutable files.

The immutable encoding algorithm, as described by this document, will start with a plaintext series of bytes, encoding parameters "k" and "N", and either an encryption key or a mechanism for deterministically deriving the key from the plaintext (the CHK specification). The algorithm will end with a set of N shares, and a set of values that must be included in the filecap to provide confidentiality (the encryption key) and integrity (the UEB hash).

The immutable decoding algorithm will start with the filecap values (key and UEB hash) and "k" shares. It will explain how to validate the shares against the integrity information, how to reverse the erasure-coding, and how to decrypt the resulting ciphertext. It will result in the original plaintext bytes (or some subrange thereof).

The sections on mutable files will contain similar information.

This document is not responsible for explaining the filecap format, since full filecaps may need to contain additional information as described in document #3. Likewise it it not responsible for explaining where to put the generated shares or where to find them again later.

It is also not responsible for explaining the access control mechanisms surrounding share upload, download, or modification ("Accounting" is the business of controlling share upload to conserve space, and mutable file shares require some sort of access control to prevent non-writecap holders from destroying shares). We don't yet have a document dedicated to explaining these, but let's call it "Access Control" for now.

#2: Share Exchange Protocol

This document explains the wire-protocol used to upload, download, and modify shares on the various storage servers.

Given the N shares created by the algorithm described in document #1, and a set of servers who are willing to accept those shares, the protocols in this document will be sufficient to get the shares onto the servers. Likewise, given a set of servers who hold at least k shares, these protocols will be enough to retrieve the shares necessary to begin the decoding process described in document #1. The notion of a "storage index" is used to reference a particular share: the storage index is generated by the encoding process described in document #1.

This document does not describe how to identify or choose those servers, rather it explains what to do once they have been selected (by the mechanisms in document #3).

This document also explains the protocols that a client uses to ask a server whether or not it is willing to accept an uploaded share, and whether it has a share available for download. These protocols will be used by the mechanisms in document #3 to help decide where the shares should be placed.

Where cryptographic mechanisms are necessary to implement access-control policy, this document will explain those mechanisms.

In the future, Tahoe will be able to use multiple protocols to speak to storage servers. There will be alternative forms of this document, one for each protocol. The first one to be written will describe the Foolscap-based protocol that tahoe currently uses, but we anticipate a subsequent one to describe a more HTTP-based protocol.

#3: Server Selection Algorithm, filecap format

This document has two interrelated purposes. With a deeper understanding of the issues, we may be able to separate these more cleanly in the future.

The first purpose is to explain the server selection algorithm. Given a set of N shares, where should those shares be uploaded? Given some information stored about a previously-uploaded file, how should a downloader locate and recover at least k shares? Given a previously-uploaded mutable file, how should a modifier locate all (or most of) the shares with a reasonable amount of work?

This question implies many things, all of which should be explained in this document:

  • the notion of a "grid", nominally a set of servers who could potentially hold shares, which might change over time
  • a way to configure which grid should be used
  • a way to discover which servers are a part of that grid
  • a way to decide which servers are reliable enough to be worth sending shares
  • an algorithm to handle servers which refuse shares
  • a way for a downloader to locate which servers have shares
  • a way to choose which shares should be used for download

The server-selection algorithm has several obviously competing goals:

  • minimize the amount of work that must be done during upload
  • minimize the total storage resources used
  • avoid "hot spots", balance load among multiple servers
  • maximize the chance that enough shares will be downloadable later, by uploading lots of shares, and by placing them on reliable servers
  • minimize the work that the future downloader must do
  • tolerate temporary server failures, permanent server departure, and new server insertions
  • minimize the amount of information that must be added to the filecap

The server-selection algorithm is defined in some context: some set of expectations about the servers or grid with which it is expected to operate. Different algorithms are appropriate for different situtations, so there will be multiple alternatives of this document.

The first version of this document will describe the algorithm that the current (1.3.0) release uses, which is heavily weighted towards the two main use case scenarios for which Tahoe has been designed: the small, stable friendnet, and the allmydata.com managed grid. In both cases, we assume that the storage servers are online most of the time, they are uniformly highly reliable, and that the set of servers does not change very rapidly. The server-selection algorithm for this environment uses a permuted server list to achieve load-balancing, uses all servers identically, and derives the permutation key from the storage index to avoid adding a new field to the filecap.

An alternative algorithm could give clients more precise control over share placement, for example by a user who wished to make sure that k+1 shares are located in each datacenter (to allow downloads to take place using only local bandwidth). This algorithm could skip the permuted list and use other mechanisms to accomplish load-balancing (or ignore the issue altogether). It could add additional information to the filecap (like a list of which servers received the shares) in lieu of performing a search at download time, perhaps at the expense of allowing a repairer to move shares to a new server after the initial upload. It might make up for this by storing "location hints" next to each share, to indicate where other shares are likely to be found, and obligating the repairer to update these hints.

The second purpose of this document is to explain the format of the file capability string (or "filecap" for short). There are multiple kinds of capabilties (read-write, read-only, verify-only, repaircap, lease-renewal cap, traverse-only, etc). There are multiple ways to represent the filecap (compressed binary, human-readable, clickable-HTTP-URL, "tahoe:" URL, etc), but they must all contain enough information to reliably retrieve a file (given some context, of course). It must at least contain the confidentiality and integrity information from document #1 (i.e. the encryption key and the UEB hash). It must also contain whatever additional information the upload-time server-selection algorithm generated that will be required by the downloader.

For some server-selection algorithms, the additional information will be minimal. For example, the 1.3.0 release uses the hash of the encryption key as a storage index, and uses the storage index to permute the server list, and uses an Introducer to learn the current list of servers. This allows a "close-enough" list of servers to be compressed into a filecap field that is already required anyways (the encryption key). It also adds k and N to the filecap, to speed up the downloader's search (the downloader knows how many shares it needs, so it can send out multiple queries in parallel).

But other server-selection algorithms might require more information. Each variant of this document will explain how to encode that additional information into the filecap, and how to extract and use that information at download time.

These two purposes are interrelated. A filecap that is interpreted in the context of the allmydata.com commercial grid, which uses tahoe-1.3.0, implies a specific peer-selection algorithm, a specific Introducer, and therefore a fairly-specific set of servers to query for shares. A filecap which is meant to be interpreted on a different sort of grid would need different information.

Some filecap formats can be designed to contain more information (and depend less upon context), such as the way an HTTP URL implies the existence of a single global DNS system. Ideally a tahoe filecap should be able to specify which "grid" it lives in, with enough information to allow a compatible implementation of Tahoe to locate that grid and retrieve the file (regardless of which server-selection algorithm was used for upload).

This more-universal format might come at the expense of reliability, however. Tahoe-1.3.0 filecaps do not contain hostnames, because the failure of DNS or an individual host might then impact file availability (however the Introducer contains DNS names or IP addresses).

#4: Directory Format

Tahoe directories are a special way of interpreting and managing the contents of a file (either mutable or immutable). These "dirnode" files are basically serialized tables that map child name to filecap/dircap. This document describes the format of these files.

Tahoe-1.3.0 directories are "transitively readonly", which is accomplished by applying an additional layer of encryption to the list of child writecaps. The key for this encryption is derived from the containing file's writecap. This document must explain how to derive this key and apply it to the appropriate portion of the table.

Future versions of the directory format are expected to contain "deep-traversal caps", which allow verification/repair of files without exposing their plaintext to the repair agent. This document wil be responsible for explaining traversal caps too.

Future versions of the directory format will probably contain an index and more advanced data structures (for efficiency and fast lookups), instead of a simple flat list of (childname, childcap). This document will also need to describe metadata formats, including what access-control policies are defined for the metadata.