mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2025-01-04 12:14:11 +00:00
211 lines
11 KiB
Plaintext
211 lines
11 KiB
Plaintext
|
= Specification Document Outline =
|
||
|
|
||
|
While we do not yet have a clear set of specification documents for Tahoe
|
||
|
(explaining the file formats, so that others can write interoperable
|
||
|
implementations), this document is intended to lay out an outline for what
|
||
|
these specs ought to contain. Think of this as the ISO 7-Layer Model for
|
||
|
Tahoe.
|
||
|
|
||
|
We currently imagine 4 documents.
|
||
|
|
||
|
== #1: Share Format, Encoding Algorithm ==
|
||
|
|
||
|
This document will describe the way that files are encrypted and encoded into
|
||
|
shares. It will include a specification of the share format, and explain both
|
||
|
the encoding and decoding algorithms. It will cover both mutable and
|
||
|
immutable files.
|
||
|
|
||
|
The immutable encoding algorithm, as described by this document, will start
|
||
|
with a plaintext series of bytes, encoding parameters "k" and "N", and either
|
||
|
an encryption key or a mechanism for deterministically deriving the key from
|
||
|
the plaintext (the CHK specification). The algorithm will end with a set of N
|
||
|
shares, and a set of values that must be included in the filecap to provide
|
||
|
confidentiality (the encryption key) and integrity (the UEB hash).
|
||
|
|
||
|
The immutable decoding algorithm will start with the filecap values (key and
|
||
|
UEB hash) and "k" shares. It will explain how to validate the shares against
|
||
|
the integrity information, how to reverse the erasure-coding, and how to
|
||
|
decrypt the resulting ciphertext. It will result in the original plaintext
|
||
|
bytes (or some subrange thereof).
|
||
|
|
||
|
The sections on mutable files will contain similar information.
|
||
|
|
||
|
This document is *not* responsible for explaining the filecap format, since
|
||
|
full filecaps may need to contain additional information as described in
|
||
|
document #3. Likewise it it not responsible for explaining where to put the
|
||
|
generated shares or where to find them again later.
|
||
|
|
||
|
It is also not responsible for explaining the access control mechanisms
|
||
|
surrounding share upload, download, or modification ("Accounting" is the
|
||
|
business of controlling share upload to conserve space, and mutable file
|
||
|
shares require some sort of access control to prevent non-writecap holders
|
||
|
from destroying shares). We don't yet have a document dedicated to explaining
|
||
|
these, but let's call it "Access Control" for now.
|
||
|
|
||
|
|
||
|
== #2: Share Exchange Protocol ==
|
||
|
|
||
|
This document explains the wire-protocol used to upload, download, and modify
|
||
|
shares on the various storage servers.
|
||
|
|
||
|
Given the N shares created by the algorithm described in document #1, and a
|
||
|
set of servers who are willing to accept those shares, the protocols in this
|
||
|
document will be sufficient to get the shares onto the servers. Likewise,
|
||
|
given a set of servers who hold at least k shares, these protocols will be
|
||
|
enough to retrieve the shares necessary to begin the decoding process
|
||
|
described in document #1. The notion of a "storage index" is used to
|
||
|
reference a particular share: the storage index is generated by the encoding
|
||
|
process described in document #1.
|
||
|
|
||
|
This document does *not* describe how to identify or choose those servers,
|
||
|
rather it explains what to do once they have been selected (by the mechanisms
|
||
|
in document #3).
|
||
|
|
||
|
This document also explains the protocols that a client uses to ask a server
|
||
|
whether or not it is willing to accept an uploaded share, and whether it has
|
||
|
a share available for download. These protocols will be used by the
|
||
|
mechanisms in document #3 to help decide where the shares should be placed.
|
||
|
|
||
|
Where cryptographic mechanisms are necessary to implement access-control
|
||
|
policy, this document will explain those mechanisms.
|
||
|
|
||
|
In the future, Tahoe will be able to use multiple protocols to speak to
|
||
|
storage servers. There will be alternative forms of this document, one for
|
||
|
each protocol. The first one to be written will describe the Foolscap-based
|
||
|
protocol that tahoe currently uses, but we anticipate a subsequent one to
|
||
|
describe a more HTTP-based protocol.
|
||
|
|
||
|
== #3: Server Selection Algorithm, filecap format ==
|
||
|
|
||
|
This document has two interrelated purposes. With a deeper understanding of
|
||
|
the issues, we may be able to separate these more cleanly in the future.
|
||
|
|
||
|
The first purpose is to explain the server selection algorithm. Given a set
|
||
|
of N shares, where should those shares be uploaded? Given some information
|
||
|
stored about a previously-uploaded file, how should a downloader locate and
|
||
|
recover at least k shares? Given a previously-uploaded mutable file, how
|
||
|
should a modifier locate all (or most of) the shares with a reasonable amount
|
||
|
of work?
|
||
|
|
||
|
This question implies many things, all of which should be explained in this
|
||
|
document:
|
||
|
|
||
|
* the notion of a "grid", nominally a set of servers who could potentially
|
||
|
hold shares, which might change over time
|
||
|
* a way to configure which grid should be used
|
||
|
* a way to discover which servers are a part of that grid
|
||
|
* a way to decide which servers are reliable enough to be worth sending
|
||
|
shares
|
||
|
* an algorithm to handle servers which refuse shares
|
||
|
* a way for a downloader to locate which servers have shares
|
||
|
* a way to choose which shares should be used for download
|
||
|
|
||
|
The server-selection algorithm has several obviously competing goals:
|
||
|
|
||
|
* minimize the amount of work that must be done during upload
|
||
|
* minimize the total storage resources used
|
||
|
* avoid "hot spots", balance load among multiple servers
|
||
|
* maximize the chance that enough shares will be downloadable later, by
|
||
|
uploading lots of shares, and by placing them on reliable servers
|
||
|
* minimize the work that the future downloader must do
|
||
|
* tolerate temporary server failures, permanent server departure, and new
|
||
|
server insertions
|
||
|
* minimize the amount of information that must be added to the filecap
|
||
|
|
||
|
The server-selection algorithm is defined in some context: some set of
|
||
|
expectations about the servers or grid with which it is expected to operate.
|
||
|
Different algorithms are appropriate for different situtations, so there will
|
||
|
be multiple alternatives of this document.
|
||
|
|
||
|
The first version of this document will describe the algorithm that the
|
||
|
current (1.3.0) release uses, which is heavily weighted towards the two main
|
||
|
use case scenarios for which Tahoe has been designed: the small, stable
|
||
|
friendnet, and the allmydata.com managed grid. In both cases, we assume that
|
||
|
the storage servers are online most of the time, they are uniformly highly
|
||
|
reliable, and that the set of servers does not change very rapidly. The
|
||
|
server-selection algorithm for this environment uses a permuted server list
|
||
|
to achieve load-balancing, uses all servers identically, and derives the
|
||
|
permutation key from the storage index to avoid adding a new field to the
|
||
|
filecap.
|
||
|
|
||
|
An alternative algorithm could give clients more precise control over share
|
||
|
placement, for example by a user who wished to make sure that k+1 shares are
|
||
|
located in each datacenter (to allow downloads to take place using only local
|
||
|
bandwidth). This algorithm could skip the permuted list and use other
|
||
|
mechanisms to accomplish load-balancing (or ignore the issue altogether). It
|
||
|
could add additional information to the filecap (like a list of which servers
|
||
|
received the shares) in lieu of performing a search at download time, perhaps
|
||
|
at the expense of allowing a repairer to move shares to a new server after
|
||
|
the initial upload. It might make up for this by storing "location hints"
|
||
|
next to each share, to indicate where other shares are likely to be found,
|
||
|
and obligating the repairer to update these hints.
|
||
|
|
||
|
The second purpose of this document is to explain the format of the file
|
||
|
capability string (or "filecap" for short). There are multiple kinds of
|
||
|
capabilties (read-write, read-only, verify-only, repaircap, lease-renewal
|
||
|
cap, traverse-only, etc). There are multiple ways to represent the filecap
|
||
|
(compressed binary, human-readable, clickable-HTTP-URL, "tahoe:" URL, etc),
|
||
|
but they must all contain enough information to reliably retrieve a file
|
||
|
(given some context, of course). It must at least contain the confidentiality
|
||
|
and integrity information from document #1 (i.e. the encryption key and the
|
||
|
UEB hash). It must also contain whatever additional information the
|
||
|
upload-time server-selection algorithm generated that will be required by the
|
||
|
downloader.
|
||
|
|
||
|
For some server-selection algorithms, the additional information will be
|
||
|
minimal. For example, the 1.3.0 release uses the hash of the encryption key
|
||
|
as a storage index, and uses the storage index to permute the server list,
|
||
|
and uses an Introducer to learn the current list of servers. This allows a
|
||
|
"close-enough" list of servers to be compressed into a filecap field that is
|
||
|
already required anyways (the encryption key). It also adds k and N to the
|
||
|
filecap, to speed up the downloader's search (the downloader knows how many
|
||
|
shares it needs, so it can send out multiple queries in parallel).
|
||
|
|
||
|
But other server-selection algorithms might require more information. Each
|
||
|
variant of this document will explain how to encode that additional
|
||
|
information into the filecap, and how to extract and use that information at
|
||
|
download time.
|
||
|
|
||
|
These two purposes are interrelated. A filecap that is interpreted in the
|
||
|
context of the allmydata.com commercial grid, which uses tahoe-1.3.0, implies
|
||
|
a specific peer-selection algorithm, a specific Introducer, and therefore a
|
||
|
fairly-specific set of servers to query for shares. A filecap which is meant
|
||
|
to be interpreted on a different sort of grid would need different
|
||
|
information.
|
||
|
|
||
|
Some filecap formats can be designed to contain more information (and depend
|
||
|
less upon context), such as the way an HTTP URL implies the existence of a
|
||
|
single global DNS system. Ideally a tahoe filecap should be able to specify
|
||
|
which "grid" it lives in, with enough information to allow a compatible
|
||
|
implementation of Tahoe to locate that grid and retrieve the file (regardless
|
||
|
of which server-selection algorithm was used for upload).
|
||
|
|
||
|
This more-universal format might come at the expense of reliability, however.
|
||
|
Tahoe-1.3.0 filecaps do not contain hostnames, because the failure of DNS or
|
||
|
an individual host might then impact file availability (however the
|
||
|
Introducer contains DNS names or IP addresses).
|
||
|
|
||
|
== #4: Directory Format ==
|
||
|
|
||
|
Tahoe directories are a special way of interpreting and managing the contents
|
||
|
of a file (either mutable or immutable). These "dirnode" files are basically
|
||
|
serialized tables that map child name to filecap/dircap. This document
|
||
|
describes the format of these files.
|
||
|
|
||
|
Tahoe-1.3.0 directories are "transitively readonly", which is accomplished by
|
||
|
applying an additional layer of encryption to the list of child writecaps.
|
||
|
The key for this encryption is derived from the containing file's writecap.
|
||
|
This document must explain how to derive this key and apply it to the
|
||
|
appropriate portion of the table.
|
||
|
|
||
|
Future versions of the directory format are expected to contain
|
||
|
"deep-traversal caps", which allow verification/repair of files without
|
||
|
exposing their plaintext to the repair agent. This document wil be
|
||
|
responsible for explaining traversal caps too.
|
||
|
|
||
|
Future versions of the directory format will probably contain an index and
|
||
|
more advanced data structures (for efficiency and fast lookups), instead of a
|
||
|
simple flat list of (childname, childcap). This document will also need to
|
||
|
describe metadata formats, including what access-control policies are defined
|
||
|
for the metadata.
|