mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2025-01-22 20:38:18 +00:00
91047bf828
This uses Read-The-Docs (sphinx/docutils) references exclusively, but adds a README.md for GitHub viewers to remind them that the links there won't work (closes ticket:2835). It also fixes all the dangling references and other Sphinx warnings. The "Preparation" section of docs/magic-folder-howto.rst was removed, since this feature has since been merged to trunk.
408 lines
17 KiB
ReStructuredText
408 lines
17 KiB
ReStructuredText
.. -*- coding: utf-8-with-signature -*-
|
|
|
|
=============================================================
|
|
Redundant Array of Independent Clouds: Share To Cloud Mapping
|
|
=============================================================
|
|
|
|
|
|
Introduction
|
|
============
|
|
|
|
This document describes a proposed design for the mapping of LAFS shares to
|
|
objects in a cloud storage service. It also analyzes the costs for each of the
|
|
functional requirements, including network, disk, storage and API usage costs.
|
|
|
|
|
|
Terminology
|
|
===========
|
|
|
|
*LAFS share*
|
|
A Tahoe-LAFS share representing part of a file after encryption and
|
|
erasure encoding.
|
|
|
|
*LAFS shareset*
|
|
The set of shares stored by a LAFS storage server for a given storage index.
|
|
The shares within a shareset are numbered by a small integer.
|
|
|
|
*Cloud storage service*
|
|
A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
|
|
Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.
|
|
|
|
*Cloud storage interface*
|
|
A protocol interface supported by a cloud storage service, such as the
|
|
S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
|
|
Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
|
|
multiple services implementing a given cloud storage interface. In this design,
|
|
only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.
|
|
|
|
*Store object*
|
|
A file-like abstraction provided by a cloud storage service, storing a
|
|
sequence of bytes. Store objects are mutable in the sense that the contents
|
|
and metadata of the store object with a given name in a given backend store
|
|
can be replaced. Store objects are called “blobs” in the Azure interface,
|
|
and “objects” in the other interfaces.
|
|
|
|
*Cloud backend store*
|
|
A container for store objects provided by a cloud service. Cloud backend
|
|
stores are called “buckets” in the S3 and Google Cloud Storage interfaces,
|
|
and “containers” in the Azure and OpenStack Storage interfaces.
|
|
|
|
|
|
Functional Requirements
|
|
=======================
|
|
|
|
* *Upload*: a LAFS share can be uploaded to an appropriately configured
|
|
Tahoe-LAFS storage server and the data is stored to the cloud
|
|
storage service.
|
|
|
|
* *Scalable shares*: there is no hard limit on the size of LAFS share
|
|
that can be uploaded.
|
|
|
|
If the cloud storage interface offers scalable files, then this could be
|
|
implemented by using that feature of the specific cloud storage
|
|
interface. Alternately, it could be implemented by mapping from the LAFS
|
|
abstraction of an unlimited-size immutable share to a set of size-limited
|
|
store objects.
|
|
|
|
* *Streaming upload*: the size of the LAFS share that is uploaded
|
|
can exceed the amount of RAM and even the amount of direct attached
|
|
storage on the storage server. I.e., the storage server is required to
|
|
stream the data directly to the ultimate cloud storage service while
|
|
processing it, instead of to buffer the data until the client is finished
|
|
uploading and then transfer the data to the cloud storage service.
|
|
|
|
* *Download*: a LAFS share can be downloaded from an appropriately
|
|
configured Tahoe-LAFS storage server, and the data is loaded from the
|
|
cloud storage service.
|
|
|
|
* *Streaming download*: the size of the LAFS share that is
|
|
downloaded can exceed the amount of RAM and even the amount of direct
|
|
attached storage on the storage server. I.e. the storage server is
|
|
required to stream the data directly to the client while processing it,
|
|
instead of to buffer the data until the cloud storage service is finished
|
|
serving and then transfer the data to the client.
|
|
|
|
* *Modify*: a LAFS share can have part of its contents modified.
|
|
|
|
If the cloud storage interface offers scalable mutable files, then this
|
|
could be implemented by using that feature of the specific cloud storage
|
|
interface. Alternately, it could be implemented by mapping from the LAFS
|
|
abstraction of an unlimited-size mutable share to a set of size-limited
|
|
store objects.
|
|
|
|
* *Efficient modify*: the size of the LAFS share being
|
|
modified can exceed the amount of RAM and even the amount of direct
|
|
attached storage on the storage server. I.e. the storage server is
|
|
required to download, patch, and upload only the segment(s) of the share
|
|
that are being modified, instead of to download, patch, and upload the
|
|
entire share.
|
|
|
|
* *Tracking leases*: The Tahoe-LAFS storage server is required to track when
|
|
each share has its lease renewed so that unused shares (shares whose lease
|
|
has not been renewed within a time limit, e.g. 30 days) can be garbage
|
|
collected. This does not necessarily require code specific to each cloud
|
|
storage interface, because the lease tracking can be performed in the
|
|
storage server's generic component rather than in the component supporting
|
|
each interface.
|
|
|
|
|
|
Mapping
|
|
=======
|
|
|
|
This section describes the mapping between LAFS shares and store objects.
|
|
|
|
A LAFS share will be split into one or more “chunks” that are each stored in a
|
|
store object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
|
|
chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
|
|
(It is not possible for `C` to be zero, because valid shares always have a header,
|
|
so, there is at least one chunk for each share.)
|
|
|
|
For an existing share, the chunk size is determined by the size of the first
|
|
chunk. For a new share, it is a parameter that may depend on the storage
|
|
interface. It is an error for any chunk to be larger than the first chunk, or
|
|
for any chunk other than the last to be smaller than the first chunk.
|
|
If a mutable share with total size less than the default chunk size for the
|
|
storage interface is being modified, the new contents are split using the
|
|
default chunk size.
|
|
|
|
*Rationale*: this design allows the `chunksize` parameter to be changed for
|
|
new shares written via a particular storage interface, without breaking
|
|
compatibility with existing stored shares. All cloud storage interfaces
|
|
return the sizes of store objects with requests to list objects, and so
|
|
the size of the first chunk can be determined without an additional request.
|
|
|
|
The name of the store object for chunk `i` > 0 of a LAFS share with storage index
|
|
`STORAGEINDEX` and share number `SHNUM`, will be
|
|
|
|
shares/`ST`/`STORAGEINDEX`/`SHNUM.i`
|
|
|
|
where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
|
|
`.0` is omitted.
|
|
|
|
*Rationale*: this layout maintains compatibility with data stored by the
|
|
prototype S3 backend, for which Least Authority Enterprises has existing
|
|
customers. This prototype always used a single store object to store each
|
|
share, with name
|
|
|
|
shares/`ST`/`STORAGEINDEX`/`SHNUM`
|
|
|
|
By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
|
|
the storage server can obtain a list of store objects associated with a given
|
|
shareset without having to know the layout in advance, and without having to
|
|
make multiple API requests. This also simplifies sharing of test code between the
|
|
disk and cloud backends.
|
|
|
|
Mutable and immutable shares will be “chunked” in the same way.
|
|
|
|
|
|
Rationale for Chunking
|
|
----------------------
|
|
|
|
Limiting the amount of data received or sent in a single request has the
|
|
following advantages:
|
|
|
|
* It is unnecessary to write separate code to take advantage of the
|
|
“large object” features of each cloud storage interface, which differ
|
|
significantly in their design.
|
|
* Data needed for each PUT request can be discarded after it completes.
|
|
If a PUT request fails, it can be retried while only holding the data
|
|
for that request in memory.
|
|
|
|
|
|
Costs
|
|
=====
|
|
|
|
In this section we analyze the costs of the proposed design in terms of network,
|
|
disk, memory, cloud storage, and API usage.
|
|
|
|
|
|
Network usage—bandwidth and number-of-round-trips
|
|
-------------------------------------------------
|
|
|
|
When a Tahoe-LAFS storage client allocates a new share on a storage server,
|
|
the backend will request a list of the existing store objects with the
|
|
appropriate prefix. This takes one HTTP request in the common case, but may
|
|
take more for the S3 interface, which has a limit of 1000 objects returned in
|
|
a single “GET Bucket” request.
|
|
|
|
If the share is to be read, the client will make a number of calls each
|
|
specifying the offset and length of the required span of bytes. On the first
|
|
request that overlaps a given chunk of the share, the server will make an
|
|
HTTP GET request for that store object. The server may also speculatively
|
|
make GET requests for store objects that are likely to be needed soon (which
|
|
can be predicted since reads are normally sequential), in order to reduce
|
|
latency.
|
|
|
|
Each read will be satisfied as soon as the corresponding data is available,
|
|
without waiting for the rest of the chunk, in order to minimize read latency.
|
|
|
|
All four cloud storage interfaces support GET requests using the
|
|
Range HTTP header. This could be used to optimize reads where the
|
|
Tahoe-LAFS storage client requires only part of a share.
|
|
|
|
If the share is to be written, the server will make an HTTP PUT request for
|
|
each chunk that has been completed. Tahoe-LAFS clients only write immutable
|
|
shares sequentially, and so we can rely on that property to simplify the
|
|
implementation.
|
|
|
|
When modifying shares of an existing mutable file, the storage server will
|
|
be able to make PUT requests only for chunks that have changed.
|
|
(Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
|
|
future versions will probably do so for MDMF files.)
|
|
|
|
In some cases, it may be necessary to retry a request (see the `Structure of
|
|
Implementation`_ section below). In the case of a PUT request, at the point
|
|
at which a retry is needed, the new chunk contents to be stored will still be
|
|
in memory and so this is not problematic.
|
|
|
|
In the absence of retries, the maximum number of GET requests that will be made
|
|
when downloading a file, or the maximum number of PUT requests when uploading
|
|
or modifying a file, will be equal to the number of chunks in the file.
|
|
|
|
If the new mutable share content has fewer chunks than the old content,
|
|
then the remaining store objects for old chunks must be deleted (using one
|
|
HTTP request each). When reading a share, the backend must tolerate the case
|
|
where these store objects have not been deleted successfully.
|
|
|
|
The last write to a share will be reported as successful only when all
|
|
corresponding HTTP PUTs and DELETEs have completed successfully.
|
|
|
|
|
|
|
|
Disk usage (local to the storage server)
|
|
----------------------------------------
|
|
|
|
It is never necessary for the storage server to write the content of share
|
|
chunks to local disk, either when they are read or when they are written. Each
|
|
chunk is held only in memory.
|
|
|
|
A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
|
|
database to store metadata about shares. In that case the same database would
|
|
be used for the cloud backend. This would enable lease tracking to be implemented
|
|
in the same way for disk and cloud backends.
|
|
|
|
|
|
Memory usage
|
|
------------
|
|
|
|
The use of chunking simplifies bounding the memory usage of the storage server
|
|
when handling files that may be larger than memory. However, this depends on
|
|
limiting the number of chunks that are simultaneously held in memory.
|
|
Multiple chunks can be held in memory either because of pipelining of requests
|
|
for a single share, or because multiple shares are being read or written
|
|
(possibly by multiple clients).
|
|
|
|
For immutable shares, the Tahoe-LAFS storage protocol requires the client to
|
|
specify in advance the maximum amount of data it will write. Also, a cooperative
|
|
client (including all existing released versions of the Tahoe-LAFS code) will
|
|
limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
|
|
size will be greater than that, it is possible to ensure that for each allocation,
|
|
the maximum chunk data memory usage is the lesser of two chunks, and the allocation
|
|
size. (There is some additional overhead but it is small compared to the chunk
|
|
data.) If the maximum memory usage of a new allocation would exceed the memory
|
|
available, the allocation can be delayed or possibly denied, so that the total
|
|
memory usage is bounded.
|
|
|
|
It is not clear that the existing protocol allows allocations for mutable
|
|
shares to be bounded in general; this may be addressed in a future protocol change.
|
|
|
|
The above discussion assumes that clients do not maliciously send large
|
|
messages as a denial-of-service attack. Foolscap (the protocol layer underlying
|
|
the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.
|
|
|
|
|
|
Storage
|
|
-------
|
|
|
|
The storage requirements, including not-yet-collected garbage shares, are
|
|
the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
|
|
objects stored is equal to the total size of shares that the disk backend
|
|
would store.
|
|
|
|
Erasure coding causes the size of shares for each file to be a
|
|
factor `shares.total` / `shares.needed` times the file size, plus overhead
|
|
that is logarithmic in the file size `¹¹`_.
|
|
|
|
|
|
API usage
|
|
---------
|
|
|
|
Cloud storage backends typically charge a small fee per API request. The number of
|
|
requests to the cloud storage service for various operations is discussed under
|
|
“network usage” above.
|
|
|
|
|
|
Structure of Implementation
|
|
===========================
|
|
|
|
A generic “cloud backend”, based on the prototype S3 backend but with support
|
|
for chunking as described above, will be written.
|
|
|
|
An instance of the cloud backend can be attached to one of several
|
|
“cloud interface adapters”, one for each cloud storage interface. These
|
|
adapters will operate only on chunks, and need not distinguish between
|
|
mutable and immutable shares. They will be a relatively “thin” abstraction
|
|
layer over the HTTP APIs of each cloud storage interface, similar to the
|
|
S3Bucket abstraction in the prototype.
|
|
|
|
For some cloud storage services it may be necessary to transparently retry
|
|
requests in order to recover from transient failures. (Although the erasure
|
|
coding may enable a file to be retrieved even when shares are not stored by or
|
|
not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
|
|
be desirable to retry cloud storage service requests in order to improve overall
|
|
reliability.) Support for this will be implemented in the generic cloud backend,
|
|
and used whenever a cloud storage adaptor reports a transient failure. Our
|
|
experience with the prototype suggests that it is necessary to retry on transient
|
|
failures for Amazon's S3 service.
|
|
|
|
There will also be a “mock” cloud interface adaptor, based on the prototype's
|
|
MockS3Bucket. This allows tests of the generic cloud backend to be run without
|
|
a connection to a real cloud service. The mock adaptor will be able to simulate
|
|
transient and non-transient failures.
|
|
|
|
|
|
Known Issues
|
|
============
|
|
|
|
This design worsens a known “write hole” issue in Tahoe-LAFS when updating
|
|
the contents of mutable files. An update to a mutable file can require
|
|
changing the contents of multiple chunks, and if the client fails or is
|
|
disconnected during the operation the resulting state of the store objects
|
|
for that share may be inconsistent—no longer containing all of the old version,
|
|
but not yet containing all of the new version. A mutable share can be left in
|
|
an inconsistent state even by the existing Tahoe-LAFS disk backend if it fails
|
|
during a write, but that has a smaller chance of occurrence because the current
|
|
client behavior leads to mutable shares being written to disk in a single
|
|
system call.
|
|
|
|
The best fix for this issue probably requires changing the Tahoe-LAFS storage
|
|
protocol, perhaps by extending it to use a two-phase or three-phase commit
|
|
(ticket #1755).
|
|
|
|
|
|
|
|
References
|
|
===========
|
|
|
|
¹ omitted
|
|
|
|
.. _²:
|
|
|
|
² “Amazon S3” Amazon (2012)
|
|
|
|
https://aws.amazon.com/s3/
|
|
|
|
.. _³:
|
|
|
|
³ “Rackspace Cloud Files” Rackspace (2012)
|
|
|
|
https://www.rackspace.com/cloud/cloud_hosting_products/files/
|
|
|
|
.. _⁴:
|
|
|
|
⁴ “Google Cloud Storage” Google (2012)
|
|
|
|
https://developers.google.com/storage/
|
|
|
|
.. _⁵:
|
|
|
|
⁵ “Windows Azure Storage” Microsoft (2012)
|
|
|
|
https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/
|
|
|
|
.. _⁶:
|
|
|
|
⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)
|
|
|
|
http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
|
|
|
|
.. _⁷:
|
|
|
|
⁷ “OpenStack Object Storage” openstack.org (2012)
|
|
|
|
http://openstack.org/projects/storage/
|
|
|
|
.. _⁸:
|
|
|
|
⁸ “Google Cloud Storage Reference Guide” Google (2012)
|
|
|
|
https://developers.google.com/storage/docs/reference-guide
|
|
|
|
.. _⁹:
|
|
|
|
⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)
|
|
|
|
http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx
|
|
|
|
.. _¹⁰:
|
|
|
|
¹⁰ “Representational state transfer” English Wikipedia (2012)
|
|
|
|
https://en.wikipedia.org/wiki/Representational_state_transfer
|
|
|
|
.. _¹¹:
|
|
|
|
¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)
|
|
|
|
:doc:`../../performance`
|