mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2025-04-08 11:24:25 +00:00
Added docs/specifications/backends/raic.rst for ticket #1760
This commit is contained in:
parent
518e4cec98
commit
2f9f853413
405
docs/specifications/backends/raic.rst
Normal file
405
docs/specifications/backends/raic.rst
Normal file
@ -0,0 +1,405 @@
|
||||
|
||||
=============================================================
|
||||
Redundant Array of Independent Clouds: Share To Cloud Mapping
|
||||
=============================================================
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
This document describes a proposed design for the mapping of LAFS shares to
|
||||
objects in a cloud storage service. It also analyzes the costs for each of the
|
||||
functional requirements, including network, disk, storage and API usage costs.
|
||||
|
||||
|
||||
Terminology
|
||||
===========
|
||||
|
||||
*LAFS share*
|
||||
A Tahoe-LAFS share representing part of a file after encryption and
|
||||
erasure encoding.
|
||||
|
||||
*LAFS shareset*
|
||||
The set of shares stored by a LAFS storage server for a given storage index.
|
||||
The shares within a shareset are numbered by a small integer.
|
||||
|
||||
*Cloud storage service*
|
||||
A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
|
||||
Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.
|
||||
|
||||
*Cloud storage interface*
|
||||
A protocol interface supported by a cloud storage service, such as the
|
||||
S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
|
||||
Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
|
||||
multiple services implementing a given cloud storage interface. In this design,
|
||||
only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.
|
||||
|
||||
*Cloud object*
|
||||
A file-like abstraction provided by a cloud storage service, storing a
|
||||
sequence of bytes. Cloud objects are mutable in the sense that the contents
|
||||
and metadata of the cloud object with a given name in a given cloud container
|
||||
can be replaced. Cloud objects are called “blobs” in the Azure interface,
|
||||
and “objects” in the other interfaces.
|
||||
|
||||
*Cloud container*
|
||||
A container for cloud objects provided by a cloud service. Cloud containers
|
||||
are called “buckets” in the S3 and Google Cloud Storage interfaces, and
|
||||
“containers” in the Azure and OpenStack Storage interfaces.
|
||||
|
||||
|
||||
Functional Requirements
|
||||
=======================
|
||||
|
||||
* *Upload*: a LAFS share can be uploaded to an appropriately configured
|
||||
Tahoe-LAFS storage server and the data is stored to the cloud
|
||||
storage service.
|
||||
|
||||
* *Scalable shares*: there is no hard limit on the size of LAFS share
|
||||
that can be uploaded.
|
||||
|
||||
If the cloud storage interface offers scalable files, then this could be
|
||||
implemented by using that feature of the specific cloud storage
|
||||
interface. Alternately, it could be implemented by mapping from the LAFS
|
||||
abstraction of an unlimited-size immutable share to a set of size-limited
|
||||
cloud objects.
|
||||
|
||||
* *Streaming upload*: the size of the LAFS share that is uploaded
|
||||
can exceed the amount of RAM and even the amount of direct attached
|
||||
storage on the storage server. I.e., the storage server is required to
|
||||
stream the data directly to the ultimate cloud storage service while
|
||||
processing it, instead of to buffer the data until the client is finished
|
||||
uploading and then transfer the data to the cloud storage service.
|
||||
|
||||
* *Download*: a LAFS share can be downloaded from an appropriately
|
||||
configured Tahoe-LAFS storage server, and the data is loaded from the
|
||||
cloud storage service.
|
||||
|
||||
* *Streaming download*: the size of the LAFS share that is
|
||||
downloaded can exceed the amount of RAM and even the amount of direct
|
||||
attached storage on the storage server. I.e. the storage server is
|
||||
required to stream the data directly to the client while processing it,
|
||||
instead of to buffer the data until the cloud storage service is finished
|
||||
serving and then transfer the data to the client.
|
||||
|
||||
* *Modify*: a LAFS share can have part of its contents modified.
|
||||
|
||||
If the cloud storage interface offers scalable mutable files, then this
|
||||
could be implemented by using that feature of the specific cloud storage
|
||||
interface. Alternately, it could be implemented by mapping from the LAFS
|
||||
abstraction of an unlimited-size mutable share to a set of size-limited
|
||||
cloud objects.
|
||||
|
||||
* *Efficient modify*: the size of the LAFS share being
|
||||
modified can exceed the amount of RAM and even the amount of direct
|
||||
attached storage on the storage server. I.e. the storage server is
|
||||
required to download, patch, and upload only the segment(s) of the share
|
||||
that are being modified, instead of to download, patch, and upload the
|
||||
entire share.
|
||||
|
||||
* *Tracking leases*: The Tahoe-LAFS storage server is required to track when
|
||||
each share has its lease renewed so that unused shares (shares whose lease
|
||||
has not been renewed within a time limit, e.g. 30 days) can be garbage
|
||||
collected. This does not necessarily require code specific to each cloud
|
||||
storage interface, because the lease tracking can be performed in the
|
||||
storage server's generic component rather than in the component supporting
|
||||
each interface.
|
||||
|
||||
|
||||
Mapping
|
||||
=======
|
||||
|
||||
This section describes the mapping between LAFS shares and cloud objects.
|
||||
|
||||
A LAFS share will be split into one or more “chunks” that are each stored in a
|
||||
cloud object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
|
||||
chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
|
||||
(It is not possible for `C` to be zero, because valid shares always have a header,
|
||||
so, there is at least one chunk for each share.)
|
||||
|
||||
For an existing share, the chunk size is determined by the size of the first
|
||||
chunk. For a new share, it is a parameter that may depend on the storage
|
||||
interface. It is an error for any chunk to be larger than the first chunk, or
|
||||
for any chunk other than the last to be smaller than the first chunk.
|
||||
If a mutable share with total size less than the default chunk size for the
|
||||
storage interface is being modified, the new contents are split using the
|
||||
default chunk size.
|
||||
|
||||
*Rationale*: this design allows the `chunksize` parameter to be changed for
|
||||
new shares written via a particular storage interface, without breaking
|
||||
compatibility with existing stored shares. All cloud storage interfaces
|
||||
return the sizes of cloud objects with requests to list objects, and so
|
||||
the size of the first chunk can be determined without an additional request.
|
||||
|
||||
The name of the cloud object for chunk `i` > 0 of a LAFS share with storage index
|
||||
`STORAGEINDEX` and share number `SHNUM`, will be
|
||||
|
||||
shares/`ST`/`STORAGEINDEX`/`SHNUM.i`
|
||||
|
||||
where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
|
||||
`.0` is omitted.
|
||||
|
||||
*Rationale*: this layout maintains compatibility with data stored by the
|
||||
prototype S3 backend, for which Least Authority Enterprises has existing
|
||||
customers. This prototype always used a single cloud object to store each
|
||||
share, with name
|
||||
|
||||
shares/`ST`/`STORAGEINDEX`/`SHNUM`
|
||||
|
||||
By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
|
||||
the storage server can obtain a list of cloud objects associated with a given
|
||||
shareset without having to know the layout in advance, and without having to
|
||||
make multiple API requests. This also simplifies sharing of test code between the
|
||||
disk and cloud backends.
|
||||
|
||||
Mutable and immutable shares will be “chunked” in the same way.
|
||||
|
||||
|
||||
Rationale for Chunking
|
||||
----------------------
|
||||
|
||||
Limiting the amount of data received or sent in a single request has the
|
||||
following advantages:
|
||||
|
||||
* It is unnecessary to write separate code to take advantage of the
|
||||
“large object” features of each cloud storage interface, which differ
|
||||
significantly in their design.
|
||||
* Data needed for each PUT request can be discarded after it completes.
|
||||
If a PUT request fails, it can be retried while only holding the data
|
||||
for that request in memory.
|
||||
|
||||
|
||||
Costs
|
||||
=====
|
||||
|
||||
In this section we analyze the costs of the proposed design in terms of network,
|
||||
disk, memory, cloud storage, and API usage.
|
||||
|
||||
|
||||
Network usage—bandwidth and number-of-round-trips
|
||||
-------------------------------------------------
|
||||
|
||||
When a Tahoe-LAFS storage client allocates a new share on a storage server,
|
||||
the backend will request a list of the existing cloud objects with the
|
||||
appropriate prefix. This takes one HTTP request in the common case, but may
|
||||
take more for the S3 interface, which has a limit of 1000 objects returned in
|
||||
a single “GET Bucket” request.
|
||||
|
||||
If the share is to be read, the client will make a number of calls each
|
||||
specifying the offset and length of the required span of bytes. On the first
|
||||
request that overlaps a given chunk of the share, the server will make an
|
||||
HTTP GET request for that cloud object. The server may also speculatively
|
||||
make GET requests for cloud objects that are likely to be needed soon (which
|
||||
can be predicted since reads are normally sequential), in order to reduce
|
||||
latency.
|
||||
|
||||
Each read will be satisfied as soon as the corresponding data is available,
|
||||
without waiting for the rest of the chunk, in order to minimize read latency.
|
||||
|
||||
All four cloud storage interfaces support GET requests using the
|
||||
Range HTTP header. This could be used to optimize reads where the
|
||||
Tahoe-LAFS storage client requires only part of a share.
|
||||
|
||||
If the share is to be written, the server will make an HTTP PUT request for
|
||||
each chunk that has been completed. Tahoe-LAFS clients only write immutable
|
||||
shares sequentially, and so we can rely on that property to simplify the
|
||||
implementation.
|
||||
|
||||
When modifying shares of an existing mutable file, the storage server will
|
||||
be able to make PUT requests only for chunks that have changed.
|
||||
(Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
|
||||
future versions will probably do so for MDMF files.)
|
||||
|
||||
In some cases, it may be necessary to retry a request (see the `Structure of
|
||||
Implementation`_ section below). In the case of a PUT request, at the point
|
||||
at which a retry is needed, the new chunk contents to be stored will still be
|
||||
in memory and so this is not problematic.
|
||||
|
||||
In the absence of retries, the maximum number of GET requests that will be made
|
||||
when downloading a file, or the maximum number of PUT requests when uploading
|
||||
or modifying a file, will be equal to the number of chunks in the file.
|
||||
|
||||
If the new mutable share content has fewer chunks than the old content,
|
||||
then the remaining cloud objects for old chunks must be deleted (using one
|
||||
HTTP request each). When reading a share, the backend must tolerate the case
|
||||
where these cloud objects have not been deleted successfully.
|
||||
|
||||
The last write to a share will be reported as successful only when all
|
||||
corresponding HTTP PUTs and DELETEs have completed successfully.
|
||||
|
||||
|
||||
|
||||
Disk usage (local to the storage server)
|
||||
----------------------------------------
|
||||
|
||||
It is never necessary for the storage server to write the content of share
|
||||
chunks to local disk, either when they are read or when they are written. Each
|
||||
chunk is held only in memory.
|
||||
|
||||
A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
|
||||
database to store metadata about shares. In that case the same database would
|
||||
be used for the cloud backend. This would enable lease tracking to be implemented
|
||||
in the same way for disk and cloud backends.
|
||||
|
||||
|
||||
Memory usage
|
||||
------------
|
||||
|
||||
The use of chunking simplifies bounding the memory usage of the storage server
|
||||
when handling files that may be larger than memory. However, this depends on
|
||||
limiting the number of chunks that are simultaneously held in memory.
|
||||
Multiple chunks can be held in memory either because of pipelining of requests
|
||||
for a single share, or because multiple shares are being read or written
|
||||
(possibly by multiple clients).
|
||||
|
||||
For immutable shares, the Tahoe-LAFS storage protocol requires the client to
|
||||
specify in advance the maximum amount of data it will write. Also, a cooperative
|
||||
client (including all existing released versions of the Tahoe-LAFS code) will
|
||||
limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
|
||||
size will be greater than that, it is possible to ensure that for each allocation,
|
||||
the maximum chunk data memory usage is the lesser of two chunks, and the allocation
|
||||
size. (There is some additional overhead but it is small compared to the chunk
|
||||
data.) If the maximum memory usage of a new allocation would exceed the memory
|
||||
available, the allocation can be delayed or possibly denied, so that the total
|
||||
memory usage is bounded.
|
||||
|
||||
It is not clear that the existing protocol allows allocations for mutable
|
||||
shares to be bounded in general; this may be addressed in a future protocol change.
|
||||
|
||||
The above discussion assumes that clients do not maliciously send large
|
||||
messages as a denial-of-service attack. Foolscap (the protocol layer underlying
|
||||
the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.
|
||||
|
||||
|
||||
Storage
|
||||
-------
|
||||
|
||||
The storage requirements, including not-yet-collected garbage shares, are
|
||||
the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
|
||||
objects stored is equal to the total size of shares that the disk backend
|
||||
would store.
|
||||
|
||||
Erasure coding causes the size of shares for each file to be a
|
||||
factor `shares.total` / `shares.needed` times the file size, plus overhead
|
||||
that is logarithmic in the file size `¹¹`_.
|
||||
|
||||
|
||||
API usage
|
||||
---------
|
||||
|
||||
Cloud storage backends typically charge a small fee per API request. The number of
|
||||
requests to the cloud storage service for various operations is discussed under
|
||||
“network usage” above.
|
||||
|
||||
|
||||
Structure of Implementation
|
||||
===========================
|
||||
|
||||
A generic “cloud backend”, based on the prototype S3 backend but with support
|
||||
for chunking as described above, will be written.
|
||||
|
||||
An instance of the cloud backend can be attached to one of several
|
||||
“cloud interface adapters”, one for each cloud storage interface. These
|
||||
adapters will operate only on chunks, and need not distinguish between
|
||||
mutable and immutable shares. They will be a relatively “thin” abstraction
|
||||
layer over the HTTP APIs of each cloud storage interface, similar to the
|
||||
S3Bucket abstraction in the prototype.
|
||||
|
||||
For some cloud storage services it may be necessary to transparently retry
|
||||
requests in order to recover from transient failures. (Although the erasure
|
||||
coding may enable a file to be retrieved even when shares are not stored by or
|
||||
not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
|
||||
be desirable to retry cloud storage service requests in order to improve overall
|
||||
reliability.) Support for this will be implemented in the generic cloud backend,
|
||||
and used whenever a cloud storage adaptor reports a transient failure. Our
|
||||
experience with the prototype suggests that it is necessary to retry on transient
|
||||
failures for Amazon's S3 service.
|
||||
|
||||
There will also be a “mock” cloud interface adaptor, based on the prototype's
|
||||
MockS3Bucket. This allows tests of the generic cloud backend to be run without
|
||||
a connection to a real cloud service. The mock adaptor will be able to simulate
|
||||
transient and non-transient failures.
|
||||
|
||||
|
||||
Known Issues
|
||||
============
|
||||
|
||||
This design worsens a known “write hole” issue in Tahoe-LAFS when updating
|
||||
the contents of mutable files. An update to a mutable file can require changing
|
||||
the contents of multiple chunks, and if the client fails or is disconnected
|
||||
during the operation the resulting state of the stored cloud objects may be
|
||||
inconsistent—no longer containing all of the old version, but not yet containing
|
||||
all of the new version. A mutable share can be left in an inconsistent state
|
||||
even by the existing Tahoe-LAFS disk backend if it fails during a write, but
|
||||
that has a smaller chance of occurrence because the current client behavior
|
||||
leads to mutable shares being written to disk in a single system call.
|
||||
|
||||
The best fix for this issue probably requires changing the Tahoe-LAFS storage
|
||||
protocol, perhaps by extending it to use a two-phase or three-phase commit
|
||||
(ticket #1755).
|
||||
|
||||
|
||||
|
||||
References
|
||||
===========
|
||||
|
||||
¹ omitted
|
||||
|
||||
.. _²:
|
||||
|
||||
² “Amazon S3” Amazon (2012)
|
||||
|
||||
https://aws.amazon.com/s3/
|
||||
|
||||
.. _³:
|
||||
|
||||
³ “Rackspace Cloud Files” Rackspace (2012)
|
||||
|
||||
https://www.rackspace.com/cloud/cloud_hosting_products/files/
|
||||
|
||||
.. _⁴:
|
||||
|
||||
⁴ “Google Cloud Storage” Google (2012)
|
||||
|
||||
https://developers.google.com/storage/
|
||||
|
||||
.. _⁵:
|
||||
|
||||
⁵ “Windows Azure Storage” Microsoft (2012)
|
||||
|
||||
https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/
|
||||
|
||||
.. _⁶:
|
||||
|
||||
⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)
|
||||
|
||||
http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
|
||||
|
||||
.. _⁷:
|
||||
|
||||
⁷ “OpenStack Object Storage” openstack.org (2012)
|
||||
|
||||
http://openstack.org/projects/storage/
|
||||
|
||||
.. _⁸:
|
||||
|
||||
⁸ “Google Cloud Storage Reference Guide” Google (2012)
|
||||
|
||||
https://developers.google.com/storage/docs/reference-guide
|
||||
|
||||
.. _⁹:
|
||||
|
||||
⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)
|
||||
|
||||
http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx
|
||||
|
||||
.. _¹⁰:
|
||||
|
||||
¹⁰ “Representational state transfer” English Wikipedia (2012)
|
||||
|
||||
https://en.wikipedia.org/wiki/Representational_state_transfer
|
||||
|
||||
.. _¹¹:
|
||||
|
||||
¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)
|
||||
|
||||
https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/performance.rst
|
Loading…
x
Reference in New Issue
Block a user