tahoe-lafs/docs/specifications/backends/raic.rst


=============================================================
Redundant Array of Independent Clouds: Share To Cloud Mapping
=============================================================


Introduction
============

This document describes a proposed design for the mapping of LAFS shares to
objects in a cloud storage service. It also analyzes the costs for each of the
functional requirements, including network, disk, storage and API usage costs.


Terminology
===========

*LAFS share*
   A Tahoe-LAFS share representing part of a file after encryption and
   erasure encoding.

*LAFS shareset*
   The set of shares stored by a LAFS storage server for a given storage index.
   The shares within a shareset are numbered by a small integer.

*Cloud storage service*
   A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
   Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.

*Cloud storage interface*
   A protocol interface supported by a cloud storage service, such as the
   S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
   Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
   multiple services implementing a given cloud storage interface. In this design,
   only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.

*Cloud object*
   A file-like abstraction provided by a cloud storage service, storing a
   sequence of bytes. Cloud objects are mutable in the sense that the contents
   and metadata of the cloud object with a given name in a given cloud container
   can be replaced. Cloud objects are called “blobs” in the Azure interface,
   and “objects” in the other interfaces.

*Cloud container*
   A container for cloud objects provided by a cloud service. Cloud containers
   are called “buckets” in the S3 and Google Cloud Storage interfaces, and
   “containers” in the Azure and OpenStack Storage interfaces.


Functional Requirements
=======================

* *Upload*: a LAFS share can be uploaded to an appropriately configured
  Tahoe-LAFS storage server and the data is stored to the cloud
  storage service.

 * *Scalable shares*: there is no hard limit on the size of LAFS share
   that can be uploaded.

   If the cloud storage interface offers scalable files, then this could be
   implemented by using that feature of the specific cloud storage
   interface. Alternately, it could be implemented by mapping from the LAFS
   abstraction of an unlimited-size immutable share to a set of size-limited
   cloud objects.

 * *Streaming upload*: the size of the LAFS share that is uploaded
   can exceed the amount of RAM and even the amount of direct attached
   storage on the storage server. I.e., the storage server is required to
   stream the data directly to the ultimate cloud storage service while
   processing it, instead of to buffer the data until the client is finished
   uploading and then transfer the data to the cloud storage service.

* *Download*: a LAFS share can be downloaded from an appropriately
  configured Tahoe-LAFS storage server, and the data is loaded from the
  cloud storage service.

 * *Streaming download*: the size of the LAFS share that is
   downloaded can exceed the amount of RAM and even the amount of direct
   attached storage on the storage server. I.e. the storage server is
   required to stream the data directly to the client while processing it,
   instead of to buffer the data until the cloud storage service is finished
   serving and then transfer the data to the client.

* *Modify*: a LAFS share can have part of its contents modified.

  If the cloud storage interface offers scalable mutable files, then this
  could be implemented by using that feature of the specific cloud storage
  interface. Alternately, it could be implemented by mapping from the LAFS
  abstraction of an unlimited-size mutable share to a set of size-limited
  cloud objects.

 * *Efficient modify*: the size of the LAFS share being
   modified can exceed the amount of RAM and even the amount of direct
   attached storage on the storage server. I.e. the storage server is
   required to download, patch, and upload only the segment(s) of the share
   that are being modified, instead of to download, patch, and upload the
   entire share.

* *Tracking leases*: The Tahoe-LAFS storage server is required to track when
  each share has its lease renewed so that unused shares (shares whose lease
  has not been renewed within a time limit, e.g. 30 days) can be garbage
  collected. This does not necessarily require code specific to each cloud
  storage interface, because the lease tracking can be performed in the
  storage server's generic component rather than in the component supporting
  each interface.


Mapping
=======

This section describes the mapping between LAFS shares and cloud objects.

A LAFS share will be split into one or more “chunks” that are each stored in a
cloud object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
(It is not possible for `C` to be zero, because valid shares always have a header,
so, there is at least one chunk for each share.)

For an existing share, the chunk size is determined by the size of the first
chunk. For a new share, it is a parameter that may depend on the storage
interface. It is an error for any chunk to be larger than the first chunk, or
for any chunk other than the last to be smaller than the first chunk.
If a mutable share with total size less than the default chunk size for the
storage interface is being modified, the new contents are split using the
default chunk size.

  *Rationale*: this design allows the `chunksize` parameter to be changed for
  new shares written via a particular storage interface, without breaking
  compatibility with existing stored shares. All cloud storage interfaces
  return the sizes of cloud objects with requests to list objects, and so
  the size of the first chunk can be determined without an additional request.

The name of the cloud object for chunk `i` > 0 of a LAFS share with storage index
`STORAGEINDEX` and share number `SHNUM`, will be

  shares/`ST`/`STORAGEINDEX`/`SHNUM.i`

where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
`.0` is omitted.

  *Rationale*: this layout maintains compatibility with data stored by the
  prototype S3 backend, for which Least Authority Enterprises has existing
  customers. This prototype always used a single cloud object to store each
  share, with name

    shares/`ST`/`STORAGEINDEX`/`SHNUM`

  By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
  the storage server can obtain a list of cloud objects associated with a given
  shareset without having to know the layout in advance, and without having to
  make multiple API requests. This also simplifies sharing of test code between the
  disk and cloud backends.

Mutable and immutable shares will be “chunked” in the same way.


Rationale for Chunking
----------------------

Limiting the amount of data received or sent in a single request has the
following advantages:

* It is unnecessary to write separate code to take advantage of the
  “large object” features of each cloud storage interface, which differ
  significantly in their design.
* Data needed for each PUT request can be discarded after it completes.
  If a PUT request fails, it can be retried while only holding the data
  for that request in memory.


Costs
=====

In this section we analyze the costs of the proposed design in terms of network,
disk, memory, cloud storage, and API usage.


Network usage: bandwidth and number-of-round-trips
--------------------------------------------------

When a Tahoe-LAFS storage client allocates a new share on a storage server,
the backend will request a list of the existing cloud objects with the
appropriate prefix. This takes one HTTP request in the common case, but may
take more for the S3 interface, which has a limit of 1000 objects returned in
a single “GET Bucket” request.

If the share is to be read, the client will make a number of calls each
specifying the offset and length of the required span of bytes. On the first
request that overlaps a given chunk of the share, the server will make an
HTTP GET request for that cloud object. The server may also speculatively
make GET requests for cloud objects that are likely to be needed soon (which
can be predicted since reads are normally sequential), in order to reduce
latency.

Each read will be satisfied as soon as the corresponding data is available,
without waiting for the rest of the chunk, in order to minimize read latency.

All four cloud storage interfaces support GET requests using the
Range HTTP header. This could be used to optimize reads where the
Tahoe-LAFS storage client requires only part of a share.

If the share is to be written, the server will make an HTTP PUT request for
each chunk that has been completed. Tahoe-LAFS clients only write immutable
shares sequentially, and so we can rely on that property to simplify the
implementation.

When modifying shares of an existing mutable file, the storage server will
be able to make PUT requests only for chunks that have changed.
(Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
future versions will probably do so for MDMF files.)

In some cases, it may be necessary to retry a request (see the `Structure of
Implementation`_ section below). In the case of a PUT request, at the point
at which a retry is needed, the new chunk contents to be stored will still be
in memory and so this is not problematic.

In the absence of retries, the maximum number of GET requests that will be made
when downloading a file, or the maximum number of PUT requests when uploading
or modifying a file, will be equal to the number of chunks in the file.

If the new mutable share content has fewer chunks than the old content,
then the remaining cloud objects for old chunks must be deleted (using one
HTTP request each). When reading a share, the backend must tolerate the case
where these cloud objects have not been deleted successfully.

The last write to a share will be reported as successful only when all
corresponding HTTP PUTs and DELETEs have completed successfully.


Disk usage (local to the storage server)
----------------------------------------

It is never necessary for the storage server to write the content of share
chunks to local disk, either when they are read or when they are written. Each
chunk is held only in memory.

A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
database to store metadata about shares. In that case the same database would
be used for the cloud backend. This would enable lease tracking to be implemented
in the same way for disk and cloud backends.


Memory usage
------------

The use of chunking simplifies bounding the memory usage of the storage server
when handling files that may be larger than memory. However, this depends on
limiting the number of chunks that are simultaneously held in memory.
Multiple chunks can be held in memory either because of pipelining of requests
for a single share, or because multiple shares are being read or written
(possibly by multiple clients).

For immutable shares, the Tahoe-LAFS storage protocol requires the client to
specify in advance the maximum amount of data it will write. Also, a cooperative
client (including all existing released versions of the Tahoe-LAFS code) will
limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
size will be greater than that, it is possible to ensure that for each allocation,
the maximum chunk data memory usage is the lesser of two chunks, and the allocation
size. (There is some additional overhead but it is small compared to the chunk
data.) If the maximum memory usage of a new allocation would exceed the memory
available, the allocation can be delayed or possibly denied, so that the total
memory usage is bounded.

It is not clear that the existing protocol allows allocations for mutable
shares to be bounded in general; this may be addressed in a future protocol change.

The above discussion assumes that clients do not maliciously send large
messages as a denial-of-service attack. Foolscap (the protocol layer underlying
the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.


Storage
-------

The storage requirements, including not-yet-collected garbage shares, are
the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
objects stored is equal to the total size of shares that the disk backend
would store.

Erasure coding causes the size of shares for each file to be a
factor `shares.total` / `shares.needed` times the file size, plus overhead
that is logarithmic in the file size `¹¹`_.


API usage
---------

Cloud storage backends typically charge a small fee per API request. The number of
requests to the cloud storage service for various operations is discussed under
“network usage” above.


Structure of Implementation
===========================

A generic “cloud backend”, based on the prototype S3 backend but with support
for chunking as described above, will be written.

An instance of the cloud backend can be attached to one of several
“cloud interface adapters”, one for each cloud storage interface. These
adapters will operate only on chunks, and need not distinguish between
mutable and immutable shares. They will be a relatively “thin” abstraction
layer over the HTTP APIs of each cloud storage interface, similar to the
S3Bucket abstraction in the prototype.

For some cloud storage services it may be necessary to transparently retry
requests in order to recover from transient failures. (Although the erasure
coding may enable a file to be retrieved even when shares are not stored by or
not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
be desirable to retry cloud storage service requests in order to improve overall
reliability.) Support for this will be implemented in the generic cloud backend,
and used whenever a cloud storage adaptor reports a transient failure. Our
experience with the prototype suggests that it is necessary to retry on transient
failures for Amazon's S3 service.

There will also be a “mock” cloud interface adaptor, based on the prototype's
MockS3Bucket. This allows tests of the generic cloud backend to be run without
a connection to a real cloud service. The mock adaptor will be able to simulate
transient and non-transient failures.


Known Issues
============

This design worsens a known “write hole” issue in Tahoe-LAFS when updating
the contents of mutable files. An update to a mutable file can require
changing the contents of multiple chunks, and if the client fails or is
disconnected during the operation the resulting state of the stored cloud
objects may be inconsistent: no longer containing all of the old version, but
not yet containing all of the new version. A mutable share can be left in an
inconsistent state even by the existing Tahoe-LAFS disk backend if it fails
during a write, but that has a smaller chance of occurrence because the
current client behavior leads to mutable shares being written to disk in a
single system call.

The best fix for this issue probably requires changing the Tahoe-LAFS storage
protocol, perhaps by extending it to use a two-phase or three-phase commit
(ticket #1755).


References
===========

¹ omitted

.. _²:

² “Amazon S3” Amazon (2012)

   https://aws.amazon.com/s3/

.. _³:

³ “Rackspace Cloud Files” Rackspace (2012)

   https://www.rackspace.com/cloud/cloud_hosting_products/files/

.. _⁴:

⁴ “Google Cloud Storage” Google (2012)

   https://developers.google.com/storage/

.. _⁵:

⁵ “Windows Azure Storage” Microsoft (2012)

   https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/

.. _⁶:

⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)

   http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html

.. _⁷:

⁷ “OpenStack Object Storage” openstack.org (2012)

   http://openstack.org/projects/storage/

.. _⁸:

⁸ “Google Cloud Storage Reference Guide” Google (2012)

   https://developers.google.com/storage/docs/reference-guide

.. _⁹:

⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)

   http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx

.. _¹⁰:

¹⁰ “Representational state transfer” English Wikipedia (2012)

    https://en.wikipedia.org/wiki/Representational_state_transfer

.. _¹¹:

¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)

    https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/performance.rst