mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2025-04-07 19:04:21 +00:00
docs/proposed/leasedb.rst: Add design doc for leasedb.
This version refs #1834. Signed-off-by: David-Sarah Hopwood <david-sarah@jacaranda.org>
This commit is contained in:
parent
8a1b2c7aa6
commit
b94721ad40
288
docs/proposed/leasedb.rst
Normal file
288
docs/proposed/leasedb.rst
Normal file
@ -0,0 +1,288 @@
|
||||
|
||||
=====================
|
||||
Lease database design
|
||||
=====================
|
||||
|
||||
The target audience for this document is developers who wish to understand
|
||||
the new lease database (leasedb) planned to be added in Tahoe-LAFS v1.11.0.
|
||||
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
A "lease" is a request by an account that a share not be deleted before a
|
||||
specified time. Each storage server stores leases in order to know which
|
||||
shares to spare from garbage collection.
|
||||
|
||||
Motivation
|
||||
----------
|
||||
|
||||
The leasedb will replace the current design in which leases are stored in
|
||||
the storage server's share container files. That design has several
|
||||
disadvantages:
|
||||
|
||||
- Updating a lease requires modifying a share container file (even for
|
||||
immutable shares). This complicates the implementation of share classes.
|
||||
The mixing of share contents and lease data in share files also led to a
|
||||
security bug (ticket `#1528`_).
|
||||
|
||||
- When only the disk backend is supported, it is possible to read and
|
||||
update leases synchronously because the share files are stored locally
|
||||
to the storage server. For the cloud backend, accessing share files
|
||||
requires an HTTP request, and so must be asynchronous. Accepting this
|
||||
asynchrony for lease queries would be both inefficient and complex.
|
||||
Moving lease information out of shares and into a local database allows
|
||||
lease queries to stay synchronous.
|
||||
|
||||
Also, the current cryptographic protocol for renewing and cancelling leases
|
||||
(based on shared secrets derived from secure hash functions) is complex,
|
||||
and the cancellation part was never used.
|
||||
|
||||
The leasedb solves the first two problems by storing the lease information in
|
||||
a local database instead of in the share container files. The share data
|
||||
itself is still held in the share container file.
|
||||
|
||||
At the same time as implementing leasedb, we devised a simpler protocol for
|
||||
allocating and cancelling leases: a client can use a public key digital
|
||||
signature to authenticate access to a foolscap object representing the
|
||||
authority of an account. This protocol is not yet implemented; at the time
|
||||
of writing, only an "anonymous" account is supported.
|
||||
|
||||
The leasedb also provides an efficient way to get summarized information,
|
||||
such as total space usage of shares leased by an account, for accounting
|
||||
purposes.
|
||||
|
||||
.. _`#1528`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1528
|
||||
|
||||
|
||||
Design constraints
|
||||
------------------
|
||||
|
||||
A share is stored as a collection of objects. The persistent storage may be
|
||||
remote from the server (for example, cloud storage).
|
||||
|
||||
Writing to the persistent store objects is in general not an atomic
|
||||
operation. So the leasedb also keeps track of which shares are in an
|
||||
inconsistent state because they have been partly written. (This may
|
||||
change in future when we implement a protocol to improve atomicity of
|
||||
updates to mutable shares.)
|
||||
|
||||
Leases are no longer stored in shares. The same share format is used as
|
||||
before, but the lease slots are ignored, and are cleared when rewriting a
|
||||
mutable share. The new design also does not use lease renewal or cancel
|
||||
secrets. (They are accepted as parameters in the storage protocol interfaces
|
||||
for backward compatibility, but are ignored. Cancel secrets were already
|
||||
ignored due to the fix for `#1528`_.)
|
||||
|
||||
The new design needs to be fail-safe in the sense that if the lease database
|
||||
is lost or corruption is detected, no share data will be lost (even though
|
||||
the metadata about leases held by particular accounts has been lost).
|
||||
|
||||
|
||||
Accounting crawler
|
||||
------------------
|
||||
|
||||
A "crawler" is a long-running process that visits share container files at a
|
||||
slow rate, so as not to overload the server by trying to visit all share
|
||||
container files one after another immediately.
|
||||
|
||||
The accounting crawler replaces the previous "lease crawler". It examines
|
||||
each share container file and compares it with the state of the leasedb, and
|
||||
may update the state of the share and/or the leasedb.
|
||||
|
||||
The accounting crawler may perform the following functions (but see ticket
|
||||
#1834 for a proposal to reduce the scope of its responsibility):
|
||||
|
||||
- Remove leases that are past their expiration time. (Currently, this is
|
||||
done automatically before deleting shares, but we plan to allow expiration
|
||||
to be performed separately for individual accounts in future.)
|
||||
|
||||
- Delete the objects containing unleased shares — that is, shares that have
|
||||
stable entries in the leasedb but no current leases (see below for the
|
||||
definition of "stable" entries).
|
||||
|
||||
- Discover shares that have been manually added to storage, via ``scp`` or
|
||||
some other out-of-band means.
|
||||
|
||||
- Discover shares that are present when a storage server is upgraded to
|
||||
a leasedb-supporting version from a previous version, and give them
|
||||
"starter leases".
|
||||
|
||||
- Recover from a situation where the leasedb is lost or detectably
|
||||
corrupted. This is handled in the same way as upgrading from a previous
|
||||
version.
|
||||
|
||||
- Detect shares that have unexpectedly disappeared from storage. The
|
||||
disappearance of a share is logged, and its entry and leases are removed
|
||||
from the leasedb.
|
||||
|
||||
|
||||
Accounts
|
||||
--------
|
||||
|
||||
An account holds leases for some subset of shares stored by a server. The
|
||||
leasedb schema can handle many distinct accounts, but for the time being we
|
||||
create only two accounts: an anonymous account and a starter account. The
|
||||
starter account is used for leases on shares discovered by the accounting
|
||||
crawler; the anonymous account is used for all other leases.
|
||||
|
||||
The leasedb has at most one lease entry per account per (storage_index,
|
||||
shnum) pair. This entry stores the times when the lease was last renewed and
|
||||
when it is set to expire (if the expiration policy does not force it to
|
||||
expire earlier), represented as Unix UTC-seconds-since-epoch timestamps.
|
||||
|
||||
For more on expiration policy, see `docs/garbage-collection.rst
|
||||
<../garbage-collection.rst>`__.
|
||||
|
||||
|
||||
Share states
|
||||
------------
|
||||
|
||||
The leasedb holds an explicit indicator of the state of each share.
|
||||
|
||||
The diagram and descriptions below give the possible values of the "state"
|
||||
indicator, what that value means, and transitions between states, for any
|
||||
(storage_index, shnum) pair on each server::
|
||||
|
||||
|
||||
# STATE_STABLE -------.
|
||||
# ^ | ^ | |
|
||||
# | v | | v
|
||||
# STATE_COMING | | STATE_GOING
|
||||
# ^ | | |
|
||||
# | | v |
|
||||
# '----- NONE <------'
|
||||
|
||||
|
||||
**NONE**: There is no entry in the ``shares`` table for this (storage_index,
|
||||
shnum) in this server's leasedb. This is the initial state.
|
||||
|
||||
**STATE_COMING**: The share is being created or (if a mutable share)
|
||||
updated. The store objects may have been at least partially written, but
|
||||
the storage server doesn't have confirmation that they have all been
|
||||
completely written.
|
||||
|
||||
**STATE_STABLE**: The store objects have been completely written and are
|
||||
not in the process of being modified or deleted by the storage server. (It
|
||||
could have been modified or deleted behind the back of the storage server,
|
||||
but if it has, the server has not noticed that yet.) The share may or may not
|
||||
be leased.
|
||||
|
||||
**STATE_GOING**: The share is being deleted.
|
||||
|
||||
State transitions
|
||||
-----------------
|
||||
|
||||
• **STATE_GOING** → **NONE**
|
||||
|
||||
trigger: The storage server gains confidence that all store objects for
|
||||
the share have been removed.
|
||||
|
||||
implementation:
|
||||
|
||||
1. Remove the entry in the leasedb.
|
||||
|
||||
• **STATE_STABLE** → **NONE**
|
||||
|
||||
trigger: The accounting crawler noticed that all the store objects for
|
||||
this share are gone.
|
||||
|
||||
implementation:
|
||||
|
||||
1. Remove the entry in the leasedb.
|
||||
|
||||
• **NONE** → **STATE_COMING**
|
||||
|
||||
triggers: A new share is being created, as explicitly signalled by a
|
||||
client invoking a creation command, *or* the accounting crawler discovers
|
||||
an incomplete share.
|
||||
|
||||
implementation:
|
||||
|
||||
1. Add an entry to the leasedb with **STATE_COMING**.
|
||||
|
||||
2. (In case of explicit creation) begin writing the store objects to hold
|
||||
the share.
|
||||
|
||||
• **STATE_STABLE** → **STATE_COMING**
|
||||
|
||||
trigger: A mutable share is being modified, as explicitly signalled by a
|
||||
client invoking a modification command.
|
||||
|
||||
implementation:
|
||||
|
||||
1. Add an entry to the leasedb with **STATE_COMING**.
|
||||
|
||||
2. Begin updating the store objects.
|
||||
|
||||
• **STATE_COMING** → **STATE_STABLE**
|
||||
|
||||
trigger: All store objects have been written.
|
||||
|
||||
implementation:
|
||||
|
||||
1. Change the state value of this entry in the leasedb from
|
||||
**STATE_COMING** to **STATE_STABLE**.
|
||||
|
||||
• **NONE** → **STATE_STABLE**
|
||||
|
||||
trigger: The accounting crawler discovers a complete share.
|
||||
|
||||
implementation:
|
||||
|
||||
1. Add an entry to the leasedb with **STATE_STABLE**.
|
||||
|
||||
• **STATE_STABLE** → **STATE_GOING**
|
||||
|
||||
trigger: The share should be deleted because it is unleased.
|
||||
|
||||
implementation:
|
||||
|
||||
1. Change the state value of this entry in the leasedb from
|
||||
**STATE_STABLE** to **STATE_GOING**.
|
||||
|
||||
2. Initiate removal of the store objects.
|
||||
|
||||
|
||||
The following constraints are needed to avoid race conditions:
|
||||
|
||||
- While a share is being deleted (entry in **STATE_GOING**), we do not accept
|
||||
any requests to recreate it. That would result in add and delete requests
|
||||
for store objects being sent concurrently, with undefined results.
|
||||
|
||||
- While a share is being added or modified (entry in **STATE_COMING**), we
|
||||
treat it as leased.
|
||||
|
||||
- Creation or modification requests for a given mutable share are serialized.
|
||||
|
||||
|
||||
Unresolved design issues
|
||||
------------------------
|
||||
|
||||
- What happens if a write to store objects for a new share fails
|
||||
permanently? If we delete the share entry, then the accounting crawler
|
||||
will eventually get to those store objects and see that their lengths
|
||||
are inconsistent with the length in the container header. This will cause
|
||||
the share to be treated as corrupted. Should we instead attempt to
|
||||
delete those objects immediately? If so, do we need a direct
|
||||
**STATE_COMING** → **STATE_GOING** transition to handle this case?
|
||||
|
||||
- What happens if only some store objects for a share disappear
|
||||
unexpectedly? This case is similar to only some objects having been
|
||||
written when we get an unrecoverable error during creation of a share, but
|
||||
perhaps we want to treat it differently in order to preserve information
|
||||
about the storage service having lost data.
|
||||
|
||||
- Does the leasedb need to track corrupted shares?
|
||||
|
||||
|
||||
Future directions
|
||||
-----------------
|
||||
|
||||
Clients will have key pairs identifying accounts, and will be able to add
|
||||
leases for a specific account. Various space usage policies can be defined.
|
||||
|
||||
Better migration tools ('tahoe storage export'?) will create export files
|
||||
that include both the share data and the lease data, and then an import tool
|
||||
will both put the share in the right place and update the recipient node's
|
||||
leasedb.
|
Loading…
x
Reference in New Issue
Block a user