(An immutable collective directory could be interesting for some use cases, and is no more difficult to support.) Signed-off-by: Daira Hopwood <daira@jacaranda.org>
42 KiB
Magic Folder design for remote-to-local sync
Scope
In this Objective we will design remote-to-local synchronization:
- How to efficiently determine which objects (files and directories) have to be downloaded in order to bring the current local filesystem into sync with the newly-discovered version of the remote filesystem.
- How to distinguish overwrites, in which the remote side was aware of your most recent version and overwrote it with a new version, from conflicts, in which the remote side was unaware of your most recent version when it published its new version. The latter needs to be raised to the user as an issue the user will have to resolve and the former must not bother the user.
- How to overwrite the (stale) local versions of those objects with the newly acquired objects, while preserving backed-up versions of those overwritten objects in case the user didn't want this overwrite and wants to recover the old version.
Tickets on the Tahoe-LAFS trac with the otf-magic-folder-objective4 keyword are within the scope of the remote-to-local synchronization design.
Glossary
Object: a file or directory
DMD: distributed mutable directory
Folder: an abstract directory that is synchronized between clients. (A folder is not the same as the directory corresponding to it on any particular client, nor is it the same as a DMD.)
Collective: the set of clients subscribed to a given Magic Folder.
Descendant: a direct or indirect child in a directory or folder tree
Subfolder: a folder that is a descendant of a magic folder
Subpath: the path from a magic folder to one of its descendants
Write: a modification to a local filesystem object by a client
Read: a read from a local filesystem object by a client
Upload: an upload of a local object to the Tahoe-LAFS file store
Download: a download from the Tahoe-LAFS file store to a local object
Pending notification: a local filesystem change that has been detected but not yet processed.
Representing the Magic Folder in Tahoe-LAFS
Unlike the local case where we use inotify or ReadDirectoryChangesW to detect filesystem changes, we have no mechanism to register a monitor for changes to a Tahoe-LAFS directory. Therefore, we must periodically poll for changes.
An important constraint on the solution is Tahoe-LAFS' "write coordination directive", which prohibits concurrent writes by different storage clients to the same mutable object:
Tahoe does not provide locking of mutable files and directories. If there is more than one simultaneous attempt to change a mutable file or directory, then an UncoordinatedWriteError may result. This might, in rare cases, cause the file or directory contents to be accidentally deleted. The user is expected to ensure that there is at most one outstanding write or update request for a given file or directory at a time. One convenient way to accomplish this is to make a different file or directory for each person or process that wants to write.
Since it is a goal to allow multiple users to write to a Magic Folder, if the write coordination directive remains the same as above, then we will not be able to implement the Magic Folder as a single Tahoe-LAFS DMD. In general therefore, we will have multiple DMDs —spread across clients— that together represent the Magic Folder. Each client in a Magic Folder collective polls the other clients' DMDs in order to detect remote changes.
Six possible designs were considered for the representation of subfolders of the Magic Folder:
1. All subfolders written by a given Magic Folder client are collapsed into a single client DMD, containing immutable files. The child name of each file encodes the full subpath of that file relative to the Magic Folder.
2. The DMD tree under a client DMD is a direct copy of the folder tree written by that client to the Magic Folder. Not all subfolders have corresponding DMDs; only those to which that client has written files or child subfolders.
3. The directory tree under a client DMD is a tahoe backup
structure containing immutable snapshots of the folder tree written by that client to the Magic Folder. As in design 2, only objects written by that client are present.
4. Each client DMD contains an eventually consistent mirror of all files and folders written by any Magic Folder client. Thus each client must also copy changes made by other Magic Folder clients to its own client DMD.
5. Each client DMD contains a tahoe backup
structure containing immutable snapshots of all files and folders written by any Magic Folder client. Thus each client must also create another snapshot in its own client DMD when changes are made by another client. (It can potentially batch changes, subject to latency requirements.)
6. The write coordination problem is solved by implementing two-phase commit. Then, the representation consists of a single DMD tree which is written by all clients.
Here is a summary of advantages and disadvantages of each design:
+----------------------------+ | Key | +=======+====================+ | ++ | major advantage | +-------+--------------------+ | + | minor advantage | +-------+--------------------+ | ‒ | minor disadvantage | +-------+--------------------+ | ‒ ‒ | major disadvantage | +-------+--------------------+ | ‒ ‒ ‒ | showstopper | +-------+--------------------+
123456+: All designs have the property that a recursive add-lease operation starting from a collective directory containing all of the client DMDs, will find all of the files and directories used in the Magic Folder representation. Therefore the representation is compatible with garbage collection, even when a pre-Magic-Folder client does the lease marking.
123456+: All designs avoid "breaking" pre-Magic-Folder clients that read a directory or file that is part of the representation.
456++: Only these designs allow a readcap to one of the client directories —or one of their subdirectories— to be directly shared with other Tahoe-LAFS clients (not necessarily Magic Folder clients), so that such a client sees all of the contents of the Magic Folder. Note that this was not a requirement of the OTF proposal, although it is useful.
135+: A Magic Folder client has only one mutable Tahoe-LAFS object to monitor per other client. This minimizes communication bandwidth for polling, or alternatively the latency possible for a given polling bandwidth.
1236+: A client does not need to make changes to its own DMD that repeat changes that another Magic Folder client had previously made. This reduces write bandwidth and complexity.
1‒: If the Magic Folder has many subfolders, their files will all be collapsed into the same DMD, which could get quite large. In practice a single DMD can easily handle the number of files expected to be written by a client, so this is unlikely to be a significant issue.
123‒ ‒ ‒: In these designs, the set of files in a Magic Folder is represented as the union of the files in all client DMDs. However, when a file is modified by more than one client, it will be linked from multiple client DMDs. We therefore need a mechanism, such as a version number or a monotonically increasing timestamp, to determine which copy takes priority.
35‒ ‒: When a Magic Folder client detects a remote change, it must traverse an immutable directory structure to see what has changed. Completely unchanged subtrees will have the same URI, allowing some of this traversal to be shortcutted.
24‒ ‒ ‒: When a Magic Folder client detects a remote change, it must traverse a mutable directory structure to see what has changed. This is more complex and less efficient than traversing an immutable structure, because shortcutting is not possible (each DMD retains the same URI even if a descendant object has changed), and because the structure may change while it is being traversed. Also the traversal needs to be robust against cycles, which can only occur in mutable structures.
45‒ ‒: When a change occurs in one Magic Folder client, it will propagate to all the other clients. Each client will therefore see multiple representation changes for a single logical change to the Magic Folder contents, and must suppress the duplicates. This is particularly problematic for design 4 where it interacts with the preceding issue.
4‒ ‒ ‒, 5‒ ‒: There is the potential for client DMDs to get "out of sync" with each other, potentially for long periods if errors occur. Thus each client must be able to "repair" its client directory (and its subdirectory structure) concurrently with performing its own writes. This is a significant complexity burden and may introduce failure modes that could not otherwise happen.
6‒ ‒ ‒: While two-phase commit is a well-established protocol, its application to Tahoe-LAFS requires significant design work, and may still leave some corner cases of the write coordination problem unsolved.
+------------------------------------------------+-----------------------------------------+ | Design Property | Designs Proposed | +================================================+======+======+======+======+======+======+ | advantages | 1 | 2 | 3 | 4 | 5 | 6 | +------------------------------------------------+------+------+------+------+------+------+ | Compatible with garbage collection + + + | +------------------------------------------------+------+------+------+------+------+------+ | Does not break old clients + + + | +------------------------------------------------+------+------+------+------+------+------+ | Allows direct sharing | | | ++ +------------------------------------------------+------+------+------+------+------+------+ | Efficient use of bandwidth | +------------------------------------------------+------+------+------+------+------+------+ | No repeated changes + | +------------------------------------------------+------+------+------+------+------+------+ | disadvantages | 1 | 2 | 3 | 4 | 5 | 6 | +------------------------------------------------+------+------+------+------+------+------+ | Can result in large DMDs | | | | | +------------------------------------------------+------+------+------+------+------+------+ | Need version number to determine priority ‒ | | | +------------------------------------------------+------+------+------+------+------+------+ | Must traverse immutable directory structure | | | +------------------------------------------------+------+------+------+------+------+------+ | Must traverse mutable directory structure | | | +------------------------------------------------+------+------+------+------+------+------+ | Must suppress duplicate representation changes | | | ‒ ‒ | | +------------------------------------------------+------+------+------+------+------+------+ | "Out of sync" problem | | | ‒ ‒ | | +------------------------------------------------+------+------+------+------+------+------+ | Unsolved design problems | | | | | +------------------------------------------------+------+------+------+------+------+------+
Evaluation of designs
Designs 2 and 3 have no significant advantages over design 1, while requiring higher polling bandwidth and greater complexity due to the need to create subdirectories. These designs were therefore rejected.
Design 4 was rejected due to the out-of-sync problem, which is severe and possibly unsolvable for mutable structures.
For design 5, the out-of-sync problem is still present but possibly solvable. However, design 5 is substantially more complex, less efficient in bandwidth/latency, and less scalable in number of clients and subfolders than design 1. It only gains over design 1 on the ability to share directory readcaps to the Magic Folder (or subfolders), which was not a requirement. It would be possible to implement this feature in future by switching to design 6.
For the time being, however, design 6 was considered out-of-scope for this project.
Therefore, design 1 was chosen. That is:
All subfolders written by a given Magic Folder client are collapsed into a single client DMD, containing immutable files. The child name of each file encodes the full subpath of that file relative to the Magic Folder.
Each directory entry in a DMD also stores a version number, so that the latest version of a file is well-defined when it has been modified by multiple clients.
To enable representing empty directories, a client that creates a directory should link a corresponding zero-length file in its DMD, at a name that ends with the encoded directory separator character.
We want to enable dynamic configuration of the membership of a Magic Folder collective, without having to reconfigure or restart each client when another client joins. To support this, we have a single collective directory that links to all of the client DMDs, named by their client nicknames. If the collective directory is mutable, then it is possible to change its contents in order to add clients. Note that a client DMD should not be unlinked from the collective directory unless all of its files are first copied to some other client DMD.
A client needs to be able to write to its own DMD, and read from other DMDs. To be consistent with the Principle of Least Authority, each client's reference to its own DMD is a write capability, whereas its reference to the collective directory is a read capability. The latter transitively grants read access to all of the other client DMDs and the files linked from them, as required.
Design and implementation of the user interface for maintaining this DMD structure and configuration will be addressed in Objectives 5 and 6.
During operation, each client will poll for changes on other clients at a predetermined frequency. On each poll, it will reread the collective directory (to allow for added or removed clients), and then read each client DMD linked from it.
"Hidden" files, and files with names matching the patterns used for backup, temporary, and conflicted files, will be ignored, i.e. not synchronized in either direction. A file is hidden if it has a filename beginning with "." (on any platform), or has the hidden or system attribute on Windows.
Conflict Detection and Resolution
The combination of local filesystems and distributed objects is an example of shared state concurrency, which is highly error-prone and can result in race conditions that are complex to analyze. Unfortunately we have no option but to use shared state in this situation.
We call the resulting design issues "dragons" (as in "Here be dragons"), which as a convenient mnemonic we have named after the classical Greek elements Earth, Fire, Air, and Water.
Note: all filenames used in the following sections are examples, and the filename patterns we use in the actual implementation may differ. The actual patterns will probably include timestamps, and for conflicted files, the nickname of the client that last changed the file.
Earth Dragons: Collisions between local filesystem operations and downloads
Write/download collisions
Suppose that Alice's Magic Folder client is about to write a version of foo
that it has downloaded in response to a remote change.
The criteria for distinguishing overwrites from conflicts are described later in the Fire Dragons section. Suppose that the remote change has been initially classified as an overwrite. (As we will see, it may be reclassified in some circumstances.)
A write/download collision occurs when another program writes to foo
in the local filesystem, concurrently with the new version being written by the Magic Folder client. We need to ensure that this does not cause data loss, as far as possible.
An important constraint on the design is that on Windows, it is not possible to rename a file to the same name as an existing file in that directory. Also, on Windows it may not be possible to delete or rename a file that has been opened by another process (depending on the sharing flags specified by that process). Therefore we need to consider carefully how to handle failure conditions.
In our proposed design, Alice's Magic Folder client follows this procedure for an overwrite in response to a remote change:
- Write a temporary file, say
.foo.tmp
. - Use the procedure described in the Fire Dragons_ section to obtain an initial classification as an overwrite or a conflict. (This takes as input the
last_downloaded_uri
field from the directory entry of the changedfoo
.) - Set the
mtime
of the replacement file to be T seconds before the current local time. - Perform a ''file replacement'' operation (explained below) with backup filename
foo.backup
, replaced filefoo
, and replacement file.foo.tmp
. If any step of this operation fails, reclassify as a conflict and stop.
To reclassify as a conflict, attempt to rename .foo.tmp
to foo.conflicted
, suppressing errors.
The implementation of file replacement differs between Unix and Windows. On Unix, it can be implemented as follows:
- 4a. Set the permissions of the replacement file to be the same as the replaced file, bitwise-or'd with octal 600 (
rw-------
). - 4b. Attempt to move the replaced file (
foo
) to the backup filename (foo.backup
). - 4c. Attempt to create a hard link at the replaced filename (
foo
) pointing to the replacement file (.foo.tmp
). - 4d. Attempt to unlink the replacement file (
.foo.tmp
), suppressing errors.
Note that, if there is no conflict, the entry for foo
recorded in the magic folder db will reflect the mtime
set in step 3. The link operation in step 4c will cause an IN_CREATE
event for foo
, but this will not trigger an upload, because the metadata recorded in the database entry will exactly match the metadata for the file's inode on disk. (The two hard links — foo
and, while it still exists, .foo.tmp
— share the same inode and therefore the same metadata.)
On Windows, file replacement can be implemented as a single call to the ReplaceFileW API (with the REPLACEFILE_IGNORE_MERGE_ERRORS
flag).
Similar to the Unix case, the ReplaceFileW operation will cause a change notification for foo
. The replaced foo
has the same mtime
as the replacement file, and so this notification will not trigger an unwanted upload.
To determine whether this procedure adequately protects against data loss, we need to consider what happens if another process attempts to update foo
, for example by renaming foo.other
to foo
. This requires us to analyze all possible interleavings between the operations performed by the Magic Folder client and the other process. (Note that atomic operations on a directory are totally ordered.) The set of possible interleavings differs between Windows and Unix.
On Unix, we have:
- Interleaving A: the other process' rename precedes our rename in step 4b, and we get an
IN_MOVED_TO
event for its rename by step 2. Then we reclassify as a conflict; its changes end up atfoo
and ours end up atfoo.conflicted
. This avoids data loss. - Interleaving B: its rename precedes ours in step 4b, and we do not get an event for its rename by step 2. Its changes end up at
foo.backup
, and ours end up atfoo
after being linked there in step 4c. This avoids data loss. - Interleaving C: its rename happens between our rename in step 4b, and our link operation in step 4c of the file replacement. The latter fails with an
EEXIST
error becausefoo
already exists. We reclassify as a conflict; the old version ends up atfoo.backup
, the other process' changes end up atfoo
, and ours atfoo.conflicted
. This avoids data loss. - Interleaving D: its rename happens after our link in step 4c, and causes an
IN_MOVED_TO
event forfoo
. Its rename also changes themtime
forfoo
so that it is different from themtime
calculated in step 3, and therefore different from the metadata recorded forfoo
in the magic folder db. (Assuming no system clock changes, its rename will set anmtime
timestamp corresponding to a time after step 4c, which is not equal to the timestamp T seconds before step 4a, provided that T seconds is sufficiently greater than the timestamp granularity.) Therefore, an upload will be triggered forfoo
after its change, which is correct and avoids data loss.
On Windows, the internal implementation of ReplaceFileW is similar to what we have described above for Unix; it works like this:
- 4a′. Copy metadata (which does not include
mtime
) from the replaced file (foo
) to the replacement file (.foo.tmp
). - 4b′. Attempt to move the replaced file (
foo
) onto the backup filename (foo.backup
), deleting the latter if it already exists. - 4c′. Attempt to move the replacement file (
.foo.tmp
) to the replaced filename (foo
); fail if the destination already exists.
Notice that this is essentially the same as the algorithm we use for Unix, but steps 4c and 4d on Unix are combined into a single step 4c′. (If there is a failure at steps 4c′ after step 4b′ has completed, the ReplaceFileW call will fail with return code ERROR_UNABLE_TO_MOVE_REPLACEMENT_2
. However, it is still preferable to use this API over two MoveFileExW calls, because it retains the attributes and ACLs of foo
where possible.)
However, on Windows the other application will not be able to directly rename foo.other
onto foo
(which would fail because the destination already exists); it will have to rename or delete foo
first. Without loss of generality, let's say foo
is deleted. This complicates the interleaving analysis, because we have two operations done by the other process interleaving with three done by the magic folder process (rather than one operation interleaving with four as on Unix). The cases are:
Interleaving A′: the other process' deletion of
foo
and its rename offoo.other
tofoo
both precede our rename in step 4b. We get an event corresponding to its rename by step 2. Then we reclassify as a conflict; its changes end up atfoo
and ours end up atfoo.conflicted
. This avoids data loss.Interleaving B′: the other process' deletion of
foo
and its rename offoo.other
tofoo
both precede our rename in step 4b. We do not get an event for its rename by step 2. Its changes end up atfoo.backup
, and ours end up atfoo
after being moved there in step 4c′. This avoids data loss.Interleaving C′: the other process' deletion of
foo
precedes our rename offoo
tofoo.backup
done by ReplaceFileW, but its rename offoo.other
tofoo
does not, so we get anERROR_FILE_NOT_FOUND
error from ReplaceFileW indicating that the replaced file does not exist. Then we reclassify as a conflict; the other process' changes end up atfoo
(after it has renamedfoo.other
tofoo
) and our changes end up atfoo.conflicted
. This avoids data loss.Interleaving D′: the other process' deletion and/or rename happen during the call to ReplaceFileW, causing the latter to fail. There are two subcases:
- if the error is
ERROR_UNABLE_TO_MOVE_REPLACEMENT_2
, thenfoo
is renamed tofoo.backup
and.foo.tmp
remains at its original name after the call. - for all other errors,
foo
and.foo.tmp
both remain at their original names after the call.
In both subcases, we reclassify as a conflict and rename
.foo.tmp
tofoo.conflicted
. This avoids data loss.- if the error is
Interleaving E′: the other process' deletion of
foo
and attempt to renamefoo.other
tofoo
both happen after all internal operations of ReplaceFileW have completed. This causes deletion and rename events forfoo
(which will in practice be merged due to the pending delay, although we don't rely on that for correctness). The rename also changes themtime
forfoo
so that it is different from themtime
calculated in step 3, and therefore different from the metadata recorded forfoo
in the magic folder db. (Assuming no system clock changes, its rename will set anmtime
timestamp corresponding to a time after the internal operations of ReplaceFileW have completed, which is not equal to the timestamp T seconds before ReplaceFileW is called, provided that T seconds is sufficiently greater than the timestamp granularity.) Therefore, an upload will be triggered forfoo
after its change, which is correct and avoids data loss.
We also need to consider what happens if another process opens foo
and writes to it directly, rather than renaming another file onto it:
- On Unix, open file handles refer to inodes, not paths. If the other process opens
foo
before it has been renamed tofoo.backup
, and then closes the file, changes will have been written to the file at the same inode, even if that inode is now linked atfoo.backup
. This avoids data loss. - On Windows, we have two subcases, depending on whether the sharing flags specified by the other process when it opened its file handle included
FILE_SHARE_DELETE
. (This flag covers both deletion and rename operations.)- If the sharing flags do not allow deletion/renaming, the ReplaceFileW operation will fail without renaming
foo
. In this case we will end up withfoo
changed by the other process, and the downloaded file still infoo.tmp
. This avoids data loss. - If the sharing flags do allow deletion/renaming, then data loss or corruption may occur. This is unavoidable and can be attributed to other process making a poor choice of sharing flags (either explicitly if it used CreateFile, or via whichever higher-level API it used).
- If the sharing flags do not allow deletion/renaming, the ReplaceFileW operation will fail without renaming
Note that it is possible that another process tries to open the file between steps 4b and 4c (or 4b′ and 4c′ on Windows). In this case the open will fail because foo
does not exist. Nevertheless, no data will be lost, and in many cases the user will be able to retry the operation.
Above we only described the case where the download was initially classified as an overwrite. If it was classed as a conflict, the procedure is the same except that we choose a unique filename for the conflicted file (say, foo.conflicted_unique
). We write the new contents to .foo.tmp
and then rename it to foo.conflicted_unique
in such a way that the rename will fail if the destination already exists. (On Windows this is a simple rename; on Unix it can be implemented as a link operation followed by an unlink, similar to steps 4c and 4d above.) If this fails because another process wrote foo.conflicted_unique
after we chose the filename, then we retry with a different filename.
Read/download collisions
A read/download collision occurs when another program reads from foo
in the local filesystem, concurrently with the new version being written by the Magic Folder client. We want to ensure that any successful attempt to read the file by the other program obtains a consistent view of its contents.
On Unix, the above procedure for writing downloads is sufficient to achieve this. There are three cases:
- A. The other process opens
foo
for reading before it is renamed tofoo.backup
. Then the file handle will continue to refer to the old file across the rename, and the other process will read the old contents. - B. The other process attempts to open
foo
after it has been renamed tofoo.backup
, and before it is linked in step c. The open call fails, which is acceptable. - C. The other process opens
foo
after it has been linked to the new file. Then it will read the new contents.
On Windows, the analysis is very similar, but case A′ needs to be split into two subcases, depending on the sharing mode the other process uses when opening the file for reading:
- A′. The other process opens
foo
before the Magic Folder client's attempt to renamefoo
tofoo.backup
(as part of the implementation of ReplaceFileW). The subcases are:- The other process uses sharing flags that deny deletion and renames. The ReplaceFileW call fails, and the download is reclassified as a conflict. The downloaded file ends up at
foo.conflicted
, which is correct. - The other process uses sharing flags that allow deletion and renames. The ReplaceFileW call succeeds, and the other process reads inconsistent data. This can be attributed to a poor choice of sharing flags by the other process.
- The other process uses sharing flags that deny deletion and renames. The ReplaceFileW call fails, and the download is reclassified as a conflict. The downloaded file ends up at
- B′. The other process attempts to open
foo
at the point during the ReplaceFileW call where it does not exist. The open call fails, which is acceptable. - C′. The other process opens
foo
after it has been linked to the new file. Then it will read the new contents.
For both write/download and read/download collisions, we have considered only interleavings with a single other process, and only the most common possibilities for the other process' interaction with the file. If multiple other processes are involved, or if a process performs operations other than those considered, then we cannot say much about the outcome in general; however, we believe that such cases will be much less common.
Fire Dragons: Distinguishing conflicts from overwrites
When synchronizing a file that has changed remotely, the Magic Folder client needs to distinguish between overwrites, in which the remote side was aware of your most recent version and overwrote it with a new version, and conflicts, in which the remote side was unaware of your most recent version when it published its new version. Those two cases have to be handled differently — the latter needs to be raised to the user as an issue the user will have to resolve and the former must not bother the user.
For example, suppose that Alice's Magic Folder client sees a change to foo
in Bob's DMD. If the version it downloads from Bob's DMD is "based on" the version currently in Alice's local filesystem at the time Alice's client attempts to write the downloaded file, then it is an overwrite. Otherwise it is initially classified as a conflict.
This initial classification is used by the procedure for writing a file described in the Earth Dragons section above. As explained in that section, we may reclassify an overwrite as a conflict if an error occurs during the write procedure.
In order to implement this policy, we need to specify how the "based on" relation between file versions is recorded and updated.
We propose to record this information:
- in the magic folder db, for local files;
- in the Tahoe-LAFS directory metadata, for files stored in the Magic Folder.
In the magic folder db we will add a last-downloaded record, consisting of last_downloaded_uri
and last_downloaded_timestamp
fields, for each path stored in the database. Whenever a Magic Folder client downloads a file, it stores the downloaded version's URI and the current local timestamp in this record. Since only immutable files are used, the URI will be an immutable file URI, which is deterministically and uniquely derived from the file contents and the Tahoe-LAFS node's convergence secret.
(Note that the last-downloaded record is updated regardless of whether the download is an overwrite or a conflict. The rationale for this to avoid "conflict loops" between clients, where every new version after the first conflict would be considered as another conflict.)
Later, in response to a local filesystem change at a given path, the Magic Folder client reads the last-downloaded record associated with that path (if any) from the database and then uploads the current file. When it links the uploaded file into its client DMD, it includes the last_downloaded_uri
field in the metadata of the directory entry, overwriting any existing field of that name. If there was no last-downloaded record associated with the path, this field is omitted.
Note that last_downloaded_uri
field does not record the URI of the uploaded file (which would be redundant); it records the URI of the last download before the local change that caused the upload. The field will be absent if the file has never been downloaded by this client (i.e. if it was created on this client and no change by any other client has been detected).
A possible refinement also takes into account the last_downloaded_timestamp
field from the magic folder db, and compares it to the timestamp of the change that caused the upload (which should be later, assuming no system clock changes). If the duration between these timestamps is very short, then we are uncertain about whether the process on Bob's system that wrote the local file could have taken into account the last download. We can use this information to be conservative about treating changes as conflicts. So, if the duration is less than a configured threshold, we omit the last_downloaded_uri
field from the metadata. This will have the effect of making other clients treat this change as a conflict whenever they already have a copy of the file.
Now we are ready to describe the algorithm for determining whether a download for the file foo
is an overwrite or a conflict (refining step 2 of the procedure from the Earth Dragons section).
Let last_downloaded_uri
be the field of that name obtained from the directory entry metadata for foo
in Bob's DMD (this field may be absent). Then the algorithm is:
2a. If Alice has no local copy of
foo
, classify as an overwrite.2b. Otherwise, "stat"
foo
to get its current statinfo (size in bytes,mtime
, andctime
).2c. Read the following information for the path
foo
from the local magic folder db:- the last-uploaded statinfo, if any (this is the size in bytes,
mtime
, andctime
stored in thelocal_files
table when the file was last uploaded); - the
filecap
field of thecaps
table for this file, which is the URI under which the file was last uploaded. Call thislast_uploaded_uri
.
- the last-uploaded statinfo, if any (this is the size in bytes,
2d. If any of the following are true, then classify as a conflict:
- there are pending notifications of changes to
foo
; - the last-uploaded statinfo is either absent, or different from the current statinfo;
- either
last_downloaded_uri
orlast_uploaded_uri
(or both) are absent, or they are different.
Otherwise, classify as an overwrite.
- there are pending notifications of changes to
Air Dragons: Collisions between local writes and uploads
Short of filesystem-specific features on Unix or the shadow copy service on Windows (which is per-volume and therefore difficult to use in this context), there is no way to read the whole contents of a file atomically. Therefore, when we read a file in order to upload it, we may read an inconsistent version if it was also being written locally.
A well-behaved application can avoid this problem for its writes:
- On Unix, if another process modifies a file by renaming a temporary file onto it, then we will consistently read either the old contents or the new contents.
- On Windows, if the other process uses sharing flags to deny reads while it is writing a file, then we will consistently read either the old contents or the new contents, unless a sharing error occurs. In the case of a sharing error we should retry later, up to a maximum number of retries.
In the case of a not-so-well-behaved application writing to a file at the same time we read from it, the magic folder will still be eventually consistent, but inconsistent versions may be visible to other users' clients.
In Objective 2 we implemented a delay, called the pending delay, after the notification of a filesystem change and before the file is read in order to upload it (Tahoe-LAFS ticket #1440). If another change notification occurs within the pending delay time, the delay is restarted. This helps to some extent because it means that if files are written more quickly than the pending delay and less frequently than the pending delay, we shouldn't encounter this inconsistency.
The likelihood of inconsistency could be further reduced, even for writes by not-so-well-behaved applications, by delaying the actual upload for a further period —called the stability delay— after the file has finished being read. If a notification occurs between the end of the pending delay and the end of the stability delay, then the read would be aborted and the notification requeued.
This would have the effect of ensuring that no write notifications have been received for the file during a time window that brackets the period when it was being read, with margin before and after this period defined by the pending and stability delays. The delays are intended to account for asynchronous notification of events, and caching in the filesystem.
Note however that we cannot guarantee that the delays will be long enough to prevent inconsistency in any particular case. Also, the stability delay would potentially affect performance significantly because (unlike the pending delay) it is not overlapped when there are multiple files on the upload queue. This performance impact could be mitigated by uploading files in parallel where possible (Tahoe-LAFS ticket #1459).
We have not yet decided whether to implement the stability delay, and it is not planned to be implemented for the OTF objective 4 milestone. Ticket #2431 has been opened to track this idea.
Note that the situation of both a local process and the Magic Folder client reading a file at the same time cannot cause any inconsistency.
Water Dragons: Handling deletion and renames
Deletion of a file
When a file is deleted from the filesystem of a Magic Folder client, the most intuitive behavior is for it also to be deleted under that name from other clients. To avoid data loss, the other clients should actually rename their copies to a backup filename.
It would not be sufficient for a Magic Folder client that deletes a file to implement this simply by removing the directory entry from its DMD. Indeed, the entry may not exist in the client's DMD if it has never previously changed the file.
Instead, the client links a zero-length file into its DMD and sets deleted: true
in the directory entry metadata. Other clients take this as a signal to rename their copies to the backup filename.
Note that the entry for this zero-length file has a version number as usual, and later versions may restore the file.
When a Magic Folder client restarts, we can detect files that had been downloaded but were deleted while it was not running, because their paths will have last-downloaded records in the magic folder db without any corresponding local file.
Deletion of a directory
Local filesystems (unlike a Tahoe-LAFS filesystem) normally cannot unlink a directory that has any remaining children. Therefore a Magic Folder client cannot delete local copies of directories in general, because they will typically contain backup files. This must be done manually on each client if desired.
Nevertheless, a Magic Folder client that deletes a directory should set deleted: true
on the metadata entry for the corresponding zero-length file. This avoids the directory being recreated after it has been manually deleted from a client.
Renaming
It is sufficient to handle renaming of a file by treating it as a deletion and an addition under the new name.
This also applies to directories, although users may find the resulting behavior unintuitive: all of the files under the old name will be renamed to backup filenames, and a new directory structure created under the new name. We believe this is the best that can be done without imposing unreasonable implementation complexity.
Summary
This completes the design of remote-to-local synchronization. We realize that it may seem very complicated. Anecdotally, proprietary filesystem synchronization designs we are aware of, such as Dropbox, are said to incur similar or greater design complexity.