= PRELIMINARY = This document is a description of a feature which is not yet implemented, added here to solicit feedback and to describe future plans. This document is subject to revision or withdrawal at any moment. Until this notice is removed, consider this entire document to be a figment of your imagination. = The Tahoe BackupDB = To speed up backup operations, Tahoe maintains a small database known as the "backupdb". This is used to avoid re-uploading files which have already been uploaded recently. This database lives in ~/.tahoe/private/backupdb.sqlite, and is a SQLite single-file database. It is used by the "tahoe backup" command, and by the "tahoe cp" command when the --use-backupdb option is included. The purpose of this database is specifically to manage the file-to-cap translation (the "upload" step). It does not address directory updates. The overall goal of optimizing backup is to reduce the work required when the source disk has not changed since the last backup. In the ideal case, running "tahoe backup" twice in a row, with no intervening changes to the disk, will not require any network traffic. This database is optional. If it is deleted, the worst effect is that a subsequent backup operation may use more effort (network bandwidth, CPU cycles, and disk IO) than it would have without the backupdb. == Schema == The database contains the following tables: CREATE TABLE version ( version integer # contains one row, set to 0 ); CREATE TABLE last_upload ( path varchar(1024), # index, this is os.path.abspath(fn) size integer, # os.stat(fn)[stat.ST_SIZE] mtime number, # os.stat(fn)[stat.ST_MTIME] fileid integer ); CREATE TABLE caps ( fileid integer PRIMARY KEY AUTOINCREMENT, filecap varchar(256), # URI:CHK:... last_uploaded timestamp, last_checked timestamp ); CREATE TABLE keys_to_files ( readkey varchar(256) PRIMARY KEY, # index, AES key portion of filecap fileid integer ); Notes: if we extend the backupdb to assist with directory maintenance (see below), we may need paths in multiple places, so it would make sense to create a table for them, and change the last_upload table to refer to a pathid instead of an absolute path: CREATE TABLE paths ( path varchar(1024), # index pathid integer PRIMARY KEY AUTOINCREMENT ); == Operation == The upload process starts with a pathname (like ~/.emacs) and wants to end up with a file-cap (like URI:CHK:...). The first step is to convert the path to an absolute form (/home/warner/emacs) and do a lookup in the last_upload table. If the path is not present in this table, the file must be uploaded. The upload process is: 1. record the file's size and modification time 2. upload the file into the grid, obtaining an immutable file read-cap 3. add an entry to the 'caps' table, with the read-cap, and the current time 4. extract the read-key from the read-cap, add an entry to 'keys_to_files' 5. add an entry to 'last_upload' If the path *is* present in 'last_upload', the easy-to-compute identifying information is compared: file size and modification time. If these differ, the file must be uploaded. The row is removed from the last_upload table, and the upload process above is followed. If the path is present but the mtime differs, the file may have changed. If the size differs, then the file has certainly changed. The client will compute the CHK read-key for the file by hashing its contents, using exactly the same algorithm as the node does when it uploads a file (including ~/.tahoe/private/convergence). It then checks the 'keys_to_files' table to see if this file has been uploaded before: perhaps the file was moved from elsewhere on the disk. If no match is found, the file must be uploaded, so the upload process above is follwed. If the read-key *is* found in the 'keys_to_files' table, then the file has been uploaded before, but we should consider performing a file check / verify operation to make sure we can skip a new upload. The fileid is used to retrieve the entry from the 'caps' table, and the last_checked timestamp is examined. If this timestamp is too old, a filecheck operation should be performed, and the file repaired if the results are not satisfactory. A "random early check" algorithm should be used, in which a check is performed with a probability that increases with the age of the previous results. E.g. files that were last checked within a month are not checked, files that were checked 5 weeks ago are re-checked with 25% probability, 6 weeks with 50%, more than 8 weeks are always checked. This reduces the "thundering herd" of filechecks-on-everything that would otherwise result when a backup operation is run one month after the original backup. The readkey can be submitted to the upload operation, to remove a duplicate hashing pass through the file and reduce the disk IO. In a future version of the storage server protocol, this could also improve the "streamingness" of the upload process. If the file's size and mtime match, the file is considered to be unmodified, and the last_checked timestamp from the 'caps' table is examined as above (possibly resulting in a filecheck or repair). The --no-timestamps option disables this check: this removes the danger of false-positives (i.e. not uploading a new file, because it appeared to be the same as a previously uploaded one), but increases the amount of disk IO that must be performed (every byte of every file must be hashed to compute the readkey). This algorithm is summarized in the following pseudocode: {{{ def backup(path): abspath = os.path.abspath(path) result = check_for_upload(abspath) now = time.time() if result == MUST_UPLOAD: filecap = upload(abspath, key=result.readkey) fileid = db("INSERT INTO caps (filecap, last_uploaded, last_checked)", (filecap, now, now)) db("INSERT INTO keys_to_files", (result.readkey, filecap)) db("INSERT INTO last_upload", (abspath,current_size,current_mtime,fileid)) if result in (MOVED, ALREADY_UPLOADED): age = now - result.last_checked probability = (age - 1*MONTH) / 1*MONTH probability = min(max(probability, 0.0), 1.0) if random.random() < probability: do_filecheck(result.filecap) if result == MOVED: db("INSERT INTO last_upload", (abspath, current_size, current_mtime, result.fileid)) def check_for_upload(abspath): row = db("SELECT (size,mtime,fileid) FROM last_upload WHERE path == %s" % abspath) if not row: return check_moved(abspath) current_size = os.stat(abspath)[stat.ST_SIZE] current_mtime = os.stat(abspath)[stat.ST_MTIME] (last_size,last_mtime,last_fileid) = row if file_changed(current_size, last_size, current_mtime, last_mtime): db("DELETE FROM last_upload WHERE fileid=%s" % fileid) return check_moved(abspath) (filecap, last_checked) = db("SELECT (filecap, last_checked) FROM caps" + " WHERE fileid == %s" % last_fileid) return ALREADY_UPLOADED(filecap=filecap, last_checked=last_checked) def file_changed(current_size, last_size, current_mtime, last_mtime): if last_size != current_size: return True if NO_TIMESTAMPS: return True if last_mtime != current_mtime: return True return False def check_moved(abspath): readkey = hash_with_convergence(abspath) fileid = db("SELECT (fileid) FROM keys_to_files WHERE readkey == %s"%readkey) if not fileid: return MUST_UPLOAD(readkey=readkey) (filecap, last_checked) = db("SELECT (filecap, last_checked) FROM caps" + " WHERE fileid == %s" % fileid) return MOVED(fileid=fileid, filecap=filecap, last_checked=last_checked) def do_filecheck(filecap): health = check(filecap) if health < DESIRED: repair(filecap) }}}