i.e. change set_data() to accept lots of parameters, instead of taking
a single dictionary with lots of keys. Also Convert all CheckResults
creators to use it.
The goal is to make CheckResults more strongly typed, and remove the
ambiguous ".data" field in favor of a bunch of specific counters and
sharelists, so I can changes .sharemap and .servermap to use IServer
instances instead of string serverids. By cleaning this up first, I hope
to get that task done with less debugging.
This stores IDisplayableServer-providing instances (StubServers or
NativeStorageServers) in the .servermap and .sharemap dictionaries. But
get_servermap()/get_sharemap() still return data structures with
serverids, not IServers, by translating their data on the way out. This
lets us put off changing the callers for a little bit longer.
Complete the getter-based transformation, by hiding ".uri" and updating
callers to use get_uri(). Also don't set a dummy self._uri, leave it
undefined until someone calls set_uri().
This hides attributes with e.g. _sharemap, and creates getters like
get_sharemap() to access them, for every field except .uri . This will
make it easier to modify the internal representation of .sharemap
without requiring callers to adjust quite yet.
".uri" has so many users that it seemed better to update it in a
subsequent patch.
Populate most of UploadResults (except .uri, which is learned later when
using a Helper) in the constructor, instead of allowing creators to
write to attributes later. This will help isolate the fields that we
want to change to use IServers.
This splits the pb.Copyable on-wire object (HelperUploadResults) out
from the local results object (UploadResults). To maintain compatibility
with older Helpers, we have to leave pb.Copyable classes alone and
unmodified, but we want to change UploadResults to use IServers instead
of serverids. So by using a different class on the wire, and translating
to/from it on either end, we can accomplish both.
This measured how long the Helper took to do a filecheck before asking
for ciphertext. The "Contacting Helper" report includes both
existence_check and the client-helper RTT.
For non-overlapping uploads, it was being returned correctly. But when
multiple upload requests overlapped, and the file was not already in the
grid, the filecheck would only run once, and its existence_check time
would be reported for all uploaders (even if they didn't have to wait
for that time). Cleaning that up proved too difficult: the only correct
place to report this time is from the initial remote_upload_chk() call,
but the return value of that is too constrained to accomodate it in the
needs-upload case.
So I'm removing it altogether. Eventually I plan to add a proper
events/times field and record more data, including this check, in a form
that can be drawn on a nice zoomable timeline view.
Old clients talking to a new Helper (which doesn't supply the value)
will tolerate the loss (they'll just display an empty field on the web
view).
If a server did not respond to the pre-repair filecheck, but did respond
to the repair, that server was not correctly added to the
RepairResults.data["servers-responding"] list. (This resulted from a
buggy usage of DictOfSets.union() in filenode.py).
In addition, servers to which filecheck queries were sent, but did not
respond, were incorrectly added to the servers-responding list
anyawys. (This resulted from code in the checker.py not paying attention
to the 'responded' flag).
The first bug was neatly masked by the second: it's pretty rare to have
a server suddenly start responding in the one-second window between a
filecheck and a subsequent repair, and if the server was around for the
filecheck, you'd never notice the problem. I only spotted the smelly
code while I was changing it for IServer cleanup purposes.
I added coverage to test_repairer.py for this. Trying to get that test
to fail before fixing the first bug is what led me to discover the
second bug. I also had to update test_corrupt_file_verno, since it was
incorrectly asserting that 10 servers responded, when in fact one of
them throws an error (but the second bug was causing it to be reported
anyways).
* use d3.js v2.4.6
* add a "toggle misc events" button, to get hash/bitmap-checking details
* only draw data that's on screen, for speed
* add fragment-arg to fetch timeline data.json from somewhere else
This consistently records all immutable uploads in the Recent Uploads And
Downloads page, regardless of code path. Previously, certain webapi upload
operations (like PUT /uri/$DIRCAP/newchildname) failed to pass the History
object and were left out.
Shares are still verified in parallel, but within a share, don't request a
block until the previous block has been verified and the memory we used to hold
it has been freed up.
Patch originally due to Brian. This version has a mockery-patchery-style test
which is "low tech" (it implements the patching inline in the test code instead
of using an extension of the mock.patch() function from the mock library) and
which unpatches in case of exception.
fixes#1395
This patch is a rebase of a patch originally written by Brian. I didn't change any of the intent of Brian's patch, just ported it to current trunk.
refs #1363
This patch was originally written by Brian, but was re-recorded by Zooko to use
darcs replace instead of hunks for any file in which it would result in fewer
total hunks.
refs #1363
This patch was written by Brian but was re-recorded by Zooko (with David-Sarah looking on) to use darcs replace instead of editing to rename the three variables to their new names.
refs #1363
When blocks terminate (either COMPLETE or CORRUPT/DEAD/BADSEGNUM), the
_shares_from_server dict was being popped incorrectly (using shnum as the
index instead of serverid). I'm still thinking through the consequences of
this bug. It was probably benign and really hard to detect. I think it would
cause us to incorrectly believe that we're pulling too many shares from a
server, and thus prefer a different server rather than asking for a second
share from the first server. The diversity code is intended to spread out the
number of shares simultaneously being requested from each server, but with
this bug, it might be spreading out the total number of shares requested at
all, not just simultaneously. (note that SegmentFetcher is scoped to a single
segment, so the effect doesn't last very long).
No behavioral changes, just updating variable/method names and log messages.
The effects outside these three files should be minimal: some exception
messages changed (to say "server" instead of "peer"), and some internal class
names were changed. A few things still use "peer" to minimize external
changes, like UploadResults.timings["peer_selection"] and
happinessutil.merge_peers, which can be changed later.
Pass around IServer instance instead of (peerid, rref) tuple. Replace
"descriptor" with "server". Other replacements:
get_all_servers -> get_connected_servers/get_known_servers
get_servers_for_index -> get_servers_for_psi (now returns IServers)
This change still needs to be pushed further down: lots of code is now
getting the IServer and then distributing (peerid, rref) internally.
Instead, it ought to distribute the IServer internally and delay
extracting a serverid or rref until the last moment.
no_network.py was updated to retain parallelism.
* repairer (really the uploader) reads beyond end of input file (Uploadable)
* new-downloader does not tolerate overreads
* uploader does lots of tiny reads (inefficient)
This fixes the last two. The uploader still does a single overread at the end
of the input file, but now that's ok so we can leave it in place. The
uploader now expects the Uploadable to behave like a normal disk
file (reading beyond EOF will return less data than was asked for), and now
the new-downloadable behaves that way.
fixes#1191
Patch by Brian. This patch description was actually written by Zooko, but I forged Brian's name on the "author" field so that he would get credit for this patch in revision control history.
deliver all shares at once instead of feeding them out one-at-a-time.
Also fix distribution of real-number-of-segments information: now all
CommonShares (not just the ones used for the first segment) get a
correctly-sized hashtree. Previously, the late ones might not, which would
make them crash and get dropped (causing the download to fail if the initial
set were insufficient, perhaps because one of their servers went away).
Update tests, add some TODO notes, improve variable names and comments.
Improve logging: add logparents, set more appropriate levels.
This avoids spamming the "recent uploads and downloads" /status page from
FileNode instances that were created for a directory read but which nobody is
ever going to read from. I also cleaned up the way DownloadStatus instances
are made to only ever do it in the CiphertextFileNode, not in the
higher-level plaintext FileNode. Also fixed DownloadStatus handling of read
size, thanks to David-Sarah for the catch.
seems to avoid the #1155 log message which reveals the URI (and filecap).
Also add an [ERROR] marker to the flog entry, since unregisterProducer also
makes interrupted downloads appear "200 OK"; this makes it more obvious that
the download did not complete.
The lost-progress bug occurred when two simultanous read() calls fetched
different segments, and the first one failed (due to corruption, or the other
bugs in #1154): the second read() would never complete. While in this state,
cancelling the second read by having its consumer call stopProducing) would
trigger the cancel-intolerance bug. Finally, in downloader.node.Cancel,
prevent late cancels by adding an 'active' flag
The Range header causes n.read() to be called with an offset= of type 'long',
which eventually got used in a Spans/DataSpans object's __len__ method.
Apparently python doesn't permit __len__() to return longs, only ints.
Rewrote Spans/DataSpans to use s.len() instead of len(s) aka s.__len__() .
Added a test in test_download. Note that test_web didn't catch this because
it uses mock FileNodes for speed: it's probably time to rewrite that.
There is still an unresolved error-recovery problem in #1154, so I'm not
closing the ticket quite yet.
The fixed 10-second timer will eventually be replaced with a per-server
value, calculated based on observed response times.
test_hung_server.py: enhance to exercise DYHB=OVERDUE state. Split existing
mutable+immutable tests into two pieces for clarity. Reenabled several tests.
Deleted the now-obsolete "test_failover_during_stage_4".
- Make some important utility functions clearer and more thoroughly
documented.
- Assert in upload.servers_of_happiness that the buckets attributes
of PeerTrackers passed to it are mutually disjoint.
- Get rid of some silly non-Pythonisms that I didn't see when I first
wrote these patches.
- Make sure that should_add_server returns true when queried about a
shnum that it doesn't know about yet.
- Change Tahoe2PeerSelector.preexisting_shares to map a shareid to a set
of peerids, alter dependencies to deal with that.
- Remove upload.should_add_servers, because it is no longer necessary
- Move upload.shares_of_happiness and upload.shares_by_server to a utility
file.
- Change some points in Tahoe2PeerSelector.
- Compute servers_of_happiness using a bipartite matching algorithm that
we know is optimal instead of an ad-hoc greedy algorithm that isn't.
- Change servers_of_happiness to just take a sharemap as an argument,
change its callers to merge existing_shares and used_peers before
calling it.
- Change an error message in the encoder to be more appropriate for
servers of happiness.
- Clarify the wording of an error message in immutable/upload.py
- Refactor a happiness failure message to happinessutil.py, and make
immutable/upload.py and immutable/encode.py use it.
- Move the word "only" as far to the right as possible in failure
messages.
- Use a better definition of progress during peer selection.
- Do read-only peer share detection queries in parallel, not sequentially.
- Clean up logging semantics; print the query statistics whenever an
upload is unsuccessful, not just in one case.
When I first implemented #778, I just altered the error messages to refer to
servers where they referred to shares. The resulting error messages weren't
very good. These are a bit better.
The Tahoe2PeerSelector returned either NoSharesError or NotEnoughSharesError
for a variety of error conditions that weren't informatively described by them.
This patch creates a new error, UploadHappinessError, replaces uses of
NoSharesError and NotEnoughSharesError with it, and alters the error message
raised with the errors to be more in line with the new servers_of_happiness
behavior. See ticket #834 for more information.
This can be useful if one of the ones that he has already begun downloading fails. See #287 for discussion. This fixes part of #287 which part was a regression caused by #928, namely this fixes fail-over in case a share is corrupted (or the server returns an error or disconnects). This does not fix the related issue mentioned in #287 if a server hangs and doesn't reply to requests for blocks.
This should put an end to the phenomenon I've been seeing that a single hung server can cause all downloads on a grid to hang. Also it should speed up all downloads by (a) not-waiting for responses to queries that it doesn't need, and (b) downloading shares from the servers which answered the initial query the fastest.
Also, do not count how many buckets you've gotten when deciding whether the download has enough shares or not -- instead count how many buckets to *unique* shares that you've gotten. This appears to improve a slightly weird behavior in the current download code in which receiving >= K different buckets all to the same sharenumber would make it think it had enough to download the file when in fact it hadn't.
This patch needs tests before it is actually ready for trunk.
allmydata.util.log.err() either takes a Failure as the first positional
argument, or takes no positional arguments and must be invoked in an
exception handler. Fixed its signature to match both foolscap.logging.log.err
and twisted.python.log.err . Included a brief unit test.
Stop checking separately for ConnectionDone/ConnectionLost, since those have
been folded into DeadReferenceError since foolscap-0.3.1 . Write
rrefutil.trap_deadref() in terms of rrefutil.trap_and_discard() to improve
code coverage.