i.e. change set_data() to accept lots of parameters, instead of taking
a single dictionary with lots of keys. Also Convert all CheckResults
creators to use it.
The goal is to make CheckResults more strongly typed, and remove the
ambiguous ".data" field in favor of a bunch of specific counters and
sharelists, so I can changes .sharemap and .servermap to use IServer
instances instead of string serverids. By cleaning this up first, I hope
to get that task done with less debugging.
The Fake*Node classes in test/common.py were accumulating share data in
a class-level dictionary, which persisted from one test run to the next.
As a result, running test_web.py over and over (with trial's
--until-failure feature) made this dictionary grow without bound,
eventually running out of memory.
This fix moves that dictionary into the FakeClient built fresh for each
test, so it doesn't build up. It does the same thing for "file_types",
which was much smaller but still lived at the class level.
Closes#1729
This stores IDisplayableServer-providing instances (StubServers or
NativeStorageServers) in the .servermap and .sharemap dictionaries. But
get_servermap()/get_sharemap() still return data structures with
serverids, not IServers, by translating their data on the way out. This
lets us put off changing the callers for a little bit longer.
IDisplayableServer includes just enough functionality to call
.get_name() and friends, which is all that the UploadResults really
need. IServer is a superset that includes actual share-manipulation
methods. StubServer instances provide only IDisplayableServer, while
actual NativeStorageServer instances provide the full IServer interface.
When the Helper sends a serverid (so we know what to call the server but
nothing else about it, and have no corresponding NativeStorageServer
object to reference), but we want to store an IDisplayableServer in the
UploadResults, we create a synthetic StubServer "server" and store that
instead.
Complete the getter-based transformation, by hiding ".uri" and updating
callers to use get_uri(). Also don't set a dummy self._uri, leave it
undefined until someone calls set_uri().
This hides attributes with e.g. _sharemap, and creates getters like
get_sharemap() to access them, for every field except .uri . This will
make it easier to modify the internal representation of .sharemap
without requiring callers to adjust quite yet.
".uri" has so many users that it seemed better to update it in a
subsequent patch.
Populate most of UploadResults (except .uri, which is learned later when
using a Helper) in the constructor, instead of allowing creators to
write to attributes later. This will help isolate the fields that we
want to change to use IServers.
This splits the pb.Copyable on-wire object (HelperUploadResults) out
from the local results object (UploadResults). To maintain compatibility
with older Helpers, we have to leave pb.Copyable classes alone and
unmodified, but we want to change UploadResults to use IServers instead
of serverids. So by using a different class on the wire, and translating
to/from it on either end, we can accomplish both.
This measured how long the Helper took to do a filecheck before asking
for ciphertext. The "Contacting Helper" report includes both
existence_check and the client-helper RTT.
For non-overlapping uploads, it was being returned correctly. But when
multiple upload requests overlapped, and the file was not already in the
grid, the filecheck would only run once, and its existence_check time
would be reported for all uploaders (even if they didn't have to wait
for that time). Cleaning that up proved too difficult: the only correct
place to report this time is from the initial remote_upload_chk() call,
but the return value of that is too constrained to accomodate it in the
needs-upload case.
So I'm removing it altogether. Eventually I plan to add a proper
events/times field and record more data, including this check, in a form
that can be drawn on a nice zoomable timeline view.
Old clients talking to a new Helper (which doesn't supply the value)
will tolerate the loss (they'll just display an empty field on the web
view).
Unlike set.union(), which returns a new set, DictOfSets.union() modified
the DictOfSets in-place. The name collision bit me when I changed some
code from using DictOfSets to a normal set, and expected that
set.union() would modify the set in-place. Since there was only one user
of DictOfSets.union, I figured it was safer to just get rid of it.
If a server did not respond to the pre-repair filecheck, but did respond
to the repair, that server was not correctly added to the
RepairResults.data["servers-responding"] list. (This resulted from a
buggy usage of DictOfSets.union() in filenode.py).
In addition, servers to which filecheck queries were sent, but did not
respond, were incorrectly added to the servers-responding list
anyawys. (This resulted from code in the checker.py not paying attention
to the 'responded' flag).
The first bug was neatly masked by the second: it's pretty rare to have
a server suddenly start responding in the one-second window between a
filecheck and a subsequent repair, and if the server was around for the
filecheck, you'd never notice the problem. I only spotted the smelly
code while I was changing it for IServer cleanup purposes.
I added coverage to test_repairer.py for this. Trying to get that test
to fail before fixing the first bug is what led me to discover the
second bug. I also had to update test_corrupt_file_verno, since it was
incorrectly asserting that 10 servers responded, when in fact one of
them throws an error (but the second bug was causing it to be reported
anyways).
Previously, test_runner sometimes fails because the _node_has_started()
poller fires after the portnum file has been opened, but before it has
actually been filled, allowing the test process to observe an empty file,
which flunks the test.
This adds a new fileutil.write_atomically() function (using the usual
write-to-.tmp-then-rename approach), and uses it for both node.url and
client.port . These files are written a bit before the node is really up and
running, but they're late enough for test_runner's purposes, which is to know
when it's safe to read client.port and use 'tahoe restart' (and therefore
SIGINT) to restart the node.
The current node/client code doesn't offer any better "are you really done
with startup" indicator.. the ideal approach would be to either watch the
logfile, or connect to its flogport, but both are a hassle. Changing the node
to write out a new "all done" file would be intrusive for regular
operations.
t=info contains randomly-generated ophandles, and t=rename-form contains the
name of the child being renamed, so neither is eligible for a
short-circuiting ETag. Enhanced test_web to exercise this. Had to improve
FakeCHKFileNode slightly to let it participate. Refs #443.