the RIStatsProvider interface requires that counter and stat values be
ChoiceOf(float, int, long) the recent changes to storage server to not
track 'consumed' led to returning None as the value of a counter.
this causes violations to be experienced by nodes whose stats are being
gathered.
this patch simply omits that stat if 'consumed' is not being tracked.
one of the storage servers is throwing foolscap violations about the
return value of get_stats(). this adds a log of the data returned
to the foolscap log event stream at the debug level '12' (between
NOISY(10) and OPERATIONAL(20)) hopefully this will facilitate
finding the cause of this problem.
the timeouts on uses of 'poll' were there purely to make sure a test doesn't
poll indefinitely. however having such timeouts makes tests susceptible
to premature timeouts under high load, or on slow machines. (e.g. cygwin
slaves running in virtual machines on loaded hosts)
purportedly trial by default applies a timeout to tests to prevent them
hanging out indefinitely, so these poll timeouts are redundant and cause
intermittent failures on slow hosts. hence they're more bother than they're
worth, and should be culled.
previously there was an edge case in the timing of expected behaviour
of the key_generator (w.r.t. the refresh delay and twisted/foolscap
delivery). if it took >6s for a key to be generated, then it was
possible for the pool refresh delay to transpire _during_ the
synchronous creation of a key in remote_get_rsa_key_pair. this could
lead to the timer elapsing during key creation and hence the pool
being refilled before control returned to the client.
this change ensures that the time window from a get key request
until the key gen reactor blocks to refill the pool is the time
since a request was answered, not since a request was asked.
this causes the behaviour to match expectations, as embodied in
test_keygen, even if the delay window is dropped to 0.1s
in both these cases, the timeout only serves to abort a stuck test, and
the key_generator should respond more quickly, but seeing test failures
in buildbot on some platforms suggests that the test is too susceptible
to timing issues on loaded buildslaves.
this cleans up KeyGenerator to be a service (a subservice of the
KeyGeneratorService as instantiated by the key-generator.tac app)
this means that the timer which replenishes the keypool will be
shutdown cleanly when the service is stopped.
adds checks on the key_generator service and client into the system
test 'test_mutable' such that one of the nodes (clients[3]) uses
the key_generator service, and checks that mutable file creation
in that node, via a variety of means, are all consuming keys from
the key_generator.
this adds a new service to pre-generate RSA key pairs. This allows
the expensive (i.e. slow) key generation to be placed into a process
outside the node, so that the node's reactor will not block when it
needs a key pair, but instead can retrieve them from a pool of already
generated key pairs in the key-generator service.
it adds a tahoe create-key-generator command which initialises an
empty dir with a tahoe-key-generator.tac file which can then be run
via twistd. it stashes its .pem and portnum for furl stability and
writes the furl of the key gen service to key_generator.furl, also
printing it to stdout.
by placing a key_generator.furl file into the nodes config directory
(e.g. ~/.tahoe) a node will attempt to connect to such a service, and
will use that when creating mutable files (i.e. directories) whenever
possible. if the keygen service is unavailable, it will perform the
key generation locally instead, as before.
When we establish any new connection, reset the delays on all the other
Reconnectors. This will trigger a new batch of connection attempts. The idea
is to detect when we (the client) have been offline for a while, and to
connect to all servers when we get back online. By accelerating the timers
inside the Reconnectors, we try to avoid spending a long time in a
partially-connected state (which increases the chances of causing problems
with mutable files, by not updating all the shares that we ought to).
not status instances. Fix this. The symptom was that following a link like
'up-123' that referred to an old operation (no longer in memory) while an
upload was active would get an ugly traceback instead of a "no such resource"
message.
when the confwiz configures a node (i.e. typically once on mac, once per
install on windows) in addition to writing the root_dir.cap retrieved from
the native_client backend into a config file, it additionally writes a hash
thereof into the 'convergence' config file.
this causes uploads from this node to use a consistent 'convergence' hashing
value matching any other nodes with the same configured root_dir, i.e. for
the most part other systems installed and configured on the same account.
This removes the guess-partial-information attack vector, and reduces
the amount of overhead that we consume with each file. It also introduces
a forwards-compability break: older versions of the code (before the
previous download-time "make hashes optional" patch) will be unable
to read files uploaded by this version, as they will complain about the
missing hashes. This patch is experimental, and is being pushed into
trunk to obtain test coverage. We may undo it before releasing 1.0.
Now upload or encode methods take a required argument named "convergence" which can be either None, indicating no convergent encryption at all, or a string, which is the "added secret" to be mixed in to the content hash key. If you want traditional convergent encryption behavior, set the added secret to be the empty string.
This patch also renames "content hash key" to "convergent encryption" in a argument names and variable names. (A different and larger renaming is needed in order to clarify that Tahoe supports immutable files which are not encrypted content-hash-key a.k.a. convergent encryption.)
This patch also changes a few unit tests to use non-convergent encryption, because it doesn't matter for what they are testing and non-convergent encryption is slightly faster.
This removes the guess-partial-information attack vector, and reduces
the amount of overhead that we consume with each file. It also introduces
a forwards-compability break: older versions of the code (before the
previous download-time "make hashes optional" patch) will be unable
to read files uploaded by this version, as they will complain about the
missing hashes. This patch is experimental, and is being pushed into
trunk to obtain test coverage. We may undo it before releasing 1.0.
Removing the plaintext hashes can help with the guess-partial-information
attack. This does not affect compatibility, but if and when we actually
remove any hashes from the share, that will introduce a
forwards-compatibility break: tahoe-0.9 will not be able to read such files.
this changes the confwiz to have a look and feel much more consistent
with that of the innosetup installer it is launched within the context
of. this applies, naturally, primarily to windows.
added a test for the simple mkdir-p hack I added yesterday
checks that mkdir-p can create a directory hierarchy, and that resubmitting
a request for the same path yields the existing dir's uri
this adds a t=mkdir-p call to directories (accessed by their uri as
/uri/<URI>?t=mkdir=p&path=/some/path) which returns the uri for a
directory at a specified path before the given uri, regardless of
whether the directory exists or whether intermediate directories
need to be created to satisfy the request.
this is used by the migration code in MV to optimise the work of
path traversal which was other wise done on every file PUT
This is because there exist in the wild computers that are misconfigured so that 'localhost' doesn't resolve to 127.0.0.1. On those computers, using 'localhost' for the nodeurl is a security problem, because the user commonly sends valuable caps to the nodeurl.
motivated simply by a desire to be able to identify 'noderoot' directories for
debugging and testing, the confwiz now writes an 'accountname' files based on
what account was used when the node was configured. this is not currently read
by or used by any code in the system, but helps identify directories from testing.
1. changed the node's exit-on-error behaviour. rather than logging debug and
then delegating to self for _abort_process() instead simply delegate to self
_service_startup_failed(failure) to report failures in the startup deferred
chain. subclasses then have complete control of handling and reporting any
failures in node startup.
2. replace the convoluted wx.PostEvent() glue for posting an event into the
gui thread with the simpler expedient of wx.CallAfter() which is much like
foolscap's eventually() but also thread safe for inducing a call back on the
gui thread.
in certain cases (e.g. the node.pem changed but old .furls are in private/)
the node will abort upon startup. previously it used os.abort() which in these
cases caused the mac gui app to crash on startup with no explanation.
this changes that behaviour from calling os.abort() to calling
node._abort_process(failure) which by default calls os.abort(). this allows
that method to be overridden in subclasses.
the mac app now provides and uses such a subclass of Client, so that failures
are reported to the user in a message dialog before the process exits.
this uses wx.PostEvent() with a custom event type to signal from the reactor
thread into the gui thread.
the confwiz now uses socket.gethostname() if a 'nickname' file doesn't already
exist, and passes that nickname into the 'record_install' method on the backend,
so that the moniker can be recorded in the system table.
when an operation takes 'too long', on 10.4 the user gets a dialog about
the problem with a 'force eject / keep trying' choice. on 10.5 the fuse
system seems to summarily unmount the drive.
this showed up in 10.5 testing because the time to open() a file depended
upon the size of the file, and an 8Mb test file took long enough for the
node to download that the open() call didn't respond within 60s and fuse
spontaneously ejected the drive, quitting the plugin (and cancelling the
download).
this changes the fuse options passed to the plugin by the ui when the
'mount filesystem' window is used. command line users should check out
the '-odaemon_timeout=...' option. this changes the default timeout from
60s to 300s (5min) for ui launched plugins.
this will be addressed in a deeper manner at a later date, with a more
advanced fuse subsystem which can interleave open()/read() with the
actual download of the file, only blocking when data is not downloaded
yet.
Unfinished bits: doc in webapi.txt, test handling of badly formed JSON, return reasonable HTTP response, examination of the effect of this patch on code coverage -- but I'm committing it anyway because MikeB can use it and I'm being called to dinner...
the name 'tahoe' is in the process of being removed from the windows
installer and binaries. this changes the name of the smb service the
confwiz tries to start to 'Allmydata SMB'
this adds an action to the dock menu and to the file menu (when visible)
"Mount Filesystem". This action opens a windows offering the user an
opportunity to select from any of the named *.cap files in their
.tahoe/private directory, and choose a corresponding mount point to mount
that at.
it launches the .app binary as a subprocess with the corresponding command
line arguments to launch the 'tahoe fuse' functionality to mount that file
system. if a NAME.icns file is present in .tahoe/private alonside the
chosen NAME.cap, then that icon will be used when the filesystem is mounted.
this is highly unlikely to work when running from source, since it uses
introspection on sys.executable to find the relavent binary to launch in
order to get the right built .app's 'tahoe fuse' functionality.
it is also relatively likely that the code currently checked in, hence
linked into the build, will have as yet unresolved library dependencies.
it's quite unlikely to work on 10.5 with macfuse 1.3.1 at the moment.
this provides a variety of changes to the macfuse 'tahoefuse' implementation.
most notably it extends the 'tahoe' command available through the mac build
to provide a 'fuse' subcommand, which invokes tahoefuse. this addresses
various aspects of main(argv) handling, sys.argv manipulation to provide an
appropriate command line syntax that meshes with the fuse library's built-
in command line parsing.
this provides a "tahoe fuse [dir_cap_name] [fuse_options] mountpoint"
command, where dir_cap_name is an optional name of a .cap file to be found
in ~/.tahoe/private defaulting to the standard root_dir.cap. fuse_options
if given are passed into the fuse system as its normal command line options
and the mountpoint is checked for existence before launching fuse.
the tahoe 'fuse' command is provided as an additional_command to the tahoe
runner in the case that it's launched from the mac .app binary.
this also includes a tweak to the TFS class which incorporates the ctime
and mtime of files into the tahoe fs model, if available.
runner provides the main point of entry for the 'tahoe' command, and
provides various subcommands by default. this provides a hook whereby
additional subcommands can be added in in other contexts, providing a
simple way to extend the (sub)commands space available through 'tahoe'
regardless of platform, the confwiz now opens the welcoe page upon
writing a config. it also provides a 'plat' argument (from python's
sys.platform) to help disambiguate our instructions by platform.
adds command line option parsing to the confwiz.
the previous --uninstall option behaves as before, but it parsed
more explicitly with the twisted usage library.
added is a --server option, which controls which web site the
backend script for configuration is to be found on. (it is looked
for at /native_client.php on the given server) this option can be
used on conjunction with --uninstall to control where the uninstall
is recorded
Options:
-u, --uninstall record uninstall
-s, --server= url of server to contact
[default: https://beta.allmydata.com/]
e.g. confwiz.py -s https://www-test.allmydata.com/
while investigating fuse related stuff, I spent quite a while staring at
very cryptic explosions I got from idlib. it turns out that unicode
objects and str objects have .translate() methods with differing signatures.
to save anyone else the headache, this makes it very clear if you accidentally
try to pass a unicode object in to a2b() etc.
base62 encoding fits more information into alphanumeric chars while avoiding the troublesome non-alphanumeric chars of base64 encoding. In particular, this allows us to work around the ext3 "32,000 entries in a directory" limit while retaining the convenient property that the intermediate directory names are leading prefixes of the storage index file names.
having moved inititalisation into startService to handle tub init cleanly,
I neglected the up-call to startService, which wound up not starting the
load_monitor.
also I changed the 'running' attribute to 'started' since 'running' is
the name used internally by MultiService itself.
this adds an interface, IStatsProducer, defining the get_stats() method
which the stats provider calls upon and registered producer, and made the
register_producer() method check that interface is implemented.
also refine the startup logic, so that the stats provider doesn't try and
connect out to the stats gatherer until after the node declares the tub
'ready'. this is to address an issue whereby providers would attach to
the gatherer without providing a valid furl, and hence the gatherer would
be unable to determine the tubid of the connected client, leading to lost
samples.
The filesystem which gets my vote for most undeservedly popular is ext3, and it has a hard limit of 32,000 entries in a directory. Many other filesystems (even ones that I like more than I like ext3) have either hard limits or bad performance consequences or weird edge cases when you get too many entries in a single directory.
This patch makes it so that there is a layer of intermediate directories between the "shares" directory and the actual storage-index directory (the one whose name contains the entire storage index (z-base-32 encoded) and which contains one or more share files named by their share number).
The intermediate directories are named by the first 14 bits of the storage index, which means there are at most 16384 of them. (This also means that the intermediate directory names are not a leading prefix of the storage-index directory names -- to do that would have required us to have intermediate directories limited to either 1024 (2-char), which is too few, or 32768 (3-chars of a full 5 bits each), which would overrun ext3's funny hard limit of 32,000.))
This closes#150, and please see the "convertshares.py" script attached to #150 to convert your old tahoe-0.7.0 storage/shares directory into a new tahoe-0.8.0 storage/shares directory.
We have a desire to collect runtime statistics from multiple nodes primarily
for server monitoring purposes. This implements a simple implementation of
such a system, as a skeleton to build more sophistication upon.
Each client now looks for a 'stats_gatherer.furl' config file. If it has
been configured to use a stats gatherer, then it instantiates internally
a StatsProvider. This is a central place for code which wishes to offer
stats up for monitoring to report them to, either by calling
stats_provider.count('stat.name', value) to increment a counter, or by
registering a class as a stats producer with sp.register_producer(obj).
The StatsProvider connects to the StatsGatherer server and provides its
provider upon startup. The StatsGatherer is then responsible for polling
the attached providers periodically to retrieve the data provided.
The provider queries each registered producer when the gatherer queries
the provider. Both the internal 'counters' and the queried 'stats' are
then reported to the gatherer.
This provides a simple gatherer app, (c.f. make stats-gatherer-run)
which prints its furl and listens for incoming connections. Once a
minute, the gatherer polls all connected providers, and writes the
retrieved data into a pickle file.
Also included is a munin plugin which knows how to read the gatherer's
stats.pickle and output data munin can interpret. this plugin,
tahoe-stats.py can be symlinked as multiple different names within
munin's 'plugins' directory, and inspects argv to determine which
data to display, doing a lookup in a table within that file.
It looks in the environment for 'statsfile' to determine the path to
the gatherer's stats.pickle. An example plugins-conf.d file is
provided.
fix the make-confwiz-match-installer-size changes, to eliminate some weird
layout/rendering bugs. also tweaked the layout slightly to add space between
the warning label and the newsletter subscribe checkbox.
this will write an arbitrary number of config files, instead of being restricted
to just the introducer.furl, based on the response of the php backend.
the get_config is passed username/password
Previously, once the node itself was launched, the UI event loop was no longer
running. This meant that the app would sit around seemingly 'wedged' and being
reported as 'Not Responding' by the os.
This chnages that by actually implementing a wxPython gui which is left running
while the reactor, and the node within it, is launched in another thread.
Beyond 'quit' -> reactor.stop, there are no interactions between the threads.
The ui provides 'open web root' and 'open account page' actions, both in the
file menu, and in the (right click) dock icon menu.
Something weird in the handling of wxpython's per-frame menubar stuff seems to
mean that the menu bar only displays the file menu and about etc (i.e. the items
from the wx menubar) if the focus changes from and back to the app while the
frame the menubar belongs to is displayed. Hence a splash frame comes up at
startup to provide an opportunity.
It also seems that, in the case that the file menu is not available, that one
can induce it to reappear by choosing 'about' from the dock menu, and then
closing the about window.
this moves some of the code common to both windows and mac builds into the
allmydata module hierarchy, and cleans up the windows and mac build directories
to import the code from there.
using sibpath to find web template files relative to source code is functional
when running from source environments, but not especially flexible when running
from bundled built environments. the more 'orthodox' mechanism, pkg_resources,
in theory at least, knows how to find resource files in various environments.
this makes the 'web' directory in allmydata into an actual allmydata.web module
(since pkg_resources looks for files relative to a named module, and that module
must be importable) and uses pkg_resources.resource_filename to find the files
therein.
Using pkg_resources.require() like this also apparently allows people to install multiple different versions of packages on their system and tahoe (if pkg_resources is available to it) will import the version of the package that it requires. I haven't tested this feature.
in a discussion the other day, brian had asked me to try removing this fix, since
it leads to double-closing the reader. since on my windows box, the test failures
I'd experienced were related to the ConnectionLost exception problem, and this
close didn't see to make a difference to test results, I agreed.
turns out that the buildbot's environment does fail without this fix, even with
the exception fix, as I'd kind of expected.
it makes sense, because the reader (specifically the file handle) must be closed
before it can be unlinked. at any rate, I'm reinstating this, in order to fix the
windows build
unlinking a file before closing it is not portable. it works on unix, but fails
since an open file holds a lock on windows.
this closes the reader before trying to unlink the encoding file within the
CHKUploadHelper.
in trying to test my fix for the failure of the offloaded unit test on windows
(by closing the reader before unlinking the encoding file - which, perhaps
disturbingly doesn't actually make a difference in my windows environment)
I was unable too because the unit test failed every time with a connection lost
error.
after much more time than I'd like to admit it took, I eventually managed to
track that down to a part of the unit test which is supposed to be be dropping
a connection. it looks like the exceptions that get thrown on unix, or at
least all the specific environments brian tested in, for that dropped
connection are different from what is thrown on my box (which is running py2.4
and twisted 2.4.0, for reference) adding ConnectionLost to the list of
expected exceptions makes the test pass.
though curiously still my test logs a NotEnoughWritersError error, and I'm not
currently able to fathom why that exception isn't leading to any overall
failure of the unit test itself.
for general interest, a large part of the time spent trying to track this down
was lost to the state of logging. I added a whole bunch of logging to try
and track down where the tests were failing, but then spent a bunch of time
searching in vain for that log output. as far as I can tell at this point
the unit tests are themselves logging to foolscap's log module, but that isn't
being directed anywhere, so all the test's logging is being black holed.
unlinking a file before closing it is not portable. it works on unix, but fails
since an open file holds a lock on windows.
this closes the reader before trying to unlink the encoding file within the
CHKUploadHelper.
use of twisted.python.util.sibpath to find files relative to modules doesn't
work when those modules are bundled into a library by py2exe. this provides
an alternative implementation (in allmydata.util.sibpath) which checks for
the existence of the file, and if it is not found, attempts to find it relative
to sys.executable instead.
adds a 'run' commands to bin/tahoe / tahoe.exe
it loads a client node into the tahoe process itself,
running in the base dir specified by --basedir/-C and
defaulting to the current working dir.
it runs synchronously, and the tahoe process blocks until
the reactor is stopped.
this is probably not of very high utility in the unix case of bin/tahoe
but is useful when working with native builds, e.g. py2exe's tahoe.exe,
to examine and debug the runtime environment, linking problems etc.
a recent purge of the start.html code also took away the logic that wrote
'node.url' into the node root. this is required for the tahoe cli tool to
find the node. this puts back a limited fraction of that code, so that the
node writes out a node.url file upon startup.
taking the same arguments as tahoe ls, it does a webbrowser.open to the page
specified by those args. hence "tahoe webopen" will open a browser to the
root dir specified in private/root_dir.cap by default.
this might be a good alternative to the start.html page.
* rename my_private_dir.cap to root_dir.cap
* move it into the private subdir
* change the cmdline argument "--root-uri=[private]" to "--dir-uri=[root]"
Unfortunately although it passes the unit tests, it doesn't work, because the unit tests and the implementation use the "encode params into URL" technique but the button uses the "encode params into request body" technique.
Also allow an optional leading "http://127.0.0.1:8123/uri/".
Also fix a few unit tests to generate bogus Dirnode URIs of the modern form instead of the former form.
Hm... I refactored processing of segments in a way that I marked as "XXX HELP
I AM YUCKY", and then I ran out of time for rerefactoring it before I
committed. At least all the tests pass.
The underlying issue is recorded in #211: one corrupt share in a query
response will cause us to ignore the remaining shares in that response, even
if they are good. In our tests (with N=10 but only 5 peers), this can leave
us with too few shares to recover the file.
The temporary workaround is to use 10 peers, to make sure we never get
multiple shares per response. The real fix will be to fix the control flow.
This fixes#209.
* use new decentralized directories everywhere instead of old centralized directories
* provide UI to them through the web server
* provide UI to them through the CLI
* update unit tests to simulate decentralized mutable directories in order to test other components that rely on them
* remove the notion of a "vdrive server" and a client thereof
* remove the notion of a "public vdrive", which was a directory that was centrally published/subscribed automatically by the tahoe node (you can accomplish this manually by making a directory and posting the URL to it on your web site, for example)
* add a notion of "wait_for_numpeers" when you need to publish data to peers, which is how many peers should be attached before you start. The default is 1.
* add __repr__ for filesystem nodes (note: these reprs contain a few bits of the secret key!)
* fix a few bugs where we used to equate "mutable" with "not read-only". Nowadays all directories are mutable, but some might be read-only (to you).
* fix a few bugs where code wasn't aware of the new general-purpose metadata dict the comes with each filesystem edge
* sundry fixes to unit tests to adjust to the new directories, e.g. don't assume that every share on disk belongs to a chk file.
It isn't currently used, and I don't remember what part of its behavior was so much better than tahoe_put.py, and Brian has subsequently improved tahoe_put.py.
I'm not 100% sure that this is correct, but it looks reasonable, it passes unit
tests (although note that unit tests are currently not covering the new mutable
files very well), and it makes the "view JSON" link on a directory work instead
of raising an assertion error.