mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2025-01-01 10:46:42 +00:00
271 lines
13 KiB
Plaintext
271 lines
13 KiB
Plaintext
= Tahoe Logging =
|
|
|
|
1. Overview
|
|
2. Realtime Logging
|
|
3. Incidents
|
|
4. Working with flogfiles
|
|
5. Gatherers
|
|
5.1. Incident Gatherer
|
|
5.2. Log Gatherer
|
|
6. Local twistd.log files
|
|
7. Adding log messages
|
|
8. Log Messages During Unit Tests
|
|
|
|
== Overview ==
|
|
|
|
Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to
|
|
record information about what is happening inside the Tahoe node. This is
|
|
primarily for use by programmers and grid operators who want to find out what
|
|
went wrong.
|
|
|
|
The foolscap logging system is documented here:
|
|
|
|
http://foolscap.lothar.com/docs/logging.html
|
|
|
|
The foolscap distribution includes a utility named "flogtool" (usually at
|
|
/usr/bin/flogtool) which is used to get access to many foolscap logging
|
|
features.
|
|
|
|
== Realtime Logging ==
|
|
|
|
When you are working on Tahoe code, and want to see what the node is doing,
|
|
the easiest tool to use is "flogtool tail". This connects to the tahoe node
|
|
and subscribes to hear about all log events. These events are then displayed
|
|
to stdout, and optionally saved to a file.
|
|
|
|
"flogtool tail" connects to the "logport", for which the FURL is stored in
|
|
BASEDIR/private/logport.furl . The following command will connect to this
|
|
port and start emitting log information:
|
|
|
|
flogtool tail BASEDIR/private/logport.furl
|
|
|
|
The "--save-to FILENAME" option will save all received events to a file,
|
|
where then can be examined later with "flogtool dump" or "flogtool
|
|
web-viewer". The --catch-up flag will ask the node to dump all stored events
|
|
before subscribing to new ones (without --catch-up, you will only hear about
|
|
events that occur after the tool has connected and subscribed).
|
|
|
|
== Incidents ==
|
|
|
|
Foolscap keeps a short list of recent events in memory. When something goes
|
|
wrong, it writes all the history it has (and everything that gets logged in
|
|
the next few seconds) into a file called an "incident". These files go into
|
|
BASEDIR/logs/incidents/ , in a file named
|
|
"incident-TIMESTAMP-UNIQUE.flog.bz2". The default definition of "something
|
|
goes wrong" is the generation of a log event at the log.WEIRD level or
|
|
higher, but other criteria could be implemented.
|
|
|
|
The typical "incident report" we've seen in a large Tahoe grid is about 40kB
|
|
compressed, representing about 1800 recent events.
|
|
|
|
These "flogfiles" have a similar format to the files saved by "flogtool tail
|
|
--save-to". They are simply lists of log events, with a small header to
|
|
indicate which event triggered the incident.
|
|
|
|
The "flogtool dump FLOGFILE" command will take one of these .flog.bz2 files
|
|
and print their contents to stdout, one line per event. The raw event
|
|
dictionaries can be dumped by using "flogtool dump --verbose FLOGFILE".
|
|
|
|
The "flogtool web-viewer" command can be used to examine the flogfile in a
|
|
web browser. It runs a small HTTP server and emits the URL on stdout. This
|
|
view provides more structure than the output of "flogtool dump": the
|
|
parent/child relationships of log events is displayed in a nested format.
|
|
"flogtool web-viewer" is still fairly immature.
|
|
|
|
== Working with flogfiles ==
|
|
|
|
The "flogtool filter" command can be used to take a large flogfile (perhaps
|
|
one created by the log-gatherer, see below) and copy a subset of events into
|
|
a second file. This smaller flogfile may be easier to work with than the
|
|
original. The arguments to "flogtool filter" specify filtering criteria: a
|
|
predicate that each event must match to be copied into the target file.
|
|
--before and --after are used to exclude events outside a given window of
|
|
time. --above will retain events above a certain severity level. --from
|
|
retains events send by a specific tubid. --strip-facility removes events that
|
|
were emitted with a given facility (like foolscap.negotiation or
|
|
tahoe.upload).
|
|
|
|
== Gatherers ==
|
|
|
|
In a deployed Tahoe grid, it is useful to get log information automatically
|
|
transferred to a central log-gatherer host. This offloads the (admittedly
|
|
modest) storage requirements to a different host and provides access to
|
|
logfiles from multiple nodes (webapi/storage/helper) nodes in a single place.
|
|
|
|
There are two kinds of gatherers. Both produce a FURL which needs to be
|
|
placed in the NODEDIR/log_gatherer.furl file (one FURL per line) of the nodes
|
|
that are to publish their logs to the gatherer. When the Tahoe node starts,
|
|
it will connect to the configured gatherers and offer its logport: the
|
|
gatherer will then use the logport to subscribe to hear about events.
|
|
|
|
The gatherer will write to files in its working directory, which can then be
|
|
examined with tools like "flogtool dump" as described above.
|
|
|
|
=== Incident Gatherer ===
|
|
|
|
The "incident gatherer" only collects Incidents: records of the log events
|
|
that occurred just before and slightly after some high-level "trigger event"
|
|
was recorded. Each incident is classified into a "category": a short string
|
|
that summarizes what sort of problem took place. These classification
|
|
functions are written after examining a new/unknown incident. The idea is to
|
|
recognize when the same problem is happening multiple times.
|
|
|
|
A collection of classification functions that are useful for Tahoe nodes are
|
|
provided in misc/incident-gatherer/support_classifiers.py . There is roughly
|
|
one category for each log.WEIRD-or-higher level event in the Tahoe source
|
|
code.
|
|
|
|
The incident gatherer is created with the "flogtool create-incident-gatherer
|
|
WORKDIR" command, and started with "tahoe start". The generated
|
|
"gatherer.tac" file should be modified to add classifier functions.
|
|
|
|
The incident gatherer writes incident names (which are simply the relative
|
|
pathname of the incident-*.flog.bz2 file) into classified/CATEGORY. For
|
|
example, the classified/mutable-retrieve-uncoordinated-write-error file
|
|
contains a list of all incidents which were triggered by an uncoordinated
|
|
write that was detected during mutable file retrieval (caused when somebody
|
|
changed the contents of the mutable file in between the node's mapupdate step
|
|
and the retrieve step). The classified/unknown file contains a list of all
|
|
incidents that did not match any of the classification functions.
|
|
|
|
At startup, the incident gatherer will automatically reclassify any incident
|
|
report which is not mentioned in any of the classified/* files. So the usual
|
|
workflow is to examine the incidents in classified/unknown, add a new
|
|
classification function, delete classified/unknown, then bound the gatherer
|
|
with "tahoe restart WORKDIR". The incidents which can be classified with the
|
|
new functions will be added to their own classified/FOO lists, and the
|
|
remaining ones will be put in classified/unknown, where the process can be
|
|
repeated until all events are classifiable.
|
|
|
|
The incident gatherer is still fairly immature: future versions will have a
|
|
web interface and an RSS feed, so operations personnel can track problems in
|
|
the storage grid.
|
|
|
|
In our experience, each Incident takes about two seconds to transfer from the
|
|
node which generated it to the gatherer. The gatherer will automatically
|
|
catch up to any incidents which occurred while it is offline.
|
|
|
|
=== Log Gatherer ===
|
|
|
|
The "Log Gatherer" subscribes to hear about every single event published by
|
|
the connected nodes, regardless of severity. This server writes these log
|
|
events into a large flogfile that is rotated (closed, compressed, and
|
|
replaced with a new one) on a periodic basis. Each flogfile is named
|
|
according to the range of time it represents, with names like
|
|
"from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2". The flogfiles
|
|
contain events from many different sources, making it easier to correlate
|
|
things that happened on multiple machines (such as comparing a client node
|
|
making a request with the storage servers that respond to that request).
|
|
|
|
The Log Gatherer is created with the "flogtool create-gatherer WORKDIR"
|
|
command, and started with "tahoe start". The log_gatherer.furl it creates
|
|
then needs to be copied into the BASEDIR/log_gatherer.furl file of all nodes
|
|
which should be sending it log events.
|
|
|
|
The "flogtool filter" command, described above, is useful to cut down the
|
|
potentially-large flogfiles into more a narrowly-focussed form.
|
|
|
|
Busy nodes, particularly wapi nodes which are performing recursive
|
|
deep-size/deep-stats/deep-check operations, can produce a lot of log events.
|
|
To avoid overwhelming the node (and using an unbounded amount of memory for
|
|
the outbound TCP queue), publishing nodes will start dropping log events when
|
|
the outbound queue grows too large. When this occurs, there will be gaps
|
|
(non-sequential event numbers) in the log-gatherer's flogfiles.
|
|
|
|
== Local twistd.log files ==
|
|
|
|
[TODO: not yet true, requires foolscap-0.3.1 and a change to allmydata.node]
|
|
|
|
In addition to the foolscap-based event logs, certain high-level events will
|
|
be recorded directly in human-readable text form, in the
|
|
BASEDIR/logs/twistd.log file (and its rotated old versions: twistd.log.1,
|
|
twistd.log.2, etc). This form does not contain as much information as the
|
|
flogfiles available through the means described previously, but they are
|
|
immediately available to the curious developer, and are retained until the
|
|
twistd.log.NN files are explicitly deleted.
|
|
|
|
Only events at the log.OPERATIONAL level or higher are bridged to twistd.log
|
|
(i.e. not the log.NOISY debugging events). In addition, foolscap internal
|
|
events (like connection negotiation messages) are not bridged to twistd.log .
|
|
|
|
== Adding log messages ==
|
|
|
|
When adding new code, the Tahoe developer should add a reasonable number of
|
|
new log events. For details, please see the Foolscap logging documentation,
|
|
but a few notes are worth stating here:
|
|
|
|
* use a facility prefix of "tahoe.", like "tahoe.mutable.publish"
|
|
|
|
* assign each severe (log.WEIRD or higher) event a unique message
|
|
identifier, as the umid= argument to the log.msg() call. The
|
|
misc/coding_tools/make_umid script may be useful for this purpose. This will make it
|
|
easier to write a classification function for these messages.
|
|
|
|
* use the parent= argument whenever the event is causally/temporally
|
|
clustered with its parent. For example, a download process that involves
|
|
three sequential hash fetches could announce the send and receipt of those
|
|
hash-fetch messages with a parent= argument that ties them to the overall
|
|
download process. However, each new wapi download request should be
|
|
unparented.
|
|
|
|
* use the format= argument in preference to the message= argument. E.g.
|
|
use log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k) instead of
|
|
log.msg("got %d shares, need %d" % (n,k)). This will allow later tools to
|
|
analyze the event without needing to scrape/reconstruct the structured
|
|
data out of the formatted string.
|
|
|
|
* Pass extra information as extra keyword arguments, even if they aren't
|
|
included in the format= string. This information will be displayed in the
|
|
"flogtool dump --verbose" output, as well as being available to other
|
|
tools. The umid= argument should be passed this way.
|
|
|
|
* use log.err for the catch-all addErrback that gets attached to the end of
|
|
any given Deferred chain. When used in conjunction with LOGTOTWISTED=1,
|
|
log.err() will tell Twisted about the error-nature of the log message,
|
|
causing Trial to flunk the test (with an "ERROR" indication that prints a
|
|
copy of the Failure, including a traceback). Don't use log.err for events
|
|
that are BAD but handled (like hash failures: since these are often
|
|
deliberately provoked by test code, they should not cause test failures):
|
|
use log.msg(level=BAD) for those instead.
|
|
|
|
|
|
== Log Messages During Unit Tests ==
|
|
|
|
If a test is failing and you aren't sure why, start by enabling
|
|
FLOGTOTWISTED=1 like this:
|
|
|
|
make test FLOGTOTWISTED=1
|
|
|
|
With FLOGTOTWISTED=1, sufficiently-important log events will be written into
|
|
_trial_temp/test.log, which may give you more ideas about why the test is
|
|
failing. Note, however, that _trial_temp/log.out will not receive messages
|
|
below the level=OPERATIONAL threshold, due to this issue:
|
|
<http://foolscap.lothar.com/trac/ticket/154>
|
|
|
|
|
|
If that isn't enough, look at the detailed foolscap logging messages instead,
|
|
by running the tests like this:
|
|
|
|
make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1
|
|
|
|
The first environment variable will cause foolscap log events to be written
|
|
to ./flog.out.bz2 (instead of merely being recorded in the circular buffers
|
|
for the use of remote subscribers or incident reports). The second will cause
|
|
all log events to be written out, not just the higher-severity ones. The
|
|
third will cause twisted log events (like the markers that indicate when each
|
|
unit test is starting and stopping) to be copied into the flogfile, making it
|
|
easier to correlate log events with unit tests.
|
|
|
|
Enabling this form of logging appears to roughly double the runtime of the
|
|
unit tests. The flog.out.bz2 file is approximately 2MB.
|
|
|
|
You can then use "flogtool dump" or "flogtool web-viewer" on the resulting
|
|
flog.out file.
|
|
|
|
("flogtool tail" and the log-gatherer are not useful during unit tests, since
|
|
there is no single Tub to which all the log messages are published).
|
|
|
|
It is possible for setting these environment variables to cause spurious test
|
|
failures in tests with race condition bugs. All known instances of this have
|
|
been fixed as of Tahoe-LAFS v1.7.1.
|