13 KiB
Tahoe Logging
- Overview
- Realtime Logging
- Incidents
- Working with flogfiles
- Gatherers
- Local twistd.log files
- Adding log messages
- Log Messages During Unit Tests
Overview
Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to record information about what is happening inside the Tahoe node. This is primarily for use by programmers and grid operators who want to find out what went wrong.
The foolscap logging system is documented at http://foolscap.lothar.com/docs/logging.html.
The foolscap distribution includes a utility named
"flogtool
" (usually at /usr/bin/flogtool
on
Unix) which is used to get access to many foolscap logging features.
Realtime Logging
When you are working on Tahoe code, and want to see what the node is
doing, the easiest tool to use is "flogtool tail
". This
connects to the Tahoe node and subscribes to hear about all log events.
These events are then displayed to stdout, and optionally saved to a
file.
"flogtool tail
" connects to the "logport", for which the
FURL is stored in BASEDIR/private/logport.furl
. The
following command will connect to this port and start emitting log
information:
flogtool tail BASEDIR/private/logport.furl
The --save-to FILENAME
option will save all received
events to a file, where then can be examined later with
"flogtool dump
" or "flogtool web-viewer
". The
--catch-up
option will ask the node to dump all stored
events before subscribing to new ones (without --catch-up
,
you will only hear about events that occur after the tool has connected
and subscribed).
Incidents
Foolscap keeps a short list of recent events in memory. When
something goes wrong, it writes all the history it has (and everything
that gets logged in the next few seconds) into a file called an
"incident". These files go into BASEDIR/logs/incidents/
,
in a file named "incident-TIMESTAMP-UNIQUE.flog.bz2
". The
default definition of "something goes wrong" is the generation of a log
event at the log.WEIRD
level or higher, but other criteria
could be implemented.
The typical "incident report" we've seen in a large Tahoe grid is about 40kB compressed, representing about 1800 recent events.
These "flogfiles" have a similar format to the files saved by
"flogtool tail --save-to
". They are simply lists of log
events, with a small header to indicate which event triggered the
incident.
The "flogtool dump FLOGFILE
" command will take one of
these .flog.bz2
files and print their contents to stdout,
one line per event. The raw event dictionaries can be dumped by using
"flogtool dump --verbose FLOGFILE
".
The "flogtool web-viewer
" command can be used to examine
the flogfile in a web browser. It runs a small HTTP server and emits the
URL on stdout. This view provides more structure than the output of
"flogtool dump
": the parent/child relationships of log
events is displayed in a nested format.
"flogtool web-viewer
" is still fairly immature.
Working with flogfiles
The "flogtool filter
" command can be used to take a
large flogfile (perhaps one created by the log-gatherer, see below) and
copy a subset of events into a second file. This smaller flogfile may be
easier to work with than the original. The arguments to
"flogtool filter
" specify filtering criteria: a predicate
that each event must match to be copied into the target file.
--before
and --after
are used to exclude
events outside a given window of time. --above
will retain
events above a certain severity level. --from
retains
events send by a specific tubid. --strip-facility
removes
events that were emitted with a given facility (like
foolscap.negotiation
or tahoe.upload
).
Gatherers
In a deployed Tahoe grid, it is useful to get log information automatically transferred to a central log-gatherer host. This offloads the (admittedly modest) storage requirements to a different host and provides access to logfiles from multiple nodes (web-API, storage, or helper) in a single place.
There are two kinds of gatherers. Both produce a FURL which needs to
be placed in the NODEDIR/log_gatherer.furl
file (one FURL
per line) of each node that is to publish its logs to the gatherer. When
the Tahoe node starts, it will connect to the configured gatherers and
offer its logport: the gatherer will then use the logport to subscribe
to hear about events.
The gatherer will write to files in its working directory, which can
then be examined with tools like "flogtool dump
" as
described above.
Incident Gatherer
The "incident gatherer" only collects Incidents: records of the log events that occurred just before and slightly after some high-level "trigger event" was recorded. Each incident is classified into a "category": a short string that summarizes what sort of problem took place. These classification functions are written after examining a new/unknown incident. The idea is to recognize when the same problem is happening multiple times.
A collection of classification functions that are useful for Tahoe
nodes are provided in
misc/incident-gatherer/support_classifiers.py
. There is
roughly one category for each log.WEIRD
-or-higher level
event in the Tahoe source code.
The incident gatherer is created with the
"flogtool create-incident-gatherer WORKDIR
" command, and
started with "tahoe start
". The generated
"gatherer.tac
" file should be modified to add classifier
functions.
The incident gatherer writes incident names (which are simply the
relative pathname of the incident-\*.flog.bz2
file) into
classified/CATEGORY
. For example, the
classified/mutable-retrieve-uncoordinated-write-error
file
contains a list of all incidents which were triggered by an
uncoordinated write that was detected during mutable file retrieval
(caused when somebody changed the contents of the mutable file in
between the node's mapupdate step and the retrieve step). The
classified/unknown
file contains a list of all incidents
that did not match any of the classification functions.
At startup, the incident gatherer will automatically reclassify any
incident report which is not mentioned in any of the
classified/\*
files. So the usual workflow is to examine
the incidents in classified/unknown
, add a new
classification function, delete classified/unknown
, then
bound the gatherer with "tahoe restart WORKDIR
". The
incidents which can be classified with the new functions will be added
to their own classified/FOO
lists, and the remaining ones
will be put in classified/unknown
, where the process can be
repeated until all events are classifiable.
The incident gatherer is still fairly immature: future versions will have a web interface and an RSS feed, so operations personnel can track problems in the storage grid.
In our experience, each incident takes about two seconds to transfer from the node that generated it to the gatherer. The gatherer will automatically catch up to any incidents which occurred while it is offline.
Log Gatherer
The "Log Gatherer" subscribes to hear about every single event
published by the connected nodes, regardless of severity. This server
writes these log events into a large flogfile that is rotated (closed,
compressed, and replaced with a new one) on a periodic basis. Each
flogfile is named according to the range of time it represents, with
names like
"from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2
".
The flogfiles contain events from many different sources, making it
easier to correlate things that happened on multiple machines (such as
comparing a client node making a request with the storage servers that
respond to that request).
The Log Gatherer is created with the
"flogtool create-gatherer WORKDIR
" command, and started
with "tahoe start
". The log_gatherer.furl
it
creates then needs to be copied into the
BASEDIR/log_gatherer.furl
file of all nodes that should be
sending it log events.
The "flogtool filter
" command, described above, is
useful to cut down the potentially-large flogfiles into more a
narrowly-focussed form.
Busy nodes, particularly web-API nodes which are performing recursive deep-size/deep-stats/deep-check operations, can produce a lot of log events. To avoid overwhelming the node (and using an unbounded amount of memory for the outbound TCP queue), publishing nodes will start dropping log events when the outbound queue grows too large. When this occurs, there will be gaps (non-sequential event numbers) in the log-gatherer's flogfiles.
Local twistd.log files
[TODO: not yet true, requires foolscap-0.3.1 and a change to
allmydata.node
]
In addition to the foolscap-based event logs, certain high-level
events will be recorded directly in human-readable text form, in the
BASEDIR/logs/twistd.log
file (and its rotated old versions:
twistd.log.1
, twistd.log.2
, etc). This form
does not contain as much information as the flogfiles available through
the means described previously, but they are immediately available to
the curious developer, and are retained until the twistd.log.NN files
are explicitly deleted.
Only events at the log.OPERATIONAL
level or higher are
bridged to twistd.log
(i.e. not the log.NOISY
debugging events). In addition, foolscap internal events (like
connection negotiation messages) are not bridged to
twistd.log
.
Adding log messages
When adding new code, the Tahoe developer should add a reasonable number of new log events. For details, please see the Foolscap logging documentation, but a few notes are worth stating here:
- use a facility prefix of "
tahoe.
", like "tahoe.mutable.publish
" - assign each severe (
log.WEIRD
or higher) event a unique message identifier, as theumid=
argument to thelog.msg()
call. Themisc/coding_tools/make_umid
script may be useful for this purpose. This will make it easier to write a classification function for these messages. - use the
parent=
argument whenever the event is causally/temporally clustered with its parent. For example, a download process that involves three sequential hash fetches could announce the send and receipt of those hash-fetch messages with aparent=
argument that ties them to the overall download process. However, each new web-API download request should be unparented. - use the
format=
argument in preference to themessage=
argument. E.g. uselog.msg(format="got %(n)d shares, need %(k)d", n=n, k=k)
instead oflog.msg("got %d shares, need %d" % (n,k))
. This will allow later tools to analyze the event without needing to scrape/reconstruct the structured data out of the formatted string. - Pass extra information as extra keyword arguments, even if they
aren't included in the
format=
string. This information will be displayed in the "flogtool dump --verbose
" output, as well as being available to other tools. Theumid=
argument should be passed this way. - use
log.err
for the catch-alladdErrback
that gets attached to the end of any given Deferred chain. When used in conjunction withLOGTOTWISTED=1
,log.err()
will tell Twisted about the error-nature of the log message, causing Trial to flunk the test (with an "ERROR" indication that prints a copy of the Failure, including a traceback). Don't uselog.err
for events that areBAD
but handled (like hash failures: since these are often deliberately provoked by test code, they should not cause test failures): uselog.msg(level=BAD)
for those instead.
Log Messages During Unit Tests
If a test is failing and you aren't sure why, start by enabling
FLOGTOTWISTED=1
like this:
make test FLOGTOTWISTED=1
With FLOGTOTWISTED=1
, sufficiently-important log events
will be written into _trial_temp/test.log
, which may give
you more ideas about why the test is failing. Note, however, that
_trial_temp/log.out
will not receive messages below the
level=OPERATIONAL
threshold, due to this issue: http://foolscap.lothar.com/trac/ticket/154
If that isn't enough, look at the detailed foolscap logging messages instead, by running the tests like this:
make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1
The first environment variable will cause foolscap log events to be
written to ./flog.out.bz2
(instead of merely being recorded
in the circular buffers for the use of remote subscribers or incident
reports). The second will cause all log events to be written out, not
just the higher-severity ones. The third will cause twisted log events
(like the markers that indicate when each unit test is starting and
stopping) to be copied into the flogfile, making it easier to correlate
log events with unit tests.
Enabling this form of logging appears to roughly double the runtime
of the unit tests. The flog.out.bz2
file is approximately
2MB.
You can then use "flogtool dump
" or
"flogtool web-viewer
" on the resulting
flog.out
file.
("flogtool tail
" and the log-gatherer are not useful
during unit tests, since there is no single Tub to which all the log
messages are published).
It is possible for setting these environment variables to cause spurious test failures in tests with race condition bugs. All known instances of this have been fixed as of Tahoe-LAFS v1.7.1.