mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2025-02-20 17:52:50 +00:00
docs/logging.txt: explain tahoe/foolscap logging. Addresses #239.
This commit is contained in:
parent
3fab96493e
commit
af07b9cad6
231
docs/logging.txt
Normal file
231
docs/logging.txt
Normal file
@ -0,0 +1,231 @@
|
||||
= Tahoe Logging =
|
||||
|
||||
Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to
|
||||
record information about what is happening inside the Tahoe node. This is
|
||||
primarily for use by programmers and grid operators who want to find out what
|
||||
went wrong.
|
||||
|
||||
The foolscap logging system is documented here:
|
||||
|
||||
http://foolscap.lothar.com/docs/logging.html
|
||||
|
||||
The foolscap distribution includes a utility named "flogtool" (usually at
|
||||
/usr/bin/flogtool) which is used to get access to many foolscap logging
|
||||
features.
|
||||
|
||||
== Realtime Logging ==
|
||||
|
||||
When you are working on Tahoe code, and want to see what the node is doing,
|
||||
the easiest tool to use is "flogtool tail". This connects to the tahoe node
|
||||
and subscribes to hear about all log events. These events are then displayed
|
||||
to stdout, and optionally saved to a file.
|
||||
|
||||
"flogtool tail" connects to the "logport", for which the FURL is stored in
|
||||
BASEDIR/private/logport.furl . The following command will connect to this
|
||||
port and start emitting log information:
|
||||
|
||||
flogtool tail BASEDIR/private/logport.furl
|
||||
|
||||
The "--save-to FILENAME" option will save all received events to a file,
|
||||
where then can be examined later with "flogtool dump" or "flogtool
|
||||
web-viewer". The --catch-up flag will ask the node to dump all stored events
|
||||
before subscribing to new ones (without --catch-up, you will only hear about
|
||||
events that occur after the tool has connected and subscribed).
|
||||
|
||||
== Incidents ==
|
||||
|
||||
Foolscap keeps a short list of recent events in memory. When something goes
|
||||
wrong, it writes all the history it has (and everything that gets logged in
|
||||
the next few seconds) into a file called an "incident". These files go into
|
||||
BASEDIR/logs/incidents/ , in a file named
|
||||
"incident-TIMESTAMP-UNIQUE.flog.bz2". The default definition of "something
|
||||
goes wrong" is the generation of a log event at the log.WEIRD level or
|
||||
higher, but other criteria could be implemented.
|
||||
|
||||
The typical "incident report" we've seen in a large Tahoe grid is about 40kB
|
||||
compressed, representing about 1800 recent events.
|
||||
|
||||
These "flogfiles" have a similar format to the files saved by "flogtool tail
|
||||
--save-to". They are simply lists of log events, with a small header to
|
||||
indicate which event triggered the incident.
|
||||
|
||||
The "flogtool dump FLOGFILE" command will take one of these .flog.bz2 files
|
||||
and print their contents to stdout, one line per event. The raw event
|
||||
dictionaries can be dumped by using "flogtool dump --verbose FLOGFILE".
|
||||
|
||||
The "flogtool web-viewer" command can be used to examine the flogfile in a
|
||||
web browser. It runs a small HTTP server and emits the URL on stdout. This
|
||||
view provides more structure than the output of "flogtool dump": the
|
||||
parent/child relationships of log events is displayed in a nested format.
|
||||
"flogtool web-viewer" is still fairly immature.
|
||||
|
||||
== Working with flogfiles ==
|
||||
|
||||
The "flogtool filter" command can be used to take a large flogfile (perhaps
|
||||
one created by the log-gatherer, see below) and copy a subset of events into
|
||||
a second file. This smaller flogfile may be easier to work with than the
|
||||
original. The arguments to "flogtool filter" specify filtering criteria: a
|
||||
predicate that each event must match to be copied into the target file.
|
||||
--before and --after are used to exclude events outside a given window of
|
||||
time. --above will retain events above a certain severity level. --from
|
||||
retains events send by a specific tubid. --strip-facility removes events that
|
||||
were emitted with a given facility (like foolscap.negotiation or
|
||||
tahoe.upload).
|
||||
|
||||
== Gatherers ==
|
||||
|
||||
In a deployed Tahoe grid, it is useful to get log information automatically
|
||||
transferred to a central log-gatherer host. This offloads the (admittedly
|
||||
modest) storage requirements to a different host and provides access to
|
||||
logfiles from multiple nodes (webapi/storage/helper) nodes in a single place.
|
||||
|
||||
There are two kinds of gatherers. Both produce a FURL which needs to be
|
||||
placed in the NODEDIR/log_gatherer.furl file (one FURL per line) of the nodes
|
||||
that are to publish their logs to the gatherer. When the Tahoe node starts,
|
||||
it will connect to the configured gatherers and offer its logport: the
|
||||
gatherer will then use the logport to subscribe to hear about events.
|
||||
|
||||
The gatherer will write to files in its working directory, which can then be
|
||||
examined with tools like "flogtool dump" as described above.
|
||||
|
||||
=== Incident Gatherer ===
|
||||
|
||||
The "incident gatherer" only collects Incidents: records of the log events
|
||||
that occurred just before and slightly after some high-level "trigger event"
|
||||
was recorded. Each incident is classified into a "category": a short string
|
||||
that summarizes what sort of problem took place. These classification
|
||||
functions are written after examining a new/unknown incident. The idea is to
|
||||
recognize when the same problem is happening multiple times.
|
||||
|
||||
A collection of classification functions that are useful for Tahoe nodes are
|
||||
provided in misc/incident-gatherer/support_classifiers.py . There is roughly
|
||||
one category for each log.WEIRD-or-higher level event in the Tahoe source
|
||||
code.
|
||||
|
||||
The incident gatherer is created with the "flogtool create-incident-gatherer
|
||||
WORKDIR" command, and started with "tahoe start". The generated
|
||||
"gatherer.tac" file should be modified to add classifier functions.
|
||||
|
||||
The incident gatherer writes incident names (which are simply the relative
|
||||
pathname of the incident-*.flog.bz2 file) into classified/CATEGORY. For
|
||||
example, the classified/mutable-retrieve-uncoordinated-write-error file
|
||||
contains a list of all incidents which were triggered by an uncoordinated
|
||||
write that was detected during mutable file retrieval (caused when somebody
|
||||
changed the contents of the mutable file in between the node's mapupdate step
|
||||
and the retrieve step). The classified/unknown file contains a list of all
|
||||
incidents that did not match any of the classification functions.
|
||||
|
||||
At startup, the incident gatherer will automatically reclassify any incident
|
||||
report which is not mentioned in any of the classified/* files. So the usual
|
||||
workflow is to examine the incidents in classified/unknown, add a new
|
||||
classification function, delete classified/unknown, then bound the gatherer
|
||||
with "tahoe restart WORKDIR". The incidents which can be classified with the
|
||||
new functions will be added to their own classified/FOO lists, and the
|
||||
remaining ones will be put in classified/unknown, where the process can be
|
||||
repeated until all events are classifiable.
|
||||
|
||||
The incident gatherer is still fairly immature: future versions will have a
|
||||
web interface and an RSS feed, so operations personnel can track problems in
|
||||
the storage grid.
|
||||
|
||||
In our experience, each Incident takes about two seconds to transfer from the
|
||||
node which generated it to the gatherer. The gatherer will automatically
|
||||
catch up to any incidents which occurred while it is offline.
|
||||
|
||||
=== Log Gatherer ===
|
||||
|
||||
The "Log Gatherer" subscribes to hear about every single event published by
|
||||
the connected nodes, regardless of severity. This server writes these log
|
||||
events into a large flogfile that is rotated (closed, compressed, and
|
||||
replaced with a new one) on a periodic basis. Each flogfile is named
|
||||
according to the range of time it represents, with names like
|
||||
"from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2". The flogfiles
|
||||
contain events from many different sources, making it easier to correlate
|
||||
things that happened on multiple machines (such as comparing a client node
|
||||
making a request with the storage servers that respond to that request).
|
||||
|
||||
The Log Gatherer is created with the "flogtool create-gatherer WORKDIR"
|
||||
command, and started with "tahoe start". The log_gatherer.furl it creates
|
||||
then needs to be copied into the BASEDIR/log_gatherer.furl file of all nodes
|
||||
which should be sending it log events.
|
||||
|
||||
The "flogtool filter" command, described above, is useful to cut down the
|
||||
potentially-large flogfiles into more a narrowly-focussed form.
|
||||
|
||||
Busy nodes, particularly webapi nodes which are performing recursive
|
||||
deep-size/deep-stats/deep-check operations, can produce a lot of log events.
|
||||
To avoid overwhelming the node (and using an unbounded amount of memory for
|
||||
the outbound TCP queue), publishing nodes will start dropping log events when
|
||||
the outbound queue grows too large. When this occurs, there will be gaps
|
||||
(non-sequential event numbers) in the log-gatherer's flogfiles.
|
||||
|
||||
== Local twistd.log files ==
|
||||
|
||||
[TODO: not yet true, requires foolscap-0.3.1 and a change to allmydata.node]
|
||||
|
||||
In addition to the foolscap-based event logs, certain high-level events will
|
||||
be recorded directly in human-readable text form, in the
|
||||
BASEDIR/logs/twistd.log file (and its rotated old versions: twistd.log.1,
|
||||
twistd.log.2, etc). This form does not contain as much information as the
|
||||
flogfiles available through the means described previously, but they are
|
||||
immediately available to the curious developer, and are retained until the
|
||||
twistd.log.NN files are explicitly deleted.
|
||||
|
||||
Only events at the log.OPERATIONAL level or higher are bridged to twistd.log
|
||||
(i.e. not the log.NOISY debugging events). In addition, foolscap internal
|
||||
events (like connection negotiation messages) are not bridged to twistd.log .
|
||||
|
||||
== Adding log messages ==
|
||||
|
||||
When adding new code, the Tahoe developer should add a reasonable number of
|
||||
new log events. For details, please see the Foolscap logging documentation,
|
||||
but a few notes are worth stating here:
|
||||
|
||||
* use a facility prefix of "tahoe.", like "tahoe.mutable.publish"
|
||||
|
||||
* assign each severe (log.WEIRD or higher) event a unique message
|
||||
identifier, as the umid= argument to the log.msg() call. The
|
||||
misc/make_umid script may be useful for this purpose. This will make it
|
||||
easier to write a classification function for these messages.
|
||||
|
||||
* use the parent= argument whenever the event is causally/temporally
|
||||
clustered with its parent. For example, a download process that involves
|
||||
three sequential hash fetches could announce the send and receipt of those
|
||||
hash-fetch messages with a parent= argument that ties them to the overall
|
||||
download process. However, each new webapi download request should be
|
||||
unparented.
|
||||
|
||||
* use the format= argument in preference to the message= argument. E.g.
|
||||
use log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k) instead of
|
||||
log.msg("got %d shares, need %d" % (n,k)). This will allow later tools to
|
||||
analyze the event without needing to scrape/reconstruct the structured
|
||||
data out of the formatted string.
|
||||
|
||||
* Pass extra information as extra keyword arguments, even if they aren't
|
||||
included in the format= string. This information will be displayed in the
|
||||
"flogtool dump --verbose" output, as well as being available to other
|
||||
tools. The umid= argument should be passed this way.
|
||||
|
||||
|
||||
== Log Messages During Unit Tests ==
|
||||
|
||||
To get detailed log message out of unit tests, run the tests like this:
|
||||
|
||||
make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTWISTED=1
|
||||
|
||||
The first environment variable will cause foolscap log events to be written
|
||||
to ./flog.out.bz2 (instead of merely being recorded in the circular buffers
|
||||
for the use of remote subscribers or incident reports). The second will cause
|
||||
all log events to be written out, not just the higher-severity ones. The
|
||||
third will cause twisted log events (like the markers that indicate when each
|
||||
unit test is starting and stopping) to be copied into the flogfile, making it
|
||||
easier to correlate log events with unit tests.
|
||||
|
||||
Enabling this form of logging appears to roughly double the runtime of the
|
||||
unit tests. The flog.out.bz2 file is approximately 2MB.
|
||||
|
||||
You can then use "flogtool dump" or "flogtool web-viewer" on the resulting
|
||||
flog.out file.
|
||||
|
||||
("flogtool tail" and the log-gatherer are not useful during unit tests, since
|
||||
there is no single Tub to which all the log messages are published).
|
Loading…
x
Reference in New Issue
Block a user