mirror of
https://github.com/tahoe-lafs/tahoe-lafs.git
synced 2025-01-02 11:16:41 +00:00
288 lines
13 KiB
ReStructuredText
288 lines
13 KiB
ReStructuredText
=============
|
|
Tahoe Logging
|
|
=============
|
|
|
|
1. `Overview`_
|
|
2. `Realtime Logging`_
|
|
3. `Incidents`_
|
|
4. `Working with flogfiles`_
|
|
5. `Gatherers`_
|
|
|
|
1. `Incident Gatherer`_
|
|
2. `Log Gatherer`_
|
|
|
|
6. `Local twistd.log files`_
|
|
7. `Adding log messages`_
|
|
8. `Log Messages During Unit Tests`_
|
|
|
|
Overview
|
|
========
|
|
|
|
Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to
|
|
record information about what is happening inside the Tahoe node. This is
|
|
primarily for use by programmers and grid operators who want to find out what
|
|
went wrong.
|
|
|
|
The foolscap logging system is documented at
|
|
`<http://foolscap.lothar.com/docs/logging.html>`_.
|
|
|
|
The foolscap distribution includes a utility named "``flogtool``" (usually
|
|
at ``/usr/bin/flogtool`` on Unix) which is used to get access to many
|
|
foolscap logging features.
|
|
|
|
Realtime Logging
|
|
================
|
|
|
|
When you are working on Tahoe code, and want to see what the node is doing,
|
|
the easiest tool to use is "``flogtool tail``". This connects to the Tahoe
|
|
node and subscribes to hear about all log events. These events are then
|
|
displayed to stdout, and optionally saved to a file.
|
|
|
|
"``flogtool tail``" connects to the "logport", for which the FURL is stored
|
|
in ``BASEDIR/private/logport.furl`` . The following command will connect to
|
|
this port and start emitting log information::
|
|
|
|
flogtool tail BASEDIR/private/logport.furl
|
|
|
|
The ``--save-to FILENAME`` option will save all received events to a file,
|
|
where then can be examined later with "``flogtool dump``" or
|
|
"``flogtool web-viewer``". The ``--catch-up`` option will ask the node to
|
|
dump all stored events before subscribing to new ones (without ``--catch-up``,
|
|
you will only hear about events that occur after the tool has connected and
|
|
subscribed).
|
|
|
|
Incidents
|
|
=========
|
|
|
|
Foolscap keeps a short list of recent events in memory. When something goes
|
|
wrong, it writes all the history it has (and everything that gets logged in
|
|
the next few seconds) into a file called an "incident". These files go into
|
|
``BASEDIR/logs/incidents/`` , in a file named
|
|
"``incident-TIMESTAMP-UNIQUE.flog.bz2``". The default definition of
|
|
"something goes wrong" is the generation of a log event at the ``log.WEIRD``
|
|
level or higher, but other criteria could be implemented.
|
|
|
|
The typical "incident report" we've seen in a large Tahoe grid is about 40kB
|
|
compressed, representing about 1800 recent events.
|
|
|
|
These "flogfiles" have a similar format to the files saved by
|
|
"``flogtool tail --save-to``". They are simply lists of log events, with a
|
|
small header to indicate which event triggered the incident.
|
|
|
|
The "``flogtool dump FLOGFILE``" command will take one of these ``.flog.bz2``
|
|
files and print their contents to stdout, one line per event. The raw event
|
|
dictionaries can be dumped by using "``flogtool dump --verbose FLOGFILE``".
|
|
|
|
The "``flogtool web-viewer``" command can be used to examine the flogfile
|
|
in a web browser. It runs a small HTTP server and emits the URL on stdout.
|
|
This view provides more structure than the output of "``flogtool dump``":
|
|
the parent/child relationships of log events is displayed in a nested format.
|
|
"``flogtool web-viewer``" is still fairly immature.
|
|
|
|
Working with flogfiles
|
|
======================
|
|
|
|
The "``flogtool filter``" command can be used to take a large flogfile
|
|
(perhaps one created by the log-gatherer, see below) and copy a subset of
|
|
events into a second file. This smaller flogfile may be easier to work with
|
|
than the original. The arguments to "``flogtool filter``" specify filtering
|
|
criteria: a predicate that each event must match to be copied into the
|
|
target file. ``--before`` and ``--after`` are used to exclude events outside
|
|
a given window of time. ``--above`` will retain events above a certain
|
|
severity level. ``--from`` retains events send by a specific tubid.
|
|
``--strip-facility`` removes events that were emitted with a given facility
|
|
(like ``foolscap.negotiation`` or ``tahoe.upload``).
|
|
|
|
Gatherers
|
|
=========
|
|
|
|
In a deployed Tahoe grid, it is useful to get log information automatically
|
|
transferred to a central log-gatherer host. This offloads the (admittedly
|
|
modest) storage requirements to a different host and provides access to
|
|
logfiles from multiple nodes (web-API, storage, or helper) in a single place.
|
|
|
|
There are two kinds of gatherers. Both produce a FURL which needs to be
|
|
placed in the ``NODEDIR/log_gatherer.furl`` file (one FURL per line) of
|
|
each node that is to publish its logs to the gatherer. When the Tahoe node
|
|
starts, it will connect to the configured gatherers and offer its logport:
|
|
the gatherer will then use the logport to subscribe to hear about events.
|
|
|
|
The gatherer will write to files in its working directory, which can then be
|
|
examined with tools like "``flogtool dump``" as described above.
|
|
|
|
Incident Gatherer
|
|
-----------------
|
|
|
|
The "incident gatherer" only collects Incidents: records of the log events
|
|
that occurred just before and slightly after some high-level "trigger event"
|
|
was recorded. Each incident is classified into a "category": a short string
|
|
that summarizes what sort of problem took place. These classification
|
|
functions are written after examining a new/unknown incident. The idea is to
|
|
recognize when the same problem is happening multiple times.
|
|
|
|
A collection of classification functions that are useful for Tahoe nodes are
|
|
provided in ``misc/incident-gatherer/support_classifiers.py`` . There is
|
|
roughly one category for each ``log.WEIRD``-or-higher level event in the
|
|
Tahoe source code.
|
|
|
|
The incident gatherer is created with the "``flogtool create-incident-gatherer
|
|
WORKDIR``" command, and started with "``tahoe start``". The generated
|
|
"``gatherer.tac``" file should be modified to add classifier functions.
|
|
|
|
The incident gatherer writes incident names (which are simply the relative
|
|
pathname of the ``incident-\*.flog.bz2`` file) into ``classified/CATEGORY``.
|
|
For example, the ``classified/mutable-retrieve-uncoordinated-write-error``
|
|
file contains a list of all incidents which were triggered by an uncoordinated
|
|
write that was detected during mutable file retrieval (caused when somebody
|
|
changed the contents of the mutable file in between the node's mapupdate step
|
|
and the retrieve step). The ``classified/unknown`` file contains a list of all
|
|
incidents that did not match any of the classification functions.
|
|
|
|
At startup, the incident gatherer will automatically reclassify any incident
|
|
report which is not mentioned in any of the ``classified/\*`` files. So the
|
|
usual workflow is to examine the incidents in ``classified/unknown``, add a
|
|
new classification function, delete ``classified/unknown``, then bound the
|
|
gatherer with "``tahoe restart WORKDIR``". The incidents which can be
|
|
classified with the new functions will be added to their own ``classified/FOO``
|
|
lists, and the remaining ones will be put in ``classified/unknown``, where
|
|
the process can be repeated until all events are classifiable.
|
|
|
|
The incident gatherer is still fairly immature: future versions will have a
|
|
web interface and an RSS feed, so operations personnel can track problems in
|
|
the storage grid.
|
|
|
|
In our experience, each incident takes about two seconds to transfer from
|
|
the node that generated it to the gatherer. The gatherer will automatically
|
|
catch up to any incidents which occurred while it is offline.
|
|
|
|
Log Gatherer
|
|
------------
|
|
|
|
The "Log Gatherer" subscribes to hear about every single event published by
|
|
the connected nodes, regardless of severity. This server writes these log
|
|
events into a large flogfile that is rotated (closed, compressed, and
|
|
replaced with a new one) on a periodic basis. Each flogfile is named
|
|
according to the range of time it represents, with names like
|
|
"``from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2``". The flogfiles
|
|
contain events from many different sources, making it easier to correlate
|
|
things that happened on multiple machines (such as comparing a client node
|
|
making a request with the storage servers that respond to that request).
|
|
|
|
The Log Gatherer is created with the "``flogtool create-gatherer WORKDIR``"
|
|
command, and started with "``tahoe start``". The ``log_gatherer.furl`` it
|
|
creates then needs to be copied into the ``BASEDIR/log_gatherer.furl`` file
|
|
of all nodes that should be sending it log events.
|
|
|
|
The "``flogtool filter``" command, described above, is useful to cut down the
|
|
potentially-large flogfiles into more a narrowly-focussed form.
|
|
|
|
Busy nodes, particularly web-API nodes which are performing recursive
|
|
deep-size/deep-stats/deep-check operations, can produce a lot of log events.
|
|
To avoid overwhelming the node (and using an unbounded amount of memory for
|
|
the outbound TCP queue), publishing nodes will start dropping log events when
|
|
the outbound queue grows too large. When this occurs, there will be gaps
|
|
(non-sequential event numbers) in the log-gatherer's flogfiles.
|
|
|
|
Local twistd.log files
|
|
======================
|
|
|
|
[TODO: not yet true, requires foolscap-0.3.1 and a change to ``allmydata.node``]
|
|
|
|
In addition to the foolscap-based event logs, certain high-level events will
|
|
be recorded directly in human-readable text form, in the
|
|
``BASEDIR/logs/twistd.log`` file (and its rotated old versions: ``twistd.log.1``,
|
|
``twistd.log.2``, etc). This form does not contain as much information as the
|
|
flogfiles available through the means described previously, but they are
|
|
immediately available to the curious developer, and are retained until the
|
|
twistd.log.NN files are explicitly deleted.
|
|
|
|
Only events at the ``log.OPERATIONAL`` level or higher are bridged to
|
|
``twistd.log`` (i.e. not the ``log.NOISY`` debugging events). In addition,
|
|
foolscap internal events (like connection negotiation messages) are not
|
|
bridged to ``twistd.log``.
|
|
|
|
Adding log messages
|
|
===================
|
|
|
|
When adding new code, the Tahoe developer should add a reasonable number of
|
|
new log events. For details, please see the Foolscap logging documentation,
|
|
but a few notes are worth stating here:
|
|
|
|
* use a facility prefix of "``tahoe.``", like "``tahoe.mutable.publish``"
|
|
|
|
* assign each severe (``log.WEIRD`` or higher) event a unique message
|
|
identifier, as the ``umid=`` argument to the ``log.msg()`` call. The
|
|
``misc/coding_tools/make_umid`` script may be useful for this purpose.
|
|
This will make it easier to write a classification function for these
|
|
messages.
|
|
|
|
* use the ``parent=`` argument whenever the event is causally/temporally
|
|
clustered with its parent. For example, a download process that involves
|
|
three sequential hash fetches could announce the send and receipt of those
|
|
hash-fetch messages with a ``parent=`` argument that ties them to the
|
|
overall download process. However, each new web-API download request
|
|
should be unparented.
|
|
|
|
* use the ``format=`` argument in preference to the ``message=`` argument.
|
|
E.g. use ``log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k)``
|
|
instead of ``log.msg("got %d shares, need %d" % (n,k))``. This will allow
|
|
later tools to analyze the event without needing to scrape/reconstruct
|
|
the structured data out of the formatted string.
|
|
|
|
* Pass extra information as extra keyword arguments, even if they aren't
|
|
included in the ``format=`` string. This information will be displayed in
|
|
the "``flogtool dump --verbose``" output, as well as being available to
|
|
other tools. The ``umid=`` argument should be passed this way.
|
|
|
|
* use ``log.err`` for the catch-all ``addErrback`` that gets attached to
|
|
the end of any given Deferred chain. When used in conjunction with
|
|
``LOGTOTWISTED=1``, ``log.err()`` will tell Twisted about the error-nature
|
|
of the log message, causing Trial to flunk the test (with an "ERROR"
|
|
indication that prints a copy of the Failure, including a traceback).
|
|
Don't use ``log.err`` for events that are ``BAD`` but handled (like hash
|
|
failures: since these are often deliberately provoked by test code, they
|
|
should not cause test failures): use ``log.msg(level=BAD)`` for those
|
|
instead.
|
|
|
|
|
|
Log Messages During Unit Tests
|
|
==============================
|
|
|
|
If a test is failing and you aren't sure why, start by enabling
|
|
``FLOGTOTWISTED=1`` like this::
|
|
|
|
make test FLOGTOTWISTED=1
|
|
|
|
With ``FLOGTOTWISTED=1``, sufficiently-important log events will be written
|
|
into ``_trial_temp/test.log``, which may give you more ideas about why the
|
|
test is failing. Note, however, that ``_trial_temp/log.out`` will not receive
|
|
messages below the ``level=OPERATIONAL`` threshold, due to this issue:
|
|
`<http://foolscap.lothar.com/trac/ticket/154>`_
|
|
|
|
|
|
If that isn't enough, look at the detailed foolscap logging messages instead,
|
|
by running the tests like this::
|
|
|
|
make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1
|
|
|
|
The first environment variable will cause foolscap log events to be written
|
|
to ``./flog.out.bz2`` (instead of merely being recorded in the circular buffers
|
|
for the use of remote subscribers or incident reports). The second will cause
|
|
all log events to be written out, not just the higher-severity ones. The
|
|
third will cause twisted log events (like the markers that indicate when each
|
|
unit test is starting and stopping) to be copied into the flogfile, making it
|
|
easier to correlate log events with unit tests.
|
|
|
|
Enabling this form of logging appears to roughly double the runtime of the
|
|
unit tests. The ``flog.out.bz2`` file is approximately 2MB.
|
|
|
|
You can then use "``flogtool dump``" or "``flogtool web-viewer``" on the
|
|
resulting ``flog.out`` file.
|
|
|
|
("``flogtool tail``" and the log-gatherer are not useful during unit tests,
|
|
since there is no single Tub to which all the log messages are published).
|
|
|
|
It is possible for setting these environment variables to cause spurious test
|
|
failures in tests with race condition bugs. All known instances of this have
|
|
been fixed as of Tahoe-LAFS v1.7.1.
|