webapi/deep-manifest t=JSON: don't return the (large) manifest/SI/verifycap lists unless the operation has completed, to avoid the considerable CPU+memory cost of creating the JSON (for 330k dirnodes, it could take two minutes to generate 275MB of JSON). They must be paid eventually, but not on every poll

2025-04-14 14:16:39 +00:00 · 2009-01-08 19:59:32 -07:00 · 2009-01-08 19:59:32 -07:00 · 7ee336b274
commit 7ee336b274
parent 39a089dc7e
2 changed files with 30 additions and 8 deletions
--- a/docs/frontends/webapi.txt
+++ b/docs/frontends/webapi.txt
@ -213,6 +213,11 @@ GET /operations/$HANDLE?output=JSON   (same)
  * whether the operation is complete, or if it is still running
  * how much of the operation is complete, and how much is left, if possible

+ Note that the final status output can be quite large: a deep-manifest of a
+ directory structure with 300k directories and 200k unique files is about
+ 275MB of JSON, and might take two minutes to generate. For this reason, the
+ full status is not provided until the operation has completed.
+
 The HTML form will include a meta-refresh tag, which will cause a regular
 web browser to reload the status page about 60 seconds later. This tag will
 be removed once the operation has completed.
@ -966,7 +971,10 @@ POST $DIRURL?t=start-manifest    (must add &ophandle=XYZ)
  by a space.

  If output=JSON is added to the queryargs, then the results will be a
-  JSON-formatted dictionary with six keys:
+  JSON-formatted dictionary with six keys. Note that because large directory
+  structures can result in very large JSON results, the full results will not
+  be available until the operation is complete (i.e. until output["finished"]
+  is True):

   finished (bool): if False then you must reload the page until True
   origin_si (base32 str): the storage index of the starting point
--- a/src/allmydata/web/directory.py
+++ b/src/allmydata/web/directory.py
@ -728,13 +728,27 @@ class ManifestResults(rend.Page, ReloadMixin):
        inevow.IRequest(ctx).setHeader("content-type", "text/plain")
        m = self.monitor
        s = m.get_status()
-        status = {"manifest": s["manifest"],
-                  "verifycaps": list(s["verifycaps"]),
-                  "storage-index": list(s["storage-index"]),
-                  "stats": s["stats"],
-                  "finished": m.is_finished(),
-                  "origin": base32.b2a(m.origin_si),
-                  }
+
+        status = { "stats": s["stats"],
+                   "finished": m.is_finished(),
+                   "origin": base32.b2a(m.origin_si),
+                   }
+        if m.is_finished():
+            # don't return manifest/verifycaps/SIs unless the operation is
+            # done, to save on CPU/memory (both here and in the HTTP client
+            # who has to unpack the JSON). Tests show that the ManifestWalker
+            # needs about 1092 bytes per item, the JSON we generate here
+            # requires about 503 bytes per item, and some internal overhead
+            # (perhaps transport-layer buffers in twisted.web?) requires an
+            # additional 1047 bytes per item.
+            status.update({ "manifest": s["manifest"],
+                            "verifycaps": [i for i in s["verifycaps"]],
+                            "storage-index": [i for i in s["storage-index"]],
+                            })
+            # simplejson doesn't know how to serialize a set. We use a
+            # generator that walks the set rather than list(setofthing) to
+            # save a small amount of memory (4B*len) and a moderate amount of
+            # CPU.
        return simplejson.dumps(status, indent=1)

    def _si_abbrev(self):