doc: add explanation of the motivation for the surprising and awkward API to erasure coding

2025-04-16 07:06:43 +00:00 · 2010-10-14 23:02:02 -07:00 · 2010-10-14 23:02:02 -07:00 · 0c2397523b
commit 0c2397523b
parent 5c31a7079b
1 changed files with 49 additions and 13 deletions
--- a/src/allmydata/interfaces.py
+++ b/src/allmydata/interfaces.py
@ -1163,20 +1163,56 @@ class ICodecEncoder(Interface):
        encode(), unless of course it already happens to be an even multiple
        of required_shares in length.)

-         ALSO: the requirement to break up your data into 'required_shares'
-         chunks before calling encode() feels a bit surprising, at least from
-         the point of view of a user who doesn't know how FEC works. It feels
-         like an implementation detail that has leaked outside the
-         abstraction barrier. Can you imagine a use case in which the data to
-         be encoded might already be available in pre-segmented chunks, such
-         that it is faster or less work to make encode() take a list rather
-         than splitting a single string?
+        Note: the requirement to break up your data into
+        'required_shares' chunks of exactly the right length before
+        calling encode() is surprising from point of view of a user
+        who doesn't know how FEC works. It feels like an
+        implementation detail that has leaked outside the abstraction
+        barrier. Is there a use case in which the data to be encoded
+        might already be available in pre-segmented chunks, such that
+        it is faster or less work to make encode() take a list rather
+        than splitting a single string?

-         ALSO ALSO: I think 'inshares' is a misleading term, since encode()
-         is supposed to *produce* shares, so what it *accepts* should be
-         something other than shares. Other places in this interface use the
-         word 'data' for that-which-is-not-shares.. maybe we should use that
-         term?
+        Yes, there is: suppose you are uploading a file with K=64,
+        N=128, segsize=262,144. Then each in-share will be of size
+        4096. If you use this .encode() API then your code could first
+        read each successive 4096-byte chunk from the file and store
+        each one in a Python string and store each such Python string
+        in a Python list. Then you could call .encode(), passing that
+        list as "inshares". The encoder would generate the other 64
+        "secondary shares" and return to you a new list containing
+        references to the same 64 Python strings that you passed in
+        (as the primary shares) plus references to the new 64 Python
+        strings.
+
+        (You could even imagine that your code could use readv() so
+        that the operating system can arrange to get all of those
+        bytes copied from the file into the Python list of Python
+        strings as efficiently as possible instead of having a loop
+        written in C or in Python to copy the next part of the file
+        into the next string.)
+
+        On the other hand if you instead use the .encode_proposal()
+        API (above), then your code can first read in all of the
+        262,144 bytes of the segment from the file into a Python
+        string, then call .encode_proposal() passing the segment data
+        as the "data" argument. The encoder would basically first
+        split the "data" argument into a list of 64 in-shares of 4096
+        byte each, and then do the same thing that .encode() does. So
+        this would result in a little bit more copying of data and a
+        little bit higher of a "maximum memory usage" during the
+        process, although it might or might not make a practical
+        difference for our current use cases.
+
+        Note that "inshares" is a strange name for the parameter if
+        you think of the parameter as being just for feeding in data
+        to the codec. It makes more sense if you think of the result
+        of this encoding as being the set of shares from inshares plus
+        an extra set of "secondary shares" (or "check shares"). It is
+        a surprising name! If the API is going to be surprising then
+        the name should be surprising. If we switch to
+        encode_proposal() above then we should also switch to an
+        unsurprising name.

        'desired_share_ids', if provided, is required to be a sequence of
        ints, each of which is required to be >= 0 and < max_shares. If not