mirror of
https://github.com/zerotier/ZeroTierOne.git
synced 2025-01-22 20:38:23 +00:00
107 lines
4.9 KiB
Markdown
107 lines
4.9 KiB
Markdown
This document contains notes about various ideas that for one reason or another
|
|
are not being actively pursued.
|
|
|
|
## Next byte is non-ASCII after ASCII optimization
|
|
|
|
The current plan for a SIMD-accelerated inner loop for handling ASCII bytes
|
|
makes no use of the bit of information that if the buffers didn't end but the
|
|
ASCII loop exited, the next byte will not be an ASCII byte.
|
|
|
|
## Handling ASCII with table lookups when decoding single-byte to UTF-16
|
|
|
|
Both uconv and ICU outperform encoding_rs when decoding single-byte to UTF-16.
|
|
unconv doesn't even do anything fancy to manually unroll the loop (see below).
|
|
Both handle even the ASCII range using table lookup. That is, there's no branch
|
|
for checking if we're in the lower or upper half of the encoding.
|
|
|
|
However, adding SIMD acceleration for the ASCII half will likely be a bigger
|
|
win than eliminating the branch to decide ASCII vs. non-ASCII.
|
|
|
|
## Manual loop unrolling for single-byte encodings
|
|
|
|
ICU currently outperforms encoding_rs (by over x2!) when decoding a single-byte
|
|
encoding to UTF-16. This appears to be thanks to manually unrolling the
|
|
conversion loop by 16. See [ucnv_MBCSSingleToBMPWithOffsets][1].
|
|
|
|
[1]: https://ssl.icu-project.org/repos/icu/icu/tags/release-55-1/source/common/ucnvmbcs.cpp
|
|
|
|
Notably, none of the single-byte encodings have bytes that'd decode to the
|
|
upper half of BMP. Therefore, if the unmappable marker has the highest bit set
|
|
instead of being zero, the check for unmappables within a 16-character stride
|
|
can be done either by ORing the BMP characters in the stride together and
|
|
checking the high bit or by loading the upper halves of the BMP charaters
|
|
in a `u8x8` register and checking the high bits using the `_mm_movemask_epi8`
|
|
/ `pmovmskb` SSE2 instruction.
|
|
|
|
## After non-ASCII, handle ASCII punctuation without SIMD
|
|
|
|
Since the failure mode of SIMD ASCII acceleration involves wasted aligment
|
|
checks and a wasted SIMD read when the next code unit is non-ASCII and non-Latin
|
|
scripts have runs of non-ASCII even if ASCII spaces and punctuation is used,
|
|
consider handling the next two or three bytes following non-ASCII as non-SIMD
|
|
before looping back to the SIMD mode. Maybe move back to SIMD ASCII faster if
|
|
there's ASCII that's not space or punctuation. Maybe with the "space or
|
|
punctuation" check in place, this code can be allowed to be in place even for
|
|
UTF-8 and Latin single-byte (i.e. not having different code for Latin and
|
|
non-Latin single-byte).
|
|
|
|
## Prefer maintaining aligment
|
|
|
|
Instead of returning to acceleration directly after non-ASCII, consider
|
|
continuing to the alignment boundary without acceleration.
|
|
|
|
## Read from SIMD lanes instead of RAM (cache) when ASCII check fails
|
|
|
|
When the SIMD ASCII check fails, the data has already been read from memory.
|
|
Test whether it's faster to read the data by lane from the SIMD register than
|
|
to read it again from RAM (cache).
|
|
|
|
## Use Level 2 Hanzi and Level 2 Kanji ordering
|
|
|
|
These two are ordered by radical and then by stroke count, so in principle,
|
|
they should be mostly Unicode-ordered, although at least Level 2 Hanzi isn't
|
|
fully Unicode-ordered. Is "mostly" good enough for encode accelelation?
|
|
|
|
## Create a `divmod_94()` function
|
|
|
|
Experiment with a function that computes `(i / 94, i % 94)` more efficiently
|
|
than generic code.
|
|
|
|
## Align writes on Aarch64
|
|
|
|
On [Cortex-A57](https://stackoverflow.com/questions/45714535/performance-of-unaligned-simd-load-store-on-aarch64/45938112#45938112
|
|
), it might be a good idea to move the destination into 16-byte alignment.
|
|
|
|
## Unalign UTF-8 validation on Aarch64
|
|
|
|
Currently, Aarch64 runs the generic ALU UTF-8 validation code that aligns
|
|
reads. That's probably unnecessary on Aarch64. (SIMD was slower than ALU!)
|
|
|
|
## Table-driven UTF-8 validation
|
|
|
|
When there are at least four bytes left, read all four. With each byte
|
|
index into tables corresponding to magic values indexable by byte in
|
|
each position.
|
|
|
|
In the value read from the table indexed by lead byte, encode the
|
|
following in 16 bits: advance 2 bits (2, 3 or 4 bytes), 9 positional
|
|
bits one of which is set to indicate the type of lead byte (8 valid
|
|
types, in the 8 lowest bits, and invalid, ASCII would be tenth type),
|
|
and the mask for extracting the payload bits from the lead byte
|
|
(for conversion to UTF-16 or UTF-32).
|
|
|
|
In the tables indexable by the trail bytes, in each positions
|
|
corresponding byte the lead byte type, store 1 if the trail is
|
|
invalid given the lead and 0 if valid given the lead.
|
|
|
|
Use the low 8 bits of the of the 16 bits read from the first
|
|
table to mask (bitwise AND) one positional bit from each of the
|
|
three other values. Bitwise OR the results together with the
|
|
bit that is 1 if the lead is invalid. If the result is zero,
|
|
the sequence is valid. Otherwise it's invalid.
|
|
|
|
Use the advance to advance. In the conversion to UTF-16 or
|
|
UTF-32 case, use the mast for extracting the meaningful
|
|
bits from the lead byte to mask them from the lead. Shift
|
|
left by 6 as many times as the advance indicates, etc.
|