mirror of
https://github.com/zerotier/ZeroTierOne.git
synced 2025-01-03 03:36:41 +00:00
822 lines
30 KiB
Markdown
822 lines
30 KiB
Markdown
|
# encoding_rs
|
||
|
|
||
|
[![Build Status](https://travis-ci.org/hsivonen/encoding_rs.svg?branch=master)](https://travis-ci.org/hsivonen/encoding_rs)
|
||
|
[![crates.io](https://img.shields.io/crates/v/encoding_rs.svg)](https://crates.io/crates/encoding_rs)
|
||
|
[![docs.rs](https://docs.rs/encoding_rs/badge.svg)](https://docs.rs/encoding_rs/)
|
||
|
|
||
|
encoding_rs an implementation of the (non-JavaScript parts of) the
|
||
|
[Encoding Standard](https://encoding.spec.whatwg.org/) written in Rust.
|
||
|
|
||
|
The Encoding Standard defines the Web-compatible set of character encodings,
|
||
|
which means this crate can be used to decode Web content. encoding_rs is
|
||
|
used in Gecko starting with Firefox 56. Due to the notable overlap between
|
||
|
the legacy encodings on the Web and the legacy encodings used on Windows,
|
||
|
this crate may be of use for non-Web-related situations as well; see below
|
||
|
for links to adjacent crates.
|
||
|
|
||
|
Additionally, the `mem` module provides various operations for dealing with
|
||
|
in-RAM text (as opposed to data that's coming from or going to an IO boundary).
|
||
|
The `mem` module is a module instead of a separate crate due to internal
|
||
|
implementation detail efficiencies.
|
||
|
|
||
|
## Functionality
|
||
|
|
||
|
Due to the Gecko use case, encoding_rs supports decoding to and encoding from
|
||
|
UTF-16 in addition to supporting the usual Rust use case of decoding to and
|
||
|
encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly
|
||
|
to accommodate the C++ side of Gecko.
|
||
|
|
||
|
Specifically, encoding_rs does the following:
|
||
|
|
||
|
* Decodes a stream of bytes in an Encoding Standard-defined character encoding
|
||
|
into valid aligned native-endian in-RAM UTF-16 (units of `u16` / `char16_t`).
|
||
|
* Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16
|
||
|
(units of `u16` / `char16_t`) into a sequence of bytes in an Encoding
|
||
|
Standard-defined character encoding as if the lone surrogates had been
|
||
|
replaced with the REPLACEMENT CHARACTER before performing the encode.
|
||
|
(Gecko's UTF-16 is potentially invalid.)
|
||
|
* Decodes a stream of bytes in an Encoding Standard-defined character
|
||
|
encoding into valid UTF-8.
|
||
|
* Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding
|
||
|
Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.)
|
||
|
* Does the above in streaming (input and output split across multiple
|
||
|
buffers) and non-streaming (whole input in a single buffer and whole
|
||
|
output in a single buffer) variants.
|
||
|
* Avoids copying (borrows) when possible in the non-streaming cases when
|
||
|
decoding to or encoding from UTF-8.
|
||
|
* Resolves textual labels that identify character encodings in
|
||
|
protocol text into type-safe objects representing the those encodings
|
||
|
conceptually.
|
||
|
* Maps the type-safe encoding objects onto strings suitable for
|
||
|
returning from `document.characterSet`.
|
||
|
* Validates UTF-8 (in common instruction set scenarios a bit faster for Web
|
||
|
workloads than the standard library; hopefully will get upstreamed some
|
||
|
day) and ASCII.
|
||
|
|
||
|
Additionally, `encoding_rs::mem` does the following:
|
||
|
|
||
|
* Checks if a byte buffer contains only ASCII.
|
||
|
* Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII).
|
||
|
* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16
|
||
|
buffer contains only Latin1 code points (below U+0100).
|
||
|
* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16
|
||
|
buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior
|
||
|
(suitable for checking if the Unicode Bidirectional Algorithm can be optimized
|
||
|
out).
|
||
|
* Combined versions of the above two checks.
|
||
|
* Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16.
|
||
|
* Converts potentially-invalid UTF-16 and Latin1 to UTF-8.
|
||
|
* Converts UTF-8 and UTF-16 to Latin1 (if in range).
|
||
|
* Finds the first invalid code unit in a buffer of potentially-invalid UTF-16.
|
||
|
* Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16.
|
||
|
* Copies ASCII from one buffer to another up to the first non-ASCII byte.
|
||
|
* Converts ASCII to UTF-16 up to the first non-ASCII byte.
|
||
|
* Converts UTF-16 to ASCII up to the first non-Basic Latin code unit.
|
||
|
|
||
|
## Integration with `std::io`
|
||
|
|
||
|
Notably, the above feature list doesn't include the capability to wrap
|
||
|
a `std::io::Read`, decode it into UTF-8 and presenting the result via
|
||
|
`std::io::Read`. The [`encoding_rs_io`](https://crates.io/crates/encoding_rs_io)
|
||
|
crate provides that capability.
|
||
|
|
||
|
## `no_std` Environment
|
||
|
|
||
|
The crate works in a `no_std` environment. By default, the `alloc` feature,
|
||
|
which assumes that an allocator is present is enabled. For a no-allocator
|
||
|
environment, the default features (i.e. `alloc`) can be turned off. This
|
||
|
makes the part of the API that returns `Vec`/`String`/`Cow` unavailable.
|
||
|
|
||
|
## Decoding Email
|
||
|
|
||
|
For decoding character encodings that occur in email, use the
|
||
|
[`charset`](https://crates.io/crates/charset) crate instead of using this
|
||
|
one directly. (It wraps this crate and adds UTF-7 decoding.)
|
||
|
|
||
|
## Windows Code Page Identifier Mappings
|
||
|
|
||
|
For mappings to and from Windows code page identifiers, use the
|
||
|
[`codepage`](https://crates.io/crates/codepage) crate.
|
||
|
|
||
|
## DOS Encodings
|
||
|
|
||
|
This crate does not support single-byte DOS encodings that aren't required by
|
||
|
the Web Platform, but the [`oem_cp`](https://crates.io/crates/oem_cp) crate does.
|
||
|
|
||
|
## Preparing Text for the Encoders
|
||
|
|
||
|
Normalizing text into Unicode Normalization Form C prior to encoding text into
|
||
|
a legacy encoding minimizes unmappable characters. Text can be normalized to
|
||
|
Unicode Normalization Form C using the
|
||
|
[`unic-normal`](https://crates.io/crates/unic-normal) crate.
|
||
|
|
||
|
The exception is windows-1258, which after normalizing to Unicode Normalization
|
||
|
Form C requires tone marks to be decomposed in order to minimize unmappable
|
||
|
characters. Vietnamese tone marks can be decomposed using the
|
||
|
[`detone`](https://crates.io/crates/detone) crate.
|
||
|
|
||
|
## Licensing
|
||
|
|
||
|
TL;DR: `(Apache-2.0 OR MIT) AND BSD-3-Clause` for the code and data combination.
|
||
|
|
||
|
Please see the file named
|
||
|
[COPYRIGHT](https://github.com/hsivonen/encoding_rs/blob/master/COPYRIGHT).
|
||
|
|
||
|
The non-test code that isn't generated from the WHATWG data in this crate is
|
||
|
under Apache-2.0 OR MIT. Test code is under CC0.
|
||
|
|
||
|
This crate contains code/data generated from WHATWG-supplied data. The WHATWG
|
||
|
upstream changed its license for portions of specs incorporated into source code
|
||
|
from CC0 to BSD-3-Clause between the initial release of this crate and the present
|
||
|
version of this crate. The in-source licensing legends have been updated for the
|
||
|
parts of the generated code that have changed since the upstream license change.
|
||
|
|
||
|
## Documentation
|
||
|
|
||
|
Generated [API documentation](https://docs.rs/encoding_rs/) is available
|
||
|
online.
|
||
|
|
||
|
There is a [long-form write-up](https://hsivonen.fi/encoding_rs/) about the
|
||
|
design and internals of the crate.
|
||
|
|
||
|
## C and C++ bindings
|
||
|
|
||
|
An FFI layer for encoding_rs is available as a
|
||
|
[separate crate](https://github.com/hsivonen/encoding_c). The crate comes
|
||
|
with a [demo C++ wrapper](https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h)
|
||
|
using the C++ standard library and [GSL](https://github.com/Microsoft/GSL/) types.
|
||
|
|
||
|
The bindings for the `mem` module are in the
|
||
|
[encoding_c_mem crate](https://github.com/hsivonen/encoding_c_mem).
|
||
|
|
||
|
For the Gecko context, there's a
|
||
|
[C++ wrapper using the MFBT/XPCOM types](https://searchfox.org/mozilla-central/source/intl/Encoding.h#100).
|
||
|
|
||
|
There's a [write-up](https://hsivonen.fi/modern-cpp-in-rust/) about the C++
|
||
|
wrappers.
|
||
|
|
||
|
## Sample programs
|
||
|
|
||
|
* [Rust](https://github.com/hsivonen/recode_rs)
|
||
|
* [C](https://github.com/hsivonen/recode_c)
|
||
|
* [C++](https://github.com/hsivonen/recode_cpp)
|
||
|
|
||
|
## Optional features
|
||
|
|
||
|
There are currently these optional cargo features:
|
||
|
|
||
|
### `simd-accel`
|
||
|
|
||
|
Enables SIMD acceleration using the nightly-dependent `packed_simd_2` crate.
|
||
|
|
||
|
This is an opt-in feature, because enabling this feature _opts out_ of Rust's
|
||
|
guarantees of future compilers compiling old code (aka. "stability story").
|
||
|
|
||
|
Currently, this has not been tested to be an improvement except for these
|
||
|
targets:
|
||
|
|
||
|
* x86_64
|
||
|
* i686
|
||
|
* aarch64
|
||
|
* thumbv7neon
|
||
|
|
||
|
If you use nightly Rust, you use targets whose first component is one of the
|
||
|
above, and you are prepared _to have to revise your configuration when updating
|
||
|
Rust_, you should enable this feature. Otherwise, please _do not_ enable this
|
||
|
feature.
|
||
|
|
||
|
_Note!_ If you are compiling for a target that does not have 128-bit SIMD
|
||
|
enabled as part of the target definition and you are enabling 128-bit SIMD
|
||
|
using `-C target_feature`, you need to enable the `core_arch` Cargo feature
|
||
|
for `packed_simd_2` to compile a crates.io snapshot of `core_arch` instead of
|
||
|
using the standard-library copy of `core::arch`, because the `core::arch`
|
||
|
module of the pre-compiled standard library has been compiled with the
|
||
|
assumption that the CPU doesn't have 128-bit SIMD. At present this applies
|
||
|
mainly to 32-bit ARM targets whose first component does not include the
|
||
|
substring `neon`.
|
||
|
|
||
|
The encoding_rs side of things has not been properly set up for POWER,
|
||
|
PowerPC, MIPS, etc., SIMD at this time, so even if you were to follow
|
||
|
the advice from the previous paragraph, you probably shouldn't use
|
||
|
the `simd-accel` option on the less mainstream architectures at this
|
||
|
time.
|
||
|
|
||
|
Used by Firefox.
|
||
|
|
||
|
### `serde`
|
||
|
|
||
|
Enables support for serializing and deserializing `&'static Encoding`-typed
|
||
|
struct fields using [Serde][1].
|
||
|
|
||
|
[1]: https://serde.rs/
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
### `fast-legacy-encode`
|
||
|
|
||
|
A catch-all option for enabling the fastest legacy encode options. _Does not
|
||
|
affect decode speed or UTF-8 encode speed._
|
||
|
|
||
|
At present, this option is equivalent to enabling the following options:
|
||
|
* `fast-hangul-encode`
|
||
|
* `fast-hanja-encode`
|
||
|
* `fast-kanji-encode`
|
||
|
* `fast-gb-hanzi-encode`
|
||
|
* `fast-big5-hanzi-encode`
|
||
|
|
||
|
Adds 176 KB to the binary size.
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
### `fast-hangul-encode`
|
||
|
|
||
|
Changes encoding precomposed Hangul syllables into EUC-KR from binary
|
||
|
search over the decode-optimized tables to lookup by index making Korean
|
||
|
plain-text encode about 4 times as fast as without this option.
|
||
|
|
||
|
Adds 20 KB to the binary size.
|
||
|
|
||
|
Does _not_ affect decode speed.
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
### `fast-hanja-encode`
|
||
|
|
||
|
Changes encoding of Hanja into EUC-KR from linear search over the
|
||
|
decode-optimized table to lookup by index. Since Hanja is practically absent
|
||
|
in modern Korean text, this option doesn't affect perfomance in the common
|
||
|
case and mainly makes sense if you want to make your application resilient
|
||
|
agaist denial of service by someone intentionally feeding it a lot of Hanja
|
||
|
to encode into EUC-KR.
|
||
|
|
||
|
Adds 40 KB to the binary size.
|
||
|
|
||
|
Does _not_ affect decode speed.
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
### `fast-kanji-encode`
|
||
|
|
||
|
Changes encoding of Kanji into Shift_JIS, EUC-JP and ISO-2022-JP from linear
|
||
|
search over the decode-optimized tables to lookup by index making Japanese
|
||
|
plain-text encode to legacy encodings 30 to 50 times as fast as without this
|
||
|
option (about 2 times as fast as with `less-slow-kanji-encode`).
|
||
|
|
||
|
Takes precedence over `less-slow-kanji-encode`.
|
||
|
|
||
|
Adds 36 KB to the binary size (24 KB compared to `less-slow-kanji-encode`).
|
||
|
|
||
|
Does _not_ affect decode speed.
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
### `less-slow-kanji-encode`
|
||
|
|
||
|
Makes JIS X 0208 Level 1 Kanji (the most common Kanji in Shift_JIS, EUC-JP and
|
||
|
ISO-2022-JP) encode less slow (binary search instead of linear search) making
|
||
|
Japanese plain-text encode to legacy encodings 14 to 23 times as fast as
|
||
|
without this option.
|
||
|
|
||
|
Adds 12 KB to the binary size.
|
||
|
|
||
|
Does _not_ affect decode speed.
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
### `fast-gb-hanzi-encode`
|
||
|
|
||
|
Changes encoding of Hanzi in the CJK Unified Ideographs block into GBK and
|
||
|
gb18030 from linear search over a part the decode-optimized tables followed
|
||
|
by a binary search over another part of the decode-optimized tables to lookup
|
||
|
by index making Simplified Chinese plain-text encode to the legacy encodings
|
||
|
100 to 110 times as fast as without this option (about 2.5 times as fast as
|
||
|
with `less-slow-gb-hanzi-encode`).
|
||
|
|
||
|
Takes precedence over `less-slow-gb-hanzi-encode`.
|
||
|
|
||
|
Adds 36 KB to the binary size (24 KB compared to `less-slow-gb-hanzi-encode`).
|
||
|
|
||
|
Does _not_ affect decode speed.
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
### `less-slow-gb-hanzi-encode`
|
||
|
|
||
|
Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encode
|
||
|
less slow (binary search instead of linear search) making Simplified Chinese
|
||
|
plain-text encode to the legacy encodings about 40 times as fast as without
|
||
|
this option.
|
||
|
|
||
|
Adds 12 KB to the binary size.
|
||
|
|
||
|
Does _not_ affect decode speed.
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
### `fast-big5-hanzi-encode`
|
||
|
|
||
|
Changes encoding of Hanzi in the CJK Unified Ideographs block into Big5 from
|
||
|
linear search over a part the decode-optimized tables to lookup by index
|
||
|
making Traditional Chinese plain-text encode to Big5 105 to 125 times as fast
|
||
|
as without this option (about 3 times as fast as with
|
||
|
`less-slow-big5-hanzi-encode`).
|
||
|
|
||
|
Takes precedence over `less-slow-big5-hanzi-encode`.
|
||
|
|
||
|
Adds 40 KB to the binary size (20 KB compared to `less-slow-big5-hanzi-encode`).
|
||
|
|
||
|
Does _not_ affect decode speed.
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
### `less-slow-big5-hanzi-encode`
|
||
|
|
||
|
Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow
|
||
|
(binary search instead of linear search) making Traditional Chinese
|
||
|
plain-text encode to Big5 about 36 times as fast as without this option.
|
||
|
|
||
|
Adds 20 KB to the binary size.
|
||
|
|
||
|
Does _not_ affect decode speed.
|
||
|
|
||
|
Not used by Firefox.
|
||
|
|
||
|
## Performance goals
|
||
|
|
||
|
For decoding to UTF-16, the goal is to perform at least as well as Gecko's old
|
||
|
uconv. For decoding to UTF-8, the goal is to perform at least as well as
|
||
|
rust-encoding. These goals have been achieved.
|
||
|
|
||
|
Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent
|
||
|
to `memcpy` and UTF-16 to UTF-8 should be fast.)
|
||
|
|
||
|
Speed is a non-goal when encoding to legacy encodings. By default, encoding to
|
||
|
legacy encodings should not be optimized for speed at the expense of code size
|
||
|
as long as form submission and URL parsing in Gecko don't become noticeably
|
||
|
too slow in real-world use.
|
||
|
|
||
|
In the interest of binary size, by default, encoding_rs does not have
|
||
|
encode-specific data tables beyond 32 bits of encode-specific data for each
|
||
|
single-byte encoding. Therefore, encoders search the decode-optimized data
|
||
|
tables. This is a linear search in most cases. As a result, by default, encode
|
||
|
to legacy encodings varies from slow to extremely slow relative to other
|
||
|
libraries. Still, with realistic work loads, this seemed fast enough not to be
|
||
|
user-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing)
|
||
|
in the Web-exposed encoder use cases.
|
||
|
|
||
|
See the cargo features above for optionally making CJK legacy encode fast.
|
||
|
|
||
|
A framework for measuring performance is [available separately][2].
|
||
|
|
||
|
[2]: https://github.com/hsivonen/encoding_bench/
|
||
|
|
||
|
## Rust Version Compatibility
|
||
|
|
||
|
It is a goal to support the latest stable Rust, the latest nightly Rust and
|
||
|
the version of Rust that's used for Firefox Nightly.
|
||
|
|
||
|
At this time, there is no firm commitment to support a version older than
|
||
|
what's required by Firefox, and there is no commitment to treat MSRV changes
|
||
|
as semver-breaking, because this crate depends on `cfg-if`, which doesn't
|
||
|
appear to treat MSRV changes as semver-breaking, so it would be useless for
|
||
|
this crate to treat MSRV changes as semver-breaking.
|
||
|
|
||
|
As of 2021-02-04, MSRV appears to be Rust 1.36.0 for using the crate and
|
||
|
1.42.0 for doc tests to pass without errors about the global allocator.
|
||
|
|
||
|
## Compatibility with rust-encoding
|
||
|
|
||
|
A compatibility layer that implements the rust-encoding API on top of
|
||
|
encoding_rs is
|
||
|
[provided as a separate crate](https://github.com/hsivonen/encoding_rs_compat)
|
||
|
(cannot be uploaded to crates.io). The compatibility layer was originally
|
||
|
written with the assuption that Firefox would need it, but it is not currently
|
||
|
used in Firefox.
|
||
|
|
||
|
## Regenerating Generated Code
|
||
|
|
||
|
To regenerate the generated code:
|
||
|
|
||
|
* Have Python 2 installed.
|
||
|
* Clone [`https://github.com/hsivonen/encoding_c`](https://github.com/hsivonen/encoding_c)
|
||
|
next to the `encoding_rs` directory.
|
||
|
* Clone [`https://github.com/hsivonen/codepage`](https://github.com/hsivonen/codepage)
|
||
|
next to the `encoding_rs` directory.
|
||
|
* Clone [`https://github.com/whatwg/encoding`](https://github.com/whatwg/encoding)
|
||
|
next to the `encoding_rs` directory.
|
||
|
* Checkout revision `be3337450e7df1c49dca7872153c4c4670dd8256` of the `encoding` repo.
|
||
|
(Note: `f381389` was the revision of `encoding` used from before the `encoding` repo
|
||
|
license change. So far, only output changed since then has been updated to
|
||
|
the new license legend.)
|
||
|
* With the `encoding_rs` directory as the working directory, run
|
||
|
`python generate-encoding-data.py`.
|
||
|
|
||
|
## Roadmap
|
||
|
|
||
|
- [x] Design the low-level API.
|
||
|
- [x] Provide Rust-only convenience features.
|
||
|
- [x] Provide an stl/gsl-flavored C++ API.
|
||
|
- [x] Implement all decoders and encoders.
|
||
|
- [x] Add unit tests for all decoders and encoders.
|
||
|
- [x] Finish BOM sniffing variants in Rust-only convenience features.
|
||
|
- [x] Document the API.
|
||
|
- [x] Publish the crate on crates.io.
|
||
|
- [x] Create a solution for measuring performance.
|
||
|
- [x] Accelerate ASCII conversions using SSE2 on x86.
|
||
|
- [x] Accelerate ASCII conversions using ALU register-sized operations on
|
||
|
non-x86 architectures (process an `usize` instead of `u8` at a time).
|
||
|
- [x] Split FFI into a separate crate so that the FFI doesn't interfere with
|
||
|
LTO in pure-Rust usage.
|
||
|
- [x] Compress CJK indices by making use of sequential code points as well
|
||
|
as Unicode-ordered parts of indices.
|
||
|
- [x] Make lookups by label or name use binary search that searches from the
|
||
|
end of the label/name to the start.
|
||
|
- [x] Make labels with non-ASCII bytes fail fast.
|
||
|
- [ ] ~Parallelize UTF-8 validation using [Rayon](https://github.com/nikomatsakis/rayon).~
|
||
|
(This turned out to be a pessimization in the ASCII case due to memory bandwidth reasons.)
|
||
|
- [x] Provide an XPCOM/MFBT-flavored C++ API.
|
||
|
- [x] Investigate accelerating single-byte encode with a single fast-tracked
|
||
|
range per encoding.
|
||
|
- [x] Replace uconv with encoding_rs in Gecko.
|
||
|
- [x] Implement the rust-encoding API in terms of encoding_rs.
|
||
|
- [x] Add SIMD acceleration for Aarch64.
|
||
|
- [x] Investigate the use of NEON on 32-bit ARM.
|
||
|
- [ ] ~Investigate Björn Höhrmann's lookup table acceleration for UTF-8 as
|
||
|
adapted to Rust in rust-encoding.~
|
||
|
- [x] Add actually fast CJK encode options.
|
||
|
- [ ] ~Investigate [Bob Steagall's lookup table acceleration for UTF-8](https://github.com/BobSteagall/CppNow2018/blob/master/FastConversionFromUTF-8/Fast%20Conversion%20From%20UTF-8%20with%20C%2B%2B%2C%20DFAs%2C%20and%20SSE%20Intrinsics%20-%20Bob%20Steagall%20-%20C%2B%2BNow%202018.pdf).~
|
||
|
- [ ] Provide a build mode that works without `alloc` (with lesser API surface).
|
||
|
- [ ] Migrate to `std::simd` once it is stable and declare 1.0.
|
||
|
|
||
|
## Release Notes
|
||
|
|
||
|
### 0.8.31
|
||
|
|
||
|
* Use SPDX with parentheses now that crates.io supports parentheses.
|
||
|
|
||
|
### 0.8.30
|
||
|
|
||
|
* Update the licensing information to take into account the WHATWG data license change.
|
||
|
|
||
|
### 0.8.29
|
||
|
|
||
|
* Make the parts that use an allocator optional.
|
||
|
|
||
|
### 0.8.28
|
||
|
|
||
|
* Fix error in Serde support introduced as part of `no_std` support.
|
||
|
|
||
|
### 0.8.27
|
||
|
|
||
|
* Make the crate works in a `no_std` environment (with `alloc`).
|
||
|
|
||
|
### 0.8.26
|
||
|
|
||
|
* Fix oversights in edition 2018 migration that broke the `simd-accel` feature.
|
||
|
|
||
|
### 0.8.25
|
||
|
|
||
|
* Do pointer alignment checks in a way where intermediate steps aren't defined to be Undefined Behavior.
|
||
|
* Update the `packed_simd` dependency to `packed_simd_2`.
|
||
|
* Update the `cfg-if` dependency to 1.0.
|
||
|
* Address warnings that have been introduced by newer Rust versions along the way.
|
||
|
* Update to edition 2018, since even prior to 1.0 `cfg-if` updated to edition 2018 without a semver break.
|
||
|
|
||
|
### 0.8.24
|
||
|
|
||
|
* Avoid computing an intermediate (not dereferenced) pointer value in a manner designated as Undefined Behavior when computing pointer alignment.
|
||
|
|
||
|
### 0.8.23
|
||
|
|
||
|
* Remove year from copyright notices. (No features or bug fixes.)
|
||
|
|
||
|
### 0.8.22
|
||
|
|
||
|
* Formatting fix and new unit test. (No features or bug fixes.)
|
||
|
|
||
|
### 0.8.21
|
||
|
|
||
|
* Fixed a panic with invalid UTF-16[BE|LE] input at the end of the stream.
|
||
|
|
||
|
### 0.8.20
|
||
|
|
||
|
* Make `Decoder::latin1_byte_compatible_up_to` return `None` in more
|
||
|
cases to make the method actually useful. While this could be argued
|
||
|
to be a breaking change due to the bug fix changing semantics, it does
|
||
|
not break callers that had to handle the `None` case in a reasonable
|
||
|
way anyway.
|
||
|
|
||
|
### 0.8.19
|
||
|
|
||
|
* Removed a bunch of bound checks in `convert_str_to_utf16`.
|
||
|
* Added `mem::convert_utf8_to_utf16_without_replacement`.
|
||
|
|
||
|
### 0.8.18
|
||
|
|
||
|
* Added `mem::utf8_latin1_up_to` and `mem::str_latin1_up_to`.
|
||
|
* Added `Decoder::latin1_byte_compatible_up_to`.
|
||
|
|
||
|
### 0.8.17
|
||
|
|
||
|
* Update `bincode` (dev dependency) version requirement to 1.0.
|
||
|
|
||
|
### 0.8.16
|
||
|
|
||
|
* Switch from the `simd` crate to `packed_simd`.
|
||
|
|
||
|
### 0.8.15
|
||
|
|
||
|
* Adjust documentation for `simd-accel` (README-only release).
|
||
|
|
||
|
### 0.8.14
|
||
|
|
||
|
* Made UTF-16 to UTF-8 encode conversion fill the output buffer as
|
||
|
closely as possible.
|
||
|
|
||
|
### 0.8.13
|
||
|
|
||
|
* Made the UTF-8 to UTF-16 decoder compare the number of code units written
|
||
|
with the length of the right slice (the output slice) to fix a panic
|
||
|
introduced in 0.8.11.
|
||
|
|
||
|
### 0.8.12
|
||
|
|
||
|
* Removed the `clippy::` prefix from clippy lint names.
|
||
|
|
||
|
### 0.8.11
|
||
|
|
||
|
* Changed minimum Rust requirement to 1.29.0 (for the ability to refer
|
||
|
to the interior of a `static` when defining another `static`).
|
||
|
* Explicitly aligned the lookup tables for single-byte encodings and
|
||
|
UTF-8 to cache lines in the hope of freeing up one cache line for
|
||
|
other data. (Perhaps the tables were already aligned and this is
|
||
|
placebo.)
|
||
|
* Added 32 bits of encode-oriented data for each single-byte encoding.
|
||
|
The change was performance-neutral for non-Latin1-ish Latin legacy
|
||
|
encodings, improved Latin1-ish and Arabic legacy encode speed
|
||
|
somewhat (new speed is 2.4x the old speed for German, 2.3x for
|
||
|
Arabic, 1.7x for Portuguese and 1.4x for French) and improved
|
||
|
non-Latin1, non-Arabic legacy single-byte encode a lot (7.2x for
|
||
|
Thai, 6x for Greek, 5x for Russian, 4x for Hebrew).
|
||
|
* Added compile-time options for fast CJK legacy encode options (at
|
||
|
the cost of binary size (up to 176 KB) and run-time memory usage).
|
||
|
These options still retain the overall code structure instead of
|
||
|
rewriting the CJK encoders totally, so the speed isn't as good as
|
||
|
what could be achieved by using even more memory / making the
|
||
|
binary even langer.
|
||
|
* Made UTF-8 decode and validation faster.
|
||
|
* Added method `is_single_byte()` on `Encoding`.
|
||
|
* Added `mem::decode_latin1()` and `mem::encode_latin1_lossy()`.
|
||
|
|
||
|
### 0.8.10
|
||
|
|
||
|
* Disabled a unit test that tests a panic condition when the assertion
|
||
|
being tested is disabled.
|
||
|
|
||
|
### 0.8.9
|
||
|
|
||
|
* Made `--features simd-accel` work with stable-channel compiler to
|
||
|
simplify the Firefox build system.
|
||
|
|
||
|
### 0.8.8
|
||
|
|
||
|
* Made the `is_foo_bidi()` not treat U+FEFF (ZERO WIDTH NO-BREAK SPACE
|
||
|
aka. BYTE ORDER MARK) as right-to-left.
|
||
|
* Made the `is_foo_bidi()` functions report `true` if the input contains
|
||
|
Hebrew presentations forms (which are right-to-left but not in a
|
||
|
right-to-left-roadmapped block).
|
||
|
|
||
|
### 0.8.7
|
||
|
|
||
|
* Fixed a panic in the UTF-16LE/UTF-16BE decoder when decoding to UTF-8.
|
||
|
|
||
|
### 0.8.6
|
||
|
|
||
|
* Temporarily removed the debug assertion added in version 0.8.5 from
|
||
|
`convert_utf16_to_latin1_lossy`.
|
||
|
|
||
|
### 0.8.5
|
||
|
|
||
|
* If debug assertions are enabled but fuzzing isn't enabled, lossy conversions
|
||
|
to Latin1 in the `mem` module assert that the input is in the range
|
||
|
U+0000...U+00FF (inclusive).
|
||
|
* In the `mem` module provide conversions from Latin1 and UTF-16 to UTF-8
|
||
|
that can deal with insufficient output space. The idea is to use them
|
||
|
first with an allocation rounded up to jemalloc bucket size and do the
|
||
|
worst-case allocation only if the jemalloc rounding up was insufficient
|
||
|
as the first guess.
|
||
|
|
||
|
### 0.8.4
|
||
|
|
||
|
* Fix SSE2-specific, `simd-accel`-specific memory corruption introduced in
|
||
|
version 0.8.1 in conversions between UTF-16 and Latin1 in the `mem` module.
|
||
|
|
||
|
### 0.8.3
|
||
|
|
||
|
* Removed an `#[inline(never)]` annotation that was not meant for release.
|
||
|
|
||
|
### 0.8.2
|
||
|
|
||
|
* Made non-ASCII UTF-16 to UTF-8 encode faster by manually omitting bound
|
||
|
checks and manually adding branch prediction annotations.
|
||
|
|
||
|
### 0.8.1
|
||
|
|
||
|
* Tweaked loop unrolling and memory alignment for SSE2 conversions between
|
||
|
UTF-16 and Latin1 in the `mem` module to increase the performance when
|
||
|
converting long buffers.
|
||
|
|
||
|
### 0.8.0
|
||
|
|
||
|
* Changed the minimum supported version of Rust to 1.21.0 (semver breaking
|
||
|
change).
|
||
|
* Flipped around the defaults vs. optional features for controlling the size
|
||
|
vs. speed trade-off for Kanji and Hanzi legacy encode (semver breaking
|
||
|
change).
|
||
|
* Added NEON support on ARMv7.
|
||
|
* SIMD-accelerated x-user-defined to UTF-16 decode.
|
||
|
* Made UTF-16LE and UTF-16BE decode a lot faster (including SIMD
|
||
|
acceleration).
|
||
|
|
||
|
### 0.7.2
|
||
|
|
||
|
* Add the `mem` module.
|
||
|
* Refactor SIMD code which can affect performance outside the `mem`
|
||
|
module.
|
||
|
|
||
|
### 0.7.1
|
||
|
|
||
|
* When encoding from invalid UTF-16, correctly handle U+DC00 followed by
|
||
|
another low surrogate.
|
||
|
|
||
|
### 0.7.0
|
||
|
|
||
|
* [Make `replacement` a label of the replacement
|
||
|
encoding.](https://github.com/whatwg/encoding/issues/70) (Spec change.)
|
||
|
* Remove `Encoding::for_name()`. (`Encoding::for_label(foo).unwrap()` is
|
||
|
now close enough after the above label change.)
|
||
|
* Remove the `parallel-utf8` cargo feature.
|
||
|
* Add optional Serde support for `&'static Encoding`.
|
||
|
* Performance tweaks for ASCII handling.
|
||
|
* Performance tweaks for UTF-8 validation.
|
||
|
* SIMD support on aarch64.
|
||
|
|
||
|
### 0.6.11
|
||
|
|
||
|
* Make `Encoder::has_pending_state()` public.
|
||
|
* Update the `simd` crate dependency to 0.2.0.
|
||
|
|
||
|
### 0.6.10
|
||
|
|
||
|
* Reserve enough space for NCRs when encoding to ISO-2022-JP.
|
||
|
* Correct max length calculations for multibyte decoders.
|
||
|
* Correct max length calculations before BOM sniffing has been
|
||
|
performed.
|
||
|
* Correctly calculate max length when encoding from UTF-16 to GBK.
|
||
|
|
||
|
### 0.6.9
|
||
|
|
||
|
* [Don't prepend anything when gb18030 range decode
|
||
|
fails](https://github.com/whatwg/encoding/issues/110). (Spec change.)
|
||
|
|
||
|
### 0.6.8
|
||
|
|
||
|
* Correcly handle the case where the first buffer contains potentially
|
||
|
partial BOM and the next buffer is the last buffer.
|
||
|
* Decode byte `7F` correctly in ISO-2022-JP.
|
||
|
* Make UTF-16 to UTF-8 encode write closer to the end of the buffer.
|
||
|
* Implement `Hash` for `Encoding`.
|
||
|
|
||
|
### 0.6.7
|
||
|
|
||
|
* [Map half-width katakana to full-width katana in ISO-2022-JP
|
||
|
encoder](https://github.com/whatwg/encoding/issues/105). (Spec change.)
|
||
|
* Give `InputEmpty` correct precedence over `OutputFull` when encoding
|
||
|
with replacement and the output buffer passed in is too short or the
|
||
|
remaining space in the output buffer is too small after a replacement.
|
||
|
|
||
|
### 0.6.6
|
||
|
|
||
|
* Correct max length calculation when a partial BOM prefix is part of
|
||
|
the decoder's state.
|
||
|
|
||
|
### 0.6.5
|
||
|
|
||
|
* Correct max length calculation in various encoders.
|
||
|
* Correct max length calculation in the UTF-16 decoder.
|
||
|
* Derive `PartialEq` and `Eq` for the `CoderResult`, `DecoderResult`
|
||
|
and `EncoderResult` types.
|
||
|
|
||
|
### 0.6.4
|
||
|
|
||
|
* Avoid panic when encoding with replacement and the destination buffer is
|
||
|
too short to hold one numeric character reference.
|
||
|
|
||
|
### 0.6.3
|
||
|
|
||
|
* Add support for 32-bit big-endian hosts. (For real this time.)
|
||
|
|
||
|
### 0.6.2
|
||
|
|
||
|
* Fix a panic from subslicing with bad indices in
|
||
|
`Encoder::encode_from_utf16`. (Due to an oversight, it lacked the fix that
|
||
|
`Encoder::encode_from_utf8` already had.)
|
||
|
* Micro-optimize error status accumulation in non-streaming case.
|
||
|
|
||
|
### 0.6.1
|
||
|
|
||
|
* Avoid panic near integer overflow in a case that's unlikely to actually
|
||
|
happen.
|
||
|
* Address Clippy lints.
|
||
|
|
||
|
### 0.6.0
|
||
|
|
||
|
* Make the methods for computing worst-case buffer size requirements check
|
||
|
for integer overflow.
|
||
|
* Upgrade rayon to 0.7.0.
|
||
|
|
||
|
### 0.5.1
|
||
|
|
||
|
* Reorder methods for better documentation readability.
|
||
|
* Add support for big-endian hosts. (Only 64-bit case actually tested.)
|
||
|
* Optimize the ALU (non-SIMD) case for 32-bit ARM instead of x86_64.
|
||
|
|
||
|
### 0.5.0
|
||
|
|
||
|
* Avoid allocating an excessively long buffers in non-streaming decode.
|
||
|
* Fix the behavior of ISO-2022-JP and replacement decoders near the end of the
|
||
|
output buffer.
|
||
|
* Annotate the result structs with `#[must_use]`.
|
||
|
|
||
|
### 0.4.0
|
||
|
|
||
|
* Split FFI into a separate crate.
|
||
|
* Performance tweaks.
|
||
|
* CJK binary size and encoding performance changes.
|
||
|
* Parallelize UTF-8 validation in the case of long buffers (with optional
|
||
|
feature `parallel-utf8`).
|
||
|
* Borrow even with ISO-2022-JP when possible.
|
||
|
|
||
|
### 0.3.2
|
||
|
|
||
|
* Fix moving pointers to alignment in ALU-based ASCII acceleration.
|
||
|
* Fix errors in documentation and improve documentation.
|
||
|
|
||
|
### 0.3.1
|
||
|
|
||
|
* Fix UTF-8 to UTF-16 decode for byte sequences beginning with 0xEE.
|
||
|
* Make UTF-8 to UTF-8 decode SSE2-accelerated when feature `simd-accel` is used.
|
||
|
* When decoding and encoding ASCII-only input from or to an ASCII-compatible
|
||
|
encoding using the non-streaming API, return a borrow of the input.
|
||
|
* Make encode from UTF-16 to UTF-8 faster.
|
||
|
|
||
|
### 0.3
|
||
|
|
||
|
* Change the references to the instances of `Encoding` from `const` to `static`
|
||
|
to make the referents unique across crates that use the refernces.
|
||
|
* Introduce non-reference-typed `FOO_INIT` instances of `Encoding` to allow
|
||
|
foreign crates to initialize `static` arrays with references to `Encoding`
|
||
|
instances even under Rust's constraints that prohibit the initialization of
|
||
|
`&'static Encoding`-typed array items with `&'static Encoding`-typed
|
||
|
`statics`.
|
||
|
* Document that the above two points will be reverted if Rust changes `const`
|
||
|
to work so that cross-crate usage keeps the referents unique.
|
||
|
* Return `Cow`s from Rust-only non-streaming methods for encode and decode.
|
||
|
* `Encoding::for_bom()` returns the length of the BOM.
|
||
|
* ASCII-accelerated conversions for encodings other than UTF-16LE, UTF-16BE,
|
||
|
ISO-2022-JP and x-user-defined.
|
||
|
* Add SSE2 acceleration behind the `simd-accel` feature flag. (Requires
|
||
|
nightly Rust.)
|
||
|
* Fix panic with long bogus labels.
|
||
|
* Map [0xCA to U+05BA in windows-1255](https://github.com/whatwg/encoding/issues/73).
|
||
|
(Spec change.)
|
||
|
* Correct the [end of the Shift_JIS EUDC range](https://github.com/whatwg/encoding/issues/53).
|
||
|
(Spec change.)
|
||
|
|
||
|
### 0.2.4
|
||
|
|
||
|
* Polish FFI documentation.
|
||
|
|
||
|
### 0.2.3
|
||
|
|
||
|
* Fix UTF-16 to UTF-8 encode.
|
||
|
|
||
|
### 0.2.2
|
||
|
|
||
|
* Add `Encoder.encode_from_utf8_to_vec_without_replacement()`.
|
||
|
|
||
|
### 0.2.1
|
||
|
|
||
|
* Add `Encoding.is_ascii_compatible()`.
|
||
|
|
||
|
* Add `Encoding::for_bom()`.
|
||
|
|
||
|
* Make `==` for `Encoding` use name comparison instead of pointer comparison,
|
||
|
because uses of the encoding constants in different crates result in
|
||
|
different addresses and the constant cannot be turned into statics without
|
||
|
breaking other things.
|
||
|
|
||
|
### 0.2.0
|
||
|
|
||
|
The initial release.
|