Quick Guide to understanding Unicode Data Transfer Formats

by Paul Hsieh
What follows is the most concise way I can think of for presenting what one needs to know to understand just the raw Unicode (aka ISO 10646) data formats. This is where everyone needs to start with understanding Unicode, and for people writing just text processing (as opposed to text viewing) tools, there is no need to proceed further than this. So in the interest of conciseness, let's just get right into it.

For the purposes of data transfer, the main thing that Unicode specifies are "code points" which are values in the range from 0x0 to 0x10FFFF with a couple of missing holes. For the encoding of ordinary ASCII text each code point corresponds to a single character. However, for more complicated languages, multiple code points may be used in combination to describe a text element (such as a word, or character.) Thus "code points" should not be thought of as exactly synonymous with characters, though they are in many cases.

Although there is only one up to date Unicode standard, there are various popular encoding formats. These are are described in the table below.

32 bit value  0:7F 80:FF 100:7FF 800:D7FF D800:DFFF E000:FFFD FFFE:FFFF 10000:10FFFF 110000:16887F 168880:7FFFFFFF 80000000:FFFFFFFF
UCS-4 4 octets Invalid
UTF-32 4 octets Invalid 4 octets Invalid 4 octets Invalid
UCS-2/BMP 2 octets Invalid 2 octets Invalid No encoding
UTF-16 2 octets  No encoding  2 octets Invalid 4 octets No encoding
Latin1 1 octet No encoding
7-bit ASCII 1 octet Invalid No encoding
UTF-8 1 octet 2 octets 3 octets Invalid 3 octets Invalid 4 octets Invalid No encoding
GB18030 1 octet 2 or 4 octets No encoding 2 or 4 octets Invalid 4 octets Invalid No encoding

octet is just a more specific way of saying "8 bit byte".

The red areas marked Invalid are values for which the encoding could represent them, however are considered illegal values in that encoding due to the fact that it does not represent a valid value. The green areas are those for which there is a valid encoding which represents a valid value in that encoding. UCS-4 allows for values which are not valid Unicode code points, while all of the UTF formats shown above precisely specify the valid Unicode range. The orange areas marked No encoding are ranges for which it is impossible to encode such a value in the encoding making the question of their validity a moot point.

Note that there is no conflict between these mappings and that each individual value has a 1-1 mapping with any other format's encoding of that same value if one exists. This is not true of formats like big5 which contains several character pairs which map to a single unicode character as well as several which do not map to any valid unicode character. Other formats like UTF-7 have multiple ways of encoding the same values. For these reasons, and because others have not gained a lot of traction (like UTF-1) only the encodings shown in the table above will be discussed.

Other things worth noting:

Because of their heritage and their improved efficiency over UTF-32, UTF-8 and UTF-16 are the most common Unicode encoding formats used for data transfer. Between the latter two there are various arguments as to which is better. UTF-16 is a little faster to decode and encode, its directly backward compatible with UCS-2, and more efficient at encoding most typical east asian text. UTF-8 does not have endianness issues, its directly backward compatible with ASCII, and more efficient at encoding western text.

So as one can easily calculate from the table above there are 1112062 valid unicode encodings. As of the most current Unicode standard, only roughly 100000 of these have actually been assigned to individually named universal values. Amongst the encodings shown above, all are trivial except for UTF-8 and UTF-16 which we now go into more depth on.

The UTF-8 mapping

A UTF-8 mapping takes valid Unicode code point values and translates them into one or more octets. An encoder will simply write the octets in sequential order, and a decoder will read the octets one at a time and try to fit them to a reverse mapping. The mapping from a valid Unicode code point value x (= x20x19x18x17x16x15x14x13x12x11x10x9x8x7x6x5x4x3x2x1x0 in binary notation) to UTF-8 is as follows:

if U-0 < x ≤ U-7F then UTF-8(x) =
x6x5x4x3x2x1x0
if U-80 < x ≤ U-7FF then UTF-8(x) =
1 1 0 x10x9x8x7x6
1 0 x5x4x3x2x1x0
if U-800 < x ≤ U-FFFF then UTF-8(x) =
1 1 1 0 x15x14x13x12
1 0 x11x10x9x8x7x6
1 0 x5x4x3x2x1x0
if U-10000 < x ≤ U-10FFFF then UTF-8(x) =
1 1 1 1 0 x20x19x18
1 0 x17x16x15x14x13x12
1 0 x11x10x9x8x7x6
1 0 x5x4x3x2x1x0

In the original UTF-8 encoding (which was intended to map to the entire UCS-4 range), 26bit -> 5 byte and 31bit -> 6 byte encodings were also specified following the pattern from above. This is important when considering the resynch property.

Properties of UTF-8

  1. Not all possible output octet combinations are possible. In particular any octet whose value is ≥ F8 is not possible. Furthermore, false aliases such as C1 BF which is using the 80:7FF range encoding to encode U-7F rather than the 0:7F range is not allowed. (In general, only the shortest encoding amongst various aliases are allowed.) The only valid encoding outputs are those produced from the mapping given above. Decoders must detect when an invalid encoding output has been encountered.

  2. By examining the top two bits of each octet alone, it is possible to determine which of 3 modes a UTF-8 octet is part of (00 and 01 are ASCII, 11 is the start of a multi-byte encoding, and 10 is a non-starting byte of a multi-byte encoding.) When an error has been encountered when decoding UTF-8, if the policy is not to halt the decoding, then a resynch can be performed by scanning up to 5 bytes (allowing for dealing potentially obsolete encoders which mapped to values beyond the Unicode code point range) after the point of the error until the top two bits of the octet are not 10.

  3. Detecting the invalid code point ranges D800:DFFF, FFFE:FFFF, 110000:1FFFFF cannot be done from the UTF-8 mapping by itself. Direct examination of the code point value for these potentially invalid values must be done.

  4. Since the encoding is octet based there are no endianness issues. In particular, leading BOM characters (U-FEFF) are unnecessary and do not imply anything about the data stream.

  5. If the octets from a valid UTF-8 stream are viewed as unsigned 2s complement 8 bit values, then the lexical sorting order of UTF-8 is identical to the sorting order of UTF-32. Note that this is not the same as collating.

  6. There is no special EOF character. In particular control characters in the ASCII range don't specify anything other than raw code point data values.

The UTF-16 mapping

A UTF-16 mapping takes valid Unicode code point values and translates them into one or two 16 bit values. Each 16 bit value is encoded as a pair of octets. An encoder will simply write the 16 bit values in sequential order, and a decoder will read the 16 bit values one at a time and try to fit them to a reverse mapping. The mapping from a valid Unicode code point value x (= x20x19x18x17x16x15x14x13x12x11x10x9x8x7x6x5x4x3x2x1x0 in binary notation) to UTF-16 is as follows:

if U-0 < x ≤ U-FFFF then UTF-16(x) =
x15x14x13x12x11x10x9x8x7x6x5x4x3x2x1x0
if U-10000 < x ≤ U-10FFFF then UTF-16(x = y + 0x10000) =
1 1 0 1 1 0 y19y18y17y16y15y14y13y12y11y10
1 1 0 1 1 1 y9y8y7y6y5y4y3y2y1y0

Notice the value shift when encoding values in the 10000:10FFFF range (i.e., a value y is first computed by subtracting 0x10000 from x, then the bits of y are encoded as shown). This prevents redundant encodings.

Properties of UTF-16

  1. There is no way to encode values from the range D800:DFFF as these bit patterns (called surrogate pairs) are used as escapes for encoding the 10000:10FFFF range. When encoding to UTF-16, invalid code points in the range D800:DFFF must be detected and (regardless of policy) such values must not be output. (Editorial note: Interestingly, there is no reuse of any of the underutilized top portion of the extended surrogate range to map back to this missing hole. Doing so would have allowed UTF-8 to recover 2048 useful values in its shorter encodings (remember, UTF-8, and Unicode in general threw out values solely in deferrence to the limitations of the UTF-16 encoding). This should not be surprising, as UTF-8 and UTF-16 were developed independently, however, it does appear as though UTF-8 conceeded part of its shorter mappings, while UTF-16 did nothing to facilitate compatibility between them.)

  2. The surrogates are always in pairs starting with one in the range D800:DBFF and followed immediately by one in the range DC00:DFFF. If an erroneous leading DC00:DFFF is encountered while decoding UTF-16, the decoder may attempt to skip the error (policy allowing) by skipping this single 16bit value. If a leading D800:DBFF is encountered without a following DC00:DFFF surrogate, the decoder may attempt to skip the error (policy allowing) by skipping this single 16bit value.

  3. Any encountered value FFFF is invalid. Any encountered FFFE is invalid if it does not appear at the start of the UTF-16 stream.

  4. Since the encoding is 16bit based there is an endianness issue. Ordinary UTF-16 streams start with the U-FEFF BOM character. If an apparent FFFE value appears at the start, the decoder should assume that it's endianness is mismatched with the data stream. When the decoder encounters an endian mismatch, it should reverse the endianess of each 16 bit octet pair consumed. This endian state should be tracked in the decoding process. If processing a multi-part UTF-16 stream, the BOM character may be omitted in subsequent parts following the first one, however the decoder should expect that the endianness is consistent amongst the parts.

    A starting BOM character should be output only in a UTF-16 encoder. It has no meaning to any of the other encoders. All UTF-16 encoders should output the BOM character at the beginning of the first part of any UTF-16 stream, and maintain a consistent endianess throughout the encoding process for a single given stream.

  5. A UTF-16 stream should always be composed of an even number of octets. If an odd number of octets are encountered, the last one should be considered an erroneously encoded element (though multi-part semantics may complicate this; but that is beyond what UTF-16 specifies.)

  6. A UTF-16 has no useful lexical sorting properties relative to the other Unicode encoding formats.

  7. There is no special EOF character. In particular control characters in the ASCII range don't specify anything other than raw code point data values.

Unicode private areas

In addition to specifying a number of convoluted ways in which code points can be combined, sized, which direction they flow in visible representations, how they are collated, and giving each defined character a name, there are ranges specified for private use.

These ranges are Area-A which is U-F0000:U-FFFFD and Area-B which is U-100000:U-10FFFD. These 131066 code points are meant to be used for developers to encode application specific meta data. I.e., ordinary text should never include values in these ranges. For example, if encoders/decoders have a policy of tagging errors encountered rather than simply halting, then they can make a guess as to what character is suspected to have been attempted to be encoded if its in a small enough range (which it typically will be) and add the base value U-F0000 to it, and insert that instead. Then the errors can still be dealt with in some subsequent step and after the rest of the message has been decoded.

Unicode

What I have presented above is not the meat of the content of Unicode. These are merely commonly used transfer formats. If you wanted to you could invent your own encoding format, and it would not make any difference to the Unicode Standard. The real guts of unicode start from the mappings from code points to actual glyphs and graphemes. Some graphemes (a generalized notion of a character) can be encoded as a sequence of code points in multiple ways. These different encodings of the same grapheme still need to be treated as though they are just representing that one grapheme. The standard also gives widths, direction, and other character attributes.

Other Resources


My programming page Mail me