UTF-8 Basics

UTF-8 Basics

Acronyms

UTF-8. (8-bit Unicode Transformation Format, see ABBREVIATIONFINDER.ORG). It is a variable-length transmission standard for encoding characters encoded using Unicode, created by Rob Pike and Ken Thompson. UTF-8 uses groups of bytes to represent the Unicode standard for the alphabets of many of the world’s languages. It is especially useful for transmission over 8- bit mail systems.

Use 1 to 4 bytes per character, depending on the Unicode symbol. For example, a single byte is needed in UTF-8 to encode the 128 ASCII | US-ASCII characters in the Unicode range U + 0000 to U + 007F.

Although it might seem inefficient to represent Unicode characters with up to 4 bytes, UTF-8 allows older systems to transmit characters from this superset of ASCII. In addition, it is still possible to use data compression regardless of the use of UTF-8.

The IETF requires that all Internet protocols indicate which character code | encoding they use for texts and that UTF-8 be one of the contemplated encodings.

Description

UTF-8 is currently standardized as RFC 3629 (UTF-8, a format transformation of ISO 10646).

In summary, the Unicode character bits are divided into several groups, which are then divided among the lowest positions within the UTF-8 bytes.

Characters smaller than 128 dec are encoded with a single byte that contains their value: this corresponds exactly to the 7-bit characters of the 128 of ASCII.

In all other cases, 2 to 4 bytes are used. The most significant bit of all bytes in this string is always 1, to prevent confusion with 7-bit ASCII characters, particularly characters less than 32 dec, traditionally called control characters, eg. car return).

Hexadecimal
UNICODE Code Range
UTF-16 UTF-8
binary system | binary
Notes
000000 – 00007F 00000000 0xxxxxxx 0xxxxxxx ASCII equivalent range; the single byte starts with zero
000080 – 0007FF 00000xxx xxxxxxxx 110xxxxx 10xxxxxx The first byte starts with 110 or 1110, the next bytes start with 10
000800 – 00FFFF xxxxxxxx xxxxxxxx 1110xxxx 10xxxxxx 10xxxxxx
010000 – 10FFFF 110110xx xxxxxxxx
110111xx xxxxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx UTF-16 requires substitutes; an offset of 0x10000 is subtracted, so the pattern bit is not identical with UTF-8

For example, the character eñe (ñ), which is represented in Unicode as 0x00F1, in UTF-8 is encoded like this:

  • Its value is in the range of 0x0080 to 0x07FF. A query to the table shows that it must be encoded using 2 bytes, with the format 110 xxxxx 10
  • Hexadecimal value 0x00F1 is equivalent to binary (0000-0) 000-1111-0001 (the first 5 bits are ignored as they are not needed to represent values ​​in the specified range).
  • The 11 bits required are arranged in the position marked by the Xs: 110 00011 10 110001.
  • The final result is two bytes with the hexadecimal values ​​0xC3 0xB1. That is the code for the letter eñe in UTF-8.

Thus, the first 128 characters require one byte. The next 1920 characters need two bytes to be encoded. This includes characters from the Latin Alphabet with diacritics, Greek alphabet, Cyrillic alphabet, alphabet Coptic | Coptic Alphabet Armenian, Hebrew Alphabet and Alphabet Arabic. The rest of the UCS-2 characters use three bytes and additional characters are encoded with 4 bytes. (An initial specification allowed even more codes to be represented, using 5 or 6 bytes, but it was not very well accepted.)

In fact, UTF-8 allows you to use a 6-byte sequence and completely cover the range 0x00-0x7FFFFFFF (31 bits), but UTF-8 was restricted by RFC 3629 to only use the area covered by the formal definition of Unicode, 0x00- 0x10FFFF, in November 2003. Prior to this, only bytes 0xFE and 0xFF did not occur in UTF-8 encoded text. After this limit was introduced, the number of unused bytes in a UTF-8 string increased to 13 bytes: 0xC0, 0xC1, 0xF5-0xFF. Although this new definition limits the encoding area severely, the problem of very long streams (different ways of encoding the same character, which can be a security risk) is eliminated, because very long streams would contain some of these bytes that they are not used and therefore would not be a valid sequence.

Reasoning behind the mechanics of UTF-8

As a consequence of the exact mechanics of UTF-8, the following multi-byte sequence properties are displayed:

  • The most significant bit of a single-byte character is always 0.
  • The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits 110for two-byte sequences; 1110for three-byte sequences, etc.
  • The remaining bytes in a multi-byte sequence have 10as their 2 most significant bits.

UTF-8 was designed to satisfy these properties, ensuring that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that substring matching can be applied to search for words or phrases within a text; some old 8-bit variable length encodings (such as Shift-JIS) did not have this property and thus made it more difficult to implement string search algorithms. Although it has been argued that this property adds redundancy to UTF-8 encoded text, the advantages outweigh this concern; furthermore, data compression is not one of the goals of Unicode and must be considered separately.

Advantage

  • Of course, the most notable advantage of any Unicode Transform Format over legacy encodings is that it can encode any character.
  • Some Unicode symbols (including the Latin Alphabet) will be taken as 1 byte, although others may take more than 4. Thus, UTF-8 will generally save space compared to UTF-16 or UTF-32 where 7-bit ASCII characters are common..
  • A sequence of bytes for one character will never be part of a longer sequence for another character like old encodings like Shift-JIS did.
  • The first byte of a multi-byte sequence is sufficient to determine the length of a multi-byte sequence. This makes it extremely simple to extract a substring from a given string without doing a thorough analysis.
  • Most of the existing software (including the operating system) were not written with Unicode in mind, and using Unicode with them could create some compatibility problems. For example, the standard library for the C programming language marks the end of a string with the single-byte character 0x00. In Unicode UTF-16 encoding the English letter A is encoded as 0x0041. The library will consider the first byte 0x00 as the end of the chain and ignore the rest. UTF-8, however, is designed so that the encoded bytes never take on any of the values ​​of the ASCII special characters, preventing these and similar problems.
  • Strings in UTF-8 can be sorted using standard byte-oriented sorting routines (however there will be no differentiation between upper and lower case with representations that exceed the value 128).
  • UTF-8 is the default for XML format.

Disadvantages

  • UTF-8 is variable in length; that means that different characters take sequences of different lengths to encode. The sharpness of this could be diminished, however, by creating an abstract interface for working with UTF-8 strings and making it transparent to the user.
  • A poorly written UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output.
  • Ideographic characters use 3 bytes in UTF-8, but only 2 in UTF-16. Thus, Chinese / Japanese / Korean texts will use more space when rendered in UTF-8.

History

UTF-8 was invented by Ken Thompson on 2 of September of 1992 on a tablecloth of a picnic area in New Jersey with Rob Pike. The next day, Pike and Thompson rolled it out and rolled it into their Plan 9 operating system.

UTF-8 was officially presented at the USENIX conference in San Diego (California) in January 1993.

UTF-8 Basics