Sysadmin

Understanding character encoding & encoding problems

ASCII encoding

The ASCII encoding provides a standard way to represent characters as numbers, from 0 to 127 (called code points), using 7 bits. For instance:

Character Decimal Binary
A 65 1000001
B 66 1000010

8-bit character encodings (Extended ASCII)

7 bits are not enough to cover a lot of letters from other European languages than English, such as: á, ü, ç, ñ, and so on. As a byte is made of 8 bits, and ASCII is not using the full byte, a lot of other encodings have been invented as a result, using 8 bits with 256 possible code points (instead of 128) to address that shortcoming. ISO 8859-1 (also called latin1) is an example of such an encoding.

Multibyte encodings

Other languages have way more characters than the 256 possible values in a byte. Big5 is an example of an encoding using two bytes for a symbol (2^16 = 65536 possible values) that covers Traditional Chinese.

Unicode

Unicode, just like ASCII, defines a table of code points for characters. However, unlike ASCII, Unicode is not an encoding. Different encodings can represent a Unicode code point into different bits.

Unicode code points are written in hexadecimal preceded by a “U+”. E.g. Ḁ has the Unicode code point U+1E00.

UTF-8, UTF-16 and UTF-32 are encodings capable of encoding all code points in Unicode.

The advantage of a variable length encoding is that it can save a lot space for characters that can be represented using just a single byte (e.g. all the English alphabet). For instance, UTF-8 will only encode ‘A’ into one byte but will still be able to encode a Chinese character using 2 or more bytes. The variable encodings use the highest bits in a byte to signal how many bytes a character consists of. Variable encodings, especially UTF-8, may also lead to wasted space if these signal bits need to be used often. For example, using UTF-8 for encoding Chinese will not be very optimized as for some characters it may use 3 or more bytes because of the signal bits while UTF-16 can only use 2.

ASCII maps 1:1 unto UTF-8

Unicode is a superset of ASCII: the ASCII codepoints (numbers from 0 to 127) have the same value in Unicode. For example, the codepoint 65 is the latin capital 'A' for both Unicode and ASCII.

All characters in the ASCII encoding result in the exact same byte as in UTF-8. Any character not in ASCII will result in two or more bytes in UTF-8.

Reading in the wrong encoding

When an application (such as a text editor, browser, or other) reads and prints a document by assuming the wrong encoding, it will typically display some unreadable text.

ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ

If the encoding has not been explicitly specified, the application may try to guess the encoding by analyzing the sequences of bits. However, it may still consider an encoding which prints ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ as being the possible encoding. it’s hard for the computer to know if this was really the intended text by the human or not.

Programs that encounter invalid bytes in the chosen encoding are often replaced by ? or silently discarded. They may also decide to insert the “Unicode replacement character” (U+FFFD) when trying to handle Unicode.

Convert from one encoding to another

Converting an encoding to another consists of keeping the content of the string unchanged, but re-encode each character to the target encoding. For example the string "Føö" encoded in latin1 has the following bytes:

Character Binary
F 01000110
ø 11111000
ö 11110110

If we encode that same string into UTF-8, it will result in the following bytes:

Character Binary
F 01000110
ø 11000011 10111000
ö 11000011 10110110

We also notice that the character F is encoded into the same bytes, as latin1 and Unicode both extend the ASCII code points (and F character being included in ASCII).

If a program reads a string in the wrong encoding, and saves the string back in another encoding, the string is probably lost or will be hard to recover. For instance, let's say the string "Føö" encoded in UTF-8 is read as latin1, the program will print the characters Føö. It prints 5 characters as expected, as the string is encoded into 5 bytes in UTF-8, while latin1 uses a single byte for each symbol. If the program then saves that decoded string using UTF-8 encoding, the resulting binary will be 1000110 11000011 10000011 11000010 10111000 11000011 10000011 11000010 10110110, and our original string/binary is lost.

Comments

Comments including links will not be approved.