UTF–8 and Extended characters
June 12, 2008
|
Character Range (hex) |
Unicode (UCS-2/UTF-16) |
UTF-8 |
|
0-7F |
00000000 0xxxxxxx |
0xxxxxxx |
|
80-7FF |
00000xxx xxxxxxxx |
110xxxxx 10xxxxxx |
|
800-FFFF |
xxxxxxxx xxxxxxxx |
1110xxxx 10xxxxxx 10xxxxxx |
|
10000-1FFFFF |
- out of range - |
11110xxx 10xxxxxx 10xxxxxx |
|
200000-3FFFFFF |
- out of range - |
111110xx 10xxxxxx 10xxxxxx |
|
4000000-7FFFFFFF |
- out of range - |
1111110x 10xxxxxx 10xxxxxx |
Note that all bytes of multi-byte UTF-8 characters have the high-bit set to one, and only the first byte of a multi-byte character has both its highest bits set. This means there can never be confusion about where a character starts. So in UTF-8, the combined Greek and Latin sequence aβcδe is represented by the following seven bytes, and looking at the high bits you can pick out the extended characters without too much trouble:
01100001 11001110 10110010 01100011 11001110 10110100 01100101
Now the really clever bit about UTF-8 is that it is capable of passing unharmed through ASCII only systems [programs which don’t even recognize UTF-8], thanks to the fact that each character beyond U+007F looks like a valid sequence of extended ASCII when read as a byte-per-character. This is in stark contrast to other Unicode encodings such as UCS-2, which are full of zero bytes and therefore wreak havoc with ASCII processing systems. To an ASCII system, the UTF-8 representation of aβcδe parses as aβcδe . On the surface this may seem like a corruption, but the important thing to note is that no illegal ASCII characters appear in a UTF-8 bytestream, and so the same string can be read and written out again as raw ASCII and then decoded later as the original UTF-8. With the exception of 7-bit text systems [a legacy email standard, unfortunately, for which the hideous UTF-7 had to be invented] UTF-8 should be able to pass through ASCII systems unscathed.
Leave a Reply