Unicode

Authors: Joseph Becker
Lee Collins
Mark Davis
Date: October 1991
Maintainers:The Unicode Consortium
Website:http://www.unicode.org/

The Unicode Standard (Unicode, coined by Becker in 1998) is a computing industry technical standard for character encoding and representation of text written in most writing systems of the world. [3] The design of Unicode is based on the simplicity and consistency of ASCII, but goes beyond the limited ability of ASCII to encode only the `Latin alphabet`_. [7]

The Unicode Standard and ISO/IEC 10646 support three encoding forms (UTF-8, UTF-16, UTF-32) that use a common set of characters. These encoding forms allow for encoding as many as a million characters. This is sufficient for all known character encoding requirements, including full coverage of all historic scripts of the world, as well as common notational systems. [7]

Contents

1   Etymology

The term "Unicode" was introduced by Joseph Becker the first Unicode draft proposal "Unicode 88" (1988):

The name "Unicode" is intended to suggest a unique, unified, universal encoding. A sequence of Unicodes (e.g. text file, string, or character stream) is called "Unitext". [4]

2   Function

Unicode provides programmers with a single universal character encoding. [3] Before Unicode, no standard for multilingual plain text existed. Instead, there were many incompatible standards for encoding plain text including ASCII, Big Five (Traditional Chinese), and GB231 (Simplified Chinese). [11]

The Unicode Standard began with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single universal standard. [3] The per-existing legacy character encoding were both inconsistent and incomplete.

Unicode provides the capacity to encode all characters used for the written languages of the world -- more than 1 million characters can be encoded. [3]

Unicode defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software. [3]

3   Applications

3.2   Web

As the default encoding of HTML and XML, the Unicode Standard provides the underpinning for the World Wide Web and the global business environments of today. [3]

UTF-8 supports 80% of the Internet as of 2014. [8]

static/images/Growth_of_Unicode_on_the_Web.png
static/images/usage_of_character_encoding_for_websites.png

4   Substance

4.1   Code points

A single number is assigned to each code element defined by the Unicode Standard. Each of these numbers is called a code point and, when referred to in text, is listed in hexadecimal form following the prefix "U+". For example, the code point U+0041 is the hexadecimal number 0041 (equal to the decimal number 65). It represents the character "A" in the Unicode Standard.

Each character is also assigned a unique name that specifies it and no other. For example, U+0041 is assigned the character name "LATIN CAPITAL LETTER A." U+0A1B is assigned the character name "GURMUKHI LETTER CHA." These Unicode names are identical to the ISO/IEC 10646 names for the same characters. [7]

Code elements are grouped logically throughout the range of code points, called the codespace. The coding starts at U+0000 with the standard ASCII characters, and continues with Greek, Cyrillic, Hebrew, Arabic, Indic and other scripts; then followed by symbols and punctuation. [7] The codespace continues with Hiragana, Katakana, and Bopomofo. The unified Han ideographs are followed by the complete set of modern Hangul. The range of surrogate code points is reserved for use with UTF-16. Towards the end of the BMP is a range of code points reserved for private use, followed by a range of compatibility characters. The compatibility characters are character variants that are encoded only to enable transcoding to earlier standards and old implementations, which made use of them. [7]

static/images/Unicode_Basic_Latin.png

Basic Latin (ASCII).

The Unicode consortium assigns every character (alphabetic, ideographic, or plain symbols) in every writing system (but not glyphs) ("encoded characters") a name and a hexadecimal number from 0 to 0x10ffff ("a code point"). [2] [3] [6] This means they can be used in any mixture and with equal facility (no code pages are required and multilingual text is supported). [3]

For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g. U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).

Code points are written like: U+12ca. For example, "LATIN CAPITAL LETTER A" is written as U+0041, the Arabic letter Ain is written as U+0639, and 'ETHIOPIC SYLLABLE WI' is written as U+12ca (0x12ca or 4810 decimal).

static/images/unicode_compared_to_the_2022_framework.png

This figure contrasts the Unicode encoding with mixtures of single-byte character sets with escape sequences to shift the meanings of bytes in the ISO/IEC 2022 framework using multiple character encoding standards.

Code points for characters can be found in one of the many tables listed int the Unicode standard:

0061    'a'; LATIN SMALL LETTER A
0062    'b'; LATIN SMALL LETTER B
0063    'c'; LATIN SMALL LETTER C
...
007B    '{'; LEFT CURLY BRACKET

The Unicode Standard contains 1,114,112 code points, most of which are available for encoding of characters. [3] Collectively, these code points are called "codespace". The Unicode codespace is divided into seventeen planes of 65,536 (2 ** 16) code points, numbered 0 to 16. 0 is the Basic Multilingual Plane (BMP), 1 is the Supplementary Multilingual Plane (SMP), 2 is the Supplementary Ideographic Plane (SIP), 3 to 13 are unassigned, 14 is the Supplementary Special-purpose Plane (SSP), and 15 and 16 are Supplementary Private Use Areas. The majority of the common characters used in the major languages of the world are encoded in the BMP. [3]

4.2   Encoding

A Unicode string, which represents a text element, is a sequence of code points. This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an "encoding". [6]

4.2.1   UCS-2

In Unicode 88, Becker outlined a 16-bit character model:

Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.

This design was based on the assumption that only those scripts and characters in modern use would need to be encoded:

Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2^14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes.

UCS-2 is a fixed-length encoding.

UCS-2 cannot represent code points outside the BMP.

The earliest idea for Unicode encoding was to store code points in two bytes each (Universal Character Set 2). So "Hello" becomes [*]:

  H     e     l     l     o
0x48 00 65 00 6c 00 6c 00 6f 00

Two bytes means there are 2^16 = 65,536 distinct values available, making it possible to represent many different characters from many different scripts, but it was discovered it was not enough to represent every script in every writing system. [6] (Unicode 6.0 uses 1,114,111 code points.) UCS-2 is now obsolete.

The first plane (code points U+0000 to U+FFFF) contains the most frequently used characters and is called the Basic Multilingual Plane or BMP. Both UTF-16 and UCS-2 encode code points in this range as single 16-bit code units that are numerically equal to the corresponding code points. The code points in the BMP are the only code points that can be represented in UCS-2. Within this plane, code points U+D800 to U+DFFF (see below) are reserved for lead and trail surrogates.

4.2.2   UTF-16

UTF-16 is an extension of UCS-2 to represent code points outside the BMP.

UTF-16 is a variable-length character encoding for Unicode capable of encoding 1,112,064 code points in the Unicode code space from 0 to 0x10FFFF.

The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996. CS-2 produces a fixed-length format by simply using the code point as the 16-bit code unit. UTF-16 expands the code space significantly by using surrogate pairs to encode code points above 0xFFFF, and produces the same result as UCS-2 for all code points in the range 0-0xFFFF that had been or ever will be assigned a character.

Code points from the other planes (called Supplementary Planes) are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

  • 0x010000 is subtracted from the code point, leaving a 20 bit number in the range 0..0x0FFFFF.
  • The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or lead surrogate, which will be in the range 0xD800..0xDBFF. (Previous versions of the Unicode Standard referred to these as high surrogates.)
  • The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or trail surrogate, which will be in the range 0xDC00..0xDFFF. (Previous versions of the Unicode Standard referred to these as low surrogates.)

4.2.3   UTF-32 (UCS-4)

To ensure we can handle all 1,114,111 code points, we can store every code point in 4-bytes.

In this representation, the string "Hello" would look like this [6]:

   H           e           l           l           o
0x48 00 00 00 65 00 00 00 6c 00 00 00 6c 00 00 00 6f 00 00 00
   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19

This representation is straightforward but using it presents a number of problems.

  1. It wastes space. The majority of the code points in most texts are below U+000F (or U+00FF) so a large amount of space is occupied by zero bytes. The above string takes 24 bytes compared to the 6 bytes needed for an ASCII representation. Increased RAM usage doesn't matter too much (desktop computers have megabytes of RAM, and strings aren't usually that large), but expanding our usage of disk and network bandwidth by a factor of 4 is intolerable. [2] [6]
  2. It's not portable; different processors order the bytes differently. [6]
  3. It's not compatible with existing C functions such as strlen(), so a new family of wide string functions would need to be used. [6]
  4. Many Internet standards are defined in terms of textual data, and can't handle content with embedded zero bytes. [6]

4.2.4   UTF-8

UTF-8 address all of the above issues using the following rules [6]:

  1. If the code point is less than U+0080, it's represented by the corresponding one byte value.
  2. If the code point is between U+0080 and U+07ff, it's turned into two byte values between 128 and 255.
  3. Code points greater than U+07ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

In this representation, the string "Hello" would look like this:

   H  e  l  l  o
0x48 65 6c 6c 6f
   0  1  2  3  4

UTF-8 has several convenient properties:

  1. It can handle any Unicode code point. [6]
  2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can't handle zero bytes. [6]
  3. A string of ASCII text is also valid UTF-8 text. [6]
  4. UTF-8 is compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte. [6]
  5. If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8. [6]

4.2.5   Other encodings

Unicode code points can be encoded in any encoding scheme. For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, with one catch: some of the letters might not show up! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a question mark (�) or a box.

Encodings don't have to handle every possible Unicode character, and most encodings don't. [6] For example, Python's default encoding is the 'ascii' encoding. The rules for converting a Unicode string into the ASCII encoding are simple; for each code point:

  1. If the code point is < 128, each byte is the same as the value of the code point. [6]
  2. If the code point is 128 or greater, the Unicode string can't be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.) [6]

Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points 0-255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1. [6]

Encodings don’t have to be simple one-to-one mappings like Latin-1. Consider IBM’s EBCDIC, which was used on IBM mainframes. Letter values weren’t in one block: ‘a’ through ‘i’ had values from 129 to 137, but ‘j’ through ‘r’ were 145 through 153. If you wanted to use EBCDIC as an encoding, you’d probably use some sort of lookup table to perform the conversion, but this is largely an internal detail. [6]

5   Representation

ISO 10646 defines several character encoding forms for the Universal Character Set. The simplest, UCS-2, uses a single code value (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value.

Unicode characters can be represented by different character encodings. The three most commonly used are UTF-8, UTF-16, and UTF-32. [3] UTF stands for "Unicode Transformation Format" or "UCS Transformation Format" and the number afterward represents the number of bits used in the encoding. [1]

5.1   UTF-8

Date:1992

UTF-8 is a variable-width encoding that represents every character in the Unicode character set using one to four 8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard). Code points with lower numerical values (i.e. earlier code positions in the Unicode character set, which tend to occur more frequently) are encoded using fewer bytes.

The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32. [3]

UTF-8 has become the dominant character encoding for the World Wide Web, accounting for more than half of all Web pages.

6   Properties

6.1   Universal

The repertoire must be large to encompass all characters that are likely to be used in general text interchange, including those in major international, national, and industry character sets. [3]

6.2   Efficient

Plain text is simple to parse: software does not have to maintain state or look for special escape sequences, and character synchronization from any point in a character stream is quick and unambiguous. [3] A fixed character code allows for efficient sorting, searching, display, and editing of text. [3]

6.3   Unambiguous

Any given Unicode code point always represents the same character. [3]

7   Maintenance

The Unicode Standard is developed in conjunction with the Universal Character Set standard.

8   History

The origins of Unicode date to 1987, when Joe Becker from Xerox_ and Lee Collins and Mark Davis from Apple Inc. started investigating the practicalities of creating a universal character set.[2] In August 1988, Joe Becker published a draft proposal for an "international/multilingual text character encoding system, tentatively called Unicode". He explained that "[t]he name 'Unicode' is intended to suggest a unique, unified, universal encoding".

The Unicode Consortium was incorporated on January 3, 1991, in California_, and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992.

The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare. [2]

Codes below 32 were called unprintable and were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in. [2]

Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters... horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners'. In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn't even reliably interchange Russian documents.

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. [2]

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the "double byte character set" in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s-- to move backwards and forwards, but instead to call functions such as Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess. [2]

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented. [2]

Unicode began with Unicode 88.

There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 1.1 revision of Unicode. [6]

The International Organization for Standardization (ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990. Hugh McGregor Ross was one of its principal architects.

In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO 10646.

Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 code points from 17 planes by means of the UTF-16 surrogate mechanism. For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 679 million. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32, although it has almost no use outside programs' internal data.

static/images/birthplace_of_utf8.jpg

The diner where UTF-8 was invented. [10]

Rob Pike and Ken Thompson designed UTF-8 at a diner in September 1992.

Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed-width encoding, which came to be called UTF-8,[1] currently the most popular UCS encoding.

The UCS has over 1.1 million code points available for use, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) ruled in 2000 that all software sold in its jurisdiction would have to support GB 18030. This required software intended for sale in the PRC to move beyond the BMP.

8.1   Unicode 1.0

Date:October 1991

8.2   Unicode 2.0

Date:July 1996

In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g. Egyptian Hieroglyphs) and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Among the characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of which are part of personal and place names, making them rarely used, but much more essential than envisioned in the original architecture of Unicode

8.3   Unicode 3.0

Date:September 1999

8.4   Unicode 4.0

Date:April 2003

8.5   Unicode 5.0

Date:July 2006

8.6   Unicode 6.0

Date:October 2010

Unicode 6.0 contains a repertoire of more than 110,000 characters covering 100 scripts and various symbols.

8.7   Unicode 7.0

Date:June 2014

9   Further reading

10   Footnotes

[*]

Alternatively:

H     e     l     l     o
00 48 00 65 00 6c 00 6c 00 6f

Originally there were two ways to store Unicode, since early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at. [2] So people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. [2]

11   References

[1]Armin Ronacher. January 9, 2014. UCS vs UTF-8 as Internal String Encoding. http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/
[2](1, 2, 3, 4, 5, 6, 7, 8, 9) Joel Spolsky. October 8, 2003. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode.html
[3](1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) The Unicode Consortium. 2012. The Unicode Standard: Version 6.2 - Core Specification. http://www.unicode.org/versions/Unicode6.2.0/UnicodeStandard-6.2.pdf
[4]Joseph Becker. August 29, 1988. Unicode 88. http://www.unicode.org/history/unicode88.pdf
[6](1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20) Unicode HOWTO. https://docs.python.org/2/howto/unicode.html
[7](1, 2, 3, 4, 5) The Unicode Standard: A Technical Introduction. http://www.unicode.org/standard/principles.html
[8]Usage of UTF-8 for websites. http://w3techs.com/technologies/details/en-utf8/all/all
[9]ISO/IEC 10646. http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
[10]Rob Pike. https://twitter.com/rob_pike/status/721114626033733633
[11]

http://speakingjs.com/es5/ch24.html

Read this.


Relation between ISO 10646 and Unicode:

In 1991, the ISO Working Group responsible for ISO/IEC 10646 (JTC 1/SC 2/WG 2) and the Unicode Consortium decided to create one universal standard for coding multilingual text. Since then, the ISO 10646 Working Group (SC 2/WG 2) and the Unicode Consortium have worked together very closely to extend the standard and to keep their respective versions synchronized.

These are not the same thing. Although the character codes and encoding forms are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646.