Monday, December 16, 2013

Difference between Unicode, UTF-8 and UTF-16 (Unicode vs UTF-8 vs UTF-16)

Difference between Unicode, UTF-8 and UTF-16 (Unicode vs UTF-8 vs UTF-16)

Unicode is a character set. UTF-8 and UTF-16 are both encodings of Unicode. 

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41.

An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:

00000001 00000010 00000011 00000100 

Our data is now translated into binary and can now be saved to disk.

All together now

Say an application reads the following from the disk:

1101000 1100101 1101100 1101100 1101111 

The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:

104 101 108 108 111 

Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. 

The resulting string is "hello".

Historical Artifact from Microsoft

The development of Unicode was aimed at creating a new standard for mapping the characters in a great majority of languages that are being used today, along with other characters that are not that essential but might be necessary for creating the text. UTF-8 is only one of the many ways that you can encode the files because there are many ways you can encode the characters inside a file into Unicode.

UTF-8 was developed with compatibility in mind. ASCII was a very prominent standard and people who already had their files in the ASCII standard might hesitate in adopting Unicode because it would break their current systems. UTF-8 eliminated this problem as any file encoded that only has characters in the ASCII character set would result in an identical file, as if it was encoded with ASCII. This allowed people to adopt Unicode without needing to convert their files or even changing their current legacy software that was unaware of the Unicode standard. Any of the other mapping methods for Unicode breaks compatibility with ASCII and would force people to convert their system.

The observance of compatibility to ASCII of UTF-8 produces a side-effect that makes it ideal for word processing where most of the time, all the characters being used are included in the ASCII character set. UTF-8 only uses a byte to represent every code point resulting in a file size that is half to the same file encoded in UTF-16 which uses 2 bytes, and a quarter to the same file encoded in UTF-32 which uses 4.

UTF-8 has been adopted in the World Wide Web because it is both space efficient and byte oriented. Web pages are often simple text files that usually do not contain any character that is outside the ASCII character set. Using other encoding methods would only increase the network load without any benefit. Even in email transport systems, UTF-8 is slowly but surely being adopted as a replacement for the older encoding systems that are still being used.

Summary:

1. Unicode is the standard for computers to display and manipulate text while UTF-8 is one of the many mapping methods for Unicode. 

2. UTF-8 is a mapping method the retains compatibility with the older ASCII

3. UTF-8 is the most space efficient mapping method for Unicode compared to other encoding methods

4. UTF-8 is the most used Unicode standard for the web

Conclusion

UTF-8 and Unicode cannot be compared. UTF-8 is an encoding used to translate binary data into numbers. Unicode is a character set used to translate numbers into characters.

No comments:

Post a Comment