Many developers I have worked with have no idea how encoding works, yet they have to work with all the time in urls, text files, streams, and so on. Often a dev inherits some code and there is a function that nobody knows what it is doing. It has things like "ToBase16", "UTF8" and "byte[]" which sounds important so everybody just stays away from it. In fact often people think that it is some security feature, so they stay far away from it. There is often significant confusion around the differences between encryption, encoding and hashing… Encoding looks like encryption because it returns jumbled "unreadable" text just like encryuption.

I would suggest to first understand Numeric Systems because it is fundamental to encoding and all computer information.

Encoding is just a mapping from a binary/decimal/Hexadecimal value to a character. for Example if:

  • 1 = a
  • 2 = b
  • 3 = c

Then if you got the message "312" you could map it out and read it as "cab".

But everybody has to agree on this map set for it to be useful, because it wont work if the bytes shared between computers represent something different on each computers map. eg 1=a for me but 1=% for you. If you are using a different character set, to the set used to encode the message then the characters will be incorrect or if our set has not defined that value it will give a "?", block or diamond. See Encoding in C# for code implementations.

To understand how hexdecimal works see Numeric Systems. Hexadecimal is a standard numeric system for many encodings, of course the hexadecimal code can also be represented in binary or decimal.

Purpose Symbol Example
Colors # red: #FF0000
Url encoding % "I Love Bananas": I%20Love%20Bananas
e-mail = España becomes: Espa=F1a
Unicode U+ space: U+0020
Many languages including descendants of C languages 0x space: 0x20

ASCII is one of the earliest attempts at a universal mapping set. 128 characters where mapped to values but 0-31 are non printable characters and are mostly obsolete now a days meaning very few spaces are left to map characters. There was a big problem with ASCII and that was that it only mapped English letters.

Since ASCII defines 127 characters we only need 7 bits meaning the 8th bit of a byte is free. The 8th bit was used to extend the table to include characters of different countries. Since countries used this 8th bit to suit their countries language, there are several different variations of the Extended ASCII table. Extended ASCII covered different languages from around the world but different implementations used the same bit for different mappings.

ANSI is usually a default on windows (This is obsolete but is often needed to run older software). It is an 8 bit character set (256 characters). This is essentially an extention of ASCII in that it included all 128 ASCII with an additional 128. The name "ANSI" is a misnomer, since it doesn't correspond to any actual ANSI standard, but the name has stuck. ANSI unified the last bit but there is not enough space in that bit to cover all the characters in the world, think all the Arabic characters, Greek, Chinese, Hindi, etc.

People from different regions came together to create a universal character set. Unicode started with a 2 byte character set but didn’t work for various reasons. All characters are mapped to a value called the code point which is universal and then would be mapped via different encoding methods to memory. eg "A" = 65 code point but it’s memory representation will be determined by its encoding (UTF-8 / UTF-16 / UTF-32). Unicode is an abstraction which requires a concrete implementation of encoding. The code point is always the same but the encoding changes. This allows encodings to be easily changed or discarded. That’s why there are hundreds of Unicode encodings. UTF has become the most popular.

Backward compatibility

The first 128 characters are mapped the same as ASCII so that files are backward compatible.

UTF-32 is a fixed length 4 byte character set. Hence 32 bits. Meaning they used 4 bytes to map every character in the world... problem solved right? Not exactly, while UTF-32 covers every language in the world it also means every character takes up 4 bytes.

The problem came that the English speaking world did not adopt this because it takes up too much space unnecessarily. A 1kb ASCII file becomes 4kb. As you can see below there are a lot of wasted bits.

  • ASCII "a"= 01100001
  • UTF-32 (Unicode) "a" = 00000000 00000000 00000000 01100001

UTF-16 is a variable length character set of 2 bytes or 4 bytes. While many characters fit in 16 bits it is able to double up for the 32 bit characters.

Big Endian Vs Little Endian

The character “A” only needs 1 byte to be represented (0x41) so the other byte is filled with 0’s (0x00). The empty byte could come first or second, which can cause some confusion.

  • The full byte at the end = 0x00 0x41 = Big Endian
  • The empty byte at the end = 0x41 0x00 = Little Endian

Byte Order Mark (BOM)

Some companies used Little Endian while others used Big Endian. To identify which UTF-16 is being used, 2 additional bytes often get added to files which is called the “Byte Order Mark”(BOM).

  • Big Endian BOM: U+FEFF (254,255)
  • Little Endian BOM: U+FFFE(255,254)

BOM is not required because if a BOM is not found software will try parse one format and if it hits errors it will parse in the other format.

UTF-8 is a variable multi-byte Character set of 1byte/2byte/3byte/4byte. If there are a few bytes in a row it will not be known if a character is encoded in 1,2,3 or 4 bytes. eg how many characters are the following bytes 01100001 01100001 01100001 01100001? Therefore the bytes need to also contain metadata about the characters, indicating where it starts and ends.

1 Byte Char

  • First bit marked with a 0 means the char is a single byte
  • Available bits = 7
  • Last code point = 127 (equal to ASCII)
  • eg 01100001 is a single char byte so it will read the next byte as the start of a new char.

2 Byte Char

  • Available bits = 11
  • Last code point = 2047
  • First byte: 110xxxxx
  • Second byte: 10xxxxxx
  • eg 11010010 10110100

3 Byte Char

  • Available bits = 16
  • Last code point = 65535
  • First byte: 1110xxxx
  • Second byte: 10xxxxxx
  • Third byte: 10xxxxxx
  • eg 11101010 10110100 10110111

4 Byte Char

  • Available bits = 21
  • Last code point = 2097151
  • First byte: 11110xxx
  • Second byte: 10xxxxxx
  • Third byte: 10xxxxxx
  • Fourth byte: 10xxxxxx
  • eg 11110101 10110100 10000110 10110111

Check out these links for more info:

My Computer Science Repo