github twitter email rss
Encodings
0001 Jun 1
One minute read

Encodings

http://www.i18nguy.com/unicode/codepages.html

ASCII

  • ISO-8859-1
  • CP1251
  • GB2312
  • Latin1

Map single bytes to characters 1 - 127

Windows 1251

http://msdn.microsoft.com/en-US/goglobal/cc305144.aspx

Unicode

w͢͢͝h͡o͢͡ ̸͢k̵͟n̴͘ǫw̸̛s͘ ̀́w͘͢ḩ̵a҉̡͢t ̧̕h́o̵r͏̵rors̡ ̶͡͠lį̶e͟͟ ̶͝in͢ ͏t̕h̷̡͟e ͟͟d̛a͜r̕͡k̢̨ ͡h̴e͏a̷̢̡rt́͏ ̴̷͠ò̵̶f̸ u̧͘ní̛͜c͢͏o̷͏d̸͢e̡͝

Assigns characters to code points
The U+ means «Unicode» and the numbers are hexadecimal.

U+2119:   DOUBLE-STRUCK CAPITAL P

UTF-8

map Unicode code points to bytes
variable length of encoded characters
In UTF-8, every code point from 0-127 is stored in a single byte.
Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
is valid ASCII because ASCII still one byte

convert encoding

iconv -f original_charset -t new_charset originalfile > newfile
iconv -f utf-16le -t utf-8 file1.txt > file2.txt

determine encoding

file -I 

Python

unicode.encode() → bytes
bytes.decode() → unicode

sys.getdefaultencoding()


Back to posts


comments powered by Disqus