Unicode is a standard that assigns a unique number to each character in every human language. It allows for comprehensive language coverage and the ability to include multiple languages in a single document. Programmers can create Unicode documents by choosing an encoding standard like UTF-8 and writing strings to files using functions like codecs.open() or encoding strings before writing.
How AI, OpenAI, and ChatGPT impact business and software.
Unicode for Small Children (and Children at Heart)
1. Unicode for Small
Children (and
Children at Heart)
Feihong Hsu
Chicago Python Users Group
March 8, 2007
2. Welcome to the Wonderful World of
Unicorns!
A Magical Guide to the World's Most Beloved
Mythological Equine
3. Welcome to the Useful World of
Unicode!
A Practical Guide to the World's Most Popular
International Text Standard
4. Top 3 reasons that unicorns are
great
● Friendly and wise
● Healing power
● Bane of evil
5. Top 3 reasons that Unicode is
important
● Comprehensive language
coverage
● Multiple languages in a single
document
● Standardized
6. The difference between Horses and
Unicorns
Horses Unicorns
Habitat Grasslands Enchanted forests
Diet Apples, Love, spirit of
oats, grass, wonder
barley, etc.
Abilities Galloping, Sentience,
eating, telepathy, laser
pooping vision (unconfirmed)
7. Difference between ISO 8859 and
Unicode
ISO 8859 Unicode
# supported Some A lot
languages
# supported 256 100,000+
characters
# bytes for each 1 1-4
character
8. So what, exactly, is Unicode?
Unicode is a standard that assigns a
unique number to each character in
every human language
Ok, not every language, see next slide
9. What is Unicode not?
● Doesn't address how the characters
are rendered (that's up to font
makers)
● Doesn't deal with imaginary
languages like Klingon and Elvish
● Doesn't deal with ancient languages
● Doesn't deal with obscure languages
that no one uses
10. How does Hollywood “create”
unicorns?
● CGI
● Horse with horn glued to forehead
● Two dudes in a costume
11. How does a programmer create
Unicode documents?
● Technically, you can't make a
Unicode document
● Usually you pick an official
encoding (UTF-8, UTF-16, etc)
● Sometimes you use a language-
specific encoding (GB2312, Shift-
JIS)
16. unicode and str: two different types!
● They have exactly the same API
● But they don't have the same
repr()
● And they don't have the same
type()
● Use isinstance() to tell them apart
17. unicode and str example
>>> u = unicode()
>>> type(u)
<type 'unicode'>
>>> print repr(u)
u''
>>> isinstance(u, str)
False
>>> s = str()
>>> type(s)
<type 'str'>
>>> print repr(s)
''
>>> isinstance(s, unicode)
False
>>>
18. Two ways to write a Unicode file
● Use the file object returned by
codecs.open()
● Use a regular file object along with
unicode.encode()
19. Example using codecs.open()
>>> import codecs
>>> s = u'u4f60u597du4e16u754c'
>>> fout = codecs.open('document.txt', 'w',
'utf-8')
>>> fout.write(s)
>>> fout.close()
>>> open('document.txt').read().decode('utf-
8')
u'u4f60u597du4e16u754c'
>>>
20. Example using unicode.encode()
>>> s = u'u4f60u597du4e16u754c'
>>> fout = open('document.txt', 'w')
>>> fout.write(s.encode('utf-8'))
>>> fout.close()
>>> open('document.txt').read().decode('utf-
8')
u'u4f60u597du4e16u754c'
>>>
21. Two ways to read Unicode files
● Use the file object returned by
codecs.open()
● Use a regular file object along with
str.decode()
● Watch out for the BOM!
22. What is Byte Order Mark?
● Called BOM for short
● In UTF-16 docs, indicates little-
endian or big-endian
● Often appears in UTF-8 docs to
distinguish them from ASCII docs
● Use read(1) for UTF-8 documents
with BOM
23. Example of reading from a UTF-8
file with BOM
>>> import codecs
>>> fin = codecs.open('bom_document.txt',
'r', 'utf-8')
>>> fin.read(1)
u'ufeff'
>>> fin.read()
u'u4f60u597du4e16u754c'
>>> fin.close()
>>>
24. Reading and writing XML
● ElementTree handles everything
implicitly
● It even eats the BOM without
complaining
● It doesn't even need the XML
declaration (as long as you use ASCII
or UTF-8)
● cElementTree works great too!
25. File system directory listing
● On Windows, os.listdir('.') won't
show you int'l characters
● You need to use os.listdir(u'.') to
see the Unicode files
● os.getcwd() doesn't show int'l
characters
● Use os.getcwdu() instead
26. String interpolation
● Str template strings can be
interpolated with both unicode and
str objects (automatic conversion
to unicode)
● Unicode template strings need to
be interpolated with unicode
objects
27. String interpolation example
>>> 'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> u'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> 'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
'Hello xe9xa3x9bxe9xb4xbb'
>>> u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
Traceback (most recent call last):
File "<pyshell#36>", line 1, in ?
u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
UnicodeDecodeError: 'ascii' codec can't
decode byte 0xe9 in position 0: ordinal not
in range(128)
>>>
28. Putting Unicode in your Python
source code
● Put “# -*- coding: utf-8 -*-” at top of
your file
● Idle automatically detects non-
ASCII characters and prompts to
edit your file
● Not generally recommended
29. Regular expressions
● The w special character doesn't
usually match non-ASCII
characters
● To match non-ASCII characters,
use re.UNICODE flag
● Remember that punctuation in
different languages uses different
characters
30. Regular expression example
>>> s = u'ABCu4f60u597du4e16u754c'
>>> m = re.match(r"w+", s)
>>> m.group()
u'ABC'
>>> m = re.match(r"w+", s, re.UNICODE)
>>> m.group()
u'ABCu4f60u597du4e16u754c'
>>>
31. Considerations for web pages
● Don't make pages or folders with int'l
characters (Firefox doesn't handle int'l
URLs well)
● Make sure you use the <meta> tag
when generating web pages
● You can display Unicode even in
ASCII-encoded pages (use character
entities)
32. Web page with <meta> tag
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html;charset=utf-8">
</head>
<body>
<h1> 你好世界 </h1>
</body>
</html>
33. Web page with character entities
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html;charset=ascii">
</head>
<body>
<h1>你好世界</h1>
</body>
</html>
Conversion recipe: s.encode('ascii',
'xmlcharrefreplace')
34. Processing documents of unknown
encoding
● Use the chardet module
● chardet.detect() function:
– accepts a string
– returns a dictionary with two keys:
'encoding' and 'confidence'
● Also try BeautifulSoup for web pages
35. Encoding detection example
>>> import chardet, urllib2
>>> html =
urllib2.urlopen('http://chol.co.kr').read()
>>> result = chardet.detect(html)
>>> result
{'confidence': 0.98999999999999999,
'encoding': 'EUC-KR'}
>>> print html.decode(result['encoding'])
36. Tools that play nice with Unicode
● IDLE (raw_input() accepts
Unicode)
● Notepad++ (can autodetect UTF-8
files with BOM)
● jEdit
37. Libraries that play nice with Unicode
● Tkinter
● wxPython
● Mako
● BeautifulSoup
● feedparser
● Elementtree
● lxml
38. Libraries that don't play nice with
Unicode
● cStringIO (StringIO.write() doesn't
accept Unicode strings)
● buzhug
● Various ID3 libraries
● ?
39. Databases
● SQLite has no problem with
Unicode
● SQLAlchemy with SQLite is fine
too
● Other databases - ?
40. Platform-specific issues
● Windows DOS prompt has no love for
Unicode
● MacOS X IDLE can't handle Unicode
● MacOS X terminal doesn't like
Unicode, likes UTF-8
● Recommendation: Use PyCrust?
43. Unicode for Small
Children (and
Children at Heart)
Feihong Hsu
Chicago Python Users Group
March 8, 2007 1
Thanks to Chris McAvoy for the conversation at PyCon
that inspired this talk.
44. Welcome to the Wonderful World of
Unicorns!
A Magical Guide to the World's Most Beloved
Mythological Equine
2
Completely drawn on my tablet PC using the free Ink
Art program. Unfortunately, Ink Art doesn't come
with good coloring tools so I just left it colorless.
45. Welcome to the Useful World of
Unicode!
A Practical Guide to the World's Most Popular
International Text Standard
3
46. Top 3 reasons that unicorns are
great
● Friendly and wise
● Healing power
● Bane of evil
4
47. Top 3 reasons that Unicode is
important
● Comprehensive language
coverage
● Multiple languages in a single
document
● Standardized
5
The Unicode Standard is maintained by the Unicode
Consortium, an organization based in California.
48. The difference between Horses and
Unicorns
Horses Unicorns
Habitat Grasslands Enchanted forests
Diet Apples, Love, spirit of
oats, grass, wonder
barley, etc.
Abilities Galloping, Sentience,
eating, telepathy, laser
pooping vision (unconfirmed)
6
I really wasn't sure about including the laser vision
ability. I honestly thought it was an urban myth. But
when a friend of my cousin's sister's friend said that
she saw it in person, I finally relented.
49. Difference between ISO 8859 and
Unicode
ISO 8859 Unicode
# supported Some A lot
languages
# supported 256 100,000+
characters
# bytes for each 1 1-4
character 7
Somebody noted that ISO 8859 can actually support
more than 256 characters through its various
extensions, so this is an oversimplification.
50. So what, exactly, is Unicode?
Unicode is a standard that assigns a
unique number to each character in
every human language
Ok, not every language, see next slide
8
The “unique number” for each character is called a
code point in Unicode terminology.
51. What is Unicode not?
● Doesn't address how the characters
are rendered (that's up to font
makers)
● Doesn't deal with imaginary
languages like Klingon and Elvish
● Doesn't deal with ancient languages
● Doesn't deal with obscure languages
that no one uses
9
Although there are many languages that Unicode
doesn't directly support, there are extensions to
Unicode that are designed to handle these cases.
52. How does Hollywood “create”
unicorns?
● CGI
● Horse with horn glued to forehead
● Two dudes in a costume
10
It helps if the two dudes are very high. And if they
have circus experience. And if neither of them has a
trick leg.
53. How does a programmer create
Unicode documents?
● Technically, you can't make a
Unicode document
● Usually you pick an official
encoding (UTF-8, UTF-16, etc)
● Sometimes you use a language-
specific encoding (GB2312, Shift-
JIS)
11
In the vast majority of cases, I think UTF-8 is more
than adequate. If in doubt, just go with that
encoding.
54. Python and Unicorn
Working together to combat evil!
12
I think this is a case of the graphic actually
undermining the point I'm trying to make. This is my
attempt to render a dynamic, exciting action scene of
a pitched battle between orc, unicorn and python.
They are fighting for the fate of the damsel in distress
because she is, like, oh so fine (well, at least when
she's got her makeup on, which she doesn't in this
picture). Unfortunately, the unicorn looks like it's
about to be stabbed in the ass, and the python
seems more interested in biting a chunk out of the
damsel than in saving her.
55. Python and Unicode
Working together to create international
applications!
13
The only time I actually visited the Unicode
Consortium's web site was to get a copy of the
Unicode logo.
56. Unicode-related functions
● unichr()
● ord()
● unicode.encode()
● str.decode()
14
Thanks to Ian Bicking for pointing out that it should be
unicode.encode(), not str.encode().
57. Examples of usage
>>> s = unichr(23456)
>>> print s
宠
>>> ord(s)
23456
>>> s.encode('utf-8')
'xe5xaexa0'
>>> s.encode('gb2312')
'xb3xe8'
>>> print _
³è
>>> 'xe5xaexa0'.decode('utf-8')
u'u5ba0'
>>> print _
宠
>>> 15
The PDF version of this presentation doesn't render
the Chinese character properly. But if you copy and
paste in a Unicode-aware editor, you'll probably be
able to see it. I admit it is pretty rare to put a
Chinese character in Courier New font.
58. unicode and str: two different types!
● They have exactly the same API
● But they don't have the same
repr()
● And they don't have the same
type()
● Use isinstance() to tell them apart
16
Thanks to Atul Varma for making some comments that
led me to adding this slide (and the next one).
59. unicode and str example
>>> u = unicode()
>>> type(u)
<type 'unicode'>
>>> print repr(u)
u''
>>> isinstance(u, str)
False
>>> s = str()
>>> type(s)
<type 'str'>
>>> print repr(s)
''
>>> isinstance(s, unicode)
False
>>> 17
60. Two ways to write a Unicode file
● Use the file object returned by
codecs.open()
● Use a regular file object along with
unicode.encode()
18
61. Example using codecs.open()
>>> import codecs
>>> s = u'u4f60u597du4e16u754c'
>>> fout = codecs.open('document.txt', 'w',
'utf-8')
>>> fout.write(s)
>>> fout.close()
>>> open('document.txt').read().decode('utf-
8')
u'u4f60u597du4e16u754c'
>>>
19
62. Example using unicode.encode()
>>> s = u'u4f60u597du4e16u754c'
>>> fout = open('document.txt', 'w')
>>> fout.write(s.encode('utf-8'))
>>> fout.close()
>>> open('document.txt').read().decode('utf-
8')
u'u4f60u597du4e16u754c'
>>>
20
63. Two ways to read Unicode files
● Use the file object returned by
codecs.open()
● Use a regular file object along with
str.decode()
● Watch out for the BOM!
21
64. What is Byte Order Mark?
● Called BOM for short
● In UTF-16 docs, indicates little-
endian or big-endian
● Often appears in UTF-8 docs to
distinguish them from ASCII docs
● Use read(1) for UTF-8 documents
with BOM
22
The actual value of the BOM is 0xfeff. If you try to print
it in the Python interpreter, you won't see anything.
65. Example of reading from a UTF-8
file with BOM
>>> import codecs
>>> fin = codecs.open('bom_document.txt',
'r', 'utf-8')
>>> fin.read(1)
u'ufeff'
>>> fin.read()
u'u4f60u597du4e16u754c'
>>> fin.close()
>>>
23
66. Reading and writing XML
● ElementTree handles everything
implicitly
● It even eats the BOM without
complaining
● It doesn't even need the XML
declaration (as long as you use ASCII
or UTF-8)
● cElementTree works great too!
24
The lxml module is similarly awesome.
67. File system directory listing
● On Windows, os.listdir('.') won't
show you int'l characters
● You need to use os.listdir(u'.') to
see the Unicode files
● os.getcwd() doesn't show int'l
characters
● Use os.getcwdu() instead
25
The behavior under Mac OS X is somewhat different. I
don't know about Linux.
68. String interpolation
● Str template strings can be
interpolated with both unicode and
str objects (automatic conversion
to unicode)
● Unicode template strings need to
be interpolated with unicode
objects
26
Template engines have these sorts of issues as well.
In particular, if you want to render a unicode string in
Mako or Myghty, you need to pass unicode strings
into the template.
69. String interpolation example
>>> 'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> u'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> 'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
'Hello xe9xa3x9bxe9xb4xbb'
>>> u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
Traceback (most recent call last):
File "<pyshell#36>", line 1, in ?
u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
UnicodeDecodeError: 'ascii' codec can't
decode byte 0xe9 in position 0: ordinal not
in range(128)
>>>
27
70. Putting Unicode in your Python
source code
● Put “# -*- coding: utf-8 -*-” at top of
your file
● Idle automatically detects non-
ASCII characters and prompts to
edit your file
● Not generally recommended
28
I don't recommend putting Unicode strings in your
source code because people who don't have
Unicode-aware editors will just see annoying
gibberish.
71. Regular expressions
● The w special character doesn't
usually match non-ASCII
characters
● To match non-ASCII characters,
use re.UNICODE flag
● Remember that punctuation in
different languages uses different
characters 29
Punctuation characters in English:
.?!
Compare with punctuation characters in Chinese:
。?!
Although they only look slightly different, they do have
different code points in Unicode.
72. Regular expression example
>>> s = u'ABCu4f60u597du4e16u754c'
>>> m = re.match(r"w+", s)
>>> m.group()
u'ABC'
>>> m = re.match(r"w+", s, re.UNICODE)
>>> m.group()
u'ABCu4f60u597du4e16u754c'
>>>
30
73. Considerations for web pages
● Don't make pages or folders with int'l
characters (Firefox doesn't handle int'l
URLs well)
● Make sure you use the <meta> tag
when generating web pages
● You can display Unicode even in
ASCII-encoded pages (use character
entities)
31
As Atul Varma pointed out, Firefox mangles the URL
but does so in a standard way. However, it still ends
up not finding the page. IE can actually find and
display pages with Unicode names. This is probably
the only thing IE does better than Firefox.
74. Web page with <meta> tag
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html;charset=utf-8">
</head>
<body>
<h1> 你好世界 </h1>
</body>
</html>
32
The text is Chinese for “Hello World”.
75. Web page with character entities
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html;charset=ascii">
</head>
<body>
<h1>你好世界</h1>
</body>
</html>
Conversion recipe: s.encode('ascii',
'xmlcharrefreplace')
33
Thanks to Ian Bicking for pointing out a shorter
conversion recipe. For the record, the original one
is:
''.join('&#%d' % ord(c) for c in s)
76. Processing documents of unknown
encoding
● Use the chardet module
● chardet.detect() function:
– accepts a string
– returns a dictionary with two keys:
'encoding' and 'confidence'
● Also try BeautifulSoup for web pages
34
77. Encoding detection example
>>> import chardet, urllib2
>>> html =
urllib2.urlopen('http://chol.co.kr').read()
>>> result = chardet.detect(html)
>>> result
{'confidence': 0.98999999999999999,
'encoding': 'EUC-KR'}
>>> print html.decode(result['encoding'])
35
You can also try BeautifulSoup for web pages.
Example:
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
encoding = soup.originalEncoding
78. Tools that play nice with Unicode
● IDLE (raw_input() accepts
Unicode)
● Notepad++ (can autodetect UTF-8
files with BOM)
● jEdit
36
Note that only IDLE on Windows has this feature.
79. Libraries that play nice with Unicode
● Tkinter
● wxPython
● Mako
● BeautifulSoup
● feedparser
● Elementtree
● lxml
37
80. Libraries that don't play nice with
Unicode
● cStringIO (StringIO.write() doesn't
accept Unicode strings)
● buzhug
● Various ID3 libraries
● ?
38
81. Databases
● SQLite has no problem with
Unicode
● SQLAlchemy with SQLite is fine
too
● Other databases - ?
39
82. Platform-specific issues
● Windows DOS prompt has no love for
Unicode
● MacOS X IDLE can't handle Unicode
● MacOS X terminal doesn't like
Unicode, likes UTF-8
● Recommendation: Use PyCrust?
40
I checked and it turns out that PyCrust chokes on int'l
characters sent through raw_input(), even on
Windows. So I formally withdraw my
recommendation of PyCrust.
84. Click to add title
Questions?
有问题吗?
42
Thanks to the experts in the audience who provided
hard-hitting answers to the the tough questions.
And, of course, thanks to everyone who attended my
first talk at ChiPy. I hope there will be more.