Unicode for Small Children (and Children at Heart)

Unicode for Small
Children (and
Children at Heart)
Feihong Hsu
Chicago Python Users Group
March 8, 2007

Welcome to the Wonderful World of
Unicorns!
A Magical Guide to the World's Most Beloved
Mythological Equine

Welcome to the Useful World of
Unicode!
A Practical Guide to the World's Most Popular
International Text Standard

Top 3 reasons that unicorns are
great
● Friendly and wise

● Healing power

● Bane of evil

Top 3 reasons that Unicode is
important
● Comprehensive language

coverage
● Multiple languages in a single

document
● Standardized

The difference between Horses and
Unicorns
Horses Unicorns

Habitat Grasslands Enchanted forests

Diet Apples, Love, spirit of
oats, grass, wonder
barley, etc.

Abilities Galloping, Sentience,
eating, telepathy, laser
pooping vision (unconfirmed)

Difference between ISO 8859 and
Unicode
ISO 8859 Unicode

# supported Some A lot
languages

# supported 256 100,000+
characters

# bytes for each 1 1-4
character

So what, exactly, is Unicode?

Unicode is a standard that assigns a
unique number to each character in
every human language
Ok, not every language, see next slide

What is Unicode not?
● Doesn't address how the characters
are rendered (that's up to font
makers)
● Doesn't deal with imaginary

languages like Klingon and Elvish
● Doesn't deal with ancient languages

● Doesn't deal with obscure languages

that no one uses

How does Hollywood “create”
unicorns?
● CGI

● Horse with horn glued to forehead

● Two dudes in a costume

How does a programmer create
Unicode documents?
● Technically, you can't make a

Unicode document
● Usually you pick an official

encoding (UTF-8, UTF-16, etc)
● Sometimes you use a language-

specific encoding (GB2312, Shift-
JIS)

Python and Unicorn
Working together to combat evil!

Python and Unicode
Working together to create international
applications!

Unicode-related functions
● unichr()
● ord()

● unicode.encode()

● str.decode()

Examples of usage
>>> s = unichr(23456)
>>> print s
宠
>>> ord(s)
23456
>>> s.encode('utf-8')
'xe5xaexa0'
>>> s.encode('gb2312')
'xb3xe8'
>>> print _
³è
>>> 'xe5xaexa0'.decode('utf-8')
u'u5ba0'
>>> print _
宠
>>>

unicode and str: two different types!
● They have exactly the same API
● But they don't have the same

repr()
● And they don't have the same

type()
● Use isinstance() to tell them apart

unicode and str example
>>> u = unicode()
>>> type(u)
<type 'unicode'>
>>> print repr(u)
u''
>>> isinstance(u, str)
False
>>> s = str()
>>> type(s)
<type 'str'>
>>> print repr(s)
''
>>> isinstance(s, unicode)
False
>>>

Two ways to write a Unicode file
● Use the file object returned by
codecs.open()
● Use a regular file object along with

unicode.encode()

Example using codecs.open()

>>> import codecs
>>> s = u'u4f60u597du4e16u754c'
>>> fout = codecs.open('document.txt', 'w',
'utf-8')
>>> fout.write(s)
>>> fout.close()
>>> open('document.txt').read().decode('utf-
8')
u'u4f60u597du4e16u754c'
>>>

Example using unicode.encode()

>>> s = u'u4f60u597du4e16u754c'
>>> fout = open('document.txt', 'w')
>>> fout.write(s.encode('utf-8'))
>>> fout.close()
8')
u'u4f60u597du4e16u754c'
>>>

Two ways to read Unicode files
codecs.open()

str.decode()
● Watch out for the BOM!

What is Byte Order Mark?
● Called BOM for short
● In UTF-16 docs, indicates little-

endian or big-endian
● Often appears in UTF-8 docs to

distinguish them from ASCII docs
● Use read(1) for UTF-8 documents

with BOM

Example of reading from a UTF-8
file with BOM

>>> import codecs
>>> fin = codecs.open('bom_document.txt',
'r', 'utf-8')
>>> fin.read(1)
u'ufeff'
>>> fin.read()
u'u4f60u597du4e16u754c'
>>> fin.close()
>>>

Reading and writing XML
● ElementTree handles everything
implicitly
● It even eats the BOM without

complaining
● It doesn't even need the XML

declaration (as long as you use ASCII
or UTF-8)
● cElementTree works great too!

File system directory listing
● On Windows, os.listdir('.') won't
show you int'l characters
● You need to use os.listdir(u'.') to

see the Unicode files
● os.getcwd() doesn't show int'l

characters
● Use os.getcwdu() instead

String interpolation
● Str template strings can be
interpolated with both unicode and
str objects (automatic conversion
to unicode)
● Unicode template strings need to

be interpolated with unicode
objects

String interpolation example
>>> 'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> u'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> 'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
'Hello xe9xa3x9bxe9xb4xbb'
>>> u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
Traceback (most recent call last):
File "<pyshell#36>", line 1, in ?
u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
UnicodeDecodeError: 'ascii' codec can't
decode byte 0xe9 in position 0: ordinal not
in range(128)
>>>

Putting Unicode in your Python
source code
● Put “# -*- coding: utf-8 -*-” at top of

your file
● Idle automatically detects non-

ASCII characters and prompts to
edit your file
● Not generally recommended

Regular expressions
● The w special character doesn't
usually match non-ASCII
characters
● To match non-ASCII characters,

use re.UNICODE flag
● Remember that punctuation in

different languages uses different
characters

Regular expression example

>>> s = u'ABCu4f60u597du4e16u754c'
>>> m = re.match(r"w+", s)
>>> m.group()
u'ABC'
>>> m = re.match(r"w+", s, re.UNICODE)
>>> m.group()
u'ABCu4f60u597du4e16u754c'
>>>

Considerations for web pages
● Don't make pages or folders with int'l
characters (Firefox doesn't handle int'l
URLs well)
● Make sure you use the <meta> tag

when generating web pages
● You can display Unicode even in

ASCII-encoded pages (use character
entities)

Web page with <meta> tag

<html>
<head>
<meta http-equiv="Content-Type"
content="text/html;charset=utf-8">
</head>
<body>
<h1> 你好世界 </h1>
</body>
</html>

Web page with character entities

<html>
<head>
content="text/html;charset=ascii">
</head>
<body>
<h1>&#20320&#22909&#19990&#30028</h1>
</body>
</html>

Conversion recipe: s.encode('ascii',
'xmlcharrefreplace')

Processing documents of unknown
encoding
● Use the chardet module
● chardet.detect() function:

– accepts a string
– returns a dictionary with two keys:
'encoding' and 'confidence'
● Also try BeautifulSoup for web pages

Encoding detection example

>>> import chardet, urllib2
>>> html =
urllib2.urlopen('http://chol.co.kr').read()
>>> result = chardet.detect(html)
>>> result
{'confidence': 0.98999999999999999,
'encoding': 'EUC-KR'}
>>> print html.decode(result['encoding'])

Tools that play nice with Unicode
● IDLE (raw_input() accepts
Unicode)
● Notepad++ (can autodetect UTF-8

files with BOM)
● jEdit

Libraries that play nice with Unicode
● Tkinter
● wxPython

● Mako

● BeautifulSoup

● feedparser

● Elementtree

● lxml

Libraries that don't play nice with
Unicode
● cStringIO (StringIO.write() doesn't

accept Unicode strings)
● buzhug

● Various ID3 libraries

● ?

Databases
● SQLite has no problem with
Unicode
● SQLAlchemy with SQLite is fine

too
● Other databases - ?

Platform-specific issues
● Windows DOS prompt has no love for
Unicode
● MacOS X IDLE can't handle Unicode

● MacOS X terminal doesn't like

Unicode, likes UTF-8
● Recommendation: Use PyCrust?

Demos
● Filesystem demo
● Mako template engine demo

● chardet demo

● pysqlite demo

● wxPython demo

Unicode for Small
Children (and
Children at Heart)
Feihong Hsu
Chicago Python Users Group
March 8, 2007 1

Thanks to Chris McAvoy for the conversation at PyCon
that inspired this talk.

Welcome to the Wonderful World of
Unicorns!
A Magical Guide to the World's Most Beloved
Mythological Equine

2

Completely drawn on my tablet PC using the free Ink
Art program. Unfortunately, Ink Art doesn't come
with good coloring tools so I just left it colorless.

Welcome to the Useful World of
Unicode!
A Practical Guide to the World's Most Popular
International Text Standard

3

Top 3 reasons that unicorns are
great
● Friendly and wise
● Healing power

● Bane of evil

4

Top 3 reasons that Unicode is
important
● Comprehensive language
coverage
● Multiple languages in a single

document
● Standardized

5

The Unicode Standard is maintained by the Unicode
Consortium, an organization based in California.

The difference between Horses and
Unicorns
Horses Unicorns

Habitat Grasslands Enchanted forests

Diet Apples, Love, spirit of
oats, grass, wonder
barley, etc.

Abilities Galloping, Sentience,
eating, telepathy, laser
pooping vision (unconfirmed)
6

I really wasn't sure about including the laser vision
ability. I honestly thought it was an urban myth. But
when a friend of my cousin's sister's friend said that
she saw it in person, I finally relented.

Difference between ISO 8859 and
Unicode
ISO 8859 Unicode

# supported Some A lot
languages

# supported 256 100,000+
characters

# bytes for each 1 1-4
character 7

Somebody noted that ISO 8859 can actually support
more than 256 characters through its various
extensions, so this is an oversimplification.

So what, exactly, is Unicode?

Unicode is a standard that assigns a
unique number to each character in
every human language
Ok, not every language, see next slide

8

The “unique number” for each character is called a
code point in Unicode terminology.

What is Unicode not?
● Doesn't address how the characters
are rendered (that's up to font
makers)
● Doesn't deal with imaginary

languages like Klingon and Elvish
● Doesn't deal with ancient languages

● Doesn't deal with obscure languages

that no one uses
9

Although there are many languages that Unicode
doesn't directly support, there are extensions to
Unicode that are designed to handle these cases.

How does Hollywood “create”
unicorns?
● CGI
● Horse with horn glued to forehead

● Two dudes in a costume

10

It helps if the two dudes are very high. And if they
have circus experience. And if neither of them has a
trick leg.

How does a programmer create
Unicode documents?
● Technically, you can't make a
Unicode document
● Usually you pick an official

encoding (UTF-8, UTF-16, etc)
● Sometimes you use a language-

specific encoding (GB2312, Shift-
JIS)
11

In the vast majority of cases, I think UTF-8 is more
than adequate. If in doubt, just go with that
encoding.

Python and Unicorn
Working together to combat evil!

12

I think this is a case of the graphic actually
undermining the point I'm trying to make. This is my
attempt to render a dynamic, exciting action scene of
a pitched battle between orc, unicorn and python.
They are fighting for the fate of the damsel in distress
because she is, like, oh so fine (well, at least when
she's got her makeup on, which she doesn't in this
picture). Unfortunately, the unicorn looks like it's
about to be stabbed in the ass, and the python
seems more interested in biting a chunk out of the
damsel than in saving her.

Python and Unicode
Working together to create international
applications!

13

The only time I actually visited the Unicode
Consortium's web site was to get a copy of the
Unicode logo.

Unicode-related functions
● unichr()
● ord()

● unicode.encode()

● str.decode()

14

Thanks to Ian Bicking for pointing out that it should be
unicode.encode(), not str.encode().

Examples of usage
>>> s = unichr(23456)
>>> print s
宠
>>> ord(s)
23456
>>> s.encode('utf-8')
'xe5xaexa0'
>>> s.encode('gb2312')
'xb3xe8'
>>> print _
³è
>>> 'xe5xaexa0'.decode('utf-8')
u'u5ba0'
>>> print _
宠
>>> 15

The PDF version of this presentation doesn't render
the Chinese character properly. But if you copy and
paste in a Unicode-aware editor, you'll probably be
able to see it. I admit it is pretty rare to put a
Chinese character in Courier New font.

unicode and str: two different types!
● They have exactly the same API
● But they don't have the same

repr()
● And they don't have the same

type()
● Use isinstance() to tell them apart

16

Thanks to Atul Varma for making some comments that
led me to adding this slide (and the next one).

unicode and str example
>>> u = unicode()
>>> type(u)
<type 'unicode'>
>>> print repr(u)
u''
>>> isinstance(u, str)
False
>>> s = str()
>>> type(s)
<type 'str'>
>>> print repr(s)
''
>>> isinstance(s, unicode)
False
>>> 17

Two ways to write a Unicode file
codecs.open()

unicode.encode()

18

Example using codecs.open()

>>> import codecs
>>> s = u'u4f60u597du4e16u754c'
>>> fout = codecs.open('document.txt', 'w',
'utf-8')
>>> fout.write(s)
>>> fout.close()
8')
u'u4f60u597du4e16u754c'
>>>

19

Example using unicode.encode()

>>> s = u'u4f60u597du4e16u754c'
>>> fout = open('document.txt', 'w')
>>> fout.write(s.encode('utf-8'))
>>> fout.close()
8')
u'u4f60u597du4e16u754c'
>>>

20

Two ways to read Unicode files
codecs.open()

str.decode()
● Watch out for the BOM!

21

What is Byte Order Mark?
● Called BOM for short
● In UTF-16 docs, indicates little-

endian or big-endian
● Often appears in UTF-8 docs to

distinguish them from ASCII docs
● Use read(1) for UTF-8 documents

with BOM
22

The actual value of the BOM is 0xfeff. If you try to print
it in the Python interpreter, you won't see anything.

Example of reading from a UTF-8
file with BOM

>>> import codecs
>>> fin = codecs.open('bom_document.txt',
'r', 'utf-8')
>>> fin.read(1)
u'ufeff'
>>> fin.read()
u'u4f60u597du4e16u754c'
>>> fin.close()
>>>

23

Reading and writing XML
● ElementTree handles everything
implicitly
● It even eats the BOM without

complaining
● It doesn't even need the XML

declaration (as long as you use ASCII
or UTF-8)
● cElementTree works great too!
24

The lxml module is similarly awesome.

File system directory listing
● On Windows, os.listdir('.') won't
show you int'l characters
● You need to use os.listdir(u'.') to

see the Unicode files
● os.getcwd() doesn't show int'l

characters
● Use os.getcwdu() instead

25

The behavior under Mac OS X is somewhat different. I
don't know about Linux.

String interpolation
● Str template strings can be
interpolated with both unicode and
str objects (automatic conversion
to unicode)
● Unicode template strings need to

be interpolated with unicode
objects
26

Template engines have these sorts of issues as well.
In particular, if you want to render a unicode string in
Mako or Myghty, you need to pass unicode strings
into the template.

String interpolation example
>>> 'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> u'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> 'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
'Hello xe9xa3x9bxe9xb4xbb'
>>> u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
Traceback (most recent call last):
File "<pyshell#36>", line 1, in ?
u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
UnicodeDecodeError: 'ascii' codec can't
decode byte 0xe9 in position 0: ordinal not
in range(128)
>>>
27

Putting Unicode in your Python
source code
● Put “# -*- coding: utf-8 -*-” at top of
your file
● Idle automatically detects non-

ASCII characters and prompts to
edit your file
● Not generally recommended

28

I don't recommend putting Unicode strings in your
source code because people who don't have
Unicode-aware editors will just see annoying
gibberish.

Regular expressions
● The w special character doesn't
usually match non-ASCII
characters
● To match non-ASCII characters,

use re.UNICODE flag
● Remember that punctuation in

different languages uses different
characters 29

Punctuation characters in English:
.?!

Compare with punctuation characters in Chinese:
。？！

Although they only look slightly different, they do have
different code points in Unicode.

Regular expression example

>>> s = u'ABCu4f60u597du4e16u754c'
>>> m = re.match(r"w+", s)
>>> m.group()
u'ABC'
>>> m = re.match(r"w+", s, re.UNICODE)
>>> m.group()
u'ABCu4f60u597du4e16u754c'
>>>

30

Considerations for web pages
● Don't make pages or folders with int'l
characters (Firefox doesn't handle int'l
URLs well)
● Make sure you use the <meta> tag

when generating web pages
● You can display Unicode even in

ASCII-encoded pages (use character
entities)
31

As Atul Varma pointed out, Firefox mangles the URL
but does so in a standard way. However, it still ends
up not finding the page. IE can actually find and
display pages with Unicode names. This is probably
the only thing IE does better than Firefox.

Web page with <meta> tag

<html>
<head>
content="text/html;charset=utf-8">
</head>
<body>
<h1> 你好世界 </h1>
</body>
</html>

32

The text is Chinese for “Hello World”.

Web page with character entities

<html>
<head>
content="text/html;charset=ascii">
</head>
<body>
<h1>&#20320&#22909&#19990&#30028</h1>
</body>
</html>

Conversion recipe: s.encode('ascii',
'xmlcharrefreplace')
33

Thanks to Ian Bicking for pointing out a shorter
conversion recipe. For the record, the original one
is:

''.join('&#%d' % ord(c) for c in s)

Processing documents of unknown
encoding
● Use the chardet module
● chardet.detect() function:

– accepts a string
– returns a dictionary with two keys:
'encoding' and 'confidence'
● Also try BeautifulSoup for web pages

34

Encoding detection example

>>> import chardet, urllib2
>>> html =
urllib2.urlopen('http://chol.co.kr').read()
>>> result = chardet.detect(html)
>>> result
{'confidence': 0.98999999999999999,
'encoding': 'EUC-KR'}
>>> print html.decode(result['encoding'])

35

You can also try BeautifulSoup for web pages.
Example:

content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
encoding = soup.originalEncoding

Tools that play nice with Unicode
● IDLE (raw_input() accepts
Unicode)
● Notepad++ (can autodetect UTF-8

files with BOM)
● jEdit

36

Note that only IDLE on Windows has this feature.

Libraries that play nice with Unicode
● Tkinter
● wxPython

● Mako

● BeautifulSoup

● feedparser

● Elementtree

● lxml
37

Libraries that don't play nice with
Unicode
● cStringIO (StringIO.write() doesn't
accept Unicode strings)
● buzhug

● Various ID3 libraries

● ?

38

Databases
● SQLite has no problem with
Unicode
● SQLAlchemy with SQLite is fine

too
● Other databases - ?

39

Platform-specific issues
● Windows DOS prompt has no love for
Unicode
● MacOS X IDLE can't handle Unicode

● MacOS X terminal doesn't like

Unicode, likes UTF-8
● Recommendation: Use PyCrust?

40

I checked and it turns out that PyCrust chokes on int'l
characters sent through raw_input(), even on
Windows. So I formally withdraw my
recommendation of PyCrust.

Demos
● Filesystem demo
● Mako template engine demo

● chardet demo

● pysqlite demo

● wxPython demo

41

Click to add title

Questions?
有问题吗？

42

Thanks to the experts in the audience who provided
hard-hitting answers to the the tough questions.
And, of course, thanks to everyone who attended my
first talk at ChiPy. I hope there will be more.

Unicode for Small Children (and Children at Heart)

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Unicode for Small Children (and Children at Heart)

Semelhante a Unicode for Small Children (and Children at Heart) (20)

Último

Último (20)

Unicode for Small Children (and Children at Heart)