SlideShare uma empresa Scribd logo
1 de 84
Baixar para ler offline
Unicode for Small
 Children (and
Children at Heart)
       Feihong Hsu
Chicago Python Users Group
      March 8, 2007
Welcome to the Wonderful World of
            Unicorns!
 A Magical Guide to the World's Most Beloved
             Mythological Equine
Welcome to the Useful World of
         Unicode!
A Practical Guide to the World's Most Popular
          International Text Standard
Top 3 reasons that unicorns are
                great
● Friendly and wise


● Healing power


● Bane of evil
Top 3 reasons that Unicode is
              important
● Comprehensive language

  coverage
● Multiple languages in a single

  document
● Standardized
The difference between Horses and
             Unicorns
            Horses         Unicorns

Habitat     Grasslands     Enchanted forests

Diet        Apples,        Love, spirit of
            oats, grass,   wonder
            barley, etc.

Abilities   Galloping,     Sentience,
            eating,        telepathy, laser
            pooping        vision (unconfirmed)
Difference between ISO 8859 and
              Unicode
                   ISO 8859   Unicode

# supported        Some       A lot
languages

# supported        256        100,000+
characters

# bytes for each   1          1-4
character
So what, exactly, is Unicode?


Unicode is a standard that assigns a
unique number to each character in
      every human language
    Ok, not every language, see next slide
What is Unicode not?
● Doesn't address how the characters
  are rendered (that's up to font
  makers)
● Doesn't deal with imaginary

  languages like Klingon and Elvish
● Doesn't deal with ancient languages


● Doesn't deal with obscure languages

  that no one uses
How does Hollywood “create”
             unicorns?
● CGI


● Horse with horn glued to forehead


● Two dudes in a costume
How does a programmer create
        Unicode documents?
● Technically, you can't make a

  Unicode document
● Usually you pick an official

  encoding (UTF-8, UTF-16, etc)
● Sometimes you use a language-

  specific encoding (GB2312, Shift-
  JIS)
Python and Unicorn
Working together to combat evil!
Python and Unicode
Working together to create international
             applications!
Unicode-related functions
● unichr()
● ord()


● unicode.encode()


● str.decode()
Examples of usage
>>> s = unichr(23456)
>>> print s
宠
>>> ord(s)
23456
>>> s.encode('utf-8')
'xe5xaexa0'
>>> s.encode('gb2312')
'xb3xe8'
>>> print _
³è
>>> 'xe5xaexa0'.decode('utf-8')
u'u5ba0'
>>> print _
宠
>>>
unicode and str: two different types!
● They have exactly the same API
● But they don't have the same

  repr()
● And they don't have the same

  type()
● Use isinstance() to tell them apart
unicode and str example
>>> u = unicode()
>>> type(u)
<type 'unicode'>
>>> print repr(u)
u''
>>> isinstance(u, str)
False
>>> s = str()
>>> type(s)
<type 'str'>
>>> print repr(s)
''
>>> isinstance(s, unicode)
False
>>>
Two ways to write a Unicode file
● Use the file object returned by
  codecs.open()
● Use a regular file object along with

  unicode.encode()
Example using codecs.open()

>>> import codecs
>>> s = u'u4f60u597du4e16u754c'
>>> fout = codecs.open('document.txt', 'w',
 'utf-8')
>>> fout.write(s)
>>> fout.close()
>>> open('document.txt').read().decode('utf-
 8')
u'u4f60u597du4e16u754c'
>>>
Example using unicode.encode()


>>> s = u'u4f60u597du4e16u754c'
>>> fout = open('document.txt', 'w')
>>> fout.write(s.encode('utf-8'))
>>> fout.close()
>>> open('document.txt').read().decode('utf-
 8')
u'u4f60u597du4e16u754c'
>>>
Two ways to read Unicode files
● Use the file object returned by
  codecs.open()
● Use a regular file object along with

  str.decode()
● Watch out for the BOM!
What is Byte Order Mark?
● Called BOM for short
● In UTF-16 docs, indicates little-

  endian or big-endian
● Often appears in UTF-8 docs to

  distinguish them from ASCII docs
● Use read(1) for UTF-8 documents

  with BOM
Example of reading from a UTF-8
         file with BOM

>>> import codecs
>>> fin = codecs.open('bom_document.txt',
 'r', 'utf-8')
>>> fin.read(1)
u'ufeff'
>>> fin.read()
u'u4f60u597du4e16u754c'
>>> fin.close()
>>>
Reading and writing XML
● ElementTree handles everything
  implicitly
● It even eats the BOM without

  complaining
● It doesn't even need the XML

  declaration (as long as you use ASCII
  or UTF-8)
● cElementTree works great too!
File system directory listing
● On Windows, os.listdir('.') won't
  show you int'l characters
● You need to use os.listdir(u'.') to

  see the Unicode files
● os.getcwd() doesn't show int'l

  characters
● Use os.getcwdu() instead
String interpolation
● Str template strings can be
  interpolated with both unicode and
  str objects (automatic conversion
  to unicode)
● Unicode template strings need to

  be interpolated with unicode
  objects
String interpolation example
>>> 'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> u'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> 'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
'Hello xe9xa3x9bxe9xb4xbb'
>>> u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
Traceback (most recent call last):
  File "<pyshell#36>", line 1, in ?
    u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
UnicodeDecodeError: 'ascii' codec can't
 decode byte 0xe9 in position 0: ordinal not
 in range(128)
>>>
Putting Unicode in your Python
              source code
● Put “# -*- coding: utf-8 -*-” at top of

  your file
● Idle automatically detects non-

  ASCII characters and prompts to
  edit your file
● Not generally recommended
Regular expressions
● The w special character doesn't
  usually match non-ASCII
  characters
● To match non-ASCII characters,

  use re.UNICODE flag
● Remember that punctuation in

  different languages uses different
  characters
Regular expression example


>>> s = u'ABCu4f60u597du4e16u754c'
>>> m = re.match(r"w+", s)
>>> m.group()
u'ABC'
>>> m = re.match(r"w+", s, re.UNICODE)
>>> m.group()
u'ABCu4f60u597du4e16u754c'
>>>
Considerations for web pages
● Don't make pages or folders with int'l
  characters (Firefox doesn't handle int'l
  URLs well)
● Make sure you use the <meta> tag

  when generating web pages
● You can display Unicode even in

  ASCII-encoded pages (use character
  entities)
Web page with <meta> tag

<html>
  <head>
    <meta http-equiv="Content-Type"
 content="text/html;charset=utf-8">
  </head>
  <body>
    <h1> 你好世界 </h1>
  </body>
</html>
Web page with character entities

<html>
  <head>
    <meta http-equiv="Content-Type"
 content="text/html;charset=ascii">
  </head>
  <body>
    <h1>&#20320&#22909&#19990&#30028</h1>
  </body>
</html>

Conversion recipe: s.encode('ascii',
 'xmlcharrefreplace')
Processing documents of unknown
            encoding
● Use the chardet module
● chardet.detect() function:


   – accepts a string
   – returns a dictionary with two keys:
     'encoding' and 'confidence'
● Also try BeautifulSoup for web pages
Encoding detection example


>>> import chardet, urllib2
>>> html =
 urllib2.urlopen('http://chol.co.kr').read()
>>> result = chardet.detect(html)
>>> result
{'confidence': 0.98999999999999999,
 'encoding': 'EUC-KR'}
>>> print html.decode(result['encoding'])
Tools that play nice with Unicode
● IDLE (raw_input() accepts
  Unicode)
● Notepad++ (can autodetect UTF-8

  files with BOM)
● jEdit
Libraries that play nice with Unicode
● Tkinter
● wxPython


● Mako


● BeautifulSoup


● feedparser


● Elementtree


● lxml
Libraries that don't play nice with
               Unicode
● cStringIO (StringIO.write() doesn't

  accept Unicode strings)
● buzhug


● Various ID3 libraries


● ?
Databases
● SQLite has no problem with
  Unicode
● SQLAlchemy with SQLite is fine

  too
● Other databases - ?
Platform-specific issues
● Windows DOS prompt has no love for
  Unicode
● MacOS X IDLE can't handle Unicode


● MacOS X terminal doesn't like

  Unicode, likes UTF-8
● Recommendation: Use PyCrust?
Demos
● Filesystem demo
● Mako template engine demo


● chardet demo


● pysqlite demo


● wxPython demo
Questions?
有问题吗?
Unicode for Small
            Children (and
           Children at Heart)
                 Feihong Hsu
          Chicago Python Users Group
                March 8, 2007               1




Thanks to Chris McAvoy for the conversation at PyCon
 that inspired this talk.
Welcome to the Wonderful World of
                    Unicorns!
          A Magical Guide to the World's Most Beloved
                      Mythological Equine




                                                        2




Completely drawn on my tablet PC using the free Ink
 Art program. Unfortunately, Ink Art doesn't come
 with good coloring tools so I just left it colorless.
Welcome to the Useful World of
         Unicode!
A Practical Guide to the World's Most Popular
          International Text Standard




                                                3
Top 3 reasons that unicorns are
                great
● Friendly and wise
● Healing power


● Bane of evil




                                      4
Top 3 reasons that Unicode is
                     important
       ● Comprehensive language
         coverage
       ● Multiple languages in a single

         document
       ● Standardized




                                            5




The Unicode Standard is maintained by the Unicode
 Consortium, an organization based in California.
The difference between Horses and
                     Unicorns
                   Horses         Unicorns

       Habitat     Grasslands     Enchanted forests

       Diet        Apples,        Love, spirit of
                   oats, grass,   wonder
                   barley, etc.

       Abilities   Galloping,     Sentience,
                   eating,        telepathy, laser
                   pooping        vision (unconfirmed)
                                                      6




I really wasn't sure about including the laser vision
   ability. I honestly thought it was an urban myth. But
   when a friend of my cousin's sister's friend said that
   she saw it in person, I finally relented.
Difference between ISO 8859 and
                     Unicode
                          ISO 8859   Unicode

       # supported        Some       A lot
       languages

       # supported        256        100,000+
       characters

       # bytes for each   1          1-4
       character                                 7




Somebody noted that ISO 8859 can actually support
 more than 256 characters through its various
 extensions, so this is an oversimplification.
So what, exactly, is Unicode?


       Unicode is a standard that assigns a
       unique number to each character in
             every human language
            Ok, not every language, see next slide


                                                     8




The “unique number” for each character is called a
 code point in Unicode terminology.
What is Unicode not?
       ● Doesn't address how the characters
         are rendered (that's up to font
         makers)
       ● Doesn't deal with imaginary

         languages like Klingon and Elvish
       ● Doesn't deal with ancient languages


       ● Doesn't deal with obscure languages

         that no one uses
                                               9




Although there are many languages that Unicode
  doesn't directly support, there are extensions to
  Unicode that are designed to handle these cases.
How does Hollywood “create”
                    unicorns?
        ● CGI
        ● Horse with horn glued to forehead


        ● Two dudes in a costume




                                               10




It helps if the two dudes are very high. And if they
   have circus experience. And if neither of them has a
   trick leg.
How does a programmer create
                Unicode documents?
        ● Technically, you can't make a
          Unicode document
        ● Usually you pick an official

          encoding (UTF-8, UTF-16, etc)
        ● Sometimes you use a language-

          specific encoding (GB2312, Shift-
          JIS)
                                                11




In the vast majority of cases, I think UTF-8 is more
  than adequate. If in doubt, just go with that
  encoding.
Python and Unicorn
                Working together to combat evil!




                                                   12




I think this is a case of the graphic actually
   undermining the point I'm trying to make. This is my
   attempt to render a dynamic, exciting action scene of
   a pitched battle between orc, unicorn and python.
   They are fighting for the fate of the damsel in distress
   because she is, like, oh so fine (well, at least when
   she's got her makeup on, which she doesn't in this
   picture). Unfortunately, the unicorn looks like it's
   about to be stabbed in the ass, and the python
   seems more interested in biting a chunk out of the
   damsel than in saving her.
Python and Unicode
            Working together to create international
                         applications!




                                                       13




The only time I actually visited the Unicode
 Consortium's web site was to get a copy of the
 Unicode logo.
Unicode-related functions
        ● unichr()
        ● ord()


        ● unicode.encode()


        ● str.decode()




                                                14




Thanks to Ian Bicking for pointing out that it should be
 unicode.encode(), not str.encode().
Examples of usage
        >>> s = unichr(23456)
        >>> print s
        宠
        >>> ord(s)
        23456
        >>> s.encode('utf-8')
        'xe5xaexa0'
        >>> s.encode('gb2312')
        'xb3xe8'
        >>> print _
        ³è
        >>> 'xe5xaexa0'.decode('utf-8')
        u'u5ba0'
        >>> print _
        宠
        >>>                                  15




The PDF version of this presentation doesn't render
 the Chinese character properly. But if you copy and
 paste in a Unicode-aware editor, you'll probably be
 able to see it. I admit it is pretty rare to put a
 Chinese character in Courier New font.
unicode and str: two different types!
       ● They have exactly the same API
       ● But they don't have the same

         repr()
       ● And they don't have the same

         type()
       ● Use isinstance() to tell them apart



                                               16




Thanks to Atul Varma for making some comments that
 led me to adding this slide (and the next one).
unicode and str example
>>> u = unicode()
>>> type(u)
<type 'unicode'>
>>> print repr(u)
u''
>>> isinstance(u, str)
False
>>> s = str()
>>> type(s)
<type 'str'>
>>> print repr(s)
''
>>> isinstance(s, unicode)
False
>>>                            17
Two ways to write a Unicode file
● Use the file object returned by
  codecs.open()
● Use a regular file object along with

  unicode.encode()



                                       18
Example using codecs.open()

>>> import codecs
>>> s = u'u4f60u597du4e16u754c'
>>> fout = codecs.open('document.txt', 'w',
 'utf-8')
>>> fout.write(s)
>>> fout.close()
>>> open('document.txt').read().decode('utf-
 8')
u'u4f60u597du4e16u754c'
>>>


                                           19
Example using unicode.encode()


>>> s = u'u4f60u597du4e16u754c'
>>> fout = open('document.txt', 'w')
>>> fout.write(s.encode('utf-8'))
>>> fout.close()
>>> open('document.txt').read().decode('utf-
 8')
u'u4f60u597du4e16u754c'
>>>




                                           20
Two ways to read Unicode files
● Use the file object returned by
  codecs.open()
● Use a regular file object along with

  str.decode()
● Watch out for the BOM!




                                     21
What is Byte Order Mark?
        ● Called BOM for short
        ● In UTF-16 docs, indicates little-

          endian or big-endian
        ● Often appears in UTF-8 docs to

          distinguish them from ASCII docs
        ● Use read(1) for UTF-8 documents

          with BOM
                                                 22




The actual value of the BOM is 0xfeff. If you try to print
 it in the Python interpreter, you won't see anything.
Example of reading from a UTF-8
         file with BOM

>>> import codecs
>>> fin = codecs.open('bom_document.txt',
 'r', 'utf-8')
>>> fin.read(1)
u'ufeff'
>>> fin.read()
u'u4f60u597du4e16u754c'
>>> fin.close()
>>>


                                            23
Reading and writing XML
       ● ElementTree handles everything
         implicitly
       ● It even eats the BOM without

         complaining
       ● It doesn't even need the XML

         declaration (as long as you use ASCII
         or UTF-8)
       ● cElementTree works great too!
                                             24




The lxml module is similarly awesome.
File system directory listing
       ● On Windows, os.listdir('.') won't
         show you int'l characters
       ● You need to use os.listdir(u'.') to

         see the Unicode files
       ● os.getcwd() doesn't show int'l

         characters
       ● Use os.getcwdu() instead

                                               25




The behavior under Mac OS X is somewhat different. I
 don't know about Linux.
String interpolation
        ● Str template strings can be
          interpolated with both unicode and
          str objects (automatic conversion
          to unicode)
        ● Unicode template strings need to

          be interpolated with unicode
          objects
                                                 26




Template engines have these sorts of issues as well.
 In particular, if you want to render a unicode string in
 Mako or Myghty, you need to pass unicode strings
 into the template.
String interpolation example
>>> 'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> u'Hello %s' % u'u98dbu9d3b'
u'Hello u98dbu9d3b'
>>> 'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
'Hello xe9xa3x9bxe9xb4xbb'
>>> u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
Traceback (most recent call last):
  File "<pyshell#36>", line 1, in ?
    u'Hello %s' % 'xe9xa3x9bxe9xb4xbb'
UnicodeDecodeError: 'ascii' codec can't
 decode byte 0xe9 in position 0: ordinal not
 in range(128)
>>>
                                           27
Putting Unicode in your Python
                    source code
       ● Put “# -*- coding: utf-8 -*-” at top of
         your file
       ● Idle automatically detects non-

         ASCII characters and prompts to
         edit your file
       ● Not generally recommended




                                               28




I don't recommend putting Unicode strings in your
   source code because people who don't have
   Unicode-aware editors will just see annoying
   gibberish.
Regular expressions
        ● The w special character doesn't
          usually match non-ASCII
          characters
        ● To match non-ASCII characters,

          use re.UNICODE flag
        ● Remember that punctuation in

          different languages uses different
          characters                            29




Punctuation characters in English:
.?!

Compare with punctuation characters in Chinese:
。?!

Although they only look slightly different, they do have
  different code points in Unicode.
Regular expression example


>>> s = u'ABCu4f60u597du4e16u754c'
>>> m = re.match(r"w+", s)
>>> m.group()
u'ABC'
>>> m = re.match(r"w+", s, re.UNICODE)
>>> m.group()
u'ABCu4f60u597du4e16u754c'
>>>


                                          30
Considerations for web pages
       ● Don't make pages or folders with int'l
         characters (Firefox doesn't handle int'l
         URLs well)
       ● Make sure you use the <meta> tag

         when generating web pages
       ● You can display Unicode even in

         ASCII-encoded pages (use character
         entities)
                                                31




As Atul Varma pointed out, Firefox mangles the URL
 but does so in a standard way. However, it still ends
 up not finding the page. IE can actually find and
 display pages with Unicode names. This is probably
 the only thing IE does better than Firefox.
Web page with <meta> tag

        <html>
          <head>
            <meta http-equiv="Content-Type"
         content="text/html;charset=utf-8">
          </head>
          <body>
            <h1> 你好世界 </h1>
          </body>
        </html>


                                              32




The text is Chinese for “Hello World”.
Web page with character entities

        <html>
          <head>
            <meta http-equiv="Content-Type"
         content="text/html;charset=ascii">
          </head>
          <body>
            <h1>&#20320&#22909&#19990&#30028</h1>
          </body>
        </html>

        Conversion recipe: s.encode('ascii',
         'xmlcharrefreplace')
                                                33




Thanks to Ian Bicking for pointing out a shorter
 conversion recipe. For the record, the original one
 is:

''.join('&#%d' % ord(c) for c in s)
Processing documents of unknown
            encoding
● Use the chardet module
● chardet.detect() function:


   – accepts a string
   – returns a dictionary with two keys:
     'encoding' and 'confidence'
● Also try BeautifulSoup for web pages




                                           34
Encoding detection example


        >>> import chardet, urllib2
        >>> html =
         urllib2.urlopen('http://chol.co.kr').read()
        >>> result = chardet.detect(html)
        >>> result
        {'confidence': 0.98999999999999999,
         'encoding': 'EUC-KR'}
        >>> print html.decode(result['encoding'])



                                                   35




You can also try BeautifulSoup for web pages.
 Example:

content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
encoding = soup.originalEncoding
Tools that play nice with Unicode
       ● IDLE (raw_input() accepts
         Unicode)
       ● Notepad++ (can autodetect UTF-8

         files with BOM)
       ● jEdit




                                               36




Note that only IDLE on Windows has this feature.
Libraries that play nice with Unicode
● Tkinter
● wxPython


● Mako


● BeautifulSoup


● feedparser


● Elementtree


● lxml
                                    37
Libraries that don't play nice with
                 Unicode
● cStringIO (StringIO.write() doesn't
  accept Unicode strings)
● buzhug


● Various ID3 libraries


● ?




                                          38
Databases
● SQLite has no problem with
  Unicode
● SQLAlchemy with SQLite is fine

  too
● Other databases - ?




                                   39
Platform-specific issues
        ● Windows DOS prompt has no love for
          Unicode
        ● MacOS X IDLE can't handle Unicode


        ● MacOS X terminal doesn't like

          Unicode, likes UTF-8
        ● Recommendation: Use PyCrust?




                                                40




I checked and it turns out that PyCrust chokes on int'l
   characters sent through raw_input(), even on
   Windows. So I formally withdraw my
   recommendation of PyCrust.
Demos
● Filesystem demo
● Mako template engine demo


● chardet demo


● pysqlite demo


● wxPython demo




                              41
Click to add title




                   Questions?
                   有问题吗?


                                           42




Thanks to the experts in the audience who provided
 hard-hitting answers to the the tough questions.
 And, of course, thanks to everyone who attended my
 first talk at ChiPy. I hope there will be more.

Mais conteúdo relacionado

Semelhante a Unicode for Small Children (and Children at Heart)

Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xmlphanleson
 
Camomile : A Unicode library for OCaml
Camomile : A Unicode library for OCamlCamomile : A Unicode library for OCaml
Camomile : A Unicode library for OCamlYamagata Yoriyuki
 
Lecture1_cis4930.pdf
Lecture1_cis4930.pdfLecture1_cis4930.pdf
Lecture1_cis4930.pdfzertash1
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - ITguest6ddfb98
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Ulf Mattsson
 
COMPUTER INTRODUCTION
COMPUTER INTRODUCTIONCOMPUTER INTRODUCTION
COMPUTER INTRODUCTIONAmit Sharma
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLexume1
 
E-Business Suite 1 _ Jim Pang _ The anatomy of multiple language support (MLS...
E-Business Suite 1 _ Jim Pang _ The anatomy of multiple language support (MLS...E-Business Suite 1 _ Jim Pang _ The anatomy of multiple language support (MLS...
E-Business Suite 1 _ Jim Pang _ The anatomy of multiple language support (MLS...InSync2011
 
Foreign Languages for Humans and Computers
Foreign Languages for Humans and ComputersForeign Languages for Humans and Computers
Foreign Languages for Humans and ComputersPeterZukerman
 
The entropic principle: /dev/u?random and NetBSD by Taylor R Campbell
  The entropic principle: /dev/u?random and NetBSD by Taylor R Campbell  The entropic principle: /dev/u?random and NetBSD by Taylor R Campbell
The entropic principle: /dev/u?random and NetBSD by Taylor R Campbelleurobsdcon
 
Unicode Encoding Forms
Unicode Encoding FormsUnicode Encoding Forms
Unicode Encoding FormsMehdi Hasan
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Jerome Eteve
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 

Semelhante a Unicode for Small Children (and Children at Heart) (20)

Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xml
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Unicode & PHP6
Unicode & PHP6Unicode & PHP6
Unicode & PHP6
 
Camomile : A Unicode library for OCaml
Camomile : A Unicode library for OCamlCamomile : A Unicode library for OCaml
Camomile : A Unicode library for OCaml
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Notes on a Standard: Unicode
Notes on a Standard: UnicodeNotes on a Standard: Unicode
Notes on a Standard: Unicode
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
Lecture1_cis4930.pdf
Lecture1_cis4930.pdfLecture1_cis4930.pdf
Lecture1_cis4930.pdf
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...
 
COMPUTER INTRODUCTION
COMPUTER INTRODUCTIONCOMPUTER INTRODUCTION
COMPUTER INTRODUCTION
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
 
E-Business Suite 1 _ Jim Pang _ The anatomy of multiple language support (MLS...
E-Business Suite 1 _ Jim Pang _ The anatomy of multiple language support (MLS...E-Business Suite 1 _ Jim Pang _ The anatomy of multiple language support (MLS...
E-Business Suite 1 _ Jim Pang _ The anatomy of multiple language support (MLS...
 
Foreign Languages for Humans and Computers
Foreign Languages for Humans and ComputersForeign Languages for Humans and Computers
Foreign Languages for Humans and Computers
 
The entropic principle: /dev/u?random and NetBSD by Taylor R Campbell
  The entropic principle: /dev/u?random and NetBSD by Taylor R Campbell  The entropic principle: /dev/u?random and NetBSD by Taylor R Campbell
The entropic principle: /dev/u?random and NetBSD by Taylor R Campbell
 
Unicode Encoding Forms
Unicode Encoding FormsUnicode Encoding Forms
Unicode Encoding Forms
 
Io
IoIo
Io
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 

Último

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Unicode for Small Children (and Children at Heart)

  • 1. Unicode for Small Children (and Children at Heart) Feihong Hsu Chicago Python Users Group March 8, 2007
  • 2. Welcome to the Wonderful World of Unicorns! A Magical Guide to the World's Most Beloved Mythological Equine
  • 3. Welcome to the Useful World of Unicode! A Practical Guide to the World's Most Popular International Text Standard
  • 4. Top 3 reasons that unicorns are great ● Friendly and wise ● Healing power ● Bane of evil
  • 5. Top 3 reasons that Unicode is important ● Comprehensive language coverage ● Multiple languages in a single document ● Standardized
  • 6. The difference between Horses and Unicorns Horses Unicorns Habitat Grasslands Enchanted forests Diet Apples, Love, spirit of oats, grass, wonder barley, etc. Abilities Galloping, Sentience, eating, telepathy, laser pooping vision (unconfirmed)
  • 7. Difference between ISO 8859 and Unicode ISO 8859 Unicode # supported Some A lot languages # supported 256 100,000+ characters # bytes for each 1 1-4 character
  • 8. So what, exactly, is Unicode? Unicode is a standard that assigns a unique number to each character in every human language Ok, not every language, see next slide
  • 9. What is Unicode not? ● Doesn't address how the characters are rendered (that's up to font makers) ● Doesn't deal with imaginary languages like Klingon and Elvish ● Doesn't deal with ancient languages ● Doesn't deal with obscure languages that no one uses
  • 10. How does Hollywood “create” unicorns? ● CGI ● Horse with horn glued to forehead ● Two dudes in a costume
  • 11. How does a programmer create Unicode documents? ● Technically, you can't make a Unicode document ● Usually you pick an official encoding (UTF-8, UTF-16, etc) ● Sometimes you use a language- specific encoding (GB2312, Shift- JIS)
  • 12. Python and Unicorn Working together to combat evil!
  • 13. Python and Unicode Working together to create international applications!
  • 14. Unicode-related functions ● unichr() ● ord() ● unicode.encode() ● str.decode()
  • 15. Examples of usage >>> s = unichr(23456) >>> print s 宠 >>> ord(s) 23456 >>> s.encode('utf-8') 'xe5xaexa0' >>> s.encode('gb2312') 'xb3xe8' >>> print _ ³è >>> 'xe5xaexa0'.decode('utf-8') u'u5ba0' >>> print _ 宠 >>>
  • 16. unicode and str: two different types! ● They have exactly the same API ● But they don't have the same repr() ● And they don't have the same type() ● Use isinstance() to tell them apart
  • 17. unicode and str example >>> u = unicode() >>> type(u) <type 'unicode'> >>> print repr(u) u'' >>> isinstance(u, str) False >>> s = str() >>> type(s) <type 'str'> >>> print repr(s) '' >>> isinstance(s, unicode) False >>>
  • 18. Two ways to write a Unicode file ● Use the file object returned by codecs.open() ● Use a regular file object along with unicode.encode()
  • 19. Example using codecs.open() >>> import codecs >>> s = u'u4f60u597du4e16u754c' >>> fout = codecs.open('document.txt', 'w', 'utf-8') >>> fout.write(s) >>> fout.close() >>> open('document.txt').read().decode('utf- 8') u'u4f60u597du4e16u754c' >>>
  • 20. Example using unicode.encode() >>> s = u'u4f60u597du4e16u754c' >>> fout = open('document.txt', 'w') >>> fout.write(s.encode('utf-8')) >>> fout.close() >>> open('document.txt').read().decode('utf- 8') u'u4f60u597du4e16u754c' >>>
  • 21. Two ways to read Unicode files ● Use the file object returned by codecs.open() ● Use a regular file object along with str.decode() ● Watch out for the BOM!
  • 22. What is Byte Order Mark? ● Called BOM for short ● In UTF-16 docs, indicates little- endian or big-endian ● Often appears in UTF-8 docs to distinguish them from ASCII docs ● Use read(1) for UTF-8 documents with BOM
  • 23. Example of reading from a UTF-8 file with BOM >>> import codecs >>> fin = codecs.open('bom_document.txt', 'r', 'utf-8') >>> fin.read(1) u'ufeff' >>> fin.read() u'u4f60u597du4e16u754c' >>> fin.close() >>>
  • 24. Reading and writing XML ● ElementTree handles everything implicitly ● It even eats the BOM without complaining ● It doesn't even need the XML declaration (as long as you use ASCII or UTF-8) ● cElementTree works great too!
  • 25. File system directory listing ● On Windows, os.listdir('.') won't show you int'l characters ● You need to use os.listdir(u'.') to see the Unicode files ● os.getcwd() doesn't show int'l characters ● Use os.getcwdu() instead
  • 26. String interpolation ● Str template strings can be interpolated with both unicode and str objects (automatic conversion to unicode) ● Unicode template strings need to be interpolated with unicode objects
  • 27. String interpolation example >>> 'Hello %s' % u'u98dbu9d3b' u'Hello u98dbu9d3b' >>> u'Hello %s' % u'u98dbu9d3b' u'Hello u98dbu9d3b' >>> 'Hello %s' % 'xe9xa3x9bxe9xb4xbb' 'Hello xe9xa3x9bxe9xb4xbb' >>> u'Hello %s' % 'xe9xa3x9bxe9xb4xbb' Traceback (most recent call last): File "<pyshell#36>", line 1, in ? u'Hello %s' % 'xe9xa3x9bxe9xb4xbb' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128) >>>
  • 28. Putting Unicode in your Python source code ● Put “# -*- coding: utf-8 -*-” at top of your file ● Idle automatically detects non- ASCII characters and prompts to edit your file ● Not generally recommended
  • 29. Regular expressions ● The w special character doesn't usually match non-ASCII characters ● To match non-ASCII characters, use re.UNICODE flag ● Remember that punctuation in different languages uses different characters
  • 30. Regular expression example >>> s = u'ABCu4f60u597du4e16u754c' >>> m = re.match(r"w+", s) >>> m.group() u'ABC' >>> m = re.match(r"w+", s, re.UNICODE) >>> m.group() u'ABCu4f60u597du4e16u754c' >>>
  • 31. Considerations for web pages ● Don't make pages or folders with int'l characters (Firefox doesn't handle int'l URLs well) ● Make sure you use the <meta> tag when generating web pages ● You can display Unicode even in ASCII-encoded pages (use character entities)
  • 32. Web page with <meta> tag <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> </head> <body> <h1> 你好世界 </h1> </body> </html>
  • 33. Web page with character entities <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ascii"> </head> <body> <h1>&#20320&#22909&#19990&#30028</h1> </body> </html> Conversion recipe: s.encode('ascii', 'xmlcharrefreplace')
  • 34. Processing documents of unknown encoding ● Use the chardet module ● chardet.detect() function: – accepts a string – returns a dictionary with two keys: 'encoding' and 'confidence' ● Also try BeautifulSoup for web pages
  • 35. Encoding detection example >>> import chardet, urllib2 >>> html = urllib2.urlopen('http://chol.co.kr').read() >>> result = chardet.detect(html) >>> result {'confidence': 0.98999999999999999, 'encoding': 'EUC-KR'} >>> print html.decode(result['encoding'])
  • 36. Tools that play nice with Unicode ● IDLE (raw_input() accepts Unicode) ● Notepad++ (can autodetect UTF-8 files with BOM) ● jEdit
  • 37. Libraries that play nice with Unicode ● Tkinter ● wxPython ● Mako ● BeautifulSoup ● feedparser ● Elementtree ● lxml
  • 38. Libraries that don't play nice with Unicode ● cStringIO (StringIO.write() doesn't accept Unicode strings) ● buzhug ● Various ID3 libraries ● ?
  • 39. Databases ● SQLite has no problem with Unicode ● SQLAlchemy with SQLite is fine too ● Other databases - ?
  • 40. Platform-specific issues ● Windows DOS prompt has no love for Unicode ● MacOS X IDLE can't handle Unicode ● MacOS X terminal doesn't like Unicode, likes UTF-8 ● Recommendation: Use PyCrust?
  • 41. Demos ● Filesystem demo ● Mako template engine demo ● chardet demo ● pysqlite demo ● wxPython demo
  • 43. Unicode for Small Children (and Children at Heart) Feihong Hsu Chicago Python Users Group March 8, 2007 1 Thanks to Chris McAvoy for the conversation at PyCon that inspired this talk.
  • 44. Welcome to the Wonderful World of Unicorns! A Magical Guide to the World's Most Beloved Mythological Equine 2 Completely drawn on my tablet PC using the free Ink Art program. Unfortunately, Ink Art doesn't come with good coloring tools so I just left it colorless.
  • 45. Welcome to the Useful World of Unicode! A Practical Guide to the World's Most Popular International Text Standard 3
  • 46. Top 3 reasons that unicorns are great ● Friendly and wise ● Healing power ● Bane of evil 4
  • 47. Top 3 reasons that Unicode is important ● Comprehensive language coverage ● Multiple languages in a single document ● Standardized 5 The Unicode Standard is maintained by the Unicode Consortium, an organization based in California.
  • 48. The difference between Horses and Unicorns Horses Unicorns Habitat Grasslands Enchanted forests Diet Apples, Love, spirit of oats, grass, wonder barley, etc. Abilities Galloping, Sentience, eating, telepathy, laser pooping vision (unconfirmed) 6 I really wasn't sure about including the laser vision ability. I honestly thought it was an urban myth. But when a friend of my cousin's sister's friend said that she saw it in person, I finally relented.
  • 49. Difference between ISO 8859 and Unicode ISO 8859 Unicode # supported Some A lot languages # supported 256 100,000+ characters # bytes for each 1 1-4 character 7 Somebody noted that ISO 8859 can actually support more than 256 characters through its various extensions, so this is an oversimplification.
  • 50. So what, exactly, is Unicode? Unicode is a standard that assigns a unique number to each character in every human language Ok, not every language, see next slide 8 The “unique number” for each character is called a code point in Unicode terminology.
  • 51. What is Unicode not? ● Doesn't address how the characters are rendered (that's up to font makers) ● Doesn't deal with imaginary languages like Klingon and Elvish ● Doesn't deal with ancient languages ● Doesn't deal with obscure languages that no one uses 9 Although there are many languages that Unicode doesn't directly support, there are extensions to Unicode that are designed to handle these cases.
  • 52. How does Hollywood “create” unicorns? ● CGI ● Horse with horn glued to forehead ● Two dudes in a costume 10 It helps if the two dudes are very high. And if they have circus experience. And if neither of them has a trick leg.
  • 53. How does a programmer create Unicode documents? ● Technically, you can't make a Unicode document ● Usually you pick an official encoding (UTF-8, UTF-16, etc) ● Sometimes you use a language- specific encoding (GB2312, Shift- JIS) 11 In the vast majority of cases, I think UTF-8 is more than adequate. If in doubt, just go with that encoding.
  • 54. Python and Unicorn Working together to combat evil! 12 I think this is a case of the graphic actually undermining the point I'm trying to make. This is my attempt to render a dynamic, exciting action scene of a pitched battle between orc, unicorn and python. They are fighting for the fate of the damsel in distress because she is, like, oh so fine (well, at least when she's got her makeup on, which she doesn't in this picture). Unfortunately, the unicorn looks like it's about to be stabbed in the ass, and the python seems more interested in biting a chunk out of the damsel than in saving her.
  • 55. Python and Unicode Working together to create international applications! 13 The only time I actually visited the Unicode Consortium's web site was to get a copy of the Unicode logo.
  • 56. Unicode-related functions ● unichr() ● ord() ● unicode.encode() ● str.decode() 14 Thanks to Ian Bicking for pointing out that it should be unicode.encode(), not str.encode().
  • 57. Examples of usage >>> s = unichr(23456) >>> print s 宠 >>> ord(s) 23456 >>> s.encode('utf-8') 'xe5xaexa0' >>> s.encode('gb2312') 'xb3xe8' >>> print _ ³è >>> 'xe5xaexa0'.decode('utf-8') u'u5ba0' >>> print _ 宠 >>> 15 The PDF version of this presentation doesn't render the Chinese character properly. But if you copy and paste in a Unicode-aware editor, you'll probably be able to see it. I admit it is pretty rare to put a Chinese character in Courier New font.
  • 58. unicode and str: two different types! ● They have exactly the same API ● But they don't have the same repr() ● And they don't have the same type() ● Use isinstance() to tell them apart 16 Thanks to Atul Varma for making some comments that led me to adding this slide (and the next one).
  • 59. unicode and str example >>> u = unicode() >>> type(u) <type 'unicode'> >>> print repr(u) u'' >>> isinstance(u, str) False >>> s = str() >>> type(s) <type 'str'> >>> print repr(s) '' >>> isinstance(s, unicode) False >>> 17
  • 60. Two ways to write a Unicode file ● Use the file object returned by codecs.open() ● Use a regular file object along with unicode.encode() 18
  • 61. Example using codecs.open() >>> import codecs >>> s = u'u4f60u597du4e16u754c' >>> fout = codecs.open('document.txt', 'w', 'utf-8') >>> fout.write(s) >>> fout.close() >>> open('document.txt').read().decode('utf- 8') u'u4f60u597du4e16u754c' >>> 19
  • 62. Example using unicode.encode() >>> s = u'u4f60u597du4e16u754c' >>> fout = open('document.txt', 'w') >>> fout.write(s.encode('utf-8')) >>> fout.close() >>> open('document.txt').read().decode('utf- 8') u'u4f60u597du4e16u754c' >>> 20
  • 63. Two ways to read Unicode files ● Use the file object returned by codecs.open() ● Use a regular file object along with str.decode() ● Watch out for the BOM! 21
  • 64. What is Byte Order Mark? ● Called BOM for short ● In UTF-16 docs, indicates little- endian or big-endian ● Often appears in UTF-8 docs to distinguish them from ASCII docs ● Use read(1) for UTF-8 documents with BOM 22 The actual value of the BOM is 0xfeff. If you try to print it in the Python interpreter, you won't see anything.
  • 65. Example of reading from a UTF-8 file with BOM >>> import codecs >>> fin = codecs.open('bom_document.txt', 'r', 'utf-8') >>> fin.read(1) u'ufeff' >>> fin.read() u'u4f60u597du4e16u754c' >>> fin.close() >>> 23
  • 66. Reading and writing XML ● ElementTree handles everything implicitly ● It even eats the BOM without complaining ● It doesn't even need the XML declaration (as long as you use ASCII or UTF-8) ● cElementTree works great too! 24 The lxml module is similarly awesome.
  • 67. File system directory listing ● On Windows, os.listdir('.') won't show you int'l characters ● You need to use os.listdir(u'.') to see the Unicode files ● os.getcwd() doesn't show int'l characters ● Use os.getcwdu() instead 25 The behavior under Mac OS X is somewhat different. I don't know about Linux.
  • 68. String interpolation ● Str template strings can be interpolated with both unicode and str objects (automatic conversion to unicode) ● Unicode template strings need to be interpolated with unicode objects 26 Template engines have these sorts of issues as well. In particular, if you want to render a unicode string in Mako or Myghty, you need to pass unicode strings into the template.
  • 69. String interpolation example >>> 'Hello %s' % u'u98dbu9d3b' u'Hello u98dbu9d3b' >>> u'Hello %s' % u'u98dbu9d3b' u'Hello u98dbu9d3b' >>> 'Hello %s' % 'xe9xa3x9bxe9xb4xbb' 'Hello xe9xa3x9bxe9xb4xbb' >>> u'Hello %s' % 'xe9xa3x9bxe9xb4xbb' Traceback (most recent call last): File "<pyshell#36>", line 1, in ? u'Hello %s' % 'xe9xa3x9bxe9xb4xbb' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128) >>> 27
  • 70. Putting Unicode in your Python source code ● Put “# -*- coding: utf-8 -*-” at top of your file ● Idle automatically detects non- ASCII characters and prompts to edit your file ● Not generally recommended 28 I don't recommend putting Unicode strings in your source code because people who don't have Unicode-aware editors will just see annoying gibberish.
  • 71. Regular expressions ● The w special character doesn't usually match non-ASCII characters ● To match non-ASCII characters, use re.UNICODE flag ● Remember that punctuation in different languages uses different characters 29 Punctuation characters in English: .?! Compare with punctuation characters in Chinese: 。?! Although they only look slightly different, they do have different code points in Unicode.
  • 72. Regular expression example >>> s = u'ABCu4f60u597du4e16u754c' >>> m = re.match(r"w+", s) >>> m.group() u'ABC' >>> m = re.match(r"w+", s, re.UNICODE) >>> m.group() u'ABCu4f60u597du4e16u754c' >>> 30
  • 73. Considerations for web pages ● Don't make pages or folders with int'l characters (Firefox doesn't handle int'l URLs well) ● Make sure you use the <meta> tag when generating web pages ● You can display Unicode even in ASCII-encoded pages (use character entities) 31 As Atul Varma pointed out, Firefox mangles the URL but does so in a standard way. However, it still ends up not finding the page. IE can actually find and display pages with Unicode names. This is probably the only thing IE does better than Firefox.
  • 74. Web page with <meta> tag <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> </head> <body> <h1> 你好世界 </h1> </body> </html> 32 The text is Chinese for “Hello World”.
  • 75. Web page with character entities <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ascii"> </head> <body> <h1>&#20320&#22909&#19990&#30028</h1> </body> </html> Conversion recipe: s.encode('ascii', 'xmlcharrefreplace') 33 Thanks to Ian Bicking for pointing out a shorter conversion recipe. For the record, the original one is: ''.join('&#%d' % ord(c) for c in s)
  • 76. Processing documents of unknown encoding ● Use the chardet module ● chardet.detect() function: – accepts a string – returns a dictionary with two keys: 'encoding' and 'confidence' ● Also try BeautifulSoup for web pages 34
  • 77. Encoding detection example >>> import chardet, urllib2 >>> html = urllib2.urlopen('http://chol.co.kr').read() >>> result = chardet.detect(html) >>> result {'confidence': 0.98999999999999999, 'encoding': 'EUC-KR'} >>> print html.decode(result['encoding']) 35 You can also try BeautifulSoup for web pages. Example: content = urllib2.urlopen(url).read() soup = BeautifulSoup(content) encoding = soup.originalEncoding
  • 78. Tools that play nice with Unicode ● IDLE (raw_input() accepts Unicode) ● Notepad++ (can autodetect UTF-8 files with BOM) ● jEdit 36 Note that only IDLE on Windows has this feature.
  • 79. Libraries that play nice with Unicode ● Tkinter ● wxPython ● Mako ● BeautifulSoup ● feedparser ● Elementtree ● lxml 37
  • 80. Libraries that don't play nice with Unicode ● cStringIO (StringIO.write() doesn't accept Unicode strings) ● buzhug ● Various ID3 libraries ● ? 38
  • 81. Databases ● SQLite has no problem with Unicode ● SQLAlchemy with SQLite is fine too ● Other databases - ? 39
  • 82. Platform-specific issues ● Windows DOS prompt has no love for Unicode ● MacOS X IDLE can't handle Unicode ● MacOS X terminal doesn't like Unicode, likes UTF-8 ● Recommendation: Use PyCrust? 40 I checked and it turns out that PyCrust chokes on int'l characters sent through raw_input(), even on Windows. So I formally withdraw my recommendation of PyCrust.
  • 83. Demos ● Filesystem demo ● Mako template engine demo ● chardet demo ● pysqlite demo ● wxPython demo 41
  • 84. Click to add title Questions? 有问题吗? 42 Thanks to the experts in the audience who provided hard-hitting answers to the the tough questions. And, of course, thanks to everyone who attended my first talk at ChiPy. I hope there will be more.