Every developer will inevitably feel the pain of character encoding issues. We will cover the fundamentals every Python developer should know on character encoding and Unicode. We will teach you how to identify the types of problems that occur when dealing with character encoding and outline a set of best practices and useful libraries which can be used to avoid and fix character encoding issues.
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
1. ODE TO A SHIPPING LABEL!
by Carlos Bueno!
!
Once there was a little o,!
with an accent on top like só.!
!
It started out as UTF8,!
(universal since '98),!
but the program only knew latin1,!
and changed little ó to "ó" for fun.!
!
A second program saw the "ó"!
and said "I know HTML entity!"!
So "ó" was smartened to "ó"!
and passed on through happily.!
!
Another program saw the tangle!
(more precisely, ampersands to mangle)!
and thus the humble "ó"!
became "ó"
9. – Luke Sneeringer | Program Committee Chair
“You'll be pleased to know that your talk title
crashed our meeting robot, which is a great
argument for the relevance of this talk. :-) ...”
45. Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
46. Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
Application
Processes
Text
47. Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
Application
Processes
Text
PSQL
48. Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
Application
Processes
Text
PSQL
52. My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had paid
9400€ for his.
Sample Review Text
53. My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had paid
9400€ for his.
Sample Review Text
54. My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had paid
9400€ for his.
Sample Review Text
63. My friend said: �I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.�
He told me he had paid
9400� for his.
Output from UTF-8 encoded PSQL database
72. My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had paid
9400€ for his.
Original CP-1252 Data
73. My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.”
He told me he had paid
9400€ for his.
Mixed CP-1252 & UTF-8
74. My friend said: �I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.�
He told me he had paid
9400� for his.
Interpreted as UTF-8 by database
81. Traceback (most recent call last):
File "...", line ..., in <module>
unicode_row = row_text.decode()
UnicodeDecodeError: 'ascii' codec can't
decode byte 0x93 in position 31: ordinal
not in range(128)
82. Traceback (most recent call last):
File "...", line ..., in <module>
unicode_row = row_text.decode()
UnicodeDecodeError: 'ascii' codec can't
decode byte 0x93 in position 31: ordinal
not in range(128)
97. test_bytes = 'I am a bytestring mwahaha'
!
test_unicode = u'ι αм υηι¢σ∂є!'
!
!
i_expect_unicode(test_bytes)
!
i_expect_bytes(test_unicode)
Test interfaces against
both Python text types
99. utf8_str = u'UՇF-8 ՇєsՇ'.encode('utf8')
!
with assertRaises(UnicodeDecodeError):
line = ascii_handling_function(utf8_str)
Test handling of
incorrect encoding
100. Best Practices
1. Know your encodings
2. Use the Unicode sandwich
3. Test your (text related) code
103. Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
Application
Processes
Text
PSQL
127. >>>u'☃ Brrrr!'.encode('cp1252', 'strict')
!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/esther/ENV/lib/python2.7/
encodings/cp1252.py", line 12, in encode
return
codecs.charmap_encode(input,errors,encoding_
table)
UnicodeEncodeError: 'charmap' codec can't
encode character u'u2603' in position 0:
character maps to <undefined>
[Python 2.7]
131. Cars.com / NewCars.com Tech Team
!
SoCal Piggies
!
Ned Batchelder
(for his Pragmatic Unicode talk)
Thank you ツ
132. Pragmatic Unicode
http://nedbatchelder.com/text/unipain.html
!
The Absolute Minimum You Must Know
http://www.joelonsoftware.com/articles/Unicode.html
!
Chapter on Strings in “Dive into Python” by Mark Pilgrim
http://getpython3.com/diveintopython3/strings.html
!
General questions, relating to UTF or Encoding Form
http://www.unicode.org/faq/utf_bom.html
!
Unicode HOWTO (Python 2.7)
http://docs.python.org/2/howto/unicode.html
The fundamentals
133. “Just what the dickens is ‘Unicode’?”
https://pythonhosted.org/kitchen/unicode-frustrations.html
Differences between these commonly confused encodings
http://www.i18nqa.com/debug/table-iso8859-1-vs-
windows-1252.html
!
“Latin-1” in MySQL is more like “CP-1252”
https://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
!
Why it's important to write tests with character boundary values
http://labs.spotify.com/2013/06/18/creative-usernames/
Further reading