SlideShare uma empresa Scribd logo
1 de 135
Baixar para ler offline
ODE TO A SHIPPING LABEL!
by Carlos Bueno!
!
Once there was a little o,!
with an accent on top like só.!
!
It started out as UTF8,!
(universal since '98),!
but the program only knew latin1,!
and changed little ó to "ó" for fun.!
!
A second program saw the "ó"!
and said "I know HTML entity!"!
So "ó" was smartened to "ó"!
and passed on through happily.!
!
Another program saw the tangle!
(more precisely, ampersands to mangle)!
and thus the humble "ó"!
became "ó"
Character Encoding
& Unicode
How to (╯°□°)╯︵ ┻━┻ with dignity
Esther Nam & Travis Fischer!
PyCon US 2014, Montréal
Uni-wat?!
┻━┻ ︵ヽ ノ︵ ┻━┻
How to (╯°□°)╯︵ ┻━┻
with dignity
– Luke Sneeringer | Program Committee Chair
“You'll be pleased to know that your talk title
crashed our meeting robot, which is a great
argument for the relevance of this talk. :-) ...”
Python 3
is out of scope
The Fundamentals
of Unicode
Humans use text.
Computers speak bytes.
a -> 01100001
ASCII
ISO-8859-15!
(latin-9)
CP-1252!
(Windows 1252)
UTF-8
a 01100001 01100001 01100001 01100001
€ NA 10100100 10000000
11100010
10000010
10101100
¤ NA NA
10100100 11000010
10100100
ASCII
ISO-8859-15!
(latin-9)
CP-1252!
(Windows 1252)
UTF-8
a 01100001 01100001 01100001 01100001
€ NA 10100100 10000000
11100010
10000010
10101100
¤ NA NA
10100100 11000010
10100100
ASCII
ISO-8859-15!
(latin-9)
CP-1252!
(Windows 1252)
UTF-8
a 01100001 01100001 01100001 01100001
€ NA 10100100 10000000
11100010
10000010
10101100
¤ NA NA
10100100 11000010
10100100
ASCII
ISO-8859-15!
(latin-9)
CP-1252!
(Windows 1252)
UTF-8
a 01100001 01100001 01100001 01100001
€ NA 10100100 10000000
11100010
10000010
10101100
¤ NA NA
10100100 11000010
10100100
π — ‽ ☠ ☁ ☂ ☃ ☄ ★ ☆ ☇ ☈ ☉ ☊
☋ ☌ ☍ ☎ ☏ ☐ ☑ ☒ ☓ ☘ ☙
☚ ☛ ☜ ☝ ☞ ☟ ☠ ☡ ☢ ☣ ☤ ☥ ☦ ☧
☨ ☩ ☪ ☫ ☬ ☭ ☮ ☯ ☸ ☹ ☺ ☻ ☼ ☽ ☾
☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆ ♇ ♔ ♕ ♖ ♗ ♘
♙ ♚ ♛ ♜ ♝ ♞ ♟ ♠ ♡ ♢ ♣ ♤ ♥
♦ ♧ ♨ ♩ ♪ ♫ ♬ ♭ ♯ ♰
♾ ⚀ ⚁ ⚂ ⚃ ⚄ ⚅ ⚆ ⚇ ⚈
a -> U+0061
Character Unicode Code Point
!
Unicode
a -> U+0061
Character Unicode Code Point
!
Unicode
a -> U+0061
Character LATIN SMALL LETTER A
Computers speak bytes.
!
Unicode
a
!
U+0061 -> 01100001
Unicode Code Point Binary Encoding
!
Unicode
U+0061 -> 01100001
Unicode Code Point Binary Encodinga
UTF-8
Unicode Transformation Format
Unicode != UTF-8
Code Points Binary Encoding
U+0061 01100001
Layers of Abstraction
• Display (Glyphs | Fonts)
Let them eat cake!
• Display (Glyphs | Fonts)
Let them eat cake!

!
• Text (Unicode | Code Points)
U+0061
• Display (Glyphs | Fonts)
Let them eat cake!

!
• Text (Unicode | Code Points)
U+0061
!
• Storage (Binary | UTF-8)
01100001
Unicode & Python
[Python 2.7]
str type
>>>euro_bytestring = '€'
!
>>>type(euro_bytestring)
<type 'str'>
[Python 2.7]
unicode type
# € code point
>>>euro_unicode = u'u20ac'
!
>>>type(euro_unicode)
<type 'unicode'>
[Python 2.7]
Unicode
Code points
u'u20ac'
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
Unicode
Code points
u'u20ac'
'xe2x82xac'.decode('utf8')
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
Unicode
Code points
u'u20ac'
'xe2x82xac'.decode('utf8')
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
Unicode
Code points
u'u20ac'
'xe2x82xac'.decode('utf8')
u'u20ac'.encode('utf8')
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
Unicode
Code points
u'u20ac'
'xe2x82xac'.becode('utf8')
u'u20ac'.uncode('utf8')
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
You CANNOT infer an
encoding from a bytestring
#! /usr/bin/python
# -*- coding: utf8 -*-
!
# Opened file should be latin-1 encoded!
# If it’s not, call tech support ASAP
with open("input_file.csv") as input_file:
Date: Wed, 11 Apr 2014 11:15:55 -0600

To: foo@bar.com

From: bar@foo.com
Subject: Character encoding
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC “-//W3C//DTD …>
<html xmlns="http://www.w3.org/1999/xhtml" …>
Best Practices
Example Application
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
Application
Processes
Text
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
Application
Processes
Text
PSQL
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
Application
Processes
Text
PSQL
Encoding: Windows 1252 (CP-1252)
Montreal -> Montréal
psql=# set server_encoding
to "utf-8";
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had paid
9400€ for his.
Sample Review Text
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had paid
9400€ for his.
Sample Review Text
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had paid
9400€ for his.
Sample Review Text
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
My friend said: �I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.�
He told me he had paid
9400� for his.
Output from UTF-8 encoded PSQL database
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had paid
9400€ for his.
Original CP-1252 Data
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.”
He told me he had paid
9400€ for his.
Mixed CP-1252 & UTF-8
My friend said: �I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.�
He told me he had paid
9400� for his.
Interpreted as UTF-8 by database
Know your encodings
Best Practice #1
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_text = row_text.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode()
author, date, review_text = unicode_row.split(",")
converted_review = review_text.replace("Montreal",
"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode()
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode()
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
Traceback (most recent call last):
File "...", line ..., in <module>
unicode_row = row_text.decode()
UnicodeDecodeError: 'ascii' codec can't
decode byte 0x93 in position 31: ordinal
not in range(128)
Traceback (most recent call last):
File "...", line ..., in <module>
unicode_row = row_text.decode()
UnicodeDecodeError: 'ascii' codec can't
decode byte 0x93 in position 31: ordinal
not in range(128)
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode()
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode()
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode("cp1252")
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode("cp1252")
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable, author, date,
converted_review)
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode("cp1252")
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable,
author.encode("utf8"),
date.encode("utf8"),
converted_review.encode("utf8"))
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode("cp1252")
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable
author.encode("utf8"
date.encode("utf8"),
converted_review.encode("utf8")
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.”
He told me he had paid
9400€ for his.
Use the Unicode
Sandwich
Best Practice #2
Decode as early as possible.!
Unicode everywhere in the middle.!
Encode as late as possible.
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode("cp1252")
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable,
author.encode("utf8"),
date.encode("utf8"),
converted_review.encode("utf8"))
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode("cp1252")
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable,
author.encode("utf8"),
date.encode("utf8"),
converted_review.encode("utf8"))
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode("cp1252")
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u”Montreal",
u"Montréal")
DB.insert(ReviewTable,
author.encode("utf8"),
date.encode("utf8"),
converted_review.encode("utf8"))
[Python 2.7]
Test Your
(Text Related) Code
Best Practice #3
Test encoding ranges
& boundaries
test_strings = ['Hello Montreal!',
'¡‫ן‬ɐǝɹʇuoɯ o‫ןן‬ǝɥ',
'ђєɭɭ๏ ๓๏ภՇгєค !']
!
func_under_test(test_strings)
test_bytes = 'I am a bytestring mwahaha'
!
test_unicode = u'ι αм υηι¢σ∂є!'
!
!
i_expect_unicode(test_bytes)
!
i_expect_bytes(test_unicode)
Test interfaces against
both Python text types
def ascii_handling_function(ascii_str):
...
ascii_str.decode('ascii')
...
Test handling of
incorrect encoding
utf8_str = u'UՇF-8 ՇєsՇ'.encode('utf8')
!
with assertRaises(UnicodeDecodeError):
line = ascii_handling_function(utf8_str)
Test handling of
incorrect encoding
Best Practices
1. Know your encodings
2. Use the Unicode sandwich
3. Test your (text related) code
Issues We Can’t
Control
Incorrect encoding
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car should feel new every
time you drive it.
L. Torvalds
Volvo isn’t evil, they just make really
crappy cars.
Application
Processes
Text
PSQL
Declared as “CP-1252”!
!
!
!
!
Is actually “UTF-8”
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.decode("cp1252")
author, date, review_text = unicode_row.split(u",")
converted_review = review_text.replace(u"Montreal",
u"Montréal")
DB.insert(ReviewTable
author.encode("utf8"
date.encode("utf8"),
converted_review.encode("utf8")
UnicodeDecodeError
How to Deal
• Ask
How to Deal
• Ask
• Guess (with chardet library)
How to Deal
• Ask
• Guess (with chardet library)
• You wrote tests, right?
Mixed encodings or
corrupted bytes
John Smith’s Autoplex
!
Broken text&hellip; it&#x2019;s fantastic!
!
Hello ^[[30m; World
John Smith’s Autoplex
!
Broken text&hellip; it&#x2019;s fantastic!
!
Hello ^[[30m; World
MOJIBAKE
u"John Smith’s Autoplex"
u"John Smith’s Autoplex"
!
>>>u'John Smith’sAutoplex'.encode('cp1252')
u"John Smith’s Autoplex"
!
>>>u'John Smith’sAutoplex'.encode('cp1252')
!
'John Smithxe2x80x99s Autoplex'
(bytestring)
'John Smithxe2x80x99s Autoplex'
(bytestring)
'John Smithxe2x80x99s Autoplex'
(bytestring)
!
>>>'John Smithxe2x80x99s Autoplex' 
.decode('utf8')
!
!
u'John Smith’s Autoplex'
UTF8
U+2019
!
’
UTF8
xe2x80x99
U+2019
!
’
UTF8
xe2x80x99
U+2019
!
’
U+00e2
!
â
U+20ac
!
€
U+2122
!
™
CP1252
str_dealer = u"John Smith’s Autoplex"
!
!
def manually_convert_encoding(str_dealer):
"""
Manually replace incorrect, UTF8-encoded bytes
with CP1252 bytes for the same character
"""
!
str_dealer.replace('xe2x80x98', 'x91') # ‘
str_dealer.replace('xe2x80x99', 'x92') # ’
str_dealer.replace('xe2x80x9c', 'x93') # “
str_dealer.replace('xe2x80x9d', 'x94') # ”
str_dealer.replace('xe2x80x94', 'x97') # —
str_dealer.replace('xe2x84xa2', 'x99') # ™
str_dealer.replace('xe2x82xac', 'x80') # €
dealer_name = u"John Smith’s Autoplex"
!
>>> from ftfy import fix_text
!
>>> fix_text(dealer_name)
!
u"John Smith's Autoplex"
python-ftfy fixes mojibake
Target encoding
can’t handle
source data
Source
Data
(UTF-8)
Target
Application
Data
(CP-1252)
?
>>>u'☃ Brrrr!'.encode('cp1252', 'strict')
!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/esther/ENV/lib/python2.7/
encodings/cp1252.py", line 12, in encode
return
codecs.charmap_encode(input,errors,encoding_
table)
UnicodeEncodeError: 'charmap' codec can't
encode character u'u2603' in position 0:
character maps to <undefined>
[Python 2.7]
>>>u'☃ Brrrr!'.encode('cp1252', 'ignore')
!
' Brrrr!'
[Python 2.7]
>>>u'☃ Brrrr!'.encode('cp1252', 'replace')
!
'? Brrrr!'
[Python 2.7]
!
!
U+0004
END OF TRANSMISSION
Cars.com / NewCars.com Tech Team
!
SoCal Piggies
!
Ned Batchelder
(for his Pragmatic Unicode talk)
Thank you ツ
Pragmatic Unicode
http://nedbatchelder.com/text/unipain.html
!
The Absolute Minimum You Must Know
http://www.joelonsoftware.com/articles/Unicode.html
!
Chapter on Strings in “Dive into Python” by Mark Pilgrim
http://getpython3.com/diveintopython3/strings.html
!
General questions, relating to UTF or Encoding Form
http://www.unicode.org/faq/utf_bom.html
!
Unicode HOWTO (Python 2.7)
http://docs.python.org/2/howto/unicode.html
The fundamentals
“Just what the dickens is ‘Unicode’?”
https://pythonhosted.org/kitchen/unicode-frustrations.html

Differences between these commonly confused encodings
http://www.i18nqa.com/debug/table-iso8859-1-vs-
windows-1252.html
!
“Latin-1” in MySQL is more like “CP-1252”
https://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
!
Why it's important to write tests with character boundary values
http://labs.spotify.com/2013/06/18/creative-usernames/
Further reading
chardet
https://pypi.python.org/pypi/chardet
!
python-ftfy
https://github.com/LuminosoInsight/python-ftfy
Tools
@estherbester @travisfischer
Slides at http://bit.ly/flip_tables
IRC

Mais conteúdo relacionado

Mais procurados

モデルベース開発勉強会
モデルベース開発勉強会モデルベース開発勉強会
モデルベース開発勉強会耕二 阿部
 
Accelerate with ibm storage ibm spectrum virtualize hyper swap deep dive dee...
Accelerate with ibm storage  ibm spectrum virtualize hyper swap deep dive dee...Accelerate with ibm storage  ibm spectrum virtualize hyper swap deep dive dee...
Accelerate with ibm storage ibm spectrum virtualize hyper swap deep dive dee...xKinAnx
 
Cuadros De Posix Y Win 32
Cuadros De Posix Y Win 32Cuadros De Posix Y Win 32
Cuadros De Posix Y Win 32sistemasop
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
様々な全域木問題
様々な全域木問題様々な全域木問題
様々な全域木問題tmaehara
 
Intoduction to Homotopy Type Therory
Intoduction to Homotopy Type TheroryIntoduction to Homotopy Type Therory
Intoduction to Homotopy Type TheroryJack Fox
 
Letter of Recommendation coworker
Letter of Recommendation coworkerLetter of Recommendation coworker
Letter of Recommendation coworkerTibor Belt
 
我的 Windows 平台自動化經驗:基礎批次檔撰寫實務
我的 Windows 平台自動化經驗:基礎批次檔撰寫實務我的 Windows 平台自動化經驗:基礎批次檔撰寫實務
我的 Windows 平台自動化經驗:基礎批次檔撰寫實務Will Huang
 
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARMXPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARMThe Linux Foundation
 
Nick Stephens-how does someone unlock your phone with nose
Nick Stephens-how does someone unlock your phone with noseNick Stephens-how does someone unlock your phone with nose
Nick Stephens-how does someone unlock your phone with noseGeekPwn Keen
 
Device tree support on arm linux
Device tree support on arm linuxDevice tree support on arm linux
Device tree support on arm linuxChih-Min Chao
 
LLVM Backend の紹介
LLVM Backend の紹介LLVM Backend の紹介
LLVM Backend の紹介Akira Maruoka
 
「時計の世界の整数論」第2回プログラマのための数学勉強会 #maths4pg
「時計の世界の整数論」第2回プログラマのための数学勉強会 #maths4pg「時計の世界の整数論」第2回プログラマのための数学勉強会 #maths4pg
「時計の世界の整数論」第2回プログラマのための数学勉強会 #maths4pgJunpei Tsuji
 

Mais procurados (20)

プログラミングコンテスト基礎テクニック
プログラミングコンテスト基礎テクニックプログラミングコンテスト基礎テクニック
プログラミングコンテスト基礎テクニック
 
モデルベース開発勉強会
モデルベース開発勉強会モデルベース開発勉強会
モデルベース開発勉強会
 
CS 354 Typography
CS 354 TypographyCS 354 Typography
CS 354 Typography
 
Les divinitats gregues i romanes
Les divinitats gregues i romanesLes divinitats gregues i romanes
Les divinitats gregues i romanes
 
Accelerate with ibm storage ibm spectrum virtualize hyper swap deep dive dee...
Accelerate with ibm storage  ibm spectrum virtualize hyper swap deep dive dee...Accelerate with ibm storage  ibm spectrum virtualize hyper swap deep dive dee...
Accelerate with ibm storage ibm spectrum virtualize hyper swap deep dive dee...
 
Cuadros De Posix Y Win 32
Cuadros De Posix Y Win 32Cuadros De Posix Y Win 32
Cuadros De Posix Y Win 32
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Arduino藍牙傳輸應用
Arduino藍牙傳輸應用Arduino藍牙傳輸應用
Arduino藍牙傳輸應用
 
様々な全域木問題
様々な全域木問題様々な全域木問題
様々な全域木問題
 
Intoduction to Homotopy Type Therory
Intoduction to Homotopy Type TheroryIntoduction to Homotopy Type Therory
Intoduction to Homotopy Type Therory
 
Letter of Recommendation coworker
Letter of Recommendation coworkerLetter of Recommendation coworker
Letter of Recommendation coworker
 
我的 Windows 平台自動化經驗:基礎批次檔撰寫實務
我的 Windows 平台自動化經驗:基礎批次檔撰寫實務我的 Windows 平台自動化經驗:基礎批次檔撰寫實務
我的 Windows 平台自動化經驗:基礎批次檔撰寫實務
 
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARMXPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
 
Nick Stephens-how does someone unlock your phone with nose
Nick Stephens-how does someone unlock your phone with noseNick Stephens-how does someone unlock your phone with nose
Nick Stephens-how does someone unlock your phone with nose
 
Cscope and ctags
Cscope and ctagsCscope and ctags
Cscope and ctags
 
Device tree support on arm linux
Device tree support on arm linuxDevice tree support on arm linux
Device tree support on arm linux
 
新しい暗号技術
新しい暗号技術新しい暗号技術
新しい暗号技術
 
LLVM Backend の紹介
LLVM Backend の紹介LLVM Backend の紹介
LLVM Backend の紹介
 
「時計の世界の整数論」第2回プログラマのための数学勉強会 #maths4pg
「時計の世界の整数論」第2回プログラマのための数学勉強会 #maths4pg「時計の世界の整数論」第2回プログラマのための数学勉強会 #maths4pg
「時計の世界の整数論」第2回プログラマのための数学勉強会 #maths4pg
 
Memory, IPC and L4Re
Memory, IPC and L4ReMemory, IPC and L4Re
Memory, IPC and L4Re
 

Destaque

Unicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutUnicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutbijan_
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
Conversational Internet - Creating a natural language interface for web pages
Conversational Internet - Creating a natural language interface for web pagesConversational Internet - Creating a natural language interface for web pages
Conversational Internet - Creating a natural language interface for web pagesDale Lane
 
Automatic Language Identification
Automatic Language IdentificationAutomatic Language Identification
Automatic Language Identificationbigshum
 
Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lithium
 
Character encoding standard(1)
Character encoding standard(1)Character encoding standard(1)
Character encoding standard(1)Pramila Selvaraj
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals SamiHsDU
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKJacob Perkins
 
Lanyrd Pro
Lanyrd ProLanyrd Pro
Lanyrd ProLanyrd
 
Lanyrd's new integrations with Eventbrite
Lanyrd's new integrations with EventbriteLanyrd's new integrations with Eventbrite
Lanyrd's new integrations with EventbriteLanyrd
 
Open Software Platforms for Mobile Digital Broadcasting
Open Software Platforms for Mobile Digital BroadcastingOpen Software Platforms for Mobile Digital Broadcasting
Open Software Platforms for Mobile Digital BroadcastingFrancois Lefebvre
 
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...Ronald G. Shapiro
 
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)John Collins
 

Destaque (20)

Unicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutUnicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layout
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
Unicode
UnicodeUnicode
Unicode
 
What character is that
What character is thatWhat character is that
What character is that
 
Unicode
UnicodeUnicode
Unicode
 
Unicode
UnicodeUnicode
Unicode
 
learn-python
learn-pythonlearn-python
learn-python
 
Conversational Internet - Creating a natural language interface for web pages
Conversational Internet - Creating a natural language interface for web pagesConversational Internet - Creating a natural language interface for web pages
Conversational Internet - Creating a natural language interface for web pages
 
Automatic Language Identification
Automatic Language IdentificationAutomatic Language Identification
Automatic Language Identification
 
Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)
 
Character encoding standard(1)
Character encoding standard(1)Character encoding standard(1)
Character encoding standard(1)
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTK
 
Lanyrd Pro
Lanyrd ProLanyrd Pro
Lanyrd Pro
 
Lanyrd's new integrations with Eventbrite
Lanyrd's new integrations with EventbriteLanyrd's new integrations with Eventbrite
Lanyrd's new integrations with Eventbrite
 
Silmeyiniz
SilmeyinizSilmeyiniz
Silmeyiniz
 
Open Software Platforms for Mobile Digital Broadcasting
Open Software Platforms for Mobile Digital BroadcastingOpen Software Platforms for Mobile Digital Broadcasting
Open Software Platforms for Mobile Digital Broadcasting
 
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
 
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)
 

Semelhante a Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
Development of TeXShop - The Past and the Future (TUG 2013)
Development of TeXShop - The Past and the Future (TUG 2013)Development of TeXShop - The Past and the Future (TUG 2013)
Development of TeXShop - The Past and the Future (TUG 2013)Yusuke Terada
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsRay Paseur
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonAram Dulyan
 
Tour of language landscape (katsconf)
Tour of language landscape (katsconf)Tour of language landscape (katsconf)
Tour of language landscape (katsconf)Yan Cui
 
RubyConf Portugal 2014 - Why ruby must go!
RubyConf Portugal 2014 - Why ruby must go!RubyConf Portugal 2014 - Why ruby must go!
RubyConf Portugal 2014 - Why ruby must go!Gautam Rege
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character setsrenchenyu
 
International Web Application Development
International Web Application DevelopmentInternational Web Application Development
International Web Application DevelopmentSarah Allen
 
There should be a tool for that - GameQALoc Barcelona 2016
There should be a tool for that - GameQALoc Barcelona 2016There should be a tool for that - GameQALoc Barcelona 2016
There should be a tool for that - GameQALoc Barcelona 2016Adolfo Gomez-Urda
 
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6Andrei Zmievski
 
The NEW Web Typography: Where the Sexy Is
The NEW Web Typography: Where the Sexy IsThe NEW Web Typography: Where the Sexy Is
The NEW Web Typography: Where the Sexy IsJason CranfordTeague
 
The New Web Typography
The New Web TypographyThe New Web Typography
The New Web TypographyForum One
 
Breaking up with Microsoft Word
Breaking up with Microsoft WordBreaking up with Microsoft Word
Breaking up with Microsoft Wordcdelk
 

Semelhante a Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity (20)

Unicode 101
Unicode 101Unicode 101
Unicode 101
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Development of TeXShop - The Past and the Future (TUG 2013)
Development of TeXShop - The Past and the Future (TUG 2013)Development of TeXShop - The Past and the Future (TUG 2013)
Development of TeXShop - The Past and the Future (TUG 2013)
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
Tour of language landscape (katsconf)
Tour of language landscape (katsconf)Tour of language landscape (katsconf)
Tour of language landscape (katsconf)
 
RubyConf Portugal 2014 - Why ruby must go!
RubyConf Portugal 2014 - Why ruby must go!RubyConf Portugal 2014 - Why ruby must go!
RubyConf Portugal 2014 - Why ruby must go!
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
International Web Application Development
International Web Application DevelopmentInternational Web Application Development
International Web Application Development
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
There should be a tool for that - GameQALoc Barcelona 2016
There should be a tool for that - GameQALoc Barcelona 2016There should be a tool for that - GameQALoc Barcelona 2016
There should be a tool for that - GameQALoc Barcelona 2016
 
Developing for TV
Developing for TVDeveloping for TV
Developing for TV
 
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
 
The NEW Web Typography: Where the Sexy Is
The NEW Web Typography: Where the Sexy IsThe NEW Web Typography: Where the Sexy Is
The NEW Web Typography: Where the Sexy Is
 
The New Web Typography
The New Web TypographyThe New Web Typography
The New Web Typography
 
Quality code 2019
Quality code 2019Quality code 2019
Quality code 2019
 
Breaking up with Microsoft Word
Breaking up with Microsoft WordBreaking up with Microsoft Word
Breaking up with Microsoft Word
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
(Web ) Typography
(Web ) Typography(Web ) Typography
(Web ) Typography
 

Último

10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 

Último (20)

10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

  • 1. ODE TO A SHIPPING LABEL! by Carlos Bueno! ! Once there was a little o,! with an accent on top like só.! ! It started out as UTF8,! (universal since '98),! but the program only knew latin1,! and changed little ó to "ó" for fun.! ! A second program saw the "ó"! and said "I know HTML entity!"! So "ó" was smartened to "&ATILDE;&SUP3;"! and passed on through happily.! ! Another program saw the tangle! (more precisely, ampersands to mangle)! and thus the humble "&ATILDE;&SUP3;"! became "&AMP;AMP;ATILDE;&AMP;AMP;SUP3;"
  • 2. Character Encoding & Unicode How to (╯°□°)╯︵ ┻━┻ with dignity Esther Nam & Travis Fischer! PyCon US 2014, Montréal
  • 3.
  • 4.
  • 5.
  • 8. How to (╯°□°)╯︵ ┻━┻ with dignity
  • 9. – Luke Sneeringer | Program Committee Chair “You'll be pleased to know that your talk title crashed our meeting robot, which is a great argument for the relevance of this talk. :-) ...”
  • 10. Python 3 is out of scope
  • 14. ASCII ISO-8859-15! (latin-9) CP-1252! (Windows 1252) UTF-8 a 01100001 01100001 01100001 01100001 € NA 10100100 10000000 11100010 10000010 10101100 ¤ NA NA 10100100 11000010 10100100
  • 15. ASCII ISO-8859-15! (latin-9) CP-1252! (Windows 1252) UTF-8 a 01100001 01100001 01100001 01100001 € NA 10100100 10000000 11100010 10000010 10101100 ¤ NA NA 10100100 11000010 10100100
  • 16. ASCII ISO-8859-15! (latin-9) CP-1252! (Windows 1252) UTF-8 a 01100001 01100001 01100001 01100001 € NA 10100100 10000000 11100010 10000010 10101100 ¤ NA NA 10100100 11000010 10100100
  • 17. ASCII ISO-8859-15! (latin-9) CP-1252! (Windows 1252) UTF-8 a 01100001 01100001 01100001 01100001 € NA 10100100 10000000 11100010 10000010 10101100 ¤ NA NA 10100100 11000010 10100100
  • 18. π — ‽ ☠ ☁ ☂ ☃ ☄ ★ ☆ ☇ ☈ ☉ ☊ ☋ ☌ ☍ ☎ ☏ ☐ ☑ ☒ ☓ ☘ ☙ ☚ ☛ ☜ ☝ ☞ ☟ ☠ ☡ ☢ ☣ ☤ ☥ ☦ ☧ ☨ ☩ ☪ ☫ ☬ ☭ ☮ ☯ ☸ ☹ ☺ ☻ ☼ ☽ ☾ ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆ ♇ ♔ ♕ ♖ ♗ ♘ ♙ ♚ ♛ ♜ ♝ ♞ ♟ ♠ ♡ ♢ ♣ ♤ ♥ ♦ ♧ ♨ ♩ ♪ ♫ ♬ ♭ ♯ ♰ ♾ ⚀ ⚁ ⚂ ⚃ ⚄ ⚅ ⚆ ⚇ ⚈
  • 19.
  • 20.
  • 21. a -> U+0061 Character Unicode Code Point
  • 22. ! Unicode a -> U+0061 Character Unicode Code Point
  • 23. ! Unicode a -> U+0061 Character LATIN SMALL LETTER A
  • 25. ! Unicode a ! U+0061 -> 01100001 Unicode Code Point Binary Encoding
  • 26. ! Unicode U+0061 -> 01100001 Unicode Code Point Binary Encodinga
  • 28. Unicode != UTF-8 Code Points Binary Encoding U+0061 01100001
  • 30. • Display (Glyphs | Fonts) Let them eat cake!
  • 31. • Display (Glyphs | Fonts) Let them eat cake! ! • Text (Unicode | Code Points) U+0061
  • 32. • Display (Glyphs | Fonts) Let them eat cake! ! • Text (Unicode | Code Points) U+0061 ! • Storage (Binary | UTF-8) 01100001
  • 34. str type >>>euro_bytestring = '€' ! >>>type(euro_bytestring) <type 'str'> [Python 2.7]
  • 35. unicode type # € code point >>>euro_unicode = u'u20ac' ! >>>type(euro_unicode) <type 'unicode'> [Python 2.7]
  • 41. You CANNOT infer an encoding from a bytestring
  • 42. #! /usr/bin/python # -*- coding: utf8 -*- ! # Opened file should be latin-1 encoded! # If it’s not, call tech support ASAP with open("input_file.csv") as input_file: Date: Wed, 11 Apr 2014 11:15:55 -0600
 To: foo@bar.com
 From: bar@foo.com Subject: Character encoding MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC “-//W3C//DTD …> <html xmlns="http://www.w3.org/1999/xhtml" …>
  • 45. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars.
  • 46. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text
  • 47. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text PSQL
  • 48. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text PSQL
  • 52. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Sample Review Text
  • 53. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Sample Review Text
  • 54. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Sample Review Text
  • 55. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 56. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 57. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 58. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 59. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 60. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 61. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 62. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)
  • 63. My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his. Output from UTF-8 encoded PSQL database
  • 64. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 65. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 66. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 67. [Python 2.7] # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)
  • 68. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 69. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 70. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 71. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 72. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Original CP-1252 Data
  • 73. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his. Mixed CP-1252 & UTF-8
  • 74. My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his. Interpreted as UTF-8 by database
  • 76. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 77. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 78. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 79. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 80. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)
  • 81. Traceback (most recent call last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)
  • 82. Traceback (most recent call last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)
  • 83. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 84. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)
  • 85. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 86. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  • 87. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]
  • 88. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")
  • 89. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his.
  • 91. Decode as early as possible.! Unicode everywhere in the middle.! Encode as late as possible.
  • 92. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]
  • 93. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]
  • 94. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u”Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]
  • 95. Test Your (Text Related) Code Best Practice #3
  • 96. Test encoding ranges & boundaries test_strings = ['Hello Montreal!', '¡‫ן‬ɐǝɹʇuoɯ o‫ןן‬ǝɥ', 'ђєɭɭ๏ ๓๏ภՇгєค !'] ! func_under_test(test_strings)
  • 97. test_bytes = 'I am a bytestring mwahaha' ! test_unicode = u'ι αм υηι¢σ∂є!' ! ! i_expect_unicode(test_bytes) ! i_expect_bytes(test_unicode) Test interfaces against both Python text types
  • 99. utf8_str = u'UՇF-8 ՇєsՇ'.encode('utf8') ! with assertRaises(UnicodeDecodeError): line = ascii_handling_function(utf8_str) Test handling of incorrect encoding
  • 100. Best Practices 1. Know your encodings 2. Use the Unicode sandwich 3. Test your (text related) code
  • 103. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text PSQL
  • 104. Declared as “CP-1252”! ! ! ! ! Is actually “UTF-8”
  • 105. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")
  • 108. How to Deal • Ask • Guess (with chardet library)
  • 109. How to Deal • Ask • Guess (with chardet library) • You wrote tests, right?
  • 111. John Smith’s Autoplex ! Broken text&hellip; it&#x2019;s fantastic! ! Hello ^[[30m; World
  • 112. John Smith’s Autoplex ! Broken text&hellip; it&#x2019;s fantastic! ! Hello ^[[30m; World MOJIBAKE
  • 114. u"John Smith’s Autoplex" ! >>>u'John Smith’sAutoplex'.encode('cp1252')
  • 115. u"John Smith’s Autoplex" ! >>>u'John Smith’sAutoplex'.encode('cp1252') ! 'John Smithxe2x80x99s Autoplex' (bytestring)
  • 116.
  • 117.
  • 119. 'John Smithxe2x80x99s Autoplex' (bytestring) ! >>>'John Smithxe2x80x99s Autoplex' .decode('utf8') ! ! u'John Smith’s Autoplex'
  • 123. str_dealer = u"John Smith’s Autoplex" ! ! def manually_convert_encoding(str_dealer): """ Manually replace incorrect, UTF8-encoded bytes with CP1252 bytes for the same character """ ! str_dealer.replace('xe2x80x98', 'x91') # ‘ str_dealer.replace('xe2x80x99', 'x92') # ’ str_dealer.replace('xe2x80x9c', 'x93') # “ str_dealer.replace('xe2x80x9d', 'x94') # ” str_dealer.replace('xe2x80x94', 'x97') # — str_dealer.replace('xe2x84xa2', 'x99') # ™ str_dealer.replace('xe2x82xac', 'x80') # €
  • 124. dealer_name = u"John Smith’s Autoplex" ! >>> from ftfy import fix_text ! >>> fix_text(dealer_name) ! u"John Smith's Autoplex" python-ftfy fixes mojibake
  • 127. >>>u'☃ Brrrr!'.encode('cp1252', 'strict') ! Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/esther/ENV/lib/python2.7/ encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_ table) UnicodeEncodeError: 'charmap' codec can't encode character u'u2603' in position 0: character maps to <undefined> [Python 2.7]
  • 131. Cars.com / NewCars.com Tech Team ! SoCal Piggies ! Ned Batchelder (for his Pragmatic Unicode talk) Thank you ツ
  • 132. Pragmatic Unicode http://nedbatchelder.com/text/unipain.html ! The Absolute Minimum You Must Know http://www.joelonsoftware.com/articles/Unicode.html ! Chapter on Strings in “Dive into Python” by Mark Pilgrim http://getpython3.com/diveintopython3/strings.html ! General questions, relating to UTF or Encoding Form http://www.unicode.org/faq/utf_bom.html ! Unicode HOWTO (Python 2.7) http://docs.python.org/2/howto/unicode.html The fundamentals
  • 133. “Just what the dickens is ‘Unicode’?” https://pythonhosted.org/kitchen/unicode-frustrations.html
 Differences between these commonly confused encodings http://www.i18nqa.com/debug/table-iso8859-1-vs- windows-1252.html ! “Latin-1” in MySQL is more like “CP-1252” https://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html ! Why it's important to write tests with character boundary values http://labs.spotify.com/2013/06/18/creative-usernames/ Further reading
  • 135. @estherbester @travisfischer Slides at http://bit.ly/flip_tables IRC