SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
Unicode and Legacy
Representations of
Emoji
IUC 36
David Yonge-Mallo, i18n Engineer, Google
Oct. 24, 2012
ver. 2012-10-23 14:00
"Bit rot"
09:15-10:00

Presenter:
Dr. Vinton G. Cerf
Vice President and
Chief Internet
Evangelist, Google

KEYNOTE PRESENTATION "Bit Rot" – A Disaster Waiting to Happen

Dr. Cerf will discuss the problem of curating digital
content on the order of centuries. Unicode has a role to
play although there are very complex issues relating to
format and structure of digital objects, interpretation of
content, intellectual property management, perhaps
even patents and other legal framework questions. The
problems are both technical and legal.
Outline
●
●
●
●
●

●

A brief history of emoji
Encoding: Shift JIS and Unicode
Mapping and unification
Emoji in Unicode 6
Problems:
○ variation selectors
○ regional indicators
○ counting
Best practices
Emoji down the ages
What if you were tasked with preserving the following texts
to be passed down for posterity?
Emoji down the ages
What if you were tasked with preserving the following texts
to be passed down for posterity?
awesome! :-)
Emoji down the ages
What if you were tasked with preserving the following texts
to be passed down for posterity?
awesome! :-)

yay! ☺
Emoji down the ages
What if you were tasked with preserving the following texts
to be passed down for posterity?
awesome! :-)

yay! ☺

i know how much you

hiking
What is an emoji (絵文字)?
絵文字 = picture (絵) + character/letter (文字)
What are they?
● pictures (representational)
● includes facial expressions (smileys)
○ but not restricted to them
● stored and transmitted as encoded characters
○ used in email and SMS
History:
● popularised on Japanese mobile devices
● extension of Japanese character sets
● carrier-specific standards
"Early" history in Japan
Three major cell phone operators supported emoji:
● NTT DoCoMo
● au/EZweb by KDDI
● SoftBank
Problems:
● each operator had its own set of emoji
● they were encoded differently
● no interoperability between them
Examples of emoji

Above: DoCoMo emoji palette
Right: DoCoMo Foma P902i, c. 2005
Examples of emoji
Subset of KDDI emojis:

Subset of SoftBank emojis:
Number of supported emoji

Source: Emoji in Unicode, IUC 33
Outline
●
●
●
●
●

●

A brief history of emoji
Encoding: Shift JIS and Unicode
Mapping and unification
Emoji in Unicode 6
Problems:
○ variation selectors
○ regional indicators
○ counting
Best practices
Encoding - Shift JIS
This is one of the most popular encodings for Japanese.
The "JIS" part refers to Japanese Industrial Standards.
ISO-2022-JP is also known as the "JIS" encoding.
The "shift" part comes from how the double-byte characters
are encoded.
0x00 - 0x7F : matches ASCII (except for 2 characters)
0x81 - 0x9F : first byte of a double-byte character
0xA1 - 0xDF : half-width katakana
0xE0 - 0xEF : first byte of a double-byte character
Encoding - Shift JIS

Source: modified from Wikipedia
Encoding - Unicode PUA
Unicode has a number of private use areas (PUAs).
PUA range in the Basic Multilingual Plane (BMP):
0xE000 - 0xF8FF
Supplementary PUA-A:
0xF0000 - 0xFFFFF
Supplementary PUA-B:
0x100000 - 0x10FFFD
Encoding is carrier-specific
Each carrier used different values to encode emoji. For
example...
NTT DoCoMo:
● Shift JIS: 0xF89F - 0xF9FC
● Unicode: 0xE63E - 0xE757 (BMP PUA)
● JIS points for e-mail
... and similarly for the other two carriers.
Mojibake (文字化け)
Mojibake is what happens when encoded text is displayed
using the wrong encoding.
Mojibake (文字化け)
Mojibake is what happens when encoded text is displayed
using the wrong encoding.

Sent:

Displayed:
Outline
●
●
●
●
●

●

A brief history of emoji
Encoding: Shift JIS and Unicode
Mapping and unification
Emoji in Unicode 6
Problems:
○ variation selectors
○ regional indicators
○ counting
Best practices
Carrier-to-carrier mapping
SoftBank

Disney

au by KDDI

DoCoMo

Source: SoftBank
Emoji support spreads...
Emoji began to be supported in web mail and other
devices:
● Yahoo! Japan Web Mail (2006)
● Gmail (2008)
● iPhone 2.2 (2008)
● Android apps (2009)
Google emoji
Provides a unified representation of the three emoji sets:
● union of all the emoji characters
● cross-mapping
○ combine same character
○ a few dozen: existing Unicode
● about 700 new characters
KDDI
○ using PUA
○ outside BMP (U+FExxx)
SoftBank
Idea:
● support legacy systems by
converting between other
encodings and Unicode

DoCoMo
Google PUA mapping table
Converting at boundaries
Gmail
(Google PUA)

KDDI

DoCoMo
SoftBank
Convert to/from Unicode
Emoji in Gmail
Uses mapping table to convert
between PUA and carrier encoding.
Display emoji using images. In some
places, "[?]" is displayed.
Right: mobile Gmail on iPhone
Below: desktop Gmail compose window
Outline
●
●
●
●
●

●

A brief history of emoji
Encoding: Shift JIS and Unicode
Mapping and unification
Emoji in Unicode 6
Problems:
○ variation selectors
○ regional indicators
○ counting
Best practices
Making it official
In 2007, the Unicode Technical Committee agreed to
encode most of the emoji characters, for the purpose of
interoperability between systems.
Unicode proposals (joint effort by Google and Apple) 2009:
● N3582 "Proposal for Encoding Emoji Symbols"
● N3583 "Emoji Symbols Proposed for New Encoding"
Authors:
● Markus Scherer, Mark Davis, Kat Momoi, Darick Tong
(Google)
● Yasuo Kida, Peter Edberg (Apple)
The Proposal

Source: N3583 "Emoji Symbols Proposed for New Encoding"
Emoji in Unicode 6
Goal:
● Encode superset of emoji in Unicode, allowing for
roundtrip and fallback mappings
Restrictions:
● Source separation rule (strict rule)
● Reuse existing Unicode symbols
● Separate generic symbols
● Abstract characters (no specific colours or animation)
● Unify semantically identical symbols, but:
disunify visually similar but semantically different
symbols
● Unify Unicode with least-marked most-common symbol
Source: Unicode Technical Committee Subcommittee on Encoding of Symbols
Proposal accepted
In 2010, the new emoji were accepted into Unicode 6.
These consisted of:
● 625 emoji new 1:1 to Unicode 6
● 103 emoji unified 1:1 with existing characters
● 11 keycaps represented as [0-9#] followed by 'keycap'
● 10 new 'flag' emojis represented as sequences
● 65 emoji logos were not added
In addition, Unicode 6 added many other symbols which
are similar in nature to emoji, such as playing cards, plants,
and transportation symbols.
Unified and new emoji
Unified emoji:

New emoji:
Outline
●
●
●
●
●

●

A brief history of emoji
Encoding: Shift JIS and Unicode
Mapping and unification
Emoji in Unicode 6
Problems:
○ variation selectors
○ regional indicators
○ counting
Best practices
New problems introduced
Since Gmail was already using the unified PUA, it looks like
all that needs to be done to bring it up to spec is to replace
the PUA code points with the official ones...
Not so fast -- it's not that simple!
Recall that one of the goals in creating the proposal was:
● Reuse existing Unicode symbols
Also, the new emoji include:
● keycaps and flags represented by sequences of
characters
What could possibly go wrong?
Can you spot the problems?
Variation selectors

Source: Unicode Standardized Variants
Regional Indicator symbols
The combined carrier emoji contained ten national flags.
(PRC, Germany, Spain, France, UK, Italy, Japan, Korea, Russia, USA)

US proposal (Google and Apple):
● encode as "emoji compatibility symbols"
Germany/Ireland counter-proposal:
● encode 256 characters for ISO 3166 country codes
Compromise:
● encode twenty-six "regional indicator symbols" (A-Z)
● spell out the two-letter country codes
Possible ambiguity
We have "regional indicators"

to

.

But what if the middle of a string looked like this?
...

...

Is this ...
or

...
...

...?

What about CN/NC, KRUS/RUSK, BB...BBFRUSBB...?
Be careful how you count!
Counting the wrong thing is a major source of bugs:
● Java's String.length() lies about Unicode supplementary
code points (UCS-2 vs. UTF-16), use String.
codePointCount() instead
● masking with "[?]" changes the length
● changing encoding changes the length
The above problems existed prior to Unicode 6. But now:
● variation selectors are invisible
● some emoji are represented by sequences (of
supplementary code points)
Outline
●
●
●
●
●

●

A brief history of emoji
Encoding: Shift JIS and Unicode
Mapping and unification
Emoji in Unicode 6
Problems:
○ variation selectors
○ regional indicators
○ counting
Best practices
Best practices
Strive for the following goals:
● use Unicode encoding rather than Shift JIS or other
● use official Unicode code points instead of PUA
● choose wisely whether to use text or image
● convert to/from Unicode at boundaries
● be aware that Unicode has emoji-like symbols beyond
the Japanese carrier sets, and conversion to the carrier
Shift JIS encodings may not be possible for these
● follow Postel's principle
○ "be liberal in what you accept,
but conservative in what you send"
The End

Thank you!
Q&A

Mais conteúdo relacionado

Semelhante a Unicode and Legacy Representations of Emoji (IUC 36)

Mobile Development. A primer.
Mobile Development. A primer.Mobile Development. A primer.
Mobile Development. A primer.Giuseppe Sollazzo
 
Delphi unicode-migration
Delphi unicode-migrationDelphi unicode-migration
Delphi unicode-migrationzevin
 
Go Global Fearless(I18N & L10N)
Go Global Fearless(I18N & L10N)Go Global Fearless(I18N & L10N)
Go Global Fearless(I18N & L10N)Venkat Rajesh
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
Mongolian keyboard drivers and Pain of software developers
Mongolian keyboard drivers and Pain of software developersMongolian keyboard drivers and Pain of software developers
Mongolian keyboard drivers and Pain of software developersOchirkhuyag Lkhagva
 
Metail Skin Colour Authoring Tool
Metail Skin Colour Authoring ToolMetail Skin Colour Authoring Tool
Metail Skin Colour Authoring ToolDavid Gavilan
 
Mdc2010 Casual Game Dev
Mdc2010 Casual Game DevMdc2010 Casual Game Dev
Mdc2010 Casual Game Devmomobangalore
 
Issues with SignWriting in Unicode 8
Issues with SignWriting in Unicode 8Issues with SignWriting in Unicode 8
Issues with SignWriting in Unicode 8Stephen Slevinski
 
DevCon Summit 2014: Trends in iOS Development by Allen Tan
DevCon Summit 2014: Trends in iOS Development by Allen TanDevCon Summit 2014: Trends in iOS Development by Allen Tan
DevCon Summit 2014: Trends in iOS Development by Allen TanDEVCON
 
No More Tofu - Mastering Emoji On Android
No More Tofu - Mastering Emoji On AndroidNo More Tofu - Mastering Emoji On Android
No More Tofu - Mastering Emoji On AndroidMiquel Beltran Febrer
 
SIGNWRITING IN UNICODE 8 ISSUES 2015 by Stephen E Slevinski Jr
SIGNWRITING IN UNICODE 8 ISSUES 2015 by Stephen E Slevinski JrSIGNWRITING IN UNICODE 8 ISSUES 2015 by Stephen E Slevinski Jr
SIGNWRITING IN UNICODE 8 ISSUES 2015 by Stephen E Slevinski JrSignWriting For Sign Languages
 
State of CJK issues of LibreOffice, 2018 edition
State of CJK issues of LibreOffice,  2018 editionState of CJK issues of LibreOffice,  2018 edition
State of CJK issues of LibreOffice, 2018 editionShinji Enoki
 

Semelhante a Unicode and Legacy Representations of Emoji (IUC 36) (20)

Mobile Development. A primer.
Mobile Development. A primer.Mobile Development. A primer.
Mobile Development. A primer.
 
Unicode Primer for the Uninitiated
Unicode Primer for the UninitiatedUnicode Primer for the Uninitiated
Unicode Primer for the Uninitiated
 
ColorZip
ColorZipColorZip
ColorZip
 
Delphi unicode-migration
Delphi unicode-migrationDelphi unicode-migration
Delphi unicode-migration
 
Go Global Fearless(I18N & L10N)
Go Global Fearless(I18N & L10N)Go Global Fearless(I18N & L10N)
Go Global Fearless(I18N & L10N)
 
Delphi Unicode Migration for Mere Mortals
Delphi Unicode Migration for Mere MortalsDelphi Unicode Migration for Mere Mortals
Delphi Unicode Migration for Mere Mortals
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Doppl Code Sharing
Doppl Code SharingDoppl Code Sharing
Doppl Code Sharing
 
Cg 2011
Cg 2011Cg 2011
Cg 2011
 
Topic 2.3 (1)
Topic 2.3 (1)Topic 2.3 (1)
Topic 2.3 (1)
 
Mongolian keyboard drivers and Pain of software developers
Mongolian keyboard drivers and Pain of software developersMongolian keyboard drivers and Pain of software developers
Mongolian keyboard drivers and Pain of software developers
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Metail Skin Colour Authoring Tool
Metail Skin Colour Authoring ToolMetail Skin Colour Authoring Tool
Metail Skin Colour Authoring Tool
 
Mdc2010 Casual Game Dev
Mdc2010 Casual Game DevMdc2010 Casual Game Dev
Mdc2010 Casual Game Dev
 
Issues with SignWriting in Unicode 8
Issues with SignWriting in Unicode 8Issues with SignWriting in Unicode 8
Issues with SignWriting in Unicode 8
 
DevCon Summit 2014: Trends in iOS Development by Allen Tan
DevCon Summit 2014: Trends in iOS Development by Allen TanDevCon Summit 2014: Trends in iOS Development by Allen Tan
DevCon Summit 2014: Trends in iOS Development by Allen Tan
 
No More Tofu - Mastering Emoji On Android
No More Tofu - Mastering Emoji On AndroidNo More Tofu - Mastering Emoji On Android
No More Tofu - Mastering Emoji On Android
 
SIGNWRITING IN UNICODE 8 ISSUES 2015 by Stephen E Slevinski Jr
SIGNWRITING IN UNICODE 8 ISSUES 2015 by Stephen E Slevinski JrSIGNWRITING IN UNICODE 8 ISSUES 2015 by Stephen E Slevinski Jr
SIGNWRITING IN UNICODE 8 ISSUES 2015 by Stephen E Slevinski Jr
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
State of CJK issues of LibreOffice, 2018 edition
State of CJK issues of LibreOffice,  2018 editionState of CJK issues of LibreOffice,  2018 edition
State of CJK issues of LibreOffice, 2018 edition
 

Último

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Unicode and Legacy Representations of Emoji (IUC 36)

  • 1. Unicode and Legacy Representations of Emoji IUC 36 David Yonge-Mallo, i18n Engineer, Google Oct. 24, 2012 ver. 2012-10-23 14:00
  • 2. "Bit rot" 09:15-10:00 Presenter: Dr. Vinton G. Cerf Vice President and Chief Internet Evangelist, Google KEYNOTE PRESENTATION "Bit Rot" – A Disaster Waiting to Happen Dr. Cerf will discuss the problem of curating digital content on the order of centuries. Unicode has a role to play although there are very complex issues relating to format and structure of digital objects, interpretation of content, intellectual property management, perhaps even patents and other legal framework questions. The problems are both technical and legal.
  • 3. Outline ● ● ● ● ● ● A brief history of emoji Encoding: Shift JIS and Unicode Mapping and unification Emoji in Unicode 6 Problems: ○ variation selectors ○ regional indicators ○ counting Best practices
  • 4. Emoji down the ages What if you were tasked with preserving the following texts to be passed down for posterity?
  • 5. Emoji down the ages What if you were tasked with preserving the following texts to be passed down for posterity? awesome! :-)
  • 6. Emoji down the ages What if you were tasked with preserving the following texts to be passed down for posterity? awesome! :-) yay! ☺
  • 7. Emoji down the ages What if you were tasked with preserving the following texts to be passed down for posterity? awesome! :-) yay! ☺ i know how much you hiking
  • 8. What is an emoji (絵文字)? 絵文字 = picture (絵) + character/letter (文字) What are they? ● pictures (representational) ● includes facial expressions (smileys) ○ but not restricted to them ● stored and transmitted as encoded characters ○ used in email and SMS History: ● popularised on Japanese mobile devices ● extension of Japanese character sets ● carrier-specific standards
  • 9. "Early" history in Japan Three major cell phone operators supported emoji: ● NTT DoCoMo ● au/EZweb by KDDI ● SoftBank Problems: ● each operator had its own set of emoji ● they were encoded differently ● no interoperability between them
  • 10. Examples of emoji Above: DoCoMo emoji palette Right: DoCoMo Foma P902i, c. 2005
  • 11. Examples of emoji Subset of KDDI emojis: Subset of SoftBank emojis:
  • 12. Number of supported emoji Source: Emoji in Unicode, IUC 33
  • 13. Outline ● ● ● ● ● ● A brief history of emoji Encoding: Shift JIS and Unicode Mapping and unification Emoji in Unicode 6 Problems: ○ variation selectors ○ regional indicators ○ counting Best practices
  • 14. Encoding - Shift JIS This is one of the most popular encodings for Japanese. The "JIS" part refers to Japanese Industrial Standards. ISO-2022-JP is also known as the "JIS" encoding. The "shift" part comes from how the double-byte characters are encoded. 0x00 - 0x7F : matches ASCII (except for 2 characters) 0x81 - 0x9F : first byte of a double-byte character 0xA1 - 0xDF : half-width katakana 0xE0 - 0xEF : first byte of a double-byte character
  • 15. Encoding - Shift JIS Source: modified from Wikipedia
  • 16. Encoding - Unicode PUA Unicode has a number of private use areas (PUAs). PUA range in the Basic Multilingual Plane (BMP): 0xE000 - 0xF8FF Supplementary PUA-A: 0xF0000 - 0xFFFFF Supplementary PUA-B: 0x100000 - 0x10FFFD
  • 17. Encoding is carrier-specific Each carrier used different values to encode emoji. For example... NTT DoCoMo: ● Shift JIS: 0xF89F - 0xF9FC ● Unicode: 0xE63E - 0xE757 (BMP PUA) ● JIS points for e-mail ... and similarly for the other two carriers.
  • 18. Mojibake (文字化け) Mojibake is what happens when encoded text is displayed using the wrong encoding.
  • 19. Mojibake (文字化け) Mojibake is what happens when encoded text is displayed using the wrong encoding. Sent: Displayed:
  • 20. Outline ● ● ● ● ● ● A brief history of emoji Encoding: Shift JIS and Unicode Mapping and unification Emoji in Unicode 6 Problems: ○ variation selectors ○ regional indicators ○ counting Best practices
  • 21. Carrier-to-carrier mapping SoftBank Disney au by KDDI DoCoMo Source: SoftBank
  • 22. Emoji support spreads... Emoji began to be supported in web mail and other devices: ● Yahoo! Japan Web Mail (2006) ● Gmail (2008) ● iPhone 2.2 (2008) ● Android apps (2009)
  • 23. Google emoji Provides a unified representation of the three emoji sets: ● union of all the emoji characters ● cross-mapping ○ combine same character ○ a few dozen: existing Unicode ● about 700 new characters KDDI ○ using PUA ○ outside BMP (U+FExxx) SoftBank Idea: ● support legacy systems by converting between other encodings and Unicode DoCoMo
  • 25. Converting at boundaries Gmail (Google PUA) KDDI DoCoMo SoftBank Convert to/from Unicode
  • 26. Emoji in Gmail Uses mapping table to convert between PUA and carrier encoding. Display emoji using images. In some places, "[?]" is displayed. Right: mobile Gmail on iPhone Below: desktop Gmail compose window
  • 27. Outline ● ● ● ● ● ● A brief history of emoji Encoding: Shift JIS and Unicode Mapping and unification Emoji in Unicode 6 Problems: ○ variation selectors ○ regional indicators ○ counting Best practices
  • 28. Making it official In 2007, the Unicode Technical Committee agreed to encode most of the emoji characters, for the purpose of interoperability between systems. Unicode proposals (joint effort by Google and Apple) 2009: ● N3582 "Proposal for Encoding Emoji Symbols" ● N3583 "Emoji Symbols Proposed for New Encoding" Authors: ● Markus Scherer, Mark Davis, Kat Momoi, Darick Tong (Google) ● Yasuo Kida, Peter Edberg (Apple)
  • 29. The Proposal Source: N3583 "Emoji Symbols Proposed for New Encoding"
  • 30. Emoji in Unicode 6 Goal: ● Encode superset of emoji in Unicode, allowing for roundtrip and fallback mappings Restrictions: ● Source separation rule (strict rule) ● Reuse existing Unicode symbols ● Separate generic symbols ● Abstract characters (no specific colours or animation) ● Unify semantically identical symbols, but: disunify visually similar but semantically different symbols ● Unify Unicode with least-marked most-common symbol Source: Unicode Technical Committee Subcommittee on Encoding of Symbols
  • 31. Proposal accepted In 2010, the new emoji were accepted into Unicode 6. These consisted of: ● 625 emoji new 1:1 to Unicode 6 ● 103 emoji unified 1:1 with existing characters ● 11 keycaps represented as [0-9#] followed by 'keycap' ● 10 new 'flag' emojis represented as sequences ● 65 emoji logos were not added In addition, Unicode 6 added many other symbols which are similar in nature to emoji, such as playing cards, plants, and transportation symbols.
  • 32. Unified and new emoji Unified emoji: New emoji:
  • 33. Outline ● ● ● ● ● ● A brief history of emoji Encoding: Shift JIS and Unicode Mapping and unification Emoji in Unicode 6 Problems: ○ variation selectors ○ regional indicators ○ counting Best practices
  • 34. New problems introduced Since Gmail was already using the unified PUA, it looks like all that needs to be done to bring it up to spec is to replace the PUA code points with the official ones... Not so fast -- it's not that simple! Recall that one of the goals in creating the proposal was: ● Reuse existing Unicode symbols Also, the new emoji include: ● keycaps and flags represented by sequences of characters What could possibly go wrong?
  • 35. Can you spot the problems?
  • 36. Variation selectors Source: Unicode Standardized Variants
  • 37. Regional Indicator symbols The combined carrier emoji contained ten national flags. (PRC, Germany, Spain, France, UK, Italy, Japan, Korea, Russia, USA) US proposal (Google and Apple): ● encode as "emoji compatibility symbols" Germany/Ireland counter-proposal: ● encode 256 characters for ISO 3166 country codes Compromise: ● encode twenty-six "regional indicator symbols" (A-Z) ● spell out the two-letter country codes
  • 38. Possible ambiguity We have "regional indicators" to . But what if the middle of a string looked like this? ... ... Is this ... or ... ... ...? What about CN/NC, KRUS/RUSK, BB...BBFRUSBB...?
  • 39. Be careful how you count! Counting the wrong thing is a major source of bugs: ● Java's String.length() lies about Unicode supplementary code points (UCS-2 vs. UTF-16), use String. codePointCount() instead ● masking with "[?]" changes the length ● changing encoding changes the length The above problems existed prior to Unicode 6. But now: ● variation selectors are invisible ● some emoji are represented by sequences (of supplementary code points)
  • 40. Outline ● ● ● ● ● ● A brief history of emoji Encoding: Shift JIS and Unicode Mapping and unification Emoji in Unicode 6 Problems: ○ variation selectors ○ regional indicators ○ counting Best practices
  • 41. Best practices Strive for the following goals: ● use Unicode encoding rather than Shift JIS or other ● use official Unicode code points instead of PUA ● choose wisely whether to use text or image ● convert to/from Unicode at boundaries ● be aware that Unicode has emoji-like symbols beyond the Japanese carrier sets, and conversion to the carrier Shift JIS encodings may not be possible for these ● follow Postel's principle ○ "be liberal in what you accept, but conservative in what you send"