SlideShare uma empresa Scribd logo
1 de 25
Unicode and Character Sets
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 					- Joel Spolsky The founder of Stackoverflow The author of 《More Joel on Software》
A In person’s eye 0100 0001 In computer’s eye
ASCII        32~127     8bits ISO-8859-1, ISO-8859-2, ISO-8859-3………..  16 In ISO-8859-1, 0xC0is À In ISO-8859-7, 0xC0is ΐ The same octet has different meanings in different charsets!!
Unicode Not a Charset To assign a code point to every words in the world A -> U+0041 http://www.unicode.org/charts/
How to use Unicode in computer?
UCS-2 (UTF-16) A -> U+0041 -> 0x00 0x41 PROS:  map code points (U+0000~U+FFFF) to octet directly CONS:  Be incompatible with ASCII Waste memory when code point <= U+007F Cannot support code point > U+FFFF
UCS-4 (UTF-32) A -> U+0041 -> 0x00 0x000x00 0x41 PROS:  map code points (U+00000000~U+FFFFFFFF) to octet directly CONS:  Be incompatible with ASCII Waste huge memory
UTF-8 0000 ~ 007F           0xxxxxxx  0080 ~ 07FF           110xxxxx 10xxxxxx  0800 ~ FFFF           1110xxxx 10xxxxxx 10xxxxxx A  => U+0041  => 1000001 => 01000001 => 0x41   神 => U+795E  => 1111001 01011110 =>  11100111 10100101 10011110 => 0xE7 0xA5 0x9E
UTF-8 PROS:  Be compatible with ASCII Can map all the code points to octets CONS:  Algorithm is a little complicate
It does not make sense to have a string without know what  encoding it uses. 				                - Joel Spolsky Software communicate with each other by octet stream  A B Sends  E7 A5 9E E9 A9 AC 3F A should tell B he sends the octets with charset UTF-8. Then B can understand the received message is “神马?”
Charsets in Perl
Two ways to get a string in Perl Literal string From I/O Literal string – depends on the encoding of your source code # encoding UTF-8 my $a1 = “神马?”; my $a2 = “E7A59EE9A9AC3F”; my $a3 = <FH>; Anyway, in the perl’s eye, it’s a string with 7 octets. ISO-8859-1 or UTF-8?
Default, Perl treats it just as a sequence of octets  # encoding UTF-8 my $a1 = “神马?”; print length($a1)  #output is 7 How to make perl treat it as a sequence of  characters? # encoding UTF-8 my $a1 = “神马?”; Encode::decode_utf8($a1); Encode::decode(“utf8”, $a1); Encode::_utf8_on($a1); print length($a1)  #output is 3
What has happened inside? Decode the sequence of octets to Code points as UTF-8(or other charsets) Encode the Code points to internal format (utf8) Turn the string’s UTF8 flag ON According to the UTF8 flag, Perl treats it as a sequence of chars UTF-8 ? utf8? UTF8?
UTF-8 The standard charset made by Ken Thompson utf8 Perl internal charset Superset of UTF-8 UTF8 The name of flag that indicate whether perl should treat it as a sequence of chars
More Examples
#encoding UTF-8 use Devel::Peek; print Dump(“神”), Dump(“E7A59E”); print Dump(“{795E}”), Dump(Encode::decode_utf8(“E7A59E”)); print Dump(“神”.“{795E}”); FLAGS = <PADMY,POK,Ppok> PV = 0x16189d8 “474536” FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e7478 “474536” [UTF8 “{795e}”] FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e74d8 “474536034702450236” br />[UTF8 “{795e}{e7}{a5}{9e}”] 3603 = 11000011 10100111 {e7} = 11100111
Convert “神” from UTF-8 to GBK 神 E7A59E(UTF-8 encoded) UTF8 flag = off decode 神 U+795E(unicode) 神 E7A59E(utf8 encoded) UTF8 flag = on encode 神 C9F1(gbk encoded) UTF8 flag = off
Charsets in MySQL
Server -> database -> table CREATE TABLE XXX …… …… …… DEFAULT CHARSET = UTF-8
SET NAMES X SET CHARACTER_SET_CLIENT = X SET CHARACTER_SET_CONNECTION = X SET CHARACTER_SET_RESULTS = X
Connection_charset = shiftJIS Client_charset = UTF-8 Shell (UTF-8) UTF-8 -> shiftJIS shiftJIS -> UTF-8 Results_charset = UTF-8 MySQL(UTF-8) UTF-8 <- UTF-8 euc-jp <- UTF-8 Client_charset = euc-jp Perl (euc-jp) shiftJIS -> UTF-8 euc-jp -> shiftJIS Results_charset = euc-jp
Q & A
Thank U!

Mais conteúdo relacionado

Destaque

Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsRay Paseur
 
Digital Image Processing and Edge Detection
Digital Image Processing and Edge DetectionDigital Image Processing and Edge Detection
Digital Image Processing and Edge DetectionSeda Yalçın
 
Unicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutUnicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutbijan_
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Project Student
 
Digital Image Processing Fundamental
Digital Image Processing FundamentalDigital Image Processing Fundamental
Digital Image Processing FundamentalThuong Nguyen Canh
 

Destaque (8)

Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Character Sets
Character SetsCharacter Sets
Character Sets
 
Digital Image Processing and Edge Detection
Digital Image Processing and Edge DetectionDigital Image Processing and Edge Detection
Digital Image Processing and Edge Detection
 
Unicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutUnicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layout
 
What character is that
What character is thatWhat character is that
What character is that
 
Unicode
UnicodeUnicode
Unicode
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
 
Digital Image Processing Fundamental
Digital Image Processing FundamentalDigital Image Processing Fundamental
Digital Image Processing Fundamental
 

Semelhante a Unicode and character sets

UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonAram Dulyan
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Dimelo R&D Team
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Jerome Eteve
 
Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Kenneth Farrall
 
The 9th Bit: Encodings in Ruby 1.9
The 9th Bit: Encodings in Ruby 1.9The 9th Bit: Encodings in Ruby 1.9
The 9th Bit: Encodings in Ruby 1.9Norman Clarke
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHPRavi Raj
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals SamiHsDU
 
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Pluginsamiable_indian
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash CourseWill Iverson
 
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...Software Guru
 
Software to the slaughter
Software to the slaughterSoftware to the slaughter
Software to the slaughterQuinn Wilton
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - ITguest6ddfb98
 
Character sets and iconv
Character sets and iconvCharacter sets and iconv
Character sets and iconvDaniel_Rhodes
 
CyberLink LabelPrint 2.5 Exploitation Process
CyberLink LabelPrint 2.5 Exploitation ProcessCyberLink LabelPrint 2.5 Exploitation Process
CyberLink LabelPrint 2.5 Exploitation ProcessThomas Gregory
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in PerlNova Patch
 

Semelhante a Unicode and character sets (20)

Unicode 101
Unicode 101Unicode 101
Unicode 101
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
 
Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)
 
The 9th Bit: Encodings in Ruby 1.9
The 9th Bit: Encodings in Ruby 1.9The 9th Bit: Encodings in Ruby 1.9
The 9th Bit: Encodings in Ruby 1.9
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHP
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
 
Unicode
UnicodeUnicode
Unicode
 
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Plugins
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
 
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
¡Ups! código inseguro: detección, explotación y mitigación de vulnerabilidade...
 
Journey of Bsdconv
Journey of BsdconvJourney of Bsdconv
Journey of Bsdconv
 
Software to the slaughter
Software to the slaughterSoftware to the slaughter
Software to the slaughter
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
Character sets and iconv
Character sets and iconvCharacter sets and iconv
Character sets and iconv
 
CyberLink LabelPrint 2.5 Exploitation Process
CyberLink LabelPrint 2.5 Exploitation ProcessCyberLink LabelPrint 2.5 Exploitation Process
CyberLink LabelPrint 2.5 Exploitation Process
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in Perl
 

Último

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Unicode and character sets

  • 2. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel Spolsky The founder of Stackoverflow The author of 《More Joel on Software》
  • 3. A In person’s eye 0100 0001 In computer’s eye
  • 4. ASCII 32~127 8bits ISO-8859-1, ISO-8859-2, ISO-8859-3……….. 16 In ISO-8859-1, 0xC0is À In ISO-8859-7, 0xC0is ΐ The same octet has different meanings in different charsets!!
  • 5. Unicode Not a Charset To assign a code point to every words in the world A -> U+0041 http://www.unicode.org/charts/
  • 6. How to use Unicode in computer?
  • 7. UCS-2 (UTF-16) A -> U+0041 -> 0x00 0x41 PROS: map code points (U+0000~U+FFFF) to octet directly CONS: Be incompatible with ASCII Waste memory when code point <= U+007F Cannot support code point > U+FFFF
  • 8. UCS-4 (UTF-32) A -> U+0041 -> 0x00 0x000x00 0x41 PROS: map code points (U+00000000~U+FFFFFFFF) to octet directly CONS: Be incompatible with ASCII Waste huge memory
  • 9. UTF-8 0000 ~ 007F 0xxxxxxx 0080 ~ 07FF 110xxxxx 10xxxxxx 0800 ~ FFFF 1110xxxx 10xxxxxx 10xxxxxx A => U+0041 => 1000001 => 01000001 => 0x41 神 => U+795E => 1111001 01011110 => 11100111 10100101 10011110 => 0xE7 0xA5 0x9E
  • 10. UTF-8 PROS: Be compatible with ASCII Can map all the code points to octets CONS: Algorithm is a little complicate
  • 11. It does not make sense to have a string without know what encoding it uses. - Joel Spolsky Software communicate with each other by octet stream A B Sends E7 A5 9E E9 A9 AC 3F A should tell B he sends the octets with charset UTF-8. Then B can understand the received message is “神马?”
  • 13. Two ways to get a string in Perl Literal string From I/O Literal string – depends on the encoding of your source code # encoding UTF-8 my $a1 = “神马?”; my $a2 = “E7A59EE9A9AC3F”; my $a3 = <FH>; Anyway, in the perl’s eye, it’s a string with 7 octets. ISO-8859-1 or UTF-8?
  • 14. Default, Perl treats it just as a sequence of octets # encoding UTF-8 my $a1 = “神马?”; print length($a1) #output is 7 How to make perl treat it as a sequence of characters? # encoding UTF-8 my $a1 = “神马?”; Encode::decode_utf8($a1); Encode::decode(“utf8”, $a1); Encode::_utf8_on($a1); print length($a1) #output is 3
  • 15. What has happened inside? Decode the sequence of octets to Code points as UTF-8(or other charsets) Encode the Code points to internal format (utf8) Turn the string’s UTF8 flag ON According to the UTF8 flag, Perl treats it as a sequence of chars UTF-8 ? utf8? UTF8?
  • 16. UTF-8 The standard charset made by Ken Thompson utf8 Perl internal charset Superset of UTF-8 UTF8 The name of flag that indicate whether perl should treat it as a sequence of chars
  • 18. #encoding UTF-8 use Devel::Peek; print Dump(“神”), Dump(“E7A59E”); print Dump(“{795E}”), Dump(Encode::decode_utf8(“E7A59E”)); print Dump(“神”.“{795E}”); FLAGS = <PADMY,POK,Ppok> PV = 0x16189d8 “474536” FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e7478 “474536” [UTF8 “{795e}”] FLAGS = <PADMY,POK,Ppok,UTF8> PV = 0x2e74d8 “474536034702450236” br />[UTF8 “{795e}{e7}{a5}{9e}”] 3603 = 11000011 10100111 {e7} = 11100111
  • 19. Convert “神” from UTF-8 to GBK 神 E7A59E(UTF-8 encoded) UTF8 flag = off decode 神 U+795E(unicode) 神 E7A59E(utf8 encoded) UTF8 flag = on encode 神 C9F1(gbk encoded) UTF8 flag = off
  • 21. Server -> database -> table CREATE TABLE XXX …… …… …… DEFAULT CHARSET = UTF-8
  • 22. SET NAMES X SET CHARACTER_SET_CLIENT = X SET CHARACTER_SET_CONNECTION = X SET CHARACTER_SET_RESULTS = X
  • 23. Connection_charset = shiftJIS Client_charset = UTF-8 Shell (UTF-8) UTF-8 -> shiftJIS shiftJIS -> UTF-8 Results_charset = UTF-8 MySQL(UTF-8) UTF-8 <- UTF-8 euc-jp <- UTF-8 Client_charset = euc-jp Perl (euc-jp) shiftJIS -> UTF-8 euc-jp -> shiftJIS Results_charset = euc-jp
  • 24. Q & A