SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
BSDCONV
Buganini Q
Since 2009
Charset & Encoding
Character Set
Collection of characters
Encoding
Binary representation
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilingual Plane)
. GB18030.
CNS11643
.
CP950
.
Latin1
Figure: Character Sets
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.GB18030.
CNS11643
.
CP950
.
Latin1
.
UTF-32 / UCS4
.
UTF-81 / UTF-16
.
UCS2
. GB18030.
CNS11643
.
CP950 (DBCS)
.
ISO-8859-1 / EBCDIC-0372
1
Could cover more but restricted by RFC 3629
2
Aka. IBM-37, some control characters are different from ISO-8859-1
Encoding :: UTF-32 / UCS4
Fixed Length
4 bytes
Filesize *= 4 for ASCII text file
Incompatible with C-style string convention
Endianness concern
Encoding :: UCS2
Fixed Length
2 bytes
Filesize *= 2 for ASCII text file
Incompatible with C-style string convention
Endianness concern
BMP-only
Encoding :: UTF-16
Variable Length
2 bytes / 4 bytes (Surrogate pairs)
Surrogates
Using U+D800..U+DFFF
Incompatible with C-style string convention
Endianness concern
******** ********
110110** ******** 110111** ********
Table: UTF-16 Structure
Encoding :: UTF-8
Variable Length
1~6 bytes
Compatible with C-style string convention
Self-synchronizing
Endian-neutral
Sorting order = Code point order
0******* (ASCII)
110***** 10******
1110**** 10****** 10******
11110*** 10****** 10****** 10******
111110** 10****** 10****** 10****** 10******
1111110* 10****** 10****** 10****** 10****** 10******
Table: UTF-8 Structure
Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not just an charset/encoding
Font
Pronunciation ㄇㄥ ˊ / méng
Radical 艸
Component 艹日月
Stroke
Tra/Sim mapping 萌蕄
Table: Examples for some information provided by 全字庫 for「萌」
Encoding :: CCCII
Variants
Variant glyph at different plane
Mostly used for library indexing
強 21 3D 48
彊 2D 3D 48
强 33 3D 48
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
Bsdconv :: Decoding and Encoding
Alternative to iconv
... ISO-8859-1. :. UTF-8..
from
.
to
Figure: Basic two phases conversion
Bsdconv :: Codecs & Fallback
Optionally produce question mark (U+003F) as replacement
... UTF-8. ,. 3F. :. ASCII. ,. 3F..
from
.
to
Figure: Fallback codec
Transliteration
... UTF-8. :. CP936. ,. CP936-TRANS. ,. 3F..
from
.
to
Figure: Multiple fallback codecs
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
Big5 5C issue (許功蓋)
BIG5:BIG5-5C,BIG5
# Input Output
Big5 Literal ” 成功” ” 成功 ”
ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
BIG5-5C,BIG5:BIG5
# Input Output
Big5 Literal ” 成功 ” ” 成功”
ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
Traditional/Simplified Chinese
NOT one-to-one mapping
Traditional 乾幹干
vs.
Simplified 干干干
Context dependent
之後、夜之后、入夜之後
Variants
峰、峯
Project Chvar (1/2)
https://github.com/buganini/chvar
..
..签簽. 籖籤.
Canonical group
.
Canonical group
.
Compatibility group
Figure: Two level grouping in Chvar
签 簽 籖 籤
TW 簽 - 籤 -
CN - 签 - 籖
CP950 簽 - 籤 -
GB2312 - 签 × ×
Table: Canonical Group
签 簽 籖 籤
TW 簽 - 簽 簽
CN - 签 签 签
CP950 簽 - 簽 簽
GB2312 - 签 签 签
Table: Compatibility Group
Project Chvar (2/2)
https://github.com/buganini/chvar
Normalization
Canonical Equivalence
Transliteration
Converted
or Canonical Equivalence
or Compatibility Equivalence
Fuzzy character matching
Compatibility Equivalence
签 簽 籖 籤
TW 簽 - 籤 -
CN - 签 - 籖
CP950 簽 - 籤 -
GB2312 - 签 × ×
Table: Canonical Group
签 簽 籖 籤
TW 簽 - 簽 簽
CN - 签 签 签
CP950 簽 - 簽 簽
GB2312 - 签 签 签
Table: Compatibility Group
Bsdconv :: Phases
Traditional Chinese ⇔ Simplified Chinese
... UTF-8. :. ZHTW. :. UTF-8..
from
.
inter
.
to
Figure: Conversion with inter-mapping phase
Bsdconv :: Phases
Furthermore, phrases mapping
... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8..
from
.
inter
.
inter
.
to
Figure: Conversion with multiple inter-mapping phases
Unicode :: Casing
IS complicated
Lowercase Uppercase
a A
i I
Table: English
Lowercase Uppercase
ı I
i İ
Table: Turkic
Lowercase Uppercase
a A
à A
Table: French
Lowercase Uppercase
σ Σ
ς Σ
Table: Greek
Default Case Folding
Unicode :: Normalization Forms (1/2)
UAX#15
Indexing
Identification security
Username, Domain name
Combining sequence Ç C + ◌̧
Ordering of combining marks q+◌̇+◌̣ q+◌̣+◌̇
Hangul 가 ᄀ + ᅡ
Singleton Ω Ω
Table: Canonical Equivalence
Unicode :: Normalization Forms (2/2)
UAX#15
Font variants ℌ H
Breaking differences NBSP SP
Cursive forms ‫ﻧ‬ ‫ﻨ‬
Circled ① 1
Width, size, rotated
カ カ
︷ {
Superscripts/subscripts ⁹ 9
Squared characters ㍿ 株 + 式 + 会 + 社
Fractions ¾ 3 + / + 4
Others dž d + z + ◌̌
Table: Compatibility Equivalence
Normalization for fuzzy matching
UTF-8:UPPER:UTF-8
Input: aăⅷDžбⓐᾥ
Output: AĂⅧDŽБⒶᾭ
UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD-
CASEFOLD:UTF-8
Input: ¼ℌℍăDžⓐ⁹ 灣湾ド鬒鬒㊣ æß
Output: 1⁄4hhădža9 灣灣 do 鬒鬒正 æss
Composition Decomposition
Canonical NFC NFD
Compatibility NFKC NFKD
Table: The four Unicode normalization forms and the transformations
Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversions
Input Output
¥x¥_ 台北
Bsdconv :: Codec argument
Other than question mark
... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Codec argument
Or more than one character
... UTF-8. ,. ANY#013F.0121. :. ASCII. ,. ANY#21..
from
.
to
Figure: Data list, separated by dot
Bsdconv :: Alias
from/3F
ANY#013F&ERROR
to/3F
ANY#3F&ERROR
from/UTF-8
ASCII,_UTF-8
inter/NFKD
_NFKD:_NF-HANGUL-DECOMPOSITION:_NF-ORDER
inter/NFKC
NFKD:_NFC:_NF-HANGUL-COMPOSITION
inter/NFKD-CASEFOLD
NFD:CASEFOLD:NFKD:CASEFOLD:NFKD
filter/01
UNICODE
Charset & Encoding
..
Unicode (32bits addr. space)
.
Unicode up to U+10FFFF
.
Unicode BMP (up to U+FFFF)
.
(Basic Multilingual Plane)
. GB18030.
CNS11643
.
CP950
.
Latin1
Figure: Character Sets
Bsdconv :: Types
(01) Unicode
(02) CNS11643
(03) Byte
(04) Chinese components
(1B) ANSI control sequences
(00) Bsdconv special characters
Encoding :: CNS11643 (全字庫) #issue
http://www.cns11643.gov.tw/
Only used by Taiwan government
NOT a subset of Unicode
Not just an charset/encoding
Font
Pronunciation ㄇㄥ ˊ / méng
Radical 艸
Component 艹日月
Stroke
Tra/Sim mapping 萌蕄
Table: Examples for some information provided by 全字庫 for「萌」
Chinese components composition
https://github.com/buganini/chicomp
UTF-8:ZH-DECOMP:ZH-COMP:UTF-8
Input: 功夫不好不要艹我
Output: 巭孬嫑莪
UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:UTF-8
Input: 功夫不好不要艹我
Output: ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ
UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:HAN-
PINYIN:UTF-8
Input: 功夫不好不要艹我
Output: pu nao yao [uh]2
Bsdconv :: Flags
FREE - memory management
MARK - identifier
Bsdconv :: Cascade
Re-encode
... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8..
from
.
to
.
from
.
to
Figure: Cascaded conversions
Input Output
¥x¥_ 台北
Look-through (1/4)
..%u03B1%CE%B2.
Input (UTF-8 literal)
. ESCAPE : ....
Decoder
.
..
01
.
03
.
B1
.
03
.
CE
.
03
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Look-through (2/4)
..
..01.
03
.
B1
. 03.
CE
. 03.
B2.
Internal data
. ... : PASS#MARK&FOR=1,BYTE.
Encoder
.
..
01
.
03
.
B1
.
MARK
.
CE
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Look-through (3/4)
..
..01.
03
.
B1
.
MARK
. CE. B2
.
Internal data
. PASS#UNMARK,UTF-8 : ....
Decoder
.
..
01
.
03
.
B1
.
01
.
03
.
B2
.
Internal data
.
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Look-through (4/4)
..
01
.
03
.
B1
.
01
.
03
.
B2
Internal data
... : UTF-8
Encoder
..
CE
.
B1
.”α”.
CE
.
B2
. ”β”
Internal data
αβ
Output (UTF-8 literal)
Entity Unicode UTF-8 Hex
α U+03B1 CEB1
β U+03B2 CEB2
Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn Diagram Showing the Set Relations for Six Categories
Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Character Class and Their Resolved Widths
Na Narrow
N Neural, usually treated as Narrow
W Wide
F Fullwidth
H Halfwidth
Table: Width attributes
String width measurement
echo "42(ˊ_>ˋ) 紅茶" | bsdconv UTF-8:WIDTH:NULL
FULL: 2
HALF: 7
AMBI: 2
Chinese charset encoding detection
https://github.com/buganini/chiconv
ENCODING:SCORE#WITH=CJK:COUNT:ZH-
BONUS:ZHTW:ZH-BONUS-PHRASE:NULL
Score(s) = $SCORE−$IERR∗$COUNT∗0.01
$COUNT
帥呆了 ⇒ UTF-8:SCORE#WITH=CJK:……
ENCODING SCORE COUNT IERR Score(s)
UTF-8 19 4 0 4.75
BIG5 8 3 2 -4.0
GBK 4 1 4 -36.0
CCCII 36 9 0 4.0
UTF-16LE 20 5 2 0.0
Khmer legacy font converter
https://github.com/buganini/khmerconv
Issues
Encoding without registerd name, bound on fonts
Stored in CP1252 or UTF-8
Solution
Two pass detection
Detect encoding
Detect font family (currently not working)
(High converage in SBCS)
Algorithm ported from Khmer Converter3
Khmer Converter
Mapping
Reordering
Visual order vs. Unicode model
Unicode Model: baseCharacter [+ [Robat/Shifter] + [Coeng*]
+ [Shifter] + [Vowel] + [Sign]]
3
http://www.khmeros.info/en/khmer-converter
Encoding :: Big5
Many incompatible variations (abusing PUA), none of
standard tools can rule them all
http://moztw.org/docs/big5/
Scenario Dominating encoding
Microsoft CP950
Taiwan BBS UAO (Unicode-at-Once)
gov.tw Big5-2003
gov.hk HKSCS (1999,2001,2004)
Special characters conflict
The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which
may have special meaning in certain context
Unicode :: East Asian Width (1/2)
UAX#11
..
..Narrow.
Halfwidth
..
.. Wide.
Fullwidth
.
Ambiguous
.
Neutral
Figure: Venn Diagram Showing the Set Relations for Six Categories
Unicode :: East Asian Width (2/2)
UAX#11
Narrow Ambiguous Wide
Я
N ऊ
Na A A F
H カ カ W
咦 W
Table: Examples for Each Character Class and Their Resolved Widths
Na Narrow
N Neural, usually treated as Narrow
W Wide
F Fullwidth
H Halfwidth
Table: Width attributes
Terminal transcoding
https://github.com/buganini/bug5
Issues
UAO: Non-standard big5 extension
Double color hack
ANSI control sequence in the middle of DBCS
Ambiguous width characters
luit/screen cannot help
Solution (tl;dr)
Big5 to Unicode
ANSI-CONTROL,BYTE:BIG5-DEFRAG:
BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:
UTF-8,PASS#FOR=1B
Unicode to Big5
UTF-8,00,BYTE:ZHTW:AMBIGUOUS-UNPAD:
BIG5,CP950-TRANS,UAO,00,ANY#3F
Bug5 explained (1/6)
..⋆xC5x1B[1mxE5.
Input (Big5 literal)
. ANSI-CONTROL,BYTE : ....
Decoder
.
..
03
.
A1
.
03
.
B9
.
03
.
C5
.
1B
.
5B
.
31
.
6D
.
03
.
E5
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (2/6)
..
..03.
A1
. 03.
B9
. 03.
C5
. 1B.
5B
.
31
.
6D
. 03.
E5.
Internal data
. ... : BIG5-DEFRAG : ....
Inter-conversion
.
..
03
.
A1
.
03
.
B9
.
03
.
C5
.
03
.
E5
.
1B
.
5B
.
31
.
6D
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (3/6)
..
..03.
A1
. 03.
B9
. 03.
C5
. 03.
E5
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : BYTE,PASS#MARK&FOR=1B.
Encoder
.
..
A1
.
B9
.
C5
.
E5
.
1B
.
5B
.
31
.
6D
.
MARK
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (4/6)
..
..A1. B9. C5. E5. 1B.
5B
.
31
.
6D
.
MARK
.
Internal data
. PASS#UNMARK,BIG5 : ....
Decoder
.
..
01
.
26
.
05
.
01
.
9A
.
5A
.
1B
.
5B
.
31
.
6D
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (5/6)
..
..01.
26
.
05
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : AMBIGUOUS-PAD : ....
Inter-conversion
.
..
01
.
26
.
05
.
01
.
A0
.
01
.
9A
.
5A
.
1B
.
5B
.
31
.
6D
.
Internal data
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bug5 explained (6/6)
..
..01.
26
.
05
. 01.
A0
. 01.
9A
.
5A
. 1B.
5B
.
31
.
6D
.
Internal data
. ... : UTF-8,PASS#FOR=1B.
Encoder
.
⋆ 驚 x1B[1m
.
Output (UTF-8 literal)
.
Entity Unicode UTF-8 Hex Big5 Hex
⋆ U+2605 E29885 A1B9
驚 U+9A5A E9A99A C5E5
[ U+005B 5B 5B
1 U+0031 31 31
m U+006D 6D 6D
(NBSP) U+00A0 C2A0 -
Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B|
PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
Bsdconv :: Bindings
Python/Ruby/Go/Perl/PHP
https://pypi.python.org/pypi/bsdconv
https://rubygems.org/gems/ruby-bsdconv
https://github.com/buganini/go-bsdconv
https://github.com/buganini/perl-bsdconv
https://github.com/buganini/php-bsdconv
PostgreSQL/MySQL
https://github.com/buganini/postgres-bsdconv
https://github.com/buganini/mysql-udf-bsdconv
Irssi
https://github.com/buganini/irssi-scripts/blob/master/irssi-bsdconv.pl
Bsdconv :: GUI
https://github.com/buganini/gbsdconv
Alternative to ConvertZ
Text
File name
File content
Meta tag
Thanks
ESCAPE,UTF-8:PA
SS#FOR=UNICODE&M
ARK,BYTE|PASS#UNMA
RK,UTF-8:NFC:ASCII,ES
CAPE|
https://github.com/buganini/bsdconv

Mais conteúdo relacionado

Mais procurados

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
PRADEEP
 

Mais procurados (20)

assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YU
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly Language
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)
 
Embedded c
Embedded cEmbedded c
Embedded c
 
Programmable Logic Devices
Programmable Logic DevicesProgrammable Logic Devices
Programmable Logic Devices
 
Assembly language (coal)
Assembly language (coal)Assembly language (coal)
Assembly language (coal)
 
C programming part2
C programming part2C programming part2
C programming part2
 
Assembly Language Lecture 2
Assembly Language Lecture 2Assembly Language Lecture 2
Assembly Language Lecture 2
 
Instruction set-of-8086
Instruction set-of-8086Instruction set-of-8086
Instruction set-of-8086
 
Introduction to 8088 microprocessor
Introduction to 8088 microprocessorIntroduction to 8088 microprocessor
Introduction to 8088 microprocessor
 
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
Assembly Language Programming By Ytha Yu, Charles Marut Chap 4 (Introduction ...
 
Assembly language part I
Assembly language part IAssembly language part I
Assembly language part I
 
Chapter 6 Flow control Instructions
Chapter 6 Flow control InstructionsChapter 6 Flow control Instructions
Chapter 6 Flow control Instructions
 
[ASM] Lab1
[ASM] Lab1[ASM] Lab1
[ASM] Lab1
 
Instruction formats-in-8086
Instruction formats-in-8086Instruction formats-in-8086
Instruction formats-in-8086
 
Lecture6
Lecture6Lecture6
Lecture6
 
Ch9a
Ch9aCh9a
Ch9a
 
C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)C cheat sheet for varsity (extreme edition)
C cheat sheet for varsity (extreme edition)
 
Lecture5(1)
Lecture5(1)Lecture5(1)
Lecture5(1)
 

Semelhante a Journey of Bsdconv

Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
EstelaJeffery653
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
Bert Pattyn
 
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARFHES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
Hackito Ergo Sum
 
Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2
Ikhwan_Fakrudin
 
Keyboard interrupt
Keyboard interruptKeyboard interrupt
Keyboard interrupt
Tech_MX
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
renchenyu
 

Semelhante a Journey of Bsdconv (20)

Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
20141106 asfws unicode_hacks
20141106 asfws unicode_hacks20141106 asfws unicode_hacks
20141106 asfws unicode_hacks
 
Chapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structuChapter 1SyllabusCatalog Description Computer structu
Chapter 1SyllabusCatalog Description Computer structu
 
Reed Solomon Frame Structures Revealed
Reed Solomon Frame Structures RevealedReed Solomon Frame Structures Revealed
Reed Solomon Frame Structures Revealed
 
ISA.pptx
ISA.pptxISA.pptx
ISA.pptx
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
Chap 01[1]
Chap 01[1]Chap 01[1]
Chap 01[1]
 
ASCII-EBCDIC-HEX
ASCII-EBCDIC-HEXASCII-EBCDIC-HEX
ASCII-EBCDIC-HEX
 
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARFHES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
HES2011 - James Oakley and Sergey bratus-Exploiting-the-Hard-Working-DWARF
 
Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2Embedded system (Chapter 2) part 2
Embedded system (Chapter 2) part 2
 
Keyboard interrupt
Keyboard interruptKeyboard interrupt
Keyboard interrupt
 
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
 
Y03301460154
Y03301460154Y03301460154
Y03301460154
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 
Assembler1
Assembler1Assembler1
Assembler1
 
C programming part2
C programming part2C programming part2
C programming part2
 

Último

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Journey of Bsdconv

  • 2. Charset & Encoding Character Set Collection of characters Encoding Binary representation
  • 3. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) . (Basic Multilingual Plane) . GB18030. CNS11643 . CP950 . Latin1 Figure: Character Sets
  • 4. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) .GB18030. CNS11643 . CP950 . Latin1 . UTF-32 / UCS4 . UTF-81 / UTF-16 . UCS2 . GB18030. CNS11643 . CP950 (DBCS) . ISO-8859-1 / EBCDIC-0372 1 Could cover more but restricted by RFC 3629 2 Aka. IBM-37, some control characters are different from ISO-8859-1
  • 5. Encoding :: UTF-32 / UCS4 Fixed Length 4 bytes Filesize *= 4 for ASCII text file Incompatible with C-style string convention Endianness concern
  • 6. Encoding :: UCS2 Fixed Length 2 bytes Filesize *= 2 for ASCII text file Incompatible with C-style string convention Endianness concern BMP-only
  • 7. Encoding :: UTF-16 Variable Length 2 bytes / 4 bytes (Surrogate pairs) Surrogates Using U+D800..U+DFFF Incompatible with C-style string convention Endianness concern ******** ******** 110110** ******** 110111** ******** Table: UTF-16 Structure
  • 8. Encoding :: UTF-8 Variable Length 1~6 bytes Compatible with C-style string convention Self-synchronizing Endian-neutral Sorting order = Code point order 0******* (ASCII) 110***** 10****** 1110**** 10****** 10****** 11110*** 10****** 10****** 10****** 111110** 10****** 10****** 10****** 10****** 1111110* 10****** 10****** 10****** 10****** 10****** Table: UTF-8 Structure
  • 9. Encoding :: CNS11643 (全字庫) #issue http://www.cns11643.gov.tw/ Only used by Taiwan government NOT a subset of Unicode Not just an charset/encoding Font Pronunciation ㄇㄥ ˊ / méng Radical 艸 Component 艹日月 Stroke Tra/Sim mapping 萌蕄 Table: Examples for some information provided by 全字庫 for「萌」
  • 10. Encoding :: CCCII Variants Variant glyph at different plane Mostly used for library indexing 強 21 3D 48 彊 2D 3D 48 强 33 3D 48
  • 11. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  • 12. Bsdconv :: Decoding and Encoding Alternative to iconv ... ISO-8859-1. :. UTF-8.. from . to Figure: Basic two phases conversion
  • 13. Bsdconv :: Codecs & Fallback Optionally produce question mark (U+003F) as replacement ... UTF-8. ,. 3F. :. ASCII. ,. 3F.. from . to Figure: Fallback codec Transliteration ... UTF-8. :. CP936. ,. CP936-TRANS. ,. 3F.. from . to Figure: Multiple fallback codecs
  • 14. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  • 15. Big5 5C issue (許功蓋) BIG5:BIG5-5C,BIG5 # Input Output Big5 Literal ” 成功” ” 成功 ” ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5” BIG5-5C,BIG5:BIG5 # Input Output Big5 Literal ” 成功 ” ” 成功” ASCII/Hex ”xA6xA8xA5” ”xA6xA8xA5”
  • 16. Traditional/Simplified Chinese NOT one-to-one mapping Traditional 乾幹干 vs. Simplified 干干干 Context dependent 之後、夜之后、入夜之後 Variants 峰、峯
  • 17. Project Chvar (1/2) https://github.com/buganini/chvar .. ..签簽. 籖籤. Canonical group . Canonical group . Compatibility group Figure: Two level grouping in Chvar 签 簽 籖 籤 TW 簽 - 籤 - CN - 签 - 籖 CP950 簽 - 籤 - GB2312 - 签 × × Table: Canonical Group 签 簽 籖 籤 TW 簽 - 簽 簽 CN - 签 签 签 CP950 簽 - 簽 簽 GB2312 - 签 签 签 Table: Compatibility Group
  • 18. Project Chvar (2/2) https://github.com/buganini/chvar Normalization Canonical Equivalence Transliteration Converted or Canonical Equivalence or Compatibility Equivalence Fuzzy character matching Compatibility Equivalence 签 簽 籖 籤 TW 簽 - 籤 - CN - 签 - 籖 CP950 簽 - 籤 - GB2312 - 签 × × Table: Canonical Group 签 簽 籖 籤 TW 簽 - 簽 簽 CN - 签 签 签 CP950 簽 - 簽 簽 GB2312 - 签 签 签 Table: Compatibility Group
  • 19. Bsdconv :: Phases Traditional Chinese ⇔ Simplified Chinese ... UTF-8. :. ZHTW. :. UTF-8.. from . inter . to Figure: Conversion with inter-mapping phase
  • 20. Bsdconv :: Phases Furthermore, phrases mapping ... UTF-8. :. ZHTW. :. ZHTW-WORDS. :. UTF-8.. from . inter . inter . to Figure: Conversion with multiple inter-mapping phases
  • 21. Unicode :: Casing IS complicated Lowercase Uppercase a A i I Table: English Lowercase Uppercase ı I i İ Table: Turkic Lowercase Uppercase a A à A Table: French Lowercase Uppercase σ Σ ς Σ Table: Greek Default Case Folding
  • 22. Unicode :: Normalization Forms (1/2) UAX#15 Indexing Identification security Username, Domain name Combining sequence Ç C + ◌̧ Ordering of combining marks q+◌̇+◌̣ q+◌̣+◌̇ Hangul 가 ᄀ + ᅡ Singleton Ω Ω Table: Canonical Equivalence
  • 23. Unicode :: Normalization Forms (2/2) UAX#15 Font variants ℌ H Breaking differences NBSP SP Cursive forms ‫ﻧ‬ ‫ﻨ‬ Circled ① 1 Width, size, rotated カ カ ︷ { Superscripts/subscripts ⁹ 9 Squared characters ㍿ 株 + 式 + 会 + 社 Fractions ¾ 3 + / + 4 Others dž d + z + ◌̌ Table: Compatibility Equivalence
  • 24. Normalization for fuzzy matching UTF-8:UPPER:UTF-8 Input: aăⅷDžбⓐᾥ Output: AĂⅧDŽБⒶᾭ UTF-8:ZH-FUZZY-TW:KANA-PHONETIC:NFKD- CASEFOLD:UTF-8 Input: ¼ℌℍăDžⓐ⁹ 灣湾ド鬒鬒㊣ æß Output: 1⁄4hhădža9 灣灣 do 鬒鬒正 æss Composition Decomposition Canonical NFC NFD Compatibility NFKC NFKD Table: The four Unicode normalization forms and the transformations
  • 25. Bsdconv :: Cascade Re-encode ... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8.. from . to . from . to Figure: Cascaded conversions Input Output ¥x¥_ 台北
  • 26. Bsdconv :: Codec argument Other than question mark ... UTF-8. ,. ANY#0121. :. ASCII. ,. ANY#21.. from . to Figure: Codec argument Or more than one character ... UTF-8. ,. ANY#013F.0121. :. ASCII. ,. ANY#21.. from . to Figure: Data list, separated by dot
  • 28. Charset & Encoding .. Unicode (32bits addr. space) . Unicode up to U+10FFFF . Unicode BMP (up to U+FFFF) . (Basic Multilingual Plane) . GB18030. CNS11643 . CP950 . Latin1 Figure: Character Sets
  • 29. Bsdconv :: Types (01) Unicode (02) CNS11643 (03) Byte (04) Chinese components (1B) ANSI control sequences (00) Bsdconv special characters
  • 30. Encoding :: CNS11643 (全字庫) #issue http://www.cns11643.gov.tw/ Only used by Taiwan government NOT a subset of Unicode Not just an charset/encoding Font Pronunciation ㄇㄥ ˊ / méng Radical 艸 Component 艹日月 Stroke Tra/Sim mapping 萌蕄 Table: Examples for some information provided by 全字庫 for「萌」
  • 31. Chinese components composition https://github.com/buganini/chicomp UTF-8:ZH-DECOMP:ZH-COMP:UTF-8 Input: 功夫不好不要艹我 Output: 巭孬嫑莪 UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:UTF-8 Input: 功夫不好不要艹我 Output: ㄆㄨ ㄋㄠ ㄧㄠ ㄜ ˊ UTF-8:ZH-DECOMP:ZH-COMP:CHEWING:HAN- PINYIN:UTF-8 Input: 功夫不好不要艹我 Output: pu nao yao [uh]2
  • 32. Bsdconv :: Flags FREE - memory management MARK - identifier
  • 33. Bsdconv :: Cascade Re-encode ... UTF-8. :. ISO-8859-1. |. BIG5. :. UTF-8.. from . to . from . to Figure: Cascaded conversions Input Output ¥x¥_ 台北
  • 34. Look-through (1/4) ..%u03B1%CE%B2. Input (UTF-8 literal) . ESCAPE : .... Decoder . .. 01 . 03 . B1 . 03 . CE . 03 . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 35. Look-through (2/4) .. ..01. 03 . B1 . 03. CE . 03. B2. Internal data . ... : PASS#MARK&FOR=1,BYTE. Encoder . .. 01 . 03 . B1 . MARK . CE . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 36. Look-through (3/4) .. ..01. 03 . B1 . MARK . CE. B2 . Internal data . PASS#UNMARK,UTF-8 : .... Decoder . .. 01 . 03 . B1 . 01 . 03 . B2 . Internal data . Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 37. Look-through (4/4) .. 01 . 03 . B1 . 01 . 03 . B2 Internal data ... : UTF-8 Encoder .. CE . B1 .”α”. CE . B2 . ”β” Internal data αβ Output (UTF-8 literal) Entity Unicode UTF-8 Hex α U+03B1 CEB1 β U+03B2 CEB2 Figure: ESCAPE:PASS#MARK&FOR=1,BYTE|PASS#UNMARK,UTF-8:UTF-8
  • 38. Unicode :: East Asian Width (1/2) UAX#11 .. ..Narrow. Halfwidth .. .. Wide. Fullwidth . Ambiguous . Neutral Figure: Venn Diagram Showing the Set Relations for Six Categories
  • 39. Unicode :: East Asian Width (2/2) UAX#11 Narrow Ambiguous Wide Я N ऊ Na A A F H カ カ W 咦 W Table: Examples for Each Character Class and Their Resolved Widths Na Narrow N Neural, usually treated as Narrow W Wide F Fullwidth H Halfwidth Table: Width attributes
  • 40. String width measurement echo "42(ˊ_>ˋ) 紅茶" | bsdconv UTF-8:WIDTH:NULL FULL: 2 HALF: 7 AMBI: 2
  • 41. Chinese charset encoding detection https://github.com/buganini/chiconv ENCODING:SCORE#WITH=CJK:COUNT:ZH- BONUS:ZHTW:ZH-BONUS-PHRASE:NULL Score(s) = $SCORE−$IERR∗$COUNT∗0.01 $COUNT 帥呆了 ⇒ UTF-8:SCORE#WITH=CJK:…… ENCODING SCORE COUNT IERR Score(s) UTF-8 19 4 0 4.75 BIG5 8 3 2 -4.0 GBK 4 1 4 -36.0 CCCII 36 9 0 4.0 UTF-16LE 20 5 2 0.0
  • 42. Khmer legacy font converter https://github.com/buganini/khmerconv Issues Encoding without registerd name, bound on fonts Stored in CP1252 or UTF-8 Solution Two pass detection Detect encoding Detect font family (currently not working) (High converage in SBCS) Algorithm ported from Khmer Converter3 Khmer Converter Mapping Reordering Visual order vs. Unicode model Unicode Model: baseCharacter [+ [Robat/Shifter] + [Coeng*] + [Shifter] + [Vowel] + [Sign]] 3 http://www.khmeros.info/en/khmer-converter
  • 43. Encoding :: Big5 Many incompatible variations (abusing PUA), none of standard tools can rule them all http://moztw.org/docs/big5/ Scenario Dominating encoding Microsoft CP950 Taiwan BBS UAO (Unicode-at-Once) gov.tw Big5-2003 gov.hk HKSCS (1999,2001,2004) Special characters conflict The second byte could be 0x5C (), 0x7C (|), 0x7E (~), which may have special meaning in certain context
  • 44. Unicode :: East Asian Width (1/2) UAX#11 .. ..Narrow. Halfwidth .. .. Wide. Fullwidth . Ambiguous . Neutral Figure: Venn Diagram Showing the Set Relations for Six Categories
  • 45. Unicode :: East Asian Width (2/2) UAX#11 Narrow Ambiguous Wide Я N ऊ Na A A F H カ カ W 咦 W Table: Examples for Each Character Class and Their Resolved Widths Na Narrow N Neural, usually treated as Narrow W Wide F Fullwidth H Halfwidth Table: Width attributes
  • 46. Terminal transcoding https://github.com/buganini/bug5 Issues UAO: Non-standard big5 extension Double color hack ANSI control sequence in the middle of DBCS Ambiguous width characters luit/screen cannot help Solution (tl;dr) Big5 to Unicode ANSI-CONTROL,BYTE:BIG5-DEFRAG: BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD: UTF-8,PASS#FOR=1B Unicode to Big5 UTF-8,00,BYTE:ZHTW:AMBIGUOUS-UNPAD: BIG5,CP950-TRANS,UAO,00,ANY#3F
  • 47. Bug5 explained (1/6) ..⋆xC5x1B[1mxE5. Input (Big5 literal) . ANSI-CONTROL,BYTE : .... Decoder . .. 03 . A1 . 03 . B9 . 03 . C5 . 1B . 5B . 31 . 6D . 03 . E5 . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 48. Bug5 explained (2/6) .. ..03. A1 . 03. B9 . 03. C5 . 1B. 5B . 31 . 6D . 03. E5. Internal data . ... : BIG5-DEFRAG : .... Inter-conversion . .. 03 . A1 . 03 . B9 . 03 . C5 . 03 . E5 . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 49. Bug5 explained (3/6) .. ..03. A1 . 03. B9 . 03. C5 . 03. E5 . 1B. 5B . 31 . 6D . Internal data . ... : BYTE,PASS#MARK&FOR=1B. Encoder . .. A1 . B9 . C5 . E5 . 1B . 5B . 31 . 6D . MARK . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 50. Bug5 explained (4/6) .. ..A1. B9. C5. E5. 1B. 5B . 31 . 6D . MARK . Internal data . PASS#UNMARK,BIG5 : .... Decoder . .. 01 . 26 . 05 . 01 . 9A . 5A . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 51. Bug5 explained (5/6) .. ..01. 26 . 05 . 01. 9A . 5A . 1B. 5B . 31 . 6D . Internal data . ... : AMBIGUOUS-PAD : .... Inter-conversion . .. 01 . 26 . 05 . 01 . A0 . 01 . 9A . 5A . 1B . 5B . 31 . 6D . Internal data . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 52. Bug5 explained (6/6) .. ..01. 26 . 05 . 01. A0 . 01. 9A . 5A . 1B. 5B . 31 . 6D . Internal data . ... : UTF-8,PASS#FOR=1B. Encoder . ⋆ 驚 x1B[1m . Output (UTF-8 literal) . Entity Unicode UTF-8 Hex Big5 Hex ⋆ U+2605 E29885 A1B9 驚 U+9A5A E9A99A C5E5 [ U+005B 5B 5B 1 U+0031 31 31 m U+006D 6D 6D (NBSP) U+00A0 C2A0 - Figure: ANSI-CONTROL,BYTE:BIG5-DEFRAG:BYTE,PASS#MARK&FOR=1B| PASS#UNMARK,BIG5:AMBIGUOUS-PAD:UTF-8,PASS#FOR=1B
  • 54. Bsdconv :: GUI https://github.com/buganini/gbsdconv Alternative to ConvertZ Text File name File content Meta tag