Localizing your apps for multibyte languages

Localizing your apps
for multibyte languages
Ken ISHIMOTO (K’s Room Japan)

Localizing your apps
• Part 1 - WebObject
• Part II - What is a multibyte Language
• Part III - Combine multibyte Language with WebObjects
• Part IV - multibyte & WOdka

Part 1 - WebObject
• Eclipse
• Ant build
• Properties (to make WebObjects ready)
• Database

Eclipse
• Set your Workspace to UTF-8
if you not do that you can get
all kind of problems, also
having not English Code in
Source can break the
compilation.

Ant build
• Set your Ant Compile task script to UTF-8

Properties in you APP
• This are the Properties that we use
• ﬁle.encoding=UTF-8
• er.extensions.ERXApplication.DefaultEncoding=UTF-8
• er.extensions.ERXApplication.DefaultMessageEncoding=UTF-8
• er.extensions.ERXLocalizationEditor.encoding=UTF-8
• wodka.Application.LanguageEncoding={Japanese = UTF-8; }

Javascript
<script type="text/javascript" charset="UTF-8">

Database - MySQL
• MySQL = &useUnicode=true&characterEncoding=UTF-8
don’t forget to create a ‘utf8’ database

Database - FrontBase
Nothing to do, just works

Part II - What is a multibyte
Language (Japanese)
• Basics
• Alphabet (How works Japanese)
• Encoding (What Encoding I have to use)

Basics
• This is a sample Page from a Book
• a Book starting reading from right to left, so
you open it where usually close it.
• you read from right to left and
from top to bottom
• This can be very complex for Word-processing
Software so XX Word isn’t a good choice to
write Books or Magazines.That’s also one Reason
why there are some Japanese Text Editor that can
do that.

Spaces between Words
• This is a pen.
• これはペンです。
• Today we have a good weather in Tokyo.
• 今日、東京はとてもいい天気です。 also a big problem can be
that there are no spaces
between words.

yen symbol vs backslash
• If you’re familiar with the Japanese keyboard, the backslash key () is replaced by the symbol for theYen (¥).
Way back when, we did a Japanese version of BRIEF, so I was familiar with this phenomenon—paths would
be separated byYen symbols, but everything worked as expected.
• set the URL_A_chars to “$+!’,?;&@=#%><{}[]"~`^|*()”
• completely failed to compile, because it looked like this:
• set the URL_A_chars to “$+!’,?;&@=#%><{}[]¥"~`^¥¥|*()”
• and ¥ didn’t escape as you’d expect.
• If I create a new ﬁle, either on my system or the English only system I can use any font and type the key
and I get the glyph. Side by side in this ﬁle I can use exactly the same font but when I type the symbol I
get the ¥ glyph.

Japanese Alphabet
• 漢字 Kanji (Chinese characters)
• ひらがな Hiragana (Japanese Alphabet)
• カタカナ Katakana (Alphabet for Foreign Words)
• ローマ字 Romaji (English characters)
• 記号 Kigo (Sign)
• 絵文字 Emoji (Smilies)
• 外字 Gaiji (Self-made characters)
• 振り仮名 Furigana

Japanese Alphabet
•漢字 Kanji (Chinese characters)

漢字 Kanji
• The complexity of this Characters
• The vast majority of these are not in common use in either Japan or China; as discussed below,
approximately 2,000 to 3,000 characters are in common use in Japan, a few thousand more ﬁnd occasional
use, and a total of about 13,000 characters can be encoded in various Japanese Industrial Standards for
kanji.
• Kyōiku kanji The Kyōiku kanji (教育漢字, "education kanji") are 1,006 characters that Japanese children
learn in elementary school.
• Jōyō kanji The Jōyō kanji (常用漢字, "regular-use kanji") are 2,136 characters consisting of all the Kyōiku
kanji, plus 1,130 additional kanji taught in junior high and high school. In publishing, characters outside this
category are often given furigana.
• Jinmeiyō kanji Since September 27, 2004, the Jinmeiyō kanji (人名用漢字, "kanji for use in personal

Encoding of 生
• UNICODE : 751F
• UTF-8 : E7 94 9F
• Shift-JIS : 90B6 A character can have not only 16 bit, and today
multibyte characters can also have more than 32
bit. so it is difficult to say in a database the name
field has only 20 varchar. That would be enough for
some Languages but in UTF-8 that can be only a few
chars long and not enough.
生

Pronunciation : 生
• ON : Chinese-style reading for kanji.
ショウ, ショウ＿ジル, ショウ＿ズル, ジョウ, セイ, ゼイ
Shou, Shou_jiru, Shou_zuru, Jou, Sei, Zei
• KUN : Japanese-style reading for kanji.
イ＿カス, イ＿キ, イ＿キル, イ＿ケル, ウ＿マレ, ウ＿マレル, ウ＿ム, ウブ, ウマ＿レ, ウマ＿レル, オ
＿イ, オ＿ウ, キ, ナ＿ス, ナ＿ル, ナマ, ハ＿エ, ハ＿エル, ハ＿ヤス, バ＿エ
i_kasu, i_ki, i_kiru, i_keru, u_mare, U-mareru, u_mu ....
• Special reading.
アイ, イク, イケ, エ, オ, サ, ナリ, ニュウ, ヌク, フ, ブ, ム＿ス, ヨイ
ai, iku, ike, e, o, sa, nari, nyuu, nuku, fu, bu, mu_su, yoi
• In China this get read : Shēng

difference between Countries
手紙
Letter Toilet paper
Japanese and Chinese are very different
even if there are some Kanji’s that looks
the some.
It is like English and French, the share
some Letters but can you read and
understand it?

Character : 生
• 生きる Ikiru ..... live, living , alive
• 生クリーム Nama kuri-mu ..... fresh cream
• 生涯 Shougai ..... lifetime
• 生命 Seimei ..... life
• 生む Umu ..... born
We can see that 1 Kanji can have a lot of
different meanings, and pronunciations.
So it makes 100% no sense to sort a
Database with Kanji’s.
People wouldn’t find the Data where the
excepted. And the sort would be only a
Unicode Sort that has no meaning.
every Char is very easy to
use and access, no special
treatment is necessary.

Japanese Alphabet
•ひらがな Hiragana (Japanese Alphabet)

ひらがな Hiragana
• Hiragana is a Japanese syllabary,
one basic component of the
Japanese writing system.
• Hiragana is used to write native
words for which there are no
kanji, including grammatical
particles , and sufﬁxes such as さん
~san "Mr., Mrs., Miss, Ms.". every Char is very easy to
use and access, no special

Japanese Alphabet
•カタカナ Katakana (Foreign Words)

カタカナ Katakana
• Katakana is a Japanese syllabary, one
component of the Japanese writing system.
• In contrast to the hiragana syllabary, which is
used for those Japanese language words and
grammatical inﬂections which kanji does not
cover, the katakana syllabary is primarily used
for transcription of foreign language words into
Japanese
every Char is very easy to use
and access, no special

Half-width kana 半角カナ
• Half-width kana (半角カナ Hankaku kana) are katakana characters displayed at half their normal width (a
2:1 aspect ratio), instead of the usual square (1:1) aspect ratio.
• Half-width kana were used in the early days of Japanese computing, to allow Japanese characters to be
displayed on the same grid as monospaced fonts of Latin characters.
• Half-width hiragana or kanji were not used.
• Half-width kana characters are not generally used today, but ﬁnd some use in speciﬁc settings, such as cash
register displays, on shop receipts, and Japanese digital television and DVD subtitles.
注
意
！
those kind of char’s can be a pain, so a good program will make a
conversion from half to full size Katakana.

String s1 = "ｱﾅﾀ";
String s2 = "アナタ";
ERXStringUtilitiesEXTENDED.changeHanKatakanaToZenkakuKatakana(s1);
// RESULT = "アナタ"
s1.equalsIgnoreCase(s2)
// RESULT = false
s1.length()
// RESULT = 3
s2.length()
// RESULT = 3
Half-width kana 半角カナ

Japanese Alphabet
•ローマ字 Romaji (English characters)

NUMBER 数字
• like with Space also Numbers have
variations.
• single Byte (Hankaku)
• double Byte (Zenkaku)
• chinese Char version (Kanji)

• Hankaku (Single) - 0123456789
• Zenkaku - ０１２３４５６７８９
• Kanji - 0 is 零 or 〇
1 is 一 or 壱 / 2 is 二 or 弐 / 3 is 三 or 参
四五六七八九
to convert every Number
into single size before
storing in the database is
the easy way to go.

String s1 = “0123456789”;
String s2 = "０１２３４５６７８９";
ERXStringUtilities.isDigitsOnly(s1);
// RESULT = true
ERXStringUtilities.isDigitsOnly(s2);
// RESULT = true
s1.equalsIgnoreCase(s2);
// RESULT = false
isDigitsOnly

replace double to single
String s = "０１２３４５６７８９";
ERXStringUtilitiesEXTENDED.changeZenkakuNumberToHanNumber(s);
// RESULT = “0123456789”

LETTER 英字
• Everybody loves the simple 26
characters, that in most School takes
2 years to learn.
• In some Countries there are
variations like German with ÜÖÄ

LETTER 英字
• There is for each Letter a double
byte Letter
• ‘U‘ == ‘Ｕ ’
to convert every Letter
the easy way to go.

String s1 = "BC";
String s2 = "ＢＣ";
s1.equalsIgnoreCase(s2);
// RESULT = false
s1 = ERXStringUtilitiesEXTENDED.changeZenkakuEijiToHanEiji(s2);
// RESULT = ‘BC’
LETTER 英字

Japanese Alphabet
•記号 Kigo (Sign)

Sign 記号
• Ｆor each Sign there is a double byte
counterpart
• ‘!‘ == ‘！ ’
to convert every Sign into
single size before storing in
the database is the easy
way to go.

String s1 = "!@#$%^&*()";
String s2 = "！＠＃＄％＾＆＊（）";
s1 = ERXStringUtilitiesEXTENDED.changeZenkakuKigouToHanKigou(s2);
// RESULT = ‘!@#$%^&*()’
Sign 記号

SPACE スペース
• String a = “ “;
• String b = “ ”;
a == space char
b == double-size space char
to convert every Number
the easy way to go.

// head and tail are 3 space chars
String s = “ A B C ”;
s.trim();
// RESULT = ‘A B C’
ERXStringUtilities.trimString(s);
ERXStringUtilitiesEXTENDED.trimStringWithZenkaku(s);
trim

// head and tail are 3 japanese ZENKAKU(double byte) space chars
String s = “ A B C ”;
s.trim();
// RESULT = ‘ A B C ’
ERXStringUtilities.trimString(s);
// RESULT = ‘ A B C ’
ERXStringUtilitiesEXTENDED.trimStringWithZenkaku(s);
better trim

// between A and B are 2 single space + 2 double space + 2 single space
String s = “A B”;
s.replace(" ", "");
// RESULT = ‘A B’
ERXStringUtilities.removeCharacters(s, " ");
// RESULT = ‘A B’
ERXStringUtilitiesEXTENDED.changeZenkakuToHanKakaku(s).replace(" ", "");
// RESULT = ‘ABC’
remove Space between chars

Japanese Alphabet
•絵文字 Emoji (Smilies)

絵文字 Emoji (Smilies)
• Emoji (絵文字); Japanese pronunciation: [emodʑi] is the Japanese term for the
ideograms or smileys used in Japanese electronic messages and webpages.
• Emoji pictograms by au are speciﬁed using the IMG tag. SoftBank Mobile emoji
are wrapped between SI/SO escape sequences, and support colors and
animation. DoCoMo's emoji are the most compact to transmit while au's
version is more ﬂexible based on open standards.
If you are creating a CMS or Data Entry like Blog,
Forum or whatever else, you will have to deal with
this Emoji. Japanese People loves to use it.

WOEmoji
last year WOWODC 2012, I spoke about
SnoWOman CMS and there is a Framework named
WOEmoji, with using this Framework it is easy to
convert Emojis for saving to the database and will
automatically working also on Windofs or Android
devices.
Version 2 of this Framework(working on it) can
also convert to the new open standard Emoji that is
under developing just right now in Japan.
I am a payed supporter of this Project and waiting
for delivery, so WOEmoji can be updated.

Japanese Alphabet
•外字 Gaiji (Self-made characters)

外字 Gaiji (Self-made characters)
• Gaiji (外字), literally meaning "external characters", are kanji that are not represented in existing
Japanese encoding systems.These include variant forms of common kanji that need to be
represented alongside the more conventional glyph in reference works, and can include non-kanji
symbols as well.
Win XP : the had only a few 1000 Kanjis and it wasn’t easy to use some
Kanjis that was not available. so People started with creating their own,
also the look was sometimes different.
WinVista : you can see the font is a little different.
But you have to buy this 1500 char Gaiji Package for about USD 500.-
OS X : works out of the Box and it is free.

Gaiji 外字 Editor
• This is a old Gaiji Editor, so the user
could make his own characters and
that was nice. it started with the ﬁrst
version of Win. but now with the
Internet there is a problem, because
lot of People really recognize that
this character can bee seen only on
this one machine, and after pushing it
up via mail or data entry into a
database, it looks different on every
other machine. so need to stripe out
this characters and give a feedback
to not use that.

ERXStringUtilitiesEXTENDED.delete_ModelDependenceCharacters(true, s, 200, false,
false);
Because i don’t have a Win Machine here, so I wasn’t able to create a Sample-string,
but their is a command for deleting that kind of character Area.
Gaiji 外字

Japanese Alphabet
•振り仮名 Furigana

Furigana 振り仮名
• Furigana (振り仮名) is a Japanese reading aid, consisting of smaller kana, or syllabic characters, printed
next to a kanji (ideographic character) or other character to indicate its pronunciation. It is typically used
to clarify rare, nonstandard or ambiguous readings, or in children's or learners' materials.

Encoding
• UTF-8
• EUC-JP
• Shift JIS
• ISO/IEC 2022
• and some more ...

UTF-8
• UTF-8 (UCS Transformation Format—8-bit[1]) is a variable-width encoding that can represent every
character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid
the complications of endianness and byte order marks in UTF-16 and UTF-32.
We use for every project UTF-8 now, and you are
mostly save and have not take care about other
Encoding, but...

EUC-JP
• EUC-JP Extended Unix Code
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean,
and simplified Chinese.
• The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character
sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as
sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded
character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with
the EUC scheme. G0 is almost always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X
1003/ISO 646:KR in EUC-KR and US-ASCII/the lower half of JIS X 0201 in EUC-JP) that is invoked on GL
(i.e. with the most significant bit cleared).
If you have to do work with some Win Machines it
can happen that you have to import Data that are
encoded with this encoding.
For my experience I never used that.

Shift JIS
• Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS) is a character encoding for the
Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction
with Microzoft and standardized as JIS X 0208 Appendix 1.
This is the most used encoding in Japan, and you can
be sure that if you get Data from an existing
Database or have to connect to an Database you
have to deal with this.
We did a lot of SJIS - UTF-8 conversion in the past.

ISO/IEC 2022
• ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO
standard (equivalent to the ECMA standard ECMA-35[1] ) specifying
• a technique for including multiple character sets in a single character encoding system, and
• a technique for representing these character sets in both 7 and 8 bit systems using the same encoding.
You have only to deal with that if you do some
Mailing solutions, but I really don’t care about that
anymore, JavaMail works just ﬁne.

Localization ローカライズ
• Localization of your App
• Localization Data
• Sorting

ERXLocalizer
// Writing Components and code with ERXLocalizer makes your life very easy
// their are so many things you can do with it, so get comfortable with it.
// Localized String from Code
ERXLocalizer.defaultLocalizer().localizedStringForKey("Nav.Main");
// Localized String in HTML
<wo:str value = "$localizer.Nav.Main" />
<wo:localized value="Nav.Main" />
* This is a bad example because I am using the power of the ‘dark force’ Inline Binding. You shouldn’t do that,
* but I use it always. Sorry I am a bad guy.

.strings
in your App ‘Resources’ folder create a folder with Language-name + ‘.lproj’
make it a plist ﬁle with KeyValue.
and save the File as
UTF-16UTF-8
with UTF-8 it is easier to read and also git commits can be viewed.

Localization of Data
1.Attributes in Entity
2. set Data in Edit-page
3. Display the Attribute
depending on the Localizer
[[eo]].name_en()
or
[[eo]].name_ja
or
[[eo]].valueForKey("name")

Sorting 1
name
(how it is written)
furigana
(how it is pronounce)

Sorting 2
林森
漢字 Kanji
(Chinese characters)
Person 1 Person 2
ひらがな Hiragana
or
カタカナ Katakana
(Japanese Alphabet)もりはやし
Mr. Mori Mr. Hayashi

WOdka improvements
• Language-switching

WOdkaLanguageEnums
• Language name
• Locale Code
• Date format + 24 hours setting
• Data for Flag information

WOdkaCountryEnums
• Country name
• code2 : ISO Code for Country
• code3 : ISO Code for Country
• money : ERXMoneyEnums
• language :WOdkaLanguageEnums
• telephone code
• tax : tax info
• zip : zip format
• company Mailing Format
• family Mailing Format
• Localized words : male, female, sexMale, sexFemale
• ﬂag : Path to Flag-data
• continent : ERXContinentEnums
• EU : ERXEuropeanUnionsEnums
"[S][CR][T][_][F][_][L]"
"[L] [F]様"
family Mailing Format
s = sex
t = title
f = ﬁrst name
l = last name
cr = next line

Thanks to
• Masahiko TANI - A10 Objects Inc., (Japan)
• Hiroyuki FUKUI - Astonish Create (Japan)
Special Thanks to
• PaulYU - Green orchid llc (USA)

Localizing your apps for multibyte languages

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (12)

Semelhante a Localizing your apps for multibyte languages

Semelhante a Localizing your apps for multibyte languages (20)

Mais de WO Community

Mais de WO Community (20)

Último

Último (20)

Localizing your apps for multibyte languages