Yukino Ikegami, Setsuo Tsuruta.
Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary.
Multimedia Tools and Applications, Volume 74, Issue 11, pp. 3933–3946 , 2015.
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
Modeless Japanese Input Method
1. Hybrid method
for modeless Japanese input
using N-gram based binary classification
and dictionary
Yukino Ikegami
Setsuo Tsuruta
2014/01/20
2. Necessity of Japanese Input Method
• Japanese has many characters
– Kana
• Hiragana
– 81 characters e.g.) いろはにほへと
• Katakana
– 81 characters e.g.) イロハニホヘト
– Kanji (Chinese-characters)
• More than 6,000 characters e.g.) 以呂波仁保反止
• We can’t input directly by a keyboard
Japanese input method (Converting alphabet to
Japanese character) is necessary
2
3. If all Japanese characters are assigned
to each key…
• Toooo many keys!
• Japanese input method is necessary
4. Japanese Input Method
-Roman to Kana-Kanji Converter-
• Flow
1. Receive the Romanized alphabets
2. Convert the Romanized alphabets
into Kana using Roman-to-Kana table
3. Convert Kana into Kanji (if necessary)
①n e k o d e s u
②ねこです
③猫です
4
5. Problems
on Japanese Input Method
• Need to switch input modes between
Japanese and ASCII
e.g. To input ‘あれは8Byteです’ (That is 8Byte)
areha [Return][ASCII Mode] 8byte [Japanese Mode] desu
Switching Switching
• Switching is cumbersome!
5
6. Adding Term to Dictionary
for Switching Mode Problem
• Adding term of other languages to
dictionary of conventional input method
editor
• Shortcoming
– New term is created continuously
– Homograph problem
7. Related Work
• Modeless Pinyin-Chinese Input [Chen et al. 2000]
– Convert alphabet (Pinyin) to Chinese
– Using word-surface feature only for classification
• Type-Any [Ehara et al. 2009]
– Convert Alphabet to Any Language
– Need press Delimiter-key when converting
– Using word-surface feature only for classification
7
8. Approach
-Modeless Japanese Input Method-
• Automatically switching input mode
1. Generate discriminating model by Support Vector
Machine (SVM)
– the model describe multiple n-gram features
2. Distinguish a segment whether Kana or not
in alphabet sequences using the discriminating
model
– e.g.
nekohacatdesu → nekoha / cat / desu → ねこはcatです
Japanese / English / Japanese
8
9. Main flow of
Modeless Japanese Input Method
each character in user inputs
if character is
still ASCII?
Kana conversion
System Response
(Kana & alphabet sequence)
User input
(alphabet sequence)
True
False
Kana-conversion
Discriminative
Model
9
Non Japanese Dic.
10. Flow of
Generating Discriminative Model
• 猫はcatですLoad Texts
• Using Japanese Morphological Analyzer (MeCab)
• ネコハcatデス
Kanji to Kana
• Using Kana to ASCII table (used by Google Japanese input)
• nakohacatdesu
Kana to ASCII
•character-surface: ne, ek, nek, ko, eko, oh, koh, ha, oha...
•character-type: LL, LL, LLL, LL, LLL, LL, LLL...
•History: KK,KK, KKK, KK, KKK, KKK...
ASCII to n-gram
• 1, 3, 4, 13, 22...n-gram to ID
• 1:1, 3:1, 4:1, 13:1, 32:1...Describe as binary model
• 1.344, 0.691, 0,023, -1.398...Learning on SVM
10
11. n-gram Features
あ れ は 8 B y t e
a r e h a 8 B y t e
(in case of n-gram upper limit n = 2, window size m = 2, focus-point xi = 2nd “a”)
• Character-Surface
– Substring of backward and forward at focus point
– e.g.) -2/ha -1/a8 0/8B 1/By
• Character-Type
– Upper-case(U), Lower-case(L), Number(N), and
Symbol(S).
– e.g.) -2/LL -1/LN 0/NU 1/UL
11
12. Generating Non-Japanese Dictionary
• Words never appeared in Japanese only text
– More than 5 length
– Contains substring can’t convert to Kana
• Source
– Corpus of Contemporary American English (COCA)
– Japanese Wikipedia article title list
12
13. Compare with Conventional IME
Conventional method
areha [Return][Alphabet Mode] 8Byte [Japanese Mode] desu
Switching Switching
Typing : 17
• The number of typing key is decreased
Modeless Japanese input method
areha8Bytedesu
Typing : 14
13
14. Datasets
used in Evaluation Experiment
• Generating Model & Evaluating Method
– Balanced Corpus of Contemporary Written
Japanese (BCCWJ)
• book, magazine, blog, government document and
others
• Non Japanese Dictionary Source
– COCA
– Japanese Wikipedia article title list
14
17. User test
• Outperforms conventional method
Person No. 1 2 3 4 5 6 7 8 9
Conventional
IME
18.18 17.89 15.4 12.71 11.09 10.18 11.42 12.38 10.48
Proposed
method
13.34 14.68 9.88 12.23 6.03 7.00 11.03 11.37 10.30
17
…
• 4 females and 7 males
• Input example sentences (chat, mail, technological
text)
18. Summary
• Switching input mode is cumbersome
• Hybrid Modeless Japanese Input Method
– Automatically switching input mode between
Japanese and ASCII
– Using n-gram features model for discrimination
• character-{surface, type}
– Outperforms conventional methods
18