Introduction to japanese tokenizer

Introduction to Japanese
tokenizers
WebHack 2020-11-10
by Wanasit T.

About Me
● Github: @wanasit
○ Text / NLP projects
● Manager, Software Engineer @ Indeed
○ Search Quality (Metadata) team
○ Work on NLP problems for Jobs / Resumes

Disclaimer
1. This talk NOT related to any of Indeed’s technology
2. I’m not a Japanese (or a native-speaker)
○ But I built a Japanese tokenizer on my free time

Today Topics
● NLP and Tokenization (for Japanese)
● Lattice-based Tokenizers (MeCab -style tokenizers)
● How it works
○ Dictionary
○ Tokenization

NLP and Tokenization
● How does computer represent text?
● String (or Char[ ] or Byte[ ] )
■ "Abc"
■ "Hello World"

"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
Source: NBC News

"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
● What’s the topic?
● Who is winning? where?
Source: NBC News

● Tokenization / Segmentation
● The ﬁrst step to solve NLP problems is usually
identifying words from the string
○ Input: string, char[ ] (or byte[ ])
○ Output: a list of meaningful words (or tokens)

"Biden is projected winner in Michigan, Wisconsin as
tense nation watch final tally".split(/W+/)
> ["Biden", "is", "projected", "winner", "in", ...]

Japanese Tokenization
"バイデン氏がミシガン州勝利、大統領にむけ“王手"
Source: TBS News

"バイデン氏がミシガン州勝利、大統領にむけ“王手"
● No punctuations
● Q: How do you split this into words?
Source: TBS News

● Use prior Japanese knowledge (Dictionary)
○ が, に, …, 氏, 州, …, バイデン
● Consider the context and combination of characters
● Consider the likelihood
○ e.g. 東京都 => [東京, 都], or [東, 京都]

Lattice-based Tokenizers
● aka. MeCab -based tokenizer (or Viterbi tokenizer)
● How:
○ From a Dictionary (required)
○ Build a Lattice (or a graph) from surface dictionary terms
○ Run Viterbi algorithm to ﬁnd the best connected path

Lattice-Based Tokenizers
● Most tokenizers are MeCab (C/C++)’s re-implementation on
different platforms:
○ Kuromoji, Sudachi (Java), Kotori (Kotlin)
○ Janome, SudachiPy (Python)
○ Kagome (Go)
○ ...

Non- Lattice-Based Tokenizers
● Is Lattice-based the only approach?
● Mostly yes, but there are also:
○ Juman++, Nagisa (RNN)
○ SentencePiece (Unsupervised, used in BERT)
● Out-of-scope of this presentation

Dictionary
● Lattice-based tokenizers need dictionary
○ To recognize predeﬁned terms and grammar
● Dictionaries are often can be downloaded as Plugins e.g.
○ $ brew install mecab
○ $ brew install mecab-ipadic

Dictionary
● Recommended beginner dictionary is MeCab’s IPADIC
● Available from this website

Dictionary - Term Table / Lexicon / CSV ﬁles
Surface Form
Context ID
(left)
Context ID
(right)
Cost Type Form Spelling ...
東京 1293 1293 3003 名詞 (place) - トウキョウ ...
京都 1293 1293 2135 名詞 (place) - キョウト ...
東京塚 1293 1293 8676 名詞 (place) - ヒガシキョウ
ヅカ
...
行く 992 992 8852 動詞 (v) 基本形イク ...
行か 1002 1002 7754 動詞 (v) 未然形イカ ...
いく 992 992 9672 動詞 (v) 基本形イク ...

Dictionary - Term Table
● Surface Form: How the term should appear in the string
● Context ID (left/right): ID used for connecting terms
together (see. later)
● Cost: How commonly used the term
○ The more the cost, the less common or less likely

Dictionary - Connection Table / Connection Cost
Context ID
(from)
Context ID
(to)
Cost
... ...
992 992 3003
992 993 2135
... ...
992 1293 -1000
992 1294 -1000
... ...
● Connection cost between
type of terms.
● The lower, the more likely
● e.g.
● 992 (v-ru) then 992 (v-ru)
○ Cost = 3000 (unlikely)
● 992 (v-ru) then 1294 (noun)
○ Cost = -1000 (likely)

Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)

Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)
○ Include term like: "ヽ(`ー`)ノ"

● What about words not in the table?
○ e.g. "ワナシットタナキットルンアン"
○ “Unknown-Term Extraction” Problem
○ Typically, some heuristic rules
■ e.g. if there are consecutive katana, it’s a Noun.
● Out-of-scope of this presentation

Lattice-Based Tokenization
Given:
● The Dictionary
● Input:"東京都に住む"
Tokenizer:
1. Find all terms in the input
and build a lattice
2. Find the minimum cost
path through the lattice

Step 1: Finding all terms
● For each index i-th
○ ﬁnd all terms in dictionary starting at i-th location
● String / Pattern Matching problem
○ Require eﬃcient lookup data structure for the dictionary
○ e.g. Trie, Finite-State-Transidual

Step 2: Finding minimum cost
● Viterbi Algorithm (Dynamic Programing)
● For each node from the left to right
○ Find the minimum cost path leading to that node
○ Reuse the selected path when consider the following
nodes

Introduction to Japanese Tokenizers
● Introduction to NLP and Tokenization
● Lattice-based tokenizers (MeCab and others)
○ Dictionary
■ Term table, Connection Cost, ...
○ Tokenization Algorithms
■ Pattern Matching, Viterbi Algorithm, ...

Learn more:
● Kotori (on Github), A Japanese tokenizer written in Kotlin
○ Small and performant (fastest among JVM-based)
○ Support multiple dictionary formats
● Article: How Japanese Tokenizers Work (by Wanasit)
● Article: 日本語形態素解析の裏側を覗く！ (by Cookpad Developer)
● Book: 自然言語処理の基礎 (by Manabu Okumura)

Introduction to japanese tokenizer

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Introduction to japanese tokenizer

Semelhante a Introduction to japanese tokenizer (7)

Mais de Fangda Wang

Mais de Fangda Wang (11)

Último

Último (20)

Introduction to japanese tokenizer