In this talk, Wanasit will share what he learn about Japanese NLP after trying to build a Japanese tokenizer from scratch.
Doing Natural Language Processing (NLP) or text processing for Japanese has many challenges. One of the most basic and obvious problems is tokenization (aka. splitting text into a list of words).
Unlike English that the words typically separated by space, splitting Japanese text (e.g. 日本語の自然言語処理を行うには…) doesn’t have such a rule-of-thumb. It requires the tokenizers and NLP tools to be a lot more sophisticated.
2. About Me
● Github: @wanasit
○ Text / NLP projects
● Manager, Software Engineer @ Indeed
○ Search Quality (Metadata) team
○ Work on NLP problems for Jobs / Resumes
3. Disclaimer
1. This talk NOT related to any of Indeed’s technology
2. I’m not a Japanese (or a native-speaker)
○ But I built a Japanese tokenizer on my free time
4. Today Topics
● NLP and Tokenization (for Japanese)
● Lattice-based Tokenizers (MeCab -style tokenizers)
● How it works
○ Dictionary
○ Tokenization
6. NLP and Tokenization
● How does computer represent text?
● String (or Char[ ] or Byte[ ] )
■ "Abc"
■ "Hello World"
7. NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
Source: NBC News
8. NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
● What’s the topic?
● Who is winning? where?
Source: NBC News
9. NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
● What’s the topic?
● Who is winning? where?
Source: NBC News
10. NLP and Tokenization
● Tokenization / Segmentation
● The first step to solve NLP problems is usually
identifying words from the string
○ Input: string, char[ ] (or byte[ ])
○ Output: a list of meaningful words (or tokens)
11. NLP and Tokenization
"Biden is projected winner in Michigan, Wisconsin as
tense nation watch final tally".split(/W+/)
> ["Biden", "is", "projected", "winner", "in", ...]
15. Japanese Tokenization
● Use prior Japanese knowledge (Dictionary)
○ が, に, …, 氏, 州, …, バイデン
● Consider the context and combination of characters
● Consider the likelihood
○ e.g. 東京都 => [東京, 都], or [東, 京都]
17. Lattice-based Tokenizers
● aka. MeCab -based tokenizer (or Viterbi tokenizer)
● How:
○ From a Dictionary (required)
○ Build a Lattice (or a graph) from surface dictionary terms
○ Run Viterbi algorithm to find the best connected path
18. Lattice-Based Tokenizers
● Most tokenizers are MeCab (C/C++)’s re-implementation on
different platforms:
○ Kuromoji, Sudachi (Java), Kotori (Kotlin)
○ Janome, SudachiPy (Python)
○ Kagome (Go)
○ ...
19. Non- Lattice-Based Tokenizers
● Is Lattice-based the only approach?
● Mostly yes, but there are also:
○ Juman++, Nagisa (RNN)
○ SentencePiece (Unsupervised, used in BERT)
● Out-of-scope of this presentation
21. Dictionary
● Lattice-based tokenizers need dictionary
○ To recognize predefined terms and grammar
● Dictionaries are often can be downloaded as Plugins e.g.
○ $ brew install mecab
○ $ brew install mecab-ipadic
24. Dictionary - Term Table
● Surface Form: How the term should appear in the string
● Context ID (left/right): ID used for connecting terms
together (see. later)
● Cost: How commonly used the term
○ The more the cost, the less common or less likely
25. Dictionary - Connection Table / Connection Cost
Context ID
(from)
Context ID
(to)
Cost
... ...
992 992 3003
992 993 2135
... ...
992 1293 -1000
992 1294 -1000
... ...
● Connection cost between
type of terms.
● The lower, the more likely
● e.g.
● 992 (v-ru) then 992 (v-ru)
○ Cost = 3000 (unlikely)
● 992 (v-ru) then 1294 (noun)
○ Cost = -1000 (likely)
26. Dictionary - Term Table
Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)
27. Dictionary - Term Table
Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)
○ Include term like: "ヽ(`ー`)ノ"
28. Dictionary - Term Table
● What about words not in the table?
○ e.g. "ワナシット タナキットルンアン"
○ “Unknown-Term Extraction” Problem
○ Typically, some heuristic rules
■ e.g. if there are consecutive katana, it’s a Noun.
● Out-of-scope of this presentation
30. Lattice-Based Tokenization
Given:
● The Dictionary
● Input:"東京都に住む"
Tokenizer:
1. Find all terms in the input
and build a lattice
2. Find the minimum cost
path through the lattice
32. Step 1: Finding all terms
● For each index i-th
○ find all terms in dictionary starting at i-th location
● String / Pattern Matching problem
○ Require efficient lookup data structure for the dictionary
○ e.g. Trie, Finite-State-Transidual
33. Step 2: Finding minimum cost
● Viterbi Algorithm (Dynamic Programing)
● For each node from the left to right
○ Find the minimum cost path leading to that node
○ Reuse the selected path when consider the following
nodes
35. Introduction to Japanese Tokenizers
● Introduction to NLP and Tokenization
● Lattice-based tokenizers (MeCab and others)
○ Dictionary
■ Term table, Connection Cost, ...
○ Tokenization Algorithms
■ Pattern Matching, Viterbi Algorithm, ...
36. Learn more:
● Kotori (on Github), A Japanese tokenizer written in Kotlin
○ Small and performant (fastest among JVM-based)
○ Support multiple dictionary formats
● Article: How Japanese Tokenizers Work (by Wanasit)
● Article: 日本語形態素解析の裏側を覗く! (by Cookpad Developer)
● Book: 自然言語処理の基礎 (by Manabu Okumura)