SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Introduction to Japanese
tokenizers
WebHack 2020-11-10
by Wanasit T.
About Me
● Github: @wanasit
○ Text / NLP projects
● Manager, Software Engineer @ Indeed
○ Search Quality (Metadata) team
○ Work on NLP problems for Jobs / Resumes
Disclaimer
1. This talk NOT related to any of Indeed’s technology
2. I’m not a Japanese (or a native-speaker)
○ But I built a Japanese tokenizer on my free time
Today Topics
● NLP and Tokenization (for Japanese)
● Lattice-based Tokenizers (MeCab -style tokenizers)
● How it works
○ Dictionary
○ Tokenization
NLP and Tokenization
NLP and Tokenization
● How does computer represent text?
● String (or Char[ ] or Byte[ ] )
■ "Abc"
■ "Hello World"
NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
Source: NBC News
NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
● What’s the topic?
● Who is winning? where?
Source: NBC News
NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
● What’s the topic?
● Who is winning? where?
Source: NBC News
NLP and Tokenization
● Tokenization / Segmentation
● The first step to solve NLP problems is usually
identifying words from the string
○ Input: string, char[ ] (or byte[ ])
○ Output: a list of meaningful words (or tokens)
NLP and Tokenization
"Biden is projected winner in Michigan, Wisconsin as
tense nation watch final tally".split(/W+/)
> ["Biden", "is", "projected", "winner", "in", ...]
Japanese Tokenization
"バイデン氏がミシガン州勝利、大統領にむけ“王手"
Source: TBS News
Japanese Tokenization
"バイデン氏がミシガン州勝利、大統領にむけ“王手"
Source: TBS News
Japanese Tokenization
"バイデン氏がミシガン州勝利、大統領にむけ“王手"
● No punctuations
● Q: How do you split this into words?
Source: TBS News
Japanese Tokenization
● Use prior Japanese knowledge (Dictionary)
○ が, に, …, 氏, 州, …, バイデン
● Consider the context and combination of characters
● Consider the likelihood
○ e.g. 東京都 => [東京, 都], or [東, 京都]
Lattice-based Tokenizers
Lattice-based Tokenizers
● aka. MeCab -based tokenizer (or Viterbi tokenizer)
● How:
○ From a Dictionary (required)
○ Build a Lattice (or a graph) from surface dictionary terms
○ Run Viterbi algorithm to find the best connected path
Lattice-Based Tokenizers
● Most tokenizers are MeCab (C/C++)’s re-implementation on
different platforms:
○ Kuromoji, Sudachi (Java), Kotori (Kotlin)
○ Janome, SudachiPy (Python)
○ Kagome (Go)
○ ...
Non- Lattice-Based Tokenizers
● Is Lattice-based the only approach?
● Mostly yes, but there are also:
○ Juman++, Nagisa (RNN)
○ SentencePiece (Unsupervised, used in BERT)
● Out-of-scope of this presentation
How it works
> Dictionary
Dictionary
● Lattice-based tokenizers need dictionary
○ To recognize predefined terms and grammar
● Dictionaries are often can be downloaded as Plugins e.g.
○ $ brew install mecab
○ $ brew install mecab-ipadic
Dictionary
● Recommended beginner dictionary is MeCab’s IPADIC
● Available from this website
Dictionary - Term Table / Lexicon / CSV files
Surface Form
Context ID
(left)
Context ID
(right)
Cost Type Form Spelling ...
東京 1293 1293 3003 名詞 (place) - トウキョウ ...
京都 1293 1293 2135 名詞 (place) - キョウト ...
東京塚 1293 1293 8676 名詞 (place) - ヒガシキョウ
ヅカ
...
行く 992 992 8852 動詞 (v) 基本形 イク ...
行か 1002 1002 7754 動詞 (v) 未然形 イカ ...
いく 992 992 9672 動詞 (v) 基本形 イク ...
Dictionary - Term Table
● Surface Form: How the term should appear in the string
● Context ID (left/right): ID used for connecting terms
together (see. later)
● Cost: How commonly used the term
○ The more the cost, the less common or less likely
Dictionary - Connection Table / Connection Cost
Context ID
(from)
Context ID
(to)
Cost
... ...
992 992 3003
992 993 2135
... ...
992 1293 -1000
992 1294 -1000
... ...
● Connection cost between
type of terms.
● The lower, the more likely
● e.g.
● 992 (v-ru) then 992 (v-ru)
○ Cost = 3000 (unlikely)
● 992 (v-ru) then 1294 (noun)
○ Cost = -1000 (likely)
Dictionary - Term Table
Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)
Dictionary - Term Table
Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)
○ Include term like: "ヽ(`ー`)ノ"
Dictionary - Term Table
● What about words not in the table?
○ e.g. "ワナシット タナキットルンアン"
○ “Unknown-Term Extraction” Problem
○ Typically, some heuristic rules
■ e.g. if there are consecutive katana, it’s a Noun.
● Out-of-scope of this presentation
How it works
> Tokenization
Lattice-Based Tokenization
Given:
● The Dictionary
● Input:"東京都に住む"
Tokenizer:
1. Find all terms in the input
and build a lattice
2. Find the minimum cost
path through the lattice
Step 1: Finding all terms
Step 1: Finding all terms
● For each index i-th
○ find all terms in dictionary starting at i-th location
● String / Pattern Matching problem
○ Require efficient lookup data structure for the dictionary
○ e.g. Trie, Finite-State-Transidual
Step 2: Finding minimum cost
● Viterbi Algorithm (Dynamic Programing)
● For each node from the left to right
○ Find the minimum cost path leading to that node
○ Reuse the selected path when consider the following
nodes
Summary
Introduction to Japanese Tokenizers
● Introduction to NLP and Tokenization
● Lattice-based tokenizers (MeCab and others)
○ Dictionary
■ Term table, Connection Cost, ...
○ Tokenization Algorithms
■ Pattern Matching, Viterbi Algorithm, ...
Learn more:
● Kotori (on Github), A Japanese tokenizer written in Kotlin
○ Small and performant (fastest among JVM-based)
○ Support multiple dictionary formats
● Article: How Japanese Tokenizers Work (by Wanasit)
● Article: 日本語形態素解析の裏側を覗く! (by Cookpad Developer)
● Book: 自然言語処理の基礎 (by Manabu Okumura)

Mais conteúdo relacionado

Mais procurados

Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
Databricks
 
スペクトラルグラフ理論入門
スペクトラルグラフ理論入門スペクトラルグラフ理論入門
スペクトラルグラフ理論入門
irrrrr
 

Mais procurados (20)

Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Rosbag search system
Rosbag search systemRosbag search system
Rosbag search system
 
Ml system in_python
Ml system in_pythonMl system in_python
Ml system in_python
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
BERTに関して
BERTに関してBERTに関して
BERTに関して
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
 
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
 
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment AnalysisSupervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
 
動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
GPU上でのNLP向け深層学習の実装について
GPU上でのNLP向け深層学習の実装についてGPU上でのNLP向け深層学習の実装について
GPU上でのNLP向け深層学習の実装について
 
スペクトラルグラフ理論入門
スペクトラルグラフ理論入門スペクトラルグラフ理論入門
スペクトラルグラフ理論入門
 
Faster rcnn
Faster rcnnFaster rcnn
Faster rcnn
 
Entity2rec recsys
Entity2rec recsysEntity2rec recsys
Entity2rec recsys
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 

Semelhante a Introduction to japanese tokenizer

Semelhante a Introduction to japanese tokenizer (7)

Algorithms - A Sneak Peek
Algorithms - A Sneak PeekAlgorithms - A Sneak Peek
Algorithms - A Sneak Peek
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
 
Shilpa shukla processing_text
Shilpa shukla processing_textShilpa shukla processing_text
Shilpa shukla processing_text
 
NLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so farNLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so far
 

Mais de Fangda Wang

Mais de Fangda Wang (11)

[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?
 
Under the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeedUnder the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeed
 
How Indeed asks coding interview questions
How Indeed asks coding interview questionsHow Indeed asks coding interview questions
How Indeed asks coding interview questions
 
Types are eating the world
Types are eating the worldTypes are eating the world
Types are eating the world
 
From ic to tech lead
From ic to tech leadFrom ic to tech lead
From ic to tech lead
 
Gentle Introduction to Scala
Gentle Introduction to ScalaGentle Introduction to Scala
Gentle Introduction to Scala
 
To pair or not to pair
To pair or not to pairTo pair or not to pair
To pair or not to pair
 
Balanced Team
Balanced TeamBalanced Team
Balanced Team
 
Functional programming and Elm
Functional programming and ElmFunctional programming and Elm
Functional programming and Elm
 
Elm at large (companies)
Elm at large (companies)Elm at large (companies)
Elm at large (companies)
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 

Último

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Último (20)

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 

Introduction to japanese tokenizer

  • 2. About Me ● Github: @wanasit ○ Text / NLP projects ● Manager, Software Engineer @ Indeed ○ Search Quality (Metadata) team ○ Work on NLP problems for Jobs / Resumes
  • 3. Disclaimer 1. This talk NOT related to any of Indeed’s technology 2. I’m not a Japanese (or a native-speaker) ○ But I built a Japanese tokenizer on my free time
  • 4. Today Topics ● NLP and Tokenization (for Japanese) ● Lattice-based Tokenizers (MeCab -style tokenizers) ● How it works ○ Dictionary ○ Tokenization
  • 6. NLP and Tokenization ● How does computer represent text? ● String (or Char[ ] or Byte[ ] ) ■ "Abc" ■ "Hello World"
  • 7. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" Source: NBC News
  • 8. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" ● What’s the topic? ● Who is winning? where? Source: NBC News
  • 9. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" ● What’s the topic? ● Who is winning? where? Source: NBC News
  • 10. NLP and Tokenization ● Tokenization / Segmentation ● The first step to solve NLP problems is usually identifying words from the string ○ Input: string, char[ ] (or byte[ ]) ○ Output: a list of meaningful words (or tokens)
  • 11. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally".split(/W+/) > ["Biden", "is", "projected", "winner", "in", ...]
  • 14. Japanese Tokenization "バイデン氏がミシガン州勝利、大統領にむけ“王手" ● No punctuations ● Q: How do you split this into words? Source: TBS News
  • 15. Japanese Tokenization ● Use prior Japanese knowledge (Dictionary) ○ が, に, …, 氏, 州, …, バイデン ● Consider the context and combination of characters ● Consider the likelihood ○ e.g. 東京都 => [東京, 都], or [東, 京都]
  • 17. Lattice-based Tokenizers ● aka. MeCab -based tokenizer (or Viterbi tokenizer) ● How: ○ From a Dictionary (required) ○ Build a Lattice (or a graph) from surface dictionary terms ○ Run Viterbi algorithm to find the best connected path
  • 18. Lattice-Based Tokenizers ● Most tokenizers are MeCab (C/C++)’s re-implementation on different platforms: ○ Kuromoji, Sudachi (Java), Kotori (Kotlin) ○ Janome, SudachiPy (Python) ○ Kagome (Go) ○ ...
  • 19. Non- Lattice-Based Tokenizers ● Is Lattice-based the only approach? ● Mostly yes, but there are also: ○ Juman++, Nagisa (RNN) ○ SentencePiece (Unsupervised, used in BERT) ● Out-of-scope of this presentation
  • 20. How it works > Dictionary
  • 21. Dictionary ● Lattice-based tokenizers need dictionary ○ To recognize predefined terms and grammar ● Dictionaries are often can be downloaded as Plugins e.g. ○ $ brew install mecab ○ $ brew install mecab-ipadic
  • 22. Dictionary ● Recommended beginner dictionary is MeCab’s IPADIC ● Available from this website
  • 23. Dictionary - Term Table / Lexicon / CSV files Surface Form Context ID (left) Context ID (right) Cost Type Form Spelling ... 東京 1293 1293 3003 名詞 (place) - トウキョウ ... 京都 1293 1293 2135 名詞 (place) - キョウト ... 東京塚 1293 1293 8676 名詞 (place) - ヒガシキョウ ヅカ ... 行く 992 992 8852 動詞 (v) 基本形 イク ... 行か 1002 1002 7754 動詞 (v) 未然形 イカ ... いく 992 992 9672 動詞 (v) 基本形 イク ...
  • 24. Dictionary - Term Table ● Surface Form: How the term should appear in the string ● Context ID (left/right): ID used for connecting terms together (see. later) ● Cost: How commonly used the term ○ The more the cost, the less common or less likely
  • 25. Dictionary - Connection Table / Connection Cost Context ID (from) Context ID (to) Cost ... ... 992 992 3003 992 993 2135 ... ... 992 1293 -1000 992 1294 -1000 ... ... ● Connection cost between type of terms. ● The lower, the more likely ● e.g. ● 992 (v-ru) then 992 (v-ru) ○ Cost = 3000 (unlikely) ● 992 (v-ru) then 1294 (noun) ○ Cost = -1000 (likely)
  • 26. Dictionary - Term Table Term table size: ● Kotori (default) ~380,000 terms (3.7 MB) ● MeCab-IPADict ~400,000 terms (12.2 MB) ● Sudachi - Small ~750,000 terms (39.8 MB) ● Sudachi - Full ~2,800,000 terms (121 MB)
  • 27. Dictionary - Term Table Term table size: ● Kotori (default) ~380,000 terms (3.7 MB) ● MeCab-IPADict ~400,000 terms (12.2 MB) ● Sudachi - Small ~750,000 terms (39.8 MB) ● Sudachi - Full ~2,800,000 terms (121 MB) ○ Include term like: "ヽ(`ー`)ノ"
  • 28. Dictionary - Term Table ● What about words not in the table? ○ e.g. "ワナシット タナキットルンアン" ○ “Unknown-Term Extraction” Problem ○ Typically, some heuristic rules ■ e.g. if there are consecutive katana, it’s a Noun. ● Out-of-scope of this presentation
  • 29. How it works > Tokenization
  • 30. Lattice-Based Tokenization Given: ● The Dictionary ● Input:"東京都に住む" Tokenizer: 1. Find all terms in the input and build a lattice 2. Find the minimum cost path through the lattice
  • 31. Step 1: Finding all terms
  • 32. Step 1: Finding all terms ● For each index i-th ○ find all terms in dictionary starting at i-th location ● String / Pattern Matching problem ○ Require efficient lookup data structure for the dictionary ○ e.g. Trie, Finite-State-Transidual
  • 33. Step 2: Finding minimum cost ● Viterbi Algorithm (Dynamic Programing) ● For each node from the left to right ○ Find the minimum cost path leading to that node ○ Reuse the selected path when consider the following nodes
  • 35. Introduction to Japanese Tokenizers ● Introduction to NLP and Tokenization ● Lattice-based tokenizers (MeCab and others) ○ Dictionary ■ Term table, Connection Cost, ... ○ Tokenization Algorithms ■ Pattern Matching, Viterbi Algorithm, ...
  • 36. Learn more: ● Kotori (on Github), A Japanese tokenizer written in Kotlin ○ Small and performant (fastest among JVM-based) ○ Support multiple dictionary formats ● Article: How Japanese Tokenizers Work (by Wanasit) ● Article: 日本語形態素解析の裏側を覗く! (by Cookpad Developer) ● Book: 自然言語処理の基礎 (by Manabu Okumura)