SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Kuromoji FST
2015/06/25
Yoshinari Fujinuma
Overview
• Motivation
• Building FST
• How freezing works
• How equivalent detection works
• Compiled FST and Virtual Machine
Motivation
• Efficient Key value store for dictionary look up
during tokenization
• String -> integers
• int -> token info
Why FST and not Trie?
• Finite State Transducer (FST) = Finite State Automaton +
Output
• Able to merge both prefixes and suffixes too
• e.g. “can”, “cats”, “dogs”
Overview of how the build
works
List of sorted
words,
list of integers
FST Builder
FST
Compiler
Object-based
FST
Compiled
FST
How Building / Compiling
works
• two variables are the key
• previous word (prev)
• current word (current)
1. Skip common prefix between prev and current
2. make arcs to the temp states
3. Freeze (Finalize) states which suffix differ betw.
prev and current
Toy example
• cat -> 0
• cats -> 1
• catx -> 2
Initializing states
• Initialization
Frozen states
Temp states
Freezing states
• prev word = “”, current word = cat
Frozen states
Temp states
t/0a/0c/0
Add Arc to suffix
Frozen states
Temp states
s/1t/0a/0c/0
• prev word = cat, current word = cats
Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
HashCode 1
Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
HashCode 1
Merge Equivalent states
• pre word = catx, current word =“”
Frozen states
Temp states s/1
t/0a/0c/0
x/2
Freezing states
• pre word = catx, current word = “”
Frozen states s/1
t/0a/0c/0
x/2
Temp states
Equivalent state detection
• We want to merge equivalent states!
• Key-value store using HashMap
• Key: State.hashCode()
• Value: State Object
• Collisions are resolved by chaining
Arc Equivalence
c/0
• Same transition character
• Same destination state
• Same output
c/0
State Equivalence
• All the outgoing set of arcs are equivalent
• Both states are of the same type of state
c/0
c/0
How Compiled FST works
• Generates a “Program”
• Running a Program = look up a word in a dictionary
• Program runs on a Virtual Machine which we implemented
Compiled FST
= “Program”
Virtual
Machine
Word
e.g. “cat”
Integer if
exists in
dictionary
-1, it not
OR
Program
• List of Instructions, 11 bytes each
• Operation code (Op code)
• Math or Accept, Match, Fail
Op code
1byte
transition char
2 bytes
output
4 bytes
target address
4 bytes
Match
• Transition to a given address
• Accumulator += output
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
….
Fail
• Stop running the Program and return -1
• e.g. “tss”
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
….
Match or Accept
• If the current character is the final char,
• Ends running the program and returns the
accumulator
• Else Match
Instructions vs. Arcs
• What instructions represent
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
….
s/1
x/2
t/0
Virtual Machine running
backwards
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
5 Fail
6 Match a 0 4
7 Fail
8 Match c 0 6
• Because of freezing from suffixes
Use of Cache
• The lookup for next state is done by linear search
• The num. of outgoing arcs from the start state is large
• Therefore, we cache those outgoing arcs
Summary
• FST is theoretically more compact than tries
• Implemented FST Builder which builds
• Object-based FST
• Compiled FST, compact form
• Uses Virtual Machine to run the compiled program
(= lookup a word)
References
• Direct Construction of Minimal Acyclic Subsequential
Transducers, http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.24.3698
• Smaller representation of finite-state automata http://
www.sciencedirect.com/science/article/pii/
S0304397512003787
• Blog post by Ikawa-san http://qiita.com/ikawaha/items/
be95304a803020e1b2d1
• This code is available at https://github.com/atilika/fst

Mais conteúdo relacionado

Semelhante a Kuromoji FST

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Spark Summit
 
Cassandra and drivers
Cassandra and driversCassandra and drivers
Cassandra and driversBen Bromhead
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector? confluent
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a clusterGal Marder
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with storesYoni Farin
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleChristophe Grand
 
Verification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAVerification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAUniversität Rostock
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrencyAlex Miller
 
Memory Management with Java and C++
Memory Management with Java and C++Memory Management with Java and C++
Memory Management with Java and C++Mohammad Shaker
 
Blockchain meets database
Blockchain meets databaseBlockchain meets database
Blockchain meets databaseYongraeJo
 
Transaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, CaskTransaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, CaskCask Data
 
FFW Gabrovo PMG - JavaScript 1
FFW Gabrovo PMG - JavaScript 1FFW Gabrovo PMG - JavaScript 1
FFW Gabrovo PMG - JavaScript 1Toni Kolev
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Operators loops conditional and statements
Operators loops conditional and statementsOperators loops conditional and statements
Operators loops conditional and statementsVladislav Hadzhiyski
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Aljoscha Krettek
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
 

Semelhante a Kuromoji FST (20)

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
 
Cassandra and drivers
Cassandra and driversCassandra and drivers
Cassandra and drivers
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit Hole
 
C language
C languageC language
C language
 
Verification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAVerification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLA
 
Think in linq
Think in linqThink in linq
Think in linq
 
Groovy concurrency
Groovy concurrencyGroovy concurrency
Groovy concurrency
 
More Pointers and Arrays
More Pointers and ArraysMore Pointers and Arrays
More Pointers and Arrays
 
Memory Management with Java and C++
Memory Management with Java and C++Memory Management with Java and C++
Memory Management with Java and C++
 
Blockchain meets database
Blockchain meets databaseBlockchain meets database
Blockchain meets database
 
Transaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, CaskTransaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, Cask
 
FFW Gabrovo PMG - JavaScript 1
FFW Gabrovo PMG - JavaScript 1FFW Gabrovo PMG - JavaScript 1
FFW Gabrovo PMG - JavaScript 1
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Operators loops conditional and statements
Operators loops conditional and statementsOperators loops conditional and statements
Operators loops conditional and statements
 
Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)Unified stateful big data processing in Apache Beam (incubating)
Unified stateful big data processing in Apache Beam (incubating)
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 

Mais de Yoshinari Fujinuma (16)

Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4
Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4
Probabilistic Graphical Models 輪読会 Chapter 4.1 - 4.4
 
IT業界における英語とプログラミングの関係性
IT業界における英語とプログラミングの関係性IT業界における英語とプログラミングの関係性
IT業界における英語とプログラミングの関係性
 
言語モデル入門 (第二版)
言語モデル入門 (第二版)言語モデル入門 (第二版)
言語モデル入門 (第二版)
 
言語モデル入門
言語モデル入門言語モデル入門
言語モデル入門
 
Liさん
LiさんLiさん
Liさん
 
冨田さん
冨田さん冨田さん
冨田さん
 
藤沼さん
藤沼さん藤沼さん
藤沼さん
 
Yokoさん
YokoさんYokoさん
Yokoさん
 
Panotさん
PanotさんPanotさん
Panotさん
 
大橋さん
大橋さん大橋さん
大橋さん
 
研究室紹介用ポスター
研究室紹介用ポスター研究室紹介用ポスター
研究室紹介用ポスター
 
Minhさん
MinhさんMinhさん
Minhさん
 
Pascualさん
PascualさんPascualさん
Pascualさん
 
Pontusさん
PontusさんPontusさん
Pontusさん
 
hara-san's research
hara-san's researchhara-san's research
hara-san's research
 
Tweet Recommendation with Graph Co-Ranking
Tweet Recommendation with Graph Co-RankingTweet Recommendation with Graph Co-Ranking
Tweet Recommendation with Graph Co-Ranking
 

Último

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Kuromoji FST

  • 2. Overview • Motivation • Building FST • How freezing works • How equivalent detection works • Compiled FST and Virtual Machine
  • 3. Motivation • Efficient Key value store for dictionary look up during tokenization • String -> integers • int -> token info
  • 4. Why FST and not Trie? • Finite State Transducer (FST) = Finite State Automaton + Output • Able to merge both prefixes and suffixes too • e.g. “can”, “cats”, “dogs”
  • 5. Overview of how the build works List of sorted words, list of integers FST Builder FST Compiler Object-based FST Compiled FST
  • 6. How Building / Compiling works • two variables are the key • previous word (prev) • current word (current) 1. Skip common prefix between prev and current 2. make arcs to the temp states 3. Freeze (Finalize) states which suffix differ betw. prev and current
  • 7. Toy example • cat -> 0 • cats -> 1 • catx -> 2
  • 9. Freezing states • prev word = “”, current word = cat Frozen states Temp states t/0a/0c/0
  • 10. Add Arc to suffix Frozen states Temp states s/1t/0a/0c/0 • prev word = cat, current word = cats
  • 11. Freeze differing suffix • pre word = cats, current word = catx Frozen states Temp states s/1 t/0a/0c/0 x/2
  • 12. Freeze differing suffix • pre word = cats, current word = catx Frozen states Temp states s/1 t/0a/0c/0 x/2 HashCode 1
  • 13. Freeze differing suffix • pre word = cats, current word = catx Frozen states Temp states s/1 t/0a/0c/0 x/2 HashCode 1
  • 14. Merge Equivalent states • pre word = catx, current word =“” Frozen states Temp states s/1 t/0a/0c/0 x/2
  • 15. Freezing states • pre word = catx, current word = “” Frozen states s/1 t/0a/0c/0 x/2 Temp states
  • 16. Equivalent state detection • We want to merge equivalent states! • Key-value store using HashMap • Key: State.hashCode() • Value: State Object • Collisions are resolved by chaining
  • 17. Arc Equivalence c/0 • Same transition character • Same destination state • Same output c/0
  • 18. State Equivalence • All the outgoing set of arcs are equivalent • Both states are of the same type of state c/0 c/0
  • 19. How Compiled FST works • Generates a “Program” • Running a Program = look up a word in a dictionary • Program runs on a Virtual Machine which we implemented Compiled FST = “Program” Virtual Machine Word e.g. “cat” Integer if exists in dictionary -1, it not OR
  • 20. Program • List of Instructions, 11 bytes each • Operation code (Op code) • Math or Accept, Match, Fail Op code 1byte transition char 2 bytes output 4 bytes target address 4 bytes
  • 21. Match • Transition to a given address • Accumulator += output 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 ….
  • 22. Fail • Stop running the Program and return -1 • e.g. “tss” 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 ….
  • 23. Match or Accept • If the current character is the final char, • Ends running the program and returns the accumulator • Else Match
  • 24. Instructions vs. Arcs • What instructions represent 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 …. s/1 x/2 t/0
  • 25. Virtual Machine running backwards 0 Fail None None None 1 M / A x 2 0 2 M / A s 1 0 3 Fail 4 Match t 0 2 5 Fail 6 Match a 0 4 7 Fail 8 Match c 0 6 • Because of freezing from suffixes
  • 26. Use of Cache • The lookup for next state is done by linear search • The num. of outgoing arcs from the start state is large • Therefore, we cache those outgoing arcs
  • 27. Summary • FST is theoretically more compact than tries • Implemented FST Builder which builds • Object-based FST • Compiled FST, compact form • Uses Virtual Machine to run the compiled program (= lookup a word)
  • 28. References • Direct Construction of Minimal Acyclic Subsequential Transducers, http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.24.3698 • Smaller representation of finite-state automata http:// www.sciencedirect.com/science/article/pii/ S0304397512003787 • Blog post by Ikawa-san http://qiita.com/ikawaha/items/ be95304a803020e1b2d1 • This code is available at https://github.com/atilika/fst