This document discusses building a finite state transducer (FST) for efficient dictionary lookups during tokenization. It describes building the FST by iterating through a word list, freezing states when word suffixes differ, and merging equivalent states. The built FST is then compiled into a program that can be executed by a virtual machine to lookup words. The program represents the FST as a list of instructions including transition characters and output values. By running the program backwards, it simulates traversing the FST from a word to an output.
3. Motivation
• Efficient Key value store for dictionary look up
during tokenization
• String -> integers
• int -> token info
4. Why FST and not Trie?
• Finite State Transducer (FST) = Finite State Automaton +
Output
• Able to merge both prefixes and suffixes too
• e.g. “can”, “cats”, “dogs”
5. Overview of how the build
works
List of sorted
words,
list of integers
FST Builder
FST
Compiler
Object-based
FST
Compiled
FST
6. How Building / Compiling
works
• two variables are the key
• previous word (prev)
• current word (current)
1. Skip common prefix between prev and current
2. make arcs to the temp states
3. Freeze (Finalize) states which suffix differ betw.
prev and current
10. Add Arc to suffix
Frozen states
Temp states
s/1t/0a/0c/0
• prev word = cat, current word = cats
11. Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
12. Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
HashCode 1
13. Freeze differing suffix
• pre word = cats, current word = catx
Frozen states
Temp states s/1
t/0a/0c/0 x/2
HashCode 1
14. Merge Equivalent states
• pre word = catx, current word =“”
Frozen states
Temp states s/1
t/0a/0c/0
x/2
15. Freezing states
• pre word = catx, current word = “”
Frozen states s/1
t/0a/0c/0
x/2
Temp states
16. Equivalent state detection
• We want to merge equivalent states!
• Key-value store using HashMap
• Key: State.hashCode()
• Value: State Object
• Collisions are resolved by chaining
18. State Equivalence
• All the outgoing set of arcs are equivalent
• Both states are of the same type of state
c/0
c/0
19. How Compiled FST works
• Generates a “Program”
• Running a Program = look up a word in a dictionary
• Program runs on a Virtual Machine which we implemented
Compiled FST
= “Program”
Virtual
Machine
Word
e.g. “cat”
Integer if
exists in
dictionary
-1, it not
OR
20. Program
• List of Instructions, 11 bytes each
• Operation code (Op code)
• Math or Accept, Match, Fail
Op code
1byte
transition char
2 bytes
output
4 bytes
target address
4 bytes
21. Match
• Transition to a given address
• Accumulator += output
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
….
22. Fail
• Stop running the Program and return -1
• e.g. “tss”
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
….
23. Match or Accept
• If the current character is the final char,
• Ends running the program and returns the
accumulator
• Else Match
24. Instructions vs. Arcs
• What instructions represent
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
….
s/1
x/2
t/0
25. Virtual Machine running
backwards
0 Fail None None None
1 M / A x 2 0
2 M / A s 1 0
3 Fail
4 Match t 0 2
5 Fail
6 Match a 0 4
7 Fail
8 Match c 0 6
• Because of freezing from suffixes
26. Use of Cache
• The lookup for next state is done by linear search
• The num. of outgoing arcs from the start state is large
• Therefore, we cache those outgoing arcs
27. Summary
• FST is theoretically more compact than tries
• Implemented FST Builder which builds
• Object-based FST
• Compiled FST, compact form
• Uses Virtual Machine to run the compiled program
(= lookup a word)
28. References
• Direct Construction of Minimal Acyclic Subsequential
Transducers, http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.24.3698
• Smaller representation of finite-state automata http://
www.sciencedirect.com/science/article/pii/
S0304397512003787
• Blog post by Ikawa-san http://qiita.com/ikawaha/items/
be95304a803020e1b2d1
• This code is available at https://github.com/atilika/fst