The ISMVL (Int'l Symp. on Multiple-Valued Logic) presentation slide on May, 22nd, 2017 at Novi Sad, Serbia. It is a kind of machine learning to realize a high-performance and low power.
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
A Random Forest using a Multi-valued Decision Diagram on an FPGa
1. A Random Forest using a Multi-valued
Decision Diagram on an FPGA
1Hiroki Nakahara, 1Akira Jinguji, 1Shimpei Sato,
2Tsutomu Sasao
1Tokyo Institute of Technology, JP, 2Meiji University, JP
May, 22nd, 2017
@ISMVL2017
3. Machine Learning
3
Much computation power, and Big data
(Left): “Single-Threaded Integer Performance,” 2016
(Right): Nakahara, “Trend of Search Engine on modern Internet,” 2014
5. Introduction
• Random Forest (RF)
• Ensemble learning method
• Consists of multiple decision trees (DTs)
• Applications: Segmentation, human pose
detection
• It is based on binary DTs (BDTs)
• A node is evaluated by an if-then-else
statement
• The same variable may appear several times
• Multiple-valued decision diagram (MDD)
• Each variable appears only once on a path
5
6. Introduction (Contʼd)
• Target platform
• CPU: Too slow
• GPU: Not suitable to the RF → slow, and
consumes much power
• FPGA: Faster, low power, long TAT
• High-level synthesis (HLS) for the RF using
MDDs on an FPGA
• Low power, high performance,
short design time
6
8. Classification by a Binary
Decision Tree (BDT)
• Partition of the feature map
1.00
0.53
0.29
0.00
0.09
0.63
0.71
1.00
C1
C2 C1
C
1
C2 C1
X1
X2
X2<0.53?
X2<0.29? X1<0.09?
X1<0.63? X1<0.71?
Y N
N
NN
NY
Y
Y
Y
C1
C1C2 C1C2
C1
8
9. Training of a BDT
• It is built by randomized samples
• Recursively partition the dataset to maximize its
entropy → The same variables may appear
9
1.00
0.53
0.29
0.00
0.09
0.63
0.71
1.00
C1
C2 C1
C
1 C2 C1
X1
X2
X2<0.53?
X2<0.29? X1<0.09?
X1<0.63? X1<0.71?
Y N
N
NN
NY
Y
Y
Y
C1
C1C2 C1C2
C1
10. Random Forest (RF)
• Ensemble learning
• Classification and regression
• Consists of multiple BDT
10
Tree 1 Tree 2 Tree n
C1
C2
C1
Voter
C1 (Class)
InputX1<0.53?
X3<0.71? X2<0.63?
X2<0.63? X3<0.72?
Y N
N
NN
NY
Y
Y
Y
C1
C1C2 C1C3
C1
Tree 1
Binary Decision Tree (BDT) Random Forest
...
11. Applications
• Key point matching [Lepetit et al., 2006]
• Object detector [Shotton et al., 2008][Gall et al., 2011]
• Hand written character recognition [Amit&Geman, 1997]
• Visual word clustering
[Moosmann et al.,2006]
• Pose recognition
[Yamashita et al., 2010]
• Human detector
[Mitsui et al., 2011]
[Dahang et al., 2012]
• Human pose estimation
[Shotton 2011]
11
12. Known Problem
• Build BDTs from randomized samples
• The same variable may appear on a path
• Tend to be slow, even if we use the GPUs
12
X2<0.53?
X2<0.29? X2<0.09?
X1<0.63? X1<0.71?
Y N
N
NN
NY
Y
Y
Y
C1
C1C2 C1C2
C1
if X2 < 0.09 then
output C1;
else
goto Child_node;
14. 14
Binary Decision Diagram (BDD)
• Recursively apply Shannon expansion to a
given logic function
• Non-terminal node: If-then-else statement
• Terminal node: Set functional value
0 1
x1
x2
x3
x4
x5
x6
Non‐terminal node
Terminal node
15. 15
Measurement of BDD
Memory size: # of nodes size of a node
Worst case performance: LPL (Longest Path Length)
→Dedicated fully pipeline hardware
0 1
x1
x2
x3
x4
x5
x6
16. 16
Multi-Valued Decision Diagram (MDD)
• MDD(k): 2k outgoing edges
• Evaluates k variables at a time
0 1
x1
x2
x3
x4
x5
x6
BDD
0 1
X3
X2
X1
{x5,x6}
{x3,x4}
{x1,x2}
MDD(2)
17. Comparison the BDT with the MDD
17
X2<0.53?
X2<0.29? X1<0.09?
X1<0.63? X1<0.71?
Y N
N
NN
NY
Y
Y
Y
C1
C1C2 C1C2
C1
X2
X1 X1
C1 C2
<0.29
<0.53
<1.00
<1.00
<0.71
<0.71
<1.00
<0.63
BDT MDD
19. Complexities of the BDT
and the MDD
19
# Nodes LPL
BDT O(Σ|Xi|) O(Σ|Xi|)
MDD O(|Xi|k) O(n)
The RF prefers shallow decision trees for avoid
the overfitting
21. FPGA (Field Programmable
Gate Array)
• Reconfigurable architecture
• Look-up Table (LUT)
• Configurable channel
• Advantages
• Faster than CPU
• Dissipate lower power
than GPU
• Short time design
than ASIC
21
24. System Design Tool
24
①
②
④
③
1. Behavior design
+ pragmas
2. Profile analysis
3. IP core generation by HLS
4. Bitstream generation by
FPGA CAD tool
5. Middle ware generation
↓
Automatically done
28. Comparison of Platforms
• Implemented RF following devices
• CPU: Intel Core i7 650
• GPU: NVIDIA GeForce GTX Titan
• FPGA: Terasic DE5-NET
• Measure dynamic power including
the host PC
• Test bench: 10,000 random vectors
• Execution time including
communication time between
the host PC and devices
28
GPU
FPGA
30. Conclusion
• Proposed the RF using MDDs
• Reduced the path length
• Increased the column multiplicity
• # of nodes: O(|X|k)
• The shallow decision diagram is
recommended to avoid the overfitting
• Developed the high-level synthesis design
flow toward the FPGA realization
• 10.7x faster than the GPU
• 14.0x faster than the CPU
30