IMAX3: Amazing Dataflow-Centric CGRA and its Applications
I present this slide to all hungry engineers who are tired of CPU, GPU, FPGA, tensor core, AI core, who want some challenging one with no black box inside, and who want to improve by themselves.
7. 7
Scalar, SIMD and CGRA
20210401
time
Scalar
(VL=32)
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
MM
Vector1
(VL=256)
Vector2
(VL=2048)
CGRA
(VL=16K)
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
MM
I1
I1
I1
I1
VST
VST
VST
VST
VFMA
VFMA
VFMA
VFMA
VLD
VLD
VLD
VLD
VLD
VLD
VLD
VLD
MM
8. 1988 VPP 4way VLIW+8elem. Vector
My CGRAs began from VLIW+Vector processor
F D W
C
D
M M M M
M M M M D D M M M M
Load pipe Store pipe
FPU pipe
F D
F D
Cache lines are shared among
heterogeneous multithreads.
M M M M M M M M
D
W
M M M M M M M M
M M M M M M M M
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
C
E
E
E
E
M M M M
B
B
B
B
B
B
B
B
B
B
Vector Length
≤ 2048 dwords
($miss)
F D D
2008 LAPP 8way VLIW+32stage Array
C2
E
E
E
LD
M M M M
C1
E
E
E
LD
C0
E
E
E
E
B
B
B
B
B
C3
E
E
E
E
HW Mapper (8stages) HW Mapper (binary compatible)
Vector Length
≤ 1024 words
E
E
E
E
E
E
E
ST
E
E
E
LD
E
E
E
LD
E
E
E
E
E
E
E
ST
2006 OROCHI 9way VLIW+5way SS
20210401
8
Data from all cache ways are passed through array.
Location free LD/ST can keep binary compatibility.
19. さらに縦に並べて全体のデータ流は上から下へ
20210401
19
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
20. さらにマルチチップ拡張として、横に増やす
20210401
20
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
21. ここで、論理4UNITを物理1UNITに重畳
20210401
21
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
22. 演算位置とローカルメモリ位置の同調制御のため、リング構造化
20210401
22
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
25. 20210401
25
外部メモリとも協調させ、途切れないデータフローを作る
C
O
N
F
R
E
G
s
A
D
D
R
Overlapping post-drain, burst-exec, pre-fetch
L
M
M
L
M
M
L
M
M
Burst exec.
R
E
G
s
A
D
D
R
L
M
M
L
M
M
L
M
M
Burst exec.
C
O
N
F
A
D
D
R
R
E
G
s
Sequential execution
L
M
M
L
M
M
L
M
M
Burst exec. A
D
D
R
R
E
G
s
L
M
M
time
L
M
M
L
M
M
Burst exec.
PIO/DMA External Memory
PIO/DMA External Memory
R
E
G
s
A
D
D
R
L
M
M
L
M
M
L
M
M
Burst exec.
61. 20210401
61
ハッシュ計算には、面倒な演算が必要
for (i=0; i<ctx->mbuflen; i+=BLKSIZE) { /* 1データ流内の並列実行は不可能. 多数データ流のパイプライン実行のみ */
for (th=0; th<thnum; th++) {
sregs[th*8+0] = state[th*8+0]; sregs[th*8+1] = state[th*8+1]; sregs[th*8+2] = state[th*8+2]; sregs[th*8+3] = state[th*8+3];
sregs[th*8+4] = state[th*8+4]; sregs[th*8+5] = state[th*8+5]; sregs[th*8+6] = state[th*8+6]; sregs[th*8+7] = state[th*8+7];
}
for (j=0; j<BLKSIZE; j+=BLKSIZE/DIV) {
for (th=0; th<thnum; th++) {
a = sregs[th*8+0]; b = sregs[th*8+1]; c = sregs[th*8+2]; d = sregs[th*8+3];
e = sregs[th*8+4]; f = sregs[th*8+5]; g = sregs[th*8+6]; h = sregs[th*8+7];
#if (DIV==4)
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 0]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 0]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 1]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 1]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 2]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 2]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 3]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 3]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 4]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 4]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 5]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 5]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 6]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 6]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 7]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 7]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 8]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 8]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 9]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 9]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+10]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+10]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+11]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+11]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+12]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+12]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+13]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+13]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+14]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+14]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+15]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+15]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
#endif
sregs[th*8+0] = a; sregs[th*8+1] = b; sregs[th*8+2] = c; sregs[th*8+3] = d;
sregs[th*8+4] = e; sregs[th*8+5] = f; sregs[th*8+6] = g; sregs[th*8+7] = h;
}
}
for (th=0; th<thnum; th++) {
state[th*8+0] += sregs[th*8+0]; state[th*8+1] += sregs[th*8+1]; state[th*8+2] += sregs[th*8+2]; state[th*8+3] += sregs[th*8+3];
state[th*8+4] += sregs[th*8+4]; state[th*8+5] += sregs[th*8+5]; state[th*8+6] += sregs[th*8+6]; state[th*8+7] += sregs[th*8+7];
}
}