Streaming App Synchronization

Synchronization Synthesis for Large Scale
Parallel Streaming Applications
Vivek Venugopal

Committee
Dr. Cameron Patterson
Dr. Peter Athanas
Dr. Paul Plassmann
Dr. Jeffrey Reed
Dr. Kevin Shinpaugh

1

Outline

• Research Overview
• Introduction
• Related Work
• Research Statement
• Methodology
• Target Applications & Evaluation
• Results
• Contributions

2

Research Overview

Streaming
application
Set of
transformations

Research scope
Specialized
• How to partition algorithm? hardware platform

• How to map and where?
• What are the communication resources?
• How is synchronization guaranteed?

3

Streaming Architecture without Flow Control (SAFC)

PE PE PE
1 2 6

PE PE PE
7 8 12

PE PE PE
31 32 36
Clock
source
Streaming architecture with
large number of PE's
requiring more than 1 board

4

Aurora
ML310 ML310
PE PE PE board 1 board 2
1 2 6
Aurora

PE PE PE
7 8 12 ML310 ML310
board 3 board 4

PE PE PE ML310 boards connected in
31 32 36 mesh driven by same clock
Clock value but different sources
source
Streaming architecture with
large number of PE's

4

Aurora
ML310 ML310
PE PE PE board 1 board 2
1 2 6
Aurora

PE PE PE
7 8 12 ML310 ML310
board 3 board 4

PE PE PE ML310 boards connected in
31 32 36 mesh driven by same clock
Clock value but different sources
source
Streaming architecture with Aurora switches
large number of PE's FSL
PE1 PE2 PE3
Aurora switches

Aurora switches
FSL

PE4 PE5 PE6

PE7 PE8 PE9 Clock
source

Aurora switches

Inside a ML310
4

Clock domains: GALS scenario

PE1 PE2

Data Data Data

• GALS (Globally Asynchronous Locally Synchronous)

5

Synchronization

Data
Source IC Destination IC

clock

• System Synchronous
• Synchronization synthesis
6

Data type

Packet based data Streaming data

Start and stop easy for packet
Cannot stop streaming data
based data

Easier synchronization due to ﬂow Synchronization is difﬁcult leading
control to data loss if not done properly

Better dynamic scheduling of Better static scheduling of
resources resources

Best-effort service Guaranteed service

7

Communication framework
System-level communication
framework

Point-to-point Bus-based
Network-On-Chip
interconnect architecture

Custom Uniform Shared Split

8

Point-to-point Interconnect

1 2 3 4

Ring

1 2 3 1 2 3

4 5 6 4 5 6

7 8 9 7 8 9

2D Torus 2D Mesh
9

Bus-based architecture

Memory High-speed Low-speed
interface peripheral peripheral

Block I/O
RAM interface

OPB Bridge
Power PC

PLB arbiter OPB arbiter

• IBM CoreConnect architecture

10

Network-on-Chip (NoC)
Router

Link

core core core

core core core

core core core

11

Multi-core Streaming Architecture
Network Of FPGAs with Integrated Aurora Switches (NOFIS)
ML310 ML310
board 1 board 2

Aurora switches Aurora switches

FSL FSL

PE1 PE2 PE3 PE1 PE2 PE3
FSL Aurora FSL
Aurora switches

Aurora switches
Aurora switches

Aurora switches



12

NOFIS: On-board communication
Master Slave

FSL_M_Clk FSL_S_Clk
FSL_M_Data FSL_S_Data
FIFO
FSL_M_Control FSL_S_Control
FSL_M_Write FSL_S_Read
FSL_M_Full FSL_S_Exists

• Fast Simplex Link : uni-directional FIFO interface
• Conﬁgure FIFO depth, clocking modes

13

NOFIS: Off-board communication

Aurora
Aurora Channel
Lane 1

User User
Aurora Aurora
Application Application
interface interface
(ML310) (ML310)

Aurora
Lane n

• High-speed (3.125 Gb/sec) and self synchronous
14

Model of Computation (MoC): SDF
Synchronous Data Flow (SDF)

1 2
Buffer 2
A B
2

Buffer
delay Buffer 1 1
elements C

• SDF exhibits ideal systolic dataﬂow behavior
• Varying data rate not supported

15

Model of Computation (MoC): PSDF
Parametrized Synchronous Data Flow (PSDF)
A1 Buffer B1
A size a1 B
A2 B2
Buf fer
f Buf a3
size er
a2 size
C1 C C2

firing of (A) ⨉ production rate of (A1) = firing of (B) ⨉ consumption rate of (B1)
firing of (A) ⨉ production rate of (A2) = firing of (C) ⨉ consumption rate of (C1)
firing of (C) ⨉ production rate of (C2) = firing of (B) ⨉ consumption rate of (B2)

• supports reconfiguration and different data rates
16

Multi-core Multi-processor trend
1000

Number of cores
100

Moore's Law

Multi-core growth
10

2003 2004 2005 2006 2007 2008 2009

Year of production

• Need single uniﬁed parallel programming tool for
exploiting parallel processing at the core level
• Kill Rule by Agrawal: correlate to communication cores
17

Related work
Related
What MoC Shortcomings
work
suboptimal bandwidth
Compaan-
compilers KPN based utilization due to infinite
Laura Matlab
length FIFOs
custom synchronization scheme always
CERBERO MPI model
architecture fixed with the master PE

PEs are connected using a MPI
custom
TMD MPI model communication library, no
architecture
automation
manual partitioning and
custom
CORES/ SDF/FSM scheduling of communication
architecture +
HASIS model resources, run-time
transformations
reconfiguration outside scope

18

Research Question

What transformations are required to map a streaming
application on a systolic-like architecture, with the low-level
communication interface details hidden from the end-user
and at the same time support automation for implementing
streaming applications on the platform?

19

Conventional design ﬂow

design capture

partitioning and
scheduling of processes

manual
redesign loop

select parameters for
the customizable cores

map the values on the
hardware

20

Proposed PRO-PART Design Flow
SFG representation
of application streaming structure and
algorithm PRO-PART dataflow capture components
+ Design flow using SFG specification
component
specification of
implementation platform
NOFIS platform
partitioning and
Input specifications
communication resource
specification

automated

Objectives: configure and generate

• Partition algorithm values for communication
cores

• Identify communication resources
• Schedule and embed flow control mapping to hardware
• Guarantee overall synchronization
without re-design loops
21

SAFC Flow Graph (SFG)
Data from Data from Data from North inputs Data from South inputs
North inputs East inputs

F1 F2 F3 F4 F5 F6
FPGA 1 FPGA 2 FPGA 3 FPGA 4

FPGA12 FPGA13 FPGA14 FPGA 5
F7 F8

FPGA11 FPGA16 FPGA15 FPGA 6

F9

FPGA10 FPGA 9 FPGA 8 FPGA 7

Synchronized data
Data from Data from blocks recorded
South inputs West inputs onto disk

• Provides abstraction to view a streaming system with a
single universal clock (I/O rate)
22

Platform speciﬁcation

ML310 board 1 ML310 board 2


FSL FSL

PE1 PE2 Aurora PE1 PE2
Aurora switches

Aurora switches
Aurora switches

Aurora switches
FSL FSL

PE3 PE4 PE3 PE4


23

Process Partitioning
Read image1 from Read image2 from
memory and memory and
create zone1 create zone2

ML310 board 1

f1 = FFT(zone1) f2 = FFT(zone2)

f3 = mult(f1, f2)
ML310 board 2

f4 = IFFT(f3)

sub-pixel(f4)

• Assign process id depending on order of execution and
partition between boards
24

Configuring comm. resources
ML310 board 1

Read image1 from Read image2 from
memory and memory and
create zone1 create zone2
synchronous non-
blocking mode for FSL • Configure buffer depths
f1 = FFT(zone1) f2 = FFT(zone2)
• map channels to physical
links
Channel multiplexing
ML310 board 2
over Aurora and flow • schedule data over
control mode
f3 = mult(f1, f2) channels
• multiplex virtual channels

f4 = IFFT(f3)

sub-pixel(f4)

25

Mapping to Hardware
Inside ML310

PE PE

data_in1 I/O FSL I/O data_out1

FSL FSL

data_in2 data_out2
I/O FSL I/O

PE PE

• I/O unit generates parameter values for: FIFO generator,
FSL block, sync counter, Aurora FSL switch
26

Particle Image Velocimetry (PIV)

• Cardiovascular Disease (CVD) is the leading cause of
death in the United States and accounts for more than
37.1 % of all fatalities for 2005.
• AEThER Lab at Virginia Tech models cardiovascular ﬂuid
dynamics
27

PIV algorithm
t

motion
vector
Image 1
FFT
t + dt
zone 1
Multiplication IFFT Reduction

Image 2 FFT
zone 2

• Data-intensive, each case results in 1250 image pairs x
5MB = 6.25 GB
• Custom FlowIQ program: 16 minutes for one image pair
on a 2GHz Xeon processor resulting in 2.6 years for
analysis
28

PIV performance
5.0
4.50 CPU
4.5 GPU • GPU fastest platform,
FPGA
4.0 expensive data transfer
3.5 between device(GPU) and
Time in seconds

3.0 host(CPU).
2.5 2.25
• PRO-PART+ NOFIS: slower
2.0
but higher throughput due to
1.5 1.279
efﬁcient pipelining and
customized communication
1.0
cores. (work in progress)
0.5

0
Execution device
29

ETA Beamforming application
Antenna LVDS
inputs connections
ML310
S25

S25 ML310

2.5 Gbit/sec Serial Interconnect Network (Aurora)
S25
ML310

ML310
S25

S25 ML310 ML310 PC
disk
S25
ML310
ML310 PC
disk
ML310
S25
ML310 PC
S25 ML310 disk

S25
ML310
ML310 PC
disk
ML310
S25 Inner nodes Recording
S25 ML310 nodes

S25
ML310

Receiver nodes Outer nodes
30

ETA using PRO-PART + NOFIS
Current ETA implementation Proposed implementation
(work in progress)
• Time-consuming and
• Shorter design cycle
extensive simulations
• Potential increase in resource
• Hardware efficient due to
but meets performance goals
hand-coded RTL
systolic dataflow structure and
design capture capture using components
SFG specification

partitioning and
partitioning and
scheduling of processes
communication resource
specification
manual
redesign loop

automated

select parameters for
configure and generate
the customizable cores
values for communication
cores

map the values on the
hardware
mapping to hardware

31

Contributions

• Map streaming applications to GALS architecture
• SAFC Flow Graph (SFG) representation
• PRO-PART design methodology
• Conﬁgurable communication cores
• Guarantee synchronization by meeting the I/O clock rate
• Increase designer productivity

32

Discussion
Questions

33

Supporting slides

34

Synchronization methods
Source synchronous

Data

Clock

Self synchronous

Data and
clock

35

Message Passing Interface (MPI)
Formatted output ﬁle
with combined results

job
Head node submission User

application application
dataset dataset

Compute Compute Compute Compute
node 1 node 2 node 3 node n

36

Models of Computation: CSP
Process
3

channel
channel
sending sending
channel channel
Process Process Process
1 2 5
channel channel
receiving receiving
channel

channel

Process
4

37

Models of Computation: KPN

Infinite
P1 sized P2
FIFO

FI zed e
si finit
In ze
fin d

In
si IFO

FO
ite
F

P3

38

NVIDIA Tesla C1060
GPU

Multiprocessor N
!

Multiprocessor 2
!! "#$!%&'()**!+,*!-(./01).1,()!1/-1!2)304)(*!,'!15!6!78'*!09!+51/!,'*1()-:!-92!
Multiprocessor 1
25;9*1()-:!2-1-!1(-9*<)(*=!
Shared Memory !! >1-92-(2!092,*1(?!<5(:!<-.15(*@!<5(!+51/!2)*A15'!-92!(-.AB:5,91)2!
.59<0C,(-1059*=!
!! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=!
Registers Registers Registers
Instruction

B@C! ()*+,"1DEA"./0"
Processor 1 Processor 2 Processor M Unit

K/)!DE$F$G!K)*3-!#LMN!
Constant
7"I!.5:',109C!+5-(2!0*!-9!
Cache
-22B09!.-(2!+-*)2!59!1/)!K)*3-!
#LMN!7"I=!$1!/-*!-!"#$!
%&'()**!<,33B/)0C/1!<5(:!<-.15(!
Texture
-92!0*!1-(C)1)2!-*!-!/0C/!
Cache
')(<5(:-9.)!.5:',109C!
HO"#J!*53,1059!<5(!"#$!
%&'()**!/5*1!;5(A*1-1059*=!
GPU memory
7"I!#5:',109C!+5-(2B3)4)3!
'(52,.1*!25!951!/-4)!20*'3-?!
.599).15(*!-92!-()!*').0<0.-33?!
2)*0C9)2!<5(!.5:',109C=!
"(5.)**5(!.35.A*@!:):5(?! F;G5:)"BHC-"()*+,"1DEA"./0" 39

PIV performance-cuda_proﬁle

40

NOFIS Hardware

ML310 Inﬁniband
adapter board

ML310

41

Streaming App Synchronization

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Mais de Vivek Venugopalan

Mais de Vivek Venugopalan (6)

Último

Último (20)

Streaming App Synchronization