This document discusses synchronization synthesis for large-scale parallel streaming applications. It introduces a proposed design flow called PRO-PART that uses a streaming application flow graph representation and configurable communication cores to map streaming applications to a multi-FPGA architecture in a way that guarantees overall synchronization. The contributions are mapping streaming applications to globally asynchronous locally synchronous architectures, the streaming application flow graph model, the PRO-PART design methodology, and ensuring synchronization by meeting input/output clock rates while hiding low-level communication details.
Nell’iperspazio con Rocket: il Framework Web di Rust!
Streaming App Synchronization
1. Synchronization Synthesis for Large Scale
Parallel Streaming Applications
Vivek Venugopal
Committee
Dr. Cameron Patterson
Dr. Peter Athanas
Dr. Paul Plassmann
Dr. Jeffrey Reed
Dr. Kevin Shinpaugh
1
2. Outline
• Research Overview
• Introduction
• Related Work
• Research Statement
• Methodology
• Target Applications & Evaluation
• Results
• Contributions
2
3. Research Overview
Streaming
application
Set of
transformations
Research scope
Specialized
• How to partition algorithm? hardware platform
• How to map and where?
• What are the communication resources?
• How is synchronization guaranteed?
3
4. Streaming Architecture without Flow Control (SAFC)
PE PE PE
1 2 6
PE PE PE
7 8 12
PE PE PE
31 32 36
Clock
source
Streaming architecture with
large number of PE's
requiring more than 1 board
4
5. Streaming Architecture without Flow Control (SAFC)
Aurora
ML310 ML310
PE PE PE board 1 board 2
1 2 6
Aurora
PE PE PE
7 8 12 ML310 ML310
board 3 board 4
PE PE PE ML310 boards connected in
31 32 36 mesh driven by same clock
Clock value but different sources
source
Streaming architecture with
large number of PE's
requiring more than 1 board
4
6. Streaming Architecture without Flow Control (SAFC)
Aurora
ML310 ML310
PE PE PE board 1 board 2
1 2 6
Aurora
PE PE PE
7 8 12 ML310 ML310
board 3 board 4
PE PE PE ML310 boards connected in
31 32 36 mesh driven by same clock
Clock value but different sources
source
Streaming architecture with Aurora switches
large number of PE's FSL
requiring more than 1 board
PE1 PE2 PE3
Aurora switches
Aurora switches
FSL
PE4 PE5 PE6
PE7 PE8 PE9 Clock
source
Aurora switches
Inside a ML310
4
7. Clock domains: GALS scenario
PE1 PE2
Data Data Data
• GALS (Globally Asynchronous Locally Synchronous)
5
8. Synchronization
Data
Source IC Destination IC
clock
• System Synchronous
• Synchronization synthesis
6
9. Data type
Packet based data Streaming data
Start and stop easy for packet
Cannot stop streaming data
based data
Easier synchronization due to flow Synchronization is difficult leading
control to data loss if not done properly
Better dynamic scheduling of Better static scheduling of
resources resources
Best-effort service Guaranteed service
7
10. Communication framework
System-level communication
framework
Point-to-point Bus-based
Network-On-Chip
interconnect architecture
Custom Uniform Shared Split
8
16. NOFIS: Off-board communication
Aurora
Aurora Channel
Lane 1
User User
Aurora Aurora
Application Application
interface interface
(ML310) (ML310)
Aurora
Lane n
• High-speed (3.125 Gb/sec) and self synchronous
14
17. Model of Computation (MoC): SDF
Synchronous Data Flow (SDF)
1 2
Buffer 2
A B
2
Buffer
delay Buffer 1 1
elements C
• SDF exhibits ideal systolic dataflow behavior
• Varying data rate not supported
15
18. Model of Computation (MoC): PSDF
Parametrized Synchronous Data Flow (PSDF)
A1 Buffer B1
A size a1 B
A2 B2
Buf fer
f Buf a3
size er
a2 size
C1 C C2
firing of (A) ⨉ production rate of (A1) = firing of (B) ⨉ consumption rate of (B1)
firing of (A) ⨉ production rate of (A2) = firing of (C) ⨉ consumption rate of (C1)
firing of (C) ⨉ production rate of (C2) = firing of (B) ⨉ consumption rate of (B2)
• supports reconfiguration and different data rates
16
19. Multi-core Multi-processor trend
1000
Number of cores
100
Moore's Law
Multi-core growth
10
2003 2004 2005 2006 2007 2008 2009
Year of production
• Need single unified parallel programming tool for
exploiting parallel processing at the core level
• Kill Rule by Agrawal: correlate to communication cores
17
20. Related work
Related
What MoC Shortcomings
work
suboptimal bandwidth
Compaan-
compilers KPN based utilization due to infinite
Laura Matlab
length FIFOs
custom synchronization scheme always
CERBERO MPI model
architecture fixed with the master PE
PEs are connected using a MPI
custom
TMD MPI model communication library, no
architecture
automation
manual partitioning and
custom
CORES/ SDF/FSM scheduling of communication
architecture +
HASIS model resources, run-time
transformations
reconfiguration outside scope
18
21. Research Question
What transformations are required to map a streaming
application on a systolic-like architecture, with the low-level
communication interface details hidden from the end-user
and at the same time support automation for implementing
streaming applications on the platform?
19
22. Research Question
What transformations are required to map a streaming
application on a systolic-like architecture, with the low-level
communication interface details hidden from the end-user
and at the same time support automation for implementing
streaming applications on the platform?
19
23. Conventional design flow
design capture
partitioning and
scheduling of processes
manual
redesign loop
select parameters for
the customizable cores
map the values on the
hardware
20
24. Proposed PRO-PART Design Flow
SFG representation
of application streaming structure and
algorithm PRO-PART dataflow capture components
+ Design flow using SFG specification
component
specification of
implementation platform
NOFIS platform
partitioning and
Input specifications
communication resource
specification
automated
Objectives: configure and generate
• Partition algorithm values for communication
cores
• Identify communication resources
• Schedule and embed flow control mapping to hardware
• Guarantee overall synchronization
without re-design loops
21
25. SAFC Flow Graph (SFG)
Data from Data from Data from North inputs Data from South inputs
North inputs East inputs
F1 F2 F3 F4 F5 F6
FPGA 1 FPGA 2 FPGA 3 FPGA 4
FPGA12 FPGA13 FPGA14 FPGA 5
F7 F8
FPGA11 FPGA16 FPGA15 FPGA 6
F9
FPGA10 FPGA 9 FPGA 8 FPGA 7
Synchronized data
Data from Data from blocks recorded
South inputs West inputs onto disk
• Provides abstraction to view a streaming system with a
single universal clock (I/O rate)
22
27. Process Partitioning
Read image1 from Read image2 from
memory and memory and
create zone1 create zone2
ML310 board 1
f1 = FFT(zone1) f2 = FFT(zone2)
f3 = mult(f1, f2)
ML310 board 2
f4 = IFFT(f3)
sub-pixel(f4)
• Assign process id depending on order of execution and
partition between boards
24
28. Configuring comm. resources
ML310 board 1
Read image1 from Read image2 from
memory and memory and
create zone1 create zone2
synchronous non-
blocking mode for FSL • Configure buffer depths
f1 = FFT(zone1) f2 = FFT(zone2)
• map channels to physical
links
Channel multiplexing
ML310 board 2
over Aurora and flow • schedule data over
control mode
f3 = mult(f1, f2) channels
• multiplex virtual channels
f4 = IFFT(f3)
sub-pixel(f4)
25
29. Mapping to Hardware
Inside ML310
PE PE
data_in1 I/O FSL I/O data_out1
FSL FSL
data_in2 data_out2
I/O FSL I/O
PE PE
• I/O unit generates parameter values for: FIFO generator,
FSL block, sync counter, Aurora FSL switch
26
30. Particle Image Velocimetry (PIV)
• Cardiovascular Disease (CVD) is the leading cause of
death in the United States and accounts for more than
37.1 % of all fatalities for 2005.
• AEThER Lab at Virginia Tech models cardiovascular fluid
dynamics
27
31. PIV algorithm
t
motion
vector
Image 1
FFT
t + dt
zone 1
Multiplication IFFT Reduction
Image 2 FFT
zone 2
• Data-intensive, each case results in 1250 image pairs x
5MB = 6.25 GB
• Custom FlowIQ program: 16 minutes for one image pair
on a 2GHz Xeon processor resulting in 2.6 years for
analysis
28
32. PIV performance
5.0
4.50 CPU
4.5 GPU • GPU fastest platform,
FPGA
4.0 expensive data transfer
3.5 between device(GPU) and
Time in seconds
3.0 host(CPU).
2.5 2.25
• PRO-PART+ NOFIS: slower
2.0
but higher throughput due to
1.5 1.279
efficient pipelining and
customized communication
1.0
cores. (work in progress)
0.5
0
Execution device
29
33. ETA Beamforming application
Antenna LVDS
inputs connections
ML310
S25
S25 ML310
2.5 Gbit/sec Serial Interconnect Network (Aurora)
S25
ML310
ML310
S25
S25 ML310 ML310 PC
disk
S25
ML310
ML310 PC
disk
ML310
S25
ML310 PC
S25 ML310 disk
S25
ML310
ML310 PC
disk
ML310
S25 Inner nodes Recording
S25 ML310 nodes
S25
ML310
Receiver nodes Outer nodes
30
34. ETA using PRO-PART + NOFIS
Current ETA implementation Proposed implementation
(work in progress)
• Time-consuming and
• Shorter design cycle
extensive simulations
• Potential increase in resource
• Hardware efficient due to
but meets performance goals
hand-coded RTL
systolic dataflow structure and
design capture capture using components
SFG specification
partitioning and
partitioning and
scheduling of processes
communication resource
specification
manual
redesign loop
automated
select parameters for
configure and generate
the customizable cores
values for communication
cores
map the values on the
hardware
mapping to hardware
31
35. Contributions
• Map streaming applications to GALS architecture
• SAFC Flow Graph (SFG) representation
• PRO-PART design methodology
• Configurable communication cores
• Guarantee synchronization by meeting the I/O clock rate
• Increase designer productivity
32
38. Synchronization methods
Source synchronous
Data
Source IC Destination IC
Clock
Self synchronous
Data and
clock
Source IC Destination IC
35
39. Message Passing Interface (MPI)
Formatted output file
with combined results
job
Head node submission User
application application
dataset dataset
Compute Compute Compute Compute
node 1 node 2 node 3 node n
36
40. Models of Computation: CSP
Process
3
channel
channel
sending sending
channel channel
Process Process Process
1 2 5
channel channel
receiving receiving
channel
channel
Process
4
37
41. Models of Computation: KPN
Infinite
P1 sized P2
FIFO
FI zed e
si finit
In ze
fin d
In
si IFO
FO
ite
F
P3
38