SlideShare uma empresa Scribd logo
1 de 44
Synchronization Synthesis for Large Scale
     Parallel Streaming Applications
              Vivek Venugopal

                 Committee
           Dr. Cameron Patterson
              Dr. Peter Athanas
            Dr. Paul Plassmann
               Dr. Jeffrey Reed
            Dr. Kevin Shinpaugh



                                            1
Outline

  •   Research Overview
  •   Introduction
  •   Related Work
  •   Research Statement
  •   Methodology
  •   Target Applications & Evaluation
  •   Results
  •   Contributions


                                         2
Research Overview


     Streaming
     application
                          Set of
                     transformations

          Research scope
                                          Specialized
 •   How to partition algorithm?       hardware platform

 •   How to map and where?
 •   What are the communication resources?
 •   How is synchronization guaranteed?

                                                           3
Streaming Architecture without Flow Control (SAFC)

             PE       PE       PE
              1        2        6


             PE       PE       PE
              7        8       12



             PE       PE       PE
             31       32       36
 Clock
source
         Streaming architecture with
            large number of PE's
         requiring more than 1 board




                                                 4
Streaming Architecture without Flow Control (SAFC)
                                                  Aurora
                                        ML310              ML310
             PE       PE       PE       board 1            board 2
              1        2        6
                                               Aurora

             PE       PE       PE
              7        8       12       ML310              ML310
                                        board 3            board 4


             PE       PE       PE      ML310 boards connected in
             31       32       36      mesh driven by same clock
 Clock                                 value but different sources
source
         Streaming architecture with
            large number of PE's
         requiring more than 1 board




                                                                     4
Streaming Architecture without Flow Control (SAFC)
                                                                                                                      Aurora
                                                                                              ML310                            ML310
             PE       PE       PE                                                             board 1                          board 2
              1        2        6
                                                                                                                   Aurora

             PE       PE       PE
              7        8       12                                                             ML310                            ML310
                                                                                              board 3                          board 4


             PE       PE       PE                                                            ML310 boards connected in
             31       32       36                                                            mesh driven by same clock
 Clock                                                                                       value but different sources
source
         Streaming architecture with                                 Aurora switches
            large number of PE's                                     FSL
         requiring more than 1 board
                                                               PE1         PE2         PE3
                                       Aurora switches




                                                                                                 Aurora switches
                                                         FSL


                                                               PE4         PE5         PE6




                                                               PE7         PE8         PE9                                   Clock
                                                                                                                            source

                                                                     Aurora switches


                                                                 Inside a ML310
                                                                                                                                         4
Clock domains: GALS scenario


                 PE1          PE2




          Data         Data         Data




  • GALS (Globally Asynchronous Locally Synchronous)

                                                       5
Synchronization

                           Data
            Source IC             Destination IC




    clock




 • System Synchronous
 • Synchronization synthesis
                                                   6
Data type

       Packet based data                     Streaming data

  Start and stop easy for packet
                                       Cannot stop streaming data
           based data


Easier synchronization due to flow   Synchronization is difficult leading
             control                 to data loss if not done properly


  Better dynamic scheduling of          Better static scheduling of
           resources                            resources


       Best-effort service                 Guaranteed service


                                                                          7
Communication framework
                              System-level communication
                                      framework




   Point-to-point                    Bus-based
                                                            Network-On-Chip
   interconnect                      architecture




Custom              Uniform    Shared               Split




                                                                              8
Point-to-point Interconnect

               1       2          3       4




                           Ring

    1      2       3                  1          2      3




    4      5       6                  4          5      6




    7      8       9                  7          8      9




        2D Torus                              2D Mesh
                                                            9
Bus-based architecture

            Memory                   High-speed                Low-speed
           interface                 peripheral                peripheral


                       Block                                                      I/O
                       RAM                                                     interface




                                                  OPB Bridge
Power PC


                       PLB arbiter                               OPB arbiter




  • IBM CoreConnect architecture

                                                                                           10
Network-on-Chip (NoC)
                            Router


                            Link

       core   core   core




       core   core   core




       core   core   core




                                     11
Multi-core Streaming Architecture
Network Of FPGAs with Integrated Aurora Switches (NOFIS)
                                                       ML310                                                                              ML310
                                                       board 1                                                                            board 2

                               Aurora switches                                                                    Aurora switches

                               FSL                                                                                FSL

                         PE1         PE2         PE3                                                        PE1         PE2         PE3
                   FSL                                                     Aurora                     FSL
 Aurora switches




                                                                                    Aurora switches
                                                         Aurora switches




                                                                                                                                            Aurora switches
                         PE4         PE5         PE6                                                        PE4         PE5         PE6




                         PE7         PE8         PE9                                                        PE7         PE8         PE9




                               Aurora switches                                                                    Aurora switches




                                                                                                                                                              12
NOFIS: On-board communication
          Master                            Slave

   FSL_M_Clk                                 FSL_S_Clk
  FSL_M_Data                                 FSL_S_Data
                            FIFO
FSL_M_Control                                FSL_S_Control
  FSL_M_Write                                FSL_S_Read
  FSL_M_Full                                 FSL_S_Exists




 • Fast Simplex Link : uni-directional FIFO interface
 • Configure FIFO depth, clocking modes

                                                             13
NOFIS: Off-board communication

                                     Aurora
                           Aurora   Channel
                           Lane 1



   User                                                      User
                Aurora                         Aurora
Application                                               Application
               interface                      interface
 (ML310)                                                   (ML310)



                           Aurora
                           Lane n




     • High-speed (3.125 Gb/sec) and self synchronous
                                                                        14
Model of Computation (MoC): SDF
            Synchronous Data Flow (SDF)

                    1                   2
                         Buffer                 2
                A                           B
                                        2




                                                    Buffer
                      delay    Buffer   1       1
                    elements                C




 • SDF exhibits ideal systolic dataflow behavior
 • Varying data rate not supported

                                                             15
Model of Computation (MoC): PSDF
         Parametrized Synchronous Data Flow (PSDF)
               A1                     Buffer                 B1
          A                          size a1                      B
               A2                                            B2
                        Buf                            fer
                           f                        Buf a3
                      size er
                           a2                       size
                                C1     C       C2



firing of (A) ⨉ production rate of (A1) = firing of (B) ⨉ consumption rate of (B1)
firing of (A) ⨉ production rate of (A2) = firing of (C) ⨉ consumption rate of (C1)
firing of (C) ⨉ production rate of (C2) = firing of (B) ⨉ consumption rate of (B2)


  • supports reconfiguration and different data rates
                                                                               16
Multi-core Multi-processor trend
                              1000




            Number of cores
                              100


                                              Moore's Law

                                                                      Multi-core growth
                               10




                                2003   2004     2005    2006   2007     2008    2009

                                                       Year of production


 • Need single unified parallel programming tool for
     exploiting parallel processing at the core level
 •   Kill Rule by Agrawal: correlate to communication cores
                                                                                          17
Related work
  Related
                   What            MoC             Shortcomings
   work
                                                suboptimal bandwidth
 Compaan-
                 compilers       KPN based     utilization due to infinite
Laura Matlab
                                                      length FIFOs
                  custom                   synchronization scheme always
 CERBERO                         MPI model
                architecture                  fixed with the master PE

                                             PEs are connected using a MPI
                  custom
   TMD                           MPI model     communication library, no
                architecture
                                                      automation
                                                manual partitioning and
                   custom
  CORES/                         SDF/FSM     scheduling of communication
                architecture +
   HASIS                          model           resources, run-time
               transformations
                                             reconfiguration outside scope

                                                                            18
Research Question


What transformations are required to map a streaming
application on a systolic-like architecture, with the low-level
communication interface details hidden from the end-user
and at the same time support automation for implementing
streaming applications on the platform?




                                                                  19
Research Question


What transformations are required to map a streaming
application on a systolic-like architecture, with the low-level
communication interface details hidden from the end-user
and at the same time support automation for implementing
streaming applications on the platform?




                                                                  19
Conventional design flow

                          design capture




                          partitioning and
                      scheduling of processes

                                           manual
      redesign loop




                       select parameters for
                      the customizable cores




                      map the values on the
                            hardware



                                                    20
Proposed PRO-PART Design Flow
  SFG representation
    of application                                         streaming        structure and
      algorithm           PRO-PART                      dataflow capture     components
          +               Design flow                       using SFG        specification
      component
    specification of
implementation platform
                                       NOFIS platform
                                                                partitioning and
  Input specifications
                                                             communication resource
                                                                 specification

                                                                               automated

      Objectives:                                             configure and generate

     • Partition algorithm                                   values for communication
                                                                       cores

     • Identify communication resources
     • Schedule and embed flow control                          mapping to hardware
     • Guarantee overall synchronization
       without re-design loops
                                                                                            21
SAFC Flow Graph (SFG)
         Data from              Data from            Data from North inputs         Data from South inputs
        North inputs            East inputs



                                                           F1     F2      F3        F4     F5     F6
    FPGA 1        FPGA 2   FPGA 3           FPGA 4




   FPGA12         FPGA13   FPGA14           FPGA 5
                                                                  F7                       F8




   FPGA11         FPGA16   FPGA15           FPGA 6

                                                                               F9




   FPGA10         FPGA 9   FPGA 8           FPGA 7


                                                                       Synchronized data
         Data from            Data from                                 blocks recorded
        South inputs          West inputs                                  onto disk




 • Provides abstraction to view a streaming system with a
   single universal clock (I/O rate)
                                                                                                         22
Platform specification

                                             ML310 board 1                                                                 ML310 board 2

                          Aurora switches                                                               Aurora switches


                               FSL                                                                           FSL


                         PE1           PE2                            Aurora                           PE1           PE2
                                                    Aurora switches




                                                                                                                                  Aurora switches
 Aurora switches




                                                                               Aurora switches
                   FSL                                                                           FSL




                         PE3           PE4                                                             PE3           PE4




                          Aurora switches                                                               Aurora switches




                                                                                                                                                    23
Process Partitioning
                    Read image1 from              Read image2 from
                      memory and                    memory and
                      create zone1                  create zone2

               ML310 board 1


                     f1 = FFT(zone1)                   f2 = FFT(zone2)




                                   f3 = mult(f1, f2)
               ML310 board 2




                                       f4 = IFFT(f3)




                                       sub-pixel(f4)




 • Assign process id depending on order of execution and
   partition between boards
                                                                         24
Configuring comm. resources
 ML310 board 1

     Read image1 from              Read image2 from
       memory and                    memory and
       create zone1                  create zone2
                   synchronous non-
                 blocking mode for FSL                             • Configure buffer depths
      f1 = FFT(zone1)                   f2 = FFT(zone2)
                                                                   • map channels to physical
                                                                       links
                                            Channel multiplexing
 ML310 board 2
                                            over Aurora and flow    •   schedule data over
                                               control mode
                    f3 = mult(f1, f2)                                  channels
                                                                   •   multiplex virtual channels

                        f4 = IFFT(f3)




                        sub-pixel(f4)



                                                                                                    25
Mapping to Hardware
                                                      Inside ML310




                   PE                                    PE


        data_in1        I/O         FSL         I/O                  data_out1


                              FSL         FSL

        data_in2                                                     data_out2
                        I/O         FSL         I/O


                   PE                                   PE




• I/O unit generates parameter values for: FIFO generator,
  FSL block, sync counter, Aurora FSL switch
                                                                                 26
Particle Image Velocimetry (PIV)




 • Cardiovascular Disease (CVD) is the leading cause of
     death in the United States and accounts for more than
     37.1 % of all fatalities for 2005.
 •   AEThER Lab at Virginia Tech models cardiovascular fluid
     dynamics
                                                          27
PIV algorithm
t


                                                                 motion
                                                                 vector
    Image 1
                       FFT
t + dt
              zone 1
                                 Multiplication   IFFT   Reduction


    Image 2            FFT
              zone 2




     • Data-intensive, each case results in 1250 image pairs x
          5MB = 6.25 GB
     •    Custom FlowIQ program: 16 minutes for one image pair
          on a 2GHz Xeon processor resulting in 2.6 years for
          analysis
                                                                          28
PIV performance
                  5.0
                        4.50            CPU
                  4.5                   GPU    • GPU fastest platform,
                                        FPGA
                  4.0                              expensive data transfer
                  3.5                              between device(GPU) and
Time in seconds




                  3.0                              host(CPU).
                  2.5                  2.25
                                               •   PRO-PART+ NOFIS: slower
                  2.0
                                                   but higher throughput due to
                  1.5          1.279
                                                   efficient pipelining and
                                                   customized communication
                  1.0
                                                   cores. (work in progress)
                  0.5

                   0
                        Execution device
                                                                                  29
ETA Beamforming application
  Antenna             LVDS
   inputs          connections
                                    ML310
             S25

             S25                    ML310




                                               2.5 Gbit/sec Serial Interconnect Network (Aurora)
             S25
                                    ML310


                                    ML310
             S25

             S25                    ML310                                                             ML310         PC
                                                                                                                         disk
             S25
                                    ML310
                                                                                                      ML310         PC
                                                                                                                         disk
                                    ML310
             S25
                                                                                                      ML310         PC
             S25                    ML310                                                                                disk

             S25
                                    ML310
                                                                                                      ML310         PC
                                                                                                                         disk
                                    ML310
             S25                                                                                   Inner nodes   Recording
             S25                    ML310                                                                         nodes

             S25
                                    ML310


        Receiver nodes           Outer nodes
                                                                                                                                30
ETA using PRO-PART + NOFIS
Current ETA implementation                             Proposed implementation
                                                          (work in progress)
 • Time-consuming and
                                                       • Shorter design cycle
                     extensive simulations
                                                       • Potential increase in resource
 •                   Hardware efficient due to
                                                          but meets performance goals
                     hand-coded RTL
                                                            systolic dataflow    structure and
                             design capture                  capture using      components
                                                                  SFG           specification




                             partitioning and
                                                                    partitioning and
                         scheduling of processes
                                                                 communication resource
                                                                     specification
                                              manual
     redesign loop




                                                                                   automated

                          select parameters for
                                                                  configure and generate
                         the customizable cores
                                                                 values for communication
                                                                           cores



                         map the values on the
                               hardware
                                                                   mapping to hardware


                                                                                                31
Contributions


•   Map streaming applications to GALS architecture
•   SAFC Flow Graph (SFG) representation
•   PRO-PART design methodology
•   Configurable communication cores
•   Guarantee synchronization by meeting the I/O clock rate
•   Increase designer productivity




                                                          32
Discussion
             Questions




                         33
Supporting slides




                    34
Synchronization methods
             Source synchronous

                       Data
      Source IC                  Destination IC

                       Clock

                  Self synchronous

                      Data and
                       clock
      Source IC                  Destination IC



                                                  35
Message Passing Interface (MPI)
                             Formatted output file
                             with combined results

                                                                   job
              Head node                                         submission   User


                                        application                      application
                                         dataset                          dataset




 Compute   Compute        Compute                     Compute
  node 1    node 2         node 3                      node n




                                                                                       36
Models of Computation: CSP
                           Process
                              3




                                     channel
                           channel
               sending                         sending
               channel                         channel
     Process               Process                         Process
        1                     2                               5
               channel                         channel
               receiving                       receiving
                           channel

                                     channel




                           Process
                              4



                                                                     37
Models of Computation: KPN

                        Infinite
       P1                sized                P2
                         FIFO




                                  FI zed e
                                    si finit
            In ze
              fin d




                                  In
               si IFO




                                    FO
                ite
                  F




                         P3




                                                   38
NVIDIA Tesla C1060
  GPU

              Multiprocessor N
                               !

        Multiprocessor 2
                                              !! "#$!%&'()**!+,*!-(./01).1,()!1/-1!2)304)(*!,'!15!6!78'*!09!+51/!,'*1()-:!-92!
   Multiprocessor 1
                                                 25;9*1()-:!2-1-!1(-9*<)(*=!
                           Shared Memory      !! >1-92-(2!092,*1(?!<5(:!<-.15(*@!<5(!+51/!2)*A15'!-92!(-.AB:5,91)2!
                                                 .59<0C,(-1059*=!
                                              !! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=!
   Registers           Registers              Registers
                                                               Instruction

                               B@C!           ()*+,"1DEA"./0"
   Processor 1             Processor 2        Processor M         Unit



                                              K/)!DE$F$G!K)*3-!#LMN!
                                                                   Constant
                                              7"I!.5:',109C!+5-(2!0*!-9!
                                                                      Cache
                                              -22B09!.-(2!+-*)2!59!1/)!K)*3-!
                                              #LMN!7"I=!$1!/-*!-!"#$!
                                              %&'()**!<,33B/)0C/1!<5(:!<-.15(!
                                                                     Texture
                                              -92!0*!1-(C)1)2!-*!-!/0C/!
                                                                      Cache
                                              ')(<5(:-9.)!.5:',109C!
                                              HO"#J!*53,1059!<5(!"#$!
                                              %&'()**!/5*1!;5(A*1-1059*=!
                                         GPU memory
                                              7"I!#5:',109C!+5-(2B3)4)3!
                                              '(52,.1*!25!951!/-4)!20*'3-?!
                                              .599).15(*!-92!-()!*').0<0.-33?!
                                              2)*0C9)2!<5(!.5:',109C=!
                                              "(5.)**5(!.35.A*@!:):5(?!             F;G5:)"BHC-"()*+,"1DEA"./0"             39
PIV performance-cuda_profile




                              40
NOFIS Hardware




                 ML310 Infiniband
                  adapter board

      ML310



                                   41

Mais conteúdo relacionado

Destaque

F I N A R T C05 09
F I N A R T C05 09F I N A R T C05 09
F I N A R T C05 09guestd08f5b
 
رایانش ابری و کارآفرینی اینترنتی
رایانش ابری و کارآفرینی اینترنتیرایانش ابری و کارآفرینی اینترنتی
رایانش ابری و کارآفرینی اینترنتیNasser Ghanemzadeh
 
درآمدی بر رایانش ابری
درآمدی بر رایانش ابریدرآمدی بر رایانش ابری
درآمدی بر رایانش ابریNasser Ghanemzadeh
 
درباره Lean Startup Machine
درباره Lean Startup Machineدرباره Lean Startup Machine
درباره Lean Startup MachineNasser Ghanemzadeh
 
都市病系列:糖尿病 Diabetes (http://bit.ly/wszhshp)
都市病系列:糖尿病 Diabetes (http://bit.ly/wszhshp)都市病系列:糖尿病 Diabetes (http://bit.ly/wszhshp)
都市病系列:糖尿病 Diabetes (http://bit.ly/wszhshp)Peter Chan
 
Present simple
Present simplePresent simple
Present simplevitita
 
Some words from the Dalai Lama
Some words from the Dalai LamaSome words from the Dalai Lama
Some words from the Dalai LamaChris Dean
 
Interior Design - Greg Hamilton
Interior Design - Greg HamiltonInterior Design - Greg Hamilton
Interior Design - Greg HamiltonGreg Hamilton
 
Profile Fitmentlinkedin
Profile FitmentlinkedinProfile Fitmentlinkedin
Profile FitmentlinkedinAmit Jalihal
 
Pmm who we are
Pmm who we arePmm who we are
Pmm who we arestefanid
 
Improve My Life Chi Eng
Improve My Life Chi EngImprove My Life Chi Eng
Improve My Life Chi EngPeter Chan
 
Assessing the Need for a Mobile Application to Engage Physicians
Assessing the Need for a Mobile Application to Engage PhysiciansAssessing the Need for a Mobile Application to Engage Physicians
Assessing the Need for a Mobile Application to Engage Physiciansmickster215
 
Cn Beijing Olympic 1v1
Cn Beijing Olympic 1v1Cn Beijing Olympic 1v1
Cn Beijing Olympic 1v1Peter Chan
 

Destaque (20)

F I N A R T C05 09
F I N A R T C05 09F I N A R T C05 09
F I N A R T C05 09
 
رایانش ابری و کارآفرینی اینترنتی
رایانش ابری و کارآفرینی اینترنتیرایانش ابری و کارآفرینی اینترنتی
رایانش ابری و کارآفرینی اینترنتی
 
Leire & Iratxe
Leire & IratxeLeire & Iratxe
Leire & Iratxe
 
درآمدی بر رایانش ابری
درآمدی بر رایانش ابریدرآمدی بر رایانش ابری
درآمدی بر رایانش ابری
 
درباره Lean Startup Machine
درباره Lean Startup Machineدرباره Lean Startup Machine
درباره Lean Startup Machine
 
Reckless
RecklessReckless
Reckless
 
Let's talk 3D Printing
Let's talk 3D PrintingLet's talk 3D Printing
Let's talk 3D Printing
 
都市病系列:糖尿病 Diabetes (http://bit.ly/wszhshp)
都市病系列:糖尿病 Diabetes (http://bit.ly/wszhshp)都市病系列:糖尿病 Diabetes (http://bit.ly/wszhshp)
都市病系列:糖尿病 Diabetes (http://bit.ly/wszhshp)
 
Present simple
Present simplePresent simple
Present simple
 
FOSS Business IUT
FOSS Business IUTFOSS Business IUT
FOSS Business IUT
 
Teenroom
TeenroomTeenroom
Teenroom
 
Some words from the Dalai Lama
Some words from the Dalai LamaSome words from the Dalai Lama
Some words from the Dalai Lama
 
Interior Design - Greg Hamilton
Interior Design - Greg HamiltonInterior Design - Greg Hamilton
Interior Design - Greg Hamilton
 
Profile Fitmentlinkedin
Profile FitmentlinkedinProfile Fitmentlinkedin
Profile Fitmentlinkedin
 
Pmm who we are
Pmm who we arePmm who we are
Pmm who we are
 
Improve My Life Chi Eng
Improve My Life Chi EngImprove My Life Chi Eng
Improve My Life Chi Eng
 
Assessing the Need for a Mobile Application to Engage Physicians
Assessing the Need for a Mobile Application to Engage PhysiciansAssessing the Need for a Mobile Application to Engage Physicians
Assessing the Need for a Mobile Application to Engage Physicians
 
Acrosticos 2º c
Acrosticos 2º cAcrosticos 2º c
Acrosticos 2º c
 
Cn Beijing Olympic 1v1
Cn Beijing Olympic 1v1Cn Beijing Olympic 1v1
Cn Beijing Olympic 1v1
 
Foss Business SFD 2010
Foss Business SFD 2010Foss Business SFD 2010
Foss Business SFD 2010
 

Mais de Vivek Venugopalan

xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsxDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsVivek Venugopalan
 
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGADesign, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGAVivek Venugopalan
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsVivek Venugopalan
 
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Vivek Venugopalan
 
Real-time processing for ATST
Real-time processing for ATSTReal-time processing for ATST
Real-time processing for ATSTVivek Venugopalan
 
Accelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid ArchitecturesAccelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid ArchitecturesVivek Venugopalan
 

Mais de Vivek Venugopalan (6)

xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsxDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
 
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGADesign, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
 
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
 
Real-time processing for ATST
Real-time processing for ATSTReal-time processing for ATST
Real-time processing for ATST
 
Accelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid ArchitecturesAccelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid Architectures
 

Último

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Último (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Streaming App Synchronization

  • 1. Synchronization Synthesis for Large Scale Parallel Streaming Applications Vivek Venugopal Committee Dr. Cameron Patterson Dr. Peter Athanas Dr. Paul Plassmann Dr. Jeffrey Reed Dr. Kevin Shinpaugh 1
  • 2. Outline • Research Overview • Introduction • Related Work • Research Statement • Methodology • Target Applications & Evaluation • Results • Contributions 2
  • 3. Research Overview Streaming application Set of transformations Research scope Specialized • How to partition algorithm? hardware platform • How to map and where? • What are the communication resources? • How is synchronization guaranteed? 3
  • 4. Streaming Architecture without Flow Control (SAFC) PE PE PE 1 2 6 PE PE PE 7 8 12 PE PE PE 31 32 36 Clock source Streaming architecture with large number of PE's requiring more than 1 board 4
  • 5. Streaming Architecture without Flow Control (SAFC) Aurora ML310 ML310 PE PE PE board 1 board 2 1 2 6 Aurora PE PE PE 7 8 12 ML310 ML310 board 3 board 4 PE PE PE ML310 boards connected in 31 32 36 mesh driven by same clock Clock value but different sources source Streaming architecture with large number of PE's requiring more than 1 board 4
  • 6. Streaming Architecture without Flow Control (SAFC) Aurora ML310 ML310 PE PE PE board 1 board 2 1 2 6 Aurora PE PE PE 7 8 12 ML310 ML310 board 3 board 4 PE PE PE ML310 boards connected in 31 32 36 mesh driven by same clock Clock value but different sources source Streaming architecture with Aurora switches large number of PE's FSL requiring more than 1 board PE1 PE2 PE3 Aurora switches Aurora switches FSL PE4 PE5 PE6 PE7 PE8 PE9 Clock source Aurora switches Inside a ML310 4
  • 7. Clock domains: GALS scenario PE1 PE2 Data Data Data • GALS (Globally Asynchronous Locally Synchronous) 5
  • 8. Synchronization Data Source IC Destination IC clock • System Synchronous • Synchronization synthesis 6
  • 9. Data type Packet based data Streaming data Start and stop easy for packet Cannot stop streaming data based data Easier synchronization due to flow Synchronization is difficult leading control to data loss if not done properly Better dynamic scheduling of Better static scheduling of resources resources Best-effort service Guaranteed service 7
  • 10. Communication framework System-level communication framework Point-to-point Bus-based Network-On-Chip interconnect architecture Custom Uniform Shared Split 8
  • 11. Point-to-point Interconnect 1 2 3 4 Ring 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9 2D Torus 2D Mesh 9
  • 12. Bus-based architecture Memory High-speed Low-speed interface peripheral peripheral Block I/O RAM interface OPB Bridge Power PC PLB arbiter OPB arbiter • IBM CoreConnect architecture 10
  • 13. Network-on-Chip (NoC) Router Link core core core core core core core core core 11
  • 14. Multi-core Streaming Architecture Network Of FPGAs with Integrated Aurora Switches (NOFIS) ML310 ML310 board 1 board 2 Aurora switches Aurora switches FSL FSL PE1 PE2 PE3 PE1 PE2 PE3 FSL Aurora FSL Aurora switches Aurora switches Aurora switches Aurora switches PE4 PE5 PE6 PE4 PE5 PE6 PE7 PE8 PE9 PE7 PE8 PE9 Aurora switches Aurora switches 12
  • 15. NOFIS: On-board communication Master Slave FSL_M_Clk FSL_S_Clk FSL_M_Data FSL_S_Data FIFO FSL_M_Control FSL_S_Control FSL_M_Write FSL_S_Read FSL_M_Full FSL_S_Exists • Fast Simplex Link : uni-directional FIFO interface • Configure FIFO depth, clocking modes 13
  • 16. NOFIS: Off-board communication Aurora Aurora Channel Lane 1 User User Aurora Aurora Application Application interface interface (ML310) (ML310) Aurora Lane n • High-speed (3.125 Gb/sec) and self synchronous 14
  • 17. Model of Computation (MoC): SDF Synchronous Data Flow (SDF) 1 2 Buffer 2 A B 2 Buffer delay Buffer 1 1 elements C • SDF exhibits ideal systolic dataflow behavior • Varying data rate not supported 15
  • 18. Model of Computation (MoC): PSDF Parametrized Synchronous Data Flow (PSDF) A1 Buffer B1 A size a1 B A2 B2 Buf fer f Buf a3 size er a2 size C1 C C2 firing of (A) ⨉ production rate of (A1) = firing of (B) ⨉ consumption rate of (B1) firing of (A) ⨉ production rate of (A2) = firing of (C) ⨉ consumption rate of (C1) firing of (C) ⨉ production rate of (C2) = firing of (B) ⨉ consumption rate of (B2) • supports reconfiguration and different data rates 16
  • 19. Multi-core Multi-processor trend 1000 Number of cores 100 Moore's Law Multi-core growth 10 2003 2004 2005 2006 2007 2008 2009 Year of production • Need single unified parallel programming tool for exploiting parallel processing at the core level • Kill Rule by Agrawal: correlate to communication cores 17
  • 20. Related work Related What MoC Shortcomings work suboptimal bandwidth Compaan- compilers KPN based utilization due to infinite Laura Matlab length FIFOs custom synchronization scheme always CERBERO MPI model architecture fixed with the master PE PEs are connected using a MPI custom TMD MPI model communication library, no architecture automation manual partitioning and custom CORES/ SDF/FSM scheduling of communication architecture + HASIS model resources, run-time transformations reconfiguration outside scope 18
  • 21. Research Question What transformations are required to map a streaming application on a systolic-like architecture, with the low-level communication interface details hidden from the end-user and at the same time support automation for implementing streaming applications on the platform? 19
  • 22. Research Question What transformations are required to map a streaming application on a systolic-like architecture, with the low-level communication interface details hidden from the end-user and at the same time support automation for implementing streaming applications on the platform? 19
  • 23. Conventional design flow design capture partitioning and scheduling of processes manual redesign loop select parameters for the customizable cores map the values on the hardware 20
  • 24. Proposed PRO-PART Design Flow SFG representation of application streaming structure and algorithm PRO-PART dataflow capture components + Design flow using SFG specification component specification of implementation platform NOFIS platform partitioning and Input specifications communication resource specification automated Objectives: configure and generate • Partition algorithm values for communication cores • Identify communication resources • Schedule and embed flow control mapping to hardware • Guarantee overall synchronization without re-design loops 21
  • 25. SAFC Flow Graph (SFG) Data from Data from Data from North inputs Data from South inputs North inputs East inputs F1 F2 F3 F4 F5 F6 FPGA 1 FPGA 2 FPGA 3 FPGA 4 FPGA12 FPGA13 FPGA14 FPGA 5 F7 F8 FPGA11 FPGA16 FPGA15 FPGA 6 F9 FPGA10 FPGA 9 FPGA 8 FPGA 7 Synchronized data Data from Data from blocks recorded South inputs West inputs onto disk • Provides abstraction to view a streaming system with a single universal clock (I/O rate) 22
  • 26. Platform specification ML310 board 1 ML310 board 2 Aurora switches Aurora switches FSL FSL PE1 PE2 Aurora PE1 PE2 Aurora switches Aurora switches Aurora switches Aurora switches FSL FSL PE3 PE4 PE3 PE4 Aurora switches Aurora switches 23
  • 27. Process Partitioning Read image1 from Read image2 from memory and memory and create zone1 create zone2 ML310 board 1 f1 = FFT(zone1) f2 = FFT(zone2) f3 = mult(f1, f2) ML310 board 2 f4 = IFFT(f3) sub-pixel(f4) • Assign process id depending on order of execution and partition between boards 24
  • 28. Configuring comm. resources ML310 board 1 Read image1 from Read image2 from memory and memory and create zone1 create zone2 synchronous non- blocking mode for FSL • Configure buffer depths f1 = FFT(zone1) f2 = FFT(zone2) • map channels to physical links Channel multiplexing ML310 board 2 over Aurora and flow • schedule data over control mode f3 = mult(f1, f2) channels • multiplex virtual channels f4 = IFFT(f3) sub-pixel(f4) 25
  • 29. Mapping to Hardware Inside ML310 PE PE data_in1 I/O FSL I/O data_out1 FSL FSL data_in2 data_out2 I/O FSL I/O PE PE • I/O unit generates parameter values for: FIFO generator, FSL block, sync counter, Aurora FSL switch 26
  • 30. Particle Image Velocimetry (PIV) • Cardiovascular Disease (CVD) is the leading cause of death in the United States and accounts for more than 37.1 % of all fatalities for 2005. • AEThER Lab at Virginia Tech models cardiovascular fluid dynamics 27
  • 31. PIV algorithm t motion vector Image 1 FFT t + dt zone 1 Multiplication IFFT Reduction Image 2 FFT zone 2 • Data-intensive, each case results in 1250 image pairs x 5MB = 6.25 GB • Custom FlowIQ program: 16 minutes for one image pair on a 2GHz Xeon processor resulting in 2.6 years for analysis 28
  • 32. PIV performance 5.0 4.50 CPU 4.5 GPU • GPU fastest platform, FPGA 4.0 expensive data transfer 3.5 between device(GPU) and Time in seconds 3.0 host(CPU). 2.5 2.25 • PRO-PART+ NOFIS: slower 2.0 but higher throughput due to 1.5 1.279 efficient pipelining and customized communication 1.0 cores. (work in progress) 0.5 0 Execution device 29
  • 33. ETA Beamforming application Antenna LVDS inputs connections ML310 S25 S25 ML310 2.5 Gbit/sec Serial Interconnect Network (Aurora) S25 ML310 ML310 S25 S25 ML310 ML310 PC disk S25 ML310 ML310 PC disk ML310 S25 ML310 PC S25 ML310 disk S25 ML310 ML310 PC disk ML310 S25 Inner nodes Recording S25 ML310 nodes S25 ML310 Receiver nodes Outer nodes 30
  • 34. ETA using PRO-PART + NOFIS Current ETA implementation Proposed implementation (work in progress) • Time-consuming and • Shorter design cycle extensive simulations • Potential increase in resource • Hardware efficient due to but meets performance goals hand-coded RTL systolic dataflow structure and design capture capture using components SFG specification partitioning and partitioning and scheduling of processes communication resource specification manual redesign loop automated select parameters for configure and generate the customizable cores values for communication cores map the values on the hardware mapping to hardware 31
  • 35. Contributions • Map streaming applications to GALS architecture • SAFC Flow Graph (SFG) representation • PRO-PART design methodology • Configurable communication cores • Guarantee synchronization by meeting the I/O clock rate • Increase designer productivity 32
  • 36. Discussion Questions 33
  • 38. Synchronization methods Source synchronous Data Source IC Destination IC Clock Self synchronous Data and clock Source IC Destination IC 35
  • 39. Message Passing Interface (MPI) Formatted output file with combined results job Head node submission User application application dataset dataset Compute Compute Compute Compute node 1 node 2 node 3 node n 36
  • 40. Models of Computation: CSP Process 3 channel channel sending sending channel channel Process Process Process 1 2 5 channel channel receiving receiving channel channel Process 4 37
  • 41. Models of Computation: KPN Infinite P1 sized P2 FIFO FI zed e si finit In ze fin d In si IFO FO ite F P3 38
  • 42. NVIDIA Tesla C1060 GPU Multiprocessor N ! Multiprocessor 2 !! "#$!%&'()**!+,*!-(./01).1,()!1/-1!2)304)(*!,'!15!6!78'*!09!+51/!,'*1()-:!-92! Multiprocessor 1 25;9*1()-:!2-1-!1(-9*<)(*=! Shared Memory !! >1-92-(2!092,*1(?!<5(:!<-.15(*@!<5(!+51/!2)*A15'!-92!(-.AB:5,91)2! .59<0C,(-1059*=! !! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=! Registers Registers Registers Instruction B@C! ()*+,"1DEA"./0" Processor 1 Processor 2 Processor M Unit K/)!DE$F$G!K)*3-!#LMN! Constant 7"I!.5:',109C!+5-(2!0*!-9! Cache -22B09!.-(2!+-*)2!59!1/)!K)*3-! #LMN!7"I=!$1!/-*!-!"#$! %&'()**!<,33B/)0C/1!<5(:!<-.15(! Texture -92!0*!1-(C)1)2!-*!-!/0C/! Cache ')(<5(:-9.)!.5:',109C! HO"#J!*53,1059!<5(!"#$! %&'()**!/5*1!;5(A*1-1059*=! GPU memory 7"I!#5:',109C!+5-(2B3)4)3! '(52,.1*!25!951!/-4)!20*'3-?! .599).15(*!-92!-()!*').0<0.-33?! 2)*0C9)2!<5(!.5:',109C=! "(5.)**5(!.35.A*@!:):5(?! F;G5:)"BHC-"()*+,"1DEA"./0" 39
  • 44. NOFIS Hardware ML310 Infiniband adapter board ML310 41