SlideShare a Scribd company logo
1 of 34
High Performance Computing
       Infrastructure:
  Past, Present, and Future

                       By
          Clay Gloster, Jr., Ph.D., P.E.
              Associate Professor

Department of Electrical & Computer Engineering
              Howard University
             THE RARE PROJECT
           cgloster@howard.edu
                June 22, 2009




                                                  1
Presentation Outline
•   Introduction to Reconfigurable Computing
•   The Bison Configurable Digital Signal Processor
•   The BCDSP Design Flow
•   Current Function Cores and Modules
•   A Remote Reconfigurable Computer
•   A Parallel and Configurable Computer




                                                  2
Introduction to
Reconfigurable Computing




                           3
Problem Statement
•   Given: An application that is computationally intensive
    or requires considerable CPU execution time.
    i.e., weather modeling, remote sensing, target
    recognition, precision targeting, gene sequencing



•   Find: A solution that significantly improves
    performance, requires acceptable development time, at
    a reasonable cost.




                                                              4
Potential Solutions
•   Cluster-based computing: The use of several general
    purpose computing systems, i.e. PCs. (Writing
    programs that execute on typical PCs/workstations.)

•   Application-Specific Integrated Circuit (ASIC) Design:
    The use of special-purposed ICs or chips. (Designing a
    chip (hardware) that is highly optimized for the
    particular application.)

•   Reconfigurable Computing: The merger of the two
    approaches. (Writing software to execute non-time-
    critical portions of the application on a PC while
    designing hardware to execute the time-critical
    portions of the application on an FPGA.)



                                                             5
A Reconfigurable Computer is:



                     PC



                   Host




A PC attached to one or more Field Programmable Gate Arrays(FPGAs).



                                                              6
An FPGA is:
                                         Programmable Pin


                                         Configurable Logic Block


                                        Programmable Interconnect




A programmable integrated circuit.

At time t1, it can be programmed as X1 (personal data assistant).
At time t2, it can be programmed as X2 (calculator).

                                                                    7
RC Systems Advantages
•   Several applications have been implemented on a
    reconfigurable computing system resulting in a system
    with execution times that were an order of magnitude
    faster than the same application implemented on a
    typical desktop computer.

•   The same reconfigurable computing system hardware
    can be reused for diverse applications.

•   With an RC system, a system can be deployed and
    subsequently reprogrammed with new hardware to
    perform functions that were not available at the time of
     deployment.



                                                           8
RC Systems Disadvantages
•   Developing an RC system requires a system designer
    that is knowledgeable in both hardware design as well
    as software design.

•   Time required to design and implement an RC system
    that executes faster than a typical desktop computer
    can be several months.




                                                            9
Research Objectives
•   To obtain RC system implementations of several
    applications that achieve an order of magnitude
    speedup over executing the application on a typical
    desktop computer.

•   To develop tools that reduce RC system development
    time from months to weeks or days and allow users
    who are not knowledgeable in hardware design to be
    able to implement RC systems while experiencing the
    potential benefit of increased system performance and
    system reuse.

•   To develop a resource management system to
    efficiently utilize available reconfigurable computing
    resources located at remote sites.


                                                             10
The Bison Configurable
Digital Signal Processor




                           11
A Configurable Digital
          Signal Processor
M0                                        M1
             Processor (BCDSP)
Data                                      Data
            CONTROL           DATA
 M2           UNIT            UNIT         M3

Data                                      Data


                            Function
                              Core
Mn-2                       (FunCoreGen)   Mn-1

Data                                      Data

                      Mn

                Instructions

                                                 12
Functional Cores

       R0   R1             R7

                                • Have one or more 32-bit inputs

                                • Perform floating point vector
ENABLE
                                  operations.
                 FunCore
                                • Have simple control.
DONE
                                • Can be built using other FunCores.

                                • Can include conditional units.




                                                                   13
2-D DCT Function Core
R0   R1   R2       R3       R4       R5   R6   R7        R8   R9   R10   R11   R12 R13   R14 R15



     X         X                 X        X               X        X           X         X

          +                           +                       +                    +
                        +                                                +

                                                    +



                                                    Z0

                                                                                         14
Optimizing System Performance
            with the BCDSP
•   Memory is 64-bits wide allowing two single-precision
    floating point numbers to be fetched in a single memory
    access.

•   There are N=4 data memories, hence multiple data
    items can be read/written in a single cycle.
    Theoretically, the number of memory accesses can be
    reduced by a factor of N=4. (This number can be
    increased to an upper bound 2N=8 if we store two
    floating point values per location.)

•   Multiple function cores can be used. For example, a
    typical processor may have 1 multiplier. In this case, K
    multiplies require K time units or clock cycles. With K
    multipliers, K multiplies can be executed in a single
    time unit or clock cycle.

•   Pipelining and DMA accesses are used to increase
    system performance.                                    15
BCDSP Software, Cores,
   and Processors




                         16
Distinguishing Features of RCCT
            Traditional Approach               Original                Our Approach   Original                   Module
                                               Source                                 Source                    Definition
                                                Code                                   Code                       File


                                                Special
                                               Compiler                                               RCCT
                                                                                                     Compiler


                                                          Modified
                                     HDL                   Source
                                                            Code                           Session              Modified
                                                                                            Files                Source
                                                                                                                  Code
                                    Logic                 High Level
                                   Synthesis               Compiler                                             High Level
                                                                                                                 Compiler

                                   Placement              Executable
                                   & Routing                Code                                                Executable
                                                                                                                  Code

                                      Bit
                                    Stream


•   Placement and routing is performed off-line.
•   The Hardware Module Library evolves continuously.
•   Compiler can easily recognize new modules.
•   As new modules are added, the Compiler has a better chance
    to improve performance for each user application.


AIST-0016-0044                                                                                                               17
The Front-End Compiler
•The purpose of the compiler is to map user applications
to FPGA-based reconfigurable computers (RC), (i.e. the
BISON reconfigurable computer).
•The compiler takes the original source code written in
C/C++ and a module library and produces two outputs:
the modified source code and a session file for each
modified section.

   Original
   Source
    Code
                                         Programming      New
               RCCT       Modified                     Application
                           Source          Language    Executable
              Compiler      Code           Compiler     (Calls the
                                                         Loader)
   Module
   Library




                         Session files




                                                                     18
The BCDSP Processor
         Back-End Compiler


dct.c        c2hl        dct_hl.vhd      hl2cudu
                                                     dct_cu.vhd



                                                           dct_du.vhd

                                                       PECORE.vhd



        hl2cudu consists of approximately 15 programs!!!


                                                              19
Execution Time for the 2D-DCT
  Image        Software (ms)      Hardware (ms)     Speedup
   Size         2.97 GHz PC       24 MHz BCDSP



    8x8       0.0400            0.0112              3.56
   16x16      0.095             0.0272              3.48
   32x32      0.264             0.09150             2.88
   64x64      0.849             0.3484              2.43
  128x128     3.080             1.3746              2.24
  256x256     12.154            5.478               2.22
  512x512     60.556            21.8942             2.76
 1024x1024    185.754           87.5560             2.12

Reconfigurable hardware was 2.71 times faster on average!!!!
                                                           20
A Remote Configurable Computer




                                 21
A Remote And Reconfigurable
             Environment (RARE)
Processor
 Library                   Remote Environment Resource Bank

                            Resource Controller

                           FPGA0     M0 0    M0 1     M0n

Automated
                   BCDSP
 Tool Set
                           FPGA1      M10     M11     M1p


     Application
     (C, Java,…)
                           FPGAm      Mm0    Mm1      Mmq
   User Parameters
(power, size, weight…)

                                                              22
The RARE Project Infrastructure
    The RARE software is developed using Java. The Java language is selected
   because it offers a number of advantages over other programming languages.

        Java supports native methods, remote method invocation and network
        security. The native method feature allows the use of software routines
          written in other programming languages such as C/C++ to be called
       from Java applications. Remote method invocation and network security
        features make it possible to execute Java programs from a remote site.




Client.java                     Server.java
                                                                         FPGA
with RMI         INTERNET        with RMI      NMI      Function.c
   links                           links                                 Board




                                                                                  23
PNN Execution Times
  Implementation          Local             Remote
       Type               (ms)               (ms)

Software (Java)          628.71            2887.74


Software (Cpp)           861.04            3116.17


Hardware                 104.07             371.01




  Remote hardware can be faster than local software!!!!


                                                          24
A Parallel and Configurable Computer System




                                         25
A Parallel and Configurable Computer
                 PC2i       •   NSF MRI Grant: A Parallel and
Parallel CC   FPGABrd2i         Configurable Computer for
                                Research in Engineering and the
CCN0            PC2i+1
                                Computational Mathematical
CCN1          FPGABrd2i+1
                                Sciences ($500K)
CCN2                        •   Projects related to RFID, an
                                Electronic Nose, PET Image
                                Reconstruction, Image
 CCNi                           Compression, and Computer
                                Vision are using this equipment
                                to solve real world problems.
CCN6
CCN7



                                                           26
Cluster Specifications
•   8 Compute Nodes
    –   1 x PCI-X dual port Infiniband 4X HCA card
    –   1 x 250GB SATA Hard Drive 7200RPM w/ 16MB Cache
    –   8 x 1GB PC3200 ECC Reg DDR (400MHz)
    –   1 x PNY nVidia Quadro FX 3000G w/ 8XAGP, 256MB DDR, Dual DVI/DVI
    –   2 x AMD Opteron Model 250 (2.4GHz)
    –   60-30-12921 1 x Dual Opteron S2885 EATX Motherboard w/ 8X AGP, gigE,
        SATA, audio, firewire, 4x 64-bit PCI

•   1 Head Node
    –   1   x   PCI-X dual port Infiniband 4X HCA card
    –   8   x   1GB PC3200 ECC Reg DDR (400MHz)
    –   2   x   AMD Opteron Model 250 (2.4GHz)
    –   1   x   PNY nVidia Quadro FX 3000G w/ 8XAGP, 256MB DDR, Dual DVI/DVI
    –   1   x   10/100/1000 64bit PCI-X Gigabit Copper NIC

•   9 FPGA Coprocessors
    –   16 WS2P/XC2VP100-6P/48D/256 Wildstar II PRO PCI board with 2 ea
        P100-6 parts & 48 MB DDR SRAM and 256 MB DDR SDRAM


                                                                               27
RARE Project Past, Present, and Future




                                         28
AIST Program Space Based NRA Technologies
                          Hierarchical Algorithms and their Embedded                                                                ESTO
                       Computational Realization in Reconfigurable Hardware                                                Earth Science Technology Office

                                            PI: Clay Gloster/Howard University
                                               Proposal No: AIST 0016-0044
   Description and Objectives                                                      61                61
                                                                           VLIW               Mem1         PE1



This project addresses problems associated with                                    61                61

developing data products for deployment in onboard RC
                                                                                              Mem2                  PE2

                                                                                                              34


systems. It involves the development of a compiler that                            61
                                                                                              Mem3
                                                                                                     61
                                                                                                                            PE3


reads algorithm descriptions written in C. The compiler                                                               34



will produce hardware and software components required
                                                                                   61                61
                                                                                              Mem4                                   PE4

                                                                                                                               34

for an RC implementation of typical NASA data products.                            61
                                                                                              Mem5
                                                                                                     61
                                                                                                                                              PE5

The main objectives of this project are: efficient algorithm                                                                            34


development and fast and reconfigurable hardware                              34
                                                                                                          FIFO 1   FIFO2   FIFO 3   FIFO 4   FIFO5



implementations (10X-100X speedup).
                                                                                                              34      34       34       34      34
                                                                                    PCI Bus




 Approach                                                        Deliverables
 Develop a compiler to translate nested loops into a           - Prototype RC Testbed shown above
sequence of floating point vector instructions. These
                                                               -Prototype Compiler
instructions correspond to modules in a library that is
to be developed as a part of this project. Hardware            -Cloud Masking Data Product Demonstration
modules will perform complex instructions i.e.
                                                               -Final Compiler
matmult, vec-vecmult, FFT, etc.
                                                                 Application/Mission
  Co-I’s/Partners
                                                               Cloud Cover Assessment Data Product Development for
Hamid Krim, Tom Conte, NC State University                     EOS/AM-1 Satellite




                                                                                                                                                             29
High Performance Weather
        Forecast Modeling
                                                                       WRF Architecture
 WRF is an HPC next generation mesoscale
 forecast model and assimilation system developed
 as a collaborative effort by the Atmospheric
 science community. It is a massively parallel
 computing environment for both forecasting and
 research purposes.



 3 Level Hierarchical Structure
  Driver: Processor management etc
 Mediation: interface between Model and Driver
 Model: plug-in algorithms that compute actual models


 Model layer includes                                     Figure courtesy of http://www.wrf-model.org
 Longwave radiation: RRTM
 Shortwave radiation: NASA/GSFC, MM5 (Dudhia)
 Cumulus: Kain-Fritsch, Betts-Miller-Janjic
 Explicit microphysics: Kessler, Lin et al., NCEP 3-class (Hong)
 PBL: MRF, MM5 (Slab)

WRF acknowledges the HPC problem, and is currently pursuing the standard solution 3 30
   RARE solution: replace physics plug-ins with BCDSP FPGA equivalents
A Reconfigurable and Open Architecture
                Module for Unmanned Systems
● Reconfigurable modules can be reused for various types of unmanned
  systems, each containing a diverse range of sensors, cameras, displays,
  GPS receivers, etc.
● Reconfigurable modules can provide capabilities during the mission that
  were unknown prior to the beginning of the mission.
● With these modules computing resources can be used on remote
  unmanned systems from a ground station when these modules are idle.
● With reconfigurable modules, a fixed amount of hardware can be
  changed to theoretically provide an infinite number of different
  capabilities.
● Because of the unpredictable nature of combat, reconfigurable systems
  provide the flexibility and performance needed to respond rapidly and
  effectively to unexpected threats.
● These systems can provide reconfigurable interfaces and
  interconnections. One system can accommodate any combination of
  interfaces: USB, Gigabit Ethernet, RS432, IR, wireless, FireWire, etc.
                                                                     31
Current System Specification
                                                      FPGAs exploit parallelism to reach higher increased
                                                      performance (sample rates, pixel or frame rates) with
                                                      limited SWAP
                                                           FPGA processing power can be combined and
                                                           redistributed in real-time to a particular sensor (s)
                                                                 FPGA-based payload interfaces combined with
                                                                 a hardware Open Architecture approach can
                                                                 provide reconfigurable software interfaces and
                                                                 physical interconnections.


                                                      One system can accommodate any combination of
                                                      interfaces: USB, Gigabit Ethernet, RS432, IR, wireless,
                                                      FireWire, etc.

                                                      The SAME Reconfigurable Context Neutral Payload Interface
                                                      can be reused to accommodate many different unmanned
                                                      vehicles, ground stations -- each containing various sensors,
                                                      cameras, radar systems, acoustics, LCD displays, GPS
System Specifications                                 systems, etc. utilizing high-bandwidth connections to the
•Weight - 27 lbs                                      interface.
•Size – 6 x 7x 8.5 in
•Power – 150 Watts
•Interface – Gbit Ethernet, camera link, LVDS, 422, USB, FireWire
•Image Formats – 4 Mb, 1080p, 720p, 480p, NTSC, RS-170, 1600 x 1200 IR, 360 HD-Visible and 640 IR,
Stereo Capable
Completed
•Software Decoder (H.264), Hardwar Encoder (H.264), IMU/GPS Interface, Imaging System, Targeting
System Interface
Demonstrated
 Imagery (meta-data format; multiple streams (801.16), Trigger and Sync, Video-teleconferencing through
the payload
33
Opening a Dialogue with Others
•   Graduate Student Support
     – One way for us to work together with others is via graduate
       students.
     – These students can bridge the gap between other disciplines and
       computer engineering.
•   Joint Proposals
     – One way for us to work together is to author joint proposals.
     – Or alternatively, we can be supported under current funding.
       However, we would be willing to work with others even if there is
       no current support for our work. As long as there is potential for
       future support.
•   Implementation of a small portion of a models to demonstrate
    potential speedup.
•   There is a potential to publish results of this experiment in
    journals of other disciplines as well as in engineering journals.



                                                                         34

More Related Content

What's hot

iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeireimec
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoopSteve Watt
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture IJECEIAES
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainMDC_UNICA
 
Iris an architecture for cognitive radio networking testbeds
Iris   an architecture for cognitive radio networking testbedsIris   an architecture for cognitive radio networking testbeds
Iris an architecture for cognitive radio networking testbedsPatricia Oniga
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)Naoto MATSUMOTO
 
Challenges in mixed signal
Challenges in mixed signal Challenges in mixed signal
Challenges in mixed signal chiportal
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
Informix User Group France - 30/11/2010 - Nouveautés IDS 11.10 & 11.50
Informix User Group France - 30/11/2010 - Nouveautés IDS 11.10 & 11.50Informix User Group France - 30/11/2010 - Nouveautés IDS 11.10 & 11.50
Informix User Group France - 30/11/2010 - Nouveautés IDS 11.10 & 11.50Nicolas Desachy
 
AMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat PresentationAMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat PresentationAMD
 
Exploiting Linux Control Groups for Effective Run-time Resource Management
Exploiting Linux Control Groups for Effective Run-time Resource ManagementExploiting Linux Control Groups for Effective Run-time Resource Management
Exploiting Linux Control Groups for Effective Run-time Resource ManagementPatrick Bellasi
 
Advanced File Graphics Server
Advanced File Graphics ServerAdvanced File Graphics Server
Advanced File Graphics ServerRon Hutton
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
 
Case Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded ProcessorsCase Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded Processorsaccount inactive
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPUJino Antony
 

What's hot (20)

iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeire
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
 
Iris an architecture for cognitive radio networking testbeds
Iris   an architecture for cognitive radio networking testbedsIris   an architecture for cognitive radio networking testbeds
Iris an architecture for cognitive radio networking testbeds
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 
Challenges in mixed signal
Challenges in mixed signal Challenges in mixed signal
Challenges in mixed signal
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
1
11
1
 
Informix User Group France - 30/11/2010 - Nouveautés IDS 11.10 & 11.50
Informix User Group France - 30/11/2010 - Nouveautés IDS 11.10 & 11.50Informix User Group France - 30/11/2010 - Nouveautés IDS 11.10 & 11.50
Informix User Group France - 30/11/2010 - Nouveautés IDS 11.10 & 11.50
 
AMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat PresentationAMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat Presentation
 
Exploiting Linux Control Groups for Effective Run-time Resource Management
Exploiting Linux Control Groups for Effective Run-time Resource ManagementExploiting Linux Control Groups for Effective Run-time Resource Management
Exploiting Linux Control Groups for Effective Run-time Resource Management
 
Advanced File Graphics Server
Advanced File Graphics ServerAdvanced File Graphics Server
Advanced File Graphics Server
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
Case Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded ProcessorsCase Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded Processors
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPU
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 

Viewers also liked

Efectos de la irradiación celular a nivel molecular
Efectos de la irradiación celular a nivel molecularEfectos de la irradiación celular a nivel molecular
Efectos de la irradiación celular a nivel molecularXtobal Padilla
 
Illnesses and their treatment
Illnesses and their treatmentIllnesses and their treatment
Illnesses and their treatmentEugeparra
 
Contoh bahasan fiqh muqaran
Contoh bahasan fiqh muqaranContoh bahasan fiqh muqaran
Contoh bahasan fiqh muqaranMarhamah Saleh
 
ikhtilaf, Sebab Ikhtilaf ahlu ra’yi & ahlu hadis
ikhtilaf, Sebab Ikhtilaf ahlu ra’yi & ahlu hadisikhtilaf, Sebab Ikhtilaf ahlu ra’yi & ahlu hadis
ikhtilaf, Sebab Ikhtilaf ahlu ra’yi & ahlu hadisMarhamah Saleh
 
Sejarah, pola istinbath mazhab hanafi maliki
Sejarah, pola istinbath mazhab hanafi malikiSejarah, pola istinbath mazhab hanafi maliki
Sejarah, pola istinbath mazhab hanafi malikiMarhamah Saleh
 
Pengertian, ruang lingkup fiqh muqaran
Pengertian, ruang lingkup fiqh muqaranPengertian, ruang lingkup fiqh muqaran
Pengertian, ruang lingkup fiqh muqaranMarhamah Saleh
 

Viewers also liked (6)

Efectos de la irradiación celular a nivel molecular
Efectos de la irradiación celular a nivel molecularEfectos de la irradiación celular a nivel molecular
Efectos de la irradiación celular a nivel molecular
 
Illnesses and their treatment
Illnesses and their treatmentIllnesses and their treatment
Illnesses and their treatment
 
Contoh bahasan fiqh muqaran
Contoh bahasan fiqh muqaranContoh bahasan fiqh muqaran
Contoh bahasan fiqh muqaran
 
ikhtilaf, Sebab Ikhtilaf ahlu ra’yi & ahlu hadis
ikhtilaf, Sebab Ikhtilaf ahlu ra’yi & ahlu hadisikhtilaf, Sebab Ikhtilaf ahlu ra’yi & ahlu hadis
ikhtilaf, Sebab Ikhtilaf ahlu ra’yi & ahlu hadis
 
Sejarah, pola istinbath mazhab hanafi maliki
Sejarah, pola istinbath mazhab hanafi malikiSejarah, pola istinbath mazhab hanafi maliki
Sejarah, pola istinbath mazhab hanafi maliki
 
Pengertian, ruang lingkup fiqh muqaran
Pengertian, ruang lingkup fiqh muqaranPengertian, ruang lingkup fiqh muqaran
Pengertian, ruang lingkup fiqh muqaran
 

Similar to High Performance Computing Infrastructure: Past, Present, and Future

Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...Edge AI and Vision Alliance
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Deepak Kumar
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
06_1_design_flow.ppt
06_1_design_flow.ppt06_1_design_flow.ppt
06_1_design_flow.pptMohammedMianA
 
XPDDS17: Intel New QoS (RDT) Features Introduction - Yi Sun, Intel
XPDDS17: Intel New QoS (RDT) Features Introduction - Yi Sun, IntelXPDDS17: Intel New QoS (RDT) Features Introduction - Yi Sun, Intel
XPDDS17: Intel New QoS (RDT) Features Introduction - Yi Sun, IntelThe Linux Foundation
 
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT I Core of Embedded Systems
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT I   Core of Embedded SystemsSYBSC IT SEM IV EMBEDDED SYSTEMS UNIT I   Core of Embedded Systems
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT I Core of Embedded SystemsArti Parab Academics
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 
Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Intel® Software
 
FPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionFPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionPersiPersi1
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN
 

Similar to High Performance Computing Infrastructure: Past, Present, and Future (20)

FPGA @ UPB-BGA
FPGA @ UPB-BGAFPGA @ UPB-BGA
FPGA @ UPB-BGA
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
 
Unit I_MT2301.pdf
Unit I_MT2301.pdfUnit I_MT2301.pdf
Unit I_MT2301.pdf
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
06_1_design_flow.ppt
06_1_design_flow.ppt06_1_design_flow.ppt
06_1_design_flow.ppt
 
XPDDS17: Intel New QoS (RDT) Features Introduction - Yi Sun, Intel
XPDDS17: Intel New QoS (RDT) Features Introduction - Yi Sun, IntelXPDDS17: Intel New QoS (RDT) Features Introduction - Yi Sun, Intel
XPDDS17: Intel New QoS (RDT) Features Introduction - Yi Sun, Intel
 
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT I Core of Embedded Systems
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT I   Core of Embedded SystemsSYBSC IT SEM IV EMBEDDED SYSTEMS UNIT I   Core of Embedded Systems
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT I Core of Embedded Systems
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
05 defense
05 defense05 defense
05 defense
 
Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)
 
REDA services
REDA servicesREDA services
REDA services
 
FPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionFPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusion
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 

More from karl.barnes

From Classroom to Collaboration: Crossing Computational and Classic Chemistry
From Classroom to Collaboration: Crossing Computational and Classic ChemistryFrom Classroom to Collaboration: Crossing Computational and Classic Chemistry
From Classroom to Collaboration: Crossing Computational and Classic Chemistrykarl.barnes
 
Howard University: Center for Computational Biology and Bioinformatics
Howard University: Center for Computational Biology and BioinformaticsHoward University: Center for Computational Biology and Bioinformatics
Howard University: Center for Computational Biology and Bioinformaticskarl.barnes
 
An Overview of the LONI Institute at Southern University
An Overview of the LONI Institute at Southern UniversityAn Overview of the LONI Institute at Southern University
An Overview of the LONI Institute at Southern Universitykarl.barnes
 
CI-Team MSI-CIEC High Performance Computing and CyberInfrastructure (CI) Camp...
CI-Team MSI-CIEC High Performance Computing and CyberInfrastructure (CI) Camp...CI-Team MSI-CIEC High Performance Computing and CyberInfrastructure (CI) Camp...
CI-Team MSI-CIEC High Performance Computing and CyberInfrastructure (CI) Camp...karl.barnes
 
Internet2 and Cyberinfrastructure
Internet2 and CyberinfrastructureInternet2 and Cyberinfrastructure
Internet2 and Cyberinfrastructurekarl.barnes
 
The BSU Xseed: Experiences Building a Top 500 Supercomputer
The BSU Xseed: Experiences Building a Top 500 SupercomputerThe BSU Xseed: Experiences Building a Top 500 Supercomputer
The BSU Xseed: Experiences Building a Top 500 Supercomputerkarl.barnes
 

More from karl.barnes (6)

From Classroom to Collaboration: Crossing Computational and Classic Chemistry
From Classroom to Collaboration: Crossing Computational and Classic ChemistryFrom Classroom to Collaboration: Crossing Computational and Classic Chemistry
From Classroom to Collaboration: Crossing Computational and Classic Chemistry
 
Howard University: Center for Computational Biology and Bioinformatics
Howard University: Center for Computational Biology and BioinformaticsHoward University: Center for Computational Biology and Bioinformatics
Howard University: Center for Computational Biology and Bioinformatics
 
An Overview of the LONI Institute at Southern University
An Overview of the LONI Institute at Southern UniversityAn Overview of the LONI Institute at Southern University
An Overview of the LONI Institute at Southern University
 
CI-Team MSI-CIEC High Performance Computing and CyberInfrastructure (CI) Camp...
CI-Team MSI-CIEC High Performance Computing and CyberInfrastructure (CI) Camp...CI-Team MSI-CIEC High Performance Computing and CyberInfrastructure (CI) Camp...
CI-Team MSI-CIEC High Performance Computing and CyberInfrastructure (CI) Camp...
 
Internet2 and Cyberinfrastructure
Internet2 and CyberinfrastructureInternet2 and Cyberinfrastructure
Internet2 and Cyberinfrastructure
 
The BSU Xseed: Experiences Building a Top 500 Supercomputer
The BSU Xseed: Experiences Building a Top 500 SupercomputerThe BSU Xseed: Experiences Building a Top 500 Supercomputer
The BSU Xseed: Experiences Building a Top 500 Supercomputer
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

High Performance Computing Infrastructure: Past, Present, and Future

  • 1. High Performance Computing Infrastructure: Past, Present, and Future By Clay Gloster, Jr., Ph.D., P.E. Associate Professor Department of Electrical & Computer Engineering Howard University THE RARE PROJECT cgloster@howard.edu June 22, 2009 1
  • 2. Presentation Outline • Introduction to Reconfigurable Computing • The Bison Configurable Digital Signal Processor • The BCDSP Design Flow • Current Function Cores and Modules • A Remote Reconfigurable Computer • A Parallel and Configurable Computer 2
  • 4. Problem Statement • Given: An application that is computationally intensive or requires considerable CPU execution time. i.e., weather modeling, remote sensing, target recognition, precision targeting, gene sequencing • Find: A solution that significantly improves performance, requires acceptable development time, at a reasonable cost. 4
  • 5. Potential Solutions • Cluster-based computing: The use of several general purpose computing systems, i.e. PCs. (Writing programs that execute on typical PCs/workstations.) • Application-Specific Integrated Circuit (ASIC) Design: The use of special-purposed ICs or chips. (Designing a chip (hardware) that is highly optimized for the particular application.) • Reconfigurable Computing: The merger of the two approaches. (Writing software to execute non-time- critical portions of the application on a PC while designing hardware to execute the time-critical portions of the application on an FPGA.) 5
  • 6. A Reconfigurable Computer is: PC Host A PC attached to one or more Field Programmable Gate Arrays(FPGAs). 6
  • 7. An FPGA is: Programmable Pin Configurable Logic Block Programmable Interconnect A programmable integrated circuit. At time t1, it can be programmed as X1 (personal data assistant). At time t2, it can be programmed as X2 (calculator). 7
  • 8. RC Systems Advantages • Several applications have been implemented on a reconfigurable computing system resulting in a system with execution times that were an order of magnitude faster than the same application implemented on a typical desktop computer. • The same reconfigurable computing system hardware can be reused for diverse applications. • With an RC system, a system can be deployed and subsequently reprogrammed with new hardware to perform functions that were not available at the time of deployment. 8
  • 9. RC Systems Disadvantages • Developing an RC system requires a system designer that is knowledgeable in both hardware design as well as software design. • Time required to design and implement an RC system that executes faster than a typical desktop computer can be several months. 9
  • 10. Research Objectives • To obtain RC system implementations of several applications that achieve an order of magnitude speedup over executing the application on a typical desktop computer. • To develop tools that reduce RC system development time from months to weeks or days and allow users who are not knowledgeable in hardware design to be able to implement RC systems while experiencing the potential benefit of increased system performance and system reuse. • To develop a resource management system to efficiently utilize available reconfigurable computing resources located at remote sites. 10
  • 11. The Bison Configurable Digital Signal Processor 11
  • 12. A Configurable Digital Signal Processor M0 M1 Processor (BCDSP) Data Data CONTROL DATA M2 UNIT UNIT M3 Data Data Function Core Mn-2 (FunCoreGen) Mn-1 Data Data Mn Instructions 12
  • 13. Functional Cores R0 R1 R7 • Have one or more 32-bit inputs • Perform floating point vector ENABLE operations. FunCore • Have simple control. DONE • Can be built using other FunCores. • Can include conditional units. 13
  • 14. 2-D DCT Function Core R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 X X X X X X X X + + + + + + + Z0 14
  • 15. Optimizing System Performance with the BCDSP • Memory is 64-bits wide allowing two single-precision floating point numbers to be fetched in a single memory access. • There are N=4 data memories, hence multiple data items can be read/written in a single cycle. Theoretically, the number of memory accesses can be reduced by a factor of N=4. (This number can be increased to an upper bound 2N=8 if we store two floating point values per location.) • Multiple function cores can be used. For example, a typical processor may have 1 multiplier. In this case, K multiplies require K time units or clock cycles. With K multipliers, K multiplies can be executed in a single time unit or clock cycle. • Pipelining and DMA accesses are used to increase system performance. 15
  • 16. BCDSP Software, Cores, and Processors 16
  • 17. Distinguishing Features of RCCT Traditional Approach Original Our Approach Original Module Source Source Definition Code Code File Special Compiler RCCT Compiler Modified HDL Source Code Session Modified Files Source Code Logic High Level Synthesis Compiler High Level Compiler Placement Executable & Routing Code Executable Code Bit Stream • Placement and routing is performed off-line. • The Hardware Module Library evolves continuously. • Compiler can easily recognize new modules. • As new modules are added, the Compiler has a better chance to improve performance for each user application. AIST-0016-0044 17
  • 18. The Front-End Compiler •The purpose of the compiler is to map user applications to FPGA-based reconfigurable computers (RC), (i.e. the BISON reconfigurable computer). •The compiler takes the original source code written in C/C++ and a module library and produces two outputs: the modified source code and a session file for each modified section. Original Source Code Programming New RCCT Modified Application Source Language Executable Compiler Code Compiler (Calls the Loader) Module Library Session files 18
  • 19. The BCDSP Processor Back-End Compiler dct.c c2hl dct_hl.vhd hl2cudu dct_cu.vhd dct_du.vhd PECORE.vhd hl2cudu consists of approximately 15 programs!!! 19
  • 20. Execution Time for the 2D-DCT Image Software (ms) Hardware (ms) Speedup Size 2.97 GHz PC 24 MHz BCDSP 8x8 0.0400 0.0112 3.56 16x16 0.095 0.0272 3.48 32x32 0.264 0.09150 2.88 64x64 0.849 0.3484 2.43 128x128 3.080 1.3746 2.24 256x256 12.154 5.478 2.22 512x512 60.556 21.8942 2.76 1024x1024 185.754 87.5560 2.12 Reconfigurable hardware was 2.71 times faster on average!!!! 20
  • 21. A Remote Configurable Computer 21
  • 22. A Remote And Reconfigurable Environment (RARE) Processor Library Remote Environment Resource Bank Resource Controller FPGA0 M0 0 M0 1 M0n Automated BCDSP Tool Set FPGA1 M10 M11 M1p Application (C, Java,…) FPGAm Mm0 Mm1 Mmq User Parameters (power, size, weight…) 22
  • 23. The RARE Project Infrastructure The RARE software is developed using Java. The Java language is selected because it offers a number of advantages over other programming languages. Java supports native methods, remote method invocation and network security. The native method feature allows the use of software routines written in other programming languages such as C/C++ to be called from Java applications. Remote method invocation and network security features make it possible to execute Java programs from a remote site. Client.java Server.java FPGA with RMI INTERNET with RMI NMI Function.c links links Board 23
  • 24. PNN Execution Times Implementation Local Remote Type (ms) (ms) Software (Java) 628.71 2887.74 Software (Cpp) 861.04 3116.17 Hardware 104.07 371.01 Remote hardware can be faster than local software!!!! 24
  • 25. A Parallel and Configurable Computer System 25
  • 26. A Parallel and Configurable Computer PC2i • NSF MRI Grant: A Parallel and Parallel CC FPGABrd2i Configurable Computer for Research in Engineering and the CCN0 PC2i+1 Computational Mathematical CCN1 FPGABrd2i+1 Sciences ($500K) CCN2 • Projects related to RFID, an Electronic Nose, PET Image Reconstruction, Image CCNi Compression, and Computer Vision are using this equipment to solve real world problems. CCN6 CCN7 26
  • 27. Cluster Specifications • 8 Compute Nodes – 1 x PCI-X dual port Infiniband 4X HCA card – 1 x 250GB SATA Hard Drive 7200RPM w/ 16MB Cache – 8 x 1GB PC3200 ECC Reg DDR (400MHz) – 1 x PNY nVidia Quadro FX 3000G w/ 8XAGP, 256MB DDR, Dual DVI/DVI – 2 x AMD Opteron Model 250 (2.4GHz) – 60-30-12921 1 x Dual Opteron S2885 EATX Motherboard w/ 8X AGP, gigE, SATA, audio, firewire, 4x 64-bit PCI • 1 Head Node – 1 x PCI-X dual port Infiniband 4X HCA card – 8 x 1GB PC3200 ECC Reg DDR (400MHz) – 2 x AMD Opteron Model 250 (2.4GHz) – 1 x PNY nVidia Quadro FX 3000G w/ 8XAGP, 256MB DDR, Dual DVI/DVI – 1 x 10/100/1000 64bit PCI-X Gigabit Copper NIC • 9 FPGA Coprocessors – 16 WS2P/XC2VP100-6P/48D/256 Wildstar II PRO PCI board with 2 ea P100-6 parts & 48 MB DDR SRAM and 256 MB DDR SDRAM 27
  • 28. RARE Project Past, Present, and Future 28
  • 29. AIST Program Space Based NRA Technologies Hierarchical Algorithms and their Embedded ESTO Computational Realization in Reconfigurable Hardware Earth Science Technology Office PI: Clay Gloster/Howard University Proposal No: AIST 0016-0044 Description and Objectives 61 61 VLIW Mem1 PE1 This project addresses problems associated with 61 61 developing data products for deployment in onboard RC Mem2 PE2 34 systems. It involves the development of a compiler that 61 Mem3 61 PE3 reads algorithm descriptions written in C. The compiler 34 will produce hardware and software components required 61 61 Mem4 PE4 34 for an RC implementation of typical NASA data products. 61 Mem5 61 PE5 The main objectives of this project are: efficient algorithm 34 development and fast and reconfigurable hardware 34 FIFO 1 FIFO2 FIFO 3 FIFO 4 FIFO5 implementations (10X-100X speedup). 34 34 34 34 34 PCI Bus Approach Deliverables Develop a compiler to translate nested loops into a - Prototype RC Testbed shown above sequence of floating point vector instructions. These -Prototype Compiler instructions correspond to modules in a library that is to be developed as a part of this project. Hardware -Cloud Masking Data Product Demonstration modules will perform complex instructions i.e. -Final Compiler matmult, vec-vecmult, FFT, etc. Application/Mission Co-I’s/Partners Cloud Cover Assessment Data Product Development for Hamid Krim, Tom Conte, NC State University EOS/AM-1 Satellite 29
  • 30. High Performance Weather Forecast Modeling WRF Architecture WRF is an HPC next generation mesoscale forecast model and assimilation system developed as a collaborative effort by the Atmospheric science community. It is a massively parallel computing environment for both forecasting and research purposes. 3 Level Hierarchical Structure  Driver: Processor management etc Mediation: interface between Model and Driver Model: plug-in algorithms that compute actual models Model layer includes Figure courtesy of http://www.wrf-model.org Longwave radiation: RRTM Shortwave radiation: NASA/GSFC, MM5 (Dudhia) Cumulus: Kain-Fritsch, Betts-Miller-Janjic Explicit microphysics: Kessler, Lin et al., NCEP 3-class (Hong) PBL: MRF, MM5 (Slab) WRF acknowledges the HPC problem, and is currently pursuing the standard solution 3 30 RARE solution: replace physics plug-ins with BCDSP FPGA equivalents
  • 31. A Reconfigurable and Open Architecture Module for Unmanned Systems ● Reconfigurable modules can be reused for various types of unmanned systems, each containing a diverse range of sensors, cameras, displays, GPS receivers, etc. ● Reconfigurable modules can provide capabilities during the mission that were unknown prior to the beginning of the mission. ● With these modules computing resources can be used on remote unmanned systems from a ground station when these modules are idle. ● With reconfigurable modules, a fixed amount of hardware can be changed to theoretically provide an infinite number of different capabilities. ● Because of the unpredictable nature of combat, reconfigurable systems provide the flexibility and performance needed to respond rapidly and effectively to unexpected threats. ● These systems can provide reconfigurable interfaces and interconnections. One system can accommodate any combination of interfaces: USB, Gigabit Ethernet, RS432, IR, wireless, FireWire, etc. 31
  • 32. Current System Specification FPGAs exploit parallelism to reach higher increased performance (sample rates, pixel or frame rates) with limited SWAP FPGA processing power can be combined and redistributed in real-time to a particular sensor (s) FPGA-based payload interfaces combined with a hardware Open Architecture approach can provide reconfigurable software interfaces and physical interconnections. One system can accommodate any combination of interfaces: USB, Gigabit Ethernet, RS432, IR, wireless, FireWire, etc. The SAME Reconfigurable Context Neutral Payload Interface can be reused to accommodate many different unmanned vehicles, ground stations -- each containing various sensors, cameras, radar systems, acoustics, LCD displays, GPS System Specifications systems, etc. utilizing high-bandwidth connections to the •Weight - 27 lbs interface. •Size – 6 x 7x 8.5 in •Power – 150 Watts •Interface – Gbit Ethernet, camera link, LVDS, 422, USB, FireWire •Image Formats – 4 Mb, 1080p, 720p, 480p, NTSC, RS-170, 1600 x 1200 IR, 360 HD-Visible and 640 IR, Stereo Capable Completed •Software Decoder (H.264), Hardwar Encoder (H.264), IMU/GPS Interface, Imaging System, Targeting System Interface Demonstrated Imagery (meta-data format; multiple streams (801.16), Trigger and Sync, Video-teleconferencing through the payload
  • 33. 33
  • 34. Opening a Dialogue with Others • Graduate Student Support – One way for us to work together with others is via graduate students. – These students can bridge the gap between other disciplines and computer engineering. • Joint Proposals – One way for us to work together is to author joint proposals. – Or alternatively, we can be supported under current funding. However, we would be willing to work with others even if there is no current support for our work. As long as there is potential for future support. • Implementation of a small portion of a models to demonstrate potential speedup. • There is a potential to publish results of this experiment in journals of other disciplines as well as in engineering journals. 34