Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Â
High Performance Computing Infrastructure: Past, Present, and Future
1. High Performance Computing
Infrastructure:
Past, Present, and Future
By
Clay Gloster, Jr., Ph.D., P.E.
Associate Professor
Department of Electrical & Computer Engineering
Howard University
THE RARE PROJECT
cgloster@howard.edu
June 22, 2009
1
2. Presentation Outline
⢠Introduction to Reconfigurable Computing
⢠The Bison Configurable Digital Signal Processor
⢠The BCDSP Design Flow
⢠Current Function Cores and Modules
⢠A Remote Reconfigurable Computer
⢠A Parallel and Configurable Computer
2
4. Problem Statement
⢠Given: An application that is computationally intensive
or requires considerable CPU execution time.
i.e., weather modeling, remote sensing, target
recognition, precision targeting, gene sequencing
⢠Find: A solution that significantly improves
performance, requires acceptable development time, at
a reasonable cost.
4
5. Potential Solutions
⢠Cluster-based computing: The use of several general
purpose computing systems, i.e. PCs. (Writing
programs that execute on typical PCs/workstations.)
⢠Application-Specific Integrated Circuit (ASIC) Design:
The use of special-purposed ICs or chips. (Designing a
chip (hardware) that is highly optimized for the
particular application.)
⢠Reconfigurable Computing: The merger of the two
approaches. (Writing software to execute non-time-
critical portions of the application on a PC while
designing hardware to execute the time-critical
portions of the application on an FPGA.)
5
6. A Reconfigurable Computer is:
PC
Host
A PC attached to one or more Field Programmable Gate Arrays(FPGAs).
6
7. An FPGA is:
Programmable Pin
Configurable Logic Block
Programmable Interconnect
A programmable integrated circuit.
At time t1, it can be programmed as X1 (personal data assistant).
At time t2, it can be programmed as X2 (calculator).
7
8. RC Systems Advantages
⢠Several applications have been implemented on a
reconfigurable computing system resulting in a system
with execution times that were an order of magnitude
faster than the same application implemented on a
typical desktop computer.
⢠The same reconfigurable computing system hardware
can be reused for diverse applications.
⢠With an RC system, a system can be deployed and
subsequently reprogrammed with new hardware to
perform functions that were not available at the time of
deployment.
8
9. RC Systems Disadvantages
⢠Developing an RC system requires a system designer
that is knowledgeable in both hardware design as well
as software design.
⢠Time required to design and implement an RC system
that executes faster than a typical desktop computer
can be several months.
9
10. Research Objectives
⢠To obtain RC system implementations of several
applications that achieve an order of magnitude
speedup over executing the application on a typical
desktop computer.
⢠To develop tools that reduce RC system development
time from months to weeks or days and allow users
who are not knowledgeable in hardware design to be
able to implement RC systems while experiencing the
potential benefit of increased system performance and
system reuse.
⢠To develop a resource management system to
efficiently utilize available reconfigurable computing
resources located at remote sites.
10
12. A Configurable Digital
Signal Processor
M0 M1
Processor (BCDSP)
Data Data
CONTROL DATA
M2 UNIT UNIT M3
Data Data
Function
Core
Mn-2 (FunCoreGen) Mn-1
Data Data
Mn
Instructions
12
13. Functional Cores
R0 R1 R7
⢠Have one or more 32-bit inputs
⢠Perform floating point vector
ENABLE
operations.
FunCore
⢠Have simple control.
DONE
⢠Can be built using other FunCores.
⢠Can include conditional units.
13
14. 2-D DCT Function Core
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15
X X X X X X X X
+ + + +
+ +
+
Z0
14
15. Optimizing System Performance
with the BCDSP
⢠Memory is 64-bits wide allowing two single-precision
floating point numbers to be fetched in a single memory
access.
⢠There are N=4 data memories, hence multiple data
items can be read/written in a single cycle.
Theoretically, the number of memory accesses can be
reduced by a factor of N=4. (This number can be
increased to an upper bound 2N=8 if we store two
floating point values per location.)
⢠Multiple function cores can be used. For example, a
typical processor may have 1 multiplier. In this case, K
multiplies require K time units or clock cycles. With K
multipliers, K multiplies can be executed in a single
time unit or clock cycle.
⢠Pipelining and DMA accesses are used to increase
system performance. 15
17. Distinguishing Features of RCCT
Traditional Approach Original Our Approach Original Module
Source Source Definition
Code Code File
Special
Compiler RCCT
Compiler
Modified
HDL Source
Code Session Modified
Files Source
Code
Logic High Level
Synthesis Compiler High Level
Compiler
Placement Executable
& Routing Code Executable
Code
Bit
Stream
⢠Placement and routing is performed off-line.
⢠The Hardware Module Library evolves continuously.
⢠Compiler can easily recognize new modules.
⢠As new modules are added, the Compiler has a better chance
to improve performance for each user application.
AIST-0016-0044 17
18. The Front-End Compiler
â˘The purpose of the compiler is to map user applications
to FPGA-based reconfigurable computers (RC), (i.e. the
BISON reconfigurable computer).
â˘The compiler takes the original source code written in
C/C++ and a module library and produces two outputs:
the modified source code and a session file for each
modified section.
Original
Source
Code
Programming New
RCCT Modified Application
Source Language Executable
Compiler Code Compiler (Calls the
Loader)
Module
Library
Session files
18
19. The BCDSP Processor
Back-End Compiler
dct.c c2hl dct_hl.vhd hl2cudu
dct_cu.vhd
dct_du.vhd
PECORE.vhd
hl2cudu consists of approximately 15 programs!!!
19
20. Execution Time for the 2D-DCT
Image Software (ms) Hardware (ms) Speedup
Size 2.97 GHz PC 24 MHz BCDSP
8x8 0.0400 0.0112 3.56
16x16 0.095 0.0272 3.48
32x32 0.264 0.09150 2.88
64x64 0.849 0.3484 2.43
128x128 3.080 1.3746 2.24
256x256 12.154 5.478 2.22
512x512 60.556 21.8942 2.76
1024x1024 185.754 87.5560 2.12
Reconfigurable hardware was 2.71 times faster on average!!!!
20
22. A Remote And Reconfigurable
Environment (RARE)
Processor
Library Remote Environment Resource Bank
Resource Controller
FPGA0 M0 0 M0 1 M0n
Automated
BCDSP
Tool Set
FPGA1 M10 M11 M1p
Application
(C, Java,âŚ)
FPGAm Mm0 Mm1 Mmq
User Parameters
(power, size, weightâŚ)
22
23. The RARE Project Infrastructure
The RARE software is developed using Java. The Java language is selected
because it offers a number of advantages over other programming languages.
Java supports native methods, remote method invocation and network
security. The native method feature allows the use of software routines
written in other programming languages such as C/C++ to be called
from Java applications. Remote method invocation and network security
features make it possible to execute Java programs from a remote site.
Client.java Server.java
FPGA
with RMI INTERNET with RMI NMI Function.c
links links Board
23
24. PNN Execution Times
Implementation Local Remote
Type (ms) (ms)
Software (Java) 628.71 2887.74
Software (Cpp) 861.04 3116.17
Hardware 104.07 371.01
Remote hardware can be faster than local software!!!!
24
26. A Parallel and Configurable Computer
PC2i ⢠NSF MRI Grant: A Parallel and
Parallel CC FPGABrd2i Configurable Computer for
Research in Engineering and the
CCN0 PC2i+1
Computational Mathematical
CCN1 FPGABrd2i+1
Sciences ($500K)
CCN2 ⢠Projects related to RFID, an
Electronic Nose, PET Image
Reconstruction, Image
CCNi Compression, and Computer
Vision are using this equipment
to solve real world problems.
CCN6
CCN7
26
27. Cluster Specifications
⢠8 Compute Nodes
â 1 x PCI-X dual port Infiniband 4X HCA card
â 1 x 250GB SATA Hard Drive 7200RPM w/ 16MB Cache
â 8 x 1GB PC3200 ECC Reg DDR (400MHz)
â 1 x PNY nVidia Quadro FX 3000G w/ 8XAGP, 256MB DDR, Dual DVI/DVI
â 2 x AMD Opteron Model 250 (2.4GHz)
â 60-30-12921 1 x Dual Opteron S2885 EATX Motherboard w/ 8X AGP, gigE,
SATA, audio, firewire, 4x 64-bit PCI
⢠1 Head Node
â 1 x PCI-X dual port Infiniband 4X HCA card
â 8 x 1GB PC3200 ECC Reg DDR (400MHz)
â 2 x AMD Opteron Model 250 (2.4GHz)
â 1 x PNY nVidia Quadro FX 3000G w/ 8XAGP, 256MB DDR, Dual DVI/DVI
â 1 x 10/100/1000 64bit PCI-X Gigabit Copper NIC
⢠9 FPGA Coprocessors
â 16 WS2P/XC2VP100-6P/48D/256 Wildstar II PRO PCI board with 2 ea
P100-6 parts & 48 MB DDR SRAM and 256 MB DDR SDRAM
27
29. AIST Program Space Based NRA Technologies
Hierarchical Algorithms and their Embedded ESTO
Computational Realization in Reconfigurable Hardware Earth Science Technology Office
PI: Clay Gloster/Howard University
Proposal No: AIST 0016-0044
Description and Objectives 61 61
VLIW Mem1 PE1
This project addresses problems associated with 61 61
developing data products for deployment in onboard RC
Mem2 PE2
34
systems. It involves the development of a compiler that 61
Mem3
61
PE3
reads algorithm descriptions written in C. The compiler 34
will produce hardware and software components required
61 61
Mem4 PE4
34
for an RC implementation of typical NASA data products. 61
Mem5
61
PE5
The main objectives of this project are: efficient algorithm 34
development and fast and reconfigurable hardware 34
FIFO 1 FIFO2 FIFO 3 FIFO 4 FIFO5
implementations (10X-100X speedup).
34 34 34 34 34
PCI Bus
Approach Deliverables
Develop a compiler to translate nested loops into a - Prototype RC Testbed shown above
sequence of floating point vector instructions. These
-Prototype Compiler
instructions correspond to modules in a library that is
to be developed as a part of this project. Hardware -Cloud Masking Data Product Demonstration
modules will perform complex instructions i.e.
-Final Compiler
matmult, vec-vecmult, FFT, etc.
Application/Mission
Co-Iâs/Partners
Cloud Cover Assessment Data Product Development for
Hamid Krim, Tom Conte, NC State University EOS/AM-1 Satellite
29
30. High Performance Weather
Forecast Modeling
WRF Architecture
WRF is an HPC next generation mesoscale
forecast model and assimilation system developed
as a collaborative effort by the Atmospheric
science community. It is a massively parallel
computing environment for both forecasting and
research purposes.
3 Level Hierarchical Structure
ď Driver: Processor management etc
ďMediation: interface between Model and Driver
ďModel: plug-in algorithms that compute actual models
Model layer includes Figure courtesy of http://www.wrf-model.org
ďLongwave radiation: RRTM
ďShortwave radiation: NASA/GSFC, MM5 (Dudhia)
ďCumulus: Kain-Fritsch, Betts-Miller-Janjic
ďExplicit microphysics: Kessler, Lin et al., NCEP 3-class (Hong)
ďPBL: MRF, MM5 (Slab)
WRF acknowledges the HPC problem, and is currently pursuing the standard solution 3 30
RARE solution: replace physics plug-ins with BCDSP FPGA equivalents
31. A Reconfigurable and Open Architecture
Module for Unmanned Systems
â Reconfigurable modules can be reused for various types of unmanned
systems, each containing a diverse range of sensors, cameras, displays,
GPS receivers, etc.
â Reconfigurable modules can provide capabilities during the mission that
were unknown prior to the beginning of the mission.
â With these modules computing resources can be used on remote
unmanned systems from a ground station when these modules are idle.
â With reconfigurable modules, a fixed amount of hardware can be
changed to theoretically provide an infinite number of different
capabilities.
â Because of the unpredictable nature of combat, reconfigurable systems
provide the flexibility and performance needed to respond rapidly and
effectively to unexpected threats.
â These systems can provide reconfigurable interfaces and
interconnections. One system can accommodate any combination of
interfaces: USB, Gigabit Ethernet, RS432, IR, wireless, FireWire, etc.
31
32. Current System Specification
ďFPGAs exploit parallelism to reach higher increased
performance (sample rates, pixel or frame rates) with
limited SWAP
ďFPGA processing power can be combined and
redistributed in real-time to a particular sensor (s)
ďFPGA-based payload interfaces combined with
a hardware Open Architecture approach can
provide reconfigurable software interfaces and
physical interconnections.
One system can accommodate any combination of
interfaces: USB, Gigabit Ethernet, RS432, IR, wireless,
FireWire, etc.
The SAME Reconfigurable Context Neutral Payload Interface
can be reused to accommodate many different unmanned
vehicles, ground stations -- each containing various sensors,
cameras, radar systems, acoustics, LCD displays, GPS
System Specifications systems, etc. utilizing high-bandwidth connections to the
â˘Weight - 27 lbs interface.
â˘Size â 6 x 7x 8.5 in
â˘Power â 150 Watts
â˘Interface â Gbit Ethernet, camera link, LVDS, 422, USB, FireWire
â˘Image Formats â 4 Mb, 1080p, 720p, 480p, NTSC, RS-170, 1600 x 1200 IR, 360 HD-Visible and 640 IR,
Stereo Capable
Completed
â˘Software Decoder (H.264), Hardwar Encoder (H.264), IMU/GPS Interface, Imaging System, Targeting
System Interface
Demonstrated
Imagery (meta-data format; multiple streams (801.16), Trigger and Sync, Video-teleconferencing through
the payload
34. Opening a Dialogue with Others
⢠Graduate Student Support
â One way for us to work together with others is via graduate
students.
â These students can bridge the gap between other disciplines and
computer engineering.
⢠Joint Proposals
â One way for us to work together is to author joint proposals.
â Or alternatively, we can be supported under current funding.
However, we would be willing to work with others even if there is
no current support for our work. As long as there is potential for
future support.
⢠Implementation of a small portion of a models to demonstrate
potential speedup.
⢠There is a potential to publish results of this experiment in
journals of other disciplines as well as in engineering journals.
34