http://www.dur.ac.uk/cfai/adaptiveoptics/rtc2011/agenda/abstracts/#VV1
Vivek Venugopal (National Solar Observatory): Real-time control for the Advanced Technology Solar Telescope (20 minutes)
Real-time processing for Adaptive Optics (AO) systems is challenging as the motion vectors have to be computed to properly actuate the mirrors before the wavefront information has become obsolete. The four meter Advanced Technology Solar Telescope (ATST) will provide unprecedented resolution for solar observation due to its larger aperture. The ATST AO system with 2 kHz frame rate camera, 1750 sub-apertures and 1900 actuators requires massive parallel processing and this increased demand in computational horsepower is far from being manageable by conventional processors. Hardware accelerators such as Field Programmable Gate Array (FPGA) and Graphics Processing Unit (GPU) are better equipped to harness the the parallel processing requirements of such a system. We investigate the implementation of the data processing architecture for Shack-Hartmann correlation and the wavefront reconstruction using FPGAs and GPUs. We benchmark the AO algorithm implemented using FPGAs and GPUs and compare it with the existing legacy FPGA-Digital Signal Processing (DSP) based hardware system used in the 76cm Dunn Solar Telescope(DST).
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Real-time processing for ATST
1. RTC Workshop, Durham, UK, April 2011
Real-time processing for the Advanced
Technology Solar Telescope
Vivek Venugopal (vivekv@nso.edu)
National Solar Observatory
Sunspot, New Mexico, USA
Wednesday, April 13, 2011
7. Dark and flat correction
pixel0 10
• Dark pixel and flat pixel stored in
- 10
RAM
dark_pixel 8
8
x 18 flat_product0
• Flat corrected product is
flat_pixel 8
accumulator
8
concatenated and written to
flat_acc1
pixel 1 10
FIFO
- 10
• Flat accumulated value can be
used to update the reference
dark_pixel 8
flat_pixel 8
x 8
18 flat_product1
image
8
accumulator
flat_acc1
pixel16 10
- 10
dark_pixel 8
flat_pixel 8
x 8
18 flat_product16
8
accumulator
flat_acc16
%
Wednesday, April 13, 2011
8. Pixel unpacking & Dark
and flat correction
Synchronizer/
counters
dark and flat reference image
value RAM RAM
206.8 ns
20 ns
256
channel 1
128
Data 160 Dark-flat correction/
Receiver FIFO
unpack accumulator
16 160
288
channel 2
PCIe system bus
128
Data 160 Dark-flat correction/
12 channels
Receiver FIFO
1/2 camera
unpack accumulator
16 160
288
channel 12
128
Data 160 Dark-flat correction/
Receiver FIFO
unpack accumulator
16 160
288
clock period = 9.42 ns clock period = 5 ns
clock rate = 106.15 MHz clock rate = 200 MHz
&
Wednesday, April 13, 2011
9. Nvidia Tesla C2050
GPU
Multiprocessor 14
• Nvidia Tesla C2050: 14
streaming multi-processors
Multiprocessor 2 with 32 cores each (SIMD)
Multiprocessor 1
Instruction Cache
clocked at 1.15 GHz
Warp Scheduler Warp Scheduler • 3 GB on-board RAM
Dispatch Unit Dispatch Unit
• Kernel-based execution
Register File
• 1.288 TFLOPS single
Core 1 Core 2 Core 1 Core 2
Load/
Store 1
SFU 1 precision
Load/ SFU 2
Core 3 Core 4 Core 3 Core 4
Store 2 • 515.2 GFLOPS double
SFU 3
Load/
precision
Core 15 Core 16 Core 15 Core 16 SFU 4
Store 16
Interconnection Network
64 KB Shared Memory/ L1 cache
Uniform Cache
Reference: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf '
Wednesday, April 13, 2011
10. Process mapping and
partitioning
Raw Flat Reference
pixels pixels pixels
20x20 20x20 20x20
FPGA GPU
Dark find x and y
dark flat 2D cross-correlation
pixels maximum interpolation
correction correction
20x20
()
Wednesday, April 13, 2011
12. find_max and
interpolation routines
• Find the maximum value and itʼs index
• Find x and y shifts using the interpolation equations
num x = max value − out(shif ted y index, (shif ted x index − 1)
den x = 2 ∗ max value − out(shif ted y index, (shif ted x index − 1))
−out(shif ted y index, (shif ted x index + 1))
num x
x = (shif ted x index − 0.5) +
den x
num y = max value − out((shif ted y index − 1), shif ted x index)
den y = 2 ∗ max value − out((shif ted y index − 1), shif ted x index)
−out((shif ted y index + 1), shif ted x index))
num y
y = (shif ted y index − 0.5) +
den y
(!
Wednesday, April 13, 2011
13. GPU results
Tesla C1060
FFT correlation Tesla C2050 7x7 correlation
2200 400
1889
313 307 301
1619 278 279 281
1650 1510 300
Time in us
Time in us
1188
1100 200
550 100
0 0
1 50 1 50 584
No. of images No. of images
Note: Least time indicates better performance ("
Wednesday, April 13, 2011
14. Reconstruction routine
1900
Tesla C1060
x y
Tesla C2050
1750 1750
x DSP
CPU
x and y shifts for 1750
sub-aperture images
3500
100000 46769
reconstruction matrix 1900x3500
10000
964 956
Time in us
1900
1000
229
accumulated values for 1900
actuators 100
10
• 1750 sub-aperture x and y shifts
• 3500 x 1900 reconstruction matrix 1
Devices (*
Wednesday, April 13, 2011
20. Synthesis estimates for
Virtex-6 FPGA
• Implement dark, flat correction only : resources used 288 out of
687,360 (1%)
• Implement the correlation for single channel up to the sub-aperture
accumulator within the channel (without the final 12 channel
accumulation) : resources used 2,578 out of 687,360 (1%)
Device utilization summary:
Slice Logic Utilization:
Number of Slice Registers: 992448 out of 687360 144% (*)
Number of Slice LUTs: 1126081 out of 343680 327% (*)
Number used as Logic: 1125853 out of 343680 327% (*)
Number used as Memory: 228 out of 99200
Number used as SRL: 37
('
Wednesday, April 13, 2011
21. FPGA timing
Rxdata from transceiver
unpacked data 123.73 ns
written to FIFO
40 ns
unpacked data read 95 ns
from FIFO
15 ns
dark-flat output
40 ns
input to xcorr_pixel
module
20 ns
output from xcorr_pixel
16 ns
output from sub-aperture
accumulator per channel
91 ns
• Each data packet is available from the FIFO after 95 ns
• 95 ns * 5 packets * 10 rows = 4.75 us to read the data from the FIFO
• Total latency for computing the 960 rows x 480 columns = 4.75 us *
(960/20) = 228 us. !)
Wednesday, April 13, 2011
22. GPU vs FPGA vs DSP
100 us 225 us 300.93 us
Camera
readout
Data transfer through
PCIe x16
C2050 GPU 1
C2050 GPU 2
C2050 GPU 3
C2050 GPU throughput = 525.93 us
FPGA
FPGA throughput = 250 us
DSP
96 DSPs throughput = 495 us
Camera
readout
Data transfer through
PCIe x16
C2050 GPU 1
!(
Wednesday, April 13, 2011
23. Conclusions
GPU FPGA
• DSP: excellent performance but not cost-effective
• GPU: fast SIMD architectures - suitable for certain tasks
• FPGA: MIMD architectures, custom I/O, meets latency and
throughput constraints
Slide idea: David Pellerin, Impulse Accelerated Technology !!
Wednesday, April 13, 2011
24. Future work
Virtex-6 Virtex-7
Resources
XC6VLX550T XC7V2000T
Slice logic resources 549,888 1,954,560
I/O pins 840 850
GTX transceivers 36 36
• Investigate performance improvement after mapping the find_max,
interpolation and reconstruction matrix calculation routines on the
FPGA
• Promising because of increased logic density in Virtex-7 FPGAs
• Throughput sustained even if the processes are partitioned over
multiple FPGAs
!"
Wednesday, April 13, 2011
25. Discussion
Questions
!*
Wednesday, April 13, 2011
26. Backup
Device utilization summary:
Selected Device : 6vlx550tff1759-2
Slice Logic Utilization:
Number of Slice Registers: 992448 out of 687360 144% (*)
Number of Slice LUTs: 1126081 out of 343680 327% (*)
Number used as Logic: 1125853 out of 343680 327% (*)
Number used as Memory: 228 out of 99200 0%
Number used as SRL: 228
Slice Logic Distribution:
Number of LUT Flip Flop pairs used: 1509605
Number with an unused Flip Flop: 517157 out of 1509605 34%
Number with an unused LUT: 383524 out of 1509605 25%
Number of fully used LUT-FF pairs: 608924 out of 1509605 40%
Number of unique control sets: 221
IO Utilization:
Number of IOs: 88
Number of bonded IOBs: 80 out of 840 9%
IOB Flip Flops/Latches: 25
Specific Feature Utilization:
Number of BUFG/BUFGCTRLs: 36 out of 32 112% (*)
WARNING:Xst:1336 - (*) More than 100% of Device resources are used !#
Wednesday, April 13, 2011