Real-time processing for ATST

RTC Workshop, Durham, UK, April 2011

Real-time processing for the Advanced
Technology Solar Telescope
Vivek Venugopal (vivekv@nso.edu)
National Solar Observatory
Sunspot, New Mexico, USA

Wednesday, April 13, 2011

Advanced Technology
Solar Telescope

!


Adaptive Optics system

Uncorrected Tip/Tilt
light Mirror

Deformable
Mirror (DM) Tilt drive signal

DM drive signal

Corrected
Processors Beamsplitter light

Shack-Hartmann
Lenslet Array
CCD
Camera

"


HOAO Real-time system
Actuator
gains

Offscale Recon-
Dark Reference slope Slope struction Actuator
Flat field image
field tolerance offsets matrix offsets

FPGA GPU
Deformable
mirror
Cross-
Offscale
WFS correlation Matrix Actuator
Camera X slope
slope
detection
X multiply servos Servo
computation parameters

Average Tip/Tilt
slope servos Tip/Tilt
mirror

Data Zernike
collection offload
process


Camera format
Channel #
480 columns x 480 columns x 0 77 76 73 72 53 52 49 48 29 28 25 24 5 4 1 0
960 rows 960 rows 1 1 173 172 169 168 149 148 145 144 125 124 121 120 101 100 97 96
2 2 269 268 265 264 245 244 241 240 221 220 217 216 197 196 193 192
3 365 364 361 360 341 340 337 336 317 316 313 312 293 292 289 288
4 461 460 457 456 437 436 433 432 413 412 409 408 389 388 385 384

0 85 84 81 80 61 60 57 56 37 36 33 32 13 12 9 8
3 1 181 180 177 176 157 156 153 152 133 132 129 128 109 108 105 104
4 2 277 276 273 272 253 252 249 248 229 228 225 224 205 204 201 200
3 373 372 369 368 349 348 345 344 325 324 321 320 301 300 297 296
4 469 468 465 464 445 444 441 440 421 420 417 416 397 396 393 392

0 93 92 89 88 69 68 65 64 45 44 41 40 21 20 17 16
5 1 189 188 185 184 165 164 161 160 141 140 137 136 117 116 113 112
6 2 285 284 281 280 261 260 257 256 237 236 233 232 213 212 209 208
3 381 380 377 376 357 356 353 352 333 332 329 328 309 308 305 304
4 477 476 473 472 453 452 449 448 429 428 425 424 405 404 401 400

0 79 78 75 74 55 54 51 50 31 30 27 26 7 6 3 2
7 1 175 174 171 170 151 150 147 146 127 126 123 122 103 102 99 98
8 2 271 270 267 266 247 246 243 242 223 222 219 218 199 198 195 194
3 367 366 363 362 343 342 339 338 319 318 315 314 295 294 291 290
12 channels 12 channels 4 463 462 459 458 439 438 435 434 415 414 411 410 391 390 387 386
per FPGA per FPGA
0 87 86 83 82 63 62 59 58 39 38 35 34 15 14 11 10
9 1 183 182 179 178 159 158 155 154 135 134 131 130 111 110 107 106

• 12 channels processed per 10 2
3
279
375
278
374
275
371
274
370
255
351
254
350
251
347
250
346
231
327
230
326
227
323
226
322
207
303
206
302
203
299
202
298
4
FPGA 0
471 470 467 466 447 446 443 442 423 422 419 418 399 398 395 394

95 94 91 90 71 70 67 66 47 46 43 42 23 22 19 18

• 5 packets to receive a 11
12
1
2
191
287
190
286
187
283
186
282
167
263
166
262
163
259
162
258
143
239
142
238
139
235
138
234
119
215
118
214
115
211
114
210
3 383 382 379 378 359 358 355 354 335 334 331 330 311 310 307 306
complete row 4 479 478 475 474 455 454 451 450 431 430 427 426 407 406 403 402

#


Pixel unpacking
Byte 1 Byte 0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

49 48 47
Pixel 1
46 45 44 43 42 9 8 7
Pixel 0
6 5 4 3 2
• FPGA receives camera data using the
31 30 29
Byte 3
28 27 26 25 24 23 22 21
Byte 2
20 19 18 17 16
ﬁber channel interface through 12
Pixel 3
129 128 127 126 125 124 123 122 89 88 87
Pixel 2
86 85 84 83 82 transceivers @ 9.42 ns
Byte 5 Byte 4
47 46
Pixel 1
45 44 43 42
Pixel 5
41 40 39 38
Pixel 0
37 36 35 34
Pixel 4
33 32
• Pixel unpacking implemented using
41 40 59 58 57 56 55 54 1 0 19 18 17 16 15 14

Byte 7 Byte 6
FSM with 2 modes (10 states/mode)
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48
Pixel 3 Pixel 7
121 120 139 138 137 136 135 134
Pixel 2
81 80 99 98
Pixel 6
97 96 95 94 • 16 pixels (10 bits/pixel) written to FIFO
Byte 9 Byte 8
79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64
Pixel 5 Pixel 9 Pixel 4 Pixel 8
53 52 51 50 69 68 67 66 13 12 11 10 29 28 27 26

Byte 11 Byte 10
95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80
133 132 131 130 149 148 147 146 93 92 91 90 109 108 107 106

Byte 13 Byte 12
111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96
65 64 63 62 61 60 79 78 25 24 23 22 21 20 39 38

Byte 15 Byte 14
127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112
145 144 143 142 141 140 159 158 105 104 103 102 101 100 119 118

Byte 17 Byte 16
143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128
Pixel 13 Pixel 12
77 76 75 74 73 72 71 70 37 36 35 34 33 32 31 30

Byte 19 Byte 18
159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144
Pixel 15 Pixel 14
157 156 155 154 153 152 151 150 117 116 115 114 113 112 111 110 $


Dark and flat correction
pixel0 10
• Dark pixel and flat pixel stored in
- 10
RAM
dark_pixel 8

8
x 18 flat_product0
• Flat corrected product is
flat_pixel 8
accumulator
8
concatenated and written to
flat_acc1
pixel 1 10
FIFO
- 10
• Flat accumulated value can be
used to update the reference
dark_pixel 8

flat_pixel 8
x 8
18 flat_product1

image
8
accumulator
flat_acc1

pixel16 10

- 10

dark_pixel 8

flat_pixel 8
x 8
18 flat_product16

8
accumulator
flat_acc16
%


Pixel unpacking & Dark
and flat correction
Synchronizer/
counters

dark and flat reference image
value RAM RAM
206.8 ns
20 ns
256
channel 1
128
Data 160 Dark-flat correction/
Receiver FIFO
unpack accumulator
16 160
288

channel 2

PCIe system bus
128
12 channels

Receiver FIFO
1/2 camera

unpack accumulator
16 160
288

channel 12
128
Receiver FIFO
unpack accumulator
16 160
288

clock period = 9.42 ns clock period = 5 ns
clock rate = 106.15 MHz clock rate = 200 MHz

&


Nvidia Tesla C2050
GPU
Multiprocessor 14
• Nvidia Tesla C2050: 14
streaming multi-processors
Multiprocessor 2 with 32 cores each (SIMD)
Multiprocessor 1

Instruction Cache
clocked at 1.15 GHz
Warp Scheduler Warp Scheduler • 3 GB on-board RAM
Dispatch Unit Dispatch Unit
• Kernel-based execution
Register File
• 1.288 TFLOPS single
Core 1 Core 2 Core 1 Core 2
Load/
Store 1
SFU 1 precision
Load/ SFU 2
Core 3 Core 4 Core 3 Core 4
Store 2 • 515.2 GFLOPS double
SFU 3

Load/
precision
Core 15 Core 16 Core 15 Core 16 SFU 4
Store 16

Interconnection Network

64 KB Shared Memory/ L1 cache

Uniform Cache

Reference: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf '


Process mapping and
partitioning

Raw Flat Reference
pixels pixels pixels
20x20 20x20 20x20
FPGA GPU

Dark ﬁnd x and y
dark ﬂat 2D cross-correlation
pixels maximum interpolation
correction correction
20x20

()


Correlation routines
1. FFT correlation 2. 7x7 correlation

ﬂat
reference
corrected
image
image
precomputed
original reference Region 1 reference
FFT FFT image 26x26 pixels (20x20 pixels)

precomputed
Region 2 reference
Complex conjugate (20x20 pixels)
Multiplication

IFFT

precomputed
Region 49 reference
(20x20 pixels)

Precomputed Reference pixels 20x20 (49 regions)
((


ﬁnd_max and
interpolation routines
• Find the maximum value and itʼs index
• Find x and y shifts using the interpolation equations

num x = max value − out(shif ted y index, (shif ted x index − 1)
den x = 2 ∗ max value − out(shif ted y index, (shif ted x index − 1))
−out(shif ted y index, (shif ted x index + 1))
num x
x = (shif ted x index − 0.5) +
den x
num y = max value − out((shif ted y index − 1), shif ted x index)
den y = 2 ∗ max value − out((shif ted y index − 1), shif ted x index)
−out((shif ted y index + 1), shif ted x index))
num y
y = (shif ted y index − 0.5) +
den y

(!


GPU results
Tesla C1060
FFT correlation Tesla C2050 7x7 correlation
2200 400
1889
313 307 301
1619 278 279 281
1650 1510 300
Time in us

Time in us
1188
1100 200

550 100

0 0
1 50 1 50 584
No. of images No. of images
Note: Least time indicates better performance ("


Reconstruction routine

1900
Tesla C1060
x y
Tesla C2050

1750 1750
x DSP
CPU
x and y shifts for 1750
sub-aperture images
3500
100000 46769
reconstruction matrix 1900x3500

10000
964 956
Time in us
1900
1000
229
accumulated values for 1900
actuators 100

10
• 1750 sub-aperture x and y shifts
• 3500 x 1900 reconstruction matrix 1

Devices (*


Xilinx design flow
Design verification
Design Entry

Functional
simulation
Design
Synthesis

Design
implementation

Optimization Static timing
analysis

Mapping
Placement
Routing Back
Timing simulation
Annotation

Bitstream
generation

Download to In-circuit
Xilinx FPGA verification

(#


Cross-correlation
18 • Configure 400x392 (49x8 bits/
flat_product0 pixel) RAM bank (RAM0-RAM19)
18

8
x 26 xcorr_product0
with pre-computed reference
flatcorr_value pixels
ref_pixel0
392
• Multiply each pixel with
18
ref_pixel
corresponding reference pixel
flat_product0

8
x 26 xcorr_product1
1274

xcorr_value_per pixel
ref_pixel1

18

flat_product0

8
x 26 xcorr_product48

ref_pixel48

($


Cross-correlation
18

flat_product0
• Configure 400x392 (49x8 bits/
18
x 26 xcorr_product0

flatcorr_value

392
8

ref_pixel0

18
pixel) RAM bank (RAM0-RAM19)
ref_pixel

with pre-computed reference
flat_product0

8
x 26 xcorr_product1
1274

ref_pixel1

18

flat_product0
pixels

•
8

ref_pixel48

Multiply each pixel with
18

flat_product1
corresponding reference pixel
18

flatcorr_value 8
x 26 xcorr_product0

ref_pixel0
392
18
ref_pixel
flat_product1

8
x 26 xcorr_product1
1274

ref_pixel1

18

flat_product1

8

ref_pixel48

18

flat_product15
18

flatcorr_value 8
x 26 xcorr_product0

ref_pixel0
392
18
ref_pixel
flat_product15

8
x 26 xcorr_product1
1274

ref_pixel1

18

flat_product15

8

($
ref_pixel48


Sub-aperture format
Channel # Channel # 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1
0
1 1
0
1
3
8
3
8
3
8
3
8
2
7
2
7
2
7
2
7
1
6
1
6
1
6
1
6
0
5
0
5
0
4
0
4
• Sub-aperture regions in 480 columns x
2 2
3
2 2
3
13
18
13
18
13
18
13
18
12
17
12
17
12
16
12
16
11
15
11
15
10
15
10
15
9
14
9
14
9
14
9
14
1 row per channel
4 4 23 23 22 22 21 21 21 21 20 20 20 20 19 19 19 19

0 0 4 4 4 4 3 3 2 2 1 1 1 1 0 0 0 0
• Accumulate pixels per sub-aperture in
3
4
1
2
3
4
1
2
9
13
9
13
8
13
8
13
7
12
7
12
7
12
7
12
6
11
6
11
6
11
6
11
5
10
5
10
5
10
5
10
each channel
3 3 18 18 18 18 17 17 17 17 16 16 16 16 15 15 14 14 1274 1715
4 4 23 23 23 23 22 22 22 22 21 21 20 20 19 19 19 19 xcorr_pixel0 subap0_acc
1274 1715
subap_accumulator
5 1 5 1 9 9 9 9 8 8 8 8 7 7 6 6 5 5 5 5 channel #1,#2,#7,#8
6 2 6 2 14 14 14 14 13 13 12 12 11 11 11 11 10 10 10 10
3 3
1274 1715
19 19 18 18 17 17 17 17 16 16 16 16 15 15 15 15 xcorr_pixel15 subap23_acc
4 4 23 23 23 23 22 22 22 22 21 21 21 21 20 20 20 20

0 0 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0
1274 1715
7 1 7 1 8 8 8 8 7 7 7 7 6 6 6 6 5 5 4 4 xcorr_pixel0 subap0_acc
8 2 8 2 13 13 13 13 12 12 12 12 11 11 10 10 9 9 9 9 1274 1715
subap_accumulator
4 4 23 23 22 22 21 21 21 21 20 20 20 20 19 19 19 19 channel #3,#4,#9,#10

0 0 4 4 4 4 3 3 2 2 1 1 1 1 0 0 0 0 1274 1715
xcorr_pixel15 subap23_acc
9 1 9 1 9 9 8 8 7 7 7 7 6 6 6 6 5 5 5 5
10 2 10 2 13 13 13 13 12 12 12 12 11 11 11 11 10 10 10 10
3 3 18 18 18 18 17 17 17 17 16 16 16 16 15 15 14 14
4 4 23 23 23 23 22 22 22 22 21 21 20 20 19 19 19 19
1274 1715
0 0 4 4 4 4 3 3 3 3 2 2 2 2 1 1 0 0 1274 1715
11 1 11 1 9 9 9 9 8 8 8 8 7 7 6 6 5 5 5 5 xcorr_pixel1 subap1_acc
subap_accumulator
12 2 12 2 14 14 14 14 13 13 12 12 11 11 11 11 10 10 10 10 channel #5,#6,#11,#12
3 3 19 19 18 18 17 17 17 17 16 16 16 16 15 15 15 15
4 4 23 23 23 23 22 22 22 22 21 21 21 21 20 20 20 20 1274 1715
(%


Top level design

channel_cycle_count
288 288
160
subap_row_count refim_fetch_addr_d RAM bank (RAM0- FCFPGA dark_flat_acc_top Flatcorr
xcorr_pixel_channel ch1278_subap_accumulator
ecoder RAM19) _FIFO

addr_decoder_ce subap_acc_out
(1715 bits) x24
address decoder data unpack xcorr_pixel
refim_in (1274 bits) x16
xcorr_sm xcorr_pixel_ce (392 bits)
x16
subap_acc_ce
channel1_top

subap_acc_12ch_ce

xcorr state
flat_fifo_rd
machine
subap_acc_out
24subap_12ch_ (1715 bits) x24
accumulator

288 288
160
FCFPGA dark_flat_acc_top Flatcorr
xcorr_pixel_channel ch561112_subap_accumulator
_FIFO

subap_acc_out
xcorr_pixel (1715 bits) x24
data unpack
refim_in (1274 bits) x16
(392 bits)
x16

channel12_top

(&


Synthesis estimates for
Virtex-6 FPGA
• Implement dark, ﬂat correction only : resources used 288 out of
687,360 (1%)
• Implement the correlation for single channel up to the sub-aperture
accumulator within the channel (without the ﬁnal 12 channel
accumulation) : resources used 2,578 out of 687,360 (1%)

Device utilization summary:
Slice Logic Utilization:
Number of Slice Registers: 992448 out of 687360 144% (*)
Number of Slice LUTs: 1126081 out of 343680 327% (*)
Number used as Logic: 1125853 out of 343680 327% (*)
Number used as Memory: 228 out of 99200
Number used as SRL: 37
('


FPGA timing

Rxdata from transceiver

unpacked data 123.73 ns
written to FIFO
40 ns
unpacked data read 95 ns
from FIFO
15 ns
dark-ﬂat output
40 ns
input to xcorr_pixel
module
20 ns
output from xcorr_pixel
16 ns
output from sub-aperture
accumulator per channel

91 ns

• Each data packet is available from the FIFO after 95 ns
• 95 ns * 5 packets * 10 rows = 4.75 us to read the data from the FIFO
• Total latency for computing the 960 rows x 480 columns = 4.75 us *
(960/20) = 228 us. !)


GPU vs FPGA vs DSP
100 us 225 us 300.93 us

Camera
readout

Data transfer through
PCIe x16

C2050 GPU 1

C2050 GPU 2

C2050 GPU 3

C2050 GPU throughput = 525.93 us

FPGA

FPGA throughput = 250 us

DSP

96 DSPs throughput = 495 us

Camera
readout

Data transfer through
PCIe x16

C2050 GPU 1

!(


Conclusions

GPU FPGA

• DSP: excellent performance but not cost-effective
• GPU: fast SIMD architectures - suitable for certain tasks
• FPGA: MIMD architectures, custom I/O, meets latency and
throughput constraints
Slide idea: David Pellerin, Impulse Accelerated Technology !!


Future work

Virtex-6 Virtex-7
Resources
XC6VLX550T XC7V2000T
Slice logic resources 549,888 1,954,560
I/O pins 840 850
GTX transceivers 36 36

• Investigate performance improvement after mapping the ﬁnd_max,
interpolation and reconstruction matrix calculation routines on the
FPGA
• Promising because of increased logic density in Virtex-7 FPGAs
• Throughput sustained even if the processes are partitioned over
multiple FPGAs
!"


Discussion

Questions

!*


Backup
Device utilization summary:
Selected Device : 6vlx550tff1759-2

Slice Logic Utilization:

Number of Slice Registers: 992448 out of 687360 144% (*)
Number of Slice LUTs: 1126081 out of 343680 327% (*)

Number used as Logic: 1125853 out of 343680 327% (*)

Number used as Memory: 228 out of 99200 0%
Number used as SRL: 228

Slice Logic Distribution:

Number of LUT Flip Flop pairs used: 1509605

Number with an unused Flip Flop: 517157 out of 1509605 34%
Number with an unused LUT: 383524 out of 1509605 25%

Number of fully used LUT-FF pairs: 608924 out of 1509605 40%

Number of unique control sets: 221
IO Utilization:

Number of IOs: 88

Number of bonded IOBs: 80 out of 840 9%
IOB Flip Flops/Latches: 25

Speciﬁc Feature Utilization:

Number of BUFG/BUFGCTRLs: 36 out of 32 112% (*)
WARNING:Xst:1336 - (*) More than 100% of Device resources are used !#


Pre-computed reference
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259
260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285
286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337
338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363
364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389
390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415
416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441
442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467
468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493
494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519
520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545
546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571
572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597
598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623
624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649
650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675
!$


Real-time processing for ATST

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Real-time processing for ATST

Similar to Real-time processing for ATST (10)

More from Vivek Venugopalan

More from Vivek Venugopalan (6)

Recently uploaded

Recently uploaded (20)

Real-time processing for ATST