Presentation Hc-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton at the AMD Developer Summit (APU13) November 11-13, 2013.
2. COREL AFTERSHOT™ PRO
What is Corel AfterShot™ Pro?
Corel AfterShot™ Pro is photo workflow software
Non-destructive photo editing of JPEG, TIFF, and Raw formats from hundreds of cameras
Photo Management
Batch Processing of modified files
2 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
5. AFTERSHOT TASK MANAGEMENT
Work is broken down into Tasks. Tasks
typically:
‒ Contain execution logic (code)
‒ May store resultant data
‒ Track whether they are complete
Disk
Photo
Thumbnail
File Reader
The Task Scheduler:
‒ Allocates a worker thread per CPU core
‒ Runs Tasks based on priority
‒ Allows Tasks to block on each other
JPEG Decoder
Task Dependency
Data
A Simple Task Dependency Graph
5 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
6. PROCESSING WITH TILES
The standard simpler approach is to use large monolithic images
Images are broken down into tiles for processing
Tiling provides faster screen updates. Only compute the visible parts of the image
Tiling allows more effective memory management
6 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
7. PROCESSING WITH TILES CONTINUED
The Image Processing Pipeline is made
up of several discrete steps [or filters]
To process a single tile:
‒ Load the input data (e.g. raw or jpeg data)
‒ Apply each Filter step in turn
Generally, we only need the output of
the last step, the top Tile in the Stack
Raw Data
7 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
Final Image
8. ADVANCED TILE PROCESSING
Some Image Filters require a radius of pixels
as input
Partially processed neighbor Tiles must
complete before the main Tile can continue
Intermediate Tiles must be stored in memory
so they do not rerun
Example Filters:
‒ Sharpening
‒ Lens Correction
‒ Noise Reduction
‒ Cropping
8 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
Requires multiple source tiles
10. ACCELERATING AFTERSHOT WITH OPENCL™
Goals for the AfterShot Pro OpenCL port
Offload image processing from Tiles
Work within the existing System
‒ Contain changes to a few critical modules
‒ Maintain full CPU utilization
‒ Integrate OpenCL Events into the Task System
10 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
11. GETTING WORK TO OPENCL
Identify the longest running image Filter functions and replace them with OpenCL
kernels
Do not block CPU threads, use OpenCL event callbacks.
Processing becomes Asynchronous
Limit total work in flight to conserve memory
Marshall data automatically
11 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
12. CAVEATS OF ASYNCHRONOUS OPENCL PROCESSING
High Buffer Usage
‒ Each kernel that runs needs input, output, and possibly scratch buffers.
‒ Buffers must “stick around” until the kernels complete
‒ Multiple chains of kernels a needed to keep the GPU busy
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Kernel
1
Kernel
2
Kernel
3
Kernel
4
Kernel
5
Processing one 512 x 512 image requires multiple 3 MB buffers resident in device memory (VRAM)
12 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
13. CAVEATS OF ASYNCHRONOUS OPENCL PROCESSING – CONTINUED
Dependencies Must Be Resolved in Advance
‒ For best performance all kernels in a chain should be enqueued together
‒ The state of all dependencies must be known before the first kernel is queued
‒ Difficult to track
‒ Compromise: only use OpenCL for Filters with simple linear dependencies
Kernel chaining and asynchronous execution provides excellent GPU utilization.
13 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
15. LARGE RADIUS IMAGE FILTERS
Several image processing operations require neighbor pixels. In AfterShot image Filters
are broken down into one of two categories:
Normal
Large Radius
Only requires the local Tile
Requires multiple Tiles
15 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
16. LARGE RADIUS IMAGE FILTERS ARE DIFFICULT
Large Radius AfterShot Filters are particularly difficult to implement in OpenCL
Large Radius filters will “break” kernel chaining
A extra layer of Intermediate Tiles must be resident, which will:
‒ Exhaust Device Memory, or
‒ Cause excessive bus transfers, hurting performance
And the solution is…
16 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
17. LARGE RADIUS FILTERS - NO
Don’t do it.
Large Radius filters are possible but at great development cost
Performance would ultimately depend on tricky optimizations
Large radius filters were left to run on the CPU
17 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
18. AFTERSHOT OPENCL RESULTS
Approximately 70% of image processing work was moved off of the CPU cores*
Batch processing speed improved by 3.5x*
Maintains 100% utilization on 8 CPU cores*
Only a mid-level GPU is required
Supported on Windows, Linux, and OS X
AfterShot Pro with OpenCL was a success
*measured on developer’s system
18 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
20. OPENCL 2.0 SHARED VIRTUAL MEMORY
OpenCL 2.0 introduces Shared Virtual Memory (SVM)
Basic [Coarse Grain] SVM
‒ Host and kernels can share pointers
Advanced [Fine Grain] SVM is available on some hardware
‒ Host and kernels can operate concurrently on the same memory
Fine Grain System SVM
‒ Kernels can access the entire host process’ address space. Kernels can read or write malloc
buffers
‒ System SVM can greatly simplify buffer management in an OpenCL application
20 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
22. RECONSIDERING LARGE RADIUS FILTERS
Large Radius OpenCL filters were dropped as an AfterShot feature. The reasons were
both technical and resource related
Can System SVM make Large Radius AfterShot filters feasible? Signs point to yes
‒ No Device Memory required for Intermediate buffers
‒ Input streams from SVM, no buffer transfers
‒ Behavior more in-line with Software [non-OpenCL] filters
‒ Dependencies could be resolved just as they would for a Software filter
22 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
23. LOCAL CONTRAST – A LARGE RADIUS AFTERSHOT FILTER
The next version of AfterShot Pro will contain a new Local Contrast filter.
‒ GPU accelerated on systems with OpenCL and SVM.
‒ Increases image contrast in detailed areas while leaving large constant areas unchanged
‒ The effect is achieved through a large radius Unsharp Mask (10-20% of the overall image width)
23 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
24. SETTING UP A KERNEL TO USE SVM MEMORY
24 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
25. LOADING SVM MEMORY FROM INSIDE THE KERNEL
25 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
26. LOCAL CONTRAST RESULTS
System SVM simplified Local Contrast
‒ No complicated buffer management
‒ No clever optimizations were required to hide Device memory transfers
‒ Additional memory pressure is similar to a software filter
Performance is good. The OpenCL code runs in ¼ the time of the optimized software
filter*
*measured on developer’s system
26 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013
27. THANK YOU
Questions
27 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013