11. ClearSpeed profiler for heterogeneous and multi-processor systems Advance™ Accelerator Board CSX 600 Pipeline CSX 600 Pipeline Host CPU(s) Host CPU(s) Host CPU(s) Advance™ Accelerator Board Host Cores(s) CSX Pipeline HOST/BOARD INTERACTION View host/board interactions. Provides performance information for data transfer operations. Trace cluster node/board interaction. See overlap of host compute and board compute. CSX PIPELINE View detailed instruction issue information. Visualize overlap of executing instructions. Optimize code at the instruction level. View instruction level performance bottlenecks. Get accurate instruction timing. CSX SYSTEM View system level trace. Visually inspect the overlap of compute and I/O. Visualize cache utilization. View branch trace of code executing. Find and analyse performance bottlenecks. Get accurate event timing ClearSpeed Accelerated System CSX Pipeline HOST CODE PROFILING Visually inspect host code executing. Supports multiple threads and processes. Time specific code sections. See overlap of host threads executing. Platform and processor agnostic trace collection. PCIe
So with the scene set for our presentation I’m going to talk a bit about the current state of the art in programming heterogeneous systems (with a summary of what will be used at SARA), as well as taking a look at what the development flow for a heterogeneous system really looks like.
So with the scene set for our presentation I’m going to talk a bit about the current state of the art in programming heterogeneous systems (with a summary of what will be used at SARA), as well as taking a look at what the development flow for a heterogeneous system really looks like.
At SARA the system is based on ClearSpeed Technology hardware and has the full range of development tools and libraries available
The level of support offered by the ClearSpeed SDK for debugging and especially profiling is still well ahead of the best of the rest (for the moment). Host profiling API, allows you to instrument even non-CS specific code and have it displayed in the profiler.
So let’s take a look at what makes heterogeneous systems interesting to the user and also some of the issues involved in programming them.
If it’s single use it’s much easier to justify the investment in time and money to get the benefits of acceleration If it’s multi-use then the cost benefit analysis is more complicated, but can still be swayed by an obvious imbalance in resource consumption. Are the codes yours, open source or closed source ISV applications? If you have source level access do you have the development expertise and resources?
So let’s put closed source applications to one side for a moment. If you have answered yes to “Do you have source access?” and “Do you have the development capabilities?” them, today you will have to decide on one of a number of proprietary development environments.
I include OpenCL here because of it’s similarity to existing languages and it’s imminent availability.
As with MKL, ACML etc IHVs will usually (but not always) get the best out of their hardware. The Library approach is by far and away easiest for the user because it carries with it the potential to provide acceleration for ISV applications, but there are a number of caveats, such as the requirement for the apps to use standard libraries (such as BLAS, LAPACK, FFTW etc) and dynamic linking (many do not because it reduces the support burden). ClearSpeed has long provided a selection of L3 BLAS support and drop in replacements for many of the most popular LAPACK routines. As you will see, the applicability and effectiveness of this approach is limited by the amount of data that gets moved around vs the compute required (in the case of DGEMM that’s n^3 compute to n^2 data)
Ok so we’ve established that proprietary solutions are not ideal for a number of reasons, but even then they have stimulated the interest of the research community and for some cases they still do provide compelling financial advantages to the user. Why do I say ‘inevitably’, well because the pull from both the developers and customers is there. Developers want to innovate, but not all are willing to be locked into single vendor deals for obvious reasons. OpenCL has gained enviable support in a very short period of time and Petapath are members of the Khronos Group and are actively participating on the OpenCL working group.
So what, for those of you who are not familiar with it, is OpenCL? It addresses a wide range of systems in a familiar way. Very similar to the existing language and library support from a number IHVs.
A very interesting point to note here is that OpenCL can also target multi-core systems. It does this via supporting the SIMD extensions to current x86 cores and exposing this parallelism to the developer in a single open API. It doesn’t provide anything that OpenMP doesn’t apart from a single API and programming interface, but this is the huge benefit for developers.
Note that there can be multiple OpenCL compute devices in a single system. Initially this is likely to be the host multi-core backend and a single vendor’s accelerator but the potential is there for supporting multiple accelerators and incrementally accelerating your systems.
So this all sounds great, but when will I be able to use OpenCL. And it’s a 1.0 spec shouldn’t I watch to see what happens for a little bit?
Note that I said earlier that there could be multiple OpenCL supported devices in a system. Well interoperability between different vendor’s implementations will be the key to this.
So having mapped out what people use today, and what standards we may have in the near future what does development on a heterogeneous system look like today?
Well if you’re here then there is probably a financial or scientific imperative to make the application run faster. HVs also provide optimised (BLAS, LAPACK, etc.) so use where you can Many compilers enable support for SSE2+ and auto-parallelisation Does it run fast enough yet? (Where can I go next if it doesn’t?)
The list of general and vendor specific tips is too long to go into here.
Wake up Tim I’m expecting a heckle here!
So having mapped out what people use today, and what standards we may have in the near future what does development on a heterogeneous system look like today?
So will all vendors hardware behave the same? How will the performance vary on different platforms?
(clearspeed gdb support and has done for about four years)
(clearspeed gdb support and has done for about four years)
There are many tools that developers rely on for host development and I think that means there will be space for a thriving ecosystem of third party tools for OpenCL