3. Agenda
Windows 8 Apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
4. Windows 8 Apps
New user experience
Touch-friendly
Trust
Battery-power
Fast and fluid
5. Windows 8 C++ App Options
XAML-based applications
XAML user interface
C++ code
DirectX-based applications and games
DirectX user interface (D2D or D3D)
C++ code
Hybrid XAML and DirectX applications
XAML controls mixed with DirectX surfaces
C++ code
HTML5 + JavaScript applications
HTML5 user interface
JS code calling into C++ code
8. Recap of “free” performance
Compilation Unit Optimizations Whole Program Optimizations
• /O2 and friends • /GL and /LTCG
Profile Guided Optimization
• /LTCG:PGI and /LTCG:PGO
9. More “free” boosts
Automatic vectorization
• Always on in VS2012 SCALAR VECTOR
(1 operation) (N operations)
• Uses “vector” instructions
where possible in loops r1 r2 v1 v2
+ +
for (i = 0; i < 1000; i++) {
r3 v3
A[i] = B[i] + C[i]; vector
length
} add r3, r1, r2 vadd v3, v1, v2
• Can run this loop in only 250
iterations down from 1,000!
10. More “free” boosts
Automatic parallelization
• Uses multiple CPU cores
• /Qpar compiler switch
#pragma loop (hint_parallel(4))
for (i = 0; i < 1000; i++) {
A[i] = B[i] + C[i];
}
• Can run this loop “vectorized”
and on 4 CPU cores in parallel
12. Parallel Patterns Library (PPL)
Part of the C++ Runtime
No new libraries to link in
Task parallelism
Parallel algorithms
Concurrency-safe containers
Asynchronous agents
Abstracts away the notion of threads
Tasks are computations that may be run in parallel
Used to express your potential concurrency
Let the runtime map it to the available concurrency
Scale from 1 to 256 cores
14. parallel_for
parallel_for(0, 1000, [] (int i) {
work(i);
});
Core 1 Core 2
• Order of iteration is indeterminate.
work(0…249) work(250…499)
• Cores may come and go.
• Ranges may be stolen by newly idle
cores.
Core 3 Core 4
work(500…749) work(750…999)
15. parallel_for
parallel_for considerations:
• Designed for unbalanced loop bodies
• An idle core can steal a portion of another core’s range of work
• Supports cancellation
• Early exit in search scenarios
For fixed-sized loop bodies that don’t need cancellation, use
parallel_for_fixed.
16. parallel_for_each
parallel_for_each iterates over an STL container in parallel
#include <ppl.h>
using namespace concurrency;
vector<int> v = …;
parallel_for_each(v.begin(), v.end(),
[] (int i) {
work(i);
}
);
17. parallel_for_each
Works best with containers that support random-access iterators:
std::vector, std::array, std::deque, concurrency::concurrent_vector, …
Works okay, but with higher overhead on containers that support forward
(or bi-di) iterators:
std::list, std::map, …
18. parallel_invoke
• Executes function objects in parallel and waits for them to finish
#include <ppl.h>
#include <string>
#include <iostream>
using namespace concurrency; using namespace std;
template <typename T>
T twice(const T& t) {
return t + t;
}
int main() {
int n = 54; double d = 5.6; string s = "Hello";
parallel_invoke(
[&n] { n = twice(n); },
[&d] { d = twice(d); },
[&s] { s = twice(s); }
);
cout << n << ' ' << d << ' ' << s << endl;
return 0;
}
19. task<>
• Used to write asynchronous code
• Task::then lets you create continuations that get executed when the task finishes
• You need to manage the lifetime of the variables going into a task
#include <ppltasks.h>
#include <iostream>
using namespace concurrency; using namespace std;
int main()
{
auto t = create_task([]() -> int
{
return 42;
});
t.then([](int result)
{
cout << result << endl;
}).wait();
}
20. Concurrent Containers
• Thread-safe, lock-free containers provided:
concurrent_vector<>
concurrent_queue<>
concurrent_unordered_map<>
concurrent_unordered_multimap<>
concurrent_unordered_set<>
concurrent_unordered_multiset<>
• Functionality resembles equivalent containers provided by the STL
• Behavior is more limited to allow concurrency. For example:
• concurrent_vector can push_back but not insert
• concurrent_vector can clear but not pop_back or erase
21. concurrent_vector<T>
#include <ppl.h>
#include <concurrent_vector.h>
using namespace concurrency;
concurrent_vector<int> carmVec;
parallel_for(2, 5000000, [&carmVec](int i) {
if (is_carmichael(i))
carmVec.push_back(i);
});
24. What is C++ AMP?
Performance & Productivity
C++ AMP -> C++ Accelerated Massive Parallelism
C++ AMP is
• Programming model for expressing data parallel algorithm
• Exploiting heterogeneous system using mainstream tools
• C++ language extensions and library
C++ AMP delivers performance without compromising productivity
25. What is C++ AMP?
C++ AMP gives you…
Productivity
• Simple programming model
Portability
• Run on hardware from NVIDIA, AMD, Intel and ARM*
• Open Specification
Performance
• Power of heterogeneous computing at your hands
Use it to speed up data parallel algorithms
27. 1. #include <iostream>
2. #include <amp.h>
amp.h: header for C++ AMP library
3. using namespace concurrency;
concurrency: namespace for library
4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
7.
8. for (int idx = 0; idx < 11; idx++)
9. {
10. v[idx] += 1;
11. }
12. for(unsigned int i = 0; i < 11; i++)
13. std::cout << static_cast<char>( v[i]);
14. }
28. 1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;
4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
7. array_view<int> av(11, v); array_view: wraps the data to operate on the
accelerator. array_view variables captured and
8. for (int idx = 0; idx < 11; idx++)
associated data copied to accelerator (on demand)
9. {
10. v[idx] += 1;
11. }
12. for(unsigned int i = 0; i < 11; i++)
13. std::cout << static_cast<char>( v[i]);
14. }
29. 1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;
4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
7. array_view<int> av(11, v); array_view: wraps the data to operate on the
accelerator. array_view variables captured and
8. for (int idx = 0; idx < 11; idx++)
associated data copied to accelerator (on demand)
9. {
10. av[idx] += 1;
11. }
12. for(unsigned int i = 0; i < 11; i++)
13. std::cout << static_cast<char>( av[i]);
14. }
30. 1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;
4. int main() parallel_for_each: execute the lambda
5. { on the accelerator once per thread
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
extent: the parallel loop bounds
or computation “shape”
7. array_view<int> av(11, v);
8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9. {
10. av[idx] += 1; index: the thread ID that is running
11. }); the lambda, used to index into data
12. for(unsigned int i = 0; i < 11; i++)
13. std::cout << static_cast<char>(av[i]);
14. }
31. 1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;
4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
7. array_view<int> av(11, v);
8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9. {
10. av[idx] += 1; restrict(amp): tells the compiler to
11. }); check that code conforms to C++
subset, and tells compiler to target GPU
12. for(unsigned int i = 0; i < 11; i++)
13. std::cout << static_cast<char>(av[i]);
14. }
32. 1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;
4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
7. array_view<int> av(11, v);
8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9. {
10. av[idx] += 1; array_view: automatically copied
11. }); to accelerator if required
12. for(unsigned int i = 0; i < 11; i++)
13. std::cout << static_cast<char>(av[i]); array_view: automatically copied
14. } back to host when and if required
33. C++ AMP
Parallel Debugger
Well known Visual Studio debugging features
Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips
Tool windows
Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch,
Quick Watch
New features (for both CPU and GPU)
Parallel Stacks window, Parallel Watch window
New GPU-specific
Emulator, GPU Threads window, race detection
concurrency::direct3d_printf, _errorf, _abort
34.
35.
36. Summary
C++ is a great way to create fast and fluid apps for Windows 8
Get the most out of the compiler’s free optimizations
Use PPL for concurrent programming
Use C++ AMP for data parallel algorithms