1. High Performance Content-Based Matching Using GPUs Alessandro Margara and GianpaoloCugola margara@elet.polimi.it, cugola@elet.polimi.it Dip. Elettronica e Informazione (DEI) Politecnico di Milano
2. The Problem: Content-Based Matching High Performance Content-Based Matching Using GPUs - DEBS 2011 2 Publishers Content-Based Matching Subscribers Predicate Filter (Smoke=true and Room = “Kitchen”) or (Light>30 and Room=“Bedroom”) Light=50, Room=Bedroom, Sender=“Sensor1” Attribute Constraint
3. Introduced by Nvidia in 2006 General purpose parallel computing architecture New instruction set New programming model Programmable using high-level languages Cuda C (a C dialect) Programming GPUs: CUDA High Performance Content-Based Matching Using GPUs - DEBS 2011 3
4. Programming Model: Basics The device (GPU) acts as a coprocessor for the host (CPU) and has its own separate memory space It is necessary to copy input data from the main memory to the GPU memory before starting a computation … … and to copy results back to the main memory when the computation finishes Often the most expensive operations Involve sending information through the PCI-Ex bus Bandwidth but also latency Also requires serialization of data structures! They must be kept simple High Performance Content-Based Matching Using GPUs - DEBS 2011 4
5. Typical Workflow High Performance Content-Based Matching Using GPUs - DEBS 2011 5 Allocate memory on device Serialize and copy data to device Execute one or more kernels on the device Wait for the device to finish processing Copy results back
6. Programming Model: Fundamentals Single Program Multiple Threads implementation strategy A single kernel(function) is executed by multiple threads in parallel Threads are organized in blocks Threads within different blocks operate independently Threads within the same block cooperate to solve a single sub-problem The runtime provides a blockIdand athreadIdvariable, to uniquely identify each running thread Accessing such variables is the only way to differentiate the work done by different threads High Performance Content-Based Matching Using GPUs - DEBS 2011 6
7. Programming Model: Memory management Hierarchical organization of memory All threads have access to the same common global memory Large (512MB-6GB) but slow (DRAM) Stores information received from the host Persistent across different function calls Threads within a block coordinate themselves using a shared memory Implemented on-chip Fast but limited (16-48KB) Each thread has its own localmemory It’s the only “cache” available No hardware/system support Must be explicitly controlled by the application code High Performance Content-Based Matching Using GPUs - DEBS 2011 7
8. More on Memory Management Without hardware managed caches, accesses to global memory can easily become a bottleneck Issues to consider when designing algorithms and data structures Maximize usage of shared (block local) memory Without overcoming its size Threads with contiguous ids should access contiguous global memory regions Hardware can combine them into several memory-wide accesses High Performance Content-Based Matching Using GPUs - DEBS 2011 8
9. Hardware Implementation An array of Streaming Multiprocessors (SMs) containing many (extremely simple) processing cores Each SM executes threads in groups of 32 called warps Scheduling is performed in hardware with zero overhead Optimized for data parallel problems Maximum efficiency only if all threads in a warp agree on the execution path 9 High Performance Content-Based Matching Using GPUs - DEBS 2011
10. Some Numbers NVIDIA GTX 460 1GB RAM (Global Memory) 7 Streaming Multiprocessors Each SM contains 48 cores Each SM manages up to 48 warps (32 threads each) Up to 10752 threads managed concurrently!!! Up to 336 threads running concurrently!!! Today’s cheap GPU: less than 160$ High Performance Content-Based Matching Using GPUs - DEBS 2011 10
11. Existing Algorithms Two approaches Counting algorithms Tree-based algorithms Complex data structures to optimize sequential execution Trees, Maps, … Lots of pointers!!! Hardly fit the data parallel programming model! High Performance Content-Based Matching Using GPUs - DEBS 2011 11
12. Algorithm Description High Performance Content-Based Matching Using GPUs - DEBS 2011 12 F1: A>10 and B=20 F2: B>15 and C<30 S1 A=12 B=20 A=12 B=20 F3: D=20 S2 2 1 0 0 1 0
13. Algorithm Description Constraints with the same name are stored in array on the GPU Contiguous memory regions When processing an event E, the CPU selects all relevant constraint arrays Based on the name of the attributes in E High Performance Content-Based Matching Using GPUs - DEBS 2011 13
14. Algorithm Description Bi-dimensional organization of threads One thread for each attribute/constraint pair Threads in the same block evaluate the same attribute It can be copied in shared memory Threads with contiguous ids access contiguous constraints Accesses combined into several memory-wide operations Filters count updated with an atomic operation High Performance Content-Based Matching Using GPUs - DEBS 2011 14 Event attributes B=32 C=21 A=7
15. Improvement Problem: before processing each event we need to reset filters count and interfaces selection vector Naïve version: use a memset Communication with the GPU introduces additional delay Solution: two copies of filters count and interfaces vector While processing an event One copy is used One copy is reset for the next event Inside the same kernel No communication overhead High Performance Content-Based Matching Using GPUs - DEBS 2011 15
16. Results: Default Scenario Comparison against state of the art sequential implementation SFF (Siena) 1.9.4 AMD CPU @ 2.8GHz Default scenario Relatively “simple” 10 interfaces, 25k filters, 1M constraints Analysis changing various parameters We measure latency Processing time for a single event High Performance Content-Based Matching Using GPUs - DEBS 2011 16 7x
17. Results: Number of Constraints High Performance Content-Based Matching Using GPUs - DEBS 2011 17 10x
18. Results: Number of Filters High Performance Content-Based Matching Using GPUs - DEBS 2011 18 13x
19. Results What is the time needed to install subscriptions? Need to serialize data structures Need to copy from CPU memory to GPU memory But data structures are simple! Memory requirements? 35MB in the default scenario Up to 200MB in all our tests Not a problem for a modern GPU High Performance Content-Based Matching Using GPUs - DEBS 2011 19
20. Results We measured the latency when processing a single event 0.14ms processing time 7000 events/s? What about the maximum throughput? High Performance Content-Based Matching Using GPUs - DEBS 2011 20 9400 events/s
21. Conclusions Benefits of GPU in a wide range of scenarios In particular in the most challenging workloads Additional advantage It leaves the CPU free to perform other tasks E.g. Communication related tasks Available for download Includes a translator from Siena subscriptions / messages More info at http://home.dei.polimi.it/margara High Performance Content-Based Matching Using GPUs - DEBS 2011 21
22. Future Work We are currently working with multi-core CPUs Using OpenMP We are currently testing our algorithm within a real system Both GPUs and multi-core CPUs Take into account communication overhead Measure of latency and throughput We plan to explore the advantages of GPUs with probabilistic (as opposed to exact) matching Encoded filters (Bloom filters) Balance between performance and percentage of false positives High Performance Content-Based Matching Using GPUs - DEBS 2011 22