SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Fastinsightstooptimizedvectorizationand
memoryusingcache-awarerooflineanalysis
Kevin O’Leary, Intel Technical Consulting Engineer
Fast insights to optimized vectorization and
memory using cache-aware roofline analysis
 The Roofline model
 Intel® Advisor Roofline analysis
 Intel® Advisor Demo
 Intel® Advisor Case Study
 Intel® Advisor “What’s New” and 2018 beta
 Summary
4
Acknowledgments/References
Roofline model proposed by Williams, Waterman, Patterson:
http://www.eecs.berkeley.edu/~waterman/papers/roofline.pdf
“Cache-aware Roofline model: Upgrading the loft” (Ilic, Pratas, Sousa, INESC-ID/IST, Thec Uni of
Lisbon) http://www.inesc-id.pt/ficheiros/publicacoes/9068.pdf
At Intel:
Roman Belenov, Zakhar Matveev, Julia Fedorova
SSG product teams, Hugh Caffey,
in collaboration with Philippe Thierry
5
RooflineModel–Avisuallyintuitiveperformancemodel
Combines
- memory utilization/demand
- CPU utilization
Into the same performance analysis
and modeling space
Arithmetic Intensity (AI) = # FLOPs / # BYTEs
Rooflinemodel:AmIboundbyVPU/CPUorbyMemory?
6
A B C
What makes loops
A, B, C different?
7
Cache-Awarevs.ClassicRoofline
AI = # FLOP / # BYTE
AI_DRAM =
# FLOP/ # BYTES (CPU & Cache  DRAM)
- “DRAM traffic” (or MCDRAM-traffic-based)
- Variable for the same code/platform (varies with dataset size/trip count)
- Can be measured relative to different memory hierarchy levels – cache level,
HBM, DRAM
AI_CARM =
# FLOP / # BYTES (CPU  Memory Sub-system)
- “Algorithmic”, “Cumulative (L1+L2+LLC+DRAM)” traffic-based
- Invariant for the given code / platform combination
- Typically AI_CARM < AI_DRAM
9
Find Effective Optimization Strategies
Intel Advisor: Cache-aware roofline analysis
 Roofline Performance Insights
 Highlights poor performing
loops
 Shows performance
“headroom” for each loop
 Which can be improved
 Which are worth improving
 Shows likely causes of
bottlenecks
 Suggests next optimization
steps
10
Find Effective Optimization Strategies
Intel Advisor: Cache-aware roofline analysis
 Roofs Show Platform Limits
 Memory, cache & compute
limits
 Dots Are Loops
 Bigger, red dots take more time
so optimization has a bigger
impact
 Dots farther from a roof have
more room for improvement
 Higher Dot = Higher GFLOPs/sec
 Optimization moves dots up
 Algorithmic changes move dots
horizontally
Which loops should we optimize?
 A and G are the best candidates
 B has room to improve, but will have less impact
 E, C, D, and H are poor candidates
Roofs
Roofline tutorial video
1
1
RooflineAutomationinIntelAdvisor2017update2and2018beta
• Interactive mapping to source and performance profile
• Synergy between Vector Advisor and Roofline: FMA example
• Customizable chart
Each Dot
represents loop or function in YOUR
APPLICATION (profiled)
Each Roof (slope)
Gives peak CPU/Memory throughput
of your PLATFORM (benchmarked)
12
#FLOP
Binary Instrumentation
Does not rely on CPU counters
Seconds
User-mode sampling
Root access not needed
Bytes
Binary Instrumentation
Counts operands size (not cachelines)
Roofs
Microbenchmarks
Actual peak for the current
configuration
AI = Flop/byte
Performance = Flops/seconds
Roofline application profile:
Axis Y: FLOP/S = #FLOP (mask aware) / #Seconds
Axis X: AI = #FLOP / #Bytes
Intel®AdvisorRoofline:underthehood
13
Getting Roofline data in Intel®Advisor
FLOP/S
= #FLOP/Seconds
Seconds #FLOP
- Mask Utilization
- #Bytes
Step 1: Survey
- Non intrusive. Representative
- Output: Seconds (+much more)
Step 2: Trip counts+FLOPS
- Precise, instrumentation based
- Physically count Num-Instructions
- Output: #FLOP, #Bytes
Intel Confidential
14
InterpretingRooflineData
Final Limits
(assuming perfect optimization)
Long-term ROI, optimization strategy
Current Limits
(what are my current bottlenecks)
Next step, optimization tactics
Finally compute-bound
Invest more into effective
CPU/VPU (SIMD)
optimization
Finally memory-bound
Invest more into effective
cache utilization
“Limited by everything”
Check your Advisor
Survey and MAP results
15
What are my memory and compute peaks?
How far away from peak system performance is my application?
Intel Confidential
DP FMA ~16 Peak GFLOP/sec
logarithmic scale
AI (DRAM Flop/Byte)AI == 1
Can we do any better?
How far are we from roofs?
16
Perform the right optimization for your region
Roofline: characterization regions
Intel Confidential
Scalar ~2.3 Peak GFLOP/sec
logarithmic scale
AI (Flop/Byte)AI == 1
L1/L2/LLC/DRAM-bound
Investing into Compute peak
could be useless
L2/LLC/DRAM/Compute-
bound
GFLOPS
Compute-Bound
Investing into Cache/DRAM
could be useless
Optimize memory
(cache blocking, etc)
Optimize compute
(threading,
vectorization, etc)
Gray area (need more
data to determine right
strategy)
The514.pomriqSPECACCELBenchmark
 An MRI image reconstruction kernel described in Stone et al. (2008).
MRI image reconstruction is a conversion from sampled radio
responses to magnetic field gradients. The sample coordinates are in
the space of magnetic field gradients, or K-space.
 The algorithm examines a large set of input, representing the intended
MRI scanning trajectory and the points that will be sampled.
 The input to 514.pomriq consists of one file containing the number of
K-space values, the number of X-space values, and then the list of K-
space coordinates, X-space coordinates, and Phi-field complex values
for the K-space samples.
Hotloopisvectorized
Intel Advisor summary view
1 vectorized loop that we
spend 98.8% of our time in
Need more information to
see if we can get more
performance
Whatisourperformance?
Relativetopeaksystemperformance
Our hot loop is below
the MCDRAM roof
Potential memory
bottleneck
GetdetailedAdvicefromintel®Advisor
Possible inefficient
memory access
Gather stride access!
Intel® Advisor
code analytics
Recommendations – need more
information, confirm inefficient
memory access
 Run Memory Access
Pattern Analysis (MAP)
Irregularaccesspatternsdecreasesperformance!
Gatherprofiling
24
Irregularaccesspatterns
Badforvectorizationperformance
Hint: use the Intel Advisor details!
Specific recommendation for your
application
Removegatherinstructions
step#1–usenewerversionoftheintelcompilercanrecognizetheaccesspattern
Removed gathers Increased GFLOPS
(from 266.42 to 342.67)
Gathers replacement is performed by the
“Gather to Shuffle/Permutes” compiler
transformation
Removegatherinstructions
step#1–newerversionoftheintelcompilercanrecognizetheaccesspattern
Now above
MCDRAM roof
Greater
GFLOPS
Removegatherinstructions
step#2-Usestructureofarraysinsteadofarrayofstructures
T
 struct kValues {
 float Kx;
 float Ky;
 float Kz;
 float PhiMag;
 };

 SDLT_PRIMITIVE(kValues, Kx, Ky, Kz, PhiMag)


 sdlt::soa1d_container<kValues> inputKValues(numK);
 auto kValues = inputKValues.access();

 for (k = 0; k < numK; k++) {
 kValues [k].Kx() = kx[k];
 kValues [k].Ky() = ky[k];
 kValues [k].Kz() = kz[k];
 kValues [k].PhiMag() = phiMag[k];
 }


 auto kVals = inputKValues.const_access();
 #pragma omp simd private(expArg, cosArg, sinArg) reduction(+:QrSum, QiSum)
 for (indexK = 0; indexK < numK; indexK++) {
 expArg = PIx2 * (kVals[indexK].Kx() * x[indexX] +
 kVals[indexK].Ky() * y[indexX] +
 kVals[indexK].Kz() * z[indexX]);

 cosArg = cosf(expArg);
 sinArg = sinf(expArg);

 float phi = kVals[indexK].PhiMag();
 QrSum += phi * cosArg;
 QiSum += phi * sinArg;
 }
Intel® SIMD Data Layout Templates
makes this transformation easy
and painless!
This is a classic vectorization
efficiency strategy
But it can yield poorly
designed code
Removegatherinstructions
step#2-TransformcodeusingtheIntel®SIMDDataLayoutTemplates
The loop is no longer red. This
means it takes less time now
Has more GFLOPS, putting it
close to the L2 roof
The total performance improvement is almost 3x
for the kernel and 50% for the entire application.
TransformcodeusingtheIntel®SIMDDataLayoutTemplates
Summaryofimprovements
 After applying this optimization, the dot is no longer red. This means it
takes less time now
 Has more GFLOPS, putting it close to the L2 roof
 The loop now has unit stride access and, as a result, no special
memory manipulations
 The total performance improvement is almost 3x for the kernel and
50% for the entire application.
31
Intel Advisor 2018 beta
Intel® Advisor – Vectorization Optimization
 Roofline analysis helps you optimize effectively
 Find high impact, but under optimized loops
 Does it need cache or vectorization optimization?
 Is a more numerically intensive algorithm a better choice?
 Faster data collection
 Filter by module - Calculate only what is needed.
 Track refinement analysis – Stop when every site has executed
 Make better decisions with more data, more recommendations
 Intel MKL friendly – Is the code optimized? Is the best variant used?
 Function call counts in addition to trip counts
 Top 5 recommendations added to summary
 Dynamic instruction mix – Expert feature shows exact count of each instruction
 Easier MPI launching
 MPI support in the command line dialog
New!
33
Call to Action
 Modernize your Code
 To get the most out of your hardware, you need to modernize your code
with vectorization and threading.
 Taking a methodical approach such as the one outlined in this
presentation, and taking advantage of the powerful tools in Intel® Parallel
Studio XE, can make the modernization task dramatically easier.
 Download the latest here: https://software.intel.com/en-us/intel-
parallel-studio-xe
 The Professional and Cluster Edition both include Advisor
 Join the 2018 beta of Intel Parallel Studio XE to get the latest version
 Send e-mail to vector_advisor@intel.com to get the latest information on
some exciting new capabilities that are currently under development.
35
 Intel® Advisor Links
 Vectorization Guide
 http://bit.ly/autovectorize-guide
 Explicit Vector Programming in
Fortran
 http://bit.ly/explicitvector-fortran
 Optimization Reports
 http://bit.ly/optimizereports
 Beta Registration & Download
 http://bit.ly/PSXE2017-Beta
 Code Modernization Links
 Modern Code Developer
Community
 software.intel.com/modern-code
 Intel Code Modernization
Enablement Program
 software.intel.com/code-modernization-
enablement
 Intel Parallel Computing Centers
 software.intel.com/ipcc
 Technical Webinar Series
Registration
 http://bit.ly/spring16-tech-webinars
 Intel Parallel Universe Magazine
 software.intel.com/intel-parallel-universe-
magazine
Resources
 For Intel® Xeon Phi™ coprocessors, but also applicable:
 https://software.intel.com/en-us/articles/vectorization-essential
 https://software.intel.com/en-us/articles/fortran-array-data-and-
arguments-and-vectorization
 Intel® Parallel Studio XE Composer Edition User and Reference Guides:
 https://software.intel.com/en-us/intel-cplusplus-compiler-16.0-user-and-
reference-guide-pdf
 https://software.intel.com/en-us/intel-fortran-compiler-16.0-user-and-
reference-guide-pdf
 Compiler User Forums
 http://software.intel.com/forums
36
AdditionalResources
Configurationsfor2007-2016Benchmarks
Platform
Unscaled
Core
Frequency
Cores/S
ocket
Num
Sockets
L1 Data
Cache
L2
Cache
L3
Cache Memory
Memory
Frequency
Memory
Access
H/W
Prefetchers
Enabled
HT
Enabled
Turbo
Enabled C States
O/S
Name
Operating
System
Compiler
Version
Intel® Xeon™
5472 Processor
3.0 GHZ 4 2 32K 6 MB None 32 GB 800 MHz UMA Y N N Disabled
Fedora
20
3.11.10-301.fc20
icc version
14.0.1
Intel® Xeon™ X5570
Processor
2.9 GHZ 4 2 32K 256K 8 MB 48 GB 1333 MHz NUMA Y Y Y Disabled
Fedora
20
3.11.10-301.fc20
icc version
14.0.1
Intel® Xeon™ X5680
Processor
3.33 GHZ 6 2 32K 256K 12 MB 48 MB 1333 MHz NUMA Y Y Y Disabled
Fedora
20
3.11.10-301.fc20
icc version
14.0.1
Intel® Xeon™ E5
2690 Processor
2.9 GHZ 8 2 32K 256K 20 MB 64 GB 1600 MHz NUMA Y Y Y Disabled
Fedora
20
3.11.10-301.fc20
icc version
14.0.1
Intel® Xeon™ E5
2697v2 Processor
2.7 GHZ 12 2 32K 256K 30 MB 64 GB 1867 MHz NUMA Y Y Y Disabled RHEL 7.1
3.10.0-
229.el7.x86_64
icc version
14.0.1
Intel® Xeon™ E5
2600v3 Processor
2.2 GHz 18 2 32K 256K 46 MB 128 GB 2133 MHz NUMA Y Y Y Disabled
Fedora
20
3.13.5-202.fc20
icc version
14.0.1
Intel® Xeon™ E5
2600v4 Processor
2.3 GHz 18 2 32K 256K 46 MB 256 GB 2400 MHz NUMA Y Y Y Disabled RHEL 7.0
3.10.0-123.
el7.x86_64
icc version
14.0.1
Intel® Xeon™ E5
2600v4 Processor
2.2 GHz 22 2 32K 256K 56 MB 128 GB 2133 MHz NUMA Y Y Y Disabled
CentOS
7.2
3.10.0-327.
el7.x86_64
icc version
14.0.1
Platform Hardware and Software Configuration
37
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not
manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are
reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice
revision #20110804 Performance measured in Intel Labs by Intel employees.
Legal Disclaimer & Optimization Notice
 INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,
TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products.
 Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are
trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for
use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
3838
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Roofline Analysis

Mais conteúdo relacionado

Mais procurados

AIDC Summit LA- Hands-on Training
AIDC Summit LA- Hands-on Training AIDC Summit LA- Hands-on Training
AIDC Summit LA- Hands-on Training Intel® Software
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019Intel® Software
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchIntel® Software
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Intel® Software
 
Passing The Joel Test In The PHP World
Passing The Joel Test In The PHP WorldPassing The Joel Test In The PHP World
Passing The Joel Test In The PHP WorldLorna Mitchell
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
 
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...Intel® Software
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaIntel® Software
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
 
Intel® Graphics Performance Analyzers
Intel® Graphics Performance AnalyzersIntel® Graphics Performance Analyzers
Intel® Graphics Performance AnalyzersIntel® Software
 
AIDC Summit LA: Wipro Solutions Overview
AIDC Summit LA: Wipro Solutions Overview AIDC Summit LA: Wipro Solutions Overview
AIDC Summit LA: Wipro Solutions Overview Intel® Software
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Intel® Software
 
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...Intel® Software
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019Intel® Software
 
TDC2019 Intel Software Day - Otimizacao grafica com o Intel GPA
TDC2019 Intel Software Day - Otimizacao grafica com o Intel GPATDC2019 Intel Software Day - Otimizacao grafica com o Intel GPA
TDC2019 Intel Software Day - Otimizacao grafica com o Intel GPAtdc-globalcode
 
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...tdc-globalcode
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysTaylor Riggan
 
The Architecture of 11th Generation Intel® Processor Graphics
The Architecture of 11th Generation Intel® Processor GraphicsThe Architecture of 11th Generation Intel® Processor Graphics
The Architecture of 11th Generation Intel® Processor GraphicsIntel® Software
 

Mais procurados (20)

AIDC Summit LA- Hands-on Training
AIDC Summit LA- Hands-on Training AIDC Summit LA- Hands-on Training
AIDC Summit LA- Hands-on Training
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
 
Passing The Joel Test In The PHP World
Passing The Joel Test In The PHP WorldPassing The Joel Test In The PHP World
Passing The Joel Test In The PHP World
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
Simple Single Instruction Multiple Data (SIMD) with the Intel® Implicit SPMD ...
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Intel® Graphics Performance Analyzers
Intel® Graphics Performance AnalyzersIntel® Graphics Performance Analyzers
Intel® Graphics Performance Analyzers
 
AIDC Summit LA: Wipro Solutions Overview
AIDC Summit LA: Wipro Solutions Overview AIDC Summit LA: Wipro Solutions Overview
AIDC Summit LA: Wipro Solutions Overview
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
 
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
Intel® Open Image Denoise: Optimized CPU Denoising | SIGGRAPH 2019 Technical ...
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
 
TDC2019 Intel Software Day - Otimizacao grafica com o Intel GPA
TDC2019 Intel Software Day - Otimizacao grafica com o Intel GPATDC2019 Intel Software Day - Otimizacao grafica com o Intel GPA
TDC2019 Intel Software Day - Otimizacao grafica com o Intel GPA
 
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
 
The Architecture of 11th Generation Intel® Processor Graphics
The Architecture of 11th Generation Intel® Processor GraphicsThe Architecture of 11th Generation Intel® Processor Graphics
The Architecture of 11th Generation Intel® Processor Graphics
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
 

Semelhante a Fast Insights to Optimized Vectorization and Memory Using Cache-aware Roofline Analysis

Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...waqarnabi
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 
byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processingideas2ignite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resumehet shah
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdlArshit Rai
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 
Accelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google CloudAccelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google CloudSamantha Guerriero
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdlArshit Rai
 

Semelhante a Fast Insights to Optimized Vectorization and Memory Using Cache-aware Roofline Analysis (20)

Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
 
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA Solutions
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Fmcad08
Fmcad08Fmcad08
Fmcad08
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processing
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resume
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdl
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
Accelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google CloudAccelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google Cloud
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdl
 
Embedded system
Embedded systemEmbedded system
Embedded system
 

Mais de Intel® Software

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.Intel® Software
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel® Software
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesIntel® Software
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision SlidesIntel® Software
 
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...Intel® Software
 
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...Intel® Software
 
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...Intel® Software
 
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...Intel® Software
 
Intel® AI: Parameter Efficient Training
Intel® AI: Parameter Efficient TrainingIntel® AI: Parameter Efficient Training
Intel® AI: Parameter Efficient TrainingIntel® Software
 
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks Intel® AI: Non-Parametric Priors for Generative Adversarial Networks
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks Intel® Software
 
Persistent Memory Programming with Pmemkv
Persistent Memory Programming with PmemkvPersistent Memory Programming with Pmemkv
Persistent Memory Programming with PmemkvIntel® Software
 
Big Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object StorageBig Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object StorageIntel® Software
 
Debugging Tools & Techniques for Persistent Memory Programming
Debugging Tools & Techniques for Persistent Memory ProgrammingDebugging Tools & Techniques for Persistent Memory Programming
Debugging Tools & Techniques for Persistent Memory ProgrammingIntel® Software
 

Mais de Intel® Software (15)

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
 
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
ANYFACE*: Create Film Industry-Quality Facial Rendering & Animation Using Mai...
 
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
Ray Tracing with Intel® Embree and Intel® OSPRay: Use Cases and Updates | SIG...
 
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time G...
 
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
Bring the Future of Entertainment to Your Living Room: MPEG-I Immersive Video...
 
Intel® AI: Parameter Efficient Training
Intel® AI: Parameter Efficient TrainingIntel® AI: Parameter Efficient Training
Intel® AI: Parameter Efficient Training
 
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks Intel® AI: Non-Parametric Priors for Generative Adversarial Networks
Intel® AI: Non-Parametric Priors for Generative Adversarial Networks
 
Persistent Memory Programming with Pmemkv
Persistent Memory Programming with PmemkvPersistent Memory Programming with Pmemkv
Persistent Memory Programming with Pmemkv
 
Big Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object StorageBig Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object Storage
 
Debugging Tools & Techniques for Persistent Memory Programming
Debugging Tools & Techniques for Persistent Memory ProgrammingDebugging Tools & Techniques for Persistent Memory Programming
Debugging Tools & Techniques for Persistent Memory Programming
 

Último

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Último (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Fast Insights to Optimized Vectorization and Memory Using Cache-aware Roofline Analysis

  • 2. Fast insights to optimized vectorization and memory using cache-aware roofline analysis  The Roofline model  Intel® Advisor Roofline analysis  Intel® Advisor Demo  Intel® Advisor Case Study  Intel® Advisor “What’s New” and 2018 beta  Summary
  • 3.
  • 4. 4 Acknowledgments/References Roofline model proposed by Williams, Waterman, Patterson: http://www.eecs.berkeley.edu/~waterman/papers/roofline.pdf “Cache-aware Roofline model: Upgrading the loft” (Ilic, Pratas, Sousa, INESC-ID/IST, Thec Uni of Lisbon) http://www.inesc-id.pt/ficheiros/publicacoes/9068.pdf At Intel: Roman Belenov, Zakhar Matveev, Julia Fedorova SSG product teams, Hugh Caffey, in collaboration with Philippe Thierry
  • 5. 5 RooflineModel–Avisuallyintuitiveperformancemodel Combines - memory utilization/demand - CPU utilization Into the same performance analysis and modeling space Arithmetic Intensity (AI) = # FLOPs / # BYTEs
  • 7. 7 Cache-Awarevs.ClassicRoofline AI = # FLOP / # BYTE AI_DRAM = # FLOP/ # BYTES (CPU & Cache  DRAM) - “DRAM traffic” (or MCDRAM-traffic-based) - Variable for the same code/platform (varies with dataset size/trip count) - Can be measured relative to different memory hierarchy levels – cache level, HBM, DRAM AI_CARM = # FLOP / # BYTES (CPU  Memory Sub-system) - “Algorithmic”, “Cumulative (L1+L2+LLC+DRAM)” traffic-based - Invariant for the given code / platform combination - Typically AI_CARM < AI_DRAM
  • 8.
  • 9. 9 Find Effective Optimization Strategies Intel Advisor: Cache-aware roofline analysis  Roofline Performance Insights  Highlights poor performing loops  Shows performance “headroom” for each loop  Which can be improved  Which are worth improving  Shows likely causes of bottlenecks  Suggests next optimization steps
  • 10. 10 Find Effective Optimization Strategies Intel Advisor: Cache-aware roofline analysis  Roofs Show Platform Limits  Memory, cache & compute limits  Dots Are Loops  Bigger, red dots take more time so optimization has a bigger impact  Dots farther from a roof have more room for improvement  Higher Dot = Higher GFLOPs/sec  Optimization moves dots up  Algorithmic changes move dots horizontally Which loops should we optimize?  A and G are the best candidates  B has room to improve, but will have less impact  E, C, D, and H are poor candidates Roofs Roofline tutorial video
  • 11. 1 1 RooflineAutomationinIntelAdvisor2017update2and2018beta • Interactive mapping to source and performance profile • Synergy between Vector Advisor and Roofline: FMA example • Customizable chart Each Dot represents loop or function in YOUR APPLICATION (profiled) Each Roof (slope) Gives peak CPU/Memory throughput of your PLATFORM (benchmarked)
  • 12. 12 #FLOP Binary Instrumentation Does not rely on CPU counters Seconds User-mode sampling Root access not needed Bytes Binary Instrumentation Counts operands size (not cachelines) Roofs Microbenchmarks Actual peak for the current configuration AI = Flop/byte Performance = Flops/seconds Roofline application profile: Axis Y: FLOP/S = #FLOP (mask aware) / #Seconds Axis X: AI = #FLOP / #Bytes Intel®AdvisorRoofline:underthehood
  • 13. 13 Getting Roofline data in Intel®Advisor FLOP/S = #FLOP/Seconds Seconds #FLOP - Mask Utilization - #Bytes Step 1: Survey - Non intrusive. Representative - Output: Seconds (+much more) Step 2: Trip counts+FLOPS - Precise, instrumentation based - Physically count Num-Instructions - Output: #FLOP, #Bytes Intel Confidential
  • 14. 14 InterpretingRooflineData Final Limits (assuming perfect optimization) Long-term ROI, optimization strategy Current Limits (what are my current bottlenecks) Next step, optimization tactics Finally compute-bound Invest more into effective CPU/VPU (SIMD) optimization Finally memory-bound Invest more into effective cache utilization “Limited by everything” Check your Advisor Survey and MAP results
  • 15. 15 What are my memory and compute peaks? How far away from peak system performance is my application? Intel Confidential DP FMA ~16 Peak GFLOP/sec logarithmic scale AI (DRAM Flop/Byte)AI == 1 Can we do any better? How far are we from roofs?
  • 16. 16 Perform the right optimization for your region Roofline: characterization regions Intel Confidential Scalar ~2.3 Peak GFLOP/sec logarithmic scale AI (Flop/Byte)AI == 1 L1/L2/LLC/DRAM-bound Investing into Compute peak could be useless L2/LLC/DRAM/Compute- bound GFLOPS Compute-Bound Investing into Cache/DRAM could be useless Optimize memory (cache blocking, etc) Optimize compute (threading, vectorization, etc) Gray area (need more data to determine right strategy)
  • 17.
  • 18.
  • 19.
  • 20. The514.pomriqSPECACCELBenchmark  An MRI image reconstruction kernel described in Stone et al. (2008). MRI image reconstruction is a conversion from sampled radio responses to magnetic field gradients. The sample coordinates are in the space of magnetic field gradients, or K-space.  The algorithm examines a large set of input, representing the intended MRI scanning trajectory and the points that will be sampled.  The input to 514.pomriq consists of one file containing the number of K-space values, the number of X-space values, and then the list of K- space coordinates, X-space coordinates, and Phi-field complex values for the K-space samples.
  • 21. Hotloopisvectorized Intel Advisor summary view 1 vectorized loop that we spend 98.8% of our time in Need more information to see if we can get more performance
  • 22. Whatisourperformance? Relativetopeaksystemperformance Our hot loop is below the MCDRAM roof Potential memory bottleneck
  • 23. GetdetailedAdvicefromintel®Advisor Possible inefficient memory access Gather stride access! Intel® Advisor code analytics Recommendations – need more information, confirm inefficient memory access
  • 24.  Run Memory Access Pattern Analysis (MAP) Irregularaccesspatternsdecreasesperformance! Gatherprofiling 24
  • 25. Irregularaccesspatterns Badforvectorizationperformance Hint: use the Intel Advisor details! Specific recommendation for your application
  • 26. Removegatherinstructions step#1–usenewerversionoftheintelcompilercanrecognizetheaccesspattern Removed gathers Increased GFLOPS (from 266.42 to 342.67) Gathers replacement is performed by the “Gather to Shuffle/Permutes” compiler transformation
  • 28. Removegatherinstructions step#2-Usestructureofarraysinsteadofarrayofstructures T  struct kValues {  float Kx;  float Ky;  float Kz;  float PhiMag;  };   SDLT_PRIMITIVE(kValues, Kx, Ky, Kz, PhiMag)    sdlt::soa1d_container<kValues> inputKValues(numK);  auto kValues = inputKValues.access();   for (k = 0; k < numK; k++) {  kValues [k].Kx() = kx[k];  kValues [k].Ky() = ky[k];  kValues [k].Kz() = kz[k];  kValues [k].PhiMag() = phiMag[k];  }    auto kVals = inputKValues.const_access();  #pragma omp simd private(expArg, cosArg, sinArg) reduction(+:QrSum, QiSum)  for (indexK = 0; indexK < numK; indexK++) {  expArg = PIx2 * (kVals[indexK].Kx() * x[indexX] +  kVals[indexK].Ky() * y[indexX] +  kVals[indexK].Kz() * z[indexX]);   cosArg = cosf(expArg);  sinArg = sinf(expArg);   float phi = kVals[indexK].PhiMag();  QrSum += phi * cosArg;  QiSum += phi * sinArg;  } Intel® SIMD Data Layout Templates makes this transformation easy and painless! This is a classic vectorization efficiency strategy But it can yield poorly designed code
  • 29. Removegatherinstructions step#2-TransformcodeusingtheIntel®SIMDDataLayoutTemplates The loop is no longer red. This means it takes less time now Has more GFLOPS, putting it close to the L2 roof The total performance improvement is almost 3x for the kernel and 50% for the entire application.
  • 30. TransformcodeusingtheIntel®SIMDDataLayoutTemplates Summaryofimprovements  After applying this optimization, the dot is no longer red. This means it takes less time now  Has more GFLOPS, putting it close to the L2 roof  The loop now has unit stride access and, as a result, no special memory manipulations  The total performance improvement is almost 3x for the kernel and 50% for the entire application.
  • 31. 31 Intel Advisor 2018 beta Intel® Advisor – Vectorization Optimization  Roofline analysis helps you optimize effectively  Find high impact, but under optimized loops  Does it need cache or vectorization optimization?  Is a more numerically intensive algorithm a better choice?  Faster data collection  Filter by module - Calculate only what is needed.  Track refinement analysis – Stop when every site has executed  Make better decisions with more data, more recommendations  Intel MKL friendly – Is the code optimized? Is the best variant used?  Function call counts in addition to trip counts  Top 5 recommendations added to summary  Dynamic instruction mix – Expert feature shows exact count of each instruction  Easier MPI launching  MPI support in the command line dialog New!
  • 32.
  • 33. 33 Call to Action  Modernize your Code  To get the most out of your hardware, you need to modernize your code with vectorization and threading.  Taking a methodical approach such as the one outlined in this presentation, and taking advantage of the powerful tools in Intel® Parallel Studio XE, can make the modernization task dramatically easier.  Download the latest here: https://software.intel.com/en-us/intel- parallel-studio-xe  The Professional and Cluster Edition both include Advisor  Join the 2018 beta of Intel Parallel Studio XE to get the latest version  Send e-mail to vector_advisor@intel.com to get the latest information on some exciting new capabilities that are currently under development.
  • 34.
  • 35. 35  Intel® Advisor Links  Vectorization Guide  http://bit.ly/autovectorize-guide  Explicit Vector Programming in Fortran  http://bit.ly/explicitvector-fortran  Optimization Reports  http://bit.ly/optimizereports  Beta Registration & Download  http://bit.ly/PSXE2017-Beta  Code Modernization Links  Modern Code Developer Community  software.intel.com/modern-code  Intel Code Modernization Enablement Program  software.intel.com/code-modernization- enablement  Intel Parallel Computing Centers  software.intel.com/ipcc  Technical Webinar Series Registration  http://bit.ly/spring16-tech-webinars  Intel Parallel Universe Magazine  software.intel.com/intel-parallel-universe- magazine Resources
  • 36.  For Intel® Xeon Phi™ coprocessors, but also applicable:  https://software.intel.com/en-us/articles/vectorization-essential  https://software.intel.com/en-us/articles/fortran-array-data-and- arguments-and-vectorization  Intel® Parallel Studio XE Composer Edition User and Reference Guides:  https://software.intel.com/en-us/intel-cplusplus-compiler-16.0-user-and- reference-guide-pdf  https://software.intel.com/en-us/intel-fortran-compiler-16.0-user-and- reference-guide-pdf  Compiler User Forums  http://software.intel.com/forums 36 AdditionalResources
  • 37. Configurationsfor2007-2016Benchmarks Platform Unscaled Core Frequency Cores/S ocket Num Sockets L1 Data Cache L2 Cache L3 Cache Memory Memory Frequency Memory Access H/W Prefetchers Enabled HT Enabled Turbo Enabled C States O/S Name Operating System Compiler Version Intel® Xeon™ 5472 Processor 3.0 GHZ 4 2 32K 6 MB None 32 GB 800 MHz UMA Y N N Disabled Fedora 20 3.11.10-301.fc20 icc version 14.0.1 Intel® Xeon™ X5570 Processor 2.9 GHZ 4 2 32K 256K 8 MB 48 GB 1333 MHz NUMA Y Y Y Disabled Fedora 20 3.11.10-301.fc20 icc version 14.0.1 Intel® Xeon™ X5680 Processor 3.33 GHZ 6 2 32K 256K 12 MB 48 MB 1333 MHz NUMA Y Y Y Disabled Fedora 20 3.11.10-301.fc20 icc version 14.0.1 Intel® Xeon™ E5 2690 Processor 2.9 GHZ 8 2 32K 256K 20 MB 64 GB 1600 MHz NUMA Y Y Y Disabled Fedora 20 3.11.10-301.fc20 icc version 14.0.1 Intel® Xeon™ E5 2697v2 Processor 2.7 GHZ 12 2 32K 256K 30 MB 64 GB 1867 MHz NUMA Y Y Y Disabled RHEL 7.1 3.10.0- 229.el7.x86_64 icc version 14.0.1 Intel® Xeon™ E5 2600v3 Processor 2.2 GHz 18 2 32K 256K 46 MB 128 GB 2133 MHz NUMA Y Y Y Disabled Fedora 20 3.13.5-202.fc20 icc version 14.0.1 Intel® Xeon™ E5 2600v4 Processor 2.3 GHz 18 2 32K 256K 46 MB 256 GB 2400 MHz NUMA Y Y Y Disabled RHEL 7.0 3.10.0-123. el7.x86_64 icc version 14.0.1 Intel® Xeon™ E5 2600v4 Processor 2.2 GHz 22 2 32K 256K 56 MB 128 GB 2133 MHz NUMA Y Y Y Disabled CentOS 7.2 3.10.0-327. el7.x86_64 icc version 14.0.1 Platform Hardware and Software Configuration 37 Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Performance measured in Intel Labs by Intel employees.
  • 38. Legal Disclaimer & Optimization Notice  INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.  Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 3838