Fator chave para a competitividade do País,
competitividade do País, da Ciência e da
Ciência e da Indústria
Igor Freitas, ...
3
Agenda
 O que é High Performance Computing ?
 HPC & Competitividade da Indústria, da Ciência e do País
 Iniciativas d...
4
O que é High Performance Computing ?
5
“High-performance computing (HPC) is the use of parallel processing for running
adv...
Extending to New Dimensions
HPC pode ser utilizado em diferentes áreas da ciência e da indústria
6
Aplicações
em HPC
Aplic...
O que é High Performance Computing ?
Democratização da performance e operação de
supercomputadores
7
“Calculadora Automáti...
A democratização dos clusters de HPC
Os últimos 20 anos
108
105
$/FLOP
10
1994
1
2014
>15,000X
IMPROVEMENT1
YEAR Avanços n...
O que é High Performance Computing ?
HPC vs Big Data
FORTRAN / C++
Applications
MPI
High Performance
Java* Applications
Ha...
O que é High Performance Computing ?
Big Data + HPC: Processamento “pesado” em tempo real
Small Data + Small
Compute
e.g. ...
Visão da Intel para HPC
Balanced compute, storage, and interconnects based on workload
NETWORKING SOFTWARECOMPUTE STORAGE
...
Quebra de paradigma para Sistemas Massivamente Paralelos
Processador + Redes de alta velocidade + Memória = Knights Landin...
Arquitetura Única para HPC & Big Data
HPC Big Data
FORTRAN / C++
Applications
MPI
High Performance
Java* Applications
Hado...
Próximos passos para HPC & Big Data
Hierarquia de Memória & Storage adaptável
Processor
Compute
Node
I/O Node
Remote
Stora...
O que é High Performance Computing ?
#HPC Matters
15
HPC Transforms Parkinson's Disease - SC15
O que é High Performance Computing ?
#HPC Matters
16
SC 15 - Climate Modeling
17
HPC propicia uma nova Metodologia Científica
Inovação na Indústria
• Prediction
• Modeling & Simulation
• Experiment Refin...
HPC & Competitividade da Indústria, da Ciência e
do País
19
• Ordem executiva do presidente Obama para um “programa
nacion...
Dyson Creates a Revolutionary Fan
Utilizing new scientific method
Reduced the number of costly, time-
consuming physical p...
Topline
Innovation
Bottom-line
Costs
Got the most for their Autodesk
software investment with optimized
performance on Int...
Intel® Xeon® Processor
E5-2600 product family enabled
artist workstations
Large, shared rendering
clusters configured with...
Genomics search algorithm
Intel based display device
(work done on cluster)
Expanded shared cluster
capacity with 100+ nod...
24
Iniciativas da Intel em HPC no Brasil
Oil & Gas - Reservoir Simulator
at PETROBRAS
LNCC - National Laboratory for Scientif...
Iniciativas da Intel em HPC no Brasil
26
• Modernizing applications to increase parallelism and
scalability
• Leverage cor...
¹Author: Gilvan Vieira - gilvan.vieira.coppetec@petrobras.com.br – PETROBRAS/CENPES
Estudo de Caso
PETROBRAS - Simulação d...
¹Author: Gilvan Vieira - gilvan.vieira.coppetec@petrobras.com.br – PETROBRAS / CENPES
Estudo de Caso
PETROBRAS - Simulação...
¹Authors: Frederico L. Cabral – fcabral@lncc.br , Marcio Murad – murad@lncc.br, Carla Osthoff osthoff@lncc.br
Estudo de Ca...
¹Authors: Frederico L. Cabral – fcabral@lncc.br , Marcio Murad – murad@lncc.br, Carla Osthoff osthoff@lncc.br
1st passo: “...
¹Authors: Frederico L. Cabral – fcabral@lncc.br , Marcio Murad – murad@lncc.br, Carla Osthoff osthoff@lncc.br
3º Passo – D...
¹Authors: Frederico L. Cabral – fcabral@lncc.br , Marcio Rentes Borges – marcio.rentes.borges@gmail.com , Carla Osthoff os...
Estudo de Caso
FATEC – Baixada Santista Rubens Lara
”Parallel Recommender System Based on the Intel® Xeon® and Xeon Phi™ “...
Intel Compiler report
Understand what optimizations were performed...and how to extract the maximum performance.
LOOP BEGI...
Partial conclusions – First part
• Intel Advisor performance predictions were very precise
• Despite “OpenMP + MKL Offload...
Computação de Alto Desempenho - Fator chave para a competitividade do País, da Ciência e da Indústria.
Computação de Alto Desempenho - Fator chave para a competitividade do País, da Ciência e da Indústria.
Próximos SlideShares
Carregando em…5
×

Computação de Alto Desempenho - Fator chave para a competitividade do País, da Ciência e da Indústria.

64 visualizações

Publicada em

Vídeo: https://www.youtube.com/watch?v=8cFqNwhQ7uE

Fator chave para a competitividade do País, da Ciência e da Indústria.
Palestra ministrada durante o Intel Innovation Week 2015 .

Publicada em: Tecnologia
0 comentários
0 gostaram
Estatísticas
Notas
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Sem downloads
Visualizações
Visualizações totais
64
No SlideShare
0
A partir de incorporações
0
Número de incorporações
3
Ações
Compartilhamentos
0
Downloads
1
Comentários
0
Gostaram
0
Incorporações 0
Nenhuma incorporação

Nenhuma nota no slide
  • Key Message: The markets and applications where Intel Xeon Phi can be applied will continue to grow as HPC is applied to other areas such as search, parallel data bases, mission critical apps, and large scale data mining for business applications. What is shown here are the traditional HPC applications and examples of use in the enterprise segment.
    Traditional HPC applications:
    Energy
    Oil & gas exploration
    Climate modeling & weather simulation
    Medical imaging
    Image processing
    Molecular dynamics
    Computational fluid dynamics
    CAD/CAM/CAE
    Digital content creation
    Financial analysis (Monte Carlo/Black Scholes)
    Gene sequencing
    Crash simulations
    Bio-chemistry
    Emerging HPC applications in the enterprise market:
    Parallel databases
    Search
    Business Intelligence & data mining
  • They use different systems…Today’s HPC and Big Data ecosystems are very different from the HW components though the SW stack including the programming model.

    The key areas of debate between the two HPC and Big Data camps are the choices of programming model, resource manager, file system, and hardware.

    Attribution – LEGAL
  • New workflows are emerging….Big Data and traditional HPC workloads will continue, but user demand for real time analysis & decision making requires applying HPC to “really” Big Data as part of a workflow or combined in new workloads. This isn’t a convergence of existing workloads, but new usage demands driving converging system requirements.

    Fast Data examples per the Matsuoka’s presentation (Blue Waters Symposium Jun’15) : Convolutional Neural Nets, Deep Machine Learning Genomics (“the new fast big kind…metagenome analysis”), Uncertainty Quantification. Some other examples per Matsuoka…
    social network-related large graph processing, social simulation, genomics with advanced sequence matching and weather problems that require real-time large data assimilation. …NOTICE the distinction between what people commonly call (and arguably over position as) “big data” vs the extremely big data that is being discussed here.

    Per Metagenomics is the study of genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics. While traditional microbiology and microbial genome sequencing and genomics rely upon cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S rRNA gene) to produce a profile of diversity in a natural sample. Such work revealed that the vast majority of microbial biodiversity had been missed by cultivation-based methods.[1] Recent studies use either "shotgun" or PCR directed sequencing to get largely unbiased samples of all genes from all the members of the sampled communities.[2] Because of its ability to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world.[3] As the price of DNA sequencing continues to fall, metagenomics now allows microbial ecology to be investigated at a much greater scale and detail than before.…. The point is that traditional genomic sequencing focuses on single clone cultures, while metagenomics involves sequencing much, much greater diversity
  • What a converged arch might look like

    Acknowledge that users have invested in different programming models which are arguably better suited for their specific needs. Thus converged stack needs to accommodate those differences

    Resource manager looks at the incoming big data or hpc or fast data workload and adapts/configures the system for best processing of the workload.

    File system is built with remote storage but has an adapter to accommodate Hadoop workloads that presume local storage.

    Hardware is optimized for performance with use of fabric and SSDs/Burst Buffers to support HPC and HPC/Big Data (ie Fast Data)
  • Key enabler is a new software stack…a new memory/storage hierarchy to better support both BD and HPC….

    Memory-Storage capabilities move storage closer to the compute.. By moving the data closer to compute we’re also effectively changing the profile of the traditional pyramid shape to one that is more top heavy. We are moving the “center of data” (analogous to the concept of a shape’s center of mass) closer to compute.

    Both HPC and BD use these capabilities, but their usage is weighted differently. For example, HPC emphasizes high bandwidth configurable memory. Big Data uses in package memory, but focuses on configurable memory and local application storae.

    For HPC (by tier and main benefits in bold)
    In-package memory benefits
    High Bandwidth Configurable (cache, memory, flat)
    Local App Storage
    NVM benefits
    Local Storage
    Temporal Storage
    Burst Buffer benefits
    Faster Checkpointing
    Quicker Recovery
    Better App Performance

    &&

    BIG DATA
    In-package memory benefits
    Configurable memory
    Local App Storage
    High Bandwidth
    NVM benefits
    Local Storage Temporal Storage
    Burst Buffer benefits
    Better App Performance
    Quicker Recovery
    Faster Checkpointing
    Remote storage / other benefits
    Run Hadoop on HPC infrastructure**
  • Key Message: Technical Computing is a key enabler of the latest evolution of scientific methodology

    A new methodology has been emerging from the scientific (nonmedical) community: the introduction of modeling and simulation as an integral part of the research and development process. This is possible because of technical computing and the ability to process massive amounts of detailed data in parallel – what we call heterogeneous computing.

    Because of the complex computing capabilities of technical computing, modeling and simulation have become essential elements of research and development.

    In the new model, after the hypothesis is proposed, modern scientists, researchers, and engineers perform numerous simulations and modeling of the hypothesis in order to design an effective experiment. This allows for an iterative optimization of the experiment design to be performed on the computer, which can take the form of virtual prototyping and virtual testing and evaluation. After this iterative step, when the best experiment design has been refined, the actual experiment is conducted in the laboratory. The value of this new approach is that early modeling and simulation saves time and money that can be better used for conducting the live experiment.

    We’ll show you how companies ranging from life sciences, to manufacturing, to oil & gas exploration are partnering with Intel to use this methodology to get products out faster, more feature rich, and with better quality --- all at lower cost.
  • OK I think everyone knows Dyson – they are the cool vacuum cleaner company who also makes a fan-less fan. You know I have one of these and it is amazing powerful and amazingly quite.

    What Dyson did with simulation based design is very cool

    They explored 200 design iterations in the same time they would have explored 10 not bad

    But look what it did they improved the airflow 2.5X the original concept - they took a good idea and made it great

    Very cool, very fast and amazingly innovative again
    So Dyson exemplified this idea – they broke the mold in several ways
    They got rid of the fan to reduce the noise
    They tested more ideas in less time and ended up with a very cool product

    You can do the same thing too

    With ANSYS innovative companies like Dyson, manufacturer of the Dyson Air Multiplier™ fan as well vacuums and hand driers, are now able to employ an idea known as design of experiment (DOE) to Create and test up to 10 geometric variations of things like the Dyson Air Multiplier dimensions. In this case the team investigated 200 different design iterations using simulation, which was 10 times the number that would have been possible had physical prototyping been the primary design tool.
  • 21
  • DreamWorks Animation notes:
    DreamWorks Animation is developing their own proprietary animation and lighting software utilizing Intel Software Development tools
    New animation and lighting software will enable more iterations of scenes to get the perfect character performances and shot depth
    Enabling more iterations improves the movie production process by permitting artists to continue to be productive instead of waiting on scene renders before attempting new changes
    This improvement is similar to enabling additional prototypes of a product to get the right innovation
  • 28% faster BLAST workload performance compared to cluster configuration prior to upgrade
    61% compute capacity increase compared to cluster configuration prior to upgrade
    22% increase in rack space compared to cluster configuration prior to upgrade
  • PETROBRAS
    Our engagement with the Research Center for Oil & Gas focused on exploration and production (the core activity of PETROBRAS), have been producing substantial results. One example is the 10.5x performance gain in their Reservoir Simulator software optimized to run in Intel Xeon servers.


    LNCC – National Laboratory for Scientific Computing
    Is home to the largest supercomputer in Latin America with capacity of 1 Petaflops, equipment has Intel® Xeon® E5 processors and Intel® Xeon® Phi™ coprocessors
    Since May, 2015 Intel signed a Technical Cooperation agreement to anchor the research In “New Computing Models for Enhanced Oil Recovery”, on Intel architecture.

    Intel Modern Code with UNESP-NCC
    The São Paulo State University – UNESP, part of the state of São Paulo public higher education system, is one of the largest universities in Brazil, and its Center for Scientific Computing (CSC) operates two large Linux-based HPC clusters to support the university research community.
    It’s a pleasure to announce they become our Intel Modern Code Partner in Latin American focused on code modernization and dissemination of improvements and innovations in parallel processing to the broader HPC community.
  • Computação de Alto Desempenho - Fator chave para a competitividade do País, da Ciência e da Indústria.

    1. 1. Fator chave para a competitividade do País, competitividade do País, da Ciência e da Ciência e da Indústria Igor Freitas, Engenheiro de Aplicação, 05/11/2015
    2. 2. 3 Agenda  O que é High Performance Computing ?  HPC & Competitividade da Indústria, da Ciência e do País  Iniciativas da Intel em HPC no Brasil
    3. 3. 4
    4. 4. O que é High Performance Computing ? 5 “High-performance computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and quickly. The term applies especially to systems that function above a teraflop or 1012 floating-point operations per second.” or in a simpler way... How to solve the hardest problems in the world regarding every aspect of our lives using a powerful and efficiency supercomputer
    5. 5. Extending to New Dimensions HPC pode ser utilizado em diferentes áreas da ciência e da indústria 6 Aplicações em HPC Aplicações Empresariais Análise de Imagens médicas Modelagem climática & Previsão do Tempo Mercado Financeiro Energia – Aplicações sísmicas Conteúdo Digital Dinâmica Molecular Dinâmica dos Fluídos Manufatura e CAD/CAMSequenciamento de DNA Automação na Indústria Eletrônica Defesa & Segurança Mecanismos de busca Banco de dados paralelos Business Intelligence / Data Mining
    6. 6. O que é High Performance Computing ? Democratização da performance e operação de supercomputadores 7 “Calculadora Automática de Sequência Controlada ou “Mark I” da IBM” Missão: ”desenvolver uma máquina que pudesse fazer cálculos científicos rápidos a fim de entender os assuntos da guerra, tais como a trajetória das ogivas” “Isso envolvia a tradução de problemas matemáticos para uma linguagem numérica que o computador pudesse entender.” Grace Murray Hopper at the UNIVAC keyboard, c. 1960 - Fonte
    7. 7. A democratização dos clusters de HPC Os últimos 20 anos 108 105 $/FLOP 10 1994 1 2014 >15,000X IMPROVEMENT1 YEAR Avanços na Ciência Alto ROI no processo de Inovação Industrial Beowulf Cluster *Source: Intel per socket estimate comparing Intel DX4TM processor (Beowulf) versus Intel® Xeon PhiTM (Knights Corner) Other brands and names are the property of their respective owners. 8
    8. 8. O que é High Performance Computing ? HPC vs Big Data FORTRAN / C++ Applications MPI High Performance Java* Applications Hadoop* Simple to Use SLURM Supports large scale startup YARN* More resilient of hardware failures Lustre* Remote Storage HDFS*, SPARK* Local Storage Compute & Memory Focused High Performance Components Storage Focused Standard Server Components Server Storage SSDs Switch Fabric Infrastructure Modelo de Programação Resource Manager Sistema de arquivos Hardware Server Storage HDDs Switch Ethernet Infrastructure Daniel Reed and Jack Dongarra, Exascale Computing and Big Data in Communications of the ACM journal, July 2015 (Vol 58, No.7), and Intel analysis Other brands and names are the property of their respective owners. 9
    9. 9. O que é High Performance Computing ? Big Data + HPC: Processamento “pesado” em tempo real Small Data + Small Compute e.g. Data analysis Big Data + Small Compute e.g. Search, Streaming, Data Preconditioning Small Data + Big Compute e.g. Mechanical Design, Multi-physics Data Compute 10
    10. 10. Visão da Intel para HPC Balanced compute, storage, and interconnects based on workload NETWORKING SOFTWARECOMPUTE STORAGE 11
    11. 11. Quebra de paradigma para Sistemas Massivamente Paralelos Processador + Redes de alta velocidade + Memória = Knights Landing Coprocessor Fabric Memory Memory Bandwidth ~500 GB/s STREAM Memory Capacity Over 25x* KNC Resiliency Systems scalable to >100 PF Power Efficiency Over 25% better than card1 I/O Up to 100 GB/s with int fabric Cost Less costly than discrete parts1 Flexibility Limitless configurations Density 3+ KNL with fabric in 1U3 Knights Landing *Comparison to 1st Generation Intel® Xeon Phi™ 7120P Coprocessor (formerly codenamed Knights Corner) 1Results based on internal Intel analysis using estimated power consumption and projected component pricing in the 2015 timeframe. This analysis is provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 2Comparison to a discrete Knights Landing processor and discrete fabric component. 3Theoretical density for air-cooled system; other cooling solutions and configurations will enable lower or higher density. Server Processor 12
    12. 12. Arquitetura Única para HPC & Big Data HPC Big Data FORTRAN / C++ Applications MPI High Performance Java* Applications Hadoop* Simple to Use Lustre* with Hadoop* Adapter Remote Storage Compute & Big Data Capable Scalable Performance Components Server Storage (SSDs and Burst Buffers) Intel® Omni-Path Architecture Infrastructure Programming Model Resource Manager File System Hardware *Other names and brands may be claimed as the property of others HPC & Big Data-Aware Resource Manager 13
    13. 13. Próximos passos para HPC & Big Data Hierarquia de Memória & Storage adaptável Processor Compute Node I/O Node Remote Storage Compute Today Caches Local Memory SSD Storage Parallel File System (Hard Drive Storage) HigherBandwidth. LowerLatencyandCapacity Some remote data moves onto I/O node I/O Node storage moves to compute node Local memory is now faster & in processor package Compute Future Caches Non-Volatile Memory Burst Buffer Storage Parallel File System (Hard Drive Storage) In-Package High Bandwidth Memory* *cache, memory or hybrid mode 14
    14. 14. O que é High Performance Computing ? #HPC Matters 15 HPC Transforms Parkinson's Disease - SC15
    15. 15. O que é High Performance Computing ? #HPC Matters 16 SC 15 - Climate Modeling
    16. 16. 17
    17. 17. HPC propicia uma nova Metodologia Científica Inovação na Indústria • Prediction • Modeling & Simulation • Experiment Refinement • Physical Prototyping • Analysis • Conclusion • Refinement • Physical Prototyping • Analysis • Conclusion • Refinement • Hypothesis • Hypothesis 1. Satava, Richard M. “The Scientific Method Is Dead-Long Live the (New) Scientific Method.” Journal of Surgical Innovation (June 2005). • Prediction To Compete, You Must Compute Accelerates the Method Iterate 18
    18. 18. HPC & Competitividade da Indústria, da Ciência e do País 19 • Ordem executiva do presidente Obama para um “programa nacional de Supercomputação” • HPC como “Top priority” para alavancar a competitividade dos EUA ”In order to maximize the benefits of HPC for economic competitiveness and scientific discovery, the United States Government must create a coordinated Federal strategy in HPC research, development, and deployment” Executive Order, Barack Obama Fonte: The White House Office of the Press Secretary
    19. 19. Dyson Creates a Revolutionary Fan Utilizing new scientific method Reduced the number of costly, time- consuming physical prototypes 2.5x better fan performance while eliminating external moving parts By investigating 10x the number of design possibilities using virtual prototyping Dyson Air Multiplier Fan Virtual prototype Source: Ansys Advantage Volume IV, Issue 2 2010 pp. 5-7 © Ansys Corp. 20
    20. 20. Topline Innovation Bottom-line Costs Got the most for their Autodesk software investment with optimized performance on Intel platforms Intel® Xeon® Processor E5-2600 product family based solution across workstations and clusters reduced deployment and maintenance costs More compelling, accurate visualization of car design Avoid physical prototyping spin by identifying body part fit issues Reduce turn-around from identifying design changes Audi Workflow Real-time, photo-realistic predictive rendering Virtual prototyped images Images courtesy of The Audi Group, Used by permission
    21. 21. Intel® Xeon® Processor E5-2600 product family enabled artist workstations Large, shared rendering clusters configured with Intel® Xeon® Processor E5-2600 product family Large Cluster Computation Intel® Xeon ® Workstation DreamWorks Animation Results Enables more iterations, improves movie production process “By combining Xeon E5-2600 class processors with a Xeon Phi coprocessor, we are now able to provide artists with extremely high-quality light transport simulation in large scenes at interactive speeds. This enables us to bring further technical innovation to bear on the ways breathtaking film imagery is created." -- Evan Smyth, Staff Architect DreamWorks Animation proprietary software 22
    22. 22. Genomics search algorithm Intel based display device (work done on cluster) Expanded shared cluster capacity with 100+ node Intel® Xeon® processor E5-2600 product family cluster • Compute capacity expanded 61% • Rack space increased by only 22% BLAST Monsanto Result Getting seeds to farmers quicker with fewer resources Desktop Large Cluster 28% faster BLAST workload performance Research team decreased time-to- results from 2 weeks to 6 days Source: Results courtesy Monsanto Corporation, 2012 23
    23. 23. 24
    24. 24. Iniciativas da Intel em HPC no Brasil Oil & Gas - Reservoir Simulator at PETROBRAS LNCC - National Laboratory for Scientific Computing Largest HPC cluster in Latin America NCC / UNESP An Intel® Modern Code Partner • Up to 10.5x performance gains in their Reservoir Simulator software • Up to 30x performance gain in Oil & Gas applications • 5 HPC Hands-on Workshops • 340 developers trained • On-going white-papers together others Institutes 25
    25. 25. Iniciativas da Intel em HPC no Brasil 26 • Modernizing applications to increase parallelism and scalability • Leverage cores, caches, threads, and vector capabilities of microprocessors and coprocessors. • Current centers in Brazil
    26. 26. ¹Author: Gilvan Vieira - gilvan.vieira.coppetec@petrobras.com.br – PETROBRAS/CENPES Estudo de Caso PETROBRAS - Simulação de Reservatórios Otimização do código através das ferramentas Intel® VTune™ Amplifier e Intel® Compiler Até 3.8x speedup em multiplicações de matrizes x vetores (utilizando apenas 1 núcleo da CPU) Ganhos de Performance¹ Assembly Fortran code using 3 scalar instructions C++ templated assembly code 1 vectorized , 2 scalar C++ template version speedup vs Fortran original code using Intel Compiler on Linux environment. Part of the optimization: In this case VTune showed the vectorized code was inneficiency , thus #pragma novector was used 27
    27. 27. ¹Author: Gilvan Vieira - gilvan.vieira.coppetec@petrobras.com.br – PETROBRAS / CENPES Estudo de Caso PETROBRAS - Simulação de Reservatórios • Intel Trace Analyzer and Collector facilitated the visualization of “serial effect communication” using blocking MPI_Sendrecv calls, thus non-blocking calls were used • Event Timeline MPI communication using 16 ranks Ganhos de performance em um ambiente paralelo utilizando 16 núcleos da CPU através do uso da ferramenta Intel® Trace Analyzer & Collector¹ Ganhos de1.28x a 10.5x de performance em kernels de multiplicação de matrizes x vetores 28
    28. 28. ¹Authors: Frederico L. Cabral – fcabral@lncc.br , Marcio Murad – murad@lncc.br, Carla Osthoff osthoff@lncc.br Estudo de Caso LNCC – Laboratório Nacional de Computação Científica 1º projeto: “Fine-Tuning Xeon architecture Vectorization and Parallelization of a Numerical Method for convection-diffusion equations” Aguardando publicação no volume CCIS 565, Springer: "Second Latin American Conference, CARLA 2015, Petrópolis, Brazil, August 26-28, 2015, Proceedings/Revised Selected Papers". Ganho de performance em um servidor Dual-socket Xeon® utilizando 56 threads 30x performance gain vs código original Cooperação Técnica com foco em projetos de pesquisa em Óleo & Gás 29
    29. 29. ¹Authors: Frederico L. Cabral – fcabral@lncc.br , Marcio Murad – murad@lncc.br, Carla Osthoff osthoff@lncc.br 1st passo: “não advinhe, meça !”  Otimize aplicações para uma única thread através de Vetorização  Passe um “raio-x” em sua aplicação com o Intel® VTune™ Amplifier  Foi identificado desperdício da CPU  Módulo de divisão da CPU sobrecarregado  Problemas de latência atrapalha a vetorização Estudo de Caso LNCC – Laboratório Nacional de Computação Científica 30
    30. 30. ¹Authors: Frederico L. Cabral – fcabral@lncc.br , Marcio Murad – murad@lncc.br, Carla Osthoff osthoff@lncc.br 3º Passo – Dê algumas “dicas” ao compilador para uso do paralelismo dentro de cada core da CPU double alfa_aux = 1.0 - 2.0*alfa; #pragma simd vectorlengthfor(double), private(alfa) #pragma vector nontemporal(U_old) //improves cache usage #pragma prefetch *64:128 for (i = head+1 ; i <= N-2 ; i+=2) { U_old[i] = alfa*(U_new[i-1] + U_new[i+1]) + alfa_aux * U_new[i]; //U_old[i] = alfa*(U_new[i-1] + U_new[i+1]) + (1.0 - 2.0*alfa)*U_new[i]; } Estudo de Caso LNCC – Laboratório Nacional de Computação Científica 31
    31. 31. ¹Authors: Frederico L. Cabral – fcabral@lncc.br , Marcio Rentes Borges – marcio.rentes.borges@gmail.com , Carla Osthoff osthoff@lncc.br 2º Projeto: “Fine Tuning Optimization applied in a Porous Media Flow Application using Intel Tools” (a ser publicado) 1ª fase: melhorar performance em aplicações single- threads no processador Intel® Xeon® Up to 4.1x performance gain vs original code (resultados parciais) Estudo de Caso LNCC – Laboratório Nacional de Computação Científica Cooperação Técnica com foco em projetos de pesquisa em Óleo & Gás 32
    32. 32. Estudo de Caso FATEC – Baixada Santista Rubens Lara ”Parallel Recommender System Based on the Intel® Xeon® and Xeon Phi™ “ Predição de performance através do Intel® Advisor antes de investir esforços otimizando o código Xeon: 16 threads seria o melhor cenário Xeon Phi : 120 threads seria o melhor cenário 33
    33. 33. Intel Compiler report Understand what optimizations were performed...and how to extract the maximum performance. LOOP BEGIN at regressao-xeon.c(116,18) inlined into regressao-xeon.c(55,6) remark #15389: vectorization support: reference beta_756 has unaligned access [ regressao-xeon.c(118,11) ] remark #15389: vectorization support: reference entrada_756 has unaligned access [ regressao-xeon.c(118,11) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15427: loop was completely unrolled remark #15399: vectorization support: unroll factor set to 6 remark #15301: SIMD LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 12 remark #15477: vector loop cost: 13.500 remark #15478: estimated potential speedup: 3.640 remark #15479: lightweight vector operations: 7 remark #15488: --- end vector loop cost summary --- LOOP END double *beta = (double*) _mm_malloc (TOTBETAS * sizeof(double), AVX_ALIGN); HINTS TO DECLARE DATA ALIGNED TO ASSIST VECTORIZATON Estudo de Caso FATEC – Baixada Santista Rubens Lara ”Parallel Recommender System Based on the Intel® Xeon® and Xeon Phi™ “ 34
    34. 34. Partial conclusions – First part • Intel Advisor performance predictions were very precise • Despite “OpenMP + MKL Offload to Xeon Phi” showed 1.2x speedup, there is room for higher speedups ! • Possible path: investigate a MPI + OpenMP version to explore Xeon + Xeon Phi 1 2.28 3.03 4.58 4.71 4.85 1 4 8 16 24 32 Speedup Threads Using only host processors as the number of threads is increasing. 1 1.23 OPENMP+MKL OPENMP+MKL OFFLOAD Speedup Speedup achieved by enabling Automatic Offload in MKL Estudo de Caso FATEC – Baixada Santista Rubens Lara ”Parallel Recommender System Based on the Intel® Xeon® and Xeon Phi™ “ 35

    ×