The High Performance Computing (HPC) community is facing a technology shift which will result in a performance boost by three orders of magnitudes within the next 5 years. This rise of performance will mainly be acquainted by increasing the level of concurrency in such a way that a user of those systems needs to accommodate to billion way parallelism. The main problems to solve are: Programmability, Portability, Energy Saving and Resiliency. The author believes that leveraging modern C++ will lead to a possible solution to those problems from a software perspective.
This talk will discuss the use of C++ in such a massively parallel environment: Using the HPX Parallel Runtime System - a future based API - to present a lightweight and efficient mechanism to support massively parallel, multi-way parallelism.
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
Stefano will give an introduction to the most common and used programming models for performing parallel I/O on supercomputers. He will first give a broad overview of parallel APIs for programming I/O on supercomputers. He will then introduce MPI I/O, one of the most used programming interfaces for parallel I/O, presenting its basic concepts, providing programming examples and guidelines for achieving high performance I/O on supercomputers.
Visit: https://www.eudat.eu/eudat-summer-school
This document provides an introduction to MapReduce and Hadoop, including an overview of computing PageRank using MapReduce. It discusses how MapReduce addresses challenges of parallel programming by hiding details of distributed systems. It also demonstrates computing PageRank on Hadoop through parallel matrix multiplication and implementing custom file formats.
This document provides an overview of Map & Reduce, a programming model for processing large datasets in parallel. It describes how Map & Reduce works by applying mapping functions to each element to generate intermediate key-value pairs, shuffling and sorting the data, then applying reduction functions to aggregate the values associated with each key. As an example, it walks through how the "word count" problem can be solved using Map & Reduce. Finally, it briefly discusses Google's implementation of MapReduce and the Apache Hadoop framework.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
The document discusses Apache Flink's Gelly library for large-scale graph processing. Gelly provides a high-level API on top of Flink for graph analytics and iterative algorithms. The document covers how to use Gelly to create and transform graphs, perform graph mutations, run vertex-centric and gather-sum-apply iterations, and provides examples for algorithms like shortest paths, community detection, and analyzing music listening data as a graph.
The document summarizes Jimmy Lin's MapReduce tutorial for WWW 2013. It discusses the MapReduce algorithm design and implementation. Specifically, it covers key aspects of MapReduce like local aggregation to reduce network traffic, sequencing computations by manipulating sort order, and using appropriate data structures to accumulate results incrementally. It also provides an example of building a term co-occurrence matrix to measure semantic distance between words.
The document discusses large-scale neural modeling using MapReduce and Giraph frameworks. It presents different MapReduce implementations and improvements like in-mapper combining and mapper-side schimmy to optimize graph algorithms. Giraph is also introduced as a vertex-centric framework for iterative graph processing. Performance comparisons show Giraph has superior performance over MapReduce-based approaches for neural simulations, completing iterations 6-91% faster. While Hadoop can model large neural networks, Giraph's vertex-centric approach with in-memory computation is better optimized for many iterations.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
Stefano will give an introduction to the most common and used programming models for performing parallel I/O on supercomputers. He will first give a broad overview of parallel APIs for programming I/O on supercomputers. He will then introduce MPI I/O, one of the most used programming interfaces for parallel I/O, presenting its basic concepts, providing programming examples and guidelines for achieving high performance I/O on supercomputers.
Visit: https://www.eudat.eu/eudat-summer-school
This document provides an introduction to MapReduce and Hadoop, including an overview of computing PageRank using MapReduce. It discusses how MapReduce addresses challenges of parallel programming by hiding details of distributed systems. It also demonstrates computing PageRank on Hadoop through parallel matrix multiplication and implementing custom file formats.
This document provides an overview of Map & Reduce, a programming model for processing large datasets in parallel. It describes how Map & Reduce works by applying mapping functions to each element to generate intermediate key-value pairs, shuffling and sorting the data, then applying reduction functions to aggregate the values associated with each key. As an example, it walks through how the "word count" problem can be solved using Map & Reduce. Finally, it briefly discusses Google's implementation of MapReduce and the Apache Hadoop framework.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
The document discusses Apache Flink's Gelly library for large-scale graph processing. Gelly provides a high-level API on top of Flink for graph analytics and iterative algorithms. The document covers how to use Gelly to create and transform graphs, perform graph mutations, run vertex-centric and gather-sum-apply iterations, and provides examples for algorithms like shortest paths, community detection, and analyzing music listening data as a graph.
The document summarizes Jimmy Lin's MapReduce tutorial for WWW 2013. It discusses the MapReduce algorithm design and implementation. Specifically, it covers key aspects of MapReduce like local aggregation to reduce network traffic, sequencing computations by manipulating sort order, and using appropriate data structures to accumulate results incrementally. It also provides an example of building a term co-occurrence matrix to measure semantic distance between words.
The document discusses large-scale neural modeling using MapReduce and Giraph frameworks. It presents different MapReduce implementations and improvements like in-mapper combining and mapper-side schimmy to optimize graph algorithms. Giraph is also introduced as a vertex-centric framework for iterative graph processing. Performance comparisons show Giraph has superior performance over MapReduce-based approaches for neural simulations, completing iterations 6-91% faster. While Hadoop can model large neural networks, Giraph's vertex-centric approach with in-memory computation is better optimized for many iterations.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Here is how you can solve this problem using MapReduce and Unix commands:
Map step:
grep -o 'Blue\|Green' input.txt | wc -l > output
This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches).
Reduce step:
cat output
This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green.
So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map
The document presents an introduction to MapReduce. It discusses how MapReduce provides an easy framework for distributed computing by allowing programmers to write simple map and reduce functions without worrying about complex distributed systems issues. It outlines Google's implementation of MapReduce and how it uses the Google File System for fault tolerance. Alternative open-source implementations like Apache Hadoop are also covered. The document discusses how MapReduce has been widely adopted by companies to process massive amounts of data and analyzes some criticism of MapReduce from database experts. It concludes by noting trends in using MapReduce as a parallel database and for multi-core processing.
This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
D4M is a software tool that connects scientists with big data technologies like Apache Accumulo. The D4M-Accumulo binding provides high performance connectivity to Accumulo for quick analytic prototyping. Current research looks to implement GraphBLAS server-side iterators and operators on Accumulo tables to support high performance graph analytics.
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisJonas Traub
This is our presentation for the German paper "Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten" which was published in Proceedings of the
LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015. Link: http://ceur-ws.org/Vol-1458/H02_CRC79_Traub.pdf
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.
This document provides an overview of Apache Flink, an open-source stream processing framework. It discusses Flink's capabilities in supporting streaming, batch, and iterative processing natively through a streaming dataflow model. It also describes Flink's architecture including the client, job manager, task managers, and various execution setups like local, remote, YARN, and embedded. Finally, it compares Flink to other stream and batch processing systems in terms of their APIs, fault tolerance guarantees, and strengths.
A relatively short Introduction to R as presented at the Belgian Software Craftmanship meetup group.
The goal of this presentation is to give you an introduction to:
• The style of the language
• It's ecosystem
• How common things like data manipulation and visualization work
• How to use it for machine learning
• Webdevelopment and report generation in R
• Integrating R in your system
License:
Introduction To R by Samuel Bosch
To the extent possible under law, the person who associated CC0 with Introduction To R has waived all copyright and related or neighboring rights
to Introduction To R.
http://creativecommons.org/publicdomain/zero/1.0/
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
All AI Roads lead to Distribution - Dot AIJim Dowling
The document discusses how all roads of artificial intelligence lead to distributed systems and computing. It provides examples of how companies like Facebook and Google have improved the accuracy and training times of image recognition models by utilizing larger distributed training datasets and systems with thousands of GPUs and TPUs. The future of AI will rely on techniques like distributed deep learning, hyperparameter optimization, and elastic model serving that can scale computation across large computing clusters in the cloud or on-premise.
Apache Giraph is a large-scale graph processing system built on Hadoop. It provides an iterative processing model and vertex-centric programming model for graphs that can be too large for a single machine. Giraph scales to graphs with trillions of edges by distributing computation across a Hadoop cluster. It is faster than traditional MapReduce approaches for graph algorithms and allows graphs to be processed in memory across iterations while only writing intermediate data to disk.
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
The document provides an overview of Apache Flink, an open source stream processing framework. It discusses Flink's programming model using DataSets and transformations, real-time stream processing capabilities, windowing functions, iterative processing, and visualization tools. It also provides details on Flink's runtime architecture, including its use of pipelined and staged execution, optimizations for iterative algorithms, and how the Flink optimizer selects execution plans.
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PCQin Liu
VENUS is a single-machine graph computation system that uses a vertex-centric streamlined processing model to reduce data access and I/O compared to other disk-based systems like GraphChi and X-Stream. It separates immutable edge data from mutable vertex data and shards graphs across disk to load smaller vertex shards into memory for parallel processing. Evaluation on large real-world graphs like Twitter and Clueweb12 showed VENUS outperformed competitors by completing PageRank up to 2x faster while also supporting other algorithms on billion-scale graphs.
What's new in 1.9.0 blink planner - Kurt Young, AlibabaFlink Forward
Flink 1.9.0 added the ability to support multiple SQL planners under the same API. With this help. we successfully merged a lot features which comes from Alibaba's internal flink version, called blink. In this talk, I will give a introduction about the architecture of the blink planner, and also share with you the functionalities and performance enhancements we added.
This document provides an introduction to stream processing with Apache Flink. It discusses why streaming is important, the key parts of a streaming infrastructure, and gives an example of how Bouygues Telecom uses Flink for stream processing. It then provides an overview of what Apache Flink is, including its unified batch and stream processing capabilities. The rest of the document focuses on stream processing features in Flink, including its DataStream API, flexible windowing options, support for iterative processing, fault tolerance mechanisms, and exactly-once processing semantics.
High Performance Distributed Systems with CQRSJonathan Oliver
This document discusses the architectural pattern of Command Query Responsibility Segregation (CQRS). It summarizes that CQRS separates read (query) and write (command) operations into different models to allow for more scalability and performance. Queries use a read-only data store optimized for reading, while commands express user intentions and are validated before being asynchronously processed to update data. The pattern allows for eventual consistency by keeping query data slightly stale, and improves scalability by allowing separate optimization of queries and commands.
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex
This document provides an overview of building a first Apache Apex application. It describes the main concepts of an Apex application including operators that implement interfaces to process streaming data within windows. The document outlines a "Sorted Word Count" application that uses various operators like LineReader, WordReader, WindowWordCount, and FileWordCount. It also demonstrates wiring these operators together in a directed acyclic graph and running the application to process streaming data.
Here is how you can solve this problem using MapReduce and Unix commands:
Map step:
grep -o 'Blue\|Green' input.txt | wc -l > output
This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches).
Reduce step:
cat output
This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green.
So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map
The document presents an introduction to MapReduce. It discusses how MapReduce provides an easy framework for distributed computing by allowing programmers to write simple map and reduce functions without worrying about complex distributed systems issues. It outlines Google's implementation of MapReduce and how it uses the Google File System for fault tolerance. Alternative open-source implementations like Apache Hadoop are also covered. The document discusses how MapReduce has been widely adopted by companies to process massive amounts of data and analyzes some criticism of MapReduce from database experts. It concludes by noting trends in using MapReduce as a parallel database and for multi-core processing.
This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
D4M is a software tool that connects scientists with big data technologies like Apache Accumulo. The D4M-Accumulo binding provides high performance connectivity to Accumulo for quick analytic prototyping. Current research looks to implement GraphBLAS server-side iterators and operators on Accumulo tables to support high performance graph analytics.
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisJonas Traub
This is our presentation for the German paper "Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten" which was published in Proceedings of the
LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015. Link: http://ceur-ws.org/Vol-1458/H02_CRC79_Traub.pdf
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.
This document provides an overview of Apache Flink, an open-source stream processing framework. It discusses Flink's capabilities in supporting streaming, batch, and iterative processing natively through a streaming dataflow model. It also describes Flink's architecture including the client, job manager, task managers, and various execution setups like local, remote, YARN, and embedded. Finally, it compares Flink to other stream and batch processing systems in terms of their APIs, fault tolerance guarantees, and strengths.
A relatively short Introduction to R as presented at the Belgian Software Craftmanship meetup group.
The goal of this presentation is to give you an introduction to:
• The style of the language
• It's ecosystem
• How common things like data manipulation and visualization work
• How to use it for machine learning
• Webdevelopment and report generation in R
• Integrating R in your system
License:
Introduction To R by Samuel Bosch
To the extent possible under law, the person who associated CC0 with Introduction To R has waived all copyright and related or neighboring rights
to Introduction To R.
http://creativecommons.org/publicdomain/zero/1.0/
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
All AI Roads lead to Distribution - Dot AIJim Dowling
The document discusses how all roads of artificial intelligence lead to distributed systems and computing. It provides examples of how companies like Facebook and Google have improved the accuracy and training times of image recognition models by utilizing larger distributed training datasets and systems with thousands of GPUs and TPUs. The future of AI will rely on techniques like distributed deep learning, hyperparameter optimization, and elastic model serving that can scale computation across large computing clusters in the cloud or on-premise.
Apache Giraph is a large-scale graph processing system built on Hadoop. It provides an iterative processing model and vertex-centric programming model for graphs that can be too large for a single machine. Giraph scales to graphs with trillions of edges by distributing computation across a Hadoop cluster. It is faster than traditional MapReduce approaches for graph algorithms and allows graphs to be processed in memory across iterations while only writing intermediate data to disk.
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
The document provides an overview of Apache Flink, an open source stream processing framework. It discusses Flink's programming model using DataSets and transformations, real-time stream processing capabilities, windowing functions, iterative processing, and visualization tools. It also provides details on Flink's runtime architecture, including its use of pipelined and staged execution, optimizations for iterative algorithms, and how the Flink optimizer selects execution plans.
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PCQin Liu
VENUS is a single-machine graph computation system that uses a vertex-centric streamlined processing model to reduce data access and I/O compared to other disk-based systems like GraphChi and X-Stream. It separates immutable edge data from mutable vertex data and shards graphs across disk to load smaller vertex shards into memory for parallel processing. Evaluation on large real-world graphs like Twitter and Clueweb12 showed VENUS outperformed competitors by completing PageRank up to 2x faster while also supporting other algorithms on billion-scale graphs.
What's new in 1.9.0 blink planner - Kurt Young, AlibabaFlink Forward
Flink 1.9.0 added the ability to support multiple SQL planners under the same API. With this help. we successfully merged a lot features which comes from Alibaba's internal flink version, called blink. In this talk, I will give a introduction about the architecture of the blink planner, and also share with you the functionalities and performance enhancements we added.
This document provides an introduction to stream processing with Apache Flink. It discusses why streaming is important, the key parts of a streaming infrastructure, and gives an example of how Bouygues Telecom uses Flink for stream processing. It then provides an overview of what Apache Flink is, including its unified batch and stream processing capabilities. The rest of the document focuses on stream processing features in Flink, including its DataStream API, flexible windowing options, support for iterative processing, fault tolerance mechanisms, and exactly-once processing semantics.
High Performance Distributed Systems with CQRSJonathan Oliver
This document discusses the architectural pattern of Command Query Responsibility Segregation (CQRS). It summarizes that CQRS separates read (query) and write (command) operations into different models to allow for more scalability and performance. Queries use a read-only data store optimized for reading, while commands express user intentions and are validated before being asynchronously processed to update data. The pattern allows for eventual consistency by keeping query data slightly stale, and improves scalability by allowing separate optimization of queries and commands.
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex
This document provides an overview of building a first Apache Apex application. It describes the main concepts of an Apex application including operators that implement interfaces to process streaming data within windows. The document outlines a "Sorted Word Count" application that uses various operators like LineReader, WordReader, WindowWordCount, and FileWordCount. It also demonstrates wiring these operators together in a directed acyclic graph and running the application to process streaming data.
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Apache Apex
Presenter:
Priyanka Gugale, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover introduction to Yarn, understanding yarn architecture as well as look into Yarn application lifecycle. We will also learn how Apache Apex is one of the Yarn applications in Hadoop.
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
Scaling Spark workloads on YARN and Mesos can provide significant performance improvements but the benefits vary across different workloads. Adding resources alone may not fully utilize the new nodes due to delay in scheduling tasks locally on the new nodes. Tuning the locality wait time parameter in Spark to quickly change task placement preference can help make better use of new resources. Dynamic executor allocation in Spark can also be enhanced to dynamically adjust configuration settings like locality wait time during auto-scaling.
Windowing in Apache Apex divides unbounded streaming data into finite time slices called windows to allow for computation. It uses time as a reference to break streams into windows, addressing issues like failure recovery and providing frames of reference. Operators can perform window-level processing by implementing callbacks for window start and end. Windows provide rolling statistics by accumulating results over multiple windows and emitting periodically. Windowing has lower latency than micro-batch systems as records are processed immediately rather than waiting for batch boundaries.
The 5 People in your Organization that grow Legacy CodeRoberto Cortez
Have you ever looked at a random piece of code and wanted to rewrite it so badly? It’s natural to have legacy code in your application at some point. It’s something that you need to accept and learn to live with. So is this a lost cause? Should we just throw in the towel and give up? Hell no! Over the years, I learned to identify 5 main creators/enablers of legacy code on the engineering side, which I’m sharing here with you using real development stories (with a little humour in the mix). Learn to keep them in line and your code will live longer!
This document provides an overview of basic Hadoop commands for interacting with the Hadoop Distributed File System (HDFS). It lists commands for creating directories, listing files, copying data between local and HDFS, copying within HDFS, viewing file contents, deleting files, getting help for commands, and viewing HDFS through a web browser. Contact information is provided at the end for additional support.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com
HDFS stores files as blocks that are by default 64 MB in size to minimize disk seek times. The namenode manages the file system namespace and metadata, tracking which datanodes store each block. When writing a file, HDFS breaks it into blocks and replicates each block across multiple datanodes. The secondary namenode periodically merges namespace and edit log changes to prevent the log from growing too large. Small files are inefficient in HDFS due to each file requiring namespace metadata regardless of size.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
Slides from the Introduction to UNIX Command-Lines class from the BTI Plant Bioinformatics course 2014. This is a course teach by the Sol Genomics Network researchers at the Boyce Thompson Institute.
Measuring the time spent on small individual fractions of program code is a common technique for analysing performance behavior and detecting performance bottlenecks. The benefits of the approach include a detailed individual attribution of performance and understandable feedback loops when experimenting with different code versions. There are however severe pitfalls when following this approach that can lead to vastly misleading results. Modern dynamic compilers use complex optimisation techniques that take a large part of the program into account. There can be therefore unexpected side-effects when combining different code snippets or even when running a presumably unrelated part of the code. This talk will present performance paradoxes with examples from the domain of dynamic compilation of Java programs. Furthermore, it will discuss an alternative approach to modelling code performance characteristics that takes the challenges of complex optimising compilers into account.
The Download: Tech Talks by the HPCC Systems Community, Episode 11HPCC Systems
Join us as we continue this series of webinars specifically designed for the community by the community with the goal to share knowledge, spark innovation and further build and link the relationships within our HPCC Systems community.
Episode 11 includes Tech Talks featuring speakers from our community on topics covering Big Data solutions, Spark Integration and other ECL Tips leveraging the HPCC Systems platform.
1) Raj Chandrasekaran, CTO & Co-Founder, ClearFunnel - Scaling Data Science capabilities: Leveraging a homogeneous Big Data ecosystem
2) James McMullan, Software Engineer III, LexisNexis Risk Solutions - HDFS Connector Preview
3) Bob Foreman, Senior Software Engineer, LexisNexis Risk Solutions - Building a RELATIONal Dataset - A Valentine’s Day Special!
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...HPCC Systems
HPCC (High Performance Computing Cluster) Systems from LexisNexis is an open source massive parallel-processing computing platform that solves Big Data problems. In this talk, attendees will be given an overview of HPCC Systems and see a demonstration of its use to parse data from free-form and semi-structured text. This represents a combined text extraction task with human intervention. The code elements and massively parallel processing principles involved in accomplishing these tasks will be thoroughly discussed.
This document provides an overview of using TensorFlow and Quarkus to build intelligent applications that serve machine learning models. It begins with an introduction and agenda. It then discusses TensorFlow and how it can be used to build and train machine learning models. It demonstrates how a TensorFlow model can be served using Quarkus and consumed via HTTP requests. The technical benefits of serving models with Quarkus are described. Finally, use cases, additional resources, and a Q&A section are outlined.
TensorFlow is an open source library for numerical computation using data flow graphs. It allows expressing machine learning algorithms as graphs with nodes representing operations and edges representing the flow of data between nodes. The graphs can then be executed across multiple CPUs and GPUs. Clipper is a system for low latency online prediction serving built using TensorFlow. It aims to handle high query volumes for complex machine learning models.
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)Ioan Toma
HP has a long history of innovation dating back to its founding in a Palo Alto garage in 1939. Some of its notable innovations include the first programmable calculator in 1968, the first pocket scientific calculator in 1972, launching the first inkjet printer in 1984, and being first to commercialize RISC technology in 1986. More recently, HP Labs has developed technologies like ePrint in 2010, 3D Photon technology in 2011, and Project Moonshot in 2013. Going forward, HP Labs is focusing its research on systems, networking, security, analytics, and printing to deliver the fastest and most efficient route from data to value.
HDR Defence - Software Abstractions for Parallel ArchitecturesJoel Falcou
Performing large, intensive or non-trivial computing on array like data
structures is one of the most common task in scientific computing, video game
development and other fields. This matter of fact is backed up by the large number
of tools, languages and libraries to perform such tasks. If we restrict ourselves to
C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK
C++ binding to template meta-programming based Blitz++ or Eigen.
If all of these libraries provide good performance or good abstraction, none of
them seems to fit the need of so many different user types. Moreover, as parallel
system complexity grows, the need to maintain all those components quickly
become unwieldy. This thesis explores various software design techniques - like
Generative Programming, MetaProgramming and Generic Programming - and their
application to the implementation of various parallel computing libraries in such a
way that abstraction and expressiveness are maximized while efficiency overhead is
minimized.
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-trevett
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of the Khronos Group and Vice President at NVIDIA, presents the "APIs for Accelerating Vision and Inferencing: Options and Trade-offs" tutorial at the May 2018 Embedded Vision Summit.
The landscape of SDKs, APIs and file formats for accelerating inferencing and vision applications continues to rapidly evolve. Low-level compute APIs, such as OpenCL, Vulkan and CUDA are being used to accelerate inferencing engines such as OpenVX, CoreML, NNAPI and TensorRT. Inferencing engines are being fed via neural network file formats such as NNEF and ONNX. Some of these APIs, like OpenCV, are vision-specific, while others, like OpenCL, are general-purpose. Some engines, like CoreML and TensorRT, are supplier-specific, while others, such as OpenVX, are open standards that any supplier can adopt. Which ones should you use for your project?
In this presentation, Trevett presents the current landscape of APIs, file formats and SDKs for inferencing and vision acceleration, explaining where each one fits in the development flow. Trevett also highlights where these APIs overlap and where they complement each other, and previews some of the latest developments in these APIs.
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed.
While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.
Deployment of an HPC Cloud based on Intel hardwareIntel IT Center
The document discusses a potential HPC cloud solution using Intel hardware for a technical university. It describes the customer's requirements for an IaaS solution that provides both physical and virtual deployment of computing resources and applications. The proposed solution involves using IBM software like PCM-AE, PAC and LSF on an Intel-based hardware cluster from transtec to provide self-service provisioning and management of HPC resources for the university's users. The implementation would include features like integration with the university's LDAP directory and deployment of Linux and Windows virtual clusters.
Hopsworks - ExtremeEarth Open WorkshopExtremeEarth
This document summarizes a presentation about the three-year ExtremeEarth project. It discusses the ExtremeEarth platform architecture, which brings together Earth observation data access from DIASes, end-user products from TEPs, and scalable AI capabilities from Hopsworks. The architecture provides infrastructure on Creodias and uses Hopsworks to develop end-to-end machine learning pipelines for processing petabytes of Earth observation data. Results have been exploited through additional research projects and a product offering on Hopsworks.ai. The project has also led to several publications and blog posts about applying AI to Earth observation data.
Slides from the talk:
Aleš Zamuda. EuroHPC AI in DAPHNE. Severo Ochoa Research Seminars. 12/Sep/2023, 1-3-2 Room, BSC Main Building and Zoom. Barcelona Supercomputing Center, Barcelona, Spain.
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsAlluxio, Inc.
Alluxio foresaw the need for agility when accessing data across silos separated from compute engines like Spark, Presto, Tensorflow and PyTorch. Embracing the separation of storage from compute, the Alluxio data orchestration platform simplifies adoption of the data lake and data mesh paradigm for analytics and AI/ML. In this talk, Bin Fan will share observations to help identify ways to use the platform to meet the needs of your data environment and workloads.
越來越多的企業架構已轉向混合雲和多雲環境。雖然這種轉變帶來了更大的靈活性和敏捷性,但也意味著必須將計算與存儲分離,這就對企業跨框架、跨雲和跨存儲系統的數據管理和編排提出了新的挑戰。此分享將讓聽眾深入了解Alluxio數據編排理念在數據中台對存儲和計算的解耦作用,以及數據編排針對存算分離場景提出的創新架構,同時結合來自金融、運營商、互聯網等行業的典型應用場景來展現Alluxio如何為大數據計算帶來真正的加速,以及如何將數據編排技術用於AI模型訓練!
At the technology meeting of the Association of Independent Research Centers (http://airi.org): An overview of recent Scientific Computing activities at Fred Hutch, Seattle
Esteban Hernandez is a PhD candidate researching heterogeneous parallel programming for weather forecasting. He has 12 years of experience in software architecture, including Linux clusters, distributed file systems, and high performance computing (HPC). HPC involves using the most efficient algorithms on high-performance computers to solve demanding problems. It is used for applications like weather prediction, fluid dynamics simulations, protein folding, and bioinformatics. Performance is often measured in floating point operations per second. Parallel computing using techniques like OpenMP, MPI, and GPUs is key to HPC. HPC systems are used across industries for applications like supply chain optimization, seismic data processing, and drug development.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/08/khronos-group-standards-powering-the-future-of-embedded-vision-a-presentation-from-the-khronos-group/
Neil Trevett, Vice President of Developer Ecosystems at NVIDIA and President of the Khronos Group, presents the “Khronos Group Standards: Powering the Future of Embedded Vision” tutorial at the May 2021 Embedded Vision Summit.
Open standards play an important role in enabling interoperability for faster, easier deployment of vision-based systems. With advances in machine learning, the number of accelerators, processors, libraries and compilers in the market is rapidly increasing. Proprietary APIs and formats create a complex industry landscape that can hinder overall market growth.
The Khronos Group’s open standards for accelerating parallel programming play a major role in deploying inferencing and embedded vision applications and include SYCL, OpenVX, NNEF, Vulkan, SPIR, and OpenCL. Trevett provides an up-to-the-minute overview and update on the Khronos embedded vision ecosystem, highlighting the capabilities and benefits of each API, giving viewers insight into which standards may be relevant to their own embedded vision projects, and discussing the future directions of these key industry initiatives.
Semelhante a C++ on its way to exascale and beyond -- The HPX Parallel Runtime System (20)
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfVALiNTRY360
Salesforce Healthcare CRM, implemented by VALiNTRY360, revolutionizes patient management by enhancing patient engagement, streamlining administrative processes, and improving care coordination. Its advanced analytics, robust security, and seamless integration with telehealth services ensure that healthcare providers can deliver personalized, efficient, and secure patient care. By automating routine tasks and providing actionable insights, Salesforce Healthcare CRM enables healthcare providers to focus on delivering high-quality care, leading to better patient outcomes and higher satisfaction. VALiNTRY360's expertise ensures a tailored solution that meets the unique needs of any healthcare practice, from small clinics to large hospital systems.
For more info visit us https://valintry360.com/solutions/health-life-sciences
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
Preparing Non - Technical Founders for Engaging a Tech AgencyISH Technologies
Preparing non-technical founders before engaging a tech agency is crucial for the success of their projects. It starts with clearly defining their vision and goals, conducting thorough market research, and gaining a basic understanding of relevant technologies. Setting realistic expectations and preparing a detailed project brief are essential steps. Founders should select a tech agency with a proven track record and establish clear communication channels. Additionally, addressing legal and contractual considerations and planning for post-launch support are vital to ensure a smooth and successful collaboration. This preparation empowers non-technical founders to effectively communicate their needs and work seamlessly with their chosen tech agency.Visit our site to get more details about this. Contact us today www.ishtechnologies.com.au
14 th Edition of International conference on computer visionShulagnaSarkar2
About the event
14th Edition of International conference on computer vision
Computer conferences organized by ScienceFather group. ScienceFather takes the privilege to invite speakers participants students delegates and exhibitors from across the globe to its International Conference on computer conferences to be held in the Various Beautiful cites of the world. computer conferences are a discussion of common Inventions-related issues and additionally trade information share proof thoughts and insight into advanced developments in the science inventions service system. New technology may create many materials and devices with a vast range of applications such as in Science medicine electronics biomaterials energy production and consumer products.
Nomination are Open!! Don't Miss it
Visit: computer.scifat.com
Award Nomination: https://x-i.me/ishnom
Conference Submission: https://x-i.me/anicon
For Enquiry: Computer@scifat.com
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Drona Infotech is a premier mobile app development company in Noida, providing cutting-edge solutions for businesses.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
C++ on its way to exascale and beyond -- The HPX Parallel Runtime System
1. C++ on its way to exascale and beyond
– The HPX Parallel Runtime System
Thomas Heller (thomas.heller@cs.fau.de)
January 21, 2016
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
2. What is Exascale anyway?
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
3. Exascale in numbers
• An Exascale Computer is supposed to execute 1018
floating point
operations in a second
• Exa: 1018
= 1000000000000000000
• People on Earth: 7.3 Billion = 7.3 ∗ 109
• Imagine each person is able to compute one operation per second. It
takes:
⇒ 136986301 seconds
⇒ 2283105 minutes
⇒ 38051 hours
⇒ 1585 days
⇒ 4 years
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
3/ 51
4. Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
5. Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
6. Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
7. Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
8. Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
9. Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
10. Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
11. Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
12. Challenges
• How do we program those beasts?
⇒ Massively parallel processors
⇒ Massive amount of compute nodes
⇒ Deep Memory hierarchies
• How can we design the architecture to be affordable?
⇒ Biggest Operational cost is Energy
⇒ Power Envelop of 20MW
⇒ Current fastest Computer (Tian-He 2): 17MW
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
5/ 51
13. Current Development
Current #1 System:
• Tian-He 2: 33.9 PFLOPS
• 4% of an Exaflop
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
6/ 51
14. Hardware Trends
• ARM: Low-Power ARM64 cores (maybe adding embedded GPU
accelerators)
• IBM: POWER + NVIDIA Accelerators
• Intel: Knights Landing (Xeon Phi) Many Core processor
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
7/ 51
15. How will C++ deal with all that?!?
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
16. Challenges
• Programmability
• Expressing Parallelism
• Expressing Data Locality
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
9/ 51
17. The 4 Horsemen of the Apocalypse: SLOW
Starvation
Latency
Overhead
Waiting for contention
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
10/ 51
18. State of the Art
• Modern architectures impose massive challenges on programmability in
the context of performance portability
• Massive increase in on-node parallelism
• Deep memory hierarchies
• Only portable parallelization solution for C++ programmers (today):
OpenMP and MPI
• Hugely successful for years
• Widely used and supported
• Simple use for simple use cases
• Very portable
• Highly optimized
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
11/ 51
19. State of the Art – Parallelism in C++
• C++11 introduced lower level abstractions
• std::thread, std::mutex, std::future, etc.
• Fairly limited, more is needed
• C++ needs stronger support for higher-level parallelism
• Several proposals to the Standardization Committee are accepted or
under consideration
• Technical Specification: Concurrency (P0159, note: misnomer)
• Technical Specification: Parallelism (P0024)
• Other smaller proposals: resumable functions, task regions, executors
• Currently there is no overarching vision related to higher-level parallelism
• Goal is to standardize a ‘big story’ by 2020
• No need for OpenMP, OpenACC, OpenCL, etc.
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
12/ 51
20. Stepping Aside – Introducing HPX
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
21. HPX – A general purpose parallel Runtime System
• Solidly based on a theoretical foundation – a well defined, new execution
model (ParalleX)
• Exposes a coherent and uniform, standards-oriented API for ease of
programming parallel and distributed applications.
• Enables to write fully asynchronous code using hundreds of millions of threads.
• Provides unified syntax and semantics for local and remote operations.
• Open Source: Published under the Boost Software License
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
14/ 51
22. HPX – A general purpose parallel Runtime System
HPX represents an innovative mixture of
• A global system-wide address space (AGAS - Active Global Address
Space)
• Fine grain parallelism and lightweight synchronization
• Combined with implicit, work queue based, message driven computation
• Full semantic equivalence of local and remote execution, and
• Explicit support for hardware accelerators (through percolation)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
15/ 51
23. HPX 101 – The programming model
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
24. HPX 101 – The programming model
Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
25. HPX 101 – The programming model
Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
26. HPX 101 – The programming model
Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
future <id_type > id =
new_ <Component >( locality , ...);
future <R> result =
async(id.get(), action , ...);
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
27. HPX 101 – The programming model
Locality 0 Locality 1 Locality i Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
28. HPX 101 – Overview
HPX
C++ Standard Library
C++
R f(p...) Synchronous Asynchronous Fire & Forget
(returns R) (returns future<R>) (returns void)
Functions f(p...) async(f, p...) apply(f, p...)
(direct)
Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...)
(lazy)
Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a)
(direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...)
Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a)
(lazy) bind(a(), id, p...)
(...)
async(bind(a(), id, p...),
...)
apply(bind(a(), id, p...),
...)
In Addition: dataflow(func, f1, f2);
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
17/ 51
29. The Future, an example
int universal_answer () { return 42; }
void deep_thought () {
future <int > promised_answer
= async(& universal_answer);
// do other things for 7.5 million years
cout << promised_answer.get() << endl;
// prints 42, eventually
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
18/ 51
30. Compositional facilities
• Sequential composition of futures
future <string > make_string () {
future <int > f1 =
async ([]() -> int { return 123; });
future <string > f2 = f1.then(
[](future <int > f) -> string
{
// here .get() won’t block
return to_string(f.get());
});
return f2;
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
19/ 51
31. Compositional facilities
• Parallel composition of futures
future <int > test_when_all () {
future <int > future1 =
async ([]() -> int { return 125; });
future <string > future2 =
async ([]() -> string { return string("hi"); });
auto all_f = when_all(future1 , future2);
future <int > result = all_f.then(
[]( auto f) -> int {
return do_work(f.get());
});
return result;
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
20/ 51
32. Dataflow – The new ’async’ (HPX)
• What if one or more arguments to ’async’ are futures themselves?
• Normal behavior: pass futures through to function
• Extended behavior: wait for futures to become ready before invoking the
function:
template <typename F, typename ... Arg >
future <result_of_t <F(Args ...) >>
// requires(is_callable <F(Arg ...) >)
dataflow(F && f, Arg &&... arg);
• If ArgN is a future, then the invocation of F will be delayed
• Non-future arguments are passed through
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
21/ 51
33. Parallel Algorithms
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
34. Concepts of Parallelism – Parallel Execution Properties
• The execution restrictions applicable for the work items
• In what sequence the work items have to be executed
• Where the work items should be executed
• The parameters of the execution environment
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
23/ 51
35. Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
36. Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
37. Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
38. Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
Grain Size
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
39. Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
Grain Size
Futures, Async, Dataflow
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
40. Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
Grain Size
Futures, Async, Dataflow
Parallel Algorithms Fork-Join, etc
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
41. Execution Policies (std)
• Specify execution guarantees (in terms of thread-safety) for executed
parallel tasks:
• sequential_execution_policy: seq
• parallel_execution_policy: par
• parallel_vector_execution_policy: par_vec
• In parallelism TS used for parallel algorithms only
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
25/ 51
42. Execution Policies (Extensions)
• Asynchronous Execution Policies:
• sequential_task_execution_policy: seq(task)
• parallel_task_execution_policy: par(task)
• In both cases the formerly synchronous functions return a future<>
• Instruct the parallel construct to be executed asynchronously
• Allows integration with asynchronous control flow
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
26/ 51
43. Executors
• Executor are objects responsible for
• Creating execution agents on which work is performed (P0058)
• In P0058 this is limited to parallel algorithms, here much broader use
• Abstraction of the (potentially platform-specific) mechanisms for launching
work
• Responsible for defining the Where and How of the execution of tasks
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
27/ 51
44. Execution Parameters
Allows to control the grain size of work
• i.e. amount of iterations of a parallel for_each run on the same thread
• Similar to OpenMP scheduling policies: static, guided, dynamic
• Much more fine control
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
28/ 51
45. Putting it all together – SAXPY routine with data locality
• a[i] = b[i] ∗ x + c[i], for i from 0 to N − 1
• Using parallel algorithms
• Explicit Control over data locality
• No raw Loops
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
29/ 51
46. Putting it all together – SAXPY routine with data locality
Complete serial version:
std::vector <double > a = ...;
std::vector <double > b = ...;
std::vector <double > c = ...;
double x = ...;
std:: transform(b.begin(), b.end(),
c.begin(), c.end(), a.begin(),
[x]( double bb, double cc)
{
return bb * x + cc;
});
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
30/ 51
47. Putting it all together – SAXPY routine with data locality
Parallel version, no data locality:
std::vector <double > a = ...;
std::vector <double > b = ...;
std::vector <double > c = ...;
double x = ...;
parallel :: transform(parallel ::par ,
b.begin(), b.end(),
c.begin(), c.end(), a.begin(),
[x]( double bb, double cc)
{
return bb * x + cc;
});
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
31/ 51
48. Putting it all together – SAXPY routine with data locality
Parallel version, no data locality:
std::vector <double , numa_allocator > a = ...;
std::vector <double , numa_allocator > b = ...;
std::vector <double , numa_allocator > c = ...;
double x = ...;
for(numa_executor : numa_executors) {
parallel :: transform(
parallel ::par.on(numa_executor),
b.begin() +..., b.begin() +...,
c.begin() +..., c.begin() +..., a.begin() +...,
[x]( double bb, double cc)
{ return bb * x + cc; });
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
32/ 51
49. Case Studies
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
50. LibGeoDecomp
• C++ Auto-parallelizing framework
• Open Source
• High scalability
• Wide range of platform support
• http://www.libgeodecomp.org
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
34/ 51
51. LibGeoDecomp
Futurizing the Simulation Flow
Basic Simulation flow:
for(Region r: innerRegion) {
update(r, oldGrid , newGrid , step);
}
swap(oldGrid , newGrid);
++step;
for(Region r: outerGhostZoneRegion) {
notifyPatchProviders(r, oldGrid);
}
for(Region r: outerGhostZoneRegion) {
update(r, oldGrid , newGrid , step);
}
for(Region r: innerGhostZoneRegion) {
notifyPatchAccepters(r, oldGrid);
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
35/ 51
52. LibGeoDecomp
Futurizing the Simulation Flow
Futurized Simulation flow:
parallel for(Region r: innerRegion) {
update(r, oldGrid , newGrid , step);
}
swap(oldGrid , newGrid); ++ step;
parallel for(Region r: outerGhostZoneRegion) {
notifyPatchProviders(r, oldGrid);
}
parallel for(Region r: outerGhostZoneRegion) {
update(r, oldGrid , newGrid , step);
}
parallel for(Region r: innerGhostZoneRegion) {
notifyPatchAccepters(r, oldGrid);
}
Continuation
Continuation
Continuation
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
36/ 51
53. HPXCL – Extending the Global Adress Space
• All GPU devices are addressable globally
• GPU memory can be allocated and referenced remotely
• Events are extensions of the shared state
⇒ API embedded into the already existing future facilities
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
37/ 51
54. From async to GPUs
Spawning single tasks not feasible
⇒ offload a work group (Think of parallel::for_each)
auto devices
= hpx:: opencl :: find_devices(hpx:: find_here (),
CL_DEVICE_TYPE_GPU).get();
// create buffers , programs and kernels ...
hpx:: opencl :: buffer buf = devices [0]. create_buffer(
CL_MEM_READ_WRITE , 4711);
auto write_future = buf.enqueue_write(some_vec.
begin(), some_vec.end());
auto kernel_future = kernel.enqueue(dim ,
write_future);
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
38/ 51
55. From async to GPUs
Spawning single tasks not feasible
⇒ offload a work group (Think of parallel::for_each)
• Proof of Concept
• Future Directions:
• Embedd OpenCL devices behind Execution Policies and Executors
• Hide OpenCL stuff behind parallel algorithms
• Hide OpenCL buffer management behind "distributed data structures"
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
38/ 51
57. Mandelbrot example
Acknowledgements to Martin Stumpf
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
40/ 51
58. Mandelbrot example
Acknowledgements to Martin Stumpf
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
41/ 51
59. LibGeoDecomp
Performance Results
0
10
20
30
40
50
60
70
1 2 4 8 16
Time[s]
Number of Cores, on one Node
Execution Times of HPX and MPI N-Body Codes
(SMP, Weak Scaling)
Sim HPX
Sim MPI
Comm HPX
Comm MPI
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
61. LibGeoDecomp
Performance Results
0
200
400
600
800
1000
1200
1400
1600
0 10 20 30 40 50 60
PerformanceinGFLOPS
Number of Cores
Weak Scaling Results for HPX N-Body Code
(Single Xeon Phi, Futurized)
1 Thread/Core
2 Threads/Core
3 Threads/Core
4 Threads/Core
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
62. LibGeoDecomp
Performance Results
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16
PerformanceinTFLOPS
Number of Nodes, 16 Cores on Host, Full Xeon Phi
Weak Scaling Results for HPX N-Body Codes
(Host Cores and Xeon Phi Accelerator)
HPX
Peak
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
63. STREAM Benchmark
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12
Bandwidth[GB/s]
Number of cores per NUMA Domain
TRIAD STREAM Results
(50 million data points)
HPX (1 NUMA Domain)
OpenMP (1 NUMA Domain)
HPX (2 NUMA Domains)
OpenMP (2 NUMA Domains)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
43/ 51
64. Matrix Transpose
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12
Datatransferrate[GB/s]
Number of cores per NUMA domain
Matrix Transpose (SMP, 24kx24k Matrices)
HPX (1 NUMA Domain)
HPX (2 NUMA Domains)
OMP (1 NUMA Domain)
OMP (2 NUMA Domains)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
44/ 51
65. Matrix Transpose
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12
Datatransferrate[GB/s]
Number of cores per NUMA domain
Matrix Transpose (SMP, 24kx24k Matrices)
HPX (2 NUMA Domains)
MPI (1 NUMA Domain, 12 ranks)
MPI (2 NUMA Domains, 24 ranks)
MPI+OMP (2 NUMA Domains)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
45/ 51
66. Matrix Transpose
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60
Datatransferrate[GB/s]
Number of cores
Matrix Transpose (Xeon/Phi, 24kx24k matrices)
HPX (4 PUs per core) OMP (4 PUs per core)
HPX (2 PUs per core) OMP (2 PUs per core)
HPX (1 PUs per core) OMP (1 PUs per core)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
46/ 51
67. Matrix Transpose
0
5
10
15
20
25
30
35
2 3 4 5 6 7 8
Datatransferrate[GB/s]
Number of nodes (16 cores each)
Matrix Transpose (Distributed, 18kx18k elements per node)
HPX MPI
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
47/ 51
68. What’s beyond Exascale?
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
69. Conclusions
Higher-level parallelization abstractions in C++:
• uniform, versatile, and generic
• All of this is enabled by use of modern C++ facilities
• Runtime system (fine-grain, task-based schedulers)
• Performant, portable implementation
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
49/ 51
70. Parallelism is here to stay!
• Massive Parallel Hardware is already part of our daily lives!
• Parallelism is observable everywhere:
⇒ IoT: Massive amount devices existing in parallel
⇒ Embedded: Meet massively parallel energy-aware systems (Epiphany, DSPs,
FPGAs)
⇒ Automotive: Massive amount of parallel sensor data to process
• We all need solutions on how to deal with this, efficiently and pragmatically
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
50/ 51
71. More Information
• https://github.com/STEllAR-GROUP/hpx
• http://stellar-group.org
• hpx-users@stellar.cct.lsu.edu
• #STE||AR @ irc.freenode.org
Collaborations:
• FET-HPC (H2020): AllScale (https://allscale.eu)
• NSF: STORM (http://storm.stellar-group.org)
• DOE: Part of X-Stack
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
51/ 51