Workshop on Parallel, Cluster and Cloud Computing on Multi-core & GPU
(PCCCMG - 2015)
Workshop Conducted by
Computer Society of India
In Association with
Dept of CSE, VNIT and Persistence System Ltd, Nagpur
Workshop Dates 4th to 6th September 2015
1. Workshop on Parallel, Cluster and
Cloud Computing on Multi-core & GPU
(PCCCMG - 2015)
Workshop Conducted by
Computer Society of India In Association with
Dept. of CSE, VNIT and
Persistence System Ltd, Nagpur
4th – 6th Sep’15
2. Big-Data Cluster Computing
Advance tools & technologies
Jagadeesan A S
Software Engineer
Persistent Systems Limited
www.github.com/jagadeesanas2
www.linkedin.com/in/jagadeesanas2
3. ContentContent
Overview of Big Data
• Data clustering concepts
• Clustering vs Classification
• Data Journey
Advance tools and technologies
• Apache Hadoop
• Apache Spark
Future of analytics
• Demo - Spark RDD in Intellij IDEA
4. Big-Data is similar to Small-Data , but bigger in size and complexity.
What is Big-Data ?
Definition from Wikipedia:
Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools
or traditional data processing applications.
9. What is a Cluster ?
A group of the same or similar elements gathered or occurring closely together.
Clustering is the key to Big Data problem
• Not feasible to “label” large collection of objects
• No prior knowledge of the number and nature of groups (clusters) in data
• Clusters may evolve over time
• Clustering provides efficient browsing, search, recommendation and organization of data
16. Large-Scale Data Analytics
MapReduce computing paradigm vs. Traditional database systems
Database
Many enterprises are turned to Hadoop
Especially applications generating big data, Web applications, social networks, scientific applications
17. APACHE HADOOP (Disk Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
Design Principles of Hadoop
• Need to process big data
• Need to parallelize computation across thousands of nodes
• Commodity hardware
• Large number of low-end cheap machines working in parallel to solve
a computing problem
• Small number of high-end expensive machines
18. Hadoop cluster architecture
A Hadoop cluster can be divided into two abstract entities:
MapReduce engine + distributed file system =
19. What is SPARK
Why SPARK
How to configure SPARK
APACHE SPARK
Open-source cluster computing framework
20. APACHE SPARK (Memory Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
• Fast cluster computing system for large-scale data processing
compatible with Apache Hadoop
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Up to 100× faster
Often 2-10× less code
21. Spark OverviewSpark Overview
Spark Shell Spark applications
• Interactive shell for learning
or data exploration
• Python or Scala
• It provides a preconfigured
Spark context called sc.
• For large scale data processing
• Python, Java, Scala and R
• Every spark application requires
a spark Context. It is the main
entry point to the Spark API.
Scala Interactive shell Python Interactive shell
22. Spark Overview
Resilient distributed datasets (RDDs)
Immutable collections of objects spread across a cluster
Built through parallel transformations (map, filter, etc)
Automatically rebuilt on failure
Controllable persistence (e.g. caching in RAM) for reuse
Shared variables that can be used in parallel operations
Work with distributed collections as we would with local ones
23. Resilient Distributed Datasets (RDDs)
Two types of RDD operation
• Transformation – define new RDDs based on the current one
Example: Filter, map, reduce
• Action – return values.
Example : count, take(n)
24. Resilient Distributed Datasets (RDDs)
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
File: movie.txt RDD: mydata
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
25. Resilient Distributed Datasets (RDDs)
map and filter Transformation
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I HAD RATHER SEE THAN BE ONE.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
I HAD RATHER SEE THAN BE ONE.
Map(lambda line : line.upper())
Filter(lambda line: line.startswith(‘I’))
Map(line => line.toUpperCase())
Filter(line => line.startsWith(‘I’))
26. Spark Stack
• Spark SQL :
--- For SQL and unstructured data processing
• Spark Streaming :
--- Stream processing of live data streams
• MLib:
--- For machine learning algorithm
• GraphX:
--- Graph processing
27. Why Spark ?
Core engine with SQL, Streaming, machine learning and graph processing
modules.
Can run today’s most advanced algorithms.
Alternative to Map Reduce for certain applications.
APIs in Java, Scala and Python
Interactive shells in Scala and Python
Runs on Yarn, Mesos and Standalone.
28. Spark’s major use cases over Hadoop
• Iterative Algorithms in Machine Learning
• Interactive Data Mining and Data Processing
• Spark is a fully Apache Hive-compatible data warehousing system that
can run 100x faster than Hive.
• Stream processing: Log processing and Fraud detection in live streams
for alerts, aggregates and analysis
• Sensor data processing: Where data is fetched and joined from
multiple sources, in-memory dataset really helpful as they are easy and
fast to process.
33. Example : Page Rank
A way of analyzing websites based on their link relationships
• Good example of a more complex algorithm
• Multiple stages of map & reduce
• Benefits from Spark’s in-memory caching
• Multiple iterations over the same data
Basic Idea
Give pages ranks (scores) based on links to them
• Links from many pages high rank
• Link from a high-rank page high rank
35. Other Iterative Algorithms
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop Spark
TIME PER ITERATION(S)
NOTE : Less Iteration Time denotes high Performance
36. Spark Installation
(For end-user side)
Download Spark distribution from https://spark.apache.org/downloads.html
which pre-build of hadoop 2.4 or later.
38. Spark Installation (continue)
Build the source code using maven and hadoop
<SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0
39. How to run Spark ?
(Standalone mode )
Once the build is completed. Go to your bin directory which is inside Spark home
directory in a terminal and invoke Spark Shell
<SPARK_HOME>/bin#./spark-shell
40. To start all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./start-all.sh
42. To stop all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./stop-all.sh
45. https://www.youtube.com/watch?v=JfqJTQnVZvA
• IBM is making Spark available as a cloud service on its
Bluemix cloud platform.
• 3,500 IBM researchers and developers to work on Spark-
related projects at more than a dozen labs worldwide.