SlideShare a Scribd company logo
1 of 27
Download to read offline
Spark Working
Environment in
Windows OS
Mohammed Zuhair Al-Taie
Big Data Centre - Universiti Teknologi Malaysia - 2016
Driver: Spark Driver is the process that contains the SparkContext
Spark Context: Spark framework operating in the basic client-server model. SparkContext is responsible of issuing tasks to the cluster
manager to execute. We have a Driver Program (which can be Scala, Ipython, or R shell) which operates in a laptop. Within the driver
program there is what is called SparkContext. typically, a Driver Program runs on a laptop or a client machine (but this is not a condition). In
this scenario, the client side (laptop or PC) stores the 'Driver' program and inside it there is the SparkContext (usually referred to as sc). In
my own experiments, the driver program is always running on my local computer (although it can run from any other computer even when in
a cluster). One difference with Hadoop MapReduce execution context is that the Driver Program is responsible for managing a lot of metadata
of which tasks to execute and the results that come back from them. In Hadoop, the master node (which lives within a cluster) is responsible
of the metadata of tasks and data. In Hadoop, the master node executes batch jobs, while in spark where we use an interactive REPL, the
Driver Program and the SparkContext often live in the same machine whether in a laptop or another machine.
Cluster Manager: In addition to the Driver Program which issues commands, there is a Cluster Manager which, in spark, can be the built-in
Standalone manager, Hadoop Yarn, or Apache Mesos manager. The Standalone manager is usually a good choice. Hadoop Yarn and Apache
Mesos are best if we want to connect to other frameworks like Hadoop, HBase, Hive, etc. A cluster manager cannot be effective without a
cluster to manage. the cluster manager can connect to one or more worker nodes. Each worker node has an Executor and a cache (or RAM)
and it has tasks to execute.
Executor: process that executes one or more spark tasks
Master: process that manages applications across the cluster
Spark Worker: process that manages executors on a particular node
Some Terminology of Spark (1)
Some Terminology of Spark (2)
In Summary:
The Spark driver sends transformations to cluster manager. The Spark cluster manager sends computation to the appropriate data.
Spark intelligently pipelines tasks to batch computations to avoid sending data over the network. Certain transformations force a wide
dependency.
An RDD can be defined as:
1. A set of partitions of current RDD (data)
2. A list of dependencies
3. A function to compute partitions (functional paradigm)
4. A partition to optimize execution
5. A Potential preferred location for partitions
The first 3 points are required by any RDD but the last two points are optional (used for optimization).
Building RDDs can be done in several ways: sc.parallelize, from hive, from an external file like a text file, from JDBC,
Cassandra, HBase, JSON, csv, sequence files, object files, or various compressed formats. It can also be created from
another RDD using any of the transformation methods.
To determine the parent RDD of a new child RDD: (RDDname).toDebugString()
To find the exact number of any RDD, type: RDDname.getNumPartitions(). In the localhost:4040, in the "Storage" section,
you can find the number of partitions of each RDD: cached.. In the "Jobs" section, you will find that that job is done on 2
partitions: 2/2 rather than 1/1.
SPARK RDDs
1. Python installation from anaconda (python version should be 2.7 only!)
2. Spark binary. Select pre-built for Hadoop 2.4 or earlier. Choose Spark package the latest version.
Download the binary version but not the source version (to avoid compiling it compiling).
Download the version that is pre-built for Hadoop
Download website: http://spark.apache.org/downloads.html
The file is downloaded in the tgz format which is a common file compression in Linux and Unix world (opposite to zip in Windows OS). WinRAR is able to unzip
such files
After downloading the file, we should keep it in an easy to use/access environment (e.g. desktop). In addition, we should provide the folder path to the
environment variables.
3. Java JDK 6/7.
To know which version of java is required for spark installation, visit spark.apache.org/docs/latest/
4. Install scientific python. ipython (the old name for Jupyter) is integrated with anaconda distribution.
5. Install py4j (to connect PySpark with Java) from the cmd command: pip install py4j.
6. Install, optionally, IRKernal (for Jupyter) to write R code on Jupyter.
NOTES:
1. Cloudera and Hortonworks provide Linux virtual machines with Spark installed. This allows at running Linux with spark installed if having a
virtual box or VMware.
2. Installing spark with Homebrew (on OSX) or Cygwin (on windows) is not recommended, although they are great at installing packages other
than spark.
Installation Requirements
We need to set some environment variables on our system. The goal is to let Windows know where to find Spark
and other components.
The following applies to setting environment variables for Windows OS. Linux and Mac OSs have their own ways for
setting environment variables
1. First, do all necessary installations (Python, Anaconda, PySpark, Java, py4j, IRKernal) as stated before.
2. Download Apache Spark from its official website and decompress it.
3. Set some environment variables in our systems. Go to environment variables: Control Panel --> System and Security -->
System --> Advanced system settings --> environment variables
There are two sections, the upper one is related to user variables and the lower one is related to system variables
4. Create a new user variable called 'SPARK_HOME' in the 'user variables' section that includes the full path to the unzipped spark
folder (for example: C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6)
This is important because we want to tell Windows where we installed Spark.
Setting Environment Variables (1)
5. Add the name of that new user variable 'SPARK_HOME' to the end of 'path' variable (user variables section) like this:
;%SPARK_HOME%bin. This will allow to run Spark from any directory without having to write the full path to it.
6. Add the following python path to the end of "path" in the "system variables" section: (variable: path, value:
...;C:Python27). Although this is not always necessary, but it allows to enable python to run if it is not responding
7. Another important installation: open a new folder in the c: drive and name it winutils. inside that folder open a new folder
and name it bin. inside bin folder, paste an executable file that you can download from the internet which is winutils.exe.
This file is important for Spark to work inside Windows environment as it expects Hadoop (or some part of it which is
winutils.exe) to be installed in windows. Installing Hadoop instead can work perfectly. Next, we need to tell spark where
that file is: in the environment variables, in the user variables part, add a new environment variable that we call it
HADOOP_HOME and the variable value is the path to the winutils.exe file which is in this case c:winutils.
8. OPTIONAL: change spark configuration by logging into conf folder in spark folder. The goal is to get rid of the many error
messages that appear during execution. After logging into conf folder, open the file log4j.properties.template using
WordPad application and make rootCategory=WARN rather than INFO. After that, we change the file extension to
log4j.properties only.
Setting Environment Variables (2)
In Spark, there are 3 modes of programming: Batch mode, interactive mode (using a
shell), and a streaming mode.
Only python and Scala have shells for spark (i.e. java cannot be run from the command
line script)
The spark python shell is a normal python REPL that is connected as well to the spark
cluster underneath
To run the python shell in the command prompt environment (cmd), cd (change directory)
to where Spark folder is unzipped then, cd to the "bin" folder inside it and then execute
"PySpark". If the "cmd" environment is not responding, try to use the administrator mode
when initiating it.
To run the Scala shell, type spark-shell from within the bin folder.
Spark Shell
Spark Shell Demonstration
1. C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6bin>pyspark --master local (for local
mode). This is the mode used most of the time. Other types for valid local modes include:
1. pyspark --master local[k], where k=2,3.... where k represents the number of worker threads.
2. pyspark --master local[*], where * corresponds to the number of cores available.
3. [*] is the default value of the pyspark shell.
2. C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6bin>pyspark --master mesos (for
mesos mode)
3. C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6bin>pyspark --master yarn (for yarn
mode)
4. C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6bin>pyspark --master spark (for spark
mode)
Running PySpark in Various Modes
The only mode that works initially is the local mode.
For small projects that don't require sharing clusters, standalone scheduler spark cluster mode is the
best option. YARN or Mesos are preferred for advanced tasks such as priorities, queues, access limiting
YARN or Mesos are necessary to connect spark to HDFS in Hadoop.
Spark can also live in a any hardware using a virtual machine using e.g. Amazon Web Service (EC2).
Spark has connectors to most of the popular data sources such as Apache HBase, Apache Hive,
Cassandra, and Tachyon which is developed by AMPlab particularly for spark.
A spark cluster consists of one master node and one (or more) slave nodes.
To start the master mode: spark-1.6.0-bin-hadoop2.6sbin>start-master.sh
To stop the master mode: spark-1.6.0-bin-hadoop2.6sbin>stop-master
To start the slave mode: spark-1.6.0-bin-hadoop2.6sbin>start-slave.sh
Notes on Spark Running Modes (1)
1. To stop the slave mode: spark-1.6.0-bin-hadoop2.6sbin>stop-slave.sh
2. To start all slaves: spark-1.6.0-bin-hadoop2.6sbin>start-slaves.sh
3. The spark master node has a URL like spark://alex-laptop.local:7077 that we need
4. Starting a slave node requires that we provide the URL of the master node
5. Losing a slave node is not a problem in the standalone mode as it is resilient against data loss
6. However, losing the master node can be problematic as it will stop all new jobs from being scheduled. to
go around this, you can use zookeeper or the single node recovery.
7. exit() to exit spark sessions
8. To know which mode we are using, in pyspark, type: sc.master or in the browser, visit
http://localhost:4040/environment/ and then spark.master in spark properties.
Notes on Spark Running Modes (2)
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-
0.9-src.zip')) ## may need to adjust on your system depending on which Spark version you're
using and where you installed it.
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Notes:
1. You run this code after launching ipython notebook
2. The above code should run PySpark on ipython (Jupyter) successfully
3. This code should be run one time only (at the beginning of the session)
4. The execution should come after doing all the installations.
Ipython Using Spark
ipython is a type of REPL (Read–Eval–Print-Loop) programs
ipython is integrated with Anaconda installation
To run ipython inside the command prompt environment (instead of a web browser) just type: ipython.
To exit spark shell, type: exit()
To run ipython on a web browser from the cmd command, type: ipython notebook. To exit the shell,
type: ctrl + break
To install ipython from the cmd command: pip install ipython
One way to install ipython is to do the installation inside a virtual environment (virtualenv ipython-env).
This type of installation has the advantage of not affecting the main working environment
To update ipython from the command line: conda update ipython, or pip install ipython --upgrade
It is possible to update anaconda distribution (packages) from the command line. Type: conda update
conda... and then specify the packages that you want to upgrade (update)
If unable to install some packages (like SciPy or sklearn.preprocessing) --> it means there is a problem
with Anaconda installation or path. A better way is through unstall-install Anaconda for the specific
Python version (i.e. python 2.7). Then install packages by typing conda install numpy scipy matplotlib.
Make use of this video tutorial "Install Python with NumPy SciPy Matplotlib on Windows".
Jupyter is the new name of ipython. it supports other languages besides python.
Ipython Settings (1)
Jupyter can be triggered from the command prompt like this: Jupyter notebook. To exit the shell, use: ctrl + break
Jupyter can also be triggered as a program from the program files in windows
In addition to running PySpark on Ipython (in the web browser), we can run PySpark on ipython in the command prompt
environment it self.
This means importing the PySpark library into ipython shell. After logging into ipython in cmd, type: import pyspark as ps
R code can also be used on Ipython in the REPL after installing IRKernal.
R can also be executed in the cmd environment by typing: SparkR
Spyder software comes integrated with Anaconda package
To know the current working directory, type: pwd.
To monitor spark processes, we can use "localhost:4040" or some other external tools such as Ganglia.
To know the default number of partitions in a PC: sc.defaultMinPartitions or sc.defaultParallelism (the result will depend on
the number of the cores in that PC). The default number of partitions in Spark on a laptop or PC = 1
Spark defaults to using one partition for each block of the input file. But the default block size is different if we are
reading from HDFS as it takes 64 MB since the files that are stored in there are big. For local files, and in most operating
systems, block size is in the order of kilobytes. Use: rdd.partitions.size
In spark, code can be done in two places: driver (shell or sc) and executors (worker threads).
In a cluster mode, the number of executors can be determined (this is not available in the standalone mode).
Spark works on top of JVM (Java Runtime Environment). However, it is platform independent in the way that there is a
version for windows, Linux, ...
Ipython Settings(2)
To monitor the performance of our application on spark, we can use:
1. Web UI (application) of the spark cluster that defaults to port 4040
2. History Server (application)
3. Ganglia (infrastructure)
4. jstack, jmap, jstat (JVM)
To optimize the performance of an application on spark and to solve possible problems, try to:
1. Optimize Data in terms of serialization/deserialization and locality
2. Optimize the application itself by using more efficient data structures, caching, broadcasting, shuffle
3. Optimize the framework itself (parallelism, memory, or garbage collecting)
In the spark Standalone cluster manager (or any other cluster manager), the master node itself launches
its own web UI.
The usual local host port (Web UI/history server) on windows is 4040 (we use it to monitor Spark
processes). It contains information about schedulers, RDD, spark environment, and executors. In the
"Executors" menu, you find one executor only if you are using Windows environment. However, if using the
cluster mode, you will find a lot of executors depending on the number of clusters. After using the cache
function(), the storage menu in Spark will now be activated and will show that a MapPartitionRDD is now
available in the memory with the size of that memory.
Ipython Settings(3)
Typing ? alone provides a lot of help documentation from the ipython terminal
Typing ?word gives information about the keyword
Typing ??word gives a detailed information about the keyword
Typing sc.(+tab) to check all available functions with spark context
Typing sc? gives me the built-in help
Type help() for general help inside spark code
Type help(function name) to see details about that function use
Ipython provides debugging. Type: %debug and insert "h" for help, "w" for location, "q" to leave,...
%pdb takes you to the debugger exactly after an exception is thrown
Implementing the “print statement” in several places within the working area also serves as a debugger
method
Help Within Ipython
Spark deployment is done in one of two modes: local mode (running a laptop or a single machine) or
cluster mode (which provides better performance compared to the local mode).
Installing spark on laptops or PCs uses the standalone version, while a cluster requires the inclusion of
Mesos or YARN.
In other words: there are two modes of Spark scalability: Single JVM or Managed Cluster.
1. In a single JVM mode, Spark runs on a single box (Linux or Windows) and all components (Driver,
executors) run within the same JVM. This is a simple setup for Spark (i.e. intended only for training
and is not intended for production).
2. In the Managed cluster, Spark can scale from 2 to thousands of nodes. We can use any cluster
manager (like Mesos or YARN) to manage nodes. Data is distributed and processed in all nodes. This
setup is suitable for production environments.
Spark Deployment
There are three ways to run spark in the local mode:
1. Single threaded: running SparkContext with a single thread. SparkContext('local'). It is a sequential execution that
allows easier debugging of program logic, where tasks are executed sequentially. When debugging logic in a multi-
threaded mode, there is no guarantee on the sequence of task execution or the tasks that are executed. In the single-
threaded mode, on the other hand, all the tasks are got executed sequentially. After building a program in the single-
threaded mode, we may move to the more advanced mode, which is the multi-threaded mode to test the application.
2. Multi-threaded: leverages multiple cores and multiple threads of the computer. For example, SparkContext('local[4]')
will create four cores for the specific application. In this mode, concurrent execution leverages parallelism and allows
debugging of coordination. It has the benefit of leveraging the parallelism available in the computer to make programs
run faster and it also allows to debug the coordination and communication of the code that is executed in parallel. A
program that can pass this stage of testing and debugging correctly is very likely to work correctly in the full
distributed mode.
3. Pseudo-distributed cluster: to a large degree similar to the cluster mode. In this mode, the distributed execution
allows debugging of communication and I/O. It is similar to working in the previous type (the multi-threaded mode)
but this mode goes one step further in how to run a program in the cluster mode with a number of physical or virtual
machines. It is possible in this mode to debug the communications and input/output of each task and the jobs that are
running, as well as having the same interface available in the cluster mode including the ability to inspect individual
workers and exactors and to make sure the communications of a network with IPs and ports work correctly for a
specific application.
Local Mode
Cluster Mode:
In the cluster mode, we must decide the mode and the machines (whether physical logical)
that spark will run on. The cluster manager comes in 3 flavors: Standalone (that comes pre-
packaged with Spark), YARN (the default cluster manager for Hadoop), and Mesos (which
came from the same research group at UC Berkley and the one that Matie started his work on
in his early days at the AMPlab).
The scheduler is responsible of building stages to execute, submitting stages to Cluster
Manager, and resubmit failed stages when output is lost. It sits between worker nodes that
run tasks and threads. The result coming from worker nodes go back to the cluster manager
and then back to driver program.
A spark cluster can not be set on Windows environment. In this case, only one client node
can be run on Windows to implement simple projects. Running a spark cluster requires a
Linux environment or installing spark inside a virtual machine on windows with thousands of
nodes.
Cluster Mode
In the standalone mode (i.e. running spark locally), both the driver program (e.g. ipython shell) and
worker nodes (the processes inside the laptop) are located on the same physical infrastructure. In this
case, there are no managers like Mesos or YARN. If we are running spark on amazon web services, the
workers nodes in this case will be EC2.
The standalone scheduler comes pre-packaged with spark core. It is a great choice for a dedicated Spark
cluster (i.e. in the cluster, we are running only spark, and we are not running Hadoop or HBase on the
same cluster that we are running our spark installation). If we are to run different applications in the same
cluster that share resources and load, we may use either YARN or Mesos schedulers.
The Standalone cluster manager has a high-availability mode that can leverage Apache Zookeeper to
enable standby nodes. This is useful in case one of the master nodes fails as we can promote one of the
other nodes to take its place and continue working in zero time.
Standalone Scheduler
Mesos is the most general cluster manager that we can run spark on. It can be though of as general purpose cluster and
global source manager if we want to run multiple applications such as Spark, Hadoop, MPI, Cassandra, etc.. to share the
same resources such as memory and CPU cycles. Mesos, in this case, is going to schedule the various resources of our
cluster.
Mesos is a popular open-source cluster manager. It allows sharing many clusters between users and apps. It is easy to run
spark on top of Mesos.
It can be thought of as an operating system of a cluster, where multiple applications can co-locate on a cluster. Spark,
Hadoop, Kafka, ... can all be run in one Mesos cluster. So, if we are running spark, Hadoop, and Cassandra on the same
cluster, Mesos is going find the best way to efficiently distribute resources such as memory, CPU, network bandwidth
between the various applications and the users who are likely to use the cluster.
Mesos and Spark came out from the same research group at UC Barkley and Matie (the father of Apache Spark) built the
first version of Spark to work with Mesos. This is why they can work together over various applications.
It is a global resource manager that facilitates multi-tenant and heterogeneous workloads.
It is useful in saving resources or running multiple spark instances at the same time.
It is easy to run spark on a ready Mesos cluster. in this case, spark-submit can be used.
Compared to Mesos, YARN (discussed next) is more integrated with Hadoop ecosystem.
Mesos
YARN stands for Yet Another Resource Negotiator and it came out with the second version of Hadoop. It was abstracted out
of the Hadoop MapReduce framework to exist as a standalone cluster manager. This is why YARN is more suited with
stateless batch jobs with long runtimes.
YARN and Hadoop, in a similar way to spark and Mesos, grew up together.
Compared to Mesos, YARN is a monolithic scheduler that manages cluster resources as well as schedules jobs that are
executed on these resources.
YARN is not well suited for long-running processes that are always up (such as web servers), real-time, or
stateful/interactive services (like spark REPL or database queries).
YARN integrates well with existing Hadoop cluster and applications.
YARN was developed to separate MapReduce from the cluster manager. Hence, Spark can be run on YARN easily.
Mesos is more appropriate for spark than YARN, as YARN requires a lot of configurations and maintenance. With all these
complications, Mesos or the ‘standalone’ version is preferable.
In any case, a Mesos or YARN cluster should be pre-built before we run spark on top.
If Hadoop is already installed, it means that YARN is integrated and spark can be next installed.
YARN
Amazon EC2 (Elastic Compute Cloud) is useful to run spark on a cluster 7/24 and is useful for fast
prototyping and testing. It is elastic enough to comprehend how many machines spark is running on.
EC2 is useful to deploy clusters or test prototypes. Spark works very nicely with EC2 as spark was bundled
with scripts that ease the process of setting up and installing an environment on each of the workers and
master machines.
Virtual machines (including EC2) is elastic and ephemeral, even if we have our own physical devices.
EC2 is great at providing the machines that are required to run scripts or test prototypes.
Although there are other cloud services, EC2 offers elastic scalability and ease of setup, even when Mesos
or YARN is installed. It can be leveraged to test the various aspects of Spark itself.
For many people, it is the only feasible way to scale up their analyses without the need to make big capital
investments in building their own clusters.
Amazon EC2
There is a difference between the Client Mode and the Cluster Mode:
1. In the Client Mode (like in a laptop environment) the driver runs in the client, the master gets
resources and communicates back to the client, and the client communicates directly with executors.
2. In the Cluster Mode, the driver runs in master in the cluster, the master communicates with executors
all within the cluster, and the client exists as soon as it passes info to the master.
We can run spark either on a 'local' mode or 'cluster' mode, where each case has its own benefits:
1. local mode is useful when we want to debug an application either on a sample data or a small-scale
data.
2. Cluster mode is useful when we want to scale up our analysis to an entire datasets or when we want
to run things in a parallel and distributed fashion. Moving from one mode to another requires very
minimal changes to the application code.
Spark Deployment in Summary
Hadoop MapReduce framework is similar to Spark in that it uses master slave-like paradigm. It has one
Master node (which consists of a job tracker, name node, and RAM) and Worker Nodes (each worker node
consists of a task tracker, data node, and a RAM). The task tracker in a worker node is analogues to an
executor in Spark environment.
Tasks are assigned by Master nodes which are also responsible for the work coordination between worker
nodes. However, Spark adds some abstractions and generalizations and performance optimizations to
achieve much better efficiency especially in iterative workloads. Yet, spark does not concern itself with
being a data file system while Hadoop has what is called HDFS.
Spark can leverage existing distributed files systems (like HDFS), a distributed data base (like HBase),
traditional databases through its JDBC or ODBC adaptors, and flat files in local file systems or on a file
store like S3 in Amazon cloud.
In summary:
1. Spark only replaces MapReduce (or the computational engine of a distributed system)
2. We still need a data store: HDFS, HBase, Hive, etc.
3. Spark has a more flexible and general programming model compared to Hadoop
4. Spark is an echo system of higher level libraries that are built on top of Spark core framework
5. Spark is often faster for iterative computations
Comparison between Spark and Hadoop
Spark Working Environment in Windows OS

More Related Content

What's hot

Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 

What's hot (20)

Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Deploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerDeploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and docker
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 

Similar to Spark Working Environment in Windows OS

Similar to Spark Working Environment in Windows OS (20)

Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hive
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Final Report - Spark
Final Report - SparkFinal Report - Spark
Final Report - Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
Oracle EBS R12.1.3_Installation_linux(64bit)_Pan_Tian
Oracle EBS R12.1.3_Installation_linux(64bit)_Pan_TianOracle EBS R12.1.3_Installation_linux(64bit)_Pan_Tian
Oracle EBS R12.1.3_Installation_linux(64bit)_Pan_Tian
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop spark performance comparison
Hadoop spark performance comparisonHadoop spark performance comparison
Hadoop spark performance comparison
 
Performance all teh things
Performance all teh thingsPerformance all teh things
Performance all teh things
 
Oracle11g On Fedora14
Oracle11g On Fedora14Oracle11g On Fedora14
Oracle11g On Fedora14
 
Oracle11g on fedora14
Oracle11g on fedora14Oracle11g on fedora14
Oracle11g on fedora14
 
Spark core
Spark coreSpark core
Spark core
 
linux installation.pdf
linux installation.pdflinux installation.pdf
linux installation.pdf
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Oracle ebs-r12-1-3installationlinux64bit
Oracle ebs-r12-1-3installationlinux64bitOracle ebs-r12-1-3installationlinux64bit
Oracle ebs-r12-1-3installationlinux64bit
 

More from Universiti Technologi Malaysia (UTM)

Explanations in Recommender Systems: Overview and Research Approaches
Explanations in Recommender Systems: Overview and Research ApproachesExplanations in Recommender Systems: Overview and Research Approaches
Explanations in Recommender Systems: Overview and Research Approaches
Universiti Technologi Malaysia (UTM)
 
Factors disrupting a successful implementation of e-commerce in iraq
Factors disrupting a successful implementation of e-commerce in iraqFactors disrupting a successful implementation of e-commerce in iraq
Factors disrupting a successful implementation of e-commerce in iraq
Universiti Technologi Malaysia (UTM)
 

More from Universiti Technologi Malaysia (UTM) (11)

A self organizing communication model for disaster risk management
A self organizing communication model for disaster risk managementA self organizing communication model for disaster risk management
A self organizing communication model for disaster risk management
 
Python networkx library quick start guide
Python networkx library quick start guidePython networkx library quick start guide
Python networkx library quick start guide
 
Python 3.x quick syntax guide
Python 3.x quick syntax guidePython 3.x quick syntax guide
Python 3.x quick syntax guide
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Scientific theory of state and society parities and disparities between the p...
Scientific theory of state and society parities and disparities between the p...Scientific theory of state and society parities and disparities between the p...
Scientific theory of state and society parities and disparities between the p...
 
Nation building current trends of technology use in da’wah
Nation building current trends of technology use in da’wahNation building current trends of technology use in da’wah
Nation building current trends of technology use in da’wah
 
Flight MH370 community structure
Flight MH370 community structureFlight MH370 community structure
Flight MH370 community structure
 
Visualization of explanations in recommender systems
Visualization of explanations in recommender systemsVisualization of explanations in recommender systems
Visualization of explanations in recommender systems
 
Explanations in Recommender Systems: Overview and Research Approaches
Explanations in Recommender Systems: Overview and Research ApproachesExplanations in Recommender Systems: Overview and Research Approaches
Explanations in Recommender Systems: Overview and Research Approaches
 
Factors disrupting a successful implementation of e-commerce in iraq
Factors disrupting a successful implementation of e-commerce in iraqFactors disrupting a successful implementation of e-commerce in iraq
Factors disrupting a successful implementation of e-commerce in iraq
 

Recently uploaded

➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

Spark Working Environment in Windows OS

  • 1. Spark Working Environment in Windows OS Mohammed Zuhair Al-Taie Big Data Centre - Universiti Teknologi Malaysia - 2016
  • 2. Driver: Spark Driver is the process that contains the SparkContext Spark Context: Spark framework operating in the basic client-server model. SparkContext is responsible of issuing tasks to the cluster manager to execute. We have a Driver Program (which can be Scala, Ipython, or R shell) which operates in a laptop. Within the driver program there is what is called SparkContext. typically, a Driver Program runs on a laptop or a client machine (but this is not a condition). In this scenario, the client side (laptop or PC) stores the 'Driver' program and inside it there is the SparkContext (usually referred to as sc). In my own experiments, the driver program is always running on my local computer (although it can run from any other computer even when in a cluster). One difference with Hadoop MapReduce execution context is that the Driver Program is responsible for managing a lot of metadata of which tasks to execute and the results that come back from them. In Hadoop, the master node (which lives within a cluster) is responsible of the metadata of tasks and data. In Hadoop, the master node executes batch jobs, while in spark where we use an interactive REPL, the Driver Program and the SparkContext often live in the same machine whether in a laptop or another machine. Cluster Manager: In addition to the Driver Program which issues commands, there is a Cluster Manager which, in spark, can be the built-in Standalone manager, Hadoop Yarn, or Apache Mesos manager. The Standalone manager is usually a good choice. Hadoop Yarn and Apache Mesos are best if we want to connect to other frameworks like Hadoop, HBase, Hive, etc. A cluster manager cannot be effective without a cluster to manage. the cluster manager can connect to one or more worker nodes. Each worker node has an Executor and a cache (or RAM) and it has tasks to execute. Executor: process that executes one or more spark tasks Master: process that manages applications across the cluster Spark Worker: process that manages executors on a particular node Some Terminology of Spark (1)
  • 3. Some Terminology of Spark (2) In Summary: The Spark driver sends transformations to cluster manager. The Spark cluster manager sends computation to the appropriate data. Spark intelligently pipelines tasks to batch computations to avoid sending data over the network. Certain transformations force a wide dependency.
  • 4. An RDD can be defined as: 1. A set of partitions of current RDD (data) 2. A list of dependencies 3. A function to compute partitions (functional paradigm) 4. A partition to optimize execution 5. A Potential preferred location for partitions The first 3 points are required by any RDD but the last two points are optional (used for optimization). Building RDDs can be done in several ways: sc.parallelize, from hive, from an external file like a text file, from JDBC, Cassandra, HBase, JSON, csv, sequence files, object files, or various compressed formats. It can also be created from another RDD using any of the transformation methods. To determine the parent RDD of a new child RDD: (RDDname).toDebugString() To find the exact number of any RDD, type: RDDname.getNumPartitions(). In the localhost:4040, in the "Storage" section, you can find the number of partitions of each RDD: cached.. In the "Jobs" section, you will find that that job is done on 2 partitions: 2/2 rather than 1/1. SPARK RDDs
  • 5. 1. Python installation from anaconda (python version should be 2.7 only!) 2. Spark binary. Select pre-built for Hadoop 2.4 or earlier. Choose Spark package the latest version. Download the binary version but not the source version (to avoid compiling it compiling). Download the version that is pre-built for Hadoop Download website: http://spark.apache.org/downloads.html The file is downloaded in the tgz format which is a common file compression in Linux and Unix world (opposite to zip in Windows OS). WinRAR is able to unzip such files After downloading the file, we should keep it in an easy to use/access environment (e.g. desktop). In addition, we should provide the folder path to the environment variables. 3. Java JDK 6/7. To know which version of java is required for spark installation, visit spark.apache.org/docs/latest/ 4. Install scientific python. ipython (the old name for Jupyter) is integrated with anaconda distribution. 5. Install py4j (to connect PySpark with Java) from the cmd command: pip install py4j. 6. Install, optionally, IRKernal (for Jupyter) to write R code on Jupyter. NOTES: 1. Cloudera and Hortonworks provide Linux virtual machines with Spark installed. This allows at running Linux with spark installed if having a virtual box or VMware. 2. Installing spark with Homebrew (on OSX) or Cygwin (on windows) is not recommended, although they are great at installing packages other than spark. Installation Requirements
  • 6. We need to set some environment variables on our system. The goal is to let Windows know where to find Spark and other components. The following applies to setting environment variables for Windows OS. Linux and Mac OSs have their own ways for setting environment variables 1. First, do all necessary installations (Python, Anaconda, PySpark, Java, py4j, IRKernal) as stated before. 2. Download Apache Spark from its official website and decompress it. 3. Set some environment variables in our systems. Go to environment variables: Control Panel --> System and Security --> System --> Advanced system settings --> environment variables There are two sections, the upper one is related to user variables and the lower one is related to system variables 4. Create a new user variable called 'SPARK_HOME' in the 'user variables' section that includes the full path to the unzipped spark folder (for example: C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6) This is important because we want to tell Windows where we installed Spark. Setting Environment Variables (1)
  • 7. 5. Add the name of that new user variable 'SPARK_HOME' to the end of 'path' variable (user variables section) like this: ;%SPARK_HOME%bin. This will allow to run Spark from any directory without having to write the full path to it. 6. Add the following python path to the end of "path" in the "system variables" section: (variable: path, value: ...;C:Python27). Although this is not always necessary, but it allows to enable python to run if it is not responding 7. Another important installation: open a new folder in the c: drive and name it winutils. inside that folder open a new folder and name it bin. inside bin folder, paste an executable file that you can download from the internet which is winutils.exe. This file is important for Spark to work inside Windows environment as it expects Hadoop (or some part of it which is winutils.exe) to be installed in windows. Installing Hadoop instead can work perfectly. Next, we need to tell spark where that file is: in the environment variables, in the user variables part, add a new environment variable that we call it HADOOP_HOME and the variable value is the path to the winutils.exe file which is in this case c:winutils. 8. OPTIONAL: change spark configuration by logging into conf folder in spark folder. The goal is to get rid of the many error messages that appear during execution. After logging into conf folder, open the file log4j.properties.template using WordPad application and make rootCategory=WARN rather than INFO. After that, we change the file extension to log4j.properties only. Setting Environment Variables (2)
  • 8. In Spark, there are 3 modes of programming: Batch mode, interactive mode (using a shell), and a streaming mode. Only python and Scala have shells for spark (i.e. java cannot be run from the command line script) The spark python shell is a normal python REPL that is connected as well to the spark cluster underneath To run the python shell in the command prompt environment (cmd), cd (change directory) to where Spark folder is unzipped then, cd to the "bin" folder inside it and then execute "PySpark". If the "cmd" environment is not responding, try to use the administrator mode when initiating it. To run the Scala shell, type spark-shell from within the bin folder. Spark Shell
  • 10. 1. C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6bin>pyspark --master local (for local mode). This is the mode used most of the time. Other types for valid local modes include: 1. pyspark --master local[k], where k=2,3.... where k represents the number of worker threads. 2. pyspark --master local[*], where * corresponds to the number of cores available. 3. [*] is the default value of the pyspark shell. 2. C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6bin>pyspark --master mesos (for mesos mode) 3. C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6bin>pyspark --master yarn (for yarn mode) 4. C:UsersZuhairDesktopspark-1.6.0-bin-hadoop2.6bin>pyspark --master spark (for spark mode) Running PySpark in Various Modes
  • 11. The only mode that works initially is the local mode. For small projects that don't require sharing clusters, standalone scheduler spark cluster mode is the best option. YARN or Mesos are preferred for advanced tasks such as priorities, queues, access limiting YARN or Mesos are necessary to connect spark to HDFS in Hadoop. Spark can also live in a any hardware using a virtual machine using e.g. Amazon Web Service (EC2). Spark has connectors to most of the popular data sources such as Apache HBase, Apache Hive, Cassandra, and Tachyon which is developed by AMPlab particularly for spark. A spark cluster consists of one master node and one (or more) slave nodes. To start the master mode: spark-1.6.0-bin-hadoop2.6sbin>start-master.sh To stop the master mode: spark-1.6.0-bin-hadoop2.6sbin>stop-master To start the slave mode: spark-1.6.0-bin-hadoop2.6sbin>start-slave.sh Notes on Spark Running Modes (1)
  • 12. 1. To stop the slave mode: spark-1.6.0-bin-hadoop2.6sbin>stop-slave.sh 2. To start all slaves: spark-1.6.0-bin-hadoop2.6sbin>start-slaves.sh 3. The spark master node has a URL like spark://alex-laptop.local:7077 that we need 4. Starting a slave node requires that we provide the URL of the master node 5. Losing a slave node is not a problem in the standalone mode as it is resilient against data loss 6. However, losing the master node can be problematic as it will stop all new jobs from being scheduled. to go around this, you can use zookeeper or the single node recovery. 7. exit() to exit spark sessions 8. To know which mode we are using, in pyspark, type: sc.master or in the browser, visit http://localhost:4040/environment/ and then spark.master in spark properties. Notes on Spark Running Modes (2)
  • 13. import os import sys spark_home = os.environ.get('SPARK_HOME', None) if not spark_home: raise ValueError('SPARK_HOME environment variable is not set') sys.path.insert(0, os.path.join(spark_home, 'python')) sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j- 0.9-src.zip')) ## may need to adjust on your system depending on which Spark version you're using and where you installed it. execfile(os.path.join(spark_home, 'python/pyspark/shell.py')) Notes: 1. You run this code after launching ipython notebook 2. The above code should run PySpark on ipython (Jupyter) successfully 3. This code should be run one time only (at the beginning of the session) 4. The execution should come after doing all the installations. Ipython Using Spark
  • 14. ipython is a type of REPL (Read–Eval–Print-Loop) programs ipython is integrated with Anaconda installation To run ipython inside the command prompt environment (instead of a web browser) just type: ipython. To exit spark shell, type: exit() To run ipython on a web browser from the cmd command, type: ipython notebook. To exit the shell, type: ctrl + break To install ipython from the cmd command: pip install ipython One way to install ipython is to do the installation inside a virtual environment (virtualenv ipython-env). This type of installation has the advantage of not affecting the main working environment To update ipython from the command line: conda update ipython, or pip install ipython --upgrade It is possible to update anaconda distribution (packages) from the command line. Type: conda update conda... and then specify the packages that you want to upgrade (update) If unable to install some packages (like SciPy or sklearn.preprocessing) --> it means there is a problem with Anaconda installation or path. A better way is through unstall-install Anaconda for the specific Python version (i.e. python 2.7). Then install packages by typing conda install numpy scipy matplotlib. Make use of this video tutorial "Install Python with NumPy SciPy Matplotlib on Windows". Jupyter is the new name of ipython. it supports other languages besides python. Ipython Settings (1)
  • 15. Jupyter can be triggered from the command prompt like this: Jupyter notebook. To exit the shell, use: ctrl + break Jupyter can also be triggered as a program from the program files in windows In addition to running PySpark on Ipython (in the web browser), we can run PySpark on ipython in the command prompt environment it self. This means importing the PySpark library into ipython shell. After logging into ipython in cmd, type: import pyspark as ps R code can also be used on Ipython in the REPL after installing IRKernal. R can also be executed in the cmd environment by typing: SparkR Spyder software comes integrated with Anaconda package To know the current working directory, type: pwd. To monitor spark processes, we can use "localhost:4040" or some other external tools such as Ganglia. To know the default number of partitions in a PC: sc.defaultMinPartitions or sc.defaultParallelism (the result will depend on the number of the cores in that PC). The default number of partitions in Spark on a laptop or PC = 1 Spark defaults to using one partition for each block of the input file. But the default block size is different if we are reading from HDFS as it takes 64 MB since the files that are stored in there are big. For local files, and in most operating systems, block size is in the order of kilobytes. Use: rdd.partitions.size In spark, code can be done in two places: driver (shell or sc) and executors (worker threads). In a cluster mode, the number of executors can be determined (this is not available in the standalone mode). Spark works on top of JVM (Java Runtime Environment). However, it is platform independent in the way that there is a version for windows, Linux, ... Ipython Settings(2)
  • 16. To monitor the performance of our application on spark, we can use: 1. Web UI (application) of the spark cluster that defaults to port 4040 2. History Server (application) 3. Ganglia (infrastructure) 4. jstack, jmap, jstat (JVM) To optimize the performance of an application on spark and to solve possible problems, try to: 1. Optimize Data in terms of serialization/deserialization and locality 2. Optimize the application itself by using more efficient data structures, caching, broadcasting, shuffle 3. Optimize the framework itself (parallelism, memory, or garbage collecting) In the spark Standalone cluster manager (or any other cluster manager), the master node itself launches its own web UI. The usual local host port (Web UI/history server) on windows is 4040 (we use it to monitor Spark processes). It contains information about schedulers, RDD, spark environment, and executors. In the "Executors" menu, you find one executor only if you are using Windows environment. However, if using the cluster mode, you will find a lot of executors depending on the number of clusters. After using the cache function(), the storage menu in Spark will now be activated and will show that a MapPartitionRDD is now available in the memory with the size of that memory. Ipython Settings(3)
  • 17. Typing ? alone provides a lot of help documentation from the ipython terminal Typing ?word gives information about the keyword Typing ??word gives a detailed information about the keyword Typing sc.(+tab) to check all available functions with spark context Typing sc? gives me the built-in help Type help() for general help inside spark code Type help(function name) to see details about that function use Ipython provides debugging. Type: %debug and insert "h" for help, "w" for location, "q" to leave,... %pdb takes you to the debugger exactly after an exception is thrown Implementing the “print statement” in several places within the working area also serves as a debugger method Help Within Ipython
  • 18. Spark deployment is done in one of two modes: local mode (running a laptop or a single machine) or cluster mode (which provides better performance compared to the local mode). Installing spark on laptops or PCs uses the standalone version, while a cluster requires the inclusion of Mesos or YARN. In other words: there are two modes of Spark scalability: Single JVM or Managed Cluster. 1. In a single JVM mode, Spark runs on a single box (Linux or Windows) and all components (Driver, executors) run within the same JVM. This is a simple setup for Spark (i.e. intended only for training and is not intended for production). 2. In the Managed cluster, Spark can scale from 2 to thousands of nodes. We can use any cluster manager (like Mesos or YARN) to manage nodes. Data is distributed and processed in all nodes. This setup is suitable for production environments. Spark Deployment
  • 19. There are three ways to run spark in the local mode: 1. Single threaded: running SparkContext with a single thread. SparkContext('local'). It is a sequential execution that allows easier debugging of program logic, where tasks are executed sequentially. When debugging logic in a multi- threaded mode, there is no guarantee on the sequence of task execution or the tasks that are executed. In the single- threaded mode, on the other hand, all the tasks are got executed sequentially. After building a program in the single- threaded mode, we may move to the more advanced mode, which is the multi-threaded mode to test the application. 2. Multi-threaded: leverages multiple cores and multiple threads of the computer. For example, SparkContext('local[4]') will create four cores for the specific application. In this mode, concurrent execution leverages parallelism and allows debugging of coordination. It has the benefit of leveraging the parallelism available in the computer to make programs run faster and it also allows to debug the coordination and communication of the code that is executed in parallel. A program that can pass this stage of testing and debugging correctly is very likely to work correctly in the full distributed mode. 3. Pseudo-distributed cluster: to a large degree similar to the cluster mode. In this mode, the distributed execution allows debugging of communication and I/O. It is similar to working in the previous type (the multi-threaded mode) but this mode goes one step further in how to run a program in the cluster mode with a number of physical or virtual machines. It is possible in this mode to debug the communications and input/output of each task and the jobs that are running, as well as having the same interface available in the cluster mode including the ability to inspect individual workers and exactors and to make sure the communications of a network with IPs and ports work correctly for a specific application. Local Mode
  • 20. Cluster Mode: In the cluster mode, we must decide the mode and the machines (whether physical logical) that spark will run on. The cluster manager comes in 3 flavors: Standalone (that comes pre- packaged with Spark), YARN (the default cluster manager for Hadoop), and Mesos (which came from the same research group at UC Berkley and the one that Matie started his work on in his early days at the AMPlab). The scheduler is responsible of building stages to execute, submitting stages to Cluster Manager, and resubmit failed stages when output is lost. It sits between worker nodes that run tasks and threads. The result coming from worker nodes go back to the cluster manager and then back to driver program. A spark cluster can not be set on Windows environment. In this case, only one client node can be run on Windows to implement simple projects. Running a spark cluster requires a Linux environment or installing spark inside a virtual machine on windows with thousands of nodes. Cluster Mode
  • 21. In the standalone mode (i.e. running spark locally), both the driver program (e.g. ipython shell) and worker nodes (the processes inside the laptop) are located on the same physical infrastructure. In this case, there are no managers like Mesos or YARN. If we are running spark on amazon web services, the workers nodes in this case will be EC2. The standalone scheduler comes pre-packaged with spark core. It is a great choice for a dedicated Spark cluster (i.e. in the cluster, we are running only spark, and we are not running Hadoop or HBase on the same cluster that we are running our spark installation). If we are to run different applications in the same cluster that share resources and load, we may use either YARN or Mesos schedulers. The Standalone cluster manager has a high-availability mode that can leverage Apache Zookeeper to enable standby nodes. This is useful in case one of the master nodes fails as we can promote one of the other nodes to take its place and continue working in zero time. Standalone Scheduler
  • 22. Mesos is the most general cluster manager that we can run spark on. It can be though of as general purpose cluster and global source manager if we want to run multiple applications such as Spark, Hadoop, MPI, Cassandra, etc.. to share the same resources such as memory and CPU cycles. Mesos, in this case, is going to schedule the various resources of our cluster. Mesos is a popular open-source cluster manager. It allows sharing many clusters between users and apps. It is easy to run spark on top of Mesos. It can be thought of as an operating system of a cluster, where multiple applications can co-locate on a cluster. Spark, Hadoop, Kafka, ... can all be run in one Mesos cluster. So, if we are running spark, Hadoop, and Cassandra on the same cluster, Mesos is going find the best way to efficiently distribute resources such as memory, CPU, network bandwidth between the various applications and the users who are likely to use the cluster. Mesos and Spark came out from the same research group at UC Barkley and Matie (the father of Apache Spark) built the first version of Spark to work with Mesos. This is why they can work together over various applications. It is a global resource manager that facilitates multi-tenant and heterogeneous workloads. It is useful in saving resources or running multiple spark instances at the same time. It is easy to run spark on a ready Mesos cluster. in this case, spark-submit can be used. Compared to Mesos, YARN (discussed next) is more integrated with Hadoop ecosystem. Mesos
  • 23. YARN stands for Yet Another Resource Negotiator and it came out with the second version of Hadoop. It was abstracted out of the Hadoop MapReduce framework to exist as a standalone cluster manager. This is why YARN is more suited with stateless batch jobs with long runtimes. YARN and Hadoop, in a similar way to spark and Mesos, grew up together. Compared to Mesos, YARN is a monolithic scheduler that manages cluster resources as well as schedules jobs that are executed on these resources. YARN is not well suited for long-running processes that are always up (such as web servers), real-time, or stateful/interactive services (like spark REPL or database queries). YARN integrates well with existing Hadoop cluster and applications. YARN was developed to separate MapReduce from the cluster manager. Hence, Spark can be run on YARN easily. Mesos is more appropriate for spark than YARN, as YARN requires a lot of configurations and maintenance. With all these complications, Mesos or the ‘standalone’ version is preferable. In any case, a Mesos or YARN cluster should be pre-built before we run spark on top. If Hadoop is already installed, it means that YARN is integrated and spark can be next installed. YARN
  • 24. Amazon EC2 (Elastic Compute Cloud) is useful to run spark on a cluster 7/24 and is useful for fast prototyping and testing. It is elastic enough to comprehend how many machines spark is running on. EC2 is useful to deploy clusters or test prototypes. Spark works very nicely with EC2 as spark was bundled with scripts that ease the process of setting up and installing an environment on each of the workers and master machines. Virtual machines (including EC2) is elastic and ephemeral, even if we have our own physical devices. EC2 is great at providing the machines that are required to run scripts or test prototypes. Although there are other cloud services, EC2 offers elastic scalability and ease of setup, even when Mesos or YARN is installed. It can be leveraged to test the various aspects of Spark itself. For many people, it is the only feasible way to scale up their analyses without the need to make big capital investments in building their own clusters. Amazon EC2
  • 25. There is a difference between the Client Mode and the Cluster Mode: 1. In the Client Mode (like in a laptop environment) the driver runs in the client, the master gets resources and communicates back to the client, and the client communicates directly with executors. 2. In the Cluster Mode, the driver runs in master in the cluster, the master communicates with executors all within the cluster, and the client exists as soon as it passes info to the master. We can run spark either on a 'local' mode or 'cluster' mode, where each case has its own benefits: 1. local mode is useful when we want to debug an application either on a sample data or a small-scale data. 2. Cluster mode is useful when we want to scale up our analysis to an entire datasets or when we want to run things in a parallel and distributed fashion. Moving from one mode to another requires very minimal changes to the application code. Spark Deployment in Summary
  • 26. Hadoop MapReduce framework is similar to Spark in that it uses master slave-like paradigm. It has one Master node (which consists of a job tracker, name node, and RAM) and Worker Nodes (each worker node consists of a task tracker, data node, and a RAM). The task tracker in a worker node is analogues to an executor in Spark environment. Tasks are assigned by Master nodes which are also responsible for the work coordination between worker nodes. However, Spark adds some abstractions and generalizations and performance optimizations to achieve much better efficiency especially in iterative workloads. Yet, spark does not concern itself with being a data file system while Hadoop has what is called HDFS. Spark can leverage existing distributed files systems (like HDFS), a distributed data base (like HBase), traditional databases through its JDBC or ODBC adaptors, and flat files in local file systems or on a file store like S3 in Amazon cloud. In summary: 1. Spark only replaces MapReduce (or the computational engine of a distributed system) 2. We still need a data store: HDFS, HBase, Hive, etc. 3. Spark has a more flexible and general programming model compared to Hadoop 4. Spark is an echo system of higher level libraries that are built on top of Spark core framework 5. Spark is often faster for iterative computations Comparison between Spark and Hadoop