Effort Estimation Techniques used in Software Projects
A Step to programming with Apache Spark
1. A Step to programming withA Step to programming with
Rahul Kumar
Trainee - Software Consultant
Knoldus Software LLP
Rahul Kumar
Trainee - Software Consultant
Knoldus Software LLP
2. Building Spark :
1. Pre Build Spark
http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz
2. Source Code
http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2.tgz
Goto the SPARK_HOME directory.
Execute : mvn -Dhadoop.version=1.2.1 -Phadoop-1 -DskipTests clean package
To start spark
goto the SPARK_HOME/bin
Execute ./spark-shell
3. ● The main feature of Spark is its in-memory cluster
computing that increases the processing speed of an
application.
● Spark is not a modified version of Hadoop because
it has its own cluster management.
● Spark uses Hadoop in two ways – one is storage
and second is processing. Since Spark has its own
cluster management computation, it uses Hadoop
for storage purpose only.
Spark Features :
4. Spark applications run as independent
sets of processes on a cluster,coordinated
by the SparkContext object in your main
program (called the driver program).
5.
6.
7. ● Resilient Distributed Datasets (RDD) is a fundamental
data structure of Spark.
● It is an immutable distributed collection of objects.
● RDDs can contain any type of Python, Java, or Scala
objects, including user-defined classes.
● There are two ways to create RDDs: parallelizing an
existing collection in your driver program
● e.g. val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
● val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
RDD :
9. ● RDDs support two types of operations:
✔ Transformations, which create a new dataset from an existing one, and
✔Actions, which return a value to the driver program after running a
computation on the dataset.
●For example,
✔ map is a transformation that passes each dataset element through a
function and returns a new RDD representing the results.
✔ reduce is an action that aggregates all the elements of the RDD using
some function and returns the final result to the driver program
●All transformations in Spark are lazy, in that they do not
compute their results right away
● RDDs support two types of operations:
✔ Transformations, which create a new dataset from an existing one, and
✔Actions, which return a value to the driver program after running a
computation on the dataset.
●For example,
✔ map is a transformation that passes each dataset element through a
function and returns a new RDD representing the results.
✔ reduce is an action that aggregates all the elements of the RDD using
some function and returns the final result to the driver program
●All transformations in Spark are lazy, in that they do not
compute their results right away
RDD :
10. ● A DataFrame is equivalent to a relational table
in Spark SQL.
DataFrame :
● Steps to create DataFrame :
Create SparkContext object :
– val conf = new
SparkConf().setAppName("Demo").setMaster("local[2]")
– val sc = new SparkContext(conf)
Create SqlContext object :
– val sqlContext = new SQLContext(sc)
Read Data From Files :
– val df = sqlContext.read.json("src/main/scala/emp.json")
11. ● A data frame is a table, or two-dimensional array-like structure, in which each column
contains measurements on one variable, and each row contains one case.
● DataFrame has additional metadata due to its tabular format, which allows Spark to
run certain optimizations on the finalized query.
● An RDD, on the other hand, is merely a Resilient Distributed Dataset
that is more of a blackbox of data that cannot be optimized as the
operations that can be performed against it are not as constrained.
● However, you can go from a DataFrame to an RDD via its rdd
method, and you can go from an RDD to a DataFrame (if the RDD is
in a tabular format) via the toDF method
DataFrame and RDD :
14. ● Hive is a data warehouse infrastructure tool to process structured data
in Hadoop.
● It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
● It stores schema in a database and processed data into HDFS.
● It provides SQL type language for querying called HiveQL or HQL.
● It is designed for OLAP.
Hive :
15. ● Hive comes bundled with the Spark library as
HiveContext, which inherits from SQLContext.
● Using HiveContext, you can create and find tables in
the HiveMetaStore and write queries on it using
HiveQL.
● Users who do not have an existing Hive deployment
can still create a HiveContext.
● When not configured by the hive-site.xml, the context
automatically creates a metastore called metastore_db
and a folder called warehouse in the current directory.
Spark-Hive :
16. ➢ Spark SQL supports queries written using HiveQL.
➢ Its a SQL-like language that produces queries that are
converted to Spark jobs.
➢ HiveQL is more mature and supports more complex
queries than Spark SQL.
Spark-Hive :(continued)
17. 1) first create a SqlContext instance,
val sqlContext = new SqlContext(sc)
2) submit the queries by calling the sql method on the HiveContext instance.
val res=sqlContext.sql("select * from employee")
To construct a HiveQL query,
1) first create a new HiveContext instance,
val conf = new SparkConf().setAppName("Demo").setMaster("local[2]")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
2) submit the queries by calling the sql method on the HiveContext instance.
val res=hiveContext.sql("select * from employee")