SlideShare a Scribd company logo
1 of 20
Download to read offline
A Step to programming withA Step to programming with
Rahul Kumar
Trainee - Software Consultant
Knoldus Software LLP
Rahul Kumar
Trainee - Software Consultant
Knoldus Software LLP
Building Spark :
1. Pre Build Spark
http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz
2. Source Code
http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2.tgz
Goto the SPARK_HOME directory.
Execute : mvn -Dhadoop.version=1.2.1 -Phadoop-1 -DskipTests clean package
To start spark
goto the SPARK_HOME/bin
Execute ./spark-shell
● The main feature of Spark is its in-memory cluster
computing that increases the processing speed of an
application.
● Spark is not a modified version of Hadoop because
it has its own cluster management.
● Spark uses Hadoop in two ways – one is storage
and second is processing. Since Spark has its own
cluster management computation, it uses Hadoop
for storage purpose only.
Spark Features :
Spark applications run as independent
sets of processes on a cluster,coordinated
by the SparkContext object in your main
program (called the driver program).
● Resilient Distributed Datasets (RDD) is a fundamental
data structure of Spark.
● It is an immutable distributed collection of objects.
● RDDs can contain any type of Python, Java, or Scala
objects, including user-defined classes.
● There are two ways to create RDDs: parallelizing an
existing collection in your driver program
● e.g. val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
● val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
RDD :
● RDD(SPARK)
● HDFS(HADOOP)
● RDDs support two types of operations:
✔ Transformations, which create a new dataset from an existing one, and
✔Actions, which return a value to the driver program after running a
computation on the dataset.
●For example,
✔ map is a transformation that passes each dataset element through a
function and returns a new RDD representing the results.
✔ reduce is an action that aggregates all the elements of the RDD using
some function and returns the final result to the driver program
●All transformations in Spark are lazy, in that they do not
compute their results right away
● RDDs support two types of operations:
✔ Transformations, which create a new dataset from an existing one, and
✔Actions, which return a value to the driver program after running a
computation on the dataset.
●For example,
✔ map is a transformation that passes each dataset element through a
function and returns a new RDD representing the results.
✔ reduce is an action that aggregates all the elements of the RDD using
some function and returns the final result to the driver program
●All transformations in Spark are lazy, in that they do not
compute their results right away
RDD :
● A DataFrame is equivalent to a relational table
in Spark SQL.
DataFrame :
● Steps to create DataFrame :
 Create SparkContext object :
– val conf = new
SparkConf().setAppName("Demo").setMaster("local[2]")
– val sc = new SparkContext(conf)
 Create SqlContext object :
– val sqlContext = new SQLContext(sc)
 Read Data From Files :
– val df = sqlContext.read.json("src/main/scala/emp.json")
● A data frame is a table, or two-dimensional array-like structure, in which each column
contains measurements on one variable, and each row contains one case.
● DataFrame has additional metadata due to its tabular format, which allows Spark to
run certain optimizations on the finalized query.
● An RDD, on the other hand, is merely a Resilient Distributed Dataset
that is more of a blackbox of data that cannot be optimized as the
operations that can be performed against it are not as constrained.
● However, you can go from a DataFrame to an RDD via its rdd
method, and you can go from an RDD to a DataFrame (if the RDD is
in a tabular format) via the toDF method
DataFrame and RDD :
DataFrame Transformations :
● Def orderBy(sortExprs: Column*): DataFrame
● Def select(cols: Column*): DataFrame
● Def show(): Unit
● Def filter(conditionExpr: String): DataFrame
● Def groupBy(cols: Column*): GroupedData
●Def collect(): Array[Row]
●Def collectAsList(): List[Row]
●Def count(): Long
●Def head(): Row
●Def head(n: Int): Array[Row]
●Def collect(): Array[Row]
●Def collectAsList(): List[Row]
●Def count(): Long
●Def head(): Row
●Def head(n: Int): Array[Row]
DataFrame Actions :
● Hive is a data warehouse infrastructure tool to process structured data
in Hadoop.
● It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
● It stores schema in a database and processed data into HDFS.
● It provides SQL type language for querying called HiveQL or HQL.
● It is designed for OLAP.
Hive :
● Hive comes bundled with the Spark library as
HiveContext, which inherits from SQLContext.
● Using HiveContext, you can create and find tables in
the HiveMetaStore and write queries on it using
HiveQL.
● Users who do not have an existing Hive deployment
can still create a HiveContext.
● When not configured by the hive-site.xml, the context
automatically creates a metastore called metastore_db
and a folder called warehouse in the current directory.
Spark-Hive :
➢ Spark SQL supports queries written using HiveQL.
➢ Its a SQL-like language that produces queries that are
converted to Spark jobs.
➢ HiveQL is more mature and supports more complex
queries than Spark SQL.
Spark-Hive :(continued)
1) first create a SqlContext instance,
val sqlContext = new SqlContext(sc)
2) submit the queries by calling the sql method on the HiveContext instance.
val res=sqlContext.sql("select * from employee")
To construct a HiveQL query,
1) first create a new HiveContext instance,
val conf = new SparkConf().setAppName("Demo").setMaster("local[2]")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
2) submit the queries by calling the sql method on the HiveContext instance.
val res=hiveContext.sql("select * from employee")
ReferencesReferences
● http://spark.apache.org/docs/latest/sql-programming-guide.html
● http://www.tutorialspoint.com/spark_sql/spark_introduction.htm
● https://cwiki.apache.org
A Step to programming with Apache Spark

More Related Content

What's hot

JVM languages "flame wars"
JVM languages "flame wars"JVM languages "flame wars"
JVM languages "flame wars"Gal Marder
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsShashank L
 
Making Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development TeamsMaking Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development TeamsLightbend
 
Event sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event StoreEvent sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event StoreShivji Kumar Jha
 
What’s expected in Spring 5
What’s expected in Spring 5What’s expected in Spring 5
What’s expected in Spring 5Gal Marder
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
Lightbend Lagom: Microservices Just Right
Lightbend Lagom: Microservices Just RightLightbend Lagom: Microservices Just Right
Lightbend Lagom: Microservices Just Rightmircodotta
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a clusterGal Marder
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...confluent
 
Standalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceStandalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceRomi Kuntsman
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Jaroslav Bachorik
 
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Reactive Database Access With Slick 3
Reactive Database Access With Slick 3Reactive Database Access With Slick 3
Reactive Database Access With Slick 3Igor Mielientiev
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Databricks
 
Building Stateful Microservices With Akka
Building Stateful Microservices With AkkaBuilding Stateful Microservices With Akka
Building Stateful Microservices With AkkaYaroslav Tkachenko
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?Eyal Ben Ivri
 

What's hot (20)

Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
JVM languages "flame wars"
JVM languages "flame wars"JVM languages "flame wars"
JVM languages "flame wars"
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Making Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development TeamsMaking Scala Faster: 3 Expert Tips For Busy Development Teams
Making Scala Faster: 3 Expert Tips For Busy Development Teams
 
Event sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event StoreEvent sourcing Live 2021: Streaming App Changes to Event Store
Event sourcing Live 2021: Streaming App Changes to Event Store
 
What’s expected in Spring 5
What’s expected in Spring 5What’s expected in Spring 5
What’s expected in Spring 5
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Lightbend Lagom: Microservices Just Right
Lightbend Lagom: Microservices Just RightLightbend Lagom: Microservices Just Right
Lightbend Lagom: Microservices Just Right
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje CrnjakJavantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
 
Standalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceStandalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and Performance
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
 
Reactive Database Access With Slick 3
Reactive Database Access With Slick 3Reactive Database Access With Slick 3
Reactive Database Access With Slick 3
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
Building Stateful Microservices With Akka
Building Stateful Microservices With AkkaBuilding Stateful Microservices With Akka
Building Stateful Microservices With Akka
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 

Viewers also liked

Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Knoldus Inc.
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Knoldus Inc.
 
Effective way to code in Scala
Effective way to code in ScalaEffective way to code in Scala
Effective way to code in ScalaKnoldus Inc.
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Akka Finite State Machine
Akka Finite State MachineAkka Finite State Machine
Akka Finite State MachineKnoldus Inc.
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Hyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven HafenegerHyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven Hafenegersparktc
 
Apache Spark™ Applications the Easy Way - Pierre Borckmans
Apache Spark™ Applications the Easy Way - Pierre BorckmansApache Spark™ Applications the Easy Way - Pierre Borckmans
Apache Spark™ Applications the Easy Way - Pierre Borckmanssparktc
 
Getting Started With AureliaJs
Getting Started With AureliaJsGetting Started With AureliaJs
Getting Started With AureliaJsKnoldus Inc.
 
Drilling the Async Library
Drilling the Async LibraryDrilling the Async Library
Drilling the Async LibraryKnoldus Inc.
 
Introduction to Scala JS
Introduction to Scala JSIntroduction to Scala JS
Introduction to Scala JSKnoldus Inc.
 
String interpolation
String interpolationString interpolation
String interpolationKnoldus Inc.
 
Mailchimp and Mandrill - The ‘Hominidae’ kingdom
Mailchimp and Mandrill - The ‘Hominidae’ kingdomMailchimp and Mandrill - The ‘Hominidae’ kingdom
Mailchimp and Mandrill - The ‘Hominidae’ kingdomKnoldus Inc.
 
Realm Mobile Database - An Introduction
Realm Mobile Database - An IntroductionRealm Mobile Database - An Introduction
Realm Mobile Database - An IntroductionKnoldus Inc.
 
Shapeless- Generic programming for Scala
Shapeless- Generic programming for ScalaShapeless- Generic programming for Scala
Shapeless- Generic programming for ScalaKnoldus Inc.
 
An Introduction to Quill
An Introduction to QuillAn Introduction to Quill
An Introduction to QuillKnoldus Inc.
 

Viewers also liked (20)

Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
Effective way to code in Scala
Effective way to code in ScalaEffective way to code in Scala
Effective way to code in Scala
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Akka Finite State Machine
Akka Finite State MachineAkka Finite State Machine
Akka Finite State Machine
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Hyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven HafenegerHyperparameter Optimization - Sven Hafeneger
Hyperparameter Optimization - Sven Hafeneger
 
Apache Spark™ Applications the Easy Way - Pierre Borckmans
Apache Spark™ Applications the Easy Way - Pierre BorckmansApache Spark™ Applications the Easy Way - Pierre Borckmans
Apache Spark™ Applications the Easy Way - Pierre Borckmans
 
Getting Started With AureliaJs
Getting Started With AureliaJsGetting Started With AureliaJs
Getting Started With AureliaJs
 
Drilling the Async Library
Drilling the Async LibraryDrilling the Async Library
Drilling the Async Library
 
Introduction to Scala JS
Introduction to Scala JSIntroduction to Scala JS
Introduction to Scala JS
 
Akka streams
Akka streamsAkka streams
Akka streams
 
String interpolation
String interpolationString interpolation
String interpolation
 
Mailchimp and Mandrill - The ‘Hominidae’ kingdom
Mailchimp and Mandrill - The ‘Hominidae’ kingdomMailchimp and Mandrill - The ‘Hominidae’ kingdom
Mailchimp and Mandrill - The ‘Hominidae’ kingdom
 
Realm Mobile Database - An Introduction
Realm Mobile Database - An IntroductionRealm Mobile Database - An Introduction
Realm Mobile Database - An Introduction
 
Kanban
KanbanKanban
Kanban
 
Shapeless- Generic programming for Scala
Shapeless- Generic programming for ScalaShapeless- Generic programming for Scala
Shapeless- Generic programming for Scala
 
An Introduction to Quill
An Introduction to QuillAn Introduction to Quill
An Introduction to Quill
 

Similar to A Step to programming with Apache Spark

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxshivani22y
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & pythonMaloy Manna, PMP®
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 

Similar to A Step to programming with Apache Spark (20)

Spark core
Spark coreSpark core
Spark core
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache spark
Apache sparkApache spark
Apache spark
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 

More from Knoldus Inc.

Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxKnoldus Inc.
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxKnoldus Inc.
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxKnoldus Inc.
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxKnoldus Inc.
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationKnoldus Inc.
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationKnoldus Inc.
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIsKnoldus Inc.
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II PresentationKnoldus Inc.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAKnoldus Inc.
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Knoldus Inc.
 
Azure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptxAzure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptxKnoldus Inc.
 
The Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and KotlinThe Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and KotlinKnoldus Inc.
 
Data Engineering with Databricks Presentation
Data Engineering with Databricks PresentationData Engineering with Databricks Presentation
Data Engineering with Databricks PresentationKnoldus Inc.
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Knoldus Inc.
 
NoOps - (Automate Ops) Presentation.pptx
NoOps - (Automate Ops) Presentation.pptxNoOps - (Automate Ops) Presentation.pptx
NoOps - (Automate Ops) Presentation.pptxKnoldus Inc.
 
Mastering Distributed Performance Testing
Mastering Distributed Performance TestingMastering Distributed Performance Testing
Mastering Distributed Performance TestingKnoldus Inc.
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxKnoldus Inc.
 
Introduction to Ansible Tower Presentation
Introduction to Ansible Tower PresentationIntroduction to Ansible Tower Presentation
Introduction to Ansible Tower PresentationKnoldus Inc.
 
CQRS with dot net services presentation.
CQRS with dot net services presentation.CQRS with dot net services presentation.
CQRS with dot net services presentation.Knoldus Inc.
 

More from Knoldus Inc. (20)

Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)
 
Azure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptxAzure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptx
 
The Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and KotlinThe Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and Kotlin
 
Data Engineering with Databricks Presentation
Data Engineering with Databricks PresentationData Engineering with Databricks Presentation
Data Engineering with Databricks Presentation
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)
 
NoOps - (Automate Ops) Presentation.pptx
NoOps - (Automate Ops) Presentation.pptxNoOps - (Automate Ops) Presentation.pptx
NoOps - (Automate Ops) Presentation.pptx
 
Mastering Distributed Performance Testing
Mastering Distributed Performance TestingMastering Distributed Performance Testing
Mastering Distributed Performance Testing
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptx
 
Introduction to Ansible Tower Presentation
Introduction to Ansible Tower PresentationIntroduction to Ansible Tower Presentation
Introduction to Ansible Tower Presentation
 
CQRS with dot net services presentation.
CQRS with dot net services presentation.CQRS with dot net services presentation.
CQRS with dot net services presentation.
 

Recently uploaded

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 

Recently uploaded (20)

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 

A Step to programming with Apache Spark

  • 1. A Step to programming withA Step to programming with Rahul Kumar Trainee - Software Consultant Knoldus Software LLP Rahul Kumar Trainee - Software Consultant Knoldus Software LLP
  • 2. Building Spark : 1. Pre Build Spark http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz 2. Source Code http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2.tgz Goto the SPARK_HOME directory. Execute : mvn -Dhadoop.version=1.2.1 -Phadoop-1 -DskipTests clean package To start spark goto the SPARK_HOME/bin Execute ./spark-shell
  • 3. ● The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. ● Spark is not a modified version of Hadoop because it has its own cluster management. ● Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only. Spark Features :
  • 4. Spark applications run as independent sets of processes on a cluster,coordinated by the SparkContext object in your main program (called the driver program).
  • 5.
  • 6.
  • 7. ● Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. ● It is an immutable distributed collection of objects. ● RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. ● There are two ways to create RDDs: parallelizing an existing collection in your driver program ● e.g. val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) ● val distFile = sc.textFile("data.txt") distFile: RDD[String] = MappedRDD@1d4cee08 RDD :
  • 9. ● RDDs support two types of operations: ✔ Transformations, which create a new dataset from an existing one, and ✔Actions, which return a value to the driver program after running a computation on the dataset. ●For example, ✔ map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. ✔ reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program ●All transformations in Spark are lazy, in that they do not compute their results right away ● RDDs support two types of operations: ✔ Transformations, which create a new dataset from an existing one, and ✔Actions, which return a value to the driver program after running a computation on the dataset. ●For example, ✔ map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. ✔ reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program ●All transformations in Spark are lazy, in that they do not compute their results right away RDD :
  • 10. ● A DataFrame is equivalent to a relational table in Spark SQL. DataFrame : ● Steps to create DataFrame :  Create SparkContext object : – val conf = new SparkConf().setAppName("Demo").setMaster("local[2]") – val sc = new SparkContext(conf)  Create SqlContext object : – val sqlContext = new SQLContext(sc)  Read Data From Files : – val df = sqlContext.read.json("src/main/scala/emp.json")
  • 11. ● A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. ● DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. ● An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it are not as constrained. ● However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method DataFrame and RDD :
  • 12. DataFrame Transformations : ● Def orderBy(sortExprs: Column*): DataFrame ● Def select(cols: Column*): DataFrame ● Def show(): Unit ● Def filter(conditionExpr: String): DataFrame ● Def groupBy(cols: Column*): GroupedData
  • 13. ●Def collect(): Array[Row] ●Def collectAsList(): List[Row] ●Def count(): Long ●Def head(): Row ●Def head(n: Int): Array[Row] ●Def collect(): Array[Row] ●Def collectAsList(): List[Row] ●Def count(): Long ●Def head(): Row ●Def head(n: Int): Array[Row] DataFrame Actions :
  • 14. ● Hive is a data warehouse infrastructure tool to process structured data in Hadoop. ● It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. ● It stores schema in a database and processed data into HDFS. ● It provides SQL type language for querying called HiveQL or HQL. ● It is designed for OLAP. Hive :
  • 15. ● Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. ● Using HiveContext, you can create and find tables in the HiveMetaStore and write queries on it using HiveQL. ● Users who do not have an existing Hive deployment can still create a HiveContext. ● When not configured by the hive-site.xml, the context automatically creates a metastore called metastore_db and a folder called warehouse in the current directory. Spark-Hive :
  • 16. ➢ Spark SQL supports queries written using HiveQL. ➢ Its a SQL-like language that produces queries that are converted to Spark jobs. ➢ HiveQL is more mature and supports more complex queries than Spark SQL. Spark-Hive :(continued)
  • 17. 1) first create a SqlContext instance, val sqlContext = new SqlContext(sc) 2) submit the queries by calling the sql method on the HiveContext instance. val res=sqlContext.sql("select * from employee") To construct a HiveQL query, 1) first create a new HiveContext instance, val conf = new SparkConf().setAppName("Demo").setMaster("local[2]") val sc = new SparkContext(conf) val hiveContext = new HiveContext(sc) 2) submit the queries by calling the sql method on the HiveContext instance. val res=hiveContext.sql("select * from employee")
  • 18.