SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Its Place Within a Big Data Stack
Junjun Olympia
Image from http://mattturck.com/2016/02/01/big-data-landscape/
A Quick Review of What Spark Is
Spark is a fast, large-scale data processing engine
● Runs both in-memory and on-disk
● 10x-100x faster than Hadoop MapReduce
● Can be written in Java, Scala, Python, R, & SQL
● Supports both batch and streaming workflows
● Has several modules
○ Spark Core
○ Spark Streaming
○ Spark MLLib
○ GraphX
It is the most active open-source project in big data
Next three images from http://go.databricks.com/2015-spark-survey
Gaining adoption across industries and companies
Used to power several different systems
(Big) Data systems perform common functions
Capture and extract data
Data can come from several sources
● Existing databases and data warehouses
● Flat files from legacy systems
● Web, mobile, and application logs
● Data feeds from social media
● IoT devices
Extract from database: Sqoop vs Spark JDBC API
$ sqoop import --connect jdbc:postgresql:dbserver --table schema.tablename 
--fields-terminated-by 't' --lines-terminated-by 'n' 
--optionally-enclosed-by '"
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "schema.tablename")).load()
Read JSON files
// JSON file as a dataframe
val df = sqlContext.read.json("people.json")
CREATE TEMPORARY TABLE people
USING org.apache.spark.sql.json
OPTIONS (path 'people.json')
Ingest streaming data from Kafka
import org.apache.spark.streaming.kafka._
val directKafkaStream = KafkaUtils.createDirectStream[
[key class], [value class], [key decoder class], [value decoder class] ](
streamingContext, [map of Kafka parameters], [set of topics to consume])
var offsetRanges = Array[OffsetRange]()
directKafkaStream.transform { rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.map {...}.foreachRDD { rdd => ... }
Transform Data
Data in an analytic pipeline usually need transformation
● Check and/or correct for data quality issues
● Handle missing values
● Cast values into specific data types or formats
● Compute derived fields
● Split or merge records to achieve desired granularity
● Join with another dataset (i.e. reference lookups)
● Restructure as required by downstream applications or target databases
There’s plenty of tools that do this
● Before big data
○ Informatica PowerCenter
○ Pentaho Kettle
○ Talend
○ SSIS
○ OWB
● Early Hadoop
○ Apache Pig
○ Hive via HQL
○ Plain ol’ MapReduce
● Spark core, Streaming, DataFrames
Store data
Data can then be stored several different ways
● As self-describing files like Parquet, Avro, JSON, XML
● Hive metastore-managed tables
● Other low-latency SQL-on-Hadoop engines (i.e. Impala, Drill, Kudu)
● Key-value and wide-table databases for fast random access (i.e. HBase,
Cassandra)
● Search databases (i.e. ElasticSearch, Solr)
● Conventional data warehouses and databases
Query, analyze, visualize
There’s plenty of tools here, too
● Databases offering JDBC/ODBC connectivity
○ Hive, Impala, Drill
○ MPP data warehouses
○ Spark SQL via JDBC Thrift Server
● BI Tools via SQL
○ Qlikview
○ Tableau
○ Pentaho BI
● For richer analyses beyond Spark SQL
○ Spark shell
○ Better with notebooks (i.e. Zeppelin, Jupyter)
Spark is an essential part of the modern big data
stack.
A unified framework such as Spark offers benefits, too
● Fewer moving pieces
● Smaller stack to administer and manage
● Common languages
● Familiar patterns
● Encourages team members to become cross-functional
Some common questions about Spark
Is Spark a database?
Fine; is it a data warehouse then?
Is Spark a Hadoop replacement?
Are there other technologies similar to Spark?
Questions?

Mais conteúdo relacionado

Mais procurados

Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsSoftwareMill
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data SuccessLars Albertsson
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero Lars Albertsson
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler Thamme Gowda
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaObjectRocket
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantageRegunath B
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 

Mais procurados (20)

Open source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applicationsOpen source big data landscape and possible ITS applications
Open source big data landscape and possible ITS applications
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Big Data A La Carte Menu
Big Data A La Carte MenuBig Data A La Carte Menu
Big Data A La Carte Menu
 
CSB_community
CSB_communityCSB_community
CSB_community
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
HBase introduction talk
HBase introduction talkHBase introduction talk
HBase introduction talk
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
 
Hadoop
HadoopHadoop
Hadoop
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 

Destaque

Uniting the touchpoints - the Asia Miles story
Uniting the touchpoints - the Asia Miles storyUniting the touchpoints - the Asia Miles story
Uniting the touchpoints - the Asia Miles storyTransform magazine
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Netsci2010 social networks for marketers
Netsci2010 social networks for marketersNetsci2010 social networks for marketers
Netsci2010 social networks for marketersTelenet
 
Mobile analytics, How analytics bring mobile advertisers closer to their audi...
Mobile analytics,How analytics bring mobile advertisers closer to their audi...Mobile analytics,How analytics bring mobile advertisers closer to their audi...
Mobile analytics, How analytics bring mobile advertisers closer to their audi...Telenet
 
Oracle Big Data. Обзор технологий
Oracle Big Data. Обзор технологийOracle Big Data. Обзор технологий
Oracle Big Data. Обзор технологийAndrey Akulov
 
Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stackFlytxt
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017LinkedIn
 

Destaque (10)

Uniting the touchpoints - the Asia Miles story
Uniting the touchpoints - the Asia Miles storyUniting the touchpoints - the Asia Miles story
Uniting the touchpoints - the Asia Miles story
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Netsci2010 social networks for marketers
Netsci2010 social networks for marketersNetsci2010 social networks for marketers
Netsci2010 social networks for marketers
 
Mobile analytics, How analytics bring mobile advertisers closer to their audi...
Mobile analytics,How analytics bring mobile advertisers closer to their audi...Mobile analytics,How analytics bring mobile advertisers closer to their audi...
Mobile analytics, How analytics bring mobile advertisers closer to their audi...
 
Topic 9: MR+
Topic 9: MR+Topic 9: MR+
Topic 9: MR+
 
Oracle Big Data. Обзор технологий
Oracle Big Data. Обзор технологийOracle Big Data. Обзор технологий
Oracle Big Data. Обзор технологий
 
Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stack
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017
 

Semelhante a Apache spark its place within a big data stack

Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & pythonMaloy Manna, PMP®
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraAnant Corporation
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data avanttic Consultoría Tecnológica
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 

Semelhante a Apache spark its place within a big data stack (20)

Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in CassandraCassandra Lunch #89: Semi-Structured Data in Cassandra
Cassandra Lunch #89: Semi-Structured Data in Cassandra
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 

Último

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 

Último (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 

Apache spark its place within a big data stack

  • 1. Its Place Within a Big Data Stack Junjun Olympia
  • 3. A Quick Review of What Spark Is
  • 4. Spark is a fast, large-scale data processing engine ● Runs both in-memory and on-disk ● 10x-100x faster than Hadoop MapReduce ● Can be written in Java, Scala, Python, R, & SQL ● Supports both batch and streaming workflows ● Has several modules ○ Spark Core ○ Spark Streaming ○ Spark MLLib ○ GraphX
  • 5. It is the most active open-source project in big data Next three images from http://go.databricks.com/2015-spark-survey
  • 6. Gaining adoption across industries and companies
  • 7. Used to power several different systems
  • 8. (Big) Data systems perform common functions
  • 10. Data can come from several sources ● Existing databases and data warehouses ● Flat files from legacy systems ● Web, mobile, and application logs ● Data feeds from social media ● IoT devices
  • 11. Extract from database: Sqoop vs Spark JDBC API $ sqoop import --connect jdbc:postgresql:dbserver --table schema.tablename --fields-terminated-by 't' --lines-terminated-by 'n' --optionally-enclosed-by '" val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load()
  • 12. Read JSON files // JSON file as a dataframe val df = sqlContext.read.json("people.json") CREATE TEMPORARY TABLE people USING org.apache.spark.sql.json OPTIONS (path 'people.json')
  • 13. Ingest streaming data from Kafka import org.apache.spark.streaming.kafka._ val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume]) var offsetRanges = Array[OffsetRange]() directKafkaStream.transform { rdd => offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }.map {...}.foreachRDD { rdd => ... }
  • 14.
  • 16. Data in an analytic pipeline usually need transformation ● Check and/or correct for data quality issues ● Handle missing values ● Cast values into specific data types or formats ● Compute derived fields ● Split or merge records to achieve desired granularity ● Join with another dataset (i.e. reference lookups) ● Restructure as required by downstream applications or target databases
  • 17. There’s plenty of tools that do this ● Before big data ○ Informatica PowerCenter ○ Pentaho Kettle ○ Talend ○ SSIS ○ OWB ● Early Hadoop ○ Apache Pig ○ Hive via HQL ○ Plain ol’ MapReduce ● Spark core, Streaming, DataFrames
  • 18.
  • 20. Data can then be stored several different ways ● As self-describing files like Parquet, Avro, JSON, XML ● Hive metastore-managed tables ● Other low-latency SQL-on-Hadoop engines (i.e. Impala, Drill, Kudu) ● Key-value and wide-table databases for fast random access (i.e. HBase, Cassandra) ● Search databases (i.e. ElasticSearch, Solr) ● Conventional data warehouses and databases
  • 21.
  • 23. There’s plenty of tools here, too ● Databases offering JDBC/ODBC connectivity ○ Hive, Impala, Drill ○ MPP data warehouses ○ Spark SQL via JDBC Thrift Server ● BI Tools via SQL ○ Qlikview ○ Tableau ○ Pentaho BI ● For richer analyses beyond Spark SQL ○ Spark shell ○ Better with notebooks (i.e. Zeppelin, Jupyter)
  • 24.
  • 25.
  • 26. Spark is an essential part of the modern big data stack.
  • 27. A unified framework such as Spark offers benefits, too ● Fewer moving pieces ● Smaller stack to administer and manage ● Common languages ● Familiar patterns ● Encourages team members to become cross-functional
  • 28. Some common questions about Spark
  • 29. Is Spark a database?
  • 30. Fine; is it a data warehouse then?
  • 31. Is Spark a Hadoop replacement?
  • 32. Are there other technologies similar to Spark?