Apache spark its place within a big data stack

•

4 gostaram•1,180 visualizações

Junjun Olympia

Talk given during Symbol & Key meetup on 7 Mar 2016

Dados e análise

Its Place Within a Big Data Stack
Junjun Olympia

Image from http://mattturck.com/2016/02/01/big-data-landscape/

Spark is a fast, large-scale data processing engine
● Runs both in-memory and on-disk
● 10x-100x faster than Hadoop MapReduce
● Can be written in Java, Scala, Python, R, & SQL
● Supports both batch and streaming workflows
● Has several modules
○ Spark Core
○ Spark Streaming
○ Spark MLLib
○ GraphX

It is the most active open-source project in big data
Next three images from http://go.databricks.com/2015-spark-survey

Gaining adoption across industries and companies

(Big) Data systems perform common functions

Data can come from several sources
● Existing databases and data warehouses
● Flat files from legacy systems
● Web, mobile, and application logs
● Data feeds from social media
● IoT devices

Extract from database: Sqoop vs Spark JDBC API
$ sqoop import --connect jdbc:postgresql:dbserver --table schema.tablename
--fields-terminated-by 't' --lines-terminated-by 'n'
--optionally-enclosed-by '"
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "schema.tablename")).load()

Read JSON files
// JSON file as a dataframe
val df = sqlContext.read.json("people.json")
CREATE TEMPORARY TABLE people
USING org.apache.spark.sql.json
OPTIONS (path 'people.json')

$Ingest streaming data from Kafka import org.apache.spark.streaming.kafka._ val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume]) var offsetRanges = Array[OffsetRange]() directKafkaStream.transform { rdd => offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }.map {...}.foreachRDD { rdd => ... }$

Data in an analytic pipeline usually need transformation
● Check and/or correct for data quality issues
● Handle missing values
● Cast values into specific data types or formats
● Compute derived fields
● Split or merge records to achieve desired granularity
● Join with another dataset (i.e. reference lookups)
● Restructure as required by downstream applications or target databases

There’s plenty of tools that do this
● Before big data
○ Informatica PowerCenter
○ Pentaho Kettle
○ Talend
○ SSIS
○ OWB
● Early Hadoop
○ Apache Pig
○ Hive via HQL
○ Plain ol’ MapReduce
● Spark core, Streaming, DataFrames

Data can then be stored several different ways
● As self-describing files like Parquet, Avro, JSON, XML
● Hive metastore-managed tables
● Other low-latency SQL-on-Hadoop engines (i.e. Impala, Drill, Kudu)
● Key-value and wide-table databases for fast random access (i.e. HBase,
Cassandra)
● Search databases (i.e. ElasticSearch, Solr)
● Conventional data warehouses and databases

There’s plenty of tools here, too
● Databases offering JDBC/ODBC connectivity
○ Hive, Impala, Drill
○ MPP data warehouses
○ Spark SQL via JDBC Thrift Server
● BI Tools via SQL
○ Qlikview
○ Tableau
○ Pentaho BI
● For richer analyses beyond Spark SQL
○ Spark shell
○ Better with notebooks (i.e. Zeppelin, Jupyter)

Spark is an essential part of the modern big data
stack.

A unified framework such as Spark offers benefits, too
● Fewer moving pieces
● Smaller stack to administer and manage
● Common languages
● Familiar patterns
● Encourages team members to become cross-functional

Are there other technologies similar to Spark?

Mais conteúdo relacionado

Mais procurados

Open source big data landscape and possible ITS applicationsSoftwareMill

Managed Cluster ServicesAdam Doyle

Big Data Streams Architectures. Why? What? How?Anton Nazaruk

Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN

Big Data A La Carte MenuVenkatesh Balakumar

CSB_communityAlbert Anthony Gavino, MBA

Organising for Data SuccessLars Albertsson

Data pipelines from zero Lars Albertsson

Sparkler - Spark Crawler Thamme Gowda

Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem

Data lineage and observability with Marquez - subsurface 2020Julien Le Dem

An Intro to Elasticsearch and KibanaObjectRocket

HBase introduction talkHayden Marchant

Data pipelines observability: OpenLineage & MarquezJulien Le Dem

Big data real time architecturesDaniel Marcous

Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar

Observability for Data Pipelines With OpenLineageDatabricks

Oss as a competitive advantageRegunath B

HadoopKasam Sharif

Obfuscating LinkedIn Member DataDataWorks Summit

Mais procurados (20)

Open source big data landscape and possible ITS applications

Managed Cluster Services

Big Data Streams Architectures. Why? What? How?

Basic Hadoop Architecture V1 vs V2

Big Data A La Carte Menu

CSB_community

Organising for Data Success

Data pipelines from zero

Sparkler - Spark Crawler

Open core summit: Observability for data pipelines with OpenLineage

Data lineage and observability with Marquez - subsurface 2020

An Intro to Elasticsearch and Kibana

HBase introduction talk

Data pipelines observability: OpenLineage & Marquez

Big data real time architectures

Geek Night - Functional Data Processing using Spark and Scala

Observability for Data Pipelines With OpenLineage

Oss as a competitive advantage

Hadoop

Obfuscating LinkedIn Member Data

Destaque

Uniting the touchpoints - the Asia Miles storyTransform magazine

Big Data Tech StackAbdullah Çetin ÇAVDAR

Big Data Technology Stack : NutshellKhalid Imran

Netsci2010 social networks for marketersTelenet

Mobile analytics,How analytics bring mobile advertisers closer to their audi...Telenet

Topic 9: MR+Zubair Nabi

Oracle Big Data. Обзор технологийAndrey Akulov

Leveraging open source for big data stackFlytxt

The Big Data StackZubair Nabi

The Top Skills That Can Get You Hired in 2017LinkedIn

Destaque (10)

Uniting the touchpoints - the Asia Miles story

Big Data Tech Stack

Big Data Technology Stack : Nutshell

Netsci2010 social networks for marketers

Mobile analytics,How analytics bring mobile advertisers closer to their audi...

Topic 9: MR+

Oracle Big Data. Обзор технологий

Leveraging open source for big data stack

The Big Data Stack

The Top Skills That Can Get You Hired in 2017

Semelhante a Apache spark its place within a big data stack

Data processing with spark in r & pythonMaloy Manna, PMP®

Spark Driven Big Data Analyticsinoshg

Spark SQLJoud Khattab

APACHE SPARK.pptxDeepaThirumurugan

Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA

Apache Spark 101 - Demi Ben-AriDemi Ben-Ari

Cassandra Lunch #89: Semi-Structured Data in CassandraAnant Corporation

9/2017 STL HUG - Back to SchoolAdam Doyle

Intro to Apache Spark by CTO of TwingoMapR Technologies

Introduction to sparkHome

Apache Hive for modern DBAsLuis Marques

Apache Spark PDFNaresh Rupareliya

Apache spark - History and market overviewMartin Zapletal

Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data avanttic Consultoría Tecnológica

Introduction to Structured Data Processing with Spark SQLdatamantra

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman

Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal

WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan

Apache Spark FundamentalsZahra Eskandari

Semelhante a Apache spark its place within a big data stack (20)

Data processing with spark in r & python

Spark Driven Big Data Analytics

Spark SQL

APACHE SPARK.pptx

Big_data_analytics_NoSql_Module-4_Session

Apache Spark 101 - Demi Ben-Ari

Cassandra Lunch #89: Semi-Structured Data in Cassandra

9/2017 STL HUG - Back to School

Intro to Apache Spark by CTO of Twingo

Introduction to spark

Apache Hive for modern DBAs

Apache Spark PDF

Apache spark - History and market overview

Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data

Introduction to Structured Data Processing with Spark SQL

Processing Large Data with Apache Spark -- HasGeek

Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...

Spark Concepts - Spark SQL, Graphx, Streaming

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms

Apache Spark Fundamentals

Último

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Week-01-2.ppt BBB human Computer interactionfulawalesam

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

ALSO dropshipping via API with DroFx.pptxolyaivanovalion

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Halmar dropshipping via API with DroFxolyaivanovalion

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

Apache spark its place within a big data stack

1. Its Place Within a Big Data Stack Junjun Olympia

2. Image from http://mattturck.com/2016/02/01/big-data-landscape/

3. A Quick Review of What Spark Is

4. Spark is a fast, large-scale data processing engine ● Runs both in-memory and on-disk ● 10x-100x faster than Hadoop MapReduce ● Can be written in Java, Scala, Python, R, & SQL ● Supports both batch and streaming workflows ● Has several modules ○ Spark Core ○ Spark Streaming ○ Spark MLLib ○ GraphX

5. It is the most active open-source project in big data Next three images from http://go.databricks.com/2015-spark-survey

6. Gaining adoption across industries and companies

7. Used to power several different systems

8. (Big) Data systems perform common functions

9. Capture and extract data

10. Data can come from several sources ● Existing databases and data warehouses ● Flat files from legacy systems ● Web, mobile, and application logs ● Data feeds from social media ● IoT devices

11. Extract from database: Sqoop vs Spark JDBC API $ sqoop import --connect jdbc:postgresql:dbserver --table schema.tablename --fields-terminated-by 't' --lines-terminated-by 'n' --optionally-enclosed-by '" val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load()

12. Read JSON files // JSON file as a dataframe val df = sqlContext.read.json("people.json") CREATE TEMPORARY TABLE people USING org.apache.spark.sql.json OPTIONS (path 'people.json')

13. Ingest streaming data from Kafka import org.apache.spark.streaming.kafka._ val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume]) var offsetRanges = Array[OffsetRange]() directKafkaStream.transform { rdd => offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }.map {...}.foreachRDD { rdd => ... }

14.

15. Transform Data

16. Data in an analytic pipeline usually need transformation ● Check and/or correct for data quality issues ● Handle missing values ● Cast values into specific data types or formats ● Compute derived fields ● Split or merge records to achieve desired granularity ● Join with another dataset (i.e. reference lookups) ● Restructure as required by downstream applications or target databases

17. There’s plenty of tools that do this ● Before big data ○ Informatica PowerCenter ○ Pentaho Kettle ○ Talend ○ SSIS ○ OWB ● Early Hadoop ○ Apache Pig ○ Hive via HQL ○ Plain ol’ MapReduce ● Spark core, Streaming, DataFrames

18.

19. Store data

20. Data can then be stored several different ways ● As self-describing files like Parquet, Avro, JSON, XML ● Hive metastore-managed tables ● Other low-latency SQL-on-Hadoop engines (i.e. Impala, Drill, Kudu) ● Key-value and wide-table databases for fast random access (i.e. HBase, Cassandra) ● Search databases (i.e. ElasticSearch, Solr) ● Conventional data warehouses and databases

21.

22. Query, analyze, visualize

23. There’s plenty of tools here, too ● Databases offering JDBC/ODBC connectivity ○ Hive, Impala, Drill ○ MPP data warehouses ○ Spark SQL via JDBC Thrift Server ● BI Tools via SQL ○ Qlikview ○ Tableau ○ Pentaho BI ● For richer analyses beyond Spark SQL ○ Spark shell ○ Better with notebooks (i.e. Zeppelin, Jupyter)

24.

25.

26. Spark is an essential part of the modern big data stack.

27. A unified framework such as Spark offers benefits, too ● Fewer moving pieces ● Smaller stack to administer and manage ● Common languages ● Familiar patterns ● Encourages team members to become cross-functional

28. Some common questions about Spark

29. Is Spark a database?

30. Fine; is it a data warehouse then?

31. Is Spark a Hadoop replacement?

32. Are there other technologies similar to Spark?

33. Questions?

Apache spark its place within a big data stack

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (10)

Semelhante a Apache spark its place within a big data stack

Semelhante a Apache spark its place within a big data stack (20)

Último

Último (20)

Apache spark its place within a big data stack