4. Spark is a fast, large-scale data processing engine
● Runs both in-memory and on-disk
● 10x-100x faster than Hadoop MapReduce
● Can be written in Java, Scala, Python, R, & SQL
● Supports both batch and streaming workflows
● Has several modules
○ Spark Core
○ Spark Streaming
○ Spark MLLib
○ GraphX
5. It is the most active open-source project in big data
Next three images from http://go.databricks.com/2015-spark-survey
10. Data can come from several sources
● Existing databases and data warehouses
● Flat files from legacy systems
● Web, mobile, and application logs
● Data feeds from social media
● IoT devices
11. Extract from database: Sqoop vs Spark JDBC API
$ sqoop import --connect jdbc:postgresql:dbserver --table schema.tablename
--fields-terminated-by 't' --lines-terminated-by 'n'
--optionally-enclosed-by '"
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "schema.tablename")).load()
12. Read JSON files
// JSON file as a dataframe
val df = sqlContext.read.json("people.json")
CREATE TEMPORARY TABLE people
USING org.apache.spark.sql.json
OPTIONS (path 'people.json')
13. Ingest streaming data from Kafka
import org.apache.spark.streaming.kafka._
val directKafkaStream = KafkaUtils.createDirectStream[
[key class], [value class], [key decoder class], [value decoder class] ](
streamingContext, [map of Kafka parameters], [set of topics to consume])
var offsetRanges = Array[OffsetRange]()
directKafkaStream.transform { rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.map {...}.foreachRDD { rdd => ... }
16. Data in an analytic pipeline usually need transformation
● Check and/or correct for data quality issues
● Handle missing values
● Cast values into specific data types or formats
● Compute derived fields
● Split or merge records to achieve desired granularity
● Join with another dataset (i.e. reference lookups)
● Restructure as required by downstream applications or target databases
17. There’s plenty of tools that do this
● Before big data
○ Informatica PowerCenter
○ Pentaho Kettle
○ Talend
○ SSIS
○ OWB
● Early Hadoop
○ Apache Pig
○ Hive via HQL
○ Plain ol’ MapReduce
● Spark core, Streaming, DataFrames
20. Data can then be stored several different ways
● As self-describing files like Parquet, Avro, JSON, XML
● Hive metastore-managed tables
● Other low-latency SQL-on-Hadoop engines (i.e. Impala, Drill, Kudu)
● Key-value and wide-table databases for fast random access (i.e. HBase,
Cassandra)
● Search databases (i.e. ElasticSearch, Solr)
● Conventional data warehouses and databases
23. There’s plenty of tools here, too
● Databases offering JDBC/ODBC connectivity
○ Hive, Impala, Drill
○ MPP data warehouses
○ Spark SQL via JDBC Thrift Server
● BI Tools via SQL
○ Qlikview
○ Tableau
○ Pentaho BI
● For richer analyses beyond Spark SQL
○ Spark shell
○ Better with notebooks (i.e. Zeppelin, Jupyter)
24.
25.
26. Spark is an essential part of the modern big data
stack.
27. A unified framework such as Spark offers benefits, too
● Fewer moving pieces
● Smaller stack to administer and manage
● Common languages
● Familiar patterns
● Encourages team members to become cross-functional