3. About Me
• Joined Percona in January 2016
• Sr Technical Operations Architect for MongoDB
• Previous:
• EA DICE (MySQL DBA)
• EA SPORTS (Sys/NoSQL DBA Ops)
• Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops)
• Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc
• 10+ years tuning Linux for database workloads (off and on)
• NOT an Apache Spark expert
4. Apache Spark
• “…is a fast and general engine for large-scale data processing”
• Written in Scala, utilises Akka streaming framework and runs under Java JVM
• Supports jobs written in Java, Scala, Python and R
• Pluggable datasources for various file types, databases, SaaSs, etc
• Fast and efficient: jobs work on datasources quickly in parallel
• Optional clustering
• Master/Slave Spark cluster, Zookeeper for Master elections
• Slave workers connect to master
• Master distributes tasks evenly to available workers
• Streaming and Machine Learning (MLib) capabilities
• Programatic and SQL(!) querying capabilities
5. Apache Spark: Software Architecture
• Jobs go to Cluster Master or runs in the
client JVM directly
• Cluster Master directs jobs to nodes
with available resources with messages
• Cluster Master HA
• Slaves reconnect to Master
• Apache Zookeeper for true HA
6. Apache Spark: Hadoop Comparison
• Hadoop MapReduce
• Batch-based
• Uses data less efficiently
• Relatively hard to develop/maintain
• Spark
• Stream Processing
• Fast/Parallelism
• Prefers memory as much as possible in jobs
• Divides work into many lightweight sub-tasks in threads
• Datasources
• Uses datasource-awareness to scale (eg: indices, shard-awareness, etc)
• Spark allows processing and storage to scale separately
7. Apache Spark: RDDs and DataFrames
• RDD: Resilient Distributed Dataset
• Original API to access data in Spark
• Lazy: does not access data until a real action is performed
• Spark’s optimiser cannot see inside
• RDDs are slow on Python
• DataFrames API
• Higher level API, focused on the “what” is being done
• Has schemas / table-like
• Interchangeable Programming and SQL APIs
• Much easier to read and comprehend
• Optimises execution plan
8. Apache Spark: Datasources
• Provides a pluggable mechanism for accessing structured data though Spark SQL
• At least these databases are supported in some way
• MySQL
• MongoDB
• Redis
• Cassandra
• Postgres
• HBase
• HDFS
• File
• S3
• In practice: search GitHub, find .jar file, deploy it!
9. Apache Spark: SQLContext
• ANSI SQL
• 30+ year old language..
• Easy to understand
• Everyone usually knows it
• Spark SQLContext
• A Spark module for structured data processing, wrapping RDD API
• Uses the same execution engine as the programatic APIs
• Supports:
• JOINs/unions
• EXPLAINs,
• Subqueries,
• ORDER/GROUP/SORT BYs
• Most datatypes you’d expect
10. Apache Spark: Use Cases
• Business Intelligence/Analytics
• Understand
• Tip: use dedicated replicas for expensive queries!
• Data Summaries and Batch Jobs
• Perform expensive summaries in the background,
save result
• Tip: use burstable/cloud hardware for infrequent
batch jobs
• Real-time Stream Processing
• Process data as it enters your system
11. So why not Apache Drill?
• A schema-free SQL engine for Hadoop, NoSQL and Cloud Storage
• Drill does not support / work with
• Relational databases (MySQL) or Redis
• No programatic-level querying
• No streaming/continuous query functionality
• I don’t know much about it
12. The Demo
• Scenario: You run a Weather Station data app that stores data in both an
RDBMs and a document store
• Goal: summarise weather station data stored in an RDBMs and a
Document store
• Min Water Temperature
• Avg Water Temperature
• Max Water Temperature
• Total Sample Count
• Get Top-10 (based on avg water temp)
13. The Demo
• RDBMs: Percona Server for MySQL 5.7
• Stores the Weather station metadata data (roughly 350 stations: ID,
name, location, etc)
• Document-Store: Percona Server for MongoDB 3.2
• Stores the Weather time-series sample data (roughly 80,000 samples:
various weather readings from stations)
• In-Memory K/V Store: Redis 2.8
• Store summarised Top-10 data for fast querying of min, avg, max
temperature and total sample counts
14. The Demo
• Apache Spark 1.6.2
Cluster
• 1 x Master
• 2 x Worker/Slaves
• 1 x Pyspark Job
• 1 x Macbook Pro
• 3 x Virtualbox VMs
• Job submitted on Master