The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
2. About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays
● BS’c Computer Science – Academic College Tel-Aviv Yaffo
● Co-Founder @
○ “Big Things” Big Data Community
○ Cloud Google Developer Group
In the Past:
● Sr. Data Engineer @ Windward
● Software Team Leader & Senior Java Software Engineer
● Missile defense and Alert System - “Ofek” – IAF
4. What are Big Data Systems?
● Applications involving the “3 Vs”
○ Volume - Gigabytes, Terabytes, Petabytes...
○ Velocity - Sensor Data, Logs, Data Streams, Financial Transactions, Geo Locations..
○ Variety - Different data format ingestion
● Some define it the “7 Vs”
○ Variability (constantly changing)
○ Veracity (accuracy)
○ Visualization
○ Value
● Characteristics
○ Multi-region availability
○ Very fast and reliable response
○ No single point of failure
5. What is Big Data (IMHO)?
● Systems involving the “3 Vs”:
What are the right questions we want to ask?
○ Volume - How much?
○ Velocity - How fast?
○ Variety - What kind? (Difference)
6. Why Not Relational Data
● Relational Model Provides
○ Normalized table schema
○ Cross table joins
○ ACID compliance (Atomicity, Consistency, Isolation, Durability)
● But at very high cost
○ Big Data table joins - bilions of rows or more - require massive overhead
○ Sharding tables across systems is complex and fragile
● Modern applications have different priorities
○ Needs for speed and availability come over consistency
○ Commodity servers racks trump massive high-end systems
○ Real world need for transactional guarantees is limited
7. What strategies help manage Big Data?
● Distribute data across nodes
○ Replication
● Relax consistency requirements
● Relax schema requirements
● Optimize data to suit actual needs
8. What is the NoSQL landscape?
● 4 broad classes of non-relational databases (http://db-engines.com/en/ranking)
○ Graph: data elements each relate to N others in graph / network
○ Key-Value: keys map to arbitrary values of any data type
○ Document: document sets (JSON) queryable in whole or part
○ Wide column Store (Column Family): keys mapped to sets of n-numbers of typed
columns
● Three key factors to help understand the subject
○ Consistency: do you get identical results, regardless which node is queried?
○ Availability: can the cluster respond to very high read and write volumes?
○ Partition tolerance: is a cluster still available when part of it is down?
9. What is the CAP theorem?
● In distributed systems, consistency, availability and partition tolerance
exist in a manually dependant relationship, Pick any two.
Availability
Partition toleranceConsistency
MySQL, PostgreSQL,
Greenplum, Vertica,
Neo4J
Cassandra,
DynamoDB, Riak,
CouchDB, Voldemort
HBase, MongoDB, Redis, BigTable, BerkeleyDB
Graph
Key-Value
Wide Column
RDBMS
10. Characteristics of Hadoop
● A system to process very large amounts of unstructured and complex
data with wanted speed
● A system to run on a large amount of machines that don’t share any
memory or disk
● A system to run on a cluster of machines which can put together in
relatively lower cost and easier maintenance
11. Hadoop Principal
● “A system to move the computation, where the data is”
● Key Concepts of Hadoop
Flexibility
A single repo for storing and
analyzing any kind of data not
bounded by schema
Scalability
Scale-out architecture divides
workload across multiple
nodes using flexible
distributed file system
Low cost
Deployed on commodity
hardware & open source
platform
Fault Tolerant
Continue working even if
node(s) go
12. Hadoop Core Components
● HDFS - Hadoop Distributed File System
○ Provides a distributed data storage system to store data in smaller blocks in a fail safe
manner
● MapReduce - Programming framework
○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple
nodes
● Yarn - (Yet Another Resource Negotiator) MRv2
○ Splitting a MapReduce Job Tracker’s info
■ Resource Manager (Global)
■ Application Manager (Per application)
18. Summary - When to choose hadoop?
● Large volumes of data to store and process
● Semi-Structured or Unstructured data
● Data is not well categorized
● Data contains a lot of redundancy
● Data arrives in streams or large batches
● Complex batch jobs arriving in parallel
● You don’t know how the data might be useful
20. What is spark?
● Apache Spark is a general-purpose, cluster
computing framework
● Spark does computation In Memory & on Disk
● Apache Spark has low level and high level APIs
21. Spark Philosophy
● Make life easy and productive for data scientists
● Well documented, expressive API’s
● Powerful domain specific libraries
● Easy integration with storage systems… and caching to avoid data
movement
● Predictable releases, stable API’s
● Stable release each 3 months
24. About Spark Project
● Spark was founded at UC Berkeley and the
main contributor is “Databricks”.
● Interactive shell Spark in Scala and Python
(spark-shell, pyspark)
● Currently stable in version 2.1
30. RDD - Resilient Distributed Dataset
● … Collection of elements partitioned across the nodes of the cluster that
can be operated on it in parallel…
○ http://spark.apache.org/docs/latest/programming-guide.html#overview
● RDD - Resilient Distributed Dataset
○ Collection similar to a List / Array (Abstraction)
○ It’s actually an Interface (Behind the scenes it’s distributed over the cluster)
● DAG - Directed Acyclic Graph
● Are Immutable!!!
34. Wide and Narrow Transformations
● Narrow dependency: each partition of the parent RDD is used by at
most one partition of the child RDD. This means the task can be executed
locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample)
● Wide dependency: multiple child partitions may depend on one partition
of the parent RDD. This means we have to shuffle data unless the
parents are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey,
cogroupByKey, join, cartesian)
● You can read a good blog post about it.
35. Basic Terms - Wide and Narrow Transformations
Narrow Dependencies: Wide (Shuffle) Dependencies:
filter,
map...
union
groupByKey
join...
41. DataFrame
● Main programming abstraction in SparkSQL
● Distributed collection of data organized into named
columns
● Similar to a table in a relational database
● Has schema, rows and rich API
● http://spark.apache.org/docs/latest/sql-programming-guide.html
50. IPython Notebook Spark
● http://jupyter.org/
● http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-wit
h-apache-spark/
● Can connect the IPython notebook to a Spark cluster and run interactive
queries in Python.
51. Conclusion
● If you’ve got a choice, keep it simple and not Distributed.
● Spark is a great framework for distributed collections
◦ Fully functional API
◦ Can perform imperative actions
● With all of this compute power, comes a lot of operational
overhead.
● Control your work and data distribution via partitions.
◦ (No more threads :) )