Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
MongoDB Europe 2016 - Big Data meets Big Compute
1. Big Data meets Big Compute
Connecting MongoDB and Spark for Fun
and Profit
{ name: "Ross Lawley", role: "Senior Software Engineer",
twitter: "@RossC0" }
2. #MDBE16
Agenda
What is Spark?
How does it work?
What problems can it solve?
Whats the future of Spark?
Spark introduction
01 A deep dive into the connector.
Configuration options
Partitioning challenges
How to Scale and keep data local.
Internals
03Introducing the new connector
How to install it and use it
How to use it in various languages
When to use it, when not to
The new connector
02
An impressive demostration of
MongoDB and Spark combined!
Demo
04 I'll try and help answer any
questions you might have.
I'll also answer questions at the
Drivers booth!
Questions
06Quick recap.
Where to go for more information
Conclusions
05
4. #MDBE16
What is Spark?
Fast and distributed general computing engine
• Makes it easy and fast to process large datasets
• Libraries for SQL, streaming, machine learning, graphs
• APIs in Scala, Python, Java, R
• It’s fundamentally different to what’s come before
5. #MDBE16
So why not just use Hadoop?
Spark is FAST
• Faster to write.
• Friendly API in Scala, Python, Java and R
• Faster to run.
• Up to 100x faster than Hadoop in memory
• 10x faster on disk.
7. #MDBE16
Spark History
2009
The Beginning
Spark project started at UC
Berkeley's AMPLab
Spark Open Sourced
2010
2015
Spark 1.3.0 – 1.5.0
R support
Spark SQL out of alpha
DataFrames
Spark 1.6.0
Spark 2.0
Datasets
Structured Streams
2016
2013
Joined the Apache
foundation
Spark 1.0.0 – 1.2.0
Scala, Java & Python
Spark SQL
Streaming
Mlib
GraphX
2014
8. #MDBE16
Spark Programming Model
Resilient Distributed Datasets
• An RDD is a collection of elements that is immutable, distributed and fault-
tolerant.
• Transformations can be applied to a RDD, resulting in new RDD.
• Actions can be applied to a RDD to obtain a value.
• RDD is lazy.
10. #MDBE16
Built in fault tolerance
RDDs maintain lineage information that can be used to reconstruct lost
partitions
val searches = spark.textFile("hdfs://...")
.filter(_.contains("Search"))
.map(_.split("t")(2)).cache()
.filter(_.contains("MongoDB"))
.count()
Mapped
RDD
Filtered
RDD
HDFS RDD
Cached
RDD
Filtered
RDD Count
18. #MDBE16
The MongoDB Spark Connector
• Spark 1.6.x and Spark 2.0.x
• Scala, Python, Java, and R
• Idiomatic Scala API
• Supports custom Aggregations
• Multiple partitioning strategies
• Automatic schema inference
• Automatic conversion to Datasets
> $SPARK_HOME/bin/spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.10:2.0.0
19. “ Reynold Xin
Co-Founder and Chief Architect at
Databricks
Users are already combining Apache Spark and
MongoDB to build sophisticated analytics applications.
The new native MongoDB Connector for Apache Spark
provides higher performance, greater ease of use, and
access to more advanced Apache Spark functionality
than any MongoDB connector available today.”
20. #MDBE16
Fare Calculation Engine
One of World’s Largest Airlines Migrates from Oracle to
MongoDB and Apache Spark to Support 100x performance
improvement
Problem Why MongoDB Results
Problem Solution Results
China Eastern targeting 130,000 seats
sold every day across its web and
mobile channels
New fare calculation engine needed to
support 20,000 search queries per
second, but current Oracle platform
supported only 200 per second
Apache Spark used for fare
calculations, using business rules
stored in MongoDB
Fare calculations written to MongoDB
for access by the search application
MongoDB Connector for Apache Spark
allows seamless integration with data
locality awareness across the cluster
Cluster of less than 20 API, Spark &
MongoDB nodes supports 180m fare
calculations & 1.6 billion searches per
day
Each node delivers 15x higher
performance and 10x lower latency
than existing Oracle servers
MongoDB Enterprise Advanced
provided Ops Manager for operational
automation and access to expert
technical support
22. #MDBE16
What's needed to connect to Spark?
1. Create a connection
• This has some cost.
The Mongo Java Driver runs a connection pool
Authenticates connections, replica set discovery etc…
• Only two modes to support:
Reads
Writes
23. #MDBE16
What's needed to connect to Spark?
2. Partition the data
• Partitions provide parallelism – splits the collection into parts
• Challenges for mutable data sources as not a snapshot in time
RDD / Collection
24. #MDBE16
MongoSamplePartitioner
The default partitioner
• Over samples the collection
• Calculate the number of partitions.
Uses the average document size and the configured partition size.
• Samples the collection, sampling n number of documents per partition
• Sorts the data by partition key
• Takes each n partition
• Adds a min and max key partition split at the start and end of the collection
{$gte: {_id: minKey}, $lt: {_id: 1}}{$gte: {_id: 1}, $lt: {_id: 100}} {$gte: {_id: 5000}, $lt: {_id: maxKey}}{$gte: {_id: 100}, $lt: {_id: 200}} {$gte: {_id: 4900}, $lt: {_id: 5000}}
25. #MDBE16
MongoShardedPartitioner
Sharded collections are already partitioned
• Examines the shard config database
• Creates partitions based on the shard chunk min and max ranges
• Stores the Shard location data for the chunk, to help promote locality
• Adds a min and max key partition split at the start and end of the collection
{$gte: {_id: minKey}, $lt: {_id: 1}} {$gte: {_id: 1000}, $lt: {_id: maxKey}}{$gte: {_id: 194}, $lt: {_id: 232}}
26. #MDBE16
Alternative Partitioners
• MongoSplitVectorPartitioner
A partitioner for standalone or replicaSets. Command requires special privileges.
• MongoPaginateByCountPartitioner
Creates a maximum number of partitions
Costs a query to calculate each partition
• MongoPaginateBySizePartitioner
As above but using average document size to determine the partitions.
• Create your own
Just implement the MongoPartitioner trait and add the full path to the config
27. #MDBE16
Whats needed to connect to Spark?
3. Support DataFrames & Datasets
• RDD's with Schema
• Supports Simple Types
• BinaryType, BooleanType, ByteType, CalendarIntervalType, DateType, DoubleType, FloatType,
IntegerType, LongType, NullType, ShortType, StringType, TimestampType
• Complex Types:
• ArrayType - Typed Array
• StructType – Map
• Unsupported Bson types use StructType similar to extended json.
28. #MDBE16
DataFrames & Datasets
• Automatic Schema inference:
val dataFrame = MongoSpark.load(sparkSession)
• Supply the schema
case class Person(firstName: String, lastName:String)
val dataFrame= MongoSpark.load[Person](sparkSession)
30. #MDBE16
The Anatomy of a read
MongoSpark.load(sparkSession).count()
1. Create a MongoRDD[Row]
2. Infer the schema (none provided)
3. Partition the data
4. Calculate the Partitions .
5. Allocate the workers
6. For each partition on each worker:
i. Queries and returns the cursor
ii. Iterates the cursor and sums up the data
7. Finally, the Spark application returns the sum of the sums.
37. #MDBE16
Scenario: You've won the EuroMillions lottery!
• To celebrate you want to travel to
Europes largest 50 cities!
• The nouveau riche only have one way
to travel; in style by personal
helicopter!
• It’s a logistical nightmare. "Travelling
Salesman Problem"
38. #MDBE16
The scale of the problem
• With 50 places to visit there are: 49 x 48 x 47 x … x 3 x 2 x 1
possible ways to travel between them.
This number is 63 digits long:
608,281,864,034,267,560,872,252,163,321,295,376,887,552,831,379,210,240,000,000,000
• Don't need to calculate all possible routes. Just need a route that is good
enough.
39. #MDBE16
Choosing MongoDB and Spark
Good fit:
• Not possible directly via the aggregation framework
• CPU intensive task
• Needs code to solve the problem
Bad fit:
• Not an obviously parallel problem
• Can fork, divide and join using Spark
40. #MDBE16
Finding a solution with a genetic algorithm
Slightly complex but basically we're using evolution.
• Randomly generate a number of routes
• Then "evolve" the routes over a number of generations
• Crossover two parent routes to create a child route.
• Randomly mutate a % of children routes.
• Keep a percentage of the best routes.
• After X generations will end up with a evolved route that is short
42. #MDBE16
An extremely powerful combination
• Many possible use cases
• Solve the right problems
Some operations maybe faster if performed using Aggregation Framework
• Performance
• Pick the correct partitioning strategy
• Tune MongoDB
• Tune Spark
• Spark is evolving all the time