2. ● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
3. Agenda
● Idea
● Motivation
● Architecture of existing big data system
● Function abstraction
● Third party libraries
● Implementing third party libraries
● MySQL task
● Code example
5. Motivation
● First version of Spark only had 1600 lines of Scala code
● Had all basic pieces of RDD and ability to run
distributed system using Mesos
● Recreating the same code with step by step
understanding
● Ample of time in hand
6. Distributed systems from 30000ft
Distributed Storage(HDFS/S3)
Distributed Cluster management
(YARN/Mesos)
Distributed Processing Systems
(Spark/MapReduce)
Data Applications
8. Function abstraction
● The whole spark API can be summarized a scala
function which can represented as follow
() => T
● This scala function can be parallelized and sent over
network to run on multiple systems using mesos
● The function is represented as a task inside the
framework
● FunctionTask.scala
9. Spark API as distributed function
● Initial API of the spark revolved around scala function
abstraction for processing as with RDD for data
abstraction
● Every API like map, flatMap represented as a function
task which takes one parameter and return one value
● The distribution of the functions are initially done by the
mesos which later ported to other cluster management
● This shows how the spark started with functional
programming
10. Till now
● Discussion about Mesos and its abstraction
● Hello world code on Mesos
● Defining Function interface
● Implementing
○ Scheduler to run scala code
○ Custom executor for scala
○ Serialize and Deserialize scala function
● https://www.youtube.com/watch?v=Oy9ToN4O63c
11. What a local function can do?
● Access to the local data. Even in spark, normally the
function access the hdfs local data
● Ability to access the classes provided by the framework
● Any logic which can be serialized
What it cannot do?
● Access classes outside from the framework
● Access the results of other functions (shuffle)
● Access to lookup data (broadcast)
12. Need of third party libraries
● Ability to add third party libraries in a distributed system
framework is important
● Third party libraries allow us to
○ Connect to third party sources
○ Use library to implement custom logic like matrix
manipulation inside function abstraction
○ Ability to extend base framework using set of
libraries ex: spark-sql
○ Ability to optimize for specific hardware
13. Approaches to third party libraries
● There are two different approaches to distribute third
party jars
● UberJar - Build all the dependencies with your
application code to single jar
● Second approach is to distribute the libraries separately
and adding them to the classpath of executors
● UberJar suffers from issues of jar size and versioning
● So we are going follow second approach which is
similar to one followed in Spark
14. Design for distributing jars
Executor 1
Executor 2
Jar serving http
server
Scheduler code
Scheduler/Driver
Download
jars over http
Download
jars over http
15. Distributing jars
● Third party jars are distributed over http protocol over
the cluster
● Whenever the scheduler/drives comes up it starts a http
server to serve the jars passed on to it by user
● Whenever executors are created, scheduler passes on
the uri of the http server to connect
● Executors connect to the jar server and download the
jars to respective machine. Then they add them to their
classpath.
16. Code for implementing
● We need multiple changes to our existing code base to
support third party jars
● The following are the different steps
○ Implementation of embedded http server
○ Change to scheduler to start http server
○ Change to executor to download jars and add it to
classpath
○ A function which uses third party library
17. Http Server
● We implement an embedded http server using jetty
● Jetty is a popular http server and J2EE servlet container
from eclipse organization
● One of the strength of jetty is it can be embedded inside
another program to provide http interfaces to certain
functionality
● Initial versions of Spark used jetty for jar distribution.
Newer version uses netty.
● https://eclipse.org/jetty/
● HttpServer.scala
18. Scheduler change
● Once we have http server, now we need to start when
we start our scheduler
● We will use registered callback for creating our jar
server.
● As part of starting the jar server, we will copy all the jars
provided by the user to a location which will beame
base director for the server.
● Once we have the server running, we pass on the
server uri to all the executors
● TaskScheduler.scala
19. Executor side
● In executor, we download the jars using calls to the jar
server running on master
● Once we downloaded the jars, we add it the classpath
using URLClassLoader
● We use above classloader to run our functions so that it
has access all the jars
● We plug this code in the registered callback of the
executor so it run only once
● TaskExecutor.scala
20. MySQL function
● This example is a function which access the mysql class
to run jdbc against a mysql instance
● We ship mysql jar using our jar distributed framework so
it will be not part of our application jar
● There is no change in our function api as it’s a normal
function as other examples
● MySQLTask.scala