As a company starts dealing with large amounts of data, operation engineers are challenged with managing the influx of information while ensuring the resilience of data. Hadoop HDFS, Mesos and Spark help reduce issues with a scheduler that allows data cluster resources to be shared. It provides a common ground where data scientists and engineers can meet, develop high performance data processing applications and deploy their own tools.
4. Mesos and Data Analysis
Yes, you don't need Hadoop to start using Mesos and
Spark.
5. Now, If You...
4 Need to store large files? by default each block is
128MB.
4 Data is written mainly as new files or by appending
into existing ones?
6. Convinced you want to jump into the
Hadoop bandwagon?
Read
Sammer, Eric. "Hadoop Operations." Sebastopol, CA:
O'Reilly, 2012. Print.
13. Hadoop MRV1 in Mesos
4 Requires Hadoop MRV1
4 Officially works with CDH5 MRV1
4 Apache Hadoop 0.22, 0.23 and 1+
4 Apache Hadoop 2+ doesn't come with MRV1!
14. Hadoop MRV1 in Mesos
4 Requires a JobTracker.
4 By default uses the
org.apache.hadoop.mapred.JobQueueTaskScheduler
4 You can change it .e.g ...mapred.FairScheduler
15. Hadoop MRV1 in Mesos
4 Requires TaskTracker.
4 That is
org.apache.hadoop.mapreduce.server.jobtracker.
TaskTracker.
4 And not
org.apache.hadoop.mapred.TaskTracker.java.
17. How Hadoop MRV1 in Mesos works?
1. Framework Mesos Scheduler creates the Job
Tracker as part of the driver.
2. The Job Trakcer will use
org.apache.hadoop.mapred.MesosScheduler to lunch
tasks.
21. Personal Preference
4 Use Hadoop 2.4.0 or above.
4 Name Node HA through the Quorum Journal
Manager.
4 Move to Spark if Possible.
22. Example of a Mesos Data Analysis
Stack
1. HDFS stores files.
2. Use the Spark CLI to test ideas.
3. Use Spark Submit for jobs.
4. Use Chronos or Oozie to schedule workflows.
27. Spark Fine Grained Scheduling
4 Enabled by default.
4 Each Spark task runs as a separate Mesos task.
4 Has an overhead in launching each task.
28. Spark Coarse Grained Scheduling
4 Uses only one long-running Spark task on each Mesos
slave.
4 Dynamically schedules its own “mini-tasks”, using
Akka.
4 Lower startup overhead.
4 Reserving the cluster resources for the complete
duration of the application.
29. Be ware of...
4 Greedy Scheduling (Coarse Grain)
4 Over committing and deadlocks (Fine Grained)
31. Use Spark Submit
Avoid parametrizing the Spark Context in your code as
much as possible.
Leverage the spark-submit arguments, properties files
as well as environment variables to configure your
application.
33. Understand and Tune Your Applications
4 Know your Working Set.
4 Understand Spark Partitioning and Block
management.
4 Define your Spark workflow and where to cache/
persist.
4 If you cache you will serialize, use Kryo.
34. Example Spark API PairRDDFunctions
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int): RDD[(K, C)]
39. Tune your Data
4 Per Data Source understand its optimal block size
4 Leverage Avro as the serialization format.
4 Leverage Parquet as the storage format.
4 Try to keep your Avro & Parquet schemas flat.
41. Each Application
4 Instrument the Code.
4 Measure Input size in number of records and byte
size.
4 Measure Output size in the same way.
42. Standardize
4 JDK & JRE version across your cluster.
4 The Spark version across your cluster.
4 The libraries that will be added to the JVM classpath
by default.
4 A packaging strategy for your application, uber jar.
44. Some Differences with YARN
4 Execution Cluster vs Client modes.
4 Isolation process vs cgroups
4 Docker support? LXC Templates?
4 Deployment complexity?