First Hadoop use case was to map the whole internet. Graph with millions of nodes and trillions of edges. Still back to the same problem of silos in order to manipulate data for interactive or Online applicationsLong Story Short, no support for alterative processing models. Iterative tasks can take 10x longer…. because of IO barriers
In Classic Hadoop MapReduce was part of the JobTracker and TaskTracker. Hence everything had to be built on MapReduce firstHas some scalbility limits 4k nodesIterative processes can take forever on map reduceJT Failures kill jobs, and everything in the que
Typical Open Soruce stack, as we can see MapReduce is On TOP of MapReduce, All other applications like Pig and Hive and HBase are also on top of MapReduce
So while Hadoop 1.x had its uses this is really about turning Hadoop into the next generation platform. So what does that mean? A platform should be able to do multiple things, ergo more then just batch processing. Need Batch, Interactive, Online, and Streaming capabilities to really turn Hadoop into a Next Gen Platform.SCALES! Yahoo plans to move into a 10k node cluster
So what does this really do? It provides a distributed application framework.Hadoop is now providing a platform were we can store all data in a reliable wayThen on the same platform be able to processes this data without having to move it. Data locality
New additions to the family like Falcon for data lifecycle management TEZ a new way of processing which avoids some of the IO barriers that MapReduce experiencedKnox for security and other enterprise features.But most importantly Yarn which as you can see everything is now on top of.
AKA a container spawned by an AM can be a Client and ask for another application to start which in turn can do the same thing.
Now we have a concept of deploying applications into the hadoop clusterThese applications run in containers of set resources
RM takes place of JT and still has scheduling ques and such like the fair, capacity and hierarchical ques
Datalocality – attempts to find a local host, if that fails moves to nearest rackFault Tolerance – Rebust in terms of managing containers Recovery- If the AM dies MapReduce application master writes a checkpoint to HDFS. This way we can recover from an AM that dies, it will read the checkpoint and continue.Inter-application Priorites -Maps have to be completed before Reducers right- so there is a complex process in the application master to balance mappers and reducers. Complex feedback from RM -- App master can now kinda look ahead and find out how many resources it can get in the next 20 minutesMigrate directly to YARN without changing a single line of code. Just need to recompile.
Havn’t used Weave but its on the 2do list. Saddly my tasks keep growing and I can’t do them in parallel
Application Attempt Id ( combination of attemptId and fail count )Application Submission Context – submitted by the client
getAllQueues - metrics related to the que such as max capacity, current capacity, application count.getApplications - list of applicationsgetNodeReports – id, rack, host, numCansApplicationSubContext needs a ContainerLaunchContext as well + resources, priority, que etc.