Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

1. Scheduling in MapReduce using Machine Learning Techniques Cloud Computing Group Search and Information Extraction Lab http://search.iiit.ac.in IIIT Hyderabad Vasudeva Varma vv@iiit.ac.in Radheshyam Nanduri radheshyam.nanduri@research.iiit.ac.in

2. Agenda Cloud Computing Group @ IIIT Hyderabad Admission Control Task Assignment Conclusion 2

4. Large datasets

5. Clusters of machines

6. Web crawling

7. Data intensive applications

8. MapReduce

9. Apache Hadoop3

10. Research Areas Resource management for MapReduce Scheduling Data Placement Power aware resource management Data management in cloud Virtualization 4

11. Teaching Cloud Computing course Monsoon semester (2008 onwards) Special focus on Apache Hadoop MapReduce and HDFS Mahout Virtualization NoSQL databases Guest lectures from industry experts 5

13. Admission Control

14. Should we accept a job for execution in the cluster?

15. Task Assignment

16. Which task to choose for running on a given node?6

18. Critical in achieving better QoS

19. Important to prevent over committing

20. Needed to maximize the utility from the perspective of a service provider7

22. Users search jobs through repositories

23. Select one that matches their criteria

24. Launch it on clusters managed by service provider

25. Service providers rent infrastructure from IaaS provider8

27. Soft and hard deadlines

28. Decay parameters

29. Provison for service provider penalty9

31. Accept a job that maximizes the expected utility

32. Use pattern classifier to classify incoming jobs

33. Two classes

34. Utility functions for prioritizing10

36. Contains job specific and cluster specific parameters

37. Includes variables that might affect admission decision11

39. Conditionally independent parameters

40. Works well in practice

41. Use past events to predict future outcomes

42. Application of Bayes theorem while computing probabilities

43. Incremental Learning – efficient w.r.t. memory usage

44. Simple to implement12

46. Simulation

47. Baseline

48. Myopic – Immediately select job that has maximum utility

49. Random – Randomly select one job from the candidate jobs13

50. Algorithm Accuracy 14

51. Comparison with baseline 15

52. Meeting Deadlines 16

54. Learning based technique

55. Extension of the work presented before17

56. Learning Scheduler 18

57. Features of Learning Scheduler Flexible task assignment – based on state of resources Consider job profile while allocating Tries to avoid overloading task trackers Allow users to control assignment by specifying priority functions Incremental learning 19

58. Using Classifier Use a pattern classifier to classify candidate jobs Two classes: good and bad Good tasks don't overload task trackers Overload: A limit set on system load average by the admin 20

59. Feature Vector Job features CPU, memory, network and disk usage of a job Node properties Static: Number of processors, maximum physical and virtual memory, CPU Frequency Dynamic: State of resources, Number of running map tasks, Number of running reduce tasks 21

60. Job Selection From the candidates labelled as good select one with maximum priority Create a task of the selected job 22

61. Priority (Utility) Functions Policy enforcement FIFO: U(J) = J.age Revenue oriented If priority of all jobs is equal, scheduler will always assign task that has the maximum likelihood of being labelled good. 23

62. Job Profile Users submit 'hints' about job performance Estimate job's resource consumption on a scale of 10, 10 being the highest. This data is passed at job submission time through job parameters: learnsched.jobstat.map - “1:2:3:4” This scheduler is made open-source at http://code.google.com/p/learnsched/ 24

64. TextWriter

65. WordCount

66. WordCount + 10ms delay

67. URLGet

68. URLToDisk

69. CPU Activity25

70. Learning Behaviour 26

71. Classifier Accuracy 27

73. Better QoS than naive approaches

74. Less runtime  happy users  more revenue for the service provider28

75. Thank you Cloud Computing Group Search and Information Extraction Lab http://search.iiit.ac.in IIIT HyderabadQuestions/Suggestions/Comments? Vasudeva Varma vv@iiit.ac.in Radheshyam Nanduri radheshyam.nanduri@research.iiit.ac.in

Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Recommended

Recommended

More Related Content

Similar to Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Similar to Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma

Editor's Notes