How AI, OpenAI, and ChatGPT impact business and software.
Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma
1. Scheduling in MapReduce using Machine Learning Techniques Cloud Computing Group Search and Information Extraction Lab http://search.iiit.ac.in IIIT Hyderabad Vasudeva Varma vv@iiit.ac.in Radheshyam Nanduri radheshyam.nanduri@research.iiit.ac.in
2. Agenda Cloud Computing Group @ IIIT Hyderabad Admission Control Task Assignment Conclusion 2
10. Research Areas Resource management for MapReduce Scheduling Data Placement Power aware resource management Data management in cloud Virtualization 4
11. Teaching Cloud Computing course Monsoon semester (2008 onwards) Special focus on Apache Hadoop MapReduce and HDFS Mahout Virtualization NoSQL databases Guest lectures from industry experts 5
57. Features of Learning Scheduler Flexible task assignment – based on state of resources Consider job profile while allocating Tries to avoid overloading task trackers Allow users to control assignment by specifying priority functions Incremental learning 19
58. Using Classifier Use a pattern classifier to classify candidate jobs Two classes: good and bad Good tasks don't overload task trackers Overload: A limit set on system load average by the admin 20
59. Feature Vector Job features CPU, memory, network and disk usage of a job Node properties Static: Number of processors, maximum physical and virtual memory, CPU Frequency Dynamic: State of resources, Number of running map tasks, Number of running reduce tasks 21
60. Job Selection From the candidates labelled as good select one with maximum priority Create a task of the selected job 22
61. Priority (Utility) Functions Policy enforcement FIFO: U(J) = J.age Revenue oriented If priority of all jobs is equal, scheduler will always assign task that has the maximum likelihood of being labelled good. 23
62. Job Profile Users submit 'hints' about job performance Estimate job's resource consumption on a scale of 10, 10 being the highest. This data is passed at job submission time through job parameters: learnsched.jobstat.map - “1:2:3:4” This scheduler is made open-source at http://code.google.com/p/learnsched/ 24
74. Less runtime happy users more revenue for the service provider28
75. Thank you Cloud Computing Group Search and Information Extraction Lab http://search.iiit.ac.in IIIT HyderabadQuestions/Suggestions/Comments? Vasudeva Varma vv@iiit.ac.in Radheshyam Nanduri radheshyam.nanduri@research.iiit.ac.in
Editor's Notes
The Search and Information Extraction Lab (SIEL) at LTRC, IIIT Hyderabad is actively involved in research in many areas relevant to Cloud Computing. The main motivation behind establishing a research team in cloud computing at SIEL was to enable researchers in the lab in experimenting with very large datasets, which are nowadays becoming a norm in search and information extraction research. To facilitate handling of such large datasets, we began exploring several methods for operating on the data sets using a cluster of machines. Eventually, we chose MapReduce as the preferred model as it suited very well for data intensive applications. We began exploring MapReduce, and its most popular implementation, Apache Hadoop. However, we soon realized that there was huge potential in research in improving the core MapReduce framework in various areas such as fault tolerance, resource management and user accessibility. As a result we established a team that does dedicated research on Hadoop and MapReduce.