This document outlines the topics that will be covered in an online training course on Machine Learning Using Spark. The course will introduce machine learning concepts and Apache Spark tools. It will cover MLlib for scalable machine learning algorithms like classification, regression, clustering and collaborative filtering. It will also cover data preparation, model evaluation, and applying machine learning to tasks like recommendation engines and text processing. The course will use Scala, Python, R and visualization libraries and include lessons on statistics, regression, classification, clustering, dimensionality reduction and more.
2. The following topics will be covered in our
Machine Learning Using Spark
Online Training:
Copyright @ 2015 Learntek. All Rights Reserved. 2
3. What is Machine Learning?
▪ Machine learning Using Spark-Spark MLlib is an application of artificial
intelligence (AI) that provides systems the ability to automatically learn
and improve from experience without being explicitly programmed.
Machine learning focuses on the development of computer programs
that can access data and use it learn for themselves.
Copyright @ 2015 Learntek. All Rights Reserved. 3
4. Into to Machine Learning Using Spark
• MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning
scalable and easy. At a high level, it provides tools such as:
• ML Algorithms: common learning algorithms such as classification, regression, clustering,
and collaborative filtering
• Featurization: feature extraction, transformation, dimensionality reduction, and selection
• Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
• Persistence: saving and load algorithms, models, and Pipelines
• Utilities: linear algebra, statistics, data handling, etc.
Copyright @ 2015 Learntek. All Rights Reserved. 4
5. Tools
• This course will be delivered using Scala and PYTHON API. For explaining
statistical concept, R language will also be using. Visualization part will
be covered using Bokeh/ggplot library.
Copyright @ 2015 Learntek. All Rights Reserved. 5
6. Introduction to Apache Spark
▪ Spark Programming model
▪ RDD and Data Frame
▪ Transformation and Action
▪ Broadcast and Accumulator
▪ Running HDP on local machine
▪ Launching Spark Cluster
Copyright @ 2015 Learntek. All Rights Reserved. 6
7. Basic Statistics
• Mean, Mode, Media, Range, Variance,
Standard Deviation, Quartiles,
Percentiles
• Sampling
• Sampling Methods
• Sampling Errors
• Probability Distributions
• Normal distribution, t-distribution, Chi-
square, F
• Margin of Error, Confidence Interval,
Significance level, Degree of Freedom
• Hypothesis concept, Type I and Type II
error
• P-value, t-Test, Chi-square Test
• Correlation Coefficient
Copyright @ 2015 Learntek. All Rights Reserved. 7
8. Machine Learning Using Spark
• Introduction to Spark MLlib
• Data types: Vector, Labeled Point
• Feature Extraction
• Feature Transformation, Normalization
• Feature Selectors
• Locality Sensitive Hashing(LSH)
Copyright @ 2015 Learntek. All Rights Reserved. 8
9. Regression Analysis with Spark
• Types of Regression Models
• Gradient Descent
• Linear Regression, Generalized Linear Regression
• MSE, RMSE MAE, R-squared Coefficient
• Transforming the target variable
• Tuning Model Parameters
Copyright @ 2015 Learntek. All Rights Reserved. 9
10. Classification Model with Spark
• Linear Models, Naives Bayes Model,
Decision Tree
• Logistic Regression
• Linear Support Vector Machine
• Random Forest
• Gradient-Boosted Trees
• Training Classification Models
• Accuracy and prediction error
• Precision and Recall
• ROC curve and AUC
• Cross validation
Copyright @ 2015 Learntek. All Rights Reserved. 10
12. Dimensionality Reduction
• Principal Component Analysis
• Singular Value Decomposition
• Clustering as dimensionality reduction
• Training a dimensionality reduction model
• Evaluating dimensionality reduction models
Copyright @ 2015 Learntek. All Rights Reserved. 12
13. Recommendation Engine
▪ Content based filtering
▪ Collaborative based filtering
▪ Overview of Movie Lens data
▪ Training a recommendation model
▪ Using the recommendation model
▪ Performance Evaluation
Copyright @ 2015 Learntek. All Rights Reserved. 13
14. Text Processing
Copyright @ 2015 Learntek. All Rights Reserved. 14
•Feature Hashing
•TF-IDF model
•Tokenization
•Stop words
•TF-IDF Weightings
•Training a TF-IDF model
•Usage of TF-IDF model
•Evaluating TF-IDF models
15. Prerequisites :
▪ Prior understanding of exploratory data analysis and data visualization will
help immensely in learning machine learning concept and applications.
This include basic statistical technique for data analysis. Having some
knowledge of R programming or some Python packages like sci-kit, numpy will
be useful. However , we are going to cover basic statistics technique as part
of this course before going deep into machine learning . This will help
everyone to gain maximum from this course.
Copyright @ 2015 Learntek. All Rights Reserved. 15