BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
Technovision
1. TITLE OF THE PROJECT
Driverless Machine Learning API for fully
automation of Classification Algorithms &
ANN
The Driverless ML API
is an automatic machine
learning that uses “ML to
do ML” to empower data
science teams to scale and
implement their AI strategy.
Team Members
Name: Sayantan Ghosh
School: Kalinga Institute Of
Industrial Technology
Mobile No: 9609078275
Email: gsayantan1999@gmail.com
2. Driverless ML API for automation of Machine Learning
Classification Algorithms
Saynatan Ghosh
Abstract :
Over the last several years, machine learning has become an integral part of many organizations’ decision-making at various labels.
With not enough data scientists to fill the increasing demand for data driven business processes, I have developed an Flask API that
automates several time consuming aspects of a typical data science workflows. The Driverless API incorporates a number of
contemporary approaches to increase the transparency and accountability of complex models and to enable users to tune
hyperparameters for accuracy and fairness and also to train high-quality models specific to their business needs. My idea is to
automatize the classification problems in just a matter of minutes. My approach is to implement a software application that can
intelligently analyze the whole dataset report properly and could also determine the feature importance s of the independent
features and could reduce the dimension of the dataset. In order to make it more efficient deep neural networks also converged in
the model to make the prediction more accurate which is incorporated with advanced visualization techniques using Support
Vector Graphics. The API introduces fundamental concepts in machine learning interpretability (MLI ) and puts forward a useful
analysis motif. It also puts forward exceptations consistency across techniques and finally, discusses several use cases.In this paper, I
present a comprehensive survey for the state-of-the-art efforts in tackling the CASH(Combined Algorithm Selection and Hyper
parameter Tuning) problem. In addition, I have highlighted the research work of automating the other steps of the full complex
machine learning pipeline (Driverless ML API ) from data understanding till model deployment. Furthermore, I have provided a
comprehensive coverage for the various tools and frameworks that have been introduced in this domain. Finally, I have discussed
some of the research directions and open challenges that need to be addressed in order to achieve the vision and goals of the
Driverless ML process.
Keywords : Driverless AI, supervised learning, model selection, Flask API , Support Vector Graphics, AI to do AI.
1 Introduction :
Due to the increasing success of machine learning techniques in several application domains, they have been attracting a lot of
attention from the research and business communities. In general, the effectiveness of machine learning techniques mainly rests on
the availability of massive datasets. Recently, we have been witnessing a continuous exponential growth in the size of data
produced by various kinds of systems, devices and data sources. It has been reported that there are 2.5 quintillion bytes of data is
being created everyday where 90% of stored data in the world, has been generated in the past two years only1 . On the one hand,
the more data that is available, the richer and the more robust the insights and the results that machine learning techniques can
produce. Thus, in the Big Data Era, we are witnessing many leaps achieved by machine and deep learning techniques in a wide
range of fields . On the other hand, this situation is raising a potential data science crisis, similar to the software crisis, due to the
crucial need of having an increasing number of data scientists with strong knowledge and good experience so that they are able to
keep up with harnessing the power of the massive amounts of data which are produced daily. In particular, it has been
acknowledged that data scientists can not scale2 and it is almost impossible to balance between the number of qualified data
scientists and the required effort to manually analyze the increasingly growing sizes of available data. Thus, we are witnessing a
3. growing focus and interest to support automating the process of building machine learning pipelines where the presence of a
human in the loop can be dramatically reduced, or preferably eliminated.
In general, the process of building a high-quality machine learning model is an iterative, complex and time-consuming process
that involves a number of steps (Figure 1). In particular, a data scientist is commonly challenged with a large number of choices
where informed decisions need to be taken. For example, the data scientist needs to select among a wide range of possible
algorithms including classification or regression techniques (e.g. Support Vector Machines, Neural Networks, Bayesian Models,
Decision Trees, etc) in addition to tuning numerous hyper-parameters of the selected algorithm. In addition, the performance of
the model can also be judged by various metrics (e.g., accuracy, sensitivity, specificity, F1-score).Naturally the decisions of the data
scientists, in each of these steps affect the performance and the quality of the developed model. For instance, in yeast dataset,3
different parameter configurations of a Random Forest classifier result in different range of accuracy values, around 5%4 . Also,
using different classifier learning algorithms leads to widely different performance values, around 20% , for the fitted models on the
same dataset. Although making such decisions require solid knowledge and expertise, in practice, increasingly, users of machine
learning tools are often non-experts who require off-the-shelf solutions. Therefore, there has been a growing interest to automate
and democratize the steps of building the machine learning pipelines.
Figure1 : Typical Supervised Machine Learning Pipeline
Driverless ML API seeks to build the fastest artificial intelligence(AI) platform on graphical processing units (GPUs). It automates
some of the most difficult data science and machine learning workflows such as feature engineering, model validation, model
tuning, model selection and model deployment. It is a high-performance computing platform for automatic development and rapid
deployment of state-of-the-art predictive analytics models. Many machine learning classification algorithms can benefit from the
efficient, fine grained parallelism and high throughput of GPUs, which allows to complete training and inference much faster than
CPUs.This API also makes use of MLI. Often times,especially in regulated industries, model transparency and explanation become
just as important as predictive performance. Driverless AI uses MLI to create easy-to-follow visualizations, interpretations, and
explanation of models. It can be deployed in business application like loss-given-default, probability of default, customer churn,
campaign response, fraud detection, anti-money-laundering, demand forecasting, and predictive asset maintenance models(Or in
machine learning practise: binomial classification, multinomial classification problems).
1.2 Machine Learning Interpretability Taxonomy : In the context of machine learning model and results, interpreatibility
has been defined as the ability to explain or to shed light in intelligible term to a human. Of course, interpreatibility and
explanations are subjective and complicated subjects. Following ideas on Interpreting Machine Learning, presented approaches will
be described in technical terms but also in terms of response function complexity, scope, application domain, understanding and
trust.
4. 1.2.1 Scope : Driverless ML API employs the techniques of expert data scientists in an easy to use application that helps scale
your data science efforts. It empowers data scientists to work on projects faster using automation and state of the art. With
Driverless ML, everyone including expert and junior data scientists, domain scientists, and data engineers can develop trusted
machine learning models. The next-generation automatic machine learning platform delivers unique and advanced functionality for
data visualization, feature engineering, model interpreatibility and low latency deployment.
1.2.2 Delivering AI at Scale : Businesses are creating and driving an AI strategy to gain a competitive edge. There are three
critical challenges to achieving AI at scale, including addressing the talent gaps, the time a model takes to train, tune and deploy,
and being able to trust the AI. Driverless AI empowers the data science teams to scale and deliver trusted, production ready models
to fulfil every business AI strategy and initiatives.
1.2.3 Filling the Talent Gap : Data scientists are in short supply for all but the largest technology companies. With Driverless AI,
expert and novice data scientists, and statisticians in all businesses can develop high accurate models that are ready to deploy. This
enables data scientist to focus on evaluating results and exploring new use cases.
1.2.4 More Models in Less Time : Reducing the time that it takes to develop accurate, production ready models to critical to
solve a large no of business challenges with AI.Driverless ML automates time-consuming data science tasks including, advanced
feature engineering, model selection, hyper-parameter tuning, model stacking, and model deployment.This processes can be
accelerated with high performance computing on GPU and CPU systems that allow for thousands of combinations and iterations to
be tested and to find the best model in hours. Model deployment also is streamlined with an comparative scoring pipelines that
include everything that is needed to run the model in production.
2 Key Capabilities & Technologies of Driverless ML API:
2.1 Pygal : Exploratory Data Analysis : The Driverless API automatically selects data plots based on the most
relevant data statistics to help users understand their data prior to the model building process. This is useful for understanding
the composition of very large data sets and discovering trends and possible issues such as large numbers of missing values or
significant outliers that could impact modeling results. It also provides recommendations for transformations to address the
problems identified.
Figure2: Graphical Visualization of Feature Impotence and f1_score distribution
5. 2.2 Automatic Feature Engineering and Model Building : Feature engineering is the secret weapon that
advanced data scientists use to extract the most accurate results from algorithms. The Driverless ML AI employs a library
of algorithms and feature transformations to automatically engineer new, high-value features for a given data set. Included in
the interface is an easy to read variable importance chart that shows the significance of original and newly engineered features.
2.3 Automatic Algorithm Implementation of Artificial Neural Network : This API is incorporated with a diverse range
of features including Classifier algorithm integration with Artificial Neural Network. After the data has been
preprocessed the whole preprocessed dataset will go through the ML Classifier pipeline along with the Neural Network
with 2 hidden Layer. As the API is centralized in binary classification I have used the loss function as Binary_crossentropy.
And the weight vector is initialized as random normal. The 2 layer hidden architecture of the ANN is shown below:
2.4 Automatic Web Documentation : To explain models to business users and regulators, data scientists must document
the data, algorithms, and processes used to create machine learning models. Driverless ML API automatic model web
documentation relieves the user from the time-consuming task of documenting and summarizing their work flow while building
machine learning models. The web documentation includes details about the data used, the validation schema selected, model and
feature tuning, MLI, and the final model created. Auto Doc saves data scientists time, which can then be used to train and deploy
more models.
2.5 Automatic Feature Importance Factor : I have used Tree Based Decision Tree to compute the feature importance of the
given dataset and to manipulate and train my model according to it.
6. 2.6 Automatic Comparison of Original & Preprocessed Dataset with auto recommendation: The Driverless API has the
capability to auto detect the features that need to be deleted and a cooperative web view of the original dataset and
pre processed dataset so that the user could know the difference and techniques applied in the preprocessed dataset.
3 Worflow of the Driverless API
Given a set of machine learning algorithms A =
,......, 21
AA , and a dataset D divided into disjoint training trainD , and
validation validationD sets. The goal is to find an algorithm A (i) ∗ where
i
A A and A (i) ∗ is a tuned version of A (i) that
achieves the highest generalization performance by training
i
A on trainD , and evaluating it on validationD . In particular, the goal
of any CASH optimization technique is defined as:
),(minarg ,
*
validationtrain
i
AA
i
DDALA
where
),,( validationtrain
i
DDAL is the loss function (e.g: error rate, false positives, etc). In practice, one constraint for CASH
optimization techniques is the time budget. In particular, the aim of the optimization algorithm is to select and tune a machine
learning algorithm that can achieve (near)- optimal performance in terms of the user-defined evaluation metric (e.g., accuracy,
sensitivity, specificity, F1-score) within the user-defined time budget for the search process.
7. In this paper, I have presented a comprehensive survey for the state-of-the-art efforts in tackling the CASH(Combined Algorithm
Selection and Hyper parameter Tuning) problem. In addition, we highlight the research work of automating the other steps of the
full end-to-end machine learning pipeline (Driverless ML) from data understanding (pre-modeling) till model deployment
(post-modeling) . The remainder of this paper is organized as follows.
Figure 3: Driverless ML: Framework Architecture
3.1 Pipeline structure Creation :
The above figure(Figure 6) represents the different stages of creating the pipeline.
3.1.1 input definition Stage: The user can insert the dataset in the Driverless ML API through a web interface. In the input
section of the API there are several fields including the dataset, the feature name to be predicted, splitting amount of the dataset.
As for the ANN integrated into the API the user can also provide the no of epochs for training validation and the optimization
algorithm for minimizing the loss function of the ANN. The web interface is developed using HTML5 and CSS3. The UI is shown in
Figure 4.
8. Figure 4 : Web Interface for giving the Input Parameters
3.1.2 Data Prprocessing Stage :
3.1.2.1 MISSING FEATURE SUBSTITUTION : Incomplete data is an unavoidable problem in dealing with most of the real
world data sources. The topic has been discussed and analyzed by several researchers in the field of ML . Generally, there are some
important factors to be taken into account when processing unknown feature values. One of the most important ones is the source
of ‘unknownness’: (i) a value is missing because it was forgotten or lost; (ii) a certain feature is not applicable for a given instance,
e.g., it does not exist for a given instance; (iii) for a given observation, the designer of a training set does not care about the value of
a certain feature (so-called don’t-care value). Analogically with the case, the expert has to choose from a number of methods for
handling missing data:
Method of Ignoring Instances with Unknown Feature Values: This method is the simplest: just ignore the instances, which have at
least one unknown feature value.
• Most Common Feature Value: The value of the feature that occurs most often is selected to be the value for all the unknown
values of the feature.
• Concept Most Common Feature Value: This time the value of the feature, which occurs the most common within the same class is
selected to be the value for all the unknown values of the feature.
• Mean substitution: Substitute a feature ’ s mean value computed from available cases to fill in missing data values on the
remaining cases.
Algorithm for Missing Values Substitution:
Let the dataset is denoted by D having N columns and M rows and the features is denoted by
i
jFeatures . where i is denoting
the columns and J is denoting the rows.
Step 1 : for each
i
jFeatures in
N
i
i
jFeatures
1
Step 2: if
0)!()).((
1
M
j
i
j isnaFeatures :
9. Step 3:
)(
1
1
i
j
M
j
i
j Features
M
Features
3.1.2.2 LABEL ENCODING : Label Encoding refers to converting the labels into numeric form so as to convert it into the
machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is
an important pre-processing step for the structured dataset in supervised learning.I am also applying a reduction technique which
is , if the data type of the feature is string type object and the no of unique features in particular that column is greater than 50% of
the row count then we can assume the column is irrelevant to predict the desired feature.After label encoding in order to get rid of
dummy-variable trap we are creating unique category column and append it to the dataset.
Algorithm for Label Encoding
Step 1: for each
i
jFeatures in
N
i
i
jFeatures
1
:
Step 2:
100().&'&'. nuniqueFeaturesOdtypeFeatures i
j
i
j
Step 3:
i
jFeatures = labelencoder_object.fit_transform(
i
jFeatures .fillna(‘0’)
3.1.2.3 FEATURE SELECTION : Feature subset selection is the process of identifying and removing possible irrelevant and
redundant features[5] . This reduces the dimension of the data and enables learning algorithms to operate faster and more
effectively. Generally, features are characterized as, Relevant. These are features have an influence on the output and their role
cannot be assumed by the rest. Irrelevant. Irrelevant features are defined as those features not having any influence on the output,
and whose values are generated at random for each example. Redundant. A redundancy exists whenever a feature can take the role
of another.
3.1.3 ML CLASSIFIERS IMPLEMENTATION : In this stage I have applied all the classifiers into the dataset and computing the
result in order to choose the best classifier among them.For implementing these techniques I have used K-Fold cross validation for
each of the classifiers . In each iteration the dataset, is spitted into 10 folds and it is computing the mean accuracy of the classifiers.
I have used Logistic regression Algorithm for regression and the ML Classifiers that are included in the model is K-Nearest
Neighbors Classifier, Decision Tree Classifier, Random Forest Classifier, Gaussian Naive Bayes Classifier, and Support Vector
Machine Classifier.
IMPLEMENTATION CODE :
Fitting the Classification Algorithms to the dataset:
10. Auto generated Graphical Visualization of the accuracy comparison of the Classification algorithms:
Exploratory graphical analysis of the classification algorithms is very crucial for the selection of classifiers.In the Driverless ML API I
have included advanced solid gauge visualization of the accuracy and the feature importance distribution of the dataset
automatically.The distribution of the F1 Score and the recall score are also implemented by pygal graph library.
IMPLEMENTATION CODE AND OUTPUT:
Analyzing the graph it is clearly visible that Fare,Sex,PClass
Have higher impact on predicting the Survived Column.
Auto Generated Accuracy Distribution of the Classification Algorithms :
13. Conclusion:
Machine learning has become one of the main engines of the current era. The production pipeline of a
machine learning models passe through different phases and stages that require wide knowledge of several
available tools, and algorithms. However, as the scale of data produced daily is increasing continuously at an
exponential scale, it has become essential to automate this process. In this project, I have covered
comprehensively the state-of-the-art research effort in the domain of Driverless ML frameworks. I have also
highlighted research directions and open challenges that need to be addressed in order to achieve the vision
and goals of the Driverless ML process. I have already built the working API and currently targeting to
integrate Convolution Neural Network to order to automate disease recognition using Image processing.
References:
[1] AutoML: A Survey of the State-of-the-Art,Xin He, Kaiyong Zhao, Xiaowen Chu
[2] H2O.ai : Driverless ML API https://www.h2o.ai/products/h2o-driverless-ai/
[3] Y. Quanming, W. Mengshuo, J. E. Hugo, G. Isabelle, and Y. Yang, “Taking human out of learning
applications: A survey on automated machine learning,” 2018
[4] Google Cloud Auto ML : https://cloud.google.com/automl/
[5] Medium Auto-ML Architecture : Alexander MAmaev
[6] James S Bergstra, R´emi Bardenet, Yoshua Bengio, and Bal´azs K´egl. Algorithms for
hyper-parameter optimization. In Advances in neural information processing systems, pages 2546–
2554, 2011
[7] Auto ML for predictive Modeling by Pavel Kordik, Towards Data Science