SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Thanks Great Lakes for being a lovely host…
• Premier Management Institute
• PGPBA
• Brochures & pads provided
• WiFi Connectivity
About Analytics Vidhya
First things first:
• Meetup frequency – Once every month
• Next meetup – 24th May 2015
• Aim to provide best networking and learning platform in Delhi NCR
• Areas of Interest – Data Science, Big Data, Machine Learning,
Internet of Things
Meet Your Volunteers
Kunal
Data Science Evangelist,
(Growth) Hacker, Blogger,
Husband, Father
Sunil
Blogger, Problem solver,
data scientist, Fitness
enthu
Manish
Avid learner, explorer,
startup guy!
Agenda
• Introduction
• Model building – life cycle
• Data Exploration and Feature Engineering methods
• Talk about modelling Techniques like
• Logistic Regression
• Decision Tree
• Random Forest
• SVM
• Predict the survival on the Titanic
Introduction
• Name
• Experience in Data Science
• Current Company
• Are you proficient with (SAS/ R/ Python)?
Team creation
• Look for diversity in experience
• Hopefully common toolset, but complementary can also work
• Competing against each other
Team Formation
A few ground rules for today
• This is not a tutorial – you are expected to solve this problem yourself
• We are here to help you, organize your thoughts and to make sure you
are going in the right direction.
• Good question to ask:
• While trying Logistic regression in R, I am facing following error?
• Bad question to ask:
• Help me understand what is Logistic Regression!
• Register on DataHack.io
• One login for each participant
• Password would be mailed upon registration
• Registration on Kaggle.com
Model building – life cycle
Problem for the day
Hypothesis generation
• In your groups, list down all possible variables, which might influence the
chances of survival of a passenger
• Download the dataset from Kaggle
• Next, look at the dataset and see which variables are available
Make sure you always do this in this order
Data Exploration & Feature Engineering
• Import data set
• Variable identification
• Univariate, Bivariate and Multivariate analysis
• Identify and Treat missing and outlier values
• Create new variables or transform existing variables
Dataset Detail
• PassengerId :- Unique ID to every passenger
• Survived :- Survival (0=No, 1=Yes)
• Pclass:- Passenger Class (1=1st, 2=2nd, 3=3rd )
• Name :- Name
• Sex :- Male/ Female
• Age :- Age
• SibSp :- Number of Siblings / Spouses Aboard
• Parch :- Number of Parents / Children Abroad
• Ticket :- Ticket Number
• Fare :- Passenger Fare
• Cabin :- Cabin
• Embarked :- Port of Embarkation (C=Cherbourg, Q=Queenstown, S=Southampton)
Are you a newbie, refer to these guides:
• Import data set (SAS, Python, R)
• Variable identification (Methods, SAS, Python, R)
• Univariate, Bivariate and Multivariate analysis (Methods, SAS, Python, R)
• Identify and Treat missing and outlier values (Missing, Outlier, SAS, Python, R1, R2)
• Create new variables or transform existing variables (Methods, SAS, Python, R1)
Practice
Explore the Titanic data set and share your inferences with the group
Break
Modelling Techniques – Logistic Regression
• Logistic regression is a form of regression analysis in which the outcome variable is binary or
dichotomous
• Used when the focus on whether or not an event occurred, rather than when it occurred
• Here, Instead of modelling the outcome, Y, directly, the method models the log odds(Y) using the
logistic function
• Analysis of variance (ANOVA) and logistic regression all are special cases of General Linear Model
(GLM)
• The probability of success falls between 0 and 1 for all possible values of X
Linear & Logistic Regression
20 30 40 50 60 70
Age
0.0
0.2
0.4
0.6
0.8
1.0
CHDProbability(p)
Predictor (x)
0 20 40 60 80
Age (yrs.)
0
20
40
60
80
100
LengthofStay(days)
Predictor(X)
Y
Y=aX+b
Logit Transformation
Logit is Directly related to Odds
• The logistic model can be written as:
• This implies that the odds for success can be expressed as:
• This relationship is the key to interpreting the coefficients in a logistic regression model !!
Xo
e
P
P 1
1
 


Modelling Techniques – Decision Tree
• Decision tree is a type of supervised learning algorithm
• It works for both categorical and continuous input and output variables
• It is a classification technique that split the population or sample into two or more homogeneous
sets (or sub-populations) based on most significant splitter / differentiator in input variables
Decision Tree - Example
Types of Decision Tree
• Binary Variable Decision Tree: Decision Tree which has binary target variable then it called as Binary
Variable Decision Tree. Example:- In above scenario of student problem, where the target variable
was “Student will play cricket or not” i.e. YES or NO.
• Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as
Continuous Variable Decision Tree.
Decision Tree - Terminology
Decision Tree – Advantages/ Disadvantages
Advantages:
• Easy to understand
• Useful in data exploration
• Less Data Cleaning required
• Data type is not a constraint
Disadvantages:
• Overfit
• Not fit for continuous variables
• Not Sensitive to Skewed distributions
Modelling Techniques – Random Forest
• “Random Forest“ is an algorithm to perform very intensive calculations.
• Random forest is like a bootstrapping algorithm with Decision tree (CART) model.
• Random forest gives much more accurate predictions when compared to simple CART/CHAID or
regression models in many scenarios.
• It captures the variance of several input variables at the same time and enables high number of
observations to participate in the prediction.
• A different subset of the training data and subset of variables are selected for each tree
• Remaining training data are used to estimate error and variable importance
Random Forest – Advantages/ Disadvantages
Advantages:
• No need for pruning trees
• Accuracy and variable importance generated automatically
• Not very sensitive to outliers in training data
• Easy to set parameters
Disadvantages:
• Over fitting is not a problem
• It is black box, rules behind model building can not be explained
Modelling Techniques – SVM
33
• It is a classification technique.
• Support Vectors are simply the co-ordinates of individual observation
• Support Vector Machine is a frontier which best segregates the one
class from other
• Solving SVMs is a quadratic programming problem
• Seen by many as the most successful current text classification
method
Sec. 15.1
Support vectors
Case Study
34
Sec. 15.1
We have a population of 50%-50% Males and
Females. Here, we want to create some set of rules
which will guide us the gender class for rest of the
population.
The blue circles in the plot represent females and
green squares represents male.
 Males in our population have a higher average
height.
 Females in our population have longer scalp
hairs.
Case Study – How to find right SVM
35
Sec. 15.1
Here we have three possible frontiers. Decide which
one is best.
Methods:
• Find the minimum distance of the frontier from
closest support vector (this can belong to any
class).
• Choose the frontier with the maximum distance
from the closest support vector. In this case, it is
black frontier with 15 unit distance.
Predict survival on the Titanic
Perform prediction for survival on the Titanic
Python Resources:
Python:
• http://www.bigdataexaminer.com/dealing-with-unbalanced-classes-svm-random-forests-and-
decision-trees-in-python/
• http://nbviewer.ipython.org/github/justmarkham/gadsdc1/blob/master/logistic_assignment/kevin
_logistic_sklearn.ipynb
• http://scikit-learn.org/stable/modules/svm.html
• http://scikit-learn.org/stable/modules/tree.html
• http://blog.yhathq.com/posts/random-forests-in-python.html
R Resources:
R:
• http://www.ats.ucla.edu/stat/r/dae/logit.htm
• http://www.cookbook-r.com/Statistical_analysis/Logistic_regression/
• http://www.rdatamining.com/examples/decision-tree
• http://www.statmethods.net/advstats/cart.html
• http://www.cair.org/conferences/cair2013/pres/58_Headstrom.pdf
• http://blog.yhathq.com/posts/comparing-random-forests-in-python-and-r.html
• http://www.louisaslett.com/Courses/Data_Mining/ST4003-Lab7-Introduction_to_Support_Vector_Machines.pdf
• http://thinktostart.com/build-a-spam-filter-with-r/
• http://cbio.ensmp.fr/~jvert/svn/tutorials/practical/svmbasic/svmbasic_notes.pdf
Thanks

Mais conteúdo relacionado

Mais procurados

Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08 Jeet Das
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier ananth
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
Lecture 01: Machine Learning for Language Technology - Introduction
 Lecture 01: Machine Learning for Language Technology - Introduction Lecture 01: Machine Learning for Language Technology - Introduction
Lecture 01: Machine Learning for Language Technology - IntroductionMarina Santini
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Marina Santini
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationMarina Santini
 
Decision Trees
Decision TreesDecision Trees
Decision TreesCloudxLab
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boostingbutest
 
MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1heinestien
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision treehktripathy
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learningbutest
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningAbhishek Vijayvargia
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Nikola Milosevic
 

Mais procurados (20)

Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
Lecture 01: Machine Learning for Language Technology - Introduction
 Lecture 01: Machine Learning for Language Technology - Introduction Lecture 01: Machine Learning for Language Technology - Introduction
Lecture 01: Machine Learning for Language Technology - Introduction
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision tree
Decision treeDecision tree
Decision tree
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)
 

Semelhante a Mini datathon

Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statisticsSpotle.ai
 
Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceDamianMingle
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Maarten Smeets
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxmuhammadsamroz
 
credit card fraud detection
credit card fraud detectioncredit card fraud detection
credit card fraud detectionjagan477830
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needGibDevs
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmVaibhav Varshney
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new pptSalford Systems
 

Semelhante a Mini datathon (20)

Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statistics
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data Science
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
 
credit card fraud detection
credit card fraud detectioncredit card fraud detection
credit card fraud detection
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Primer on major data mining algorithms
Primer on major data mining algorithmsPrimer on major data mining algorithms
Primer on major data mining algorithms
 
Turning Information chaos into reliable data
Turning Information chaos into reliable dataTurning Information chaos into reliable data
Turning Information chaos into reliable data
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
Classification
ClassificationClassification
Classification
 

Último

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 

Último (20)

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 

Mini datathon

  • 1.
  • 2. Thanks Great Lakes for being a lovely host… • Premier Management Institute • PGPBA • Brochures & pads provided • WiFi Connectivity
  • 4.
  • 5. First things first: • Meetup frequency – Once every month • Next meetup – 24th May 2015 • Aim to provide best networking and learning platform in Delhi NCR • Areas of Interest – Data Science, Big Data, Machine Learning, Internet of Things
  • 6. Meet Your Volunteers Kunal Data Science Evangelist, (Growth) Hacker, Blogger, Husband, Father Sunil Blogger, Problem solver, data scientist, Fitness enthu Manish Avid learner, explorer, startup guy!
  • 7. Agenda • Introduction • Model building – life cycle • Data Exploration and Feature Engineering methods • Talk about modelling Techniques like • Logistic Regression • Decision Tree • Random Forest • SVM • Predict the survival on the Titanic
  • 8. Introduction • Name • Experience in Data Science • Current Company • Are you proficient with (SAS/ R/ Python)?
  • 9. Team creation • Look for diversity in experience • Hopefully common toolset, but complementary can also work • Competing against each other
  • 11. A few ground rules for today • This is not a tutorial – you are expected to solve this problem yourself • We are here to help you, organize your thoughts and to make sure you are going in the right direction. • Good question to ask: • While trying Logistic regression in R, I am facing following error? • Bad question to ask: • Help me understand what is Logistic Regression!
  • 12.
  • 13. • Register on DataHack.io • One login for each participant • Password would be mailed upon registration • Registration on Kaggle.com
  • 14. Model building – life cycle
  • 16. Hypothesis generation • In your groups, list down all possible variables, which might influence the chances of survival of a passenger • Download the dataset from Kaggle • Next, look at the dataset and see which variables are available Make sure you always do this in this order
  • 17. Data Exploration & Feature Engineering • Import data set • Variable identification • Univariate, Bivariate and Multivariate analysis • Identify and Treat missing and outlier values • Create new variables or transform existing variables
  • 18. Dataset Detail • PassengerId :- Unique ID to every passenger • Survived :- Survival (0=No, 1=Yes) • Pclass:- Passenger Class (1=1st, 2=2nd, 3=3rd ) • Name :- Name • Sex :- Male/ Female • Age :- Age • SibSp :- Number of Siblings / Spouses Aboard • Parch :- Number of Parents / Children Abroad • Ticket :- Ticket Number • Fare :- Passenger Fare • Cabin :- Cabin • Embarked :- Port of Embarkation (C=Cherbourg, Q=Queenstown, S=Southampton)
  • 19. Are you a newbie, refer to these guides: • Import data set (SAS, Python, R) • Variable identification (Methods, SAS, Python, R) • Univariate, Bivariate and Multivariate analysis (Methods, SAS, Python, R) • Identify and Treat missing and outlier values (Missing, Outlier, SAS, Python, R1, R2) • Create new variables or transform existing variables (Methods, SAS, Python, R1)
  • 20. Practice Explore the Titanic data set and share your inferences with the group
  • 21. Break
  • 22. Modelling Techniques – Logistic Regression • Logistic regression is a form of regression analysis in which the outcome variable is binary or dichotomous • Used when the focus on whether or not an event occurred, rather than when it occurred • Here, Instead of modelling the outcome, Y, directly, the method models the log odds(Y) using the logistic function • Analysis of variance (ANOVA) and logistic regression all are special cases of General Linear Model (GLM) • The probability of success falls between 0 and 1 for all possible values of X
  • 23. Linear & Logistic Regression 20 30 40 50 60 70 Age 0.0 0.2 0.4 0.6 0.8 1.0 CHDProbability(p) Predictor (x) 0 20 40 60 80 Age (yrs.) 0 20 40 60 80 100 LengthofStay(days) Predictor(X) Y Y=aX+b
  • 25. Logit is Directly related to Odds • The logistic model can be written as: • This implies that the odds for success can be expressed as: • This relationship is the key to interpreting the coefficients in a logistic regression model !! Xo e P P 1 1    
  • 26. Modelling Techniques – Decision Tree • Decision tree is a type of supervised learning algorithm • It works for both categorical and continuous input and output variables • It is a classification technique that split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables
  • 27. Decision Tree - Example
  • 28. Types of Decision Tree • Binary Variable Decision Tree: Decision Tree which has binary target variable then it called as Binary Variable Decision Tree. Example:- In above scenario of student problem, where the target variable was “Student will play cricket or not” i.e. YES or NO. • Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.
  • 29. Decision Tree - Terminology
  • 30. Decision Tree – Advantages/ Disadvantages Advantages: • Easy to understand • Useful in data exploration • Less Data Cleaning required • Data type is not a constraint Disadvantages: • Overfit • Not fit for continuous variables • Not Sensitive to Skewed distributions
  • 31. Modelling Techniques – Random Forest • “Random Forest“ is an algorithm to perform very intensive calculations. • Random forest is like a bootstrapping algorithm with Decision tree (CART) model. • Random forest gives much more accurate predictions when compared to simple CART/CHAID or regression models in many scenarios. • It captures the variance of several input variables at the same time and enables high number of observations to participate in the prediction. • A different subset of the training data and subset of variables are selected for each tree • Remaining training data are used to estimate error and variable importance
  • 32. Random Forest – Advantages/ Disadvantages Advantages: • No need for pruning trees • Accuracy and variable importance generated automatically • Not very sensitive to outliers in training data • Easy to set parameters Disadvantages: • Over fitting is not a problem • It is black box, rules behind model building can not be explained
  • 33. Modelling Techniques – SVM 33 • It is a classification technique. • Support Vectors are simply the co-ordinates of individual observation • Support Vector Machine is a frontier which best segregates the one class from other • Solving SVMs is a quadratic programming problem • Seen by many as the most successful current text classification method Sec. 15.1 Support vectors
  • 34. Case Study 34 Sec. 15.1 We have a population of 50%-50% Males and Females. Here, we want to create some set of rules which will guide us the gender class for rest of the population. The blue circles in the plot represent females and green squares represents male.  Males in our population have a higher average height.  Females in our population have longer scalp hairs.
  • 35. Case Study – How to find right SVM 35 Sec. 15.1 Here we have three possible frontiers. Decide which one is best. Methods: • Find the minimum distance of the frontier from closest support vector (this can belong to any class). • Choose the frontier with the maximum distance from the closest support vector. In this case, it is black frontier with 15 unit distance.
  • 36. Predict survival on the Titanic Perform prediction for survival on the Titanic
  • 37. Python Resources: Python: • http://www.bigdataexaminer.com/dealing-with-unbalanced-classes-svm-random-forests-and- decision-trees-in-python/ • http://nbviewer.ipython.org/github/justmarkham/gadsdc1/blob/master/logistic_assignment/kevin _logistic_sklearn.ipynb • http://scikit-learn.org/stable/modules/svm.html • http://scikit-learn.org/stable/modules/tree.html • http://blog.yhathq.com/posts/random-forests-in-python.html
  • 38. R Resources: R: • http://www.ats.ucla.edu/stat/r/dae/logit.htm • http://www.cookbook-r.com/Statistical_analysis/Logistic_regression/ • http://www.rdatamining.com/examples/decision-tree • http://www.statmethods.net/advstats/cart.html • http://www.cair.org/conferences/cair2013/pres/58_Headstrom.pdf • http://blog.yhathq.com/posts/comparing-random-forests-in-python-and-r.html • http://www.louisaslett.com/Courses/Data_Mining/ST4003-Lab7-Introduction_to_Support_Vector_Machines.pdf • http://thinktostart.com/build-a-spam-filter-with-r/ • http://cbio.ensmp.fr/~jvert/svn/tutorials/practical/svmbasic/svmbasic_notes.pdf