SlideShare uma empresa Scribd logo
1 de 64
Baixar para ler offline
Introduction to
Machine
Learning
This course content is being actively
developed by Delta Analytics, a 501(c)3
Bay Area nonprofit that aims to
empower communities to leverage their
data for good.
Please reach out with any questions or
feedback to inquiry@deltanalytics.org.
Find out more about our mission here.
Delta Analytics builds technical capacity
around the world.
Course overview:
Now let’s turn to the data we will be using...
✓ Module 1: Introduction to Machine Learning
❏ Module 2: Machine Learning Deep Dive
❏ Module 3: Model Selection and Evaluation
❏ Module 4: Linear Regression
❏ Module 5: Decision Trees
❏ Module 6: Ensemble Algorithms
❏ Module 7: Unsupervised Learning Algorithms
❏ Module 8: Natural Language Processing Part 1
❏ Module 9: Natural Language Processing Part 2
Module 1:
Introduction to
Machine Learning
✓ What is machine learning?
❏ How do you define a research question?
❏ What are observations?
❏ What are features?
❏ What are outcome variables?
❏ Introduction to KIVA data
Module Checklist
What is machine learning?
What is machine learning?
Artificial Intelligence
(AI)
Machine
Learning
Using data science methods and
sometimes big data
We call something machine learning
when instead of telling a computer to do
something, we allow a computer to
come up with its own solution based
upon the data it is given.
Machine learning is a subset of AI that allows
machines to learn from raw data.
Humans learn from experience. Traditional software programing involves giving machines
instructions which they perform. Machine learning involves allowing machines to learn from raw
data so that the computer program can change when exposed to new data (learning from
experience).
Source: https://www.youtube.com/watch?v=IpGxLWOIZy4
Machine learning
+
Machine learning is…
● Computer science + statistics + mathematics
● The use of data to answer questions
Critical thinking combined with technical toolkit
Machine learning is interdisciplinary
Data
Domain
Knowledge
Insight Action
There is a growing need for machine
learning
Sources:
[1] “What is Big Data,” IBM,
● There are huge amounts of data generated
every day.
● Previously impossible problems are now
solvable.
● Companies are increasingly demanding
quantitative solutions.
“Every day, we create 2.5 quintillion bytes of data
— so much that 90% of the data in the world today
has been created in the last two years alone.” [1]
Dr. Delta works to
diagnose patients with
malaria.
However, it takes a
long time for her to see
everyone.
Machine learning example: Predicting malaria
Luckily, Dr. Delta has
historical patient data
about what factors predict
malaria, such as body
temperature, travel history,
age, medical history.
Dr. Delta can use historical
data as an input in a
machine learning
algorithm to help her
predict whether a new
patient will have malaria.
The algorithm (the machine) learns from past data, like a human would, and
is thus able to make predictions about the future.
Machine learning is a powerful tool; it can…
Determine your credit
rating based upon cell
phone usage.
Determine the topic of a piece of
text.
Recognize your
face in a photo.
Recommend
movies you
will like.
Machine learning helps
us answer questions.
How do we define the
question?
Before we even get to the models/algorithms, we have
to learn about our data and define our research
question.
Research
Question
Exploratory
Analysis
Modeling
Phase
Performance
Data
Validation
+ Cleaning
Machine learning takes
place during the
modeling phase.
~80% of your time as
a data scientist is
spent here, preparing
your data for analysis
Examples of research questions:
● Does this patient have malaria?
● Can we monitor illegal deforestation by detecting
chainsaw noises in audio streamed from
rainforests?
A research question is the question we
want our model to answer.
Research
Question
We may have a question in mind before we look at the data, but
we will often use our exploration of the data to develop or refine
our research question.
Research
Question
Exploratory
Analysis
Data
Validation+
Cleaning
What comes first, the
chicken or the egg?
Data Validation and
Cleaning
Data
Cleaning
Source: Survey of 80 data scientists. Forbes article, March 23, 2016.
“Data preparation accounts for about 80%
of the work of data scientists.”
Data
Cleaning
Why do we need to validate and clean our
data?
Data often comes from multiple sources
- Do data align across different
sources?
Data is created by humans
- Does the data need to be transformed?
- Is it free from human bias and errors?
Go further with these readings: here and here
Data cleaning involves identifying any issues
with our data and confirming our
qualitative understanding of the data.
Data
Cleaning
Missing Data
Is there missing data? Is it
missing systematically?
Times Series Validation
Is the data for the correct
time range?
Are there unusual spikes in
the volume of loans over
time?
Data Type
Are all variables the right type?
Is a date treated like a date?
Data Range
Are all values in the
expected range? Are all
loan_amounts greater than
0?
Let’s step through some examples:
Data
Cleaning
Missing data
Time series
Data types
Data
Cleaning
Transforming
variables
After gaining an initial
understanding of your data, you
may need to transform it to be
used in analysis
Data
Cleaning
Is there missing data? Is data
missing at random or systematically?
Very few datasets
have no missing data;
most of the time you
will have to deal with
missing data.
The first question you
have to ask is what
type of missing data
you have.
Missing completely at
random: no pattern in the
missing data. This is the best type
of missing you can hope for.
Missing at random: there is a
pattern in your missing data but
not in your variables of interest.
Missing not at random: there
is a pattern in the missing data
that systematically affects your
primary variables.
Missing
data
Example: You have survey data from a random sample from high school students in
the U.S. Some students didn’t participate:
Data
Cleaning
Missing
data
If data is missing at random,
we can use the rest of the
nonmissing data without
worrying about bias!
If data is missing in a
non-random or systematic
way, your nonmissing data may
be biased
Some students
were sick the
day of the day
of the survey
Some students
declined to
participate,
since the
survey asks
about grades
Is there missing data? Is data
missing at random or systematically?
Example: You have survey data from a random sample from high school students in
the U.S. Some students didn’t participate:
Data
Cleaning
Missing
data
If data is missing at random,
we can use the rest of the
nonmissing data without
worrying about bias!
If data is missing in a
non-random or systematic
way, your nonmissing data may
be biased
Some students
were sick the
day of the day
of the survey
Some students
declined to
participate,
since the
survey asks
about grades
Is there missing data? Is data
missing at random or systematically?
Data
Cleaning
Sometimes, you can replace
missing data.
● Drop missing observations
● Populate missing values with average
of available data
● Impute data
Missing
data
What you should do depends heavily on what makes
sense for your research question, and your data.
Data
Cleaning
Common imputation techniques
Take the average of observations you do have to populate missing
observations - i.e., assume that this observation is also represented by
the population average
Missing
data
Use the average of
nonmissing values
Use an educated
guess
Use common point
imputation
It sounds arbitrary and often isn’t preferred, but you can infer a missing
value. For related questions, for example, like those often presented in a
matrix, if the participant responds with all “4s”, assume that the missing
value is a 4.
For a rating scale, using the middle point or most commonly chosen value.
For example, on a five-point scale, substitute a 3, the midpoint, or a 4, the
most common value (in many cases). This is a bit more structured than
guessing, but it’s still among the more risky options. Use caution unless
you have good reason and data to support using the substitute value.
Source: Handling missing data
Data
Cleaning
If we have observations over
time, we need to do time series
validation.
Ask yourself:
a. Is the data for the correct time range?
b. Are there unusual spikes in the data over
time?
What should we do if there are
unusual spikes in the data over time?
Time series
Data
Cleaning
How do we address unexpected
spikes in our data?
Data
anomaly
For certain datasets, (like sales data) systematic
seasonal spikes are expected. For example,
around Christmas we would see a spike in sales
venue. This is normal, and should not
necessarily be removed.
Time series
Systematic
spike
Random
spike
If the spike is isolated it is probably unexpected,
we may want to remove the corrupted data.
For example, if for one month sales are
recorded in Kenyan Shillings rather than US
dollars, it would inflate sales figures. We should
do some data cleaning by converting to $ or
perhaps remove this month.
Note, sometimes there are natural anomalies in data that should be investigated first
Data
Cleaning
Are all variables the right
type?
● Integer: A number with no decimal places
● Float: A number with decimal places
● String: Text field, or more formally, a sequence of unicode characters
● Boolean: Can only be True or False (also called indicator or dummy variable)
● Datetime: Values meant to hold time data.
Fun blog on this topic: Data Carpentry
Many functions in Python are type specific, which means we need to make sure all of
our fields are being treated as the correct type:
integer float string date
Data type
As you explore the data, some questions arise…
Data cleaning quiz!
Question Answer
There is an observation from the KIVA loan
dataset that says a loan was fully funded in
year 1804, but Kiva wasn’t even founded
then. What do I do?
Question #1
Question Answer
There is an observation that says this loan
was fully funded in year 1804, but Kiva
wasn’t even founded then. What do I do?
Consult the data documentation. If no
explanation exists, remove this observation.
Data
Cleaning
Time series
This question illustrates you
should always do validation of
the time range. Check what
the minimum and maximum
observations in your data set
are.
Question #1
Question Answer
There is an observation that states a
person’s birthday is 12/1/80 but the “age”
variable is missing. What do I do?
Question #2
Data
Cleaning
This question illustrates how
we may be able to leverage
other fields to make an
educated guess about the
missing age.
Missing
Data
Question Answer
There is an observation that states a
person’s birthday is 12/1/80 but the “age”
variable is missing. What do I do?
We have the input (year, month and day)
needed to calculate age. We can define a
function that will transform this input into
the age of each loan recipient.
Question #2
Question Answer
The variable “amount_funded” has values of
both “N/A” and “0”. What do I do?
Question #3
As you explore the data, some questions arise…
Question Answer
The variable “amount_funded” has values of
both “N/A” and “0”. What do I do?
Check documentation if there is a material
difference between NA and 0.
Question #3
Question Answer
I’m not sure what currency the variable
“amount_funded” is reported in. What do I
do?
Question #4
Question Answer
I’m not sure what currency the variable
“amount_funded” is reported in. What do I
do?
Check documentation and other variables,
convert to appropriate currency
Question #4
A final note…
Data
Cleaning
Note that our examples were all very specific - you may or may not
encounter these exact examples in the wild. This is because data
cleaning is very often idiosyncratic and cannot be adequately
completed by following a predetermined set of steps - you must use
common sense!
Next we turn to exploratory analysis, for which we often have to
transform our data.
Data
transformation
Exploratory Analysis
The goal of exploratory analysis is to
better understand your data.
Research
Question
Exploratory
Analysis
Data
Validation
+ Cleaning
Exploratory
Analysis
Exploratory analysis can reveal
data limitations, what features
are important, and inform what
methods you use in answering
your research question.
This is an indispensable first step
in any data analysis!
Let’s explore our data!
Exploratory
Analysis
Scatter plots
Correlation
Box plots
Summary
statistics
Once we have done some initial
validation, we explore the data to see
what models are suitable and what
patterns we can identify.
The process varies depending on the
data, your style, and time constraints,
but typically exploration includes:
● Histogram
● Scatter plots
● Correlation tables
● Box plots
● Summary statistics
○ Mean, median, frequency
Histogram
Exploratory
Analysis
Histogram
Histograms tell us about the
distribution of the feature.
A histogram shows the
frequency distribution of a
continuous feature.
Here, we have height data of a
group of people. We see that
most of the people in the
group are between 149 and
159 cm tall.
Exploratory
Analysis
Scatter Plots
Scatter plots provide insight
about the relationship between
two features.
Scatter plots visualize
relationships between any two
features as points on a graph.
They are a useful first step to
exploring a research question.
Here, we can already see a positive
relationship between amount
funded and amount requested.
What can we conclude?
Scatter plots are an indispensable
first step to exploring a research
question.
Here, we can already see a positive
relationship between amount
funded and amount requested for a
KIVA loan. What can we conclude?
It looks like there is a strong
relationship between what loan
amount is requested and what is
funded.
Exploratory
Analysis
Scatter Plots
Scatter plot provide important
data about the relationship
between two features.
Correlation is a useful measure of the strength of a
relationship between two variables. It ranges from -1.00 to
1.00
Go further with this fun game.
1.00 0.88 0.60 0.00 -0.55 -0.78 -1.00
Exploratory
Analysis
Correlation
Correlation does not equal
causation
Exploratory
Analysis
Let’s say you are an executive at a company. You’ve
gathered the following data:
X = $ spent on advertising
Y= Sales
Based on the graph and positive correlation, you’d
be tempted to say $ spent on advertising caused an
increase in sales. But hang on - it’s also possible
that an increase in sales (and thus, profit) would
lead to an increase in $ spent on advertising!
Correlation between x and y does not mean x causes
y; it could mean that y causes x!
Correlation
Kiva example: Correlation
does not equal causation
Correlation: 0.96
If you wanted to request a loan
through Kiva, and were
presented with this graph only,
you might conclude that it is a
good idea to request $1 million
dollars.
Exploratory
Analysis
Correlation
Kiva example: Correlation
does not equal causation
Conclusions can be invalid even when data is valid!
But common sense tells us that
this conclusion doesn’t make a lot
of sense. Just because you
request a lot doesn’t mean you
will be funded a lot!
Exploratory
Analysis
Correlation
Mean, median, frequency are
useful summary statistics that
let you know what is in your data.
Exploratory
Analysis
Summary
statistics
Image source: University of Florida, Quantitative introduction to the boxplot
Exploratory
Analysis
Boxplots
Boxplots are a useful visual depiction
of certain summary statistics.
Forming a
research question
Recall: We may have a question in mind before we look at the
data, but our exploration of the data often develops or refines
our research question.
Research
Question
Exploratory
Analysis
Data
Validation
+ Cleaning
What comes first, the
chicken or the egg?
● We ask a question we expect data to answer. What comes
first, the data or the question?
How do you define the research question?
START:
Research question
Do I have data that may
answer my question?
Gather or find more data Is the data labelled?
Supervised learning
methods
Unsupervised learning
methods
YN
YN
● We ask a question we expect data to answer. What comes
first, the data or the question?
How do you define the research question?
START:
Research question
Do I have data that may
answer my question?
Gather or find more data Is the data labelled?
Supervised learning
methods
Unsupervised learning
methods
Requires data processing and validation
YN
YN
Does it contain the outcome
feature Y?
We’ll cover both supervised and unsupervised
methods in this course!
Research
Question
Given the KIVA data below, we may
find a few questions interesting.
Loan amount requested by a
Kiva borrower in Kenya Town Kiva
borrower resides in.
One possible
research question
we might be
interested in
exploring is: Does
the loan amount
requested vary by
town?
Research
Question
How does loan amount requested
vary by town?
This is a reasonable research
question, because we would
expect the amount to vary
because the cost of materials and
services varies from region to
region.
For example, we would expect the
cost of living in a rural area to be
cheaper than an urban city.
Looking ahead:
Modeling
Now we have our research question, we are able to
start modeling!
Research
Question
Exploratory
Analysis
Modeling
Phase
Performance
Data
Cleaning
We are here!
Learning
Methodology
Task
Task
What is the problem we want our
model to solve?
Performance
Measure
Quantitative measure we use to
evaluate the model’s performance.
Learning
Methodology
ML algorithms can be supervised or
unsupervised. This determines the
learning methodology.
All models have 3 key components: a task, a
performance measure and a learning
methodology.
Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning
Modeling
Phase
We will go over the machine
learning task and learning
methodology in the next lesson.
✓ What is machine learning?
✓ How do you define a research question?
✓ What are observations?
✓ What are features?
✓ What are outcome variables?
✓ Introduction to KIVA data
You covered this today:
You are on fire! Go straight to the
next module here.
Need to slow down and digest? Take a
minute to write us an email about
what you thought about the course. All
feedback small or large welcome!
Email: sara@deltanalytics.org
Find out more about
Delta’s machine
learning for good
mission here.

Mais conteúdo relacionado

Mais procurados

Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Simplilearn
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 

Mais procurados (20)

Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Feature scaling
Feature scalingFeature scaling
Feature scaling
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
eScience SHAP talk
eScience SHAP talkeScience SHAP talk
eScience SHAP talk
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
 
Machine learning
Machine learningMachine learning
Machine learning
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 

Semelhante a Module 1 introduction to machine learning

classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptx
XICSStudents
 
Stat11t Chapter1
Stat11t Chapter1Stat11t Chapter1
Stat11t Chapter1
gueste87a4f
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
Rohit Dubey
 

Semelhante a Module 1 introduction to machine learning (20)

Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptx
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
 
Intro scikitlearnstatsmodels
Intro scikitlearnstatsmodelsIntro scikitlearnstatsmodels
Intro scikitlearnstatsmodels
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Learning Analytics Primer: Getting Started with Learning and Performance Anal...
Learning Analytics Primer: Getting Started with Learning and Performance Anal...Learning Analytics Primer: Getting Started with Learning and Performance Anal...
Learning Analytics Primer: Getting Started with Learning and Performance Anal...
 
Metopen 6
Metopen 6Metopen 6
Metopen 6
 
EDM405 4.pptx
EDM405 4.pptxEDM405 4.pptx
EDM405 4.pptx
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
Stat11t chapter1
Stat11t chapter1Stat11t chapter1
Stat11t chapter1
 
Stat11t Chapter1
Stat11t Chapter1Stat11t Chapter1
Stat11t Chapter1
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data Management
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Unit 3 Qualitative Data
Unit 3 Qualitative DataUnit 3 Qualitative Data
Unit 3 Qualitative Data
 
Analyzing and Interpreting Data statippt
Analyzing and Interpreting Data statipptAnalyzing and Interpreting Data statippt
Analyzing and Interpreting Data statippt
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 

Mais de Sara Hooker

Mais de Sara Hooker (8)

Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised Learning
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble Algorithms
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Storytelling with Data (Global Engagement Summit at Northwestern University 2...
Storytelling with Data (Global Engagement Summit at Northwestern University 2...Storytelling with Data (Global Engagement Summit at Northwestern University 2...
Storytelling with Data (Global Engagement Summit at Northwestern University 2...
 
Delta Analytics Open Data Science Conference Presentation 2016
Delta Analytics Open Data Science Conference Presentation 2016Delta Analytics Open Data Science Conference Presentation 2016
Delta Analytics Open Data Science Conference Presentation 2016
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Module 1 introduction to machine learning

  • 2. This course content is being actively developed by Delta Analytics, a 501(c)3 Bay Area nonprofit that aims to empower communities to leverage their data for good. Please reach out with any questions or feedback to inquiry@deltanalytics.org. Find out more about our mission here. Delta Analytics builds technical capacity around the world.
  • 3. Course overview: Now let’s turn to the data we will be using... ✓ Module 1: Introduction to Machine Learning ❏ Module 2: Machine Learning Deep Dive ❏ Module 3: Model Selection and Evaluation ❏ Module 4: Linear Regression ❏ Module 5: Decision Trees ❏ Module 6: Ensemble Algorithms ❏ Module 7: Unsupervised Learning Algorithms ❏ Module 8: Natural Language Processing Part 1 ❏ Module 9: Natural Language Processing Part 2
  • 5. ✓ What is machine learning? ❏ How do you define a research question? ❏ What are observations? ❏ What are features? ❏ What are outcome variables? ❏ Introduction to KIVA data Module Checklist
  • 6. What is machine learning?
  • 7. What is machine learning? Artificial Intelligence (AI) Machine Learning Using data science methods and sometimes big data We call something machine learning when instead of telling a computer to do something, we allow a computer to come up with its own solution based upon the data it is given.
  • 8. Machine learning is a subset of AI that allows machines to learn from raw data. Humans learn from experience. Traditional software programing involves giving machines instructions which they perform. Machine learning involves allowing machines to learn from raw data so that the computer program can change when exposed to new data (learning from experience). Source: https://www.youtube.com/watch?v=IpGxLWOIZy4 Machine learning +
  • 9. Machine learning is… ● Computer science + statistics + mathematics ● The use of data to answer questions Critical thinking combined with technical toolkit Machine learning is interdisciplinary Data Domain Knowledge Insight Action
  • 10. There is a growing need for machine learning Sources: [1] “What is Big Data,” IBM, ● There are huge amounts of data generated every day. ● Previously impossible problems are now solvable. ● Companies are increasingly demanding quantitative solutions. “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” [1]
  • 11. Dr. Delta works to diagnose patients with malaria. However, it takes a long time for her to see everyone. Machine learning example: Predicting malaria Luckily, Dr. Delta has historical patient data about what factors predict malaria, such as body temperature, travel history, age, medical history. Dr. Delta can use historical data as an input in a machine learning algorithm to help her predict whether a new patient will have malaria. The algorithm (the machine) learns from past data, like a human would, and is thus able to make predictions about the future.
  • 12. Machine learning is a powerful tool; it can… Determine your credit rating based upon cell phone usage. Determine the topic of a piece of text. Recognize your face in a photo. Recommend movies you will like.
  • 13. Machine learning helps us answer questions. How do we define the question?
  • 14. Before we even get to the models/algorithms, we have to learn about our data and define our research question. Research Question Exploratory Analysis Modeling Phase Performance Data Validation + Cleaning Machine learning takes place during the modeling phase. ~80% of your time as a data scientist is spent here, preparing your data for analysis
  • 15. Examples of research questions: ● Does this patient have malaria? ● Can we monitor illegal deforestation by detecting chainsaw noises in audio streamed from rainforests? A research question is the question we want our model to answer. Research Question
  • 16. We may have a question in mind before we look at the data, but we will often use our exploration of the data to develop or refine our research question. Research Question Exploratory Analysis Data Validation+ Cleaning What comes first, the chicken or the egg?
  • 18. Data Cleaning Source: Survey of 80 data scientists. Forbes article, March 23, 2016. “Data preparation accounts for about 80% of the work of data scientists.”
  • 19. Data Cleaning Why do we need to validate and clean our data? Data often comes from multiple sources - Do data align across different sources? Data is created by humans - Does the data need to be transformed? - Is it free from human bias and errors? Go further with these readings: here and here
  • 20. Data cleaning involves identifying any issues with our data and confirming our qualitative understanding of the data. Data Cleaning Missing Data Is there missing data? Is it missing systematically? Times Series Validation Is the data for the correct time range? Are there unusual spikes in the volume of loans over time? Data Type Are all variables the right type? Is a date treated like a date? Data Range Are all values in the expected range? Are all loan_amounts greater than 0?
  • 21. Let’s step through some examples: Data Cleaning Missing data Time series Data types Data Cleaning Transforming variables After gaining an initial understanding of your data, you may need to transform it to be used in analysis
  • 22. Data Cleaning Is there missing data? Is data missing at random or systematically? Very few datasets have no missing data; most of the time you will have to deal with missing data. The first question you have to ask is what type of missing data you have. Missing completely at random: no pattern in the missing data. This is the best type of missing you can hope for. Missing at random: there is a pattern in your missing data but not in your variables of interest. Missing not at random: there is a pattern in the missing data that systematically affects your primary variables. Missing data
  • 23. Example: You have survey data from a random sample from high school students in the U.S. Some students didn’t participate: Data Cleaning Missing data If data is missing at random, we can use the rest of the nonmissing data without worrying about bias! If data is missing in a non-random or systematic way, your nonmissing data may be biased Some students were sick the day of the day of the survey Some students declined to participate, since the survey asks about grades Is there missing data? Is data missing at random or systematically?
  • 24. Example: You have survey data from a random sample from high school students in the U.S. Some students didn’t participate: Data Cleaning Missing data If data is missing at random, we can use the rest of the nonmissing data without worrying about bias! If data is missing in a non-random or systematic way, your nonmissing data may be biased Some students were sick the day of the day of the survey Some students declined to participate, since the survey asks about grades Is there missing data? Is data missing at random or systematically?
  • 25. Data Cleaning Sometimes, you can replace missing data. ● Drop missing observations ● Populate missing values with average of available data ● Impute data Missing data What you should do depends heavily on what makes sense for your research question, and your data.
  • 26. Data Cleaning Common imputation techniques Take the average of observations you do have to populate missing observations - i.e., assume that this observation is also represented by the population average Missing data Use the average of nonmissing values Use an educated guess Use common point imputation It sounds arbitrary and often isn’t preferred, but you can infer a missing value. For related questions, for example, like those often presented in a matrix, if the participant responds with all “4s”, assume that the missing value is a 4. For a rating scale, using the middle point or most commonly chosen value. For example, on a five-point scale, substitute a 3, the midpoint, or a 4, the most common value (in many cases). This is a bit more structured than guessing, but it’s still among the more risky options. Use caution unless you have good reason and data to support using the substitute value. Source: Handling missing data
  • 27. Data Cleaning If we have observations over time, we need to do time series validation. Ask yourself: a. Is the data for the correct time range? b. Are there unusual spikes in the data over time? What should we do if there are unusual spikes in the data over time? Time series
  • 28. Data Cleaning How do we address unexpected spikes in our data? Data anomaly For certain datasets, (like sales data) systematic seasonal spikes are expected. For example, around Christmas we would see a spike in sales venue. This is normal, and should not necessarily be removed. Time series Systematic spike Random spike If the spike is isolated it is probably unexpected, we may want to remove the corrupted data. For example, if for one month sales are recorded in Kenyan Shillings rather than US dollars, it would inflate sales figures. We should do some data cleaning by converting to $ or perhaps remove this month. Note, sometimes there are natural anomalies in data that should be investigated first
  • 29. Data Cleaning Are all variables the right type? ● Integer: A number with no decimal places ● Float: A number with decimal places ● String: Text field, or more formally, a sequence of unicode characters ● Boolean: Can only be True or False (also called indicator or dummy variable) ● Datetime: Values meant to hold time data. Fun blog on this topic: Data Carpentry Many functions in Python are type specific, which means we need to make sure all of our fields are being treated as the correct type: integer float string date Data type
  • 30. As you explore the data, some questions arise… Data cleaning quiz!
  • 31. Question Answer There is an observation from the KIVA loan dataset that says a loan was fully funded in year 1804, but Kiva wasn’t even founded then. What do I do? Question #1
  • 32. Question Answer There is an observation that says this loan was fully funded in year 1804, but Kiva wasn’t even founded then. What do I do? Consult the data documentation. If no explanation exists, remove this observation. Data Cleaning Time series This question illustrates you should always do validation of the time range. Check what the minimum and maximum observations in your data set are. Question #1
  • 33. Question Answer There is an observation that states a person’s birthday is 12/1/80 but the “age” variable is missing. What do I do? Question #2
  • 34. Data Cleaning This question illustrates how we may be able to leverage other fields to make an educated guess about the missing age. Missing Data Question Answer There is an observation that states a person’s birthday is 12/1/80 but the “age” variable is missing. What do I do? We have the input (year, month and day) needed to calculate age. We can define a function that will transform this input into the age of each loan recipient. Question #2
  • 35. Question Answer The variable “amount_funded” has values of both “N/A” and “0”. What do I do? Question #3 As you explore the data, some questions arise…
  • 36. Question Answer The variable “amount_funded” has values of both “N/A” and “0”. What do I do? Check documentation if there is a material difference between NA and 0. Question #3
  • 37. Question Answer I’m not sure what currency the variable “amount_funded” is reported in. What do I do? Question #4
  • 38. Question Answer I’m not sure what currency the variable “amount_funded” is reported in. What do I do? Check documentation and other variables, convert to appropriate currency Question #4
  • 39. A final note… Data Cleaning Note that our examples were all very specific - you may or may not encounter these exact examples in the wild. This is because data cleaning is very often idiosyncratic and cannot be adequately completed by following a predetermined set of steps - you must use common sense! Next we turn to exploratory analysis, for which we often have to transform our data. Data transformation
  • 41. The goal of exploratory analysis is to better understand your data. Research Question Exploratory Analysis Data Validation + Cleaning Exploratory Analysis Exploratory analysis can reveal data limitations, what features are important, and inform what methods you use in answering your research question. This is an indispensable first step in any data analysis!
  • 42. Let’s explore our data! Exploratory Analysis Scatter plots Correlation Box plots Summary statistics Once we have done some initial validation, we explore the data to see what models are suitable and what patterns we can identify. The process varies depending on the data, your style, and time constraints, but typically exploration includes: ● Histogram ● Scatter plots ● Correlation tables ● Box plots ● Summary statistics ○ Mean, median, frequency Histogram
  • 43. Exploratory Analysis Histogram Histograms tell us about the distribution of the feature. A histogram shows the frequency distribution of a continuous feature. Here, we have height data of a group of people. We see that most of the people in the group are between 149 and 159 cm tall.
  • 44. Exploratory Analysis Scatter Plots Scatter plots provide insight about the relationship between two features. Scatter plots visualize relationships between any two features as points on a graph. They are a useful first step to exploring a research question. Here, we can already see a positive relationship between amount funded and amount requested. What can we conclude?
  • 45. Scatter plots are an indispensable first step to exploring a research question. Here, we can already see a positive relationship between amount funded and amount requested for a KIVA loan. What can we conclude? It looks like there is a strong relationship between what loan amount is requested and what is funded. Exploratory Analysis Scatter Plots Scatter plot provide important data about the relationship between two features.
  • 46. Correlation is a useful measure of the strength of a relationship between two variables. It ranges from -1.00 to 1.00 Go further with this fun game. 1.00 0.88 0.60 0.00 -0.55 -0.78 -1.00 Exploratory Analysis Correlation
  • 47. Correlation does not equal causation Exploratory Analysis Let’s say you are an executive at a company. You’ve gathered the following data: X = $ spent on advertising Y= Sales Based on the graph and positive correlation, you’d be tempted to say $ spent on advertising caused an increase in sales. But hang on - it’s also possible that an increase in sales (and thus, profit) would lead to an increase in $ spent on advertising! Correlation between x and y does not mean x causes y; it could mean that y causes x! Correlation
  • 48. Kiva example: Correlation does not equal causation Correlation: 0.96 If you wanted to request a loan through Kiva, and were presented with this graph only, you might conclude that it is a good idea to request $1 million dollars. Exploratory Analysis Correlation
  • 49. Kiva example: Correlation does not equal causation Conclusions can be invalid even when data is valid! But common sense tells us that this conclusion doesn’t make a lot of sense. Just because you request a lot doesn’t mean you will be funded a lot! Exploratory Analysis Correlation
  • 50. Mean, median, frequency are useful summary statistics that let you know what is in your data. Exploratory Analysis Summary statistics
  • 51. Image source: University of Florida, Quantitative introduction to the boxplot Exploratory Analysis Boxplots Boxplots are a useful visual depiction of certain summary statistics.
  • 53. Recall: We may have a question in mind before we look at the data, but our exploration of the data often develops or refines our research question. Research Question Exploratory Analysis Data Validation + Cleaning What comes first, the chicken or the egg?
  • 54. ● We ask a question we expect data to answer. What comes first, the data or the question? How do you define the research question? START: Research question Do I have data that may answer my question? Gather or find more data Is the data labelled? Supervised learning methods Unsupervised learning methods YN YN
  • 55. ● We ask a question we expect data to answer. What comes first, the data or the question? How do you define the research question? START: Research question Do I have data that may answer my question? Gather or find more data Is the data labelled? Supervised learning methods Unsupervised learning methods Requires data processing and validation YN YN Does it contain the outcome feature Y? We’ll cover both supervised and unsupervised methods in this course!
  • 56. Research Question Given the KIVA data below, we may find a few questions interesting. Loan amount requested by a Kiva borrower in Kenya Town Kiva borrower resides in. One possible research question we might be interested in exploring is: Does the loan amount requested vary by town?
  • 57. Research Question How does loan amount requested vary by town? This is a reasonable research question, because we would expect the amount to vary because the cost of materials and services varies from region to region. For example, we would expect the cost of living in a rural area to be cheaper than an urban city.
  • 59. Now we have our research question, we are able to start modeling! Research Question Exploratory Analysis Modeling Phase Performance Data Cleaning We are here! Learning Methodology Task
  • 60. Task What is the problem we want our model to solve? Performance Measure Quantitative measure we use to evaluate the model’s performance. Learning Methodology ML algorithms can be supervised or unsupervised. This determines the learning methodology. All models have 3 key components: a task, a performance measure and a learning methodology. Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning Modeling Phase
  • 61. We will go over the machine learning task and learning methodology in the next lesson.
  • 62. ✓ What is machine learning? ✓ How do you define a research question? ✓ What are observations? ✓ What are features? ✓ What are outcome variables? ✓ Introduction to KIVA data You covered this today:
  • 63. You are on fire! Go straight to the next module here. Need to slow down and digest? Take a minute to write us an email about what you thought about the course. All feedback small or large welcome! Email: sara@deltanalytics.org
  • 64. Find out more about Delta’s machine learning for good mission here.