What is the basis for the Data Science course and Data Scientist to know?
1-Algorithm
2-Data
3-Ask The Right Question
4-Predict an answer
5- Copy other people's work to do data science
1. Data Science Unit 2
1-Algorithm
2-Data
3-Ask The Right Question
4-Predict an answer
5- Copy other people's work to
do data science
By : Professor Lili Saghafi
https://professorlilisaghafiquantumcomputing.wor
dpress.com
proflilisaghafi@gmail.com
https://sites.google.com/site/professorlilisaghafi/home
@Lili_PLS
2. Agenda
• 1-Algorithm
• 2-Data
• 3-Ask The Right Question
• 4-Predict an answer
• 5- Copy other people's work to do data
science
3. Data Science
1-Algorithm
5 machine learning methods,
called algorithms
By : Professor Lili Saghafi
https://professorlilisaghafiquantumcomputing.wor
dpress.com
proflilisaghafi@gmail.com
https://sites.google.com/site/professorlilisaghafi/home
@Lili_PLS
4. What Data Science can answer?
• Data Science uses numbers and names (also
known as categories or labels) to predict
answers to questions.
• It might surprise you, but there are only five
questions that data science answers:
– Is this A or B?
– Is this weird?
– How much – or – How many?
– How is this organized?
– What should I do next?
5. Machine Learning Methods,
called Algorithms.
• Each one of these questions is answered by a
separate family of machine learning methods,
called algorithms.
• It's helpful to think about an algorithm as a
recipe and your data as the ingredients.
• An algorithm tells how to combine and mix the
data in order to get an answer.
• Computers are like a blender. They do most of
the hard work of the algorithm for you and they
do it pretty fast.
7. two-class classification algorithms
• This family of algorithms is called two-class
classification.
• It's useful for any question that has just two possible
answers.
• For example:
– Will this tire fail in the next 1,000 miles: Yes or no?
– Which brings in more customers: a $5 coupon or a 25%
discount?
• This question can also be rephrased to include more
than two options: Is this A or B or C or D, etc.?
• This is called multiclass classification and it's useful
when you have several — or several thousand —
possible answers.
• Multiclass classification chooses the most likely one.
9. Detection Algorithms
• If you have a credit card, you’ve already benefited from
anomaly detection. Your credit card company analyzes
your purchase patterns, so that they can alert you to
possible fraud. Charges that are "weird" might be a
purchase at a store where you don't normally shop or
buying an unusually pricey item.
• This question can be useful in lots of ways. For instance:
– If you have a car with pressure gauges, you might want to know:
Is this pressure gauge reading normal?
– If you're monitoring the internet, you’d want to know: Is this
message from the internet typical?
• Anomaly detection flags unexpected or unusual events
or behaviours. It gives clues where to look for problems.
10. How much? or How many? uses
Regression Algorithms
11. Regression Algorithms
• Regression algorithms make numerical
predictions, such as:
– What will the temperature be next Tuesday?
– What will my fourth quarter sales be?
• They help answer any question that asks
for a number.
12. How is this organized? uses
Clustering Algorithms
13. Clustering Algorithms
• Sometimes you want to understand the structure
of a data set - How is this organized?
• For this question, you don’t have examples that
you already know outcomes for.
• There are a lot of ways to tease out the structure
of data. One approach is clustering.
• It separates data into natural "clumps," for easier
interpretation.
• With clustering, there is no one right answer.
14. Clustering Algorithms
• Common examples of clustering questions
are:
– Which viewers like the same types of movies?
– Which printer models fail the same way?
• By understanding how data is organized,
you can better understand - and predict -
behaviours and events.
15. What should I do now? uses
Reinforcement Learning Algorithms
• Reinforcement learning was inspired by
how the brains of rats and humans
respond to punishment and rewards.
• These algorithms learn from outcomes,
and decide on the next action.
• Reinforcement learning is a good fit for
automated systems that have to make lots
of small decisions without human
guidance.
17. Reinforcement Learning Algorithms
• Questions it answers are always about what
action should be taken - usually by a machine or
a robot. Examples are:
– If I'm a temperature control system for a house:
Adjust the temperature or leave it where it is?
– If I'm a self-driving car: At a yellow light, brake or
accelerate?
• For a robot vacuum: Keep vacuuming, or go
back to the charging station?
• Reinforcement learning algorithms gather data
as they go, learning from trial and error.
18. Data Science
2-DATA
By : Professor Lili Saghafi
https://professorlilisaghafiquantumcomputing.wor
dpress.com
proflilisaghafi@gmail.com
https://sites.google.com/site/professorlilisaghafi/home
@Lili_PLS
19. Data
Before data science can give you the answers you
want, you have to give it some high-quality raw
materials to work with. Just like making a pizza,
the better the ingredients you start with, the
better the final product.
Criteria for data
• In data science, there are certain ingredients
that must be pulled together including:
– Relevant
– Connected
– Accurate
– Enough to work with
21. Is your data relevant?
• On the left, the table
presents the blood
alcohol level of seven
people tested outside a
Boston bar,
• the Red Sox batting
average in their last
game,
• and the price of milk in
the nearest convenience
store.
• table on the right. This
time each person’s body
mass was measured
• as well as the number of
drinks they’ve had.
• The numbers in each row
are now relevant to each
other.
• If I gave you my body
mass and the number of
Margaritas I've had, you
could make a guess at
my blood alcohol
content.
22. Is your data relevant?
• This is all perfectly legitimate data. It’s only
fault is that it isn’t relevant.
• There's no obvious relationship between
these numbers.
• If someone gave you the current price of
milk and the Red Sox batting average,
there's no way you could guess their
blood alcohol content.
24. Do you have connected data?
• Here is some relevant data on the quality of hamburgers:
grill temperature, patty weight, and rating in the local
food magazine. But notice the gaps in the table on the
left.
• Most data sets are missing some values. It's common to
have holes like this and there are ways to work around
them. But if there's too much missing, your data begins
to look like Swiss cheese.
• If you look at the table on the left, there's so much
missing data, it's hard to come up with any kind of
relationship between grill temperature and patty weight.
This example shows disconnected data.
• The table on the right, though, is full and complete - an
example of connected data.
26. Is your data accurate?
• Look at the target in the upper right. There is a tight grouping right
around the bulls eye. That, of course, is accurate.
• Oddly, in the language of data science, performance on the target
right below it is also considered accurate.
• If you mapped out the center of these arrows, you'd see that it's very
close to the bulls eye.
• The arrows are spread out all around the target, so they're
considered imprecise, but they're centered around the bulls eye, so
they're considered accurate.
• Now look at the upper-left target. Here the arrows hit very close
together, a tight grouping.
• They're precise, but they're inaccurate because the center is way off
the bulls eye.
• The arrows in the bottom-left target are both inaccurate and
imprecise. This archer needs more practice.
28. Do you have enough data to
work with?
• Think of each data point in your table as being a
brush stroke in a painting. If you have only a few
of them, the painting can be fuzzy - it's hard to
tell what it is.
• If you add some more brush strokes, then your
painting starts to get a little sharper.
• When you have barely enough strokes, you only
see enough to make some broad decisions. Is it
somewhere I might want to visit? It looks bright,
that looks like clean water – yes, that’s where
I’m going on vacation.
29. Do you have enough data to
work with?
• As you add more data, the picture becomes
clearer and you can make more detailed
decisions. Now you can look at the three hotels
on the left bank. You can notice the architectural
features of the one in the foreground. You might
even choose to stay on the third floor because of
the view.
• With data that's relevant, connected,
accurate, and enough, you have all the
ingredients needed to do some high-quality
data science.
30. Data Science
3-Ask The Right Question
By : Professor Lili Saghafi
https://professorlilisaghafiquantumcomputing.wor
dpress.com
proflilisaghafi@gmail.com
https://sites.google.com/site/professorlilisaghafi/home
@Lili_PLS
31. Ask a sharp question
• We've talked about how data science is the process of
using names (also called categories or labels) and
numbers to predict an answer to a question.
• But it can't be just any question; it has to be a sharp
question.
• A vague question doesn't have to be answered with a
name or a number. A sharp question must.
• Imagine you found a magic lamp with a genie who will
truthfully answer any question you ask.
• But it's a mischievous genie, who will try to make their
answer as vague and confusing as they can get away
with.
• You want to pin them down with a question so airtight
that they can't help but tell you what you want to know.
32. Ask a sharp question
• If you were to ask a vague question, like "What's
going to happen with my stock?", the genie
might answer, "The price will change".
• That's a truthful answer, but it's not very helpful.
• But if you were to ask a sharp question, like
"What will my stock's sale price be next week?",
the genie can't help but give you a specific
answer and predict a sale price.
33. Examples of your answer: Target
data
• Once you formulate your question, check to see
whether you have examples of the answer in
your data.
• If our question is "What will my stock's sale price
be next week?" then we have to make sure our
data includes the stock price history.
• If our question is "Which car in my fleet is going
to fail first?" then we have to make sure our data
includes information about previous failures.
35. Target
• These examples of answers are called a
target.
• A target is what we are trying to predict
about future data points, whether it's a
category or a number.
• If you don't have any target data, you'll
need to get some. You won't be able to
answer your question without it.
36. Reformulate your question
• Sometimes you can reword your question
to get a more useful answer.
• The question "Is this data point A or B?"
predicts the category (or name or label) of
something. To answer it, we use
a classification algorithm.
• The question "How much?" or "How
many?" predicts an amount. To answer it
we use a regression algorithm.
37. Reformulate your question
• To see how we can transform these, let's
look at the question, "Which news story is
the most interesting to this reader?" It asks
for a prediction of a single choice from
many possibilities - in other words "Is this
A or B or C or D?" - and would use a
classification algorithm.
38. Reformulate your question
• But, this question may be easier to answer
if you reword it as "How interesting is each
story on this list to this reader?"
• Now you can give each article a numerical
score, and then it's easy to identify the
highest-scoring article.
• This is a rephrasing of the classification
question into a regression question or
How much?
40. basic principles for asking a
question
• How you ask a question is a clue to which algorithm can
give you an answer.
• You'll find that certain families of algorithms - like the
ones in our news story example - are closely related.
• You can reformulate your question to use the algorithm
that gives you the most useful answer.
• But, most important, ask that sharp question - the
question that you can answer with data.
• And be sure you have the right data to answer it.
• We've talked about some basic principles for asking
a question you can answer with data.
41. Data Science
4-Predict an answer
By : Professor Lili Saghafi
https://professorlilisaghafiquantumcomputing.wor
dpress.com
proflilisaghafi@gmail.com
https://sites.google.com/site/professorlilisaghafi/home
@Lili_PLS
42. Predict an answer with a simple
model
• A model is a simplified story about our
data.
43. Collect relevant, accurate,
connected, enough data
• Say I want to shop for a diamond.
• I have a ring that belonged to my grandmother
with a setting for a 1.35 carat diamond, and I
want to get an idea of how much it will cost.
• I take a notepad and pen into the jewellery store,
and I write down the price of all of the diamonds
in the case and how much they weigh in carats.
• Starting with the first diamond - it's 1.01 carats
and $7,366.
• Now I go through and do this for all the other
diamonds in the store.
45. Collect relevant, accurate,
connected, enough data
• Notice that our list has two columns.
• Each column has a different attribute - weight in carats
and price - and each row is a single data point that
represents a single diamond.
• We've actually created a small data set here - a table.
• Notice that it meets our criteria for quality:
– The data is relevant - weight is definitely related to price
– It's accurate - we double-checked the prices that we write down
– It's connected - there are no blank spaces in either of these
columns
– And, as we'll see, it's enough data to answer our question
46. Ask a sharp question
• Now we'll pose our question in a sharp
way: "How much will it cost to buy a 1.35
carat diamond?"
• Our list doesn't have a 1.35 carat diamond
in it, so we'll have to use the rest of our
data to get an answer to the question.
47. Plot the existing data
• The first thing we'll do is draw a horizontal
number line, called an axis, to chart the weights.
• The range of the weights is 0 to 2, so we'll draw
a line that covers that range and put ticks for
each half carat.
• Next we'll draw a vertical axis to record the price
and connect it to the horizontal weight axis.
• This will be in units of dollars.
• Now we have a set of coordinate axes.
•
49. Plot the existing data
• We're going to take this data now and turn it into
a scatter plot.
• This is a great way to visualize numerical data sets.
• For the first data point, we eyeball a vertical line at 1.01
carats.
• Then, we eyeball a horizontal line at $7,366. Where they
meet, we draw a dot.
• This represents our first diamond.
• Now we go through each diamond on this list and do the
same thing.
• When we're through, this is what we get: a bunch of
dots, one for each diamond.
51. Draw the model through the data
points
• Now if you look at the dots and squint, the
collection looks like a fat, fuzzy line.
• We can take our marker and draw a straight line
through it.
• By drawing a line, we created a model.
• Think of this as taking the real world and making
a simplistic cartoon version of it.
• Now the cartoon is wrong - the line doesn't go
through all the data points.
• But, it's a useful simplification.
53. Draw the model through the data
points
• The fact that all the dots don't go exactly through the line
is OK.
• Data scientists explain this by saying that there's the
model - that's the line - and then each dot has
some noise or variance associated with it.
• There's the underlying perfect relationship, and then
there's the gritty, real world that adds noise and
uncertainty.
• Because we're trying to answer the question How
much? this is called a regression.
• And because we're using a straight line, it's a linear
regression.
54. Use the model to find the answer
• Now we have a model and we ask it our
question: How much will a 1.35 carat diamond
cost?
• To answer our question, we eyeball 1.35 carats
and draw a vertical line.
• Where it crosses the model line, we eyeball a
horizontal line to the dollar axis.
• It hits right at 10,000. Boom! That's the answer:
A 1.35 carat diamond costs about $10,000.
56. Create a confidence interval
• It's natural to wonder how precise this
prediction is.
• It's useful to know whether the 1.35 carat
diamond will be very close to $10,000, or a
lot higher or lower.
• To figure this out, let's draw an envelope
around the regression line that
includes most of the dots.
57. Create a confidence interval
• This envelope is called our confidence
interval: We're pretty confident that prices
fall within this envelope, because in the
past most of them have.
• We can draw two more horizontal lines
from where the 1.35 carat line crosses the
top and the bottom of that envelope.
59. Create a confidence interval
• Now we can say something about our
confidence interval: We can say
confidently that the price of a 1.35 carat
diamond is about $10,000 - but it might be
as low as $8,000 and it might be as high
as $12,000.
60. We're done, with no math or
computers
• We did what data scientists get paid to do,
and we did it just by drawing:
– We asked a question that we could answer
with data
– We built a model using linear regression
– We made a prediction, complete with
a confidence interval
61. We're done, with no math or
computers
• And we didn't use math or computers to do
it.
• Now if we'd had more information, like...
– the cut of the diamond
– color variations (how close the diamond is to
being white)
– the number of inclusions in the diamond
62. We're done, with no math or
computers
• ...then we would have had more columns.
• In that case, math becomes helpful. If you have
more than two columns, it's hard to draw dots on
paper.
• The math lets you fit that line or that plane to
your data very nicely.
• Also, if instead of just a handful of diamonds, we
had two thousand or two million, then you can
do that work much faster with a computer.
63. What we did
• We've talked about how to do linear
regression, and we made a prediction
using data.
64. Data Science
5- Copy other people's work to
do data science
By : Professor Lili Saghafi
https://professorlilisaghafiquantumcomputing.wor
dpress.com
proflilisaghafi@gmail.com
https://sites.google.com/site/professorlilisaghafi/home
@Lili_PLS
65. Copy other people's work to do
data science
• Here you’ll discover a place to find
examples that you can borrow from as a
starting point for your own work
• One of the trade secrets of data science is
getting other people to do your work for
you.
66. Find examples in the Azure AI
Gallery
• Microsoft has a cloud-based service
called Azure Machine Learning Studio that
you're welcome to try for free.
• It provides you with a workspace where
you can experiment with different machine
learning algorithms, and, when you've got
your solution worked out, you can launch it
as a web service.
68. Find examples in the Azure AI
Gallery
• Part of this service is something called
the Azure AI Gallery.
• It contains resources, including a collection of
Azure Machine Learning Studio experiments, or
models, that people have built and contributed
for others to use.
• These experiments are a great way to leverage
the thought and hard work of others to get you
started on your own solutions.
• Everyone is welcome to browse through it.
70. Azure AI Gallery
• If you click Experiments at the top of the
site, you'll see a number of the most
recent and popular experiments in the
gallery.
• You can search through the rest of
experiments by clicking Browse All at the
top of the screen, and there you can enter
search terms and choose search filters.
71. Find and use a clustering
algorithm example
• So, for instance, let's say you want to see
an example of how clustering works, so
you search for "clustering
sweep" experiments.
72.
73. Find and use a clustering
algorithm example
• Click on that experiment and you get a
web page that describes the work that this
contributor did, along with some of their
results.
76. Azure Machine Learning Studio
• I can click on that and it takes me right
to Azure Machine Learning Studio.
• It creates a copy of the experiment and
puts it in my own workspace.
• This includes the contributor's dataset, all
the processing that they did, all of the
algorithms that they used, and how they
saved out the results.
78. Azure Machine Learning Studio
• And now I have a starting point.
• I can swap out their data for my own and
do my own tweaking of the model.
• This gives me a running start, and it lets
me build on the work of people who really
know what they’re doing.
79. Find experiments that demonstrate
machine learning techniques
• There are other experiments in the Azure AI
Gallery that were contributed specifically to
provide how-to examples for people new to data
science.
• For instance, there's an experiment in the gallery
that demonstrates how to handle missing values
(Methods for handling missing values).
• It walks you through 15 different ways of
substituting empty values, and talks about the
benefits of each method and when to use it.
80. Azure AI Gallery is a place to find working
experiments that you can use as a starting
point for your own solutions.
81. Data Science
Azure Machine Learning Studio
By : Professor Lili Saghafi
https://professorlilisaghafiquantumcomputing.wor
dpress.com
proflilisaghafi@gmail.com
https://sites.google.com/site/professorlilisaghafi/home
@Lili_PLS
82. Azure Machine Learning Studio
• You can use Azure Machine Learning Studio to
deploy machine learning workflows and models
as web services.
• These web services can then be used to call the
machine learning models from applications over
the Internet to do predictions in real time or in
batch mode.
• Because the web services are RESTful, you can
call them from various programming languages
and platforms, such as .NET and Java, and from
applications, such as Excel.