2. 1. Introduction to data science – where did it come from
2. Why did I become a data scientist ?
3. Definition of data science
4. Data science skillset map
5. Data science process – one off vs. production pipeline
6. Data science process breakdown – a bit more detail
7. Various Data Science tools
8. Q&A
Agenda of today
4. Google trend – what people are searching
1 2 3 4
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
1
2
3
4
7. Cloud computing
Virtualization
Data Science
Big Data
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
what people are searching – top 5 keywords
8. Examples of what make
the data so big
Source: http://cloud-dba-journey.blogspot.se/2013/10/demystifying-hadoop-for-data-architects.html
9. Data Science can help
to reveal these insights
Data Value from
business’s perspective
13. WHY ?
As an analyst for many years…
I realise …
14. Act on
Customer
Time (weekly) Time!
Time (weekly)
Time
(+6 months) Time (monthly)
Insight to action – too slow !
Request insights
The Analysts
Issues discovered
1. Data is not centralized /syncronized
2. Data quality is bad
3. Organization’s hierarchy slow down
decision making process
4. NO Common KPIs (isolated measurement)
5. Marketing Strategy strongly
depending on gut-feelings ( historical
reason )
6. Knowledge gaps & misconceptions
(focus on visualization, not necessary facts)
7. Insufficient information
( insufficient data sources to answer to
the given question)
monitor
marketers
Answering , usually in a
dashboard/reports … format
Analysing
15. How did it happened ?
Fragmented data view
1. Focus on Database as the only truth
2. Limited data sources ( mostly DB +
clickstreams)
3. Central data repository non-existed
4. Common definiton of a customer
non-existed
5. Customers’ ever-changing behavior
( historical vs real time behavioural data )
6. Marketers’ believes vs. real
evidence about the customers
18. Data Science can at least answer to SOME of those concerns !
But . . .
it heavily depends on how mature is the organization
19. Organization
Maturity
Data Maturity
Resistance to change
Isolated acceptance
Growing importance
Embracing throughout
business disciplines
Data-driven
product & organization
Fragmented data
(Ad-hoc reports focused)
Central Data lake
(exploratory analysis)
360 data view
In real time
(predictive analytics)
Data governance
(Data quality control)
Data driven enterprise
strategy
(recommender system)
Source : https://datafloq.com/read/five-levels-big-data-maturity-organisation/259
21. Data science is a "concept to unify statistics, data analysis and their
related methods" in order to "understand and analyze actual phenomena"
with data. It employs techniques and theories drawn from many fields
within the broad areas of mathematics, statistics, information science,
and computer science, in particular from the subdomains of machine
learning, classification, cluster analysis, data mining, databases,
and visualization.
Short definition (wikipedia)
22. Typical characteristics :
Is question specific
Bias-Variance tradeoff + over/under fitting
Split data into training , testing ( validation ) sets
Can be combined with other algorithms
Can utilize parallelization
Deal with all kinds of data (incl. unstructured)
Data mining technique ( for big data) is applied
Machine Learning(ML)
Predictive analytics
(Supervised Learning)
Typical Characteristic:
Focus on feature engineering ( variables selection)
Exploration vs exploitation
Prediction preformance decade quickly with time
Mostly ad-hoc | one-off based
Deal with all kinds of data ( when applying machine
learning) or else mostly structured|semi-structured data
Typical characteristics:
Ad-hoc based
Limited data blending
Mostly structured data ( from database)
Focus on historical statistic models
Modelling focus on finding correlation or
describing existed datasets
Inferential
+ Exploratory
+ Descriptive
Data Science synonyms … what includes what
26. Data Scientist – The skillset map
Unicorn version vs your own path !
27. Not on the map but equally important
Teamwork essentials -
• Story-telling
• visualization
• Cooperation/team building
• Inter-personal / inspiration coach
• Open mind
• Knowledge sharing
Personality traits –
• Extreme Curiosity
• Detective spirit
• Naive and stupid
• Strong ethic (data protection / privacy
law)
28. My journey – my own version
Tree Trunk :
Skillsets yet to
be acquired
Math
(University)
Statistic
(University)
Computer
Science
(Master)
The ground
Data Science threshold
Specialization areas/
Further development
• Programming : R & Python
• Machine Learning Algorithms
• Data mining techniques
• Cloud services (Virtualization concepts)
• Big Data Eco systems
• Bayesian Statistics
• Graph Theory (option)
• Text mining techniques(option)
Analyst
(work experience)
Roots :
Your initial foundation
• Leadership /Team building
• Recommender system
• Experimental design
• Game theory
• Story-telling/presentation skills
• New model development
• Deep Learning artificial
Intelligence
Tree branches & leaves :
Specialized interests
Motivation
is the key !
31. What motivate you ?
What would your path look like ?
(15 mins Break)
32. Refresh our memory from previous section -
• Relationship between data science and big data
• What motivate me to become a data scientist
• The definition of data science and it’s closely related
synonym
• The skillset map for becoming a data scientist ( unicorn
version vs. your own)
• Motivation is the key !
41. Where are these two approaches came from ?
due to organization maturity . . .
42. Traditional
BI
Data- Driven
Organization
& Products
Data silos –
Fragmented data views
Resistance to
Change
Isolated
Acceptance
DataLake Acquisition
Growing
Importance
Data Quality and Governance
Embrace throughout
Business Disciplines
Automated data management &
administration
Organization maturity
Phase 1
(Infancy)
Phase 2
(Technical adoption)
Phase 3
(Business adoption)
Phase 4
(Data&Analytic as a Service)
Phase component
Real-time
dashboard(s)
Algorithm embedded
dashboard(s)
Algorithm Performance
dashboard(s)
Visualization of deliveries
Pattern
detecting
Unsupervised
learning
Supervised Learning
Recommender
System(s)
Deep Learning
Possible type of ML used in each phase
Data exploration
Experimental
design
Map data sources vs
customers touch points
Acquire solution for
architecture
Control data
Quality
merge data sources and
automise processing
Design experiment – extract
preference data
Platform maturity
(data + technology)
Pipe-line data processing &
application flow
43. Traditional
BI
Data- Driven
Organization
& Products
Data silos –
Fragmented data views
Resistance to
Change
Isolated
Acceptance
DataLake Acquisition
Growing
Importance
Data Quality and Governance
Embrace throughout
Business Disciplines
Automated data management &
administration
Organization maturity
Phase 1
(Infancy)
Phase 2
(Technical adoption)
Phase 3
(Business adoption)
Phase 4
(Data&Analytic as a Service)
Phase component
Real-time
dashboard(s)
Algorithm embedded
dashboard(s)
Algorithm Performance
dashboard(s)
Visualization of deliveries
Pattern
detecting
Unsupervised
learning
Supervised Learning
Recommender
System(s)
Deep Learning
Possible type of ML used in each phase
Data exploration
Experimental
design
Map data sources vs
customers touch points
Acquire solution for
architecture
Control data
Quality
merge data sources and
automise processing
Design experiment – extract
preference data
Platform maturity
(data + technology)
Pipe-line data processing &
application flow
One-off
(Proof Of Concepts=POC)
Production PipeLine
45. Data engineer
Business
knowledge Data scientist IT support
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deliverables
One-Off
iterations
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deployment
Apply to
Application
Production
Pipelines
Performance Optimization
Enable
automization
47. Data engineer
Business
knowledge Data scientist IT support
One-Off
iterations
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deployment
Apply to
Application
Production
Pipelines
Performance Optimization
Enable
automization
70-80% 10%~20%
48. comparison
Oragnization maturity
What are they looking for
Project scope
Platform & technology
Data source availbility
Data quality
Deliverbles
One-off
phrase 1 phrase 2
To understand how data science
work (baby step)
Small 4 -8 weeks
Do not change anything existed
inhouse
Mainly DB + 1 or 2 additional
datasource
Poor, need lots of clearning
Focus in intepretation(visualized)
Production
Pipeline
Phrase +2 and forward
Participate in data science
process
At least 3 months and above
Consider or already migrate to
new platform/technology
Start to map out all available
datasources
Start to sort out data quality
Focus on code( hence limitation
on programming language)
49. Data Science Process –
Box-in the activities overview
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
50. Define
Business
Question
Define the
goal
Decompose
the question
Verify
understanding
Project
Scoping
Map data
sources
Establish
performance
measure
Data scientist
Workspace
Task Force
Business
limitation
Define project
scope
Data
acquisition &
Preparation
Environment
set up
Languages:
SQL, R,
Python…etc
Data sources
merging
Data pre-scan
Q&A
Data Quality
review
Descriptive
statistics
(data
exploration)
Explore data
(plots)
Data
manipulation
Outliers/NA s
summary
statistics
Data explore
review
Features
Engineering
Establish
performance
threshold
Features
engineering
Algorithms
selection
Bueinss sign off
Model
building &
validation
Type of models
Model selection
criteria
Build and
Validate the
model
Review results
Deploy
/deliverables
To whom
On what platform
Update
Frequency
Performance
review
Infographic(visual
ization)
Deployment
review
51. Step-wised Data Science Process :
from Business Question Scoping
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
52. questions
How to get
the data
(access)
done
Datalake
Environment
set up
issues
Extract
Next : About Data
SpecifyNot
ready
?
The Scope
1. thresholds
2. Data scope
3. Resource
4. taskforce
5. Limitations
6. Budget &
timeline…etc
define
NOT done
Ready
Question Scope
53. Step-wised Data Science Process :
Data acquisition data preparation
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
54. Main table
(PK= Transaction ID
FK=StoreID )
Acquire data – merge the data sources
Customer Interests
(PK=email address)
6.Joined by email
Data source type : social
3.Joined by StoreID
Promotions : campaign name,
campaign duration, in which store,
discout level…etc
(PK=CampaignID,
FK=StoreID)
Data source type : campaign tool
1.Joined by
TrasactionID
Customer Purchase informaiton
(PK=CID
FK=Transaction ID)
Customer Database
(PK=CID
FK=email)Joined by CID
Data source type : DatabaseData source type : Database
4.Joined by
StoreID
Store Survey : questions, scale of
satisfaction, product rating..etc
(PK=SurveyID,
FK=StoreID)
Data source type : Survey tool
Store Geo Info: location, km to center, km
to customer’s address, kms to competitor’s
store in the same postcode region…etc
(PK=StoreID)
5.Joined by
StoreID
Data source type : API calls
2.Joined by
Transaction IDWebsite Browsing :
Pages viewed, avg time on site ,
product browsed..etc
(PK=CookieID,
FK=TrasactionID)
Data source type : clickstream
The GOAL
55. Step-wised Data Science Process :
Descriptive Statistics
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
56. A flower called iris
3
Sentosa Virginica Versicolor
Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
62. Step-wised Data Science Process:
Features engineering
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
63. - Observation from Descriptive Statistics
- Remove highly correlated columns/parameters
(example slides further down the presentation)
- Candidate models’ requirement ?
- Some model requires you to do One-Hot-Encoding ( example Neural Network, PCA , Kmeans clusering )
- Outliers sensitive or not ? ( example: regression models are more sensitive to outliers than tree models)
- Forward stepwised /Backward stepwised / shrinkage selection concepts vs.
Blackbox model rank features importance ?
- Computing time vs. response
- Business limitations
( example, business equire to shink the features to <=20 )
Feature Selection ( things to consider)
64. Example (justifying selected features)
Background :
you’ve done an exploratory analysis about correlation,
you have the result and now you need to explain it in a 5-
year’s-old-can-understand way and use the exploratory
results to do your feature selection !
66. Observation Interval of distance
Direction to the right
A B
Highly correlated(0.75~1) : Tesla car and Volvo car moving almost at the same speed and toward the same
direction
Negatively correlated(<0) : Tesla car and Volvo car moving toward different directions
Positively correlated (0.5 ~0.75) : Tesla car move a bit faster than Volvo car but they are still both heading
at the same direction
explain Correlation with a metaphor continued
67. Linear
Correlation
In the following slides, for intuitive
convenience purpose we rescale
and map the correlation coefficient
into the % format - - -
Example :
Strong positive correlation :
1 100%
where:
is the covariance of varible X and Y
is th standard deviation of X
is th standard deviation of Y
Pearson’s correlation :
68. The result of the analysis
Externalsheettempexhaustpipe
External sheet temp exhaust pipe
Actual exhaust temperature exhaust pipe
Actualexhausttemperatureexhaustpipe
Process value regulator under pressure
Processvalueregulatorunderpressure
Process value regulator hood damper
Processvalueregulatorhooddamper
Negative pressure exhaust pipe
Negativepressureexhaustpipe
Regulator value hood damper
Regulator value exhaust damper
Actualvaluedamperexhaustpipe
Regulatorvalueexhaustdamper
Regulatorprocessvalue
Actual value damper exhaust pipe
69. Before we leave this metaphor –
one last thing :
” correlation does not impley causation ! ”
70. Correlation does not imply causation !
Question : Why did these two cars (Tesla car and Volvo car) move toward the same direction in the first place?
Guess 1 : husband and wife
I drive
Tesla car
I drive
Volvo car
Guess 2 : racing track
A B
A B
Guess 3 : coincidence
71. Before diving into training your model(s) …
ask yourself :
what type of model should I use ?
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
72. Question :
Do you have the correct
answer to a given
business question ?
Supervised learning
Regressions
Classes
Unsupervised learning
Deep learning
Clustering
Association analysis
What type of models are suitable ?
YES
NO
73. Before diving into training your model(s) …
Models landscape
1. Supervised
2. Unsupervised
3.Deep learning
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
74. Supervised Learning
Regressions:
Linear Regression
Step-wised Regression
Piecewise Polynomials and splines
Smoothing Splines
Logistic Regression
Multivariate Adaptive Regression Splines
Least Absolute Shrinkage and Selection Operator (LASSO)
Ridge Regression
Linear Discriminant Analysis (LDA)
Trees :
Decision trees
Gradient Boosted Regression trees
Adaptive Boosting trees (AdaBoost)
Conditional Inference trees (CI trees)
Bootstrap Aggregation (Bagging) trees
Gradient Boosted Machines(GBM)
Random Forest (RF)
Support Vector Machines (SVM) :
Support vector classifier (two class)
Support vector classifier (multiclass)
Kernels and support vector machines
Dimensionality reduction:
Principal Component Analysis(PCA)
Singular Value Decomposition (SVD)
MinHash
Locality Sensitive Hashing(LSH)
t-Distributed Stochastic Neighbor embedding (t-SNE)
Clustering :
Kmeans Clustering
Hierarchical Clustering
Bradley-Fayyad-Reina (BFR) clustering
Clustering Using REpresentatives CURE clustering
Bayesian networks
Topic modelling
Market Basket :
Apriori (association rules)
Park Chen and Yu algorithm (PCY)
Savasere, Omiecinski and Navathe (SON)
Toivonen’s algorithm
Stream Analysis :
Bloom filters
Flajolet-Martin Algorithm
Alon-Matias-Szegedy
Datar-Gionis-Indyk-Motwani algorithm
Unsupervised Learning
NeuralNetwork families
Deep Learning
Perceptrons
Simple Neural Networks (fully connected )
Deep Boltzmann machines
Convolutional neural networks
Recurrent neural networks
Hierarchical temporal memory
Genetic algorithm (chromosome)
Multi-arm bandit
K’s Nearest Neighbors (KNN)
Content based recommender
User-User recommender
Item-item recommender
Hybrid recommender
Latent Dirichlet Allocation recommender
Recommender Systems
Others
Others
75. Data Science Process :
Model training Model Validation
( example : supervised learning)
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
76. pre-processed data
Validation
set
Training set Test set
Split into
Train
ML models
Check
Select one winning model
Models that pass the
testing set
Winning
model
Monitor model
performance
Re-train
the
models ?
Yes
No
decide
Sampling from
live data
streams
If we want to be REALLY picky
Live testing the
winning model
77. data science process
Model selection criteria
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
78. Example ( justifying how you select the model)
Background:
you built a prediction model (let’s say to classify customer
purchase=Yes/No), now you need to explain why did you
pick THAT alrogithm in the first place !
79. criterias logistic trees RF GBM weights
Performance
=Accuracy 86,5% 86,7% 86,8% 85,8% 10%
Sensitivity
4,6% 12,5% 8,4% 21,4% 20%
interpretability 1 0,8 0,4 0,2 30%
Time to
compute 1 0,8 0,2 0,2 20%
# of
parameters 2,4 2,4 1,89 2,38 10%
Conflict to use
regression Yes partial minimum minimum 10%
Ranking 1,016 1,063 0,625 0,894 100%
Performance=(true positive+true negative)/test set’s population the model correctly predicted on Both whether you are a Purchaser or NonPurchaser
Sensitivity =True positive/all positives on test set the model correctly predicted that you are going to purchase
Construct criteria for model selection – input both from business as well as data characteristicsNone of the Numerical data is normally distributed
80. Data Science Process :
explain your model
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
81. Example (explaion the selected the model)
Background:
Now I have select a model called recursive Partition tree (rPart),
the stakeholders asked me to explain how this model works …
82. High level - Conceptually
Medium level - a bit more detail
Recurssive Partitioning Tree (rPart)–
How does it work ?
Explained in 2 levels . . .
84. High level – rPart how does it work ?
Parent node
Use both criteria 1 & 2
to decide whether to
split or not
Child node 2.1
(repeat the same
thing)
Child node 2.2
(repeat the same
thing)
For every parameters Pi , check
1) Is spliting on Pi with value Xi
gives me more information ?
2) Is split on Pi with value Xi
gives me better accuracy for prediction?
Note: information is defined by inforamtion theory and have the
option of Gini index and information gain( link )
• Minisplit - the minimum number of
observations that must exist in a node in order for a
split to be attempted
• Minibucket-minimum observation in terminal
node =minsplit/3
• cp- complexity parameter,punish the model if too
many parameters will used and not much of
increasing of accuracy/information were achieved
Criteria 1 Criteria 2
Split on Parameter Pi
with value Xi
YESNO
… …
Tree Split nodes on : Hyper-parameters
85. Medium Level – a bit more detail
1) information gain 2) accuracy improvement
86. Scenario 2 :
If the end nodes have 100 percent of the chance to say that a class to be
Purchaser or noPurchaser, it is perfect classification, hence this node is said to
be reaching minimum impurity (entropy=0)
calculation formula :
-P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase))
-P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase))
=0 -(1)*log2(1) +0 =0 minimum impurity
Scenario 1 :
If the end nodes have 50-50 percent of the chance to say that a class to be
Purchaser or noPurchaser, it is as good as ’guess’, hence this node is said to be
reaching maximum impurity (entropy=1)
calculation formula :
-P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase))
-P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase))
=0 -(1/2)*log2(1/2)-(1/2)*log(1/2)+0 =1 maximum impurity
1) Information gain by checking the Impurity of the end nodes calculated by entropy
Total: 10 data points
Label :
5 Purchase+ 5 noPurchase
(end node1)
Total: 5 data points
Label :
0 Purchase+5 noPurchase
(end node2)
Total: data points
Label :
5 Purchase+ 0 noPurchase
Spliting condition 1Yes No
Scenario 1
P1(Purchase)=0
P1(noPurchase)=5/10 =1/2
P2(Purchase)=5/10
P(noPurchase)=0/10
Total: 10 data points
Label :
0 Purchase+ 10 noPurchase
(end node1)
Total: 5 data points
Label :
0 Purchase+10 noPurchase
(end node2)
Total: data points
Label :
0 Purchase+ 0 noPurchase
Spliting condition 1Yes No
Scenario 2
P1(Purchase)=0
P1(noPurchase)=10/10=1
P2(Purchase)=0
P(noPurchase)=0
0
87. 2) how rpart calculating misclassification rate on parameter Pi with value Xi
20 data
points
10 data
points
10 data
points
Age <45?Yes No
Predict
noPurc
hase
=7
Predict
Purchase
=3
cntTotal <110?Yes No
Correct
classified
rate =1/7
Correct
classified
rate =1/3
Predict
noPurc
hase
=5
Predict
Purchase
=5
cntTotal <75 ?Yes No
Correct
classified
rate =1/5
Correct
classified
rate =1/5
rPart model will ask for each and every value Xi in
a parameter Pi
Was it a good idea´(via calculate the
missclassification rate) to split on this value and it
will do so for all parameter Pi on all possible value
Xi associated with Pi (see image on the left as an
example )
Overall misclassification rate
(True Purchase + true noPurchase) / total population
= 4/20 =20%
Misclassified =1- 20% =80%
88. Data Science Process :
deployment
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
89. Board members /
CTO, CEO, CFO..etc
Marketing directors, Marketers
Processed data for
visualization
Data
scientist
Model Performance
Matrices & output
prediction
pass
business
owner’s
vision
Deliverables: One-off (POC)
Interpretability
Lesson learned - Final
reports or prototype
dashboards for internal
sales
WoW-effect Visualization
90. IT + Content
creators +
marketers
Processed data for
visualization
Data
scientist
Code for
embeddedment into
applications
Model Performance
Matrices & output
prediction
Pass
integration
test
Deployment : Production Pipeline
Reproducibility
Add to organization-wide
dashboards&reporting
pipeline (automated)
Embedded code directly
into applications
( content recommender, product mix vs
customer segments matching..etc)
Use the output of model
prediction for further
marketing purpose
( such as segmentation, customer
profiling..etc)
Process efficiency
92. Refresh our memory from previous sections
• Relationship between data science and big data
• What motivate me to become a data scientist
• The definition of data science and it’s closely related
synonym
• The skillset map for becoming a data scientist ( unicorn
version vs your own)
o Why team work approach
o Dream team mates
o Data science process : two approach ( why , compare ,
boxed-in activities)
o Data science process breakdown in details (step-wised)
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
Cloud computing : Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand
Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
Cloud computing : Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand
Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.
Now i want you to spend some time to read about this slide so i can drink some water because i am thirsty .. :P
Oki so motivation is very personal.. You need to find yours.. Here are mine…
I am extrememly attracted to knowledge.. In fact every time i found something interesting.. I can’t just let it pass, i need to stay with it until i know more, or enough that satisfy my hunger for knowledge.. … i dont know about you but for me.. this need of knowing more drives me to go further….
Secondary , it sounds a bit cliché to say that the beautiful thing about learning is that no one can take it away from you ….
Oki, so if you really think about it, it is true in that, well in this world, we are all alone, we can try to keep those who we care about close, we can try to build the most secure locker in the world…
Eventually, things, people leave us… the only thing that you are stuck with, is yourself and the knowledge you know… in a way , it is both sad and nice..
So the third picture is quite curious… does anyone knows who made this art ?
https://en.wikipedia.org/wiki/Waterfall_(M._C._Escher)
So anyone wants to guess why i choose this picture ?
Things are not always what it seems at the first glace when you look a bit longer, you will realize something is off… then you will ask yourself why is it so..
This is exactly the point, it challenges you to think outside the box , we live in a world with conditions… everything comes with conditions that we are not even conciously aware of… for example. We restrict ourselve to think in a less than 3 dimensional ways.. That we get confused when dimensions grows higher than 3… what if we are allow to go to the 4th or 5th dimensions , what will happen ?
Another way to think about it is that , we awwume gravity exists even in pictures.. Oki so says that it SHOULD exist at all costs ? What if we try to surreal ..
This concept of challenge your fundemental ’’bias’’ extend to everything you do as a data scientist.. Remember that i said in the data science skillset map that you need to be naive and stupid ? Ask questions about why it is so.. Why it is done like this is actually important, it sometimes reveal hidden truth
So anyone can tell me what is the difference between these two picture ?
We have two cars, one tesla car and one volvo car here.
During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed
We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized
Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track
It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B
Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that
When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well)
So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car
Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one..
This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously
Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation )
So why is it important for feature engineering to know this ?
Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel
Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption
Hence it is actually harmful to not carefully select your features
We have two cars, one tesla car and one volvo car here.
During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed
We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized
Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track
It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B
Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that
When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well)
So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car
Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one..
This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously
Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation )
So why is it important for feature engineering to know this ?
Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel
Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption
Hence it is actually harmful to not carefully select your features