VII Jornadas eMadrid "Education in exponential times". "Uso de IBM Analytics para aprender a tomar mejores decisiones". Ramiro Regó Álvarez. 05/07/2017.
VII Jornadas eMadrid "Education in exponential times". "Uso de IBM Analytics para aprender a tomar mejores decisiones". Ramiro Regó Álvarez. 05/07/2017.
VII Jornadas eMadrid "Education in exponential times". "Uso de IBM Analytics para aprender a tomar mejores decisiones". Ramiro Regó Álvarez. 05/07/2017.
1. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Spark
SPSS
Modeler
Statistics
scoring
IBMMlib
R
variance
decission tree
algorithm
regression
distribution
propensity
accuracy
binomial
Stratified sample
Analytic Server
Hadoop
Map/Reduce
Gini
Weibull
PCA
gamma
Montecarlo
decision
management
neural network
type I error
cluster
K-means
SQL
learning machine learning
Using IBM Analytics to help learn taking better decisions
2. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Spark
SPSS
Modeler
Statistics
scoring IBMMlib
Rvariance
decission tree
algorithm
regressiondistribution
propensity
accuracy
binomialStratified sample
Analytic Server
Hadoop
Map/Reduce
Gini
Weibull
PCAgamma
What is Data Science
and what is not
3. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
CRISP-DM
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
Cross Industry Standard Process for Data Mining,
commonly known by its acronym
CRISP-DM,
is a data mining process model that describes
commonly used approaches that data mining
experts use to tackle problems.
In other words is common practice (and common
sense) put in a diagram.
But is it as simple as it seems?
Can we just walk in and start mining?
How I am sure I got the right tool ?
4. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Simple case: One variable. Straightforward?
Source https://clevertap.com/blog/the-fallacy-of-seeing-patterns/
From the shape of the histogram,
it seems the distribution is left-skewed,
but does it picture the entire story?
The data is represented on 5 intervals between 35
and 85.
A little over 45% of the observations are in the
interval – 65 to 75.
What if we change the number of intervals from
the current 5 to something higher that could give
a better distribution of data among the intervals?
0
20
40
60
80
100
120
140
160
35 45 55 65 75 More
FREQUENCY X
Histogram
Frequency
5. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Not that much
Source https://clevertap.com/blog/the-fallacy-of-seeing-patterns/
0
10
20
30
40
50
60
FREQUENCY X
Histogram
Frequency
In this histogram,
each interval is of size 3 approximately.
There seems to be a change in the shape of the
distribution now.
The original inference of
left-skewed
distribution is now replaced with a shape that has
2 peaks.
6. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
This is correlated, does it mean one causes the other?
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Divorce rate in Maine 5 4,7 4,6 4,4 4,3 4,1 4,2 4,2 4,2 4,1
Per capita consumption of margarine 8,2 7 6,5 5,3 5,2 4 4,6 4,5 4,2 3,7
Correlation: 0,992558
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Per capita consumption of chicken 54,2 54 56,8 57,5 59,3 60,5 60,9 59,9 58,7 56
Total US crude oil imports 3,311 3,405 3,336 3,521 3,674 3,67 3,685 3,656 3,571 3,307
Correlation: 0,899899
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Number of people who died by becoming
tangled in their bedsheets 327 456 509 497 596 573 661 741 809 717
Total revenue generated by skiing facilities 1,551 1,635 1,801 1,827 1,956 1,989 2,178 2,257 2,476 2,438
Correlation: 0,969724
Source: http://tylervigen.com/spurious-correlations
7. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Correlation does not mean causation
That
correlation proves causation,
is considered a questionable cause
logical fallacy
when two events occurring together are taken to
have established a cause-and-effect relationship. This
fallacy is also known as
cum hoc ergo propter hoc,
Latin for
with this, therefore because of this,
and "false cause." A similar fallacy, that an event that
followed another was necessarily a consequence of
the first event, is the
post hoc ergo propter hoc ,
Latin for
after this, therefore because of this.
fallacy.
Source: http://tylervigen.com/spurious-correlations & Wikipedia
8. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Need training (who doesn’t)? No problem.
9. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Specifics on Data Science
16. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
… with a free and open environment available
17. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Spark
SPSS
Modeler
Statistics
scoring IBMMlib
Rvariance
decission tree
algorithm
regressiondistribution
propensity
accuracy
binomialStratified sample
Analytic Server
Hadoop
Map/Reduce
Gini
Weibull
PCAgamma
I understood IBM is with open source, but we need support and guarantee
Any commercial software?
18. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Data Science & Machine Learning
A product for each of the techniques / user profile
Data Science
Experience
SPSS
Decision
Optimization
Machine Learning Watson Analytics
IBM is making data science and machine learning simple and open.
20. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
IBM SPSS Product portfolio
IBM SPSS
Modeler Gold
IBM SPSS
Modeler
Professional
IBM SPSS
Analytic Server
IBM SPSS
Statistics
IBM SPSS
C&DS
IBM SPSS
Decision
Management
IBM SPSS
Modeler
Premium
21. SparkSPSS ModelerStatisticsscoring IBMMlib
Rvariance decission treealgorithm
regressiondistributionpropensity accuracybinomial
variable
Stratified sample
Analytic Server
Hadoop Map/ReduceGini
Weibull
PCA
Doubts, concerns, questions, suggestions?
Let me know:
ramiro.rego@es.ibm.com
Thanks!