This presentation is about a lecture I gave within the "Green Lab" course of the Computer Science master, Software Engineering and Green IT track of the Vrije Universiteit Amsterdam: http://masters.vu.nl/en/programmes/computer-science-software-engineering-green-it/index.aspx
http://www.procaccianti.me
1. 1 Het begint met een idee
Data Analysis
Descriptive Statistics and EDA
Giuseppe Procaccianti
2. Vrije Universiteit Amsterdam
2 Giuseppe Procaccianti / S2 group / The Green Lab
Quick Recap
Experiment
scoping
Experiment
planning
Idea
Experiment
operation
Analysis &
interpretation
Presentation &
package
3. Vrije Universiteit Amsterdam
3 Giuseppe Procaccianti / S2 group / The Green Lab
Analysis and Interpretation
● Understanding the data
○ descriptive statistics
○ exploratory data analysis (EDA, e.g. boxplots, scatter plots)
● (Optional) data reduction
● Hypothesis testing
● Results interpretation
4. Vrije Universiteit Amsterdam
4 Giuseppe Procaccianti / S2 group / The Green Lab
Descriptive Statistics
● Goal: get a ‘feeling’ about how data is distributed
● Properties:
○ Central Tendency (e.g. Mean, Median)
○ Dispersion (e.g. Frequency, Standard Deviation)
○ Dependency (e.g. Correlation)
5. Vrije Universiteit Amsterdam
5 Giuseppe Procaccianti / S2 group / The Green Lab
Parameter vs. statistic
● Parameter: feature of the population
○ μ: mean
○ σ: standard deviation
● Statistic: feature of the sample
○ : mean
○ s: standard deviation
● Statistics are an estimation of parameters
6. Vrije Universiteit Amsterdam
6 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency
● Arithmetic mean:
● Geometric Mean:
7. Vrije Universiteit Amsterdam
7 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency: example
● Average of scores:
6 - 7 - 8 - 9 - 10
● Arithmetic mean: 8
● Geometric mean: ~7.87
8. Vrije Universiteit Amsterdam
8 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency: example
● Average of returns of investments:
90% ; 10% ; 20% ; 30% ; -90%
● Arithmetic mean:
(90+10+20+30-90)/5= 12%
● Geometric mean:
[(1.9 x 1.1 x 1.2 x 1.3 x 0.1) ^ 1/5] - 1 =0.2008= -20.08%
9. Vrije Universiteit Amsterdam
9 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency
● Median (or 50% percentile): middle value separating the
greater and lesser halves of a data set
X = [13, 18, 13, 14, 13, 16, 14, 21, 13]
Xsort
= [13, 13, 13, 13, 14, 14, 16, 18, 21]
10. Vrije Universiteit Amsterdam
10 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency
● Mode: most frequent value in data set
X = [13, 18, 13, 14, 13, 16, 14, 21, 13]
Mox
= 13
12. Vrije Universiteit Amsterdam
12 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion
● Sample variance:
● Standard Deviation:
● Standard Deviation is dimensionally equivalent to the data
13. Vrije Universiteit Amsterdam
13 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - three-sigma-rule
"Empirical Rule" by Dan Kernler - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Empirical_Rule.PNG#/media/File:Empirical_Rule.PNG
14. Vrije Universiteit Amsterdam
14 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - three-sigma-rule
● Range:
● Coefficient of variation:
(in percentage of mean)
● Coefficient of variation only has meaning if all values are
positive (ratio scale, not interval scale e.g. temperatures)
15. Vrije Universiteit Amsterdam
15 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - example
● Dataset: [100, 100, 100]
Mean: 100
● Variance: 0
● Standard Deviation: 0
● Coeff. Variation: 0
● Range: 0
16. Vrije Universiteit Amsterdam
16 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - example
● Dataset: [90, 100, 110]
Mean: 100
● Sample Variance: 100
● Standard Deviation: 10
● Coeff. Variation: 10%
● Range: 20
17. Vrije Universiteit Amsterdam
17 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - example
● Dataset: [1, 5, 6, 8, 10, 40, 65, 88]
Mean: 27.875
● Sample Variance: 1082.69
● Standard Deviation: 32.9
● Coeff. Variation: 1.18%
● Range: 87
18. Vrije Universiteit Amsterdam
18 Giuseppe Procaccianti / S2 group / The Green Lab
Basic visualizations
Box Plot
Median
3rd quartile
1st quartile
20. Vrije Universiteit Amsterdam
20 Giuseppe Procaccianti / S2 group / The Green Lab
Basic visualizations
Box Plot
By Gbdivers (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
outliers positive
skewness
21. Vrije Universiteit Amsterdam
21 Giuseppe Procaccianti / S2 group / The Green Lab
Dependency: correlation
● Sample correlation coefficient (Pearson):
● Meaningful when comparing paired values/datasets
22. Vrije Universiteit Amsterdam
22 Giuseppe Procaccianti / S2 group / The Green Lab
Dependency: correlation
● Spearman’s rank correlation coefficient:
● Kendall’s rank correlation coefficient:
○ smaller values
○ more accurate on small samples
● Pearson correlation coefficient assumes normally distributed
data
23. Vrije Universiteit Amsterdam
23 Giuseppe Procaccianti / S2 group / The Green Lab
Dependency: example
Age vs. body fat %
● Pearson: r = 0.7921
● Spearman: = 0.7539
● Kendall: = 0.5762
25. Vrije Universiteit Amsterdam
25 Giuseppe Procaccianti / S2 group / The Green Lab
Basic Visualizations
Image Source:
http://www.cqeacademy.com/cqe-body-of-knowledge/continuous-improvement/quality-control-tools/the-scatter-
plot-linear-regression/
Scatter plots per different
values of r
26. Vrije Universiteit Amsterdam
26 Giuseppe Procaccianti / S2 group / The Green Lab
Correlation does NOT imply causation!
● Spurious Correlations: http://tylervigen.com/