Web Analytics with R
Alexandros Papageorgiou
About me
Started @ Google Ireland
Career break
Web analyst @
About the talk
Intro Analytics
Live Demo
Practical Applications x 3
Part I: Intro
Web analytics now and then…
Getting started overview
1. Get some web data for a start
2. Get the right / acurate / relevant data ***
3. Analyse the data
Google Analytics API + R
Why ?
Large queries ?
Freedom from the limits of the GA user interface
Automation, reproducibility, applications
Richer datasets up to 7 Dimensions and 10 Metrics
Handle queries of 10K - 1M records
Mitigate the effect of Query Sampling
The package: RGA
Author Artem Klevtsov
Access to multiple GA APIs
Shiny app to explore dimensions and metrics.
Actively developped + good documentation
Part II: Demo
Part III: Applications
Practical applications
Ecommerce website (simulated data)
Advertising campaign effectiveness (Key Ratios)
Adgroup performance (Clustering)
Key factors leading to conversion (Decision Tree)
1. Key Performance Ratios
Commonly used in Business and finance analysis
Good for data exploration in context
Key Ratios: Getting the data
by_medium <‐ get_ga( = 106368203,
           = "2015‐11‐01", 
           = "2015‐08‐21", 
                    metrics = "ga:transactions, ga:sessions",
                    dimensions = "ga:date, ga:medium",
                    sort = NULL, 
                    filters = NULL, 
                    segment = NULL, 
                    sampling.level = NULL,
                    start.index = NULL, 
                    max.results = NULL)
Sessions and Transactions by medium
##         date   medium transactions sessions
## 1 2014‐11‐01   (none)            0       57
## 2 2014‐11‐01   search            0       10
## 3 2014‐11‐01  display            3      422
## 4 2014‐11‐01  organic            0       30
## 5 2014‐11‐01 referral            1       40
## 6 2014‐11‐02   (none)            0       63
Calculating the ratios
ConversionQualityIndex =
by_medium_ratios <‐ by_medium  %>% 
    group_by(date) %>%  # sum sessions & transactions by date
    mutate(tot.sess = sum(sessions), tot.trans = sum(transactions)) %>% 
    mutate(pct.sessions = 100*sessions/tot.sess,   # calculate % sessions by medium
           pct.trans = 100*transactions/tot.trans, # calculate % transactions by medium
           conv.rate = 100*transactions/sessions) %>%     # conversion rate by medium
    mutate(ConvQualityIndex = pct.trans/pct.sessions) %>%  # conv quality index.
    filter(medium %in% c("search", "display", "referral"))    # the top 3 channels 
Ratios table
columns <‐ c(1, 2, 7:10)
head(by_medium_ratios[columns])  # display selected columns
## Source: local data frame [6 x 6]
## Groups: date
##         date   medium pct.sessions pct.trans conv.rate ConvQualityIndex
## 1 2014‐11‐01   search    1.7889088   0.00000 0.0000000        0.0000000
## 2 2014‐11‐01  display   75.4919499  75.00000 0.7109005        0.9934834
## 3 2014‐11‐01 referral    7.1556351  25.00000 2.5000000        3.4937500
## 4 2014‐11‐02   search    0.5995204   0.00000 0.0000000        0.0000000
## 5 2014‐11‐02  display   79.1366906  66.66667 0.3030303        0.8424242
## 6 2014‐11‐02 referral    9.5923261  33.33333 1.2500000        3.4750000
Sessions % by medium
ggplot(by_medium_ratios, aes(date, pct.sessions, color = medium)) + 
    geom_point() + geom_jitter()+ geom_smooth() + ylim(0, 100)
Transactions % by medium
ggplot(by_medium_ratios, aes(date, pct.trans, color = medium)) + 
    geom_point() + geom_jitter() + geom_smooth()  
Conversion Quality Index by medium
ggplot(by_medium_ratios, aes(date, ConvQualityIndex , color = medium)) + 
    geom_point(aes(size=tot.trans)) + geom_jitter() + geom_smooth() + ylim(0,  5) +
    geom_hline(yintercept = 1, linetype="dashed", size = 1, color = "white") 
2. Clustering for Ad groups
Unsupervised learning
Discovers structure in data
Based on a similarity criterion
Ad Group Clustering: Getting the Data = "12345678" = "2015‐01‐01" = "2015‐03‐31"
metrics = "ga:sessions, ga:transactions, 
           ga:adCost, ga:transactionRevenue, 
dimensions = "ga:adGroup"
adgroup_data <‐  get_ga( =, 
                    metrics = metrics, 
                    dimensions = dimensions)
Hierarchical Clustering
top_adgroups <‐ adgroup_data %>% 
    filter(transactions >10)  %>%    # keeping only where transactions > 10 
    filter(!="(not set)") 
n <‐  nrow(top_adgroups)
rownames(top_adgroups) <‐  paste("adG", 1:n) # short codes for adgroups
top_adgroups <‐  select(top_adgroups, ‐ # remove long adgroup names 
scaled_adgroups <‐ scale(top_adgroups)  # scale the values
Matrix: Scaled adgroup values.
##         sessions transactions    ad.cost transaction.revenue
## adG 1  0.3790902   2.72602456 ‐0.7040545           3.7397620
## adG 2 ‐0.6137714   0.05134068 ‐0.7111664           0.2086295
## adG 3  0.3207199   0.30473179  0.3411098           0.5346303
## adG 4  0.9956617   0.78335943  0.9139105           1.1769897
## adG 5 ‐0.2261473  ‐0.65252350 ‐0.1688845          ‐0.6691330
## adG 6 ‐0.6863092  ‐0.59621436 ‐0.5979007          ‐0.4800614
##       pageviews.per.session
## adG 1            1.93262389
## adG 2            1.05619885
## adG 3           ‐0.74163568
## adG 4           ‐0.32186999
## adG 5           ‐1.01079490
## adG 6            0.01538598
hc <‐  hclust(dist(scaled_adgroups) ) 
plot(hc, hang = ‐1)
rect.hclust(hc, k=3, border="red")   
library(gplots); library(RColorBrewer)
my_palette <‐ colorRampPalette(c('white', 'yellow', 'green'))(256)
          cexRow = 0.7, 
          cexCol = 0.7,          
          col = my_palette,     
          rowsep = c(1, 5, 10, 14),
          lwid = c(lcm(8),lcm(8)),
          srtCol = 45,
          adjCol = c(1, 1),
          colsep = c(1, 2, 3, 4),
          sepcolor = "white", 
          sepwidth = c(0.01, 0.01),  
          scale = "none",         
          dendrogram = "row",    
          offsetRow = 0,
          offsetCol = 0,
Clusters based on 5 key values
3. Decision Trees
Handle categorical + numerical variables
Mimic human decion making process
Greedy approach
3. Pushing the API = "12345678" = "2015‐03‐01" = "2015‐03‐31"
dimensions = "ga:dateHour, ga:minute, ga:sourceMedium, ga:operatingSystem, 
              ga:subContinent, ga:pageDepth, ga:daysSinceLastSession"
metrics = "ga:sessions, ga:percentNewSessions,  ga:transactions, 
           ga:transactionRevenue, ga:bounceRate, ga:avgSessionDuration,
           ga:pageviewsPerSession, ga:bounces, ga:hits"
ga_data <‐  get_ga( =, 
                    metrics = metrics, 
                    dimensions = dimensions)
The Data
## Source: local data frame [6 x 16]
##     dateHour minute       sourceMedium operatingSystem    subContinent
## 1 2015030100     00 facebook / display         Windows Southern Europe
## 2 2015030100     01 facebook / display       Macintosh Southern Europe
## 3 2015030100     01       google / cpc         Windows Northern Europe
## 4 2015030100     01       google / cpc             iOS Southern Europe
## 5 2015030100     02 facebook / display       Macintosh Southern Europe
## 6 2015030100     02 facebook / display         Windows  Western Europe
## Variables not shown: pageDepth (chr), daysSinceLastSession (chr), sessions
##   (dbl), percentNewSessions (dbl), transactions (dbl), transactionRevenue
##   (dbl), bounceRate (dbl), avgSessionDuration (dbl), pageviewsPerSession
##   (dbl), hits (dbl), Visitor (chr)
Imbalanced class
Approach: Page depth>5 set as proxy to conversion
Data preparation
Session data made "almost" granular
Removed invalid sessions
Extra dimension added (user type)
Removed highly correlated vars
Data split into train and test
Day of the week extracted from date
Days since last session placed in buckets
Date converted to weekday or weekend
Datehour split in two component variables
Georgraphy split between top sub-continents and Other
Hour converted to AM or PM
Decision Tree with rpart
fit <‐ rpart(pageDepth ~., data = Train,       # pageDepth is a binary variable
                            method = 'class',  
                            control=rpart.control(minsplit = 10, cp = 0.001, xval = 10)) 
# printcp(fit)
fit <‐ prune(fit, cp = 1.7083e‐03)   # prune the tree based on chosen param value
The Tree
…Possible Actions ?
Web analytics not just for marketers!
But neither a magic bullet… (misses the wealth of atomic level data)
Solutions ?
What's coming next ?
Thank you!
