13. Extra
Packages
Provide extra functionalities and
algorithms, you can install them from the
interface. Or add them to your script:
install.packages("RJDBC",dep=TRUE)
install.packages("ggplot2",dep=TRUE)
18. Combine Linear
Model and ggplot2
c <- ggplot(mtcars, aes(qsec, wt))
c + stat_smooth()
c + stat_smooth() + geom_point()
# Adjust parameters
c + stat_smooth(se = FALSE) + geom_point()
c + stat_smooth(span = 0.9) + geom_point()
c + stat_smooth(level = 0.99) + geom_point()
c + stat_smooth(method = "lm") + geom_point()
19. Reading data
# read the data from csv
data = read.csv('data.csv', header = F, sep = 't', col.names = c('title',
'count'))
# order the data
data = data[order(data$count, decreasing=T),]
data$title = factor(data$title, levels=unique(as.character(data$title)))
head(data)
qplot(count, title, data=data)
# the other way around
qplot(title, count, data=data)
23. Decision Tree
• Create feature vector with Hive
SELECT v.cookie, COUNT(DISTINCT v.day) dagen, browser_with_version(v.user_agent)
bwv, device_type(v.user_agent) dt, v.screen, COUNT(c.day) clicks
FROM at_views v
LEFT OUTER JOIN at_clicks c ON v.cookie = c.cookie
WHERE v.day > '2013-01-12' AND v.site = 470027 AND v.site_section = 16 AND
v.cookie LIKE "a%"
GROUP BY v.cookie,browser_with_version(v.user_agent), device_type(v.user_agent),
v.screen
Load the output CSV into R
clicklog <- read.csv("~/Downloads/query_result-2.csv", header=T, sep = ',')
clicklog$clickers <- (clicklog$clicks > 0)
fit <- rpart(clickers ~ screen + dt + bwv + dagen, data=clicklog)
plot(fit)
text(fit, use.n=TRUE)
25. Data mining
algorithmsExamples of tasks Algorithms to use
Predicting a discrete attribute
• Flag the customers in a prospective buyers list as good or poor prospects.
• Calculate the probability that a server will fail within the next 6 months.
• Categorize patient outcomes and explore related factors.
Decision Trees
Naive Bayes
Clustering
Neural Network
Logistic Regression
Predicting a continuous attribute
• Forecast next year's sales.
• Predict site visitors given past historical and seasonal trends.
• Generate a risk score given demographics.
Decision Trees
Time Series
Linear Regression
Predicting a sequence
• Perform clickstream analysis of a company's Web site.
• Analyze the factors leading to server failure.
• Capture and analyze sequences of activities during outpatient visits, to formulate best practices around common
activities
Sequence Clustering
26. Where to start
• R interpreter: http://www.r-project.org
• RStudio: http://www.rstudio.com/
• RForge: http://www.rforge.net/