AWS offers services that revolutionize the scale and cost for customers to extract information from large data sets, commonly called Big Data. This session analyzes Amazon CloudFront logs combined with additional structured data as a scenario for correlating log and transactional data. Successfully implementing this type of solution requires architects and developers to assemble a set of services with multiple decision points. The session provides a design and example of architecting and implementing the scenario using Amazon S3, AWS Data Pipeline, Amazon Elastic MapReduce, and Amazon Redshift. It explores loading, query performance, security, incremental updates, and design trade-off decisions.
7. The Challenge
“…Foursquare streams hundreds
of millions of application logs
each day. The company relies on
analytics to report on its daily
usage, evaluate new offerings,
and perform long-term trend
analysis—and with millions of
new check-ins each day, the
workload is only growing…”
8. “Real” Project Requirements Example
Cost
Analysis
Marketing
Operations
Revenue
Data transfer
Top URLs
Error rates
Top games
• By date/time
• By edge location
• By date/time within
an edge location
• By top X URLs
• By HTTP vs. HTTPS
•
•
•
•
• By top X URLs
• By edge location
• By edge location and
content type
• By revenue
• By edge location and
revenue
As-is count
By content type
By edge location
By edge location and
content type
Top ads
• That lead to a game
purchase
Requests served
• By edge location
Revenue
• By edge location
Top games
• By age
• By income
• By gender
10. Available Data Sources
Metric
Data transfer by date/time
Data transfer by edge location
Data transfer by date/time within an edge location
Data transfer by top x URLs
Data transfer by http vs HTTPS
Top URLs
Top URLs by Content Type
Top URLs by Edge Location
Top URLs by Edge Location and Content Type
Error rates by top x URLs
Error rate by edge location
Error Rate by edge location and content type
Requests served by edge location
Revenue by edge location
Top games segmented by age
Top games segmented by income
Top games segmented by gender
Top games by revenue
Top games by edge location and revenue
Top game revenue segmented by age
Sources
CloudFront logs
CloudFront logs
CloudFront logs
CloudFront logs, web servers logs
CloudFront logs
CloudFront logs, web servers logs
CloudFront logs
CloudFront logs
CloudFront logs
CloudFront logs, web servers logs
CloudFront logs
CloudFront logs
CloudFront logs
CloudFront logs, OrdersDB, app servers logs
CloudFront logs, user profile
CloudFront logs, user profile
CloudFront logs, user profile
CloudFront logs, OrdersDB
CloudFront logs, OrdersDB
CloudFront logs, OrdersDB, user profile
11. CloudFront Access Log Format
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query
2012-05-25
22:01:30
AMS1
4448
94.212.249.78
GET
d1234567890213.cloudfront.net
/YT0KthT/F5SOWdDPqNqQF07tiTOXqJMpfD
dlb3LMwv3/jP3/CINm/yDSy0MsRcWJN/Simutrans.exe
200
http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2
Mozilla/5.0%20(compatible;%20M
SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625181
2012-05-25
22:01:30
AMS1
4952
94.212.249.78
GET
d1234567890213.cloudfront.net
/66IG584/CPCxY0P44BGb5ZOd3qSUrauL05
0LOvFwaMj/eH/caw/Blob Wars-Blob And Conquer.exe
200
http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2
Mozilla/5.0%20(compatible;%20M
SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625184
2012-05-25
22:01:30
AMS1
4556
78.8.5.135
GET
d1234567890213.cloudfront.net
/SwlufjC/xEjH3BRbXMXwmFWqzKt7od6tlW
R3e13LhmH/V3eF/lo6g/AstroMenace.exe 200
http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt
Opera/9.80%20(Windows%20NT%205.1;%20U;%20pl)%2
0Presto/2.10.229%20Version/11.60 uid=100&oid=108625189
2012-05-25
22:01:30
AMS1
47172
78.8.5.135
GET
d1234567890213.cloudfront.net
/Di1cXoN/TskldkSHcgkvZXQEmv5vOVR25X
5UTisFkRq/pQa/wCjUXZb/Z1HRuGlo/Kroz.exe
200
http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt
Opera/9.80%20(Windows%20NT%205.1;%20U;
%20pl)%20Presto/2.10.229%20Version/11.60 uid=100&oid=108625206
12. Sample Your Data with R
>
>
>
>
>
sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F)
sample_data <- sample_data[-1:-2,]
View(sample_data)
m <- ggplot(sample_data, aes(x = factor(V9)))
m + geom_histogram() + scale_y_log10() + xlab('Error Codes') + ylab('log(Frequency)')
32. Pig for Access Logs Analysis
RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…);
LOGS_BASE_F = FILTER RAW_LOG BY url MATCHES '^GET /__track.*$’;
LOGS_BASE_F_W_PARAM = FOREACH LOGS_BASE_F GENERATE
url,
Load and Filter
DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') as dt,
SUBSTRING(DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') ,0, 10 ) as day,
(cat / grep)
…
status,
REGEX_EXTRACT(url, '^GET /([^?]+)', 1) AS action: chararray,
REGEX_EXTRACT(url, 'idt=([^&]+)', 1) AS idt: chararray,
REGEX_EXTRACT(url, 'idc=([^&]+)', 1) AS idc: chararray;
I1 = FILTER LOGS_BASE_F_W_PARAM by action == 'clic' or action == 'display';
Parse
LOGS_SHORT = FOREACH I1 GENERATE uuid, action, dt, day, ida, idas, act, idp, idcmp
(awk)
,idc;
Store
G1 = GROUP LOGS_SHORT BY (uuid,idc);
store G1 into ‘s3://mybucket/sessions/’;
(>)
33. Pig vs. Hive
• Pig is geared toward sequentially transforming data
– ETL
– Shell in scale (from local mode to any scale)
• Hive is for querying data
– Data analysis / HQL
– Some transformation, typically as a means to a goal i.e., temporary tables
41. Customers Tools
Gathering information about EMR
jobs from multiple sources and
presentation it in a textual and
graphic view
github.com/Hi-Media/EmrMonitoring
48. More Trends to Consider
Transactional Processing
Analytical Processing
Transactional context
Global context
Latency
Throughput
Indexed access
Full table scans
Random IO
Sequential IO
Disk seek times
Disk transfer rate
49.
50.
51. COPY into Amazon Redshift
create table cf_logs
( d date, t char(8), edge char(4), bytes int, cip varchar(15),
verb char(3), distro varchar(MAX), object varchar(MAX), status int,
Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )
copy cf_logs from 's3://big-data/logs/E123ABCDEF/'
credentials
'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'
IGNOREHEADER 2
GZIP
DELIMITER 't'
DATEFORMAT 'YYYY-MM-DD'
60. Unload Data from Amazon Redshift
unload (“select * from cf_logs where date between '2013-11-03’ and '201311-10’“)
to 's3://mybucket/unload_cf_logs_week_46'
credentials 'aws_access_key_id=<key_id>;
aws_secret_access_key=<secret_key>’
delimiter as 't’
GZIP;
68. Would You Like to Know More?
Further reading
http://aws.amazon.com/architecture
http://aws.amazon.com/articles
http://aws.typepad.com
Re:invent sessions
DAT205 - Amazon Redshift in Action: Enterprise, Big Data, and SaaS
DAT305 - Getting Maximum Performance from Amazon Redshift
BDT301 - Scaling your Analytics with Amazon Elastic MapReduce
69. Please give us your feedback on this
presentation
ARC306
As a thank you, we will select prize
winners daily for completed surveys!