Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

ARC 306: Lumberjacking on AWS
Cutting Through Logs to Find What Matters
Guy Ernest, Solutions Architecture
November 15, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Progress Is Not Evenly Distributed

1980
$14,000,000/TB  450,000 ÷ 
 30,000 X 
100 MB
 50 X 
4 MB/s

Today
$30/TB
3 TB
200 MB/s

by Kheel Center, Cornell University

Solution: More Spindles

The Challenge
“…Foursquare streams hundreds
of millions of application logs
each day. The company relies on
analytics to report on its daily
usage, evaluate new offerings,
and perform long-term trend
analysis—and with millions of
new check-ins each day, the
workload is only growing…”

“Real” Project Requirements Example
Cost
Analysis

Marketing

Operations

Revenue

Data transfer

Top URLs

Error rates

Top games

• By date/time
• By edge location
• By date/time within
an edge location
• By top X URLs
• By HTTP vs. HTTPS

•
•
•
•

• By top X URLs
• By edge location and
content type

• By revenue
• By edge location and
revenue

As-is count
By content type
By edge location
By edge location and
content type

Top ads
• That lead to a game
purchase

Requests served

Revenue

Top games
• By age
• By income
• By gender

Viable Business
Revenues
# Users

Operation Costs

$ Money

Available Data Sources
Metric
Data transfer by date/time
Data transfer by edge location
Data transfer by date/time within an edge location
Data transfer by top x URLs
Data transfer by http vs HTTPS
Top URLs
Top URLs by Content Type
Top URLs by Edge Location
Top URLs by Edge Location and Content Type
Error rates by top x URLs
Error rate by edge location
Error Rate by edge location and content type
Requests served by edge location
Revenue by edge location
Top games segmented by age
Top games segmented by income
Top games segmented by gender
Top games by revenue
Top games by edge location and revenue
Top game revenue segmented by age

Sources
CloudFront logs
CloudFront logs
CloudFront logs
CloudFront logs, web servers logs
CloudFront logs
CloudFront logs
CloudFront logs
CloudFront logs
CloudFront logs
CloudFront logs
CloudFront logs
CloudFront logs, OrdersDB, app servers logs
CloudFront logs, user profile
CloudFront logs, OrdersDB
CloudFront logs, OrdersDB
CloudFront logs, OrdersDB, user profile

CloudFront Access Log Format
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query
2012-05-25
22:01:30
AMS1
4448
94.212.249.78
GET
d1234567890213.cloudfront.net
/YT0KthT/F5SOWdDPqNqQF07tiTOXqJMpfD
dlb3LMwv3/jP3/CINm/yDSy0MsRcWJN/Simutrans.exe
200
http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2
Mozilla/5.0%20(compatible;%20M
SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625181
2012-05-25
22:01:30
AMS1
4952
94.212.249.78
GET
/66IG584/CPCxY0P44BGb5ZOd3qSUrauL05
0LOvFwaMj/eH/caw/Blob Wars-Blob And Conquer.exe
200
http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2
Mozilla/5.0%20(compatible;%20M
SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625184
2012-05-25
22:01:30
AMS1
4556
78.8.5.135
GET
/SwlufjC/xEjH3BRbXMXwmFWqzKt7od6tlW
R3e13LhmH/V3eF/lo6g/AstroMenace.exe 200
http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt
Opera/9.80%20(Windows%20NT%205.1;%20U;%20pl)%2
0Presto/2.10.229%20Version/11.60 uid=100&oid=108625189
2012-05-25
22:01:30
AMS1
47172
78.8.5.135
GET
/Di1cXoN/TskldkSHcgkvZXQEmv5vOVR25X
5UTisFkRq/pQa/wCjUXZb/Z1HRuGlo/Kroz.exe
200
http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt
Opera/9.80%20(Windows%20NT%205.1;%20U;
%20pl)%20Presto/2.10.229%20Version/11.60 uid=100&oid=108625206

Sample Your Data with R

>
>
>
>
>

sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F)
sample_data <- sample_data[-1:-2,]
View(sample_data)
m <- ggplot(sample_data, aes(x = factor(V9)))
m + geom_histogram() + scale_y_log10() + xlab('Error Codes') + ylab('log(Frequency)')

OpenRefine Running on an EC2 Instance

Logs

E T
Web

L
OLAP

OLTP

CRM

ANALYST
DATAWAREHOUSE

OLTP

DB

Swedish public domain photo taken in 1918

Log Shipping

“Poor Man’s Log Shipping”

Embedding Poor-man Invisible Pixel
http://www.poor-mananalytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban
.com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=enus&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr
=-&utmp=%2F&utmac=UA-70197651&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B
%2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(re
ferral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analyticsarchitecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~

Use Amazon Kinesis to Ship Your Logs

New

Aggregation with S3Distcp

Aggregated
Even-size
Compressed

S3distcp on EMR Job Sample
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args
'--src,s3://myawsbucket/cf,
--dest,s3://myoutputbucket/aggregate ,
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,
--targetSize,128,
--outputCodec,lzo,
--deleteOnSuccess'

Pig for Access Logs Analysis
RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…);
LOGS_BASE_F = FILTER RAW_LOG BY url MATCHES '^GET /__track.*$’;
LOGS_BASE_F_W_PARAM = FOREACH LOGS_BASE_F GENERATE
url,
Load and Filter
DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') as dt,
SUBSTRING(DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') ,0, 10 ) as day,
(cat / grep)
…
status,
REGEX_EXTRACT(url, '^GET /([^?]+)', 1) AS action: chararray,
REGEX_EXTRACT(url, 'idt=([^&]+)', 1) AS idt: chararray,
REGEX_EXTRACT(url, 'idc=([^&]+)', 1) AS idc: chararray;
I1 = FILTER LOGS_BASE_F_W_PARAM by action == 'clic' or action == 'display';
Parse
LOGS_SHORT = FOREACH I1 GENERATE uuid, action, dt, day, ida, idas, act, idp, idcmp
(awk)
,idc;
Store
G1 = GROUP LOGS_SHORT BY (uuid,idc);
store G1 into ‘s3://mybucket/sessions/’;
(>)

Pig vs. Hive
• Pig is geared toward sequentially transforming data
– ETL
– Shell in scale (from local mode to any scale)

• Hive is for querying data
– Data analysis / HQL
– Some transformation, typically as a means to a goal i.e., temporary tables

Monitoring Pig

https://github.com/netflix/lipstick

Another Monitoring
Tool

https://github.com/twitter/ambrose

Bootstrap Actions
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia

Customers Tools
Gathering information about EMR
jobs from multiple sources and
presentation it in a textual and
graphic view

github.com/Hi-Media/EmrMonitoring

Spot Bidding Strategies

Less
Interruptions
Not paying
more
Most Saving

Jeff Bezos (early Amazon days)

More Trends to Consider
Transactional Processing

Analytical Processing

Transactional context

Global context

Latency

Throughput

Indexed access

Full table scans

Random IO

Sequential IO

Disk seek times

Disk transfer rate

COPY into Amazon Redshift
create table cf_logs
( d date, t char(8), edge char(4), bytes int, cip varchar(15),
verb char(3), distro varchar(MAX), object varchar(MAX), status int,
Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )
copy cf_logs from 's3://big-data/logs/E123ABCDEF/'
credentials
'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'
IGNOREHEADER 2
GZIP
DELIMITER 't'
DATEFORMAT 'YYYY-MM-DD'

COPY into Amazon Redshift with
AWS Data Pipeline

Charles Minard's flow map of Napoleon's March (1869)

Time for Data Visualization

Choose Your Favorite
Visualization Tool
Tableau (Windows instance)
R
Jaspersoft
QlikView
MicroStrategy
SiSense
…

Unload Data from Amazon Redshift
unload (“select * from cf_logs where date between '2013-11-03’ and '201311-10’“)
to 's3://mybucket/unload_cf_logs_week_46'
credentials 'aws_access_key_id=<key_id>;
aws_secret_access_key=<secret_key>’
delimiter as 't’
GZIP;

Partner Services
Loggly
Splunk
Stratalux (Logstash)
…

Loggly AWS Marketplace Page

What Else Can You Do with
Log Analysis?

Finally, a Small Warning

Abraham Wald (1902-1950)

Would You Like to Know More?
Further reading
http://aws.amazon.com/architecture

http://aws.amazon.com/articles
http://aws.typepad.com

Re:invent sessions
DAT205 - Amazon Redshift in Action: Enterprise, Big Data, and SaaS
DAT305 - Getting Maximum Performance from Amazon Redshift
BDT301 - Scaling your Analytics with Amazon Elastic MapReduce

Please give us your feedback on this
presentation

ARC306
As a thank you, we will select prize
winners daily for completed surveys!

Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

Similar to Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013