7. We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form when you have a
chance.
We are constantly producing more data
8. We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form when you have a
chance.
From all types of industries
15. We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form when you have a
chance.
Until now, Questions you ask drove Data model
New model is collect as much data as possible
– “Data-First Philosophy”
16. We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form when you have a
chance.
Data is the new raw material for
any business on par with
capital, people, labor
Datais the new raw material for business on par with capital
& labor
18. Generated
data
Available for analysis
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
22. select productId, count(*)
from page_hits
where hour in (12,13)
group by productId
order by count(*) desc
cat *-(12|13) | cut –f3 | sort | uniq -c > out
Hit <enter>?
23. 1PB = 10^15 (1,000,000,000,000,000) bytes
1 PB = 231 days at 50MB/s
77. Source: IDC Whitepaper, sponsored by Amazon, “The Business Value of Amazon Web Services Accelerates Over Time.” July 2012
70% lower 5 year TCO per app
AWS
On-
premises $3.01M
$0.90M
50% reduction in
analytics costs
87. More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3
months
4 million ratings per day
3 million searches
Device location , time ,
day, week etc.
Social data
98. Foursquare…
33 million users
1.3 million businesses
…generates a lot of Data
3.5 billion check-ins
15M+ venues,
Terabytes of log data
99. Uses EMR for
Evaluation of new features
Machine learning
Exploratory analysis
Daily customer usage reporting
Long-term trend analysis
100. Benefits of EMR
Ease-of-Use
“We have decreased the processing time for urgent data-analysis”
Flexibility
To deal with changing requirements & dynamically expand reporting clusters
Costs
“We have reduced our analytics costs by over 50%”
113. Common Crawl
1000 Genomes Project
Census Data
54 other datasets
http://aws.amazon.com/publicdatasets/
114. Challenge:
Large amounts of computing resources
needed for short periods of time; significant
data storage costs
Solution:
Clusters of 100s of nodes on EMR running 4-5 hours
at a time
Leverages 1000 genomes Public Data Set on AWS —
free access to ~200 TB of genomes for over 2,600
people from 26 populations around the world.
115. Challenge:
Volatile weather is deadly to crops like grapes
Solution:
Built a predictive model based on freely
available data—
60 years of crop data,
14 TBs of soil data, and
1M government Doppler radar points
50 EMR clusters process new data as it comes
into S3 each day, continuously updating the
model.