The document discusses how connected data and cloud computing are enabling new types of data-driven science and applications. It provides examples across several domains: (1) genomics research has progressed from sequencing individual genomes to large-scale comparisons using petabytes of data stored in the cloud, (2) consumer applications like home security cameras generate petabytes of data per month for computer vision analysis, (3) retail and travel companies use vast amounts of customer, product, and location data to personalize experiences, (4) industrial research leverages cloud-based materials data from global partners, (5) sports and location-based applications also generate large volumes of real-time data for analysis in the cloud. The cloud is enabling data to be
4. The amount of information generated during the first day of
a baby’s life today is equivalent to 70 times the information
contained in the Library of Congress"
8. Human Genome Project"
Collaborative project to sequence every single letter!
of the human genetic code.!
13 years and $billions to complete.!
Gigabyte scale datasets (transferred between sites on!
iPods!)!
9. Beyond the Human Genome"
45+ species sequenced: mouse, rat, gorilla, rabbit, !
platypus, nematode, zebra fish...!
Compare genomes between species to identify!
biologically interesting areas of the genome.!
100Gb scale datasets. Increased computational
requirements.!
10. The Next Generation"
New sequencing instruments lead to a dramatic!
drop in cost and time required to sequence a genome.!
Sequence and compare genetic code of individuals to!
find areas of variation. Much more interesting.!
Terabyte scale datasets. Significant computational
requirements.!
11. The 1000 Genomes Projects"
Public/private consortium to build world’s largest!
collection of human genetic variation.!
Hugely important dataset to drive new insight into!
known genetic traits, and the identification of new ones.!
Vast, complex data and computational resources required,
beyond reach of most research groups and hospitals.!
12. 1000 Genomes in the Cloud"
The 1000 Genomes data made available to all on AWS.!
Stored for free as part of the Public Datasets program.!
Updated regularly.!
200Tb. 1700 individual genomes. As much compute and
storage as required available to all.!
23. Dropcam
is
the
biggest
inbound
video
service
on
the
Web
• More
data
uploaded
per
minute
than
YouTube
• Petabytes
of
data
processed
every
month
• Billions
of
mo=on
events
detected
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35. Lenddo’s
Journey
• Process
about
3.5TB
of
social
data
• Social
Data
growing
more
users
• Started
with
MongoDB
cluster
on
CR1
instance
types
on
AWS
,spending
10K
USD/month
• Re-‐architected
to
move
all
their
data
to
S3
and
keep
caches
in
smaller
mongodb
and
dynamodb
cluster.
Use
EMR
to
process
data
• Now
spending
3K/month
39. Who
is
my
customer
really?
What
do
people
really
like?
What
is
happening
socially
with
my
products?
Where
do
people
consume
my
product?
How
do
people
really
use
your
product?
41. 75% of users select"
movies based on"
recommendations"
42. More than 27 million users!
~ 30 million plays per day!
More than 40 billion events per day !
~ 4 million ratings per day!
~ 3 million searches per day!
Geo-location data!
Device information!
Time of day and week (it now can verify that users watch more TV shows during
the week and more movies during the weekend)!
Metadata from third parties such as Nielsen!
Social media data from Facebook and Twitter!
46. Wego
• Search
using
Flexible
dates
AND/OR
Loca=ons
and
Themes
– FROM
Singapore
TO
Beach
FOR
A
Weekend
Trip
(theme
loca=on
+
flexible
date)
– FROM
Singapore
TO
Paris
FOR
A
Whole-‐week
Vaca=on
(specific
des=na=on
+
flexible
date)
– FROM
Singapore
TO
Sydney
IN
Next
Two
Months
(specific
des=na=on
+
flexible
date)
– FROM
Singapore
TO
Family-‐friendly
Des=na=on
ON
30-‐Apr
to
05-‐May
(theme
loca=on
+
fixed
dates)
• Need
for
robust
caching
mechanism
with
millions
of
flight
searches
with
10Million
+
different
flight
routes
• Use
the
AWS
cloud
to
rapidly
spin
up
machines
to
scale
to
the
requirements
• AWS
allows
them
to
do
this
in
a
scalable
and
cost
effec=ve
manner
51. The
only
Asian
company
which
made
it
to
the
CODE_n
finalist
list
for
CeBIT
2014
52. Platform Architecture
Archival
(Glacier)
Storage
(S3)
Crawl
Cluster
(EC2)
File
Server
(EC2)
Processing
Cluster
(EC2)
Choice
Engine
Cluster
(EC2)
Data
Partners
End
user
interac=on/Front
End
On
AWS
External
to
AWS
Integra=on
Engine
Data
Acquisi=on
87. What ! right now?!
trades are executing!
is the exception rate!
is the ad click-through!
topics are trending"
inventory remains!
queries are slow!
are the high scores!
!
!
90. Kinesis architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
91. AWS Internal Metering Service
Capture
Submissions
Process in
Realtime
Store in
Redshift
Clients
Submitting
Data
Workload
• Tens of millions records/sec
• Multiple TB per hour
• 100,000s of sources
New features
• Scale with the business
• Provide real-time alerting
• Inexpensive
• Improved auditing
92. Workload
• Daily load of billions records from millions of files
from hundreds of sources
• 3 hour SLA to load and audit data
• Hundreds of customers
• Hundreds of queries per hour
New features
• Our data is fresh, we ingest every 6 hours
• Now processing triple the volume in less than 25%
of the time
• “Hammerstone” ETL solution
– Built on AWS Data Pipeline
– Build business specific marts
– Build workload specific clusters
• Supports a variety of analytics tools: Tableau, R,
Toad, SQL Developer, etc.
Internal AWS Data Warehouse
Over 200 internal
data sources
Data staged in
Amazon S3
"Hammerstone:"
Custom ETL
using AWS
Data Pipeline
Data processing
Redshift cluster
Batch reporting
Redshift cluster
Ad hoc query
Redshift cluster