The volume, velocity and variety of data has changed drastically in the last decade. Everything generates data today, from your customers on social networks, to the instances running your web applications. The tools to support collecting, storing, organizing, analyzing and sharing of data are all available in a couple of clicks, with Amazon Web Services. Attend this session to learn how Big Data in the cloud can help you easily unlock business opportunities hidden in your data today.
6. An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it
11. Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
12. Big Gap in turning data into actionable
information
16. Big Data Verticals and Use cases
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendati
ons
Transactions
Analysis
Life Sciences
Genome
Analysis
Financial
Services
Monte Carlo
Simulations
Risk
Analysis
Security
Anti-virus
Fraud
Detection
Image
Recognition
Social
Network/Gaming
User
Demograph
ics
Usage
analysis
In-game
metrics
23. More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3
months
4 million ratings per day
3 million searches
Device location , time ,
day, week etc.
Social data
42. COLLECT | STORE | ANALYSE | SHARE
Direct
Connect
SQS
Glacier
S3
EC2
Redshift
DynamoDB
Elastic Map
Reduce
CloudFront
EC2
Basic building blocks for every workload
Data pipeline
Import Export
Compute
Fleet
43. Big Data tools on AWS
In-memory
Hadoop and
Friends
Managed
Services
MPP
Datawarehouse
NoSQL
Scale out
processing
44. Big Data tools on AWS
In-memory
Hadoop and
Friends
Managed
Services
MPP
Datawarehouse
NoSQL
Scale out
processing
Hive/Pig/Cascading
Shark/Spark
Dynamodb
Hbase
Cassandra
MongoDB
..
EC2
..
SAP HANA one
..
Treasure Data
Qubole
Splunk Storm
Sumologic
Karmasphere
..
Redshift
..
45. Big Data tools on AWS
In-memory
Hadoop and
Friends
Managed
Services
MPP
Datawarehouse
NoSQL
Scale out
processing
Hive/Pig/Cascading
Shark/Spark
Dynamodb
Hbase
Cassandra
MongoDB
..
EC2
..
SAP HANA one
..
Treasure Data
Qubole
Splunk Storm
Sumologic
Karmasphere
..
Redshift
Vertica
..
47. How does EMR work ?
EMR
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS
59. Run the analysis
S3
Run clusters with your data in S3
Data is “streamed” in and
intermediate results stored in HDFS
EMR Cluster
1
60. When done shutdown the cluster
EMR Cluster
S3
When processing is complete, you
can terminate the cluster (and stop
paying)
1
61. EMR
EMR Cluster
You can also run 24/7
S3
If you run your jobs 24 x 7 , you
can also run a persistent cluster
and use RI models to save costs
2
62. Option to use S3 along with HDFS
S3
EMR
EMR Cluster
• S3 provides 99.99999999999% of
durability
• Elastic
• Version control against failure
• Run multiple clusters with a single
source of truth
• Quick recovery from failure
• Continuously resize clusters
3
63. Which is the data warehouse here ?
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
64. Need faster query response time
3
Separate Map Reduce like
Engine
In-memory data storage for fast
query response time
Compatible with hadoop storage
API
SHARK Port of Apache Hive on SPARK
Compatible with existing HIVE
meta-stores
Similar speed ups of upto 40x