AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha

Abhishek Sinha
Business Development Manager
sinhaar@amazon.com
@abysinha
Big Data Analytics

Presenter Name
Presenter Title
Month Day, YearAbhishek Sinha
Rajnikant

Customary Rajnikant Joke
• How would “Rajni Saar”
process big data ?

Customary Rajnikant Joke
• How would “Rajni Saar”
process big data ?
He could count it on his
fingers !

An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it

What is the challenge with big data ?

Generation
Collection & storage
Analytics & computation
Collaboration & sharing

Generation
Lower cost,
higher throughput

Generation
Highly
constrained
Lower cost,
higher throughput

Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Big Gap in turning data into actionable
information

Amazon Web Services helps remove
constraints

1 instance x 100 hours = 100 instances x 1 hour

Big Data Verticals and Use cases
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendati
ons
Transactions
Analysis
Life Sciences
Genome
Analysis
Financial
Services
Monte Carlo
Simulations
Risk
Analysis
Security
Anti-virus
Fraud
Detection
Image
Recognition
Social
Network/Gaming
User
Demograph
ics
Usage
analysis
In-game
metrics

From data to
actionable information

“Who is using our
service?”

Identified early mobile usage
Invested heavily in mobile development
Finding signal in the noise of logs

9,432,061 unique mobile devices
used the Yelp mobile app.
4 million+ calls. 5 million+ directions.
In January 2013

Autocomplete Search
Recommendations
Automatic spelling
corrections

“What kind of movies do people
like ?”

More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3
months
4 million ratings per day
3 million searches
Device location , time ,
day, week etc.
Social data

1.42 million images from
Hubble

K-means
clustering
Cascade correlation
neural networks

Novice
Analytics & machine learning
Expert

Stronger than the
sum of their parts

COLLECT | STORE | ANALYSE | SHARE
Direct
Connect
SQS
Glacier
S3
EC2
Redshift
DynamoDB
Elastic Map
Reduce
CloudFront
EC2
Basic building blocks for every workload
Data pipeline
Import Export
Compute
Fleet

Big Data tools on AWS
In-memory
Hadoop and
Friends
Managed
Services
MPP
Datawarehouse
NoSQL
Scale out
processing

In-memory
Hadoop and
Friends
Managed
Services
MPP
Datawarehouse
NoSQL
Scale out
processing
Hive/Pig/Cascading
Shark/Spark
Dynamodb
Hbase
Cassandra
MongoDB
..
EC2
..
SAP HANA one
..
Treasure Data
Qubole
Splunk Storm
Sumologic
Karmasphere
..
Redshift
..

In-memory
Hadoop and
Friends
Managed
Services
MPP
Datawarehouse
NoSQL
Scale out
processing
Hive/Pig/Cascading
Shark/Spark
Dynamodb
Hbase
Cassandra
MongoDB
..
EC2
..
SAP HANA one
..
Treasure Data
Qubole
Splunk Storm
Sumologic
Karmasphere
..
Redshift
Vertica
..

How does EMR work ?
EMR
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS

What can you run on EMR…
S3
EMR
EMR Cluster

EMR
EMR Cluster
Resize Clusters
S3
You can easily add and
remove nodes

On and Off Fast Growth
Predictable peaksVariable peaks
WASTE
CUSTOMER DISSATISFACTION

Fast GrowthOn and Off
Predictable peaksVariable peaks

Resize Nodes with Spot Instances
Cost without Spot
10 node cluster running for 14 hours
Cost = 1.2 * 10 * 14 = $168

Cost without Spot Add 10 nodes on spot
Cost = 1.2 * 10 * 14 = $168
Cost = 1.2 * 10 * 7 = $84
= 0.6 * 10 * 7 = $42

Cost without Spot Add 10 nodes on spot
Cost = 1.2 * 10 * 14 = $168
Cost = 1.2 * 10 * 7 = $84
= 0.6 * 10 * 7 = $42
= Total $126
25% reduction in price
50% reduction in time

Traditional IT
capacityCapacity
Time
Analytics needs

Traditional IT
capacityCapacity
Time
Reserved Instances

Traditional IT
capacityCapacity
Time
Reserved Instances
On-demand

Traditional IT
capacityCapacity
Time
Reserved Instances
On-demand
Spot

Run the analysis
S3
Run clusters with your data in S3
Data is “streamed” in and
intermediate results stored in HDFS
EMR Cluster
1

When done shutdown the cluster
EMR Cluster
S3
When processing is complete, you
can terminate the cluster (and stop
paying)
1

EMR
EMR Cluster
You can also run 24/7
S3
If you run your jobs 24 x 7 , you
can also run a persistent cluster
and use RI models to save costs
2

Option to use S3 along with HDFS
S3
EMR
EMR Cluster
• S3 provides 99.99999999999% of
durability
• Elastic
• Version control against failure
• Run multiple clusters with a single
source of truth
• Quick recovery from failure
• Continuously resize clusters
3

Which is the data warehouse here ?
 Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

Need faster query response time
3
Separate Map Reduce like
Engine
In-memory data storage for fast
query response time
Compatible with hadoop storage
API
SHARK Port of Apache Hive on SPARK
Compatible with existing HIVE
meta-stores
Similar speed ups of upto 40x

 elastic-mapreduce --create --alive --name
"Spark/Shark Cluster" --bootstrap-action
s3://elasticmapreduce/samples/spark/0.7/ins
tall-spark-shark.sh --bootstrap-name
"Mesos/Spark/Shark" --instance-type
m1.xlarge --instance-count 3

 Source https://amplab.cs.berkeley.edu/2013/06/04/comparing-large-
scale-query-engines/

Generation
Remove
Constraints

Thank You
sinhaar@amazon.com
aws.amazon.com/elasticmapreduce
aws.amazon.com/datapipeline
aws.amazon.com/big-data
@abysinha

AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha

Semelhante a AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

Último

Último (20)

AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha