Building a Big Data & Analytics Platform using AWS

v
Chris Hampartsoumian
Technology Evangelist - ASEAN
End to End Data Flows on the Cloud
Structured, Unstructured & Streaming
July 2015

How is Cloud Computing important for Big Data
Applications?

v
?
…get into cloud computing?
How did Amazon…

11 Regions
30 Availability Zones
53 Edge locations
AWS Global Infrastructure

Why are customers adopting cloud computing?
Variable expense
Replace capital
expenditure with variable
expense
Elastic capacity
No need to guess
capacity requirements
and over-provision
Speed and agility
Infrastructure in minutes
not weeks
Global Reach
Go global in minutes and
reach a global audience

Mobile
Push
Notifications
Mobile
Analytics
Cognito
Cognito
Sync
AWS Global Infrastructure
Your Applications
AWS Global Infrastructure11 Regions 30 Availability Zones 53 Edge Locations
Network
VPC
Direct
Connect
Route 53
API
Human Interaction
Support
Web Console
Interaction
Command Line
Libraries, SDK’s
Database
DynamoDBRDS ElastiCache
Deployment & Management
Elastic
Beanstalk
OpsWorks
Cloud
Formation
Code
Deploy
Code
Pipeline
Code
Commit
Security & Administration
CloudWatch Config
Cloud
Trail
IAM Directory KMS
Application
SQS SWF
App
Stream
Elastic
Transcoder
SES
Cloud
Search
SNS
Enterprise Applications
WorkSpaces WorkMail WorkDocs
Compute
EC2 ELB
Auto
Scaling
LambdaECS
Analytics
Kinesis
Data
Pipeline
RedShift EMR
Machine
Learning
Storage
EBS Glacier CloudFrontEFSS3

v
Structure
LowHigh
Large
Small
Size
Traditional
Database
Hadoop
NoSQL
MPP Database

UnstructuredStructured Streaming
MPP Databases
Amazon Redshift
Hadoop
Amazon EMR
Real-time Analysis
Amazon Kinesis

v
• Standard SQL
• Optimized for fast analysis
• Very scalable

v
MPP SQL Database
Optimised for Analytics
Gigabytes to Petabytes
Fully relational
Fully managed
Amazon
Redshift

JDBC/ODBC
ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail

v
• Column storage
• Data compression
• Zone maps
• With row storage you do unnecessary I/O
• To get average Amount by State, you have
to read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Dramatically reduces I/O

v
• With column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• Column storage
• Zone maps

v
• Column storage
• Zone maps
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
• Track the minimum and maximum
value for each block
• Skip over blocks that don’t contain
relevant data

v
Q3. What’s good about it?
Performance, Scalability, Ease of Use, Cost

v
Performance Evaluation on 2B Rows
Aggregate by month 02:08:35 00:35:46 00:00:12
Traditional
SQL Database
Amazon
Redshift

160 GBDW2.L
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
2 PB

v
Q4. How do I integrate with Redshift?

v
Works with your existing analysis tools
JDBC/ODBC
Amazon Redshift

S3
Redshift
DynamoDB
EMR
Linux
Loading data

Amazon
Redshift
Source
Systems
ETL

Input
File
Hadoop cluster
Functions Output
1. Very Flexible
2. Very Scalable
3. Often Transient

v
Amazon Elastic MapReduce (EMR)

v
Q1. What is it?
Managed Hadoop

Input
File
EMR cluster
Functions Output
EC2
EC2
EC2
EC2
EC2
EC2

v
EMR
EMR ClusterS3
1. Put the
data into S3
2. Choose: Hadoop
distribution, # of nodes, types
of nodes, Hadoop apps like
Hive/Pig/HBase
4. Get the output
from S3
3. Launch the cluster using
the EMR console, CLI, SDK,
or APIs

v
EMR
EMR Cluster
S3
You can easily resize
the cluster
And launch parallel
clusters using the same
data

v
EMR
EMR Cluster
S3
Use Spot
nodes to save
time and
money

v
EMR ClusterS3
When processing is complete, you
can terminate the cluster (and stop
paying)

v
Q3. What’s good about it?
Scalability, Cost & Ease of Use

v
14 Hours
Duration:
Scenario #1
Duration:
7 Hours
Scenario #2
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Time Savings: 50%
Cost Savings: ~22%

Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Great for
Spot Instances

vKinesis
A fully managed service for real-time processing
of high-volume, streaming data.

Availability
Zone
Availability
Zone
Availability
Zone
Data
Sources
Data
Sources
Data
Sources
Data
Sources
Data
Sources
Logging
Metrics
Analysis
Machine
Learning
S3
DynamoDB
Redshift
EMR
Kinesis
Stream

Putting data into Kinesis
• Each shard
• 1000 Tx Per Second
• 1MB Per Second
• 50KB Payload Per Tx
• Messages kept for 24 hours
• Simple PUT interface to store data in Kinesis
• A Partition Key is used to distribute the PUTs across Shards
• A unique Sequence # is created

v
Getting data out of Kinesis
Kinesis Client Library (KCL):
• Abstracts code from individual shards
• Starts a Kinesis Worker for each shard
• Increases and decreases workers
• Tracks a Worker’s location in the stream

v
Easy Administration Real-time Performance High Throughput.
Elastic
Integration
S3
Redshift
DynamoDB
Storm
ElasticSearch
Build Real-time
Applications
.
Low Cost

v
A Legacy of Machine
Learning at Amazon
“Customers who bought this
also bought…”

Why Did We Build Amazon Machine Learning?

Three types of data-driven development
Retrospective
analysis and
reporting
Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR

Retrospective
analysis and
reporting
Here-and-now
real-time processing and
dashboards
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift,
Amazon RDS
Amazon S3
Amazon EMR

Retrospective
analysis and
reporting
Here-and-now
real-time processing and
dashboards
Predictions
to enable smart
applications
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift,
Amazon RDS
Amazon S3
Amazon EMR

v
Machine learning and smart applications
• Machine learning is the technology that automatically
finds patterns in your data and uses them to make
predictions for new data points as they become
available

v
Machine learning and smart applications
• Machine learning is the technology that automatically
finds patterns in your data and uses them to make
predictions for new data points as they become
available
Your data + machine learning = smart applications

v
Smart applications by example
Based on what you know
about the user:
Will they use your product?

v
about the user:
about an order:
Is this order fraudulent?

v
about the user:
about an order:
Is this order fraudulent?
Based on what you know about a
news article:
What other articles are
interesting?

v
Challenges to Building Smart Applications Today
Expertise Technology Operationalization
Limited supply of
data scientists
Many choices, few
mainstays
Complex and error-
prone data workflows
Expensive to hire
or outsource
Difficult to use and scale Custom platforms and
APIs

What is Amazon Machine Learning?

v
Amazon Machine Learning
• Easy to use, managed machine learning service
built for developers
• Robust, powerful machine learning technology
based on Amazon’s internal systems
• Create models using your data already stored in
the AWS cloud
• Deploy models to production in seconds

v
Easy to use and developer-friendly
• Use the intuitive, powerful service console to build and
explore your initial models
• Data retrieval
• Model training, quality evaluation, fine-tuning
• Deployment and management
• Automate model lifecycle with fully featured APIs and
SDKs
• Java, Python, .NET, JavaScript, Ruby, PHP
• Easily create smart iOS and Android applications with AWS
Mobile SDK

v
Powerful machine learning technology
• Based on Amazon’s battle-hardened internal systems
• Not just the algorithms:
• Smart data transformations
• Input data and model quality alerts
• Built-in industry best practices
• Grows with your needs
• Train on up to 100 GB of data
• Generate billions of predictions
• Obtain predictions in batches or real-time

v
Integrated with AWS Data Ecosystem
• Access data that is stored in Amazon S3, Amazon
Redshift, or MySQL databases in RDS
• Output predictions to Amazon S3 for easy integration
with your data flows
• Use AWS Identity and Access Management (IAM) for
fine-grained data-access permission policies

v
Fully-managed model and prediction services
• End-to-end service, with no servers to provision and
manage
• One-click production model deployment
• Programmatically query model metadata to enable
automatic retraining workflows
• Monitor prediction usage patterns with Amazon
CloudWatch metrics

v
Pay-as-you-go and inexpensive
• Data analysis, model training, and evaluation:
$0.42/instance hour
• Batch predictions: $0.10/1000
• Real-time predictions: $0.10/1000
• + hourly capacity reservation charge

v
Three Supported Types of Predictions
• Binary Classification
• Predict the answer to a Yes/No question
• Multi-class classification
• Predict the correct category from a list
• Regression
• Predict the value of a numeric variable

How Do I Get started Using
Amazon Machine Learning?

Get Started Quickly
• Create, access, and manage all Amazon
ML entities through the AWS
Management Console
• Easily learn to build a model with the
tutorial dataset provided
• Add prediction capabilities to your iOS
and Android applications with AWS
Mobile SDK
• Use Amazon ML APIs, CLIs, or SDKs

v
Build
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML

v
Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
- Create a Datasource object pointing to your data
- Explore and understand your data
- Transform data and train your model

v
Explore and understand your data

v
Train your model
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_ml_model(
ml_model_id=’my_model',
ml_model_type='REGRESSION',
training_data_source_id='my_datasource')

v
Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
- Understand model quality
- Adjust model interpretation

v
Fine-tune model interpretation

v
Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
- Batch predictions
- Real-time predictions

v
Batch predictions
• Asynchronous, large-volume prediction generation
• Request through service console or API
• Best for applications that deal with batches of data records
>>> import boto
>>> model = ml.create_batch_prediction(
batch_prediction_id = 'my_batch_prediction’
batch_prediction_data_source_id = ’my_datasource’
ml_model_id = ’my_model',
output_uri = 's3://examplebucket/output/’)

v
Real-time predictions
• Synchronous, low-latency, high-throughput prediction generation
• Request through service API or server or mobile SDKs
• Best for interaction applications that deal with individual data records
>>> import boto
>>> ml.predict(
ml_model_id=’my_model',
predict_endpoint=’example_endpoint’,
record={’key1':’value1’, ’key2':’value2’})
{
'Prediction': {
'predictedValue': 13.284348,
'details': {
'Algorithm': 'SGD',
'PredictiveModelType': 'REGRESSION’
}
}
}

Architecture Patterns for Smart
Applications

Batch predictions with Amazon EMR
Query for predictions with
Amazon ML batch API
Process data with
Amazon EMR
Raw data in
Amazon S3
Aggregated data
in Amazon S3
Predictions
in Amazon S3 Your application

Batch predictions with Amazon Redshift
Structured data
In Amazon Redshift
Load predictions into Amazon
Redshift
-or-
Read prediction results directly
from Amazon S3
Predictions
in Amazon S3
Amazon ML batch API
Your application

Real-time predictions for interactive applications
Your application
Amazon ML real-time API

Thank you!
@AWSCloudSEAsia
Chris Hampartsoumian
Technology Evangelist ASEAN

Building a Big Data & Analytics Platform using AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Big Data & Analytics Platform using AWS

Similar to Building a Big Data & Analytics Platform using AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Building a Big Data & Analytics Platform using AWS