SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deep Di ve: Mi grati ng Bi g Data Workl oad s to AWS
B r u n o F a r i a , S r . E M R S o l u t i o n A r c h i t e c t , A W S
R i t e s h S h a h , S r . P r o g r a m M a n a g e r , V a n g u a r d
A B D 3 1 2
N o v e m b e r 2 8 , 2 0 1 7
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Deconstructing current big data environments
• Identifying challenges with on-premises or unmanaged
architectures
• Migrating components to Amazon EMR and Amazon Web
Services (AWS) analytics services
– Choosing the right engine for the job
– Architecting for cost and scalability
• Customer migration story
– How Vanguard migrated their big data workload to AWS
• Q&A
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deconstructing on-premises
big data environments
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• A cluster of 1U machines
• Typically 12 cores, 32/64 GB
RAM, and 6-8 TB of HDD ($3-4K)
• Networking switches and racks
• Open-source distribution of
Hadoop or a fixed-licensing term
by commercial distributions
• Different node roles
• HDFS uses local disk and is sized
for 3x data replication
On-premises Hadoop clusters
Server rack 1
(20 nodes)
Server rack 2
(20 nodes)
Server rack N
(20 nodes)
Core
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On premises: Workload types running on the
same cluster
• Large-scale ETL
• Interactive queries
• Machine learning and data science
• NoSQL
• Stream processing
• Search
• Data warehouses
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On premises: Swim lane of jobs
Over-utilized Under-utilized
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Management of the cluster
(failures, hardware replacement,
restarting services, expanding
cluster)
• Configuration management
• Tuning of specific jobs or hardware
• Managing development and test
environments
• Backing up data and disaster
recovery
On premises: Role of a big data
administrator
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Identifying on-premises
challenges
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On premises: Over-utilization and idle capacity
• Tightly coupled compute and storage
requires buying excess capacity
• Can be over-utilized during peak hours and
under-utilized at other times
• Results in high costs and low efficiency
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On premises: System management
difficulties
• Managing distributed applications and availability
• Durable storage and disaster recovery
• Adding new frameworks and doing upgrades
• Multiple environments
• Need team to manage cluster and procure hardware
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Migrating workloads to AWS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key migration considerations
• Do not lift and shift
• Deconstruct workloads and use the right tool for the job
• Decouple storage and compute with Amazon Simple Storage
Storage Service (Amazon S3)
• Design for cost and scalability
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
De constru ct workloads —analytics type s & frame works
Batch Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark), AWS Glue
Interactive Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream Takes milliseconds to seconds
Example: Fraud alerts, 1-minute metrics
Amazon EMR (Spark Streaming, Flink), Amazon Kinesis Analytics, KCL, Storm
Artificial
intelligence
Takes milliseconds to minutes
Example: Fraud detection, forecast demand, text to speech
Amazon AI (Amazon Lex, Amazon Polly, Amazon ML, Amazon Rekognition), Amazon
EMR (Spark ML), AWS Deep Learning AMI
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Translate use cases to the right tools
Amazon EMR
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix
HBase
Presto
Batch
MapReduce
Interactive
Tez
In memory
Spark
Streaming
Flink
YARN
Cluster resource management
Storage
Amazon S3 (EMRFS), HDFS
AWS Glue
Amazon Athena
Amazon
Redshift/Spectrum
• Low-latency SQL -> Amazon Athena, Presto, Amazon Redshift/Spectrum
• Data warehouse/reporting -> Spark, Hive, AWS Glue, Amazon Redshift
• Management and monitoring -> Amazon CloudWatch, AWS console, Ganglia
• HDFS -> Amazon S3
• Notebooks -> Zeppelin Notebook, Jupyter (via bootstrap action)
• Query console -> Amazon Athena, Hue
• Security -> Ranger (CF template), HiveServer2, AWS IAM roles
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Many storage layers to choose from
Amazon RDSAmazon ElastiCahe
Amazon
Redshift
Amazon
DynamoDB
Amazon Kinesis
Amazon S3
Amazon
EMR
Amazon ES
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 as your persistent data store
• Natively supported by big data frameworks (Spark, Hive, Presto, and
others)
• Decouple storage and compute
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Amazon EMR clusters with Amazon EC2 Spot
Instances
• Multiple and heterogeneous analysis clusters and services can
use the same data
• Designed for 99.999999999% durability
• No need to pay for data replication
• Secure—SSL, client/server-side encryption at rest
• Low cost
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What about HDFS and data tiering?
• Use HDFS for very frequently accessed (hot) data
• Use Amazon S3 Standard for frequently accessed data
• Use Amazon S3 Standard – IA for less frequently accessed data
• Use Amazon Glacier for archiving cold data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple storage and compute
Compute
External metastore
Amazon
RDS
AWS
Glue Data
Catalog
Amazon EMR
Amazon Athena
Amazon Redshift
SpectrumAmazon S3
Storage
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A m a z o n S 3 t i p s : P a r t i t i o n s , c o m p r e s s i o n , a n d f i l e f o r m a t s
• Partition your data for improved read performance
• Read-only files the query needs
• Reduce amount of data scanned
• Optimize file sizes
• Avoid files that are too small (generally, anything less than 128 MB)
• Fewer files, matching closely to block size
• Fewer calls to Amazon S3 (faster listing)
• Fewer network/HDFS requests
• Compress data set to minimize bandwidth from Amazon S3 to Amazon EC2
• Make sure you use splittable compression or have each file be the
optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased performance on reads
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ALTER TABLE app_logs ADD PARTITION (year='2015',month='01',day='01') LOCATION
's3://bucket_name/app/plaintext/year=2015/month=01/day=01/’;
ALTER TABLE elb_logs ADD PARTITION (year='2015',month='01',day='01') LOCATION
's3://bucket_name/elb/plaintext/2015/01/01/’;
ALTER TABLE orders DROP PARTITION (dt='2014-05-14’,country='IN'),
PARTITION (dt='2014-05-15’,country='IN');
ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION
's3://bucket_name/new_customers/zip=98040/state=WA’;
MSCK REPAIR TABLE table_name;  Only works with Hive-compatible partitions
Data partitioning—examples
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
File formats
Columnar─Parquet and ORC
• Compressed
• Column-based read-optimized
• Integrated indexes and stats
Row─Avro
• Compressed
• Row-based read-optimized
• Integrated indexes and stats
Text─xSV, JSON
• May or may not be compressed
• Not optimized
• Generic and malleable
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SELECT count(*) as count FROM examples_csv
Run time: 36 seconds, Data scanned: 15.9GB
SELECT count(*) as count FROM examples_parquet
Run time: 5 seconds, Data scanned: 4.93GB
File Format – Examples
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
File formats—considerations
Scanning
• xSV and JSON require scanning entire file
• Columnar ideal when selecting only a subset of columns
• Row ideal when selecting all columns of a subset of rows
Read performance
• Text – SLOW
• Avro – Optimal (specific to use case)
• Parquet and ORC – Optimal (specific to use case)
Write performance
• Text – SLOW
• Avro – Good
• Parquet and ORC – Good (has some overhead with large datasets)
* Highly dependent on the dataset
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
External hive metastore
Use an external metastore when you require a persistent metastore or a metastore
shared by different clusters, services, and applications. There are two options for an
external metastore:
• Amazon Relational Database Service (Amazon RDS)/Amazon Aurora
• AWS Glue Data Catalog (Amazon EMR version 5.8.0 or later only)
• Search metastore
• Utilize crawlers to detect new data, schema, partitions
• Schema and version management
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
External metastore—AWS Glue Data Catalog
You can choose to use the AWS Glue Data Catalog to store external table metadata for Hive
and Spark instead of utilizing an on-cluster or self-managed Hive metastore. This allows you to
more easily store metadata for your external tables on Amazon S3 outside of your cluster.
You can configure your Amazon EMR clusters to use the AWS Glue Data Catalog from the
Amazon EMR console, AWS Command Line Interface (CLI), or the AWS SDK with the Amazon
EMR API.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Instance fleets for advanced Spot provisioning
Master node Core instance fleet Task instance fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot block support
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Transient or long-running workloads
Transient
Long-running
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR: Lower costs with Auto Scaling
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security—governance and auditing
• AWS Identity and Access Management (IAM)
• Amazon Cognito
• Amazon CloudWatch and AWS CloudTrail
• AWS Key Management Service (AWS KMS)
• AWS Directory Service
• Amazon S3 access logs for cluster Amazon S3 access
• Amazon Macie
• YARN and application logs
• Apache Ranger
Security &
Governance IAM Amazon
CloudWatch
AWS
CloudTrail
AWS
KMS
AWS
CloudHSM
AWS Directory
Service
Amazon
Cognito
Amazon
Macie
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Migrating workloads to AWS at Vanguard
Ritesh Shah, Sr. Program Manager
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Vanguard—background
Vanguard is one of the world's largest investment
companies, offering a large selection of low-cost
mutual funds, ETFs, advice, and related services
Core purpose – To take a stand for all investors, to
treat them fairly, and to give them the best chance
for investment success
Oldest fund – Wellington Fund (inception 1929)
Began operations – May 1, 1975 in Valley Forge, PA
Funds – Over 180 U.S. funds (including variable
annuity portfolios) and 190 additional funds in
markets outside the United States
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deconstructing on-premises workloads
Oracle
DB2
MS
SQL
Files
Hadoop
cluster
Data analyst
Data wrangler
Data scientist
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deconstructing on-premises workloads
Oracle
DB2
MS
SQL
Files
Hadoop
cluster
Data analyst
Data wrangler
Data scientist
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Tightly coupled compute and
storage
• Over-utilized during peak
hours and under-utilized at
other times
• Cross impact from different
types of workloads
• Lack of DR environment
• Dedicated team to maintain
cluster
Why migrate from on premises to AWS?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Tightly coupled compute and
storage
• Over-utilized during peak
hours and under-utilized at
other times
• Cross impact from different
types of workloads
• Lack of DR environment
• Dedicated team to maintain
cluster
• Long procurement cycles for
hardware
• Long setup time of hardware
• Complex upgrade coordination
needs leading to infrequent
upgrades
• High total cost of ownership
(TCO)
• Difficult to charge back LOB
Why migrate from on premises to AWS?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demystifying EMR workloads
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Foundational requirements
Secure
Encryption in flight & at rest
Lower TCO
Pay as per usage
Flexible
Customize per workload
Full control
Infrastructure as code
Scalable
Compute & storage
Managed
Reduced administration
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ingestion workloads
Oracle
DB2
MS
SQL
Files
Amazon EMR
Amazon S3
• Ephemeral EMR–based ingestion clusters
• Amazon S3–based data lake
• In-flight and at-rest data encryption using KMS
• S3 bucket access control policies using IAM
• Step API used to launch Oozie–based workflows
• Sqoop, Snowball, and CDC (Attunity)–based
data ingestion
• Hive/Tez and Spark–based data processing
• CloudWatch plus Splunk–based monitoring
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ingestion workloads
Oracle
DB2
MS
SQL
Files
Amazon EMR
Amazon S3
• Ephemeral EMR–based ingestion clusters
• Amazon S3–based data lake
• In-flight and at-rest data encryption using KMS
• S3 bucket access control policies using IAM
• Step API used to launch Oozie–based workflows
• Sqoop, Snowball, and CDC (Attunity)–based
data ingestion
• Hive/Tez and Spark–based data processing
• CloudWatch plus Splunk–based monitoring
Lessons learned
• Segregation of infrastructure and ingestion code
• Instance type and cluster sizing
• Obfuscation of PII data
• Segregation of data based on LOB rules
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conceptual diagram—ingestion workload
On-premises data center AWS cloud
Control-M
Ephemeral
EMR
clusters
EC2 job
nodeELB
Ephemeral
EC2 FTP node
Persistent
EC2 FTP node
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conceptual diagram—ingestion workload
On-premises data center AWS cloud
Control-M
DBMS
Hive/Hue
metastore
Ephemeral
EMR clusters
S3 storage
buckets
EC2 job
nodeELB
Ephemeral
EC2 FTP node
Persistent
EC2 FTP node
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conceptual diagram—ingestion workload
On-premise data center AWS cloud
Control-M
DBMS
Hive/Hue
metastore
Amazon
PostgreSQL
Ephemeral
EMR clusters
S3 Storage
buckets
EC2 job
nodeELB
Ephemeral
EC2 FTP node
Persistent
EC2 FTP node
Attunity EC2
agent
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conceptual diagram—ingestion workload
On-premises data center AWS cloud
FTP server
Control-M
DBMS
Hive/Hue
metastore
Amazon
PostgreSQL
Ephemeral
EMR clusters
S3 storage
buckets
EC2 job
nodeELB
Ephemeral
EC2 FTP node
Persistent
EC2 FTP node
Attunity EC2
agent
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conceptual diagram—ingestion workload
On-premises data center AWS cloud
DTS
FTP server
Control-M
DBMS
Hive/Hue
metastore
Amazon
PostgreSQL
Ephemeral
EMR clusters
S3 Storage
buckets
EC2 job
nodeELB
Ephemeral
EC2 FTP node
Persistent
EC2 FTP node
Attunity EC2
agent
AWS Snowball Amazon
CloudWatch
IAM AWS KMS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analytics workloads
• Persistent EMR-based analytics clusters
• Hive/Tez and Presto as query engines
• Spark and Python-based distributed processing
• Hue and Tableau as user interactive tools
• Zeppelin and Jupyter-based notebook for
interactive and collaborative data exploration
• Authorization using Hive SQL Auth and IAM
bucket policies for Amazon S3
Data analyst
Data wrangler
Data scientist
Amazon EMR
Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analytics workloads
• Persistent EMR-based analytics clusters
• Hive/Tez and Presto as query engines
• Spark and Python-based distributed processing
• Hue and Tableau as user interactive tools
• Zeppelin and Jupyter-based notebook for
interactive and collaborative data exploration
• Authorization using Hive SQL Auth and IAM
bucket policies for Amazon S3
Lessons learned
• Data governance (metadata, lineage, and
others)
• Audit of access to data
• Instance type and cluster sizing
• Integration of user tools with query and
processing engines
Data analyst
Data wrangler
Data scientist
Amazon EMR
Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conceptual diagram—analytics workload
On-
premises
data
center
AWS cloud
Control-M
Hive/Hue
metastore
S3 storage
buckets
EC2 job
node
ELB
Amazon
CloudWatch
IAM AWS KMS
Query EMR
clusters
EMR applications
EC2
Tableau
node
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conceptual diagram—analytics workload
On-
premises
data
center
AWS cloud
Control-M
Hive/Hue
metastore
Data science
EMR
clusters
S3 storage
buckets
EC2 job
node
ELB
Amazon
CloudWatch
IAM AWS KMS
Query EMR
clusters
EMR applications
EC2
Tableau
node
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conceptual diagram—analytics workload
On-
premises
data
center
AWS cloud
Control-M
Hive/Hue
metastore
Data science
EMR
clusters
S3 storage
buckets
EC2 job
node
ELB
Amazon
CloudWatch
IAM AWS KMS
Query EMR
clusters
EMR applications
EC2
Tableau
node
User zone
Data analyst
Data wrangler
Data scientist
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What’s next
AWS
Lambda
Amazon
Glacier
Amazon
Redshift
AWS DMS
Amazon
Athena
Amazon
CloudSearch
Amazon
Kinesis
Amazon
QuickSight
AWS Data
Pipeline/AWS
Glue
Amazon Lex Amazon Polly Amazon
Rekognition
Amazon ML
Enable additional AWS services for:
• Real-time data ingestion
• Enhanced data processing pipelines
• Visualization and analytics use case
• Experimentation with AI and ML
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Q&A
Thank you!
a w s . a m a z o n . c o m / e m r
b l o g s . a w s . a m a z o n . c o m / b i g d a t a
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Please remember to rate this
session in the mobile app

Mais conteúdo relacionado

Mais procurados

ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
Amazon Web Services
 

Mais procurados (20)

ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
 
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS Glue
 
AWS & Database Analytics
AWS & Database AnalyticsAWS & Database Analytics
AWS & Database Analytics
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS Analytics
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
AWS 機器學習 I ─ 人工智慧 AI
AWS 機器學習 I ─ 人工智慧 AIAWS 機器學習 I ─ 人工智慧 AI
AWS 機器學習 I ─ 人工智慧 AI
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
How Amazon uses AWS Analytics
How Amazon uses AWS AnalyticsHow Amazon uses AWS Analytics
How Amazon uses AWS Analytics
 

Semelhante a ABD312_Deep Dive Migrating Big Data Workloads to AWS

Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
Amazon Web Services
 

Semelhante a ABD312_Deep Dive Migrating Big Data Workloads to AWS (20)

STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
STG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsSTG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data Workloads
 
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
 
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Building Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudBuilding Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS Cloud
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 

Mais de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

ABD312_Deep Dive Migrating Big Data Workloads to AWS

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deep Di ve: Mi grati ng Bi g Data Workl oad s to AWS B r u n o F a r i a , S r . E M R S o l u t i o n A r c h i t e c t , A W S R i t e s h S h a h , S r . P r o g r a m M a n a g e r , V a n g u a r d A B D 3 1 2 N o v e m b e r 2 8 , 2 0 1 7
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Deconstructing current big data environments • Identifying challenges with on-premises or unmanaged architectures • Migrating components to Amazon EMR and Amazon Web Services (AWS) analytics services – Choosing the right engine for the job – Architecting for cost and scalability • Customer migration story – How Vanguard migrated their big data workload to AWS • Q&A
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deconstructing on-premises big data environments
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • A cluster of 1U machines • Typically 12 cores, 32/64 GB RAM, and 6-8 TB of HDD ($3-4K) • Networking switches and racks • Open-source distribution of Hadoop or a fixed-licensing term by commercial distributions • Different node roles • HDFS uses local disk and is sized for 3x data replication On-premises Hadoop clusters Server rack 1 (20 nodes) Server rack 2 (20 nodes) Server rack N (20 nodes) Core
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: Workload types running on the same cluster • Large-scale ETL • Interactive queries • Machine learning and data science • NoSQL • Stream processing • Search • Data warehouses
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: Swim lane of jobs Over-utilized Under-utilized
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Management of the cluster (failures, hardware replacement, restarting services, expanding cluster) • Configuration management • Tuning of specific jobs or hardware • Managing development and test environments • Backing up data and disaster recovery On premises: Role of a big data administrator
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Identifying on-premises challenges
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: Over-utilization and idle capacity • Tightly coupled compute and storage requires buying excess capacity • Can be over-utilized during peak hours and under-utilized at other times • Results in high costs and low efficiency
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: System management difficulties • Managing distributed applications and availability • Durable storage and disaster recovery • Adding new frameworks and doing upgrades • Multiple environments • Need team to manage cluster and procure hardware
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Migrating workloads to AWS
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key migration considerations • Do not lift and shift • Deconstruct workloads and use the right tool for the job • Decouple storage and compute with Amazon Simple Storage Storage Service (Amazon S3) • Design for cost and scalability
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. De constru ct workloads —analytics type s & frame works Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark), AWS Glue Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Stream Takes milliseconds to seconds Example: Fraud alerts, 1-minute metrics Amazon EMR (Spark Streaming, Flink), Amazon Kinesis Analytics, KCL, Storm Artificial intelligence Takes milliseconds to minutes Example: Fraud detection, forecast demand, text to speech Amazon AI (Amazon Lex, Amazon Polly, Amazon ML, Amazon Rekognition), Amazon EMR (Spark ML), AWS Deep Learning AMI
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Translate use cases to the right tools Amazon EMR Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix HBase Presto Batch MapReduce Interactive Tez In memory Spark Streaming Flink YARN Cluster resource management Storage Amazon S3 (EMRFS), HDFS AWS Glue Amazon Athena Amazon Redshift/Spectrum • Low-latency SQL -> Amazon Athena, Presto, Amazon Redshift/Spectrum • Data warehouse/reporting -> Spark, Hive, AWS Glue, Amazon Redshift • Management and monitoring -> Amazon CloudWatch, AWS console, Ganglia • HDFS -> Amazon S3 • Notebooks -> Zeppelin Notebook, Jupyter (via bootstrap action) • Query console -> Amazon Athena, Hue • Security -> Ranger (CF template), HiveServer2, AWS IAM roles
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Many storage layers to choose from Amazon RDSAmazon ElastiCahe Amazon Redshift Amazon DynamoDB Amazon Kinesis Amazon S3 Amazon EMR Amazon ES
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 as your persistent data store • Natively supported by big data frameworks (Spark, Hive, Presto, and others) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple and heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication • Secure—SSL, client/server-side encryption at rest • Low cost
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What about HDFS and data tiering? • Use HDFS for very frequently accessed (hot) data • Use Amazon S3 Standard for frequently accessed data • Use Amazon S3 Standard – IA for less frequently accessed data • Use Amazon Glacier for archiving cold data
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Decouple storage and compute Compute External metastore Amazon RDS AWS Glue Data Catalog Amazon EMR Amazon Athena Amazon Redshift SpectrumAmazon S3 Storage
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A m a z o n S 3 t i p s : P a r t i t i o n s , c o m p r e s s i o n , a n d f i l e f o r m a t s • Partition your data for improved read performance • Read-only files the query needs • Reduce amount of data scanned • Optimize file sizes • Avoid files that are too small (generally, anything less than 128 MB) • Fewer files, matching closely to block size • Fewer calls to Amazon S3 (faster listing) • Fewer network/HDFS requests • Compress data set to minimize bandwidth from Amazon S3 to Amazon EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ALTER TABLE app_logs ADD PARTITION (year='2015',month='01',day='01') LOCATION 's3://bucket_name/app/plaintext/year=2015/month=01/day=01/’; ALTER TABLE elb_logs ADD PARTITION (year='2015',month='01',day='01') LOCATION 's3://bucket_name/elb/plaintext/2015/01/01/’; ALTER TABLE orders DROP PARTITION (dt='2014-05-14’,country='IN'), PARTITION (dt='2014-05-15’,country='IN'); ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION 's3://bucket_name/new_customers/zip=98040/state=WA’; MSCK REPAIR TABLE table_name;  Only works with Hive-compatible partitions Data partitioning—examples
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. File formats Columnar─Parquet and ORC • Compressed • Column-based read-optimized • Integrated indexes and stats Row─Avro • Compressed • Row-based read-optimized • Integrated indexes and stats Text─xSV, JSON • May or may not be compressed • Not optimized • Generic and malleable
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SELECT count(*) as count FROM examples_csv Run time: 36 seconds, Data scanned: 15.9GB SELECT count(*) as count FROM examples_parquet Run time: 5 seconds, Data scanned: 4.93GB File Format – Examples
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. File formats—considerations Scanning • xSV and JSON require scanning entire file • Columnar ideal when selecting only a subset of columns • Row ideal when selecting all columns of a subset of rows Read performance • Text – SLOW • Avro – Optimal (specific to use case) • Parquet and ORC – Optimal (specific to use case) Write performance • Text – SLOW • Avro – Good • Parquet and ORC – Good (has some overhead with large datasets) * Highly dependent on the dataset
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. External hive metastore Use an external metastore when you require a persistent metastore or a metastore shared by different clusters, services, and applications. There are two options for an external metastore: • Amazon Relational Database Service (Amazon RDS)/Amazon Aurora • AWS Glue Data Catalog (Amazon EMR version 5.8.0 or later only) • Search metastore • Utilize crawlers to detect new data, schema, partitions • Schema and version management
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. External metastore—AWS Glue Data Catalog You can choose to use the AWS Glue Data Catalog to store external table metadata for Hive and Spark instead of utilizing an on-cluster or self-managed Hive metastore. This allows you to more easily store metadata for your external tables on Amazon S3 outside of your cluster. You can configure your Amazon EMR clusters to use the AWS Glue Data Catalog from the Amazon EMR console, AWS Command Line Interface (CLI), or the AWS SDK with the Amazon EMR API.
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Instance fleets for advanced Spot provisioning Master node Core instance fleet Task instance fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot block support
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient or long-running workloads Transient Long-running
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR: Lower costs with Auto Scaling
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security—governance and auditing • AWS Identity and Access Management (IAM) • Amazon Cognito • Amazon CloudWatch and AWS CloudTrail • AWS Key Management Service (AWS KMS) • AWS Directory Service • Amazon S3 access logs for cluster Amazon S3 access • Amazon Macie • YARN and application logs • Apache Ranger Security & Governance IAM Amazon CloudWatch AWS CloudTrail AWS KMS AWS CloudHSM AWS Directory Service Amazon Cognito Amazon Macie
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Migrating workloads to AWS at Vanguard Ritesh Shah, Sr. Program Manager
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Vanguard—background Vanguard is one of the world's largest investment companies, offering a large selection of low-cost mutual funds, ETFs, advice, and related services Core purpose – To take a stand for all investors, to treat them fairly, and to give them the best chance for investment success Oldest fund – Wellington Fund (inception 1929) Began operations – May 1, 1975 in Valley Forge, PA Funds – Over 180 U.S. funds (including variable annuity portfolios) and 190 additional funds in markets outside the United States
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deconstructing on-premises workloads Oracle DB2 MS SQL Files Hadoop cluster Data analyst Data wrangler Data scientist
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deconstructing on-premises workloads Oracle DB2 MS SQL Files Hadoop cluster Data analyst Data wrangler Data scientist
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Tightly coupled compute and storage • Over-utilized during peak hours and under-utilized at other times • Cross impact from different types of workloads • Lack of DR environment • Dedicated team to maintain cluster Why migrate from on premises to AWS?
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Tightly coupled compute and storage • Over-utilized during peak hours and under-utilized at other times • Cross impact from different types of workloads • Lack of DR environment • Dedicated team to maintain cluster • Long procurement cycles for hardware • Long setup time of hardware • Complex upgrade coordination needs leading to infrequent upgrades • High total cost of ownership (TCO) • Difficult to charge back LOB Why migrate from on premises to AWS?
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demystifying EMR workloads
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Foundational requirements Secure Encryption in flight & at rest Lower TCO Pay as per usage Flexible Customize per workload Full control Infrastructure as code Scalable Compute & storage Managed Reduced administration
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ingestion workloads Oracle DB2 MS SQL Files Amazon EMR Amazon S3 • Ephemeral EMR–based ingestion clusters • Amazon S3–based data lake • In-flight and at-rest data encryption using KMS • S3 bucket access control policies using IAM • Step API used to launch Oozie–based workflows • Sqoop, Snowball, and CDC (Attunity)–based data ingestion • Hive/Tez and Spark–based data processing • CloudWatch plus Splunk–based monitoring
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ingestion workloads Oracle DB2 MS SQL Files Amazon EMR Amazon S3 • Ephemeral EMR–based ingestion clusters • Amazon S3–based data lake • In-flight and at-rest data encryption using KMS • S3 bucket access control policies using IAM • Step API used to launch Oozie–based workflows • Sqoop, Snowball, and CDC (Attunity)–based data ingestion • Hive/Tez and Spark–based data processing • CloudWatch plus Splunk–based monitoring Lessons learned • Segregation of infrastructure and ingestion code • Instance type and cluster sizing • Obfuscation of PII data • Segregation of data based on LOB rules
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conceptual diagram—ingestion workload On-premises data center AWS cloud Control-M Ephemeral EMR clusters EC2 job nodeELB Ephemeral EC2 FTP node Persistent EC2 FTP node
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conceptual diagram—ingestion workload On-premises data center AWS cloud Control-M DBMS Hive/Hue metastore Ephemeral EMR clusters S3 storage buckets EC2 job nodeELB Ephemeral EC2 FTP node Persistent EC2 FTP node
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conceptual diagram—ingestion workload On-premise data center AWS cloud Control-M DBMS Hive/Hue metastore Amazon PostgreSQL Ephemeral EMR clusters S3 Storage buckets EC2 job nodeELB Ephemeral EC2 FTP node Persistent EC2 FTP node Attunity EC2 agent
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conceptual diagram—ingestion workload On-premises data center AWS cloud FTP server Control-M DBMS Hive/Hue metastore Amazon PostgreSQL Ephemeral EMR clusters S3 storage buckets EC2 job nodeELB Ephemeral EC2 FTP node Persistent EC2 FTP node Attunity EC2 agent
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conceptual diagram—ingestion workload On-premises data center AWS cloud DTS FTP server Control-M DBMS Hive/Hue metastore Amazon PostgreSQL Ephemeral EMR clusters S3 Storage buckets EC2 job nodeELB Ephemeral EC2 FTP node Persistent EC2 FTP node Attunity EC2 agent AWS Snowball Amazon CloudWatch IAM AWS KMS
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analytics workloads • Persistent EMR-based analytics clusters • Hive/Tez and Presto as query engines • Spark and Python-based distributed processing • Hue and Tableau as user interactive tools • Zeppelin and Jupyter-based notebook for interactive and collaborative data exploration • Authorization using Hive SQL Auth and IAM bucket policies for Amazon S3 Data analyst Data wrangler Data scientist Amazon EMR Amazon S3
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analytics workloads • Persistent EMR-based analytics clusters • Hive/Tez and Presto as query engines • Spark and Python-based distributed processing • Hue and Tableau as user interactive tools • Zeppelin and Jupyter-based notebook for interactive and collaborative data exploration • Authorization using Hive SQL Auth and IAM bucket policies for Amazon S3 Lessons learned • Data governance (metadata, lineage, and others) • Audit of access to data • Instance type and cluster sizing • Integration of user tools with query and processing engines Data analyst Data wrangler Data scientist Amazon EMR Amazon S3
  • 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conceptual diagram—analytics workload On- premises data center AWS cloud Control-M Hive/Hue metastore S3 storage buckets EC2 job node ELB Amazon CloudWatch IAM AWS KMS Query EMR clusters EMR applications EC2 Tableau node
  • 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conceptual diagram—analytics workload On- premises data center AWS cloud Control-M Hive/Hue metastore Data science EMR clusters S3 storage buckets EC2 job node ELB Amazon CloudWatch IAM AWS KMS Query EMR clusters EMR applications EC2 Tableau node
  • 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conceptual diagram—analytics workload On- premises data center AWS cloud Control-M Hive/Hue metastore Data science EMR clusters S3 storage buckets EC2 job node ELB Amazon CloudWatch IAM AWS KMS Query EMR clusters EMR applications EC2 Tableau node User zone Data analyst Data wrangler Data scientist
  • 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s next AWS Lambda Amazon Glacier Amazon Redshift AWS DMS Amazon Athena Amazon CloudSearch Amazon Kinesis Amazon QuickSight AWS Data Pipeline/AWS Glue Amazon Lex Amazon Polly Amazon Rekognition Amazon ML Enable additional AWS services for: • Real-time data ingestion • Enhanced data processing pipelines • Visualization and analytics use case • Experimentation with AI and ML
  • 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Q&A Thank you! a w s . a m a z o n . c o m / e m r b l o g s . a w s . a m a z o n . c o m / b i g d a t a
  • 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Please remember to rate this session in the mobile app