SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Apache Spark and the
Hadoop Ecosystem on AWS
Getting Started with
Amazon EMR
Jonathan Fritz, Sr. Product Manager
March 20, 2017
Agenda
• Quick introduction to Spark, Hive on Tez, and
Presto
• Building data lakes with Amazon EMR
and Amazon S3
• Running jobs and security options
• Customer use cases
• Demo
Quick introduction to Spark,
Hive on Tez, and Presto
Spark for fast processing
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Hive and Tez for batch ETL and SQL
• Run Spark Driver in
Client or Cluster mode
• Spark and Tez
applications run as a
YARN application
• Spark Executors and
Tez Workers run in
YARN Containers on
NodeManagers in your
cluster
Amazon EMR runs Spark and Tez on YARN
Presto: interactive SQL for analytics
Important Presto Features
High Performance
• E.g. Netflix: runs 3500+ Presto queries / day on 25+ PB dataset in S3 with 350 active
platform users
Extensibility
• Pluggable backends: Hive, Cassandra, JMX, Kafka, MySQL, PostgreSQL, MySQL, and
more
• JDBC, ODBC for commercial BI tools or dashboards
• Client Protocol: HTTP+JSON, support various languages (Python, Ruby, PHP, Node.js,
Java(JDBC), C#,…)
ANSI SQL
• complex queries, joins, aggregations, various functions (Window functions)
On-cluster UIs
Manage applications
SQL editor, Workflow designer,
Metastore browser
Notebooks
Design and execute
queries and workloads
And more using
bootstrap actions!
Building data lakes with
Amazon EMR and Amazon S3
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to manage options
Flexible
Customize the cluster
Create a fully configured cluster with the
latest versions of Presto and Spark in minutes
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
Hue (SQL Interface/Metastore Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver (JDBC/ODBC)
Amazon EMR service
Amazon EMR release
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Streaming
Flink
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
HBase on S3 for scalable NoSQL
S3 tips: Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Use RDS/Aurora for an external Hive metastore
Amazon Aurora
Hive Metastore for
external tables on S3
Amazon S3Set metastore
location in hive-site
Spot for
task nodes
Up to 80%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Use Spot and Reserved Instances to lower costs
Meet SLA at predictable cost Exceed SLA at lower cost
Amazon EMR supports most EC2 instance types
Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support
Lower costs with Auto Scaling
Running Jobs and
Security Options
YARN schedulers - CapacityScheduler
• Default scheduler specified in Amazon EMR
• Queues
• Single queue is set by default
• Can create additional queues for workloads based on
multitenancy requirements
• Capacity Guarantees
• set minimal resources for each queue
• Programmatically assign free resources to queues
• Adjust these settings using the classification capacity-
scheduler in an EMR configuration object
Configuring Executors – Dynamic Allocation
• Optimal resource utilization
• YARN dynamically creates and shuts down executors
based on the resource needs of the Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default, and
calculates the default executor size to use based on the
instance family of your Core Group
Options to submit jobs – on cluster
Web UIs: Hue SQL editor,
Zeppelin notebooks,
R Studio, Airpal, and more!
Connect with ODBC / JDBC to
HiveServer2, Spark Thriftserver, or Presto
Use Hive and Spark Actions in your Apache
Oozie workflow to create DAGs of jobs.
(start using
start-thriftserver.sh)
Or, use the native APIs and CLIs for
each application
Options to submit jobs – off cluster
Amazon EMR
Step API
Submit a Hive or Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Hive or Spark on your cluster
Security - configuring VPC subnets
• Use Amazon S3 Endpoints in VPC for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using Security Groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess
IAM Roles – managed or custom policies
EMR Service Role EC2 Role
Encryption – use security configurations
Customer Use Cases
Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
S3
ETL Attribution
Machine
Learning
S3Amazon
Kinesis
• 2 Petabytes Processed Daily
• 2 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models
Trained per Day
ADHOC
Transient
ETL	Cluster	
Hive
Transient
Event	Cluster
Spark
RDS
Meta	store
Custom	Replication
Ideal	Path
RDS	Replicated
Meta	store
To	support	
Hive		Presto	
Data	types
Ideal	Path
Data Lake - Stats
• 50+ Ad hoc Users
• 1000+ Ad hoc Queries 
Day
• 4+ Data Science Users
• 20+ Sources of Data
• 100+ ETL Hive Jobs
• 25+ Spark Jobs
• 2+ PB of Data
Netflix uses Presto on Amazon EMR with
a 25 PB dataset in Amazon S3
Full Presentation: https://www.youtube.com/watch?v=A4OU6i4AQsI
SmartNews uses Presto as a reporting front-end
AWS Big Data Blog: https://blogs.aws.amazon.com/bigdata/post/Tx2V1BSKGITCMTU/How-SmartNews-
Built-a-Lambda-Architecture-on-AWS-to-Analyze-Customer-Behavior-an
FINRA saved 60% by moving to HBase on EMR
Demo
jonfritz@amazon.com
aws.amazon.com/emr
blogs.aws.amazon.com/bigdata
Your feedback
is important to us!

Mais conteúdo relacionado

Mais procurados

Running Active Directory in the AWS Cloud
Running Active Directory in the AWS Cloud Running Active Directory in the AWS Cloud
Running Active Directory in the AWS Cloud Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Introduce AWS Lambda for newbie and Non-IT
Introduce AWS Lambda for newbie and Non-ITIntroduce AWS Lambda for newbie and Non-IT
Introduce AWS Lambda for newbie and Non-ITChitpong Wuttanan
 
Introducing AWS Elastic Beanstalk
Introducing AWS Elastic BeanstalkIntroducing AWS Elastic Beanstalk
Introducing AWS Elastic BeanstalkAmazon Web Services
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
 
[REPEAT] Microsoft Active Directory Deep Dive (WIN303-R) - AWS re:Invent 2018
[REPEAT] Microsoft Active Directory Deep Dive (WIN303-R) - AWS re:Invent 2018[REPEAT] Microsoft Active Directory Deep Dive (WIN303-R) - AWS re:Invent 2018
[REPEAT] Microsoft Active Directory Deep Dive (WIN303-R) - AWS re:Invent 2018Amazon Web Services
 
Using Active Directory in AWS
Using Active Directory in AWSUsing Active Directory in AWS
Using Active Directory in AWSTriNimbus
 
금융 분야 마이데이터 (My Data) 산업 도입 방안 및 AWS 활용법 – 고종원 AWS 어카운트 매니저, 양찬욱 KB국민카드 팀장:: ...
금융 분야 마이데이터 (My Data) 산업 도입 방안 및 AWS 활용법 – 고종원 AWS 어카운트 매니저, 양찬욱 KB국민카드 팀장:: ...금융 분야 마이데이터 (My Data) 산업 도입 방안 및 AWS 활용법 – 고종원 AWS 어카운트 매니저, 양찬욱 KB국민카드 팀장:: ...
금융 분야 마이데이터 (My Data) 산업 도입 방안 및 AWS 활용법 – 고종원 AWS 어카운트 매니저, 양찬욱 KB국민카드 팀장:: ...Amazon Web Services Korea
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesAmazon Web Services
 
관계형 데이터베이스의 새로운 패러다임 Amazon Aurora :: 김상필 :: AWS Summit Seoul 2016
관계형 데이터베이스의 새로운 패러다임 Amazon Aurora :: 김상필 :: AWS Summit Seoul 2016관계형 데이터베이스의 새로운 패러다임 Amazon Aurora :: 김상필 :: AWS Summit Seoul 2016
관계형 데이터베이스의 새로운 패러다임 Amazon Aurora :: 김상필 :: AWS Summit Seoul 2016Amazon Web Services Korea
 
Ad-Tech on AWS 세미나 | AWS와 데이터 분석
Ad-Tech on AWS 세미나 | AWS와 데이터 분석Ad-Tech on AWS 세미나 | AWS와 데이터 분석
Ad-Tech on AWS 세미나 | AWS와 데이터 분석Amazon Web Services Korea
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Amazon Web Services
 

Mais procurados (20)

Running Active Directory in the AWS Cloud
Running Active Directory in the AWS Cloud Running Active Directory in the AWS Cloud
Running Active Directory in the AWS Cloud
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Introduce AWS Lambda for newbie and Non-IT
Introduce AWS Lambda for newbie and Non-ITIntroduce AWS Lambda for newbie and Non-IT
Introduce AWS Lambda for newbie and Non-IT
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Introducing AWS Elastic Beanstalk
Introducing AWS Elastic BeanstalkIntroducing AWS Elastic Beanstalk
Introducing AWS Elastic Beanstalk
 
AWS Lambda
AWS LambdaAWS Lambda
AWS Lambda
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
[REPEAT] Microsoft Active Directory Deep Dive (WIN303-R) - AWS re:Invent 2018
[REPEAT] Microsoft Active Directory Deep Dive (WIN303-R) - AWS re:Invent 2018[REPEAT] Microsoft Active Directory Deep Dive (WIN303-R) - AWS re:Invent 2018
[REPEAT] Microsoft Active Directory Deep Dive (WIN303-R) - AWS re:Invent 2018
 
Using Active Directory in AWS
Using Active Directory in AWSUsing Active Directory in AWS
Using Active Directory in AWS
 
금융 분야 마이데이터 (My Data) 산업 도입 방안 및 AWS 활용법 – 고종원 AWS 어카운트 매니저, 양찬욱 KB국민카드 팀장:: ...
금융 분야 마이데이터 (My Data) 산업 도입 방안 및 AWS 활용법 – 고종원 AWS 어카운트 매니저, 양찬욱 KB국민카드 팀장:: ...금융 분야 마이데이터 (My Data) 산업 도입 방안 및 AWS 활용법 – 고종원 AWS 어카운트 매니저, 양찬욱 KB국민카드 팀장:: ...
금융 분야 마이데이터 (My Data) 산업 도입 방안 및 AWS 활용법 – 고종원 AWS 어카운트 매니저, 양찬욱 KB국민카드 팀장:: ...
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless Architectures
 
AWS Lambda
AWS LambdaAWS Lambda
AWS Lambda
 
Amazon ECS
Amazon ECSAmazon ECS
Amazon ECS
 
관계형 데이터베이스의 새로운 패러다임 Amazon Aurora :: 김상필 :: AWS Summit Seoul 2016
관계형 데이터베이스의 새로운 패러다임 Amazon Aurora :: 김상필 :: AWS Summit Seoul 2016관계형 데이터베이스의 새로운 패러다임 Amazon Aurora :: 김상필 :: AWS Summit Seoul 2016
관계형 데이터베이스의 새로운 패러다임 Amazon Aurora :: 김상필 :: AWS Summit Seoul 2016
 
Intro to AWS: Storage Services
Intro to AWS: Storage ServicesIntro to AWS: Storage Services
Intro to AWS: Storage Services
 
Ad-Tech on AWS 세미나 | AWS와 데이터 분석
Ad-Tech on AWS 세미나 | AWS와 데이터 분석Ad-Tech on AWS 세미나 | AWS와 데이터 분석
Ad-Tech on AWS 세미나 | AWS와 데이터 분석
 
AWS Technical Essentials Day
AWS Technical Essentials DayAWS Technical Essentials Day
AWS Technical Essentials Day
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions:
 

Destaque

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5RocksAmazon Web Services Korea
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입Hoon Park
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기Brian Hong
 

Destaque (7)

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
[Gaming on AWS] 모바일 게임 데이터의 특성과 100% 활용법 - 5Rocks
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
[NDC 2011] 게임 개발자를 위한 데이터분석의 도입
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기
 

Semelhante a Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft TLV 2017

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaAmazon Web Services
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Amazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzDatabricks
 
(CMP405) Containerizing Video: The Next Gen Video Transcoding Pipeline
(CMP405) Containerizing Video: The Next Gen Video Transcoding Pipeline(CMP405) Containerizing Video: The Next Gen Video Transcoding Pipeline
(CMP405) Containerizing Video: The Next Gen Video Transcoding PipelineAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 

Semelhante a Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft TLV 2017 (20)

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS Lambda
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
 
(CMP405) Containerizing Video: The Next Gen Video Transcoding Pipeline
(CMP405) Containerizing Video: The Next Gen Video Transcoding Pipeline(CMP405) Containerizing Video: The Next Gen Video Transcoding Pipeline
(CMP405) Containerizing Video: The Next Gen Video Transcoding Pipeline
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
EMR Training
EMR TrainingEMR Training
EMR Training
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Último (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft TLV 2017

  • 1. Apache Spark and the Hadoop Ecosystem on AWS Getting Started with Amazon EMR Jonathan Fritz, Sr. Product Manager March 20, 2017
  • 2. Agenda • Quick introduction to Spark, Hive on Tez, and Presto • Building data lakes with Amazon EMR and Amazon S3 • Running jobs and security options • Customer use cases • Demo
  • 3. Quick introduction to Spark, Hive on Tez, and Presto
  • 4. Spark for fast processing join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle
  • 5. Spark components to match your use case
  • 6. Hive and Tez for batch ETL and SQL
  • 7. • Run Spark Driver in Client or Cluster mode • Spark and Tez applications run as a YARN application • Spark Executors and Tez Workers run in YARN Containers on NodeManagers in your cluster Amazon EMR runs Spark and Tez on YARN
  • 8. Presto: interactive SQL for analytics
  • 9. Important Presto Features High Performance • E.g. Netflix: runs 3500+ Presto queries / day on 25+ PB dataset in S3 with 350 active platform users Extensibility • Pluggable backends: Hive, Cassandra, JMX, Kafka, MySQL, PostgreSQL, MySQL, and more • JDBC, ODBC for commercial BI tools or dashboards • Client Protocol: HTTP+JSON, support various languages (Python, Ruby, PHP, Node.js, Java(JDBC), C#,…) ANSI SQL • complex queries, joins, aggregations, various functions (Window functions)
  • 10. On-cluster UIs Manage applications SQL editor, Workflow designer, Metastore browser Notebooks Design and execute queries and workloads And more using bootstrap actions!
  • 11. Building data lakes with Amazon EMR and Amazon S3
  • 12. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy to manage options Flexible Customize the cluster
  • 13. Create a fully configured cluster with the latest versions of Presto and Spark in minutes AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API
  • 14.
  • 15. Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service Amazon EMR release Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Streaming Flink
  • 16. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  • 17. HBase on S3 for scalable NoSQL
  • 18. S3 tips: Partitions, compression, and file formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 19. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  • 20. Use RDS/Aurora for an external Hive metastore Amazon Aurora Hive Metastore for external tables on S3 Amazon S3Set metastore location in hive-site
  • 21. Spot for task nodes Up to 80% off EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Use Spot and Reserved Instances to lower costs Meet SLA at predictable cost Exceed SLA at lower cost Amazon EMR supports most EC2 instance types
  • 22. Instance fleets for advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  • 23. Lower costs with Auto Scaling
  • 25. YARN schedulers - CapacityScheduler • Default scheduler specified in Amazon EMR • Queues • Single queue is set by default • Can create additional queues for workloads based on multitenancy requirements • Capacity Guarantees • set minimal resources for each queue • Programmatically assign free resources to queues • Adjust these settings using the classification capacity- scheduler in an EMR configuration object
  • 26. Configuring Executors – Dynamic Allocation • Optimal resource utilization • YARN dynamically creates and shuts down executors based on the resource needs of the Spark application • Spark uses the executor memory and executor cores settings in the configuration for each executor • Amazon EMR uses dynamic allocation by default, and calculates the default executor size to use based on the instance family of your Core Group
  • 27. Options to submit jobs – on cluster Web UIs: Hue SQL editor, Zeppelin notebooks, R Studio, Airpal, and more! Connect with ODBC / JDBC to HiveServer2, Spark Thriftserver, or Presto Use Hive and Spark Actions in your Apache Oozie workflow to create DAGs of jobs. (start using start-thriftserver.sh) Or, use the native APIs and CLIs for each application
  • 28. Options to submit jobs – off cluster Amazon EMR Step API Submit a Hive or Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Hive or Spark on your cluster
  • 29. Security - configuring VPC subnets • Use Amazon S3 Endpoints in VPC for connectivity to S3 • Use Managed NAT for connectivity to other services or the Internet • Control the traffic using Security Groups • ElasticMapReduce-Master-Private • ElasticMapReduce-Slave-Private • ElasticMapReduce-ServiceAccess
  • 30. IAM Roles – managed or custom policies EMR Service Role EC2 Role
  • 31. Encryption – use security configurations
  • 33. Learn Models ModelsImpressions Clicks Activities Calibrate Evaluate Real Time Bidding S3 ETL Attribution Machine Learning S3Amazon Kinesis • 2 Petabytes Processed Daily • 2 Million Bid Decisions Per Second • Runs 24 X 7 on 5 Continents • Thousands of ML Models Trained per Day
  • 34. ADHOC Transient ETL Cluster Hive Transient Event Cluster Spark RDS Meta store Custom Replication Ideal Path RDS Replicated Meta store To support Hive Presto Data types Ideal Path Data Lake - Stats • 50+ Ad hoc Users • 1000+ Ad hoc Queries Day • 4+ Data Science Users • 20+ Sources of Data • 100+ ETL Hive Jobs • 25+ Spark Jobs • 2+ PB of Data
  • 35.
  • 36. Netflix uses Presto on Amazon EMR with a 25 PB dataset in Amazon S3 Full Presentation: https://www.youtube.com/watch?v=A4OU6i4AQsI
  • 37.
  • 38. SmartNews uses Presto as a reporting front-end AWS Big Data Blog: https://blogs.aws.amazon.com/bigdata/post/Tx2V1BSKGITCMTU/How-SmartNews- Built-a-Lambda-Architecture-on-AWS-to-Analyze-Customer-Behavior-an
  • 39. FINRA saved 60% by moving to HBase on EMR
  • 40. Demo