No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
Apache Spark and the Hadoop Ecosystem on AWS
1. Apache Spark and the
Hadoop Ecosystem on AWS
Getting Started with
Amazon EMR
Jonathan Fritz, Sr. Product Manager
March 20, 2017
2. Agenda
• Quick introduction to Spark, Hive on Tez, and
Presto
• Building data lakes with Amazon EMR
and Amazon S3
• Running jobs and security options
• Demo
• Customer use cases
7. • Run Spark Driver in
Client or Cluster mode
• Spark and Tez
applications run as a
YARN application
• Spark Executors and
Tez Workers run in
YARN Containers on
NodeManagers in your
cluster
Amazon EMR runs Spark and Tez on YARN
9. Important Presto Features
High Performance
• E.g. Netflix: runs 3500+ Presto queries / day on 25+ PB dataset in S3 with 350 active
platform users
Extensibility
• Pluggable backends: Hive, Cassandra, JMX, Kafka, MySQL, PostgreSQL, MySQL, and
more
• JDBC, ODBC for commercial BI tools or dashboards
• Client Protocol: HTTP+JSON, support various languages (Python, Ruby, PHP, Node.js,
Java(JDBC), C#,…)
ANSI SQL
• complex queries, joins, aggregations, various functions (Window functions)
10. On-cluster UIs
Manage applications
SQL editor, Workflow designer,
Metastore browser
Notebooks
Design and execute
queries and workloads
And more using
bootstrap actions!
12. Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to manage options
Flexible
Customize the cluster
13. Create a fully configured cluster with the
latest versions of Presto and Spark in minutes
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
16. Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
18. S3 tips: Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
19. Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
20. Use RDS/Aurora for an external Hive metastore
Amazon Aurora
Hive Metastore for
external tables on S3
Amazon S3Set metastore
location in hive-site
21. Spot for
task nodes
Up to 80%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Use Spot and Reserved Instances to lower costs
Meet SLA at predictable cost Exceed SLA at lower cost
Amazon EMR supports most EC2 instance types
22. Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support
25. YARN schedulers - CapacityScheduler
• Default scheduler specified in Amazon EMR
• Queues
• Single queue is set by default
• Can create additional queues for workloads based on
multitenancy requirements
• Capacity Guarantees
• set minimal resources for each queue
• Programmatically assign free resources to queues
• Adjust these settings using the classification capacity-
scheduler in an EMR configuration object
26. Configuring Executors – Dynamic Allocation
• Optimal resource utilization
• YARN dynamically creates and shuts down executors
based on the resource needs of the Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default, and
calculates the default executor size to use based on the
instance family of your Core Group
27. Options to submit jobs – on cluster
Web UIs: Hue SQL editor,
Zeppelin notebooks,
R Studio, Airpal, and more!
Connect with ODBC / JDBC to
HiveServer2, Spark Thriftserver, or Presto
Use Hive and Spark Actions in your Apache
Oozie workflow to create DAGs of jobs.
(start using
start-thriftserver.sh)
Or, use the native APIs and CLIs for
each application
28. Options to submit jobs – off cluster
Amazon EMR
Step API
Submit a Hive or Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Hive or Spark on your cluster
29. Security - configuring VPC subnets
• Use Amazon S3 Endpoints in VPC for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using Security Groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess
30. IAM Roles – managed or custom policies
EMR Service Role EC2 Role
35. ADHOC
Transient
ETL Cluster
Hive
Transient
Event Cluster
Spark
RDS
Meta store
Custom Replication
Ideal Path
RDS Replicated
Meta store
To support
Hive Presto
Data types
Ideal Path
Data Lake - Stats
• 50+ Ad hoc Users
• 1000+ Ad hoc Queries
Day
• 4+ Data Science Users
• 20+ Sources of Data
• 100+ ETL Hive Jobs
• 25+ Spark Jobs
• 2+ PB of Data
38. Netflix uses Presto on Amazon EMR with
a 25 PB dataset in Amazon S3
Full Presentation: https://www.youtube.com/watch?v=A4OU6i4AQsI
39.
40. SmartNews uses Presto as a reporting front-end
AWS Big Data Blog: https://blogs.aws.amazon.com/bigdata/post/Tx2V1BSKGITCMTU/How-SmartNews-
Built-a-Lambda-Architecture-on-AWS-to-Analyze-Customer-Behavior-an
Editor's Notes
DAGs offers benefits over MR due to lazy evaluation which allows for optimized execution (e.g. predicate push-down)
ML pipeline persistence is now supported across all languages
Tez executes a DAG (directed acyclic graph) of tasks to execute a job
Verticies in the graph are application logic and edges are the movement of data
Can express this via a dataflow definition API, which makes it easy for higher level declarative applications to create query plans
Each vertex is modeling as an input, processor, and output module
Input and Output determine the data format and how it’s read or written.
Processor is the data transformation logic
Dynamic graph reconfiguration – Tez can adjust the execution at runtime when more information is known (i.e. more reducers)
Runs on YARN
Optimal resource management. Tez will try to reuse components so no operation is duplicated.
For example, Tez will run multiple tasks in a single YARN container to share objects across tasks
Tez Sessions
Can create a Tez Application Master and submit DAGs to this session – avoids the overhead of creating new Application Masters for each DAG
Containers can be reused across DAGs and data can be cached across sessions
Tez UI is located on the EMR Master Node at 8080/tez-ui
Presto’s pipelined execution model runs multiple stages at once and streams data from one stage to next as it becomes available, which reduces end-to-end latency
Presto dynamically compiles certain portions of query plan into bytecode which lets JVM to optimize and generate native machine code
In-memory distributed query engine
Support standard ANSI-SQL
Support rich analytical functions
Support wide range of data sources
Combine data from multiple sources in single query
Response time ranges from seconds to minutes
Components: a coordinator and multiple workers.
Queries are submitted by a client such as the Presto CLI to the coordinator.
The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.
In memory parallel queries
Pipeline task execution
Data local computation with multi-threading
Cache hot queries and data
Just-in-time compile-to-bye-code operator
SQL optimization
Other optimizations (e.g. Predicate Pushdown)
But first as a quick refresher, Six main reasons why Amazon EMR
some sample languages from which they can use the SDK? Python, Ruby, Java, Node, PHP
S3 connector for EMR (implements the Hadoop FileSystem interface); EMRFS and PrestoS3Filesystem
Improved performance and error handling options
Transparent to applications – just read/write to “s3://”
Consistent view feature set for consistent list
Support for Amazon S3 server-side and client-side encryption
Faster listing using EMRFS metadata
Important to note – Presto doesn’t support wire encryption yet, and uses the PrestoS3Filesystem (you can configure encryption manually there)
Important to note – Presto doesn’t support wire encryption yet, and uses the PrestoS3Filesystem (you can configure encryption manually there)
Important to note – Presto doesn’t support wire encryption yet, and uses the PrestoS3Filesystem (you can configure encryption manually there)
Web-based notebook that enables interactive data analytics
Data-driven, interactive charts and documents, that can be shared with colleagues
Spark integration (automatic SparkContext and SQLContext)
Support for multiple languages (shell, SparkSQL, Scala, PySpark)
Variety of Data Visualization options available
Explain each
Cross account roles
Access to DDB and KMS
Important to note – Presto doesn’t support wire encryption yet, and uses the PrestoS3Filesystem (you can configure encryption manually there)
Hearst Corporation is one of the nation’s largest diversified media and information companies.
Hearst has users on over 200 web properties, and they want to better understand which articles are performing well and what themes are trending. Their old process used Kinesis to ingest data and use an EC2 instance to transfer that data to S3 in JSON format. Then, they would use Pig on EMR to process the data in S3 using UDFs written in Python, and wrote the output back to S3 in CSV format. Then, this data would be picked up by SAS running on EC2 instances to do regression analysis and aggregation. Then, this data is loaded into ElasticSearch by another EMR cluster, and ElasticSearch is being called by various production processes. However, they wanted to process more data and get better performance from this pipeline.
To collect weblog data, Hearst has a node.js application running on Elastic Beanstalk taking click data and pumping it into a Kinesis stream. They like using Kinesis as a staging area to hold data before it gets processed, and the stream fault tolerant and can hold data for up to 24hrs (in case there is a delay downstream). Their new architecture uses Spark Streaming on an EMR cluster to consume the data from the Kinesis stream and process data in 5 minute windows (about 100MB per segment). Also, by using EMR, it was easy for Hearst to spin up test clusters to iterate on the platform, and quickly get a cluster into production. they are now using Redshift for this processing workload, and then using EMR to load the output of this to their ElasticSearch cluster.
Important to note – Presto doesn’t support wire encryption yet, and uses the PrestoS3Filesystem (you can configure encryption manually there)
25 PB datawarehouse on amazon S3 with Parquet file format
Access around 10% daily
550 billion events per day
350 active platform users
GumGum is an in-image and in-screen advertising platform, driving brand engagement for advertisers and publishers. GumGum reaches hundreds of millions of unique visitors per month across billions of images on premium websites.
Saves them time and money, and better perf. GumGum has several workflows with Spark on EMR.
First, for their forecasting server, they run a 24/7 Spark cluster and keep 365 days of data in memory and run queries for revenue forecasting from click impressions and ad impression data. It's a backend to a live system, and populates an interactive dashboard.
Second, have Spark on EMR clusters for batch processing of clickstream logs hourly and daily. The logs start in S3, and the output is also sent to S3. AWS Data Pipeline is used to schedule these jobs. They load the output into Redshift and use a BI tool with Redshift. They like Redshift's performance for regular reporting and persistence of storage.
Third, they can create on-demand Spark on EMR clusters for ad hoc analysis of unstructured data directly in S3, when they want to query data and don't have a predefined schema. It allows them to start investigating without even knowing the final questions yet.
SmartNews is a news content discorvery app.
The news team at SmartNews uses data as input to their machine learning algorithm for delivering the very best stories on the Web. The product team relies on data to run various A/B tests, to learn about how our customers consume news articles, and to make product decisions. Build a Lambda architecture on AWS.
Use shared rds metastore
The serving layer indexes and exposes the views so that they can be queried. Scanning billions of rows in double digit seconds.