These are slides I presented at the BigData Conclave event in Bangalore, December 2013. The talk was focused on sharing experiences building a managed bigdata platform on top of the Amazon AWS infrastructure and how it adds value to enterprises
6. Reuse process
BigData
•Build common processing
frameworks or libraries
•Ingest and Extract can be
centralized services
•Frameworks can be
developed for ETL
processes, workflows, etc.
•Save time in building
Infrastructure
Data
analytical solutions
Process
7. Other Reasons
• Develop and leverage skill set of people
• Separating concerns of running
applications vs running infrastructure
• Evaluate and adopt new developments in
the space
8. Flavors of managed BigData platforms
•
•
Physical data centers
Private or Public clouds
•
•
Infrastructure Providers:
•
Amazon Web Services, Google Compute Engine,
Microsoft Azure, IBM, Open Stack, Rackspace
Platform Providers:
•
•
Qubole, Xurmo
In-House: Netflix, ...
9. Architectural Layers
Enterprise User Data / Workloads
User Data / Workloads
Enterprise Managed BigData
Services (E.g. Netflix Genie)
Managed BigData Services (E.g. EMR, Savanna, Redshift)
Cloud Storage (E.g. S3, Swift)
Virtualized Compute (E.g. EC2,
Nova)
10. Components in a managed platform
Presentation
Command Line Tools
API
Analytics Workbench
Data analytics
Data Catalog
Query
Aggregates
ETL
Platform
Ingest
FileSystem
Workflow
Provisioning
Scheduler
Job Management
Extract
Access Control
Eventing
Infrastructure
Redshift
Data
S3
EMR
Compute
IAM
Identity
SNS
Infrastructure
11. Elastic MapReduce - 101
•
Provision a Hadoop cluster of given size, using given type
of instances
•
•
•
•
•
•
Support for most of the ecosystem- Hive, Pig, HBase, etc.
Can scale up and down nodes for a cluster on demand
User submits ‘jobflows’ - a sequence of Hadoop jobs
Integrates with S3 as permanent store of data
Integrates with other Amazon services
Cost = Std. EC2 instance cost + extra + Std s3 ops etc.
12. Reasons for having Enterprise Tier on EMR
• Improve usability by providing better
abstractions, necessary automation
• Improve cost utilization by reusing
infrastructure
• Improve performance by providing system
level optimizations
13. Improving Usability
•
EMR API expects some
repetitive setup steps as
part of job submission. E.g.
Hive setup for all Hive jobs
•
Provide a service API with a
simpler interface that
automates the setup.
15. Improving Usability
•
Separate cluster management
from job management.
•
EMR expects users to
know the cluster sizes
when launching jobs
•
Have the system (or
administrators) launch
clusters on behalf of users
•
Users will either not
know how to launch
clusters, or will launch
incorrectly sized ones.
•
Have the system submit jobs
to appropriate clusters
•
Scale them according to the
needs of the jobs automatically or
administratively
16. Improve cost utilization
•
Different cluster types
in EMR: ephemeral
(default) and static
•
Ephemeral clusters can
be a huge cost drain Note: minimum charges
for a hour
•
Static clusters can also
waste money (if unused)
•
Go with a Hybrid model
Launch clusters on demand,
but maximize the cost to
utilization ratio - keep them
alive at least for an hour
•
•
•
Reuse them for other jobs
transparently
•
Shutdown if not used anymore
Saved $3000 in a month with
this strategy
17. Job Management System Design
Job
Management
Service
Job Executor
Resource
Estimator
Cluster
Manager
18. Job Management System Design
Manage provisioning,
monitoring and terminating
clusters. Matches job
requests to suitable clusters
based on policy
Job
Management
Service
Job Executor
Resource
Estimator
Cluster
Manager
19. Job Management System Design
Pool of clusters brought up
either on demand or predetermined, based on
requirements of resource
requirements, longevity, etc.
Job
Management
Service
Job Executor
Resource
Estimator
Cluster
Manager
20. Job Management System Design
Has knowledge of how to
convert a user jobflow to an
EMR jobflow. Also knows
how to submit jobflows to
clusters identified by cluster
manager
Job
Management
Service
Job Executor
Resource
Estimator
Cluster
Manager
21. Job Management System Design
Job
Management
Service
Monitors running jobs on
clusters using CloudWatch
(or similar system), and
determines whether to add /
delete more nodes to a
cluster
Job Executor
Resource
Estimator
Cluster
Manager
22. Job Management System Design
Job
Management
Service
Front-end service API for
users to submit their jobs.
Job Executor
Resource
Estimator
Cluster
Manager