SlideShare uma empresa Scribd logo
1 de 29
Janos Matyas / CTO / SequenceIQ Inc.
GOAL / MOTIVATION
TECHNOLOGY STACK
PROBLEM RESOLUTION / HOW IT WORKS
RESULTS / ACHIEVEMENTS
OVERVIEW
GOAL / MOTIVATION
 Ease Hadoop provisioning – everywhere
 Automate and unify the process
 Arbitrary cluster size
 Same process through a cluster lifecycle (Dev, QA, UAT, Prod)
 (Auto) scaling Hadoop
 QoS
OUR APPROACH
 Use Docker
 Build cloud-specific ‘Dockerized’ images
 Provision the cluster
 Use Ambari
DOCKER
 Lightweight, portable
 Build once, run anywhere
 VM – without the overhead of a VM
 Isolated containers
 Automated and scripted
DOCKER – CONTAINERS vs. VMs
 Containers are isolated, but share OS and,
where appropriate, bins/libraries
APACHE AMBARI – ARCHITECTURE
 Easy Hadoop cluster provisioning
 Management and monitoring
 Key features – blueprints
 REST API
APACHE AMBARI – CREATE CLUSTER
 Define a blueprint (POST /api/v1/blueprints)
 Create cluster (POST /api/v1/clusters/mycluster)
HADOOP PROVISIONG ISSUES
 Each cloud provider has a proprietary API
 Create images for each provider
 Network configuration
 Service discovery
 Resize, failover, member join support
OUR APPROACH – DETAILS
 Build your Docker image
 Install or pre-install Hadoop services with Ambari
 Install Serf and dnsmasq
 Build your cloud image
 Use Ansible to create an image
 Provision the cluster
BUILD DOCKER IMAGES
 Create the Dockerfile
 Have Docker.io to build the image
 Optionally pre-install services
 Use Ambari
 Push image to Docker.io
 Licensing questions
BUILD CLOUD IMAGES
 Use a Docker ready base image
 Use Ansible to provision the image template
 Pull the Docker images
 Apply custom infrastructure
 Use cloud provider specific playbooks
 AWS EC2
 Azure
ANSIBLE
 Configuration as data
 Simplest way to automate IT
 Secure and agentless
 Goal oriented
 One playbook – multiple modules
 We use it to “burn” cloud images/templates
PROVISIONING – ISSUES
 FQDN
 /etc/hosts is read-only in Docker
 Everybody needs to know everybody
 DNS
 Single point of failure
 Dynamic cluster – nodes joining, leaving, failing
 Routing
 Cloud – ability to inter-host container routing
 Collision free private IP range for Docker bridge
 We need predefined host names/IP addresses
 /etc/hosts is read-only in Docker
 Use Ansible to provision the image template
 Pull the Docker images
 Start a DNS server
 Use it as a reference docker run -dns <IP_OF_DNS>
 Nodes need to know each other
PROVISIONING – SOLUTION
 FQDN
 Use –h and –dns Docker params
 DNS
 dnsmasq is running on each Docker container
 Serf member-xxx events trigger dnsmasq reconfiguration
 Routing
 Docker bridge configuration – follows a convention
SERF
 Gossip based membership
 Service discovery
 Decentralized
 Lightweight, fault tolerant
 Highly available
 DevOps friendly
 Keep an eye on Consul, Open vSwitch, pipework
SERF – DECENTRALIZED SERVICE DISCOVERY
 Gossip instead of heartbeat
 LAN, WAN profiles
 Provides membership information
 Event handlers: member_join, member_leave, member_failed, member-
update, member-reap, user
 Query
SERF – GOSSIPING
SERF – MEMBERSHIP, EVENT HANDLERS
DNSMASQ
 Network infrastructure for small networks
 Lightweight DNS, DHCP server
 Comes with most Linux distributions
AWS EC2 – HADOOP CLUSTER
 Use EC2 REST API to provision instances (from Dockerized image)
 Start Docker containers
 One Ambari server
 N-1 Ambari agents connecting to server
 Connect ambari-shell to
 Define blueprint
 Provision the cluster
AWS EC2 – NETWORK SECURITY
 Create a VPC
 Configure subnets
 Routing tables
 Security gateway
 Set ACL
 Configure VPN
AWS EC2 - CLOUDFORMATION
 Manually set up VPC is too complicated
 Use CloudFormation
 Manage the stack together
 Template-based
 Environments under version control
 Customizable at runtime
 No extra charge
"VpcId" : {
"Type" : "String",
"Description" : "VpcId of your existing Virtual Private Cloud (VPC)"
},
"SubnetId" : {
"Type" : "String",
"Description" : "SubnetId of an existing subnet (for the primary
network) in your Virtual Private Cloud (VPC)"
},
"SecondaryIPAddressCount" : {
"Type" : "Number",
"Default" : "1",
"MinValue" : "1",
"MaxValue" : "5",
"Description" : "Number of secondary IP addresses to assign to the
network interface (1-5)",
"ConstraintDescription": "must be a number from 1 to 5."
},
"SSHLocation" : {
"Description" : "The IP address range that can be used to SSH to the
EC2 instances",
"Type": "String",
"MinLength": "9",
"MaxLength": "18",
"Default": "0.0.0.0/0",
"AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/
(d{1,2})",
"ConstraintDescription": "must be a valid IP CIDR range of the form
x.x.x.x/x."
}
},
CLOUDBREAK
Cloudbreak is a powerful left surf that
breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.
Cloudbreak is a cloud-agnostic
Hadoop as a Service API. Abstracts
the provisioning and ease
management and monitoring of on-
demand clusters.
Provisioning Hadoop has never been easier
CLOUDBREAK
 Benefits
 Elastic
 Scalable
 Blueprints
 Flexible
 Main REST resources
 /template – specify a cluster infrastructure
 /stack – creates a cloud infrastructure built from a template
 /blueprint – describes a Hadoop cluster
 /cluster – creates a Hadoop cluster
RESULTS AND ACHIEVEMENTS
 Hadoop as a Service API
 Available for EC2 and Azure cloud
 OpenStack, bare metal is coming soon
 Open source under Apache 2 licence
 Same goals as Apache Ambari Launchpad project
 What's next?
HADOOP SERVICES - AS A SERVICE
 Leverage YARN
 Slider (Hoya) providers
 HBase, Accumulo
 SequenceIQ providers - Flume, Tomcat
 YARN -1964
 QoS for YARN – heuristic scheduler
 Platform as a Service API
BANZAI PIPELINE
Banzai Pipeline is a surf reef break located
in Hawaii, off Ehukai Beach Park in
Pupukea on O'ahu's North Shore.
Banzai Pipeline is a RESTful
application development
platform for building on-
demand data and job pipelines
running on Hadoop YARN.
Banzai Pipeline is a big data API for the REST
THANK YOU
 Get the code: https://github.com/sequenceiq
 Read about: http://blog.sequenceiq.com
 Facebook: http://facebook.com/sequenceiq
 Twitter: http://twitter.com/sequenceiq
 LinkedIn: http://linkedin.com/sequenceiq
 Contact: janos.matyas@sequenceiq.com
FEEL FREE TO CONTRIBUTE

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Introduction to Apache CloudStack by David Nalley
Introduction to Apache CloudStack by David NalleyIntroduction to Apache CloudStack by David Nalley
Introduction to Apache CloudStack by David Nalley
 
CloudStack-Developer-Day
CloudStack-Developer-DayCloudStack-Developer-Day
CloudStack-Developer-Day
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
 
Docker, Mesos, Spark
Docker, Mesos, Spark Docker, Mesos, Spark
Docker, Mesos, Spark
 
Cloud stack for_beginners
Cloud stack for_beginnersCloud stack for_beginners
Cloud stack for_beginners
 
AWS-compared-to-OpenStack
AWS-compared-to-OpenStackAWS-compared-to-OpenStack
AWS-compared-to-OpenStack
 
Apache CloudStack from API to UI
Apache CloudStack from API to UIApache CloudStack from API to UI
Apache CloudStack from API to UI
 
Guaranteeing Storage Performance by Mike Tutkowski
Guaranteeing Storage Performance by Mike TutkowskiGuaranteeing Storage Performance by Mike Tutkowski
Guaranteeing Storage Performance by Mike Tutkowski
 
Avishay Traeger & Shimshon Zimmerman, Stratoscale - Deploying OpenStack Cinde...
Avishay Traeger & Shimshon Zimmerman, Stratoscale - Deploying OpenStack Cinde...Avishay Traeger & Shimshon Zimmerman, Stratoscale - Deploying OpenStack Cinde...
Avishay Traeger & Shimshon Zimmerman, Stratoscale - Deploying OpenStack Cinde...
 
Introduction to CloudStack
Introduction to CloudStack Introduction to CloudStack
Introduction to CloudStack
 
CloudStack Architecture Future
CloudStack Architecture FutureCloudStack Architecture Future
CloudStack Architecture Future
 
Migrate Oracle database to Amazon RDS
Migrate Oracle database to Amazon RDSMigrate Oracle database to Amazon RDS
Migrate Oracle database to Amazon RDS
 
Openstack: An Open Source Cloud Framework
Openstack: An Open Source Cloud FrameworkOpenstack: An Open Source Cloud Framework
Openstack: An Open Source Cloud Framework
 
OpenStack 101 update
OpenStack 101 updateOpenStack 101 update
OpenStack 101 update
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
 
EMC & OpenStack: A View From Within
EMC & OpenStack: A View From WithinEMC & OpenStack: A View From Within
EMC & OpenStack: A View From Within
 
OpenStack Cinder
OpenStack CinderOpenStack Cinder
OpenStack Cinder
 
VNG/IRD - Cloud computing & Openstack discussion 3/5/2014
VNG/IRD - Cloud computing & Openstack discussion 3/5/2014VNG/IRD - Cloud computing & Openstack discussion 3/5/2014
VNG/IRD - Cloud computing & Openstack discussion 3/5/2014
 
OpenStack 101 Presentation
OpenStack 101 PresentationOpenStack 101 Presentation
OpenStack 101 Presentation
 
Containers and workload security an overview
Containers and workload security an overview Containers and workload security an overview
Containers and workload security an overview
 

Destaque

MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 
Budget Cuts And Their Effects
Budget Cuts And Their EffectsBudget Cuts And Their Effects
Budget Cuts And Their Effects
endre1mr
 
I want to visit Austrialia
I want to visit AustrialiaI want to visit Austrialia
I want to visit Austrialia
mliadvisor
 

Destaque (20)

Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera Quickstart
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
How could I automate log gathering in the distributed system
How could I automate log gathering in the distributed systemHow could I automate log gathering in the distributed system
How could I automate log gathering in the distributed system
 
HBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with ClusterdockHBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with Clusterdock
 
How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceHow mentoring can help you start contributing to open source
How mentoring can help you start contributing to open source
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conferenceLuciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conference
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
 
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
 
Shipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with DockerShipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with Docker
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Cafe Con Leche
Cafe Con LecheCafe Con Leche
Cafe Con Leche
 
Budget Cuts And Their Effects
Budget Cuts And Their EffectsBudget Cuts And Their Effects
Budget Cuts And Their Effects
 
I want to visit Austrialia
I want to visit AustrialiaI want to visit Austrialia
I want to visit Austrialia
 

Semelhante a Docker Based Hadoop Provisioning

Application Deployment on Openstack
Application Deployment on OpenstackApplication Deployment on Openstack
Application Deployment on Openstack
Docker, Inc.
 
Docker - Portable Deployment
Docker - Portable DeploymentDocker - Portable Deployment
Docker - Portable Deployment
javaonfly
 
Chef and Apache CloudStack (ChefConf 2014)
Chef and Apache CloudStack (ChefConf 2014)Chef and Apache CloudStack (ChefConf 2014)
Chef and Apache CloudStack (ChefConf 2014)
Jeff Moody
 

Semelhante a Docker Based Hadoop Provisioning (20)

Automating Your CloudStack Cloud with Puppet
Automating Your CloudStack Cloud with PuppetAutomating Your CloudStack Cloud with Puppet
Automating Your CloudStack Cloud with Puppet
 
Automating CloudStack with Puppet - David Nalley
Automating CloudStack with Puppet - David NalleyAutomating CloudStack with Puppet - David Nalley
Automating CloudStack with Puppet - David Nalley
 
Azure: Docker Container orchestration, PaaS ( Service Farbic ) and High avail...
Azure: Docker Container orchestration, PaaS ( Service Farbic ) and High avail...Azure: Docker Container orchestration, PaaS ( Service Farbic ) and High avail...
Azure: Docker Container orchestration, PaaS ( Service Farbic ) and High avail...
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker Containers
 
Docker in practice
Docker in practiceDocker in practice
Docker in practice
 
Docker Presentation at the OpenStack Austin Meetup | 2013-09-12
Docker Presentation at the OpenStack Austin Meetup | 2013-09-12Docker Presentation at the OpenStack Austin Meetup | 2013-09-12
Docker Presentation at the OpenStack Austin Meetup | 2013-09-12
 
Application Deployment on Openstack
Application Deployment on OpenstackApplication Deployment on Openstack
Application Deployment on Openstack
 
Docker - Portable Deployment
Docker - Portable DeploymentDocker - Portable Deployment
Docker - Portable Deployment
 
Dockerization of Azure Platform
Dockerization of Azure PlatformDockerization of Azure Platform
Dockerization of Azure Platform
 
The Future of SDN in CloudStack by Chiradeep Vittal
The Future of SDN in CloudStack by Chiradeep VittalThe Future of SDN in CloudStack by Chiradeep Vittal
The Future of SDN in CloudStack by Chiradeep Vittal
 
Donabe-essex-conference-readout
Donabe-essex-conference-readoutDonabe-essex-conference-readout
Donabe-essex-conference-readout
 
Directions for CloudStack Networking
Directions for CloudStack  NetworkingDirections for CloudStack  Networking
Directions for CloudStack Networking
 
Cloud level scalability - Nuxeo Tour 2014
Cloud level scalability - Nuxeo Tour 2014Cloud level scalability - Nuxeo Tour 2014
Cloud level scalability - Nuxeo Tour 2014
 
Docker - Demo on PHP Application deployment
Docker - Demo on PHP Application deployment Docker - Demo on PHP Application deployment
Docker - Demo on PHP Application deployment
 
Higher order infrastructure: from Docker basics to cluster management - Nicol...
Higher order infrastructure: from Docker basics to cluster management - Nicol...Higher order infrastructure: from Docker basics to cluster management - Nicol...
Higher order infrastructure: from Docker basics to cluster management - Nicol...
 
DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop
 
Virtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin MurrayVirtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin Murray
 
.NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time....NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time...
 
Chef and Apache CloudStack (ChefConf 2014)
Chef and Apache CloudStack (ChefConf 2014)Chef and Apache CloudStack (ChefConf 2014)
Chef and Apache CloudStack (ChefConf 2014)
 
AWS re:Invent re:Cap - 배포를 더욱 손쉽고 빠르게: Amazon EC2 Container Service - 김일호
AWS re:Invent re:Cap - 배포를 더욱 손쉽고 빠르게: Amazon EC2 Container Service - 김일호AWS re:Invent re:Cap - 배포를 더욱 손쉽고 빠르게: Amazon EC2 Container Service - 김일호
AWS re:Invent re:Cap - 배포를 더욱 손쉽고 빠르게: Amazon EC2 Container Service - 김일호
 

Mais de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Docker Based Hadoop Provisioning

  • 1. Janos Matyas / CTO / SequenceIQ Inc.
  • 2. GOAL / MOTIVATION TECHNOLOGY STACK PROBLEM RESOLUTION / HOW IT WORKS RESULTS / ACHIEVEMENTS OVERVIEW
  • 3. GOAL / MOTIVATION  Ease Hadoop provisioning – everywhere  Automate and unify the process  Arbitrary cluster size  Same process through a cluster lifecycle (Dev, QA, UAT, Prod)  (Auto) scaling Hadoop  QoS
  • 4. OUR APPROACH  Use Docker  Build cloud-specific ‘Dockerized’ images  Provision the cluster  Use Ambari
  • 5. DOCKER  Lightweight, portable  Build once, run anywhere  VM – without the overhead of a VM  Isolated containers  Automated and scripted
  • 6. DOCKER – CONTAINERS vs. VMs  Containers are isolated, but share OS and, where appropriate, bins/libraries
  • 7. APACHE AMBARI – ARCHITECTURE  Easy Hadoop cluster provisioning  Management and monitoring  Key features – blueprints  REST API
  • 8. APACHE AMBARI – CREATE CLUSTER  Define a blueprint (POST /api/v1/blueprints)  Create cluster (POST /api/v1/clusters/mycluster)
  • 9. HADOOP PROVISIONG ISSUES  Each cloud provider has a proprietary API  Create images for each provider  Network configuration  Service discovery  Resize, failover, member join support
  • 10. OUR APPROACH – DETAILS  Build your Docker image  Install or pre-install Hadoop services with Ambari  Install Serf and dnsmasq  Build your cloud image  Use Ansible to create an image  Provision the cluster
  • 11. BUILD DOCKER IMAGES  Create the Dockerfile  Have Docker.io to build the image  Optionally pre-install services  Use Ambari  Push image to Docker.io  Licensing questions
  • 12. BUILD CLOUD IMAGES  Use a Docker ready base image  Use Ansible to provision the image template  Pull the Docker images  Apply custom infrastructure  Use cloud provider specific playbooks  AWS EC2  Azure
  • 13. ANSIBLE  Configuration as data  Simplest way to automate IT  Secure and agentless  Goal oriented  One playbook – multiple modules  We use it to “burn” cloud images/templates
  • 14. PROVISIONING – ISSUES  FQDN  /etc/hosts is read-only in Docker  Everybody needs to know everybody  DNS  Single point of failure  Dynamic cluster – nodes joining, leaving, failing  Routing  Cloud – ability to inter-host container routing  Collision free private IP range for Docker bridge  We need predefined host names/IP addresses  /etc/hosts is read-only in Docker  Use Ansible to provision the image template  Pull the Docker images  Start a DNS server  Use it as a reference docker run -dns <IP_OF_DNS>  Nodes need to know each other
  • 15. PROVISIONING – SOLUTION  FQDN  Use –h and –dns Docker params  DNS  dnsmasq is running on each Docker container  Serf member-xxx events trigger dnsmasq reconfiguration  Routing  Docker bridge configuration – follows a convention
  • 16. SERF  Gossip based membership  Service discovery  Decentralized  Lightweight, fault tolerant  Highly available  DevOps friendly  Keep an eye on Consul, Open vSwitch, pipework
  • 17. SERF – DECENTRALIZED SERVICE DISCOVERY  Gossip instead of heartbeat  LAN, WAN profiles  Provides membership information  Event handlers: member_join, member_leave, member_failed, member- update, member-reap, user  Query
  • 19. SERF – MEMBERSHIP, EVENT HANDLERS
  • 20. DNSMASQ  Network infrastructure for small networks  Lightweight DNS, DHCP server  Comes with most Linux distributions
  • 21. AWS EC2 – HADOOP CLUSTER  Use EC2 REST API to provision instances (from Dockerized image)  Start Docker containers  One Ambari server  N-1 Ambari agents connecting to server  Connect ambari-shell to  Define blueprint  Provision the cluster
  • 22. AWS EC2 – NETWORK SECURITY  Create a VPC  Configure subnets  Routing tables  Security gateway  Set ACL  Configure VPN
  • 23. AWS EC2 - CLOUDFORMATION  Manually set up VPC is too complicated  Use CloudFormation  Manage the stack together  Template-based  Environments under version control  Customizable at runtime  No extra charge "VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" }, "SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" }, "SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." }, "SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "0.0.0.0/0", "AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/ (d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } },
  • 24. CLOUDBREAK Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off southwest the island of Tavarua, Fiji. Cloudbreak is a cloud-agnostic Hadoop as a Service API. Abstracts the provisioning and ease management and monitoring of on- demand clusters. Provisioning Hadoop has never been easier
  • 25. CLOUDBREAK  Benefits  Elastic  Scalable  Blueprints  Flexible  Main REST resources  /template – specify a cluster infrastructure  /stack – creates a cloud infrastructure built from a template  /blueprint – describes a Hadoop cluster  /cluster – creates a Hadoop cluster
  • 26. RESULTS AND ACHIEVEMENTS  Hadoop as a Service API  Available for EC2 and Azure cloud  OpenStack, bare metal is coming soon  Open source under Apache 2 licence  Same goals as Apache Ambari Launchpad project  What's next?
  • 27. HADOOP SERVICES - AS A SERVICE  Leverage YARN  Slider (Hoya) providers  HBase, Accumulo  SequenceIQ providers - Flume, Tomcat  YARN -1964  QoS for YARN – heuristic scheduler  Platform as a Service API
  • 28. BANZAI PIPELINE Banzai Pipeline is a surf reef break located in Hawaii, off Ehukai Beach Park in Pupukea on O'ahu's North Shore. Banzai Pipeline is a RESTful application development platform for building on- demand data and job pipelines running on Hadoop YARN. Banzai Pipeline is a big data API for the REST
  • 29. THANK YOU  Get the code: https://github.com/sequenceiq  Read about: http://blog.sequenceiq.com  Facebook: http://facebook.com/sequenceiq  Twitter: http://twitter.com/sequenceiq  LinkedIn: http://linkedin.com/sequenceiq  Contact: janos.matyas@sequenceiq.com FEEL FREE TO CONTRIBUTE

Notas do Editor

  1. YAML
  2. Dev – env : use default Docker bridge (easy)
  3. -h for hostname, --dns to specify the DNS service to use Convention: AMI launch index
  4. Fire and forget Waits for anwer – limited response collection