SlideShare uma empresa Scribd logo
1 de 20
Gateway
Cluster Virtualization Framework




Konstantin V Shvachko
Po Cheung
Priyo Mustafi
Hadoop Platform Team, eBay




Hadoop World Conference
November 9, 2011
Hadoop Cluster Components


• HDFS – a distributed file system
      – NameNode – namespace and block management
      – DataNodes – block replica container
      – BackupNode – checkpointer

• MapReduce – a framework for distributed computations
      – JobTracker – job scheduling, resource management, lifecycle coordination
      – TaskTracker – task execution module


                                 NameNode                 JobTracker




                             TaskTracker    TaskTracker        TaskTracker

                             DataNode       DataNode            DataNode



2   eBay Inc. confidential
Cluster Access via Portal Nodes


• Users access Hadoop clusters via dedicated portal nodes located behind
  corporate firewalls
      – Login (ssh) to the portal node: authentication and authorization
      – Access clusters: run HDFS commands, submit jobs



                                                    NameNode                 JobTracker
                               Portal Node(s)
                    Firewall




                                                TaskTracker    TaskTracker        TaskTracker

                                                 DataNode       DataNode           DataNode



3   eBay Inc. confidential
Use Case #1: Development of New Applications


• Developers of new applications fall into a cycle of
  moving programs, input and output data
  between their dev boxes, the portal, and the Hadoop clusters.


    Develop an application;
    while( my manager is unsatisfied ) {
       build application on your desktop;
       scp myapp.jar or in.mydata to the portal node;
       Run application on the cluster (data in HDFS);
       Verify job results with the manager;
       Fix application bugs or develop more;
    }
    Offload output data from the cluster;



4   eBay Inc. confidential
Use Case #2: Access to Public Datasets


• Scientific data:
      – Genomics datasets
      – Fundamental physics experiments (LHC in Nebraska)
      – Astronomical images

• Data is public, but not the servers used to store and process data
• Geographically separated datacenters
• Users should be able to access and analyze data via internet
• Implies direct login to the clusters for everybody
      – Complex security issues




5   eBay Inc. confidential
Problem: Portal Nodes as Shared Resources


• Developers hate transferring programs to portal nodes
• Input data should be first transferred to the portal, then to HDFS
• Developers tend to use portals as their dev nodes
      – Setup development environments
      – Connect to git repositories

• Portals are shared multi-tenant resources
      – Community property

• Portal nodes become yet another cluster component
      – Maintenance overhead for cluster administrators

• Public datasets: need access without direct login to cluster portals




6   eBay Inc. confidential
Gateway Project: Main Objective


• Gateway is a cluster virtualization framework, which
  provides a unified and seamless access to Hadoop clusters
  from users’ workplace computers through corporate firewalls.
                             Gateway Server(s)

                                                     NameNode                 JobTracker




                                                 TaskTracker    TaskTracker        TaskTracker

                                                  DataNode       DataNode           DataNode




7   eBay Inc. confidential
Gateway Project: Principal Benefits


1. Unified access to multiple Hadoop clusters through the corporate firewalls
      – Multiple clusters within the same datacenter
        “HDFS Scalability: The limits to growth” USENIX ;login: 2010
        Connotations with Federation in implementation (ViewFS) and purpose
      – Clusters in different datacenters

2. Service availability:
   failover to active clusters when one has scheduled/unscheduled downtime
3. Flexible cluster upgrades:
   redirect traffic to other clusters when one is upgrading
4. Versioning:
   access to clusters running different versions of Hadoop
5. Load balancing:
   smart job submission based on cluster workloads


8   eBay Inc. confidential
Network Requirements


• Gateway Servers are positioned on the boundary between the corporate
  and “public” networks
• Gateway Servers can
      – communicate with the user desktops/laptops residing on public network and to
        Hadoop clusters running in different data centers within corp. network.

• Due to firewalls there is no direct connectivity from the public network to
  Hadoop clusters and vice versa other than via the Gateway Servers.
• Gateway plays the role of a proxy between users and Hadoop clusters
      – Users delegate execution of their jobs and HDFS commands to the Gateway
        servers.
      – The servers talk to the actual clusters and return the replies back to the users.




9   eBay Inc. confidential
Functional Requirements


• The cluster virtualization framework need to support
       – current Java and command line user facing Hadoop APIs
       – existing Hadoop applications and jobs should continue to run from user boxes the
         same way as they used to from portal nodes

• Transparent use of client side libraries:
       – Pig, Hive, Cascading, Hadoop shell commands

• Authorization and Authentication
       – As a replacement for existing portal nodes, Gateway should provide adequate
         user authentication and authorization

• Unified WEB UI combining UIs of the serviced clusters




10   eBay Inc. confidential
Gateway Architecture


Gateway Virtualization Framework has two main components:
• Job Submission system, represented by
       – Gateway MapReduce Server (GWMRServer) on the server side, and
       – regular Hadoop job submission and status tracking tools
         contacting GWMRServer via the standard Hadoop JobClient.

• Virtualization of File System Access is represented by
       – GatewayFileSystem on the client side and
       – Gateway File System Server (GWFSServer) on the server side.

                                          GWMR
                              JobClient
                                          Server
                                          Server
                                          GWFS




                               gwfs://



11   eBay Inc. confidential
Job Submission


• Hadoop uses JobClient to submit jobs
       – Job is defined by its configuration file and the job jar
       – JobClient loads these two files along with other user-specified files required for
         the job to HDFS and submits the job to the JobTracker
       – the job is then scheduled for execution

• GWMRServer is the only component needed to virtualize job submission.
  No specialized gateway client is required
• Regular Hadoop JobClients are configured to send submissions to
  GWMRServer instead of a JobTracker
• Job Submission Virtualization allows submitting jobs to multiple MR clusters
  via GWMRServer
• GWMRServer selects one of the clusters and further submits the job to the
  respective JobTracker


12   eBay Inc. confidential
HDFS Access


• File System Access virtualized via GatewayFileSystem and GWFSServer
• GatewayFileSystem is a new specialized client for accessing HDFS clusters
  via GWFSServer
       – The client is instantiated automatically based on configuration parameters setup
         to access gateway server instead of HDFS
       – GatewayFileSystem passes the client request to GWFSServer
       – The gateway server instantiates a traditional HDFS client (DistributedFileSystem)
         pointing to the requested cluster
       – Executes the request on the cluster and returns the result back to the gateway
         client

• Unlike Job Submission the virtualized Files System Access is always cluster
  aware
       – If a user accesses a file he should explicitly specify, which HDFS cluster the file
         belongs to



13   eBay Inc. confidential
GWMR: Implementation



• GWMRServer is a subclass of mapred.JobSubmissionProtocol (H-0.20)
  mapreduce.ClientProtocol (> H-0.20)
• GWMRServer can be accessed via regular Hadoop command-line-interface
  and Java interface
• MR clients communicate (submit jobs and obtain job information) directly
  with GWMRServer as if they talk to a real JobTracker via hadoop.RPC
• GWMRServer redirects the job to one of the clusters, based on
       – Data location
       – Cluster workload
       – User group information




14   eBay Inc. confidential
GWMR: Implementation Continued



• GWMRServer is stateless (or keeps a very lightweight state)
       – allows setting up pools of Gateway servers in order to avoid single point of failure

• On startup GWMRServer reads configuration from “gateway-site.xml”,
  which determines the Hadoop MR clusters it must serve
• GWMRServer has a web UI, similar to the JobTracker UI, which aggregates
  data from available JobTrackers
• GWMRServer supports job sequencing, so that chaining MR jobs initiated
  by a single Pig or Hive job were scheduled to the same cluster




15   eBay Inc. confidential
GatewayFileSystem: Implementation


• GatewayFileSystem is a subclass of the FileSystem abstract class
  Similar to LocalFileSystem, HFTPFileSystem, S3FileSystem
       – gwfs://              - GatewayFileSystem
       – file://              - LocalFileSystem
       – hdfs://              - DistributedFileSystem
       – har://               - HarFileSystem
       – hftp://              - HFTPFileSystem
       – s3://                - S3FileSystem
       – kfs://               - KFSFileSystem

• GatewayFileSystem is instantiated based on the URI scheme listed in
  fs.default.name (fs.defaultFS) field of core-site.xml
       – fs.default.name = gwfs://<GWFSServer-address>
       – fs.gwfs.impl = org.apache.hadoop.gateway.fs.GatewayFileSystem




16   eBay Inc. confidential
GWFSServer: Implementation


• GatewayFileSystem passes client requests to GWFSServer using
       – a new RPC protocol – GWFSProtocol, and
       – a new binary data transfer protocol – DataTProtocol

• The Gateway server processes GWFSProtocol requests
       – It instantiates a real DistributedFileSystem pointing to the required cluster,
       – executes the request and returns results back to the gateway client

• DataTProtocol transfers data between the Gateway clients and the server
       – The data transfer is a direct pipeline between a gateway client and HDFS
       – GWFSServer reads data from HDFS and pipelines it to gateway client via
         DataTProtocol, and vice versa for write

• GWFSServer is stateless. This will allow setting up pools of servers in order
  to avoid single point of failure and to provide load balancing




17   eBay Inc. confidential
Versioning


• GWMRServer can serve JobClients of a specific version only. Incompatible
  version of Hadoop will require different implementations of GWMRServer
       – The service will run multiple versions of GWMRServer so that client requests
         could be redirected to a server serving the compatible version

• Same instance of GWMRServer can submit jobs and query map-reduce
  clusters running different versions of Hadoop
       – GWMRServer discovers the Hadoop version of a particular cluster, and uses the
         respective Hadoop jars to instantiate an appropriate version of the JobClient

• GatewayFileSystem-to-GWFSServer communication is independent of
  HDFS
       – No need to implement a new GWFSServer for every new Hadoop release.

• Same instance of GWFSServer can access clusters running different
  versions of HDFS
       – GWFSServer discovers the HDFS cluster version and uses the respective jars to
         instantiate an appropriate version of the DistributedFileSystem

18   eBay Inc. confidential
Project Status


• Support for Hadoop 0.20.xxx
  Plan for 0.22
• Authorization & Authentication
• Job Chaining
• Packaging
       – It is convenient for users to have
         the entire Hadoop client suite
         installed, configured,
         and packaged as a VM

• Plan to open-source soon
• Developers wanted




19   eBay Inc. confidential
Thank You!




20   eBay Inc. confidential

Mais conteúdo relacionado

Mais procurados

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013Hortonworks
 
Infinispan @ Red Hat Forum 2013
Infinispan @ Red Hat Forum 2013Infinispan @ Red Hat Forum 2013
Infinispan @ Red Hat Forum 2013Jaehong Cheon
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Managing enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystemManaging enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystemDataWorks Summit
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka SecurityDataWorks Summit
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewYafang Chang
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
New Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideNew Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideHBaseCon
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNDataWorks Summit/Hadoop Summit
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHBaseCon
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmDataWorks Summit
 

Mais procurados (20)

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
Apache Ambari BOF - Blueprints + Azure - Hadoop Summit 2013
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Infinispan @ Red Hat Forum 2013
Infinispan @ Red Hat Forum 2013Infinispan @ Red Hat Forum 2013
Infinispan @ Red Hat Forum 2013
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Managing enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystemManaging enterprise users in Hadoop ecosystem
Managing enterprise users in Hadoop ecosystem
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka Security
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
YARN and the Docker container runtime
YARN and the Docker container runtimeYARN and the Docker container runtime
YARN and the Docker container runtime
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
New Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's GuideNew Security Features in Apache HBase 0.98: An Operator's Guide
New Security Features in Apache HBase 0.98: An Operator's Guide
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFS
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
 

Semelhante a Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay

Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 
Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api CompatibilityCloudera, Inc.
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityChris Nauroth
 
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...Yahoo Developer Network
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Mary Bass
 

Semelhante a Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay (20)

Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 
Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api Compatibility
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
1.0 vs2.0
1.0 vs2.01.0 vs2.0
1.0 vs2.0
 
Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay

  • 1. Gateway Cluster Virtualization Framework Konstantin V Shvachko Po Cheung Priyo Mustafi Hadoop Platform Team, eBay Hadoop World Conference November 9, 2011
  • 2. Hadoop Cluster Components • HDFS – a distributed file system – NameNode – namespace and block management – DataNodes – block replica container – BackupNode – checkpointer • MapReduce – a framework for distributed computations – JobTracker – job scheduling, resource management, lifecycle coordination – TaskTracker – task execution module NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 2 eBay Inc. confidential
  • 3. Cluster Access via Portal Nodes • Users access Hadoop clusters via dedicated portal nodes located behind corporate firewalls – Login (ssh) to the portal node: authentication and authorization – Access clusters: run HDFS commands, submit jobs NameNode JobTracker Portal Node(s) Firewall TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 3 eBay Inc. confidential
  • 4. Use Case #1: Development of New Applications • Developers of new applications fall into a cycle of moving programs, input and output data between their dev boxes, the portal, and the Hadoop clusters. Develop an application; while( my manager is unsatisfied ) { build application on your desktop; scp myapp.jar or in.mydata to the portal node; Run application on the cluster (data in HDFS); Verify job results with the manager; Fix application bugs or develop more; } Offload output data from the cluster; 4 eBay Inc. confidential
  • 5. Use Case #2: Access to Public Datasets • Scientific data: – Genomics datasets – Fundamental physics experiments (LHC in Nebraska) – Astronomical images • Data is public, but not the servers used to store and process data • Geographically separated datacenters • Users should be able to access and analyze data via internet • Implies direct login to the clusters for everybody – Complex security issues 5 eBay Inc. confidential
  • 6. Problem: Portal Nodes as Shared Resources • Developers hate transferring programs to portal nodes • Input data should be first transferred to the portal, then to HDFS • Developers tend to use portals as their dev nodes – Setup development environments – Connect to git repositories • Portals are shared multi-tenant resources – Community property • Portal nodes become yet another cluster component – Maintenance overhead for cluster administrators • Public datasets: need access without direct login to cluster portals 6 eBay Inc. confidential
  • 7. Gateway Project: Main Objective • Gateway is a cluster virtualization framework, which provides a unified and seamless access to Hadoop clusters from users’ workplace computers through corporate firewalls. Gateway Server(s) NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 7 eBay Inc. confidential
  • 8. Gateway Project: Principal Benefits 1. Unified access to multiple Hadoop clusters through the corporate firewalls – Multiple clusters within the same datacenter “HDFS Scalability: The limits to growth” USENIX ;login: 2010 Connotations with Federation in implementation (ViewFS) and purpose – Clusters in different datacenters 2. Service availability: failover to active clusters when one has scheduled/unscheduled downtime 3. Flexible cluster upgrades: redirect traffic to other clusters when one is upgrading 4. Versioning: access to clusters running different versions of Hadoop 5. Load balancing: smart job submission based on cluster workloads 8 eBay Inc. confidential
  • 9. Network Requirements • Gateway Servers are positioned on the boundary between the corporate and “public” networks • Gateway Servers can – communicate with the user desktops/laptops residing on public network and to Hadoop clusters running in different data centers within corp. network. • Due to firewalls there is no direct connectivity from the public network to Hadoop clusters and vice versa other than via the Gateway Servers. • Gateway plays the role of a proxy between users and Hadoop clusters – Users delegate execution of their jobs and HDFS commands to the Gateway servers. – The servers talk to the actual clusters and return the replies back to the users. 9 eBay Inc. confidential
  • 10. Functional Requirements • The cluster virtualization framework need to support – current Java and command line user facing Hadoop APIs – existing Hadoop applications and jobs should continue to run from user boxes the same way as they used to from portal nodes • Transparent use of client side libraries: – Pig, Hive, Cascading, Hadoop shell commands • Authorization and Authentication – As a replacement for existing portal nodes, Gateway should provide adequate user authentication and authorization • Unified WEB UI combining UIs of the serviced clusters 10 eBay Inc. confidential
  • 11. Gateway Architecture Gateway Virtualization Framework has two main components: • Job Submission system, represented by – Gateway MapReduce Server (GWMRServer) on the server side, and – regular Hadoop job submission and status tracking tools contacting GWMRServer via the standard Hadoop JobClient. • Virtualization of File System Access is represented by – GatewayFileSystem on the client side and – Gateway File System Server (GWFSServer) on the server side. GWMR JobClient Server Server GWFS gwfs:// 11 eBay Inc. confidential
  • 12. Job Submission • Hadoop uses JobClient to submit jobs – Job is defined by its configuration file and the job jar – JobClient loads these two files along with other user-specified files required for the job to HDFS and submits the job to the JobTracker – the job is then scheduled for execution • GWMRServer is the only component needed to virtualize job submission. No specialized gateway client is required • Regular Hadoop JobClients are configured to send submissions to GWMRServer instead of a JobTracker • Job Submission Virtualization allows submitting jobs to multiple MR clusters via GWMRServer • GWMRServer selects one of the clusters and further submits the job to the respective JobTracker 12 eBay Inc. confidential
  • 13. HDFS Access • File System Access virtualized via GatewayFileSystem and GWFSServer • GatewayFileSystem is a new specialized client for accessing HDFS clusters via GWFSServer – The client is instantiated automatically based on configuration parameters setup to access gateway server instead of HDFS – GatewayFileSystem passes the client request to GWFSServer – The gateway server instantiates a traditional HDFS client (DistributedFileSystem) pointing to the requested cluster – Executes the request on the cluster and returns the result back to the gateway client • Unlike Job Submission the virtualized Files System Access is always cluster aware – If a user accesses a file he should explicitly specify, which HDFS cluster the file belongs to 13 eBay Inc. confidential
  • 14. GWMR: Implementation • GWMRServer is a subclass of mapred.JobSubmissionProtocol (H-0.20) mapreduce.ClientProtocol (> H-0.20) • GWMRServer can be accessed via regular Hadoop command-line-interface and Java interface • MR clients communicate (submit jobs and obtain job information) directly with GWMRServer as if they talk to a real JobTracker via hadoop.RPC • GWMRServer redirects the job to one of the clusters, based on – Data location – Cluster workload – User group information 14 eBay Inc. confidential
  • 15. GWMR: Implementation Continued • GWMRServer is stateless (or keeps a very lightweight state) – allows setting up pools of Gateway servers in order to avoid single point of failure • On startup GWMRServer reads configuration from “gateway-site.xml”, which determines the Hadoop MR clusters it must serve • GWMRServer has a web UI, similar to the JobTracker UI, which aggregates data from available JobTrackers • GWMRServer supports job sequencing, so that chaining MR jobs initiated by a single Pig or Hive job were scheduled to the same cluster 15 eBay Inc. confidential
  • 16. GatewayFileSystem: Implementation • GatewayFileSystem is a subclass of the FileSystem abstract class Similar to LocalFileSystem, HFTPFileSystem, S3FileSystem – gwfs:// - GatewayFileSystem – file:// - LocalFileSystem – hdfs:// - DistributedFileSystem – har:// - HarFileSystem – hftp:// - HFTPFileSystem – s3:// - S3FileSystem – kfs:// - KFSFileSystem • GatewayFileSystem is instantiated based on the URI scheme listed in fs.default.name (fs.defaultFS) field of core-site.xml – fs.default.name = gwfs://<GWFSServer-address> – fs.gwfs.impl = org.apache.hadoop.gateway.fs.GatewayFileSystem 16 eBay Inc. confidential
  • 17. GWFSServer: Implementation • GatewayFileSystem passes client requests to GWFSServer using – a new RPC protocol – GWFSProtocol, and – a new binary data transfer protocol – DataTProtocol • The Gateway server processes GWFSProtocol requests – It instantiates a real DistributedFileSystem pointing to the required cluster, – executes the request and returns results back to the gateway client • DataTProtocol transfers data between the Gateway clients and the server – The data transfer is a direct pipeline between a gateway client and HDFS – GWFSServer reads data from HDFS and pipelines it to gateway client via DataTProtocol, and vice versa for write • GWFSServer is stateless. This will allow setting up pools of servers in order to avoid single point of failure and to provide load balancing 17 eBay Inc. confidential
  • 18. Versioning • GWMRServer can serve JobClients of a specific version only. Incompatible version of Hadoop will require different implementations of GWMRServer – The service will run multiple versions of GWMRServer so that client requests could be redirected to a server serving the compatible version • Same instance of GWMRServer can submit jobs and query map-reduce clusters running different versions of Hadoop – GWMRServer discovers the Hadoop version of a particular cluster, and uses the respective Hadoop jars to instantiate an appropriate version of the JobClient • GatewayFileSystem-to-GWFSServer communication is independent of HDFS – No need to implement a new GWFSServer for every new Hadoop release. • Same instance of GWFSServer can access clusters running different versions of HDFS – GWFSServer discovers the HDFS cluster version and uses the respective jars to instantiate an appropriate version of the DistributedFileSystem 18 eBay Inc. confidential
  • 19. Project Status • Support for Hadoop 0.20.xxx Plan for 0.22 • Authorization & Authentication • Job Chaining • Packaging – It is convenient for users to have the entire Hadoop client suite installed, configured, and packaged as a VM • Plan to open-source soon • Developers wanted 19 eBay Inc. confidential
  • 20. Thank You! 20 eBay Inc. confidential