O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks – The Hadoop Ecosystem
Fall 2014
Powering the Modern...
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
Apache Hadoop and Hortonworks Data Platform (HDP)
HDP and...
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Hadoop
Apache Hadoop is an open-source software framewor...
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Projects in Hadoop
Hadoop Core
– Hadoop Common
– Hadoop Distribu...
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a comprehensive data management platform
Hortonwork...
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS and Yarn – The Core of Hadoop
The core components of HDP ar...
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN extends Hadoop into data center leaders
YARN
The Architectu...
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Access
YARN provides the foundation for a versatile
range o...
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance and Integration
• HDP extends data access and
ma...
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security
• Authentication/ Authorization and
Encryption
• Kerbe...
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Operations – Apache Ambari
• Provisioning, manage and monitor
H...
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Hadoop: Central Set of Services
YARN: Data Operating...
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything...
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
OPERATIONAL TOOLS
DEV & DATA TOOLS
INFRASTRUCTURE
The Partner E...
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• Couchbase is primarily online operational NoSQL datastore, lo...
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• HDP Certified Sqoop connector for batch
mode export / import
...
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What’s New in HDP 2.2
New and Improved YARN
Ready Engines
• Ent...
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.next: Enterprise SQL at Hadoop Scale
A continuation of ...
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Spark
• Apache Spark is an open source project for fast ...
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Stack
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Ready Spark for HDP 2.2 & beyond
HDP 2.2 – Spark on ...
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Bringing more applications and services to
YARN and making ISV ...
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security in HDP 2.2
HDP 2.2 New Features
• Extend Authorization...
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streamlining Operations in HDP 2.2
Apache Ambari 1.7.0 Delivers...
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Rolling Upgrades
Allow continuous operation and up-time for
app...
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Development
& POC Cluster
Production
Cluster
Vision: Maximize H...
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
Cloudbreak ...
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
Periscope w...
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank You
Próximos SlideShares
Carregando em…5
×

Introduction to the Hadoop EcoSystem

Following Slides were presented in CouchBase Connect 2015.

  • Seja o primeiro a comentar

Introduction to the Hadoop EcoSystem

  1. 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hortonworks – The Hadoop Ecosystem Fall 2014 Powering the Modern Data Architecture Shivaji Dutta – Sr. Partner Solutions Engineer
  2. 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda Apache Hadoop and Hortonworks Data Platform (HDP) HDP and Couchbase What’s new in HDP?
  3. 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved What is Hadoop Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
  4. 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Projects in Hadoop Hadoop Core – Hadoop Common – Hadoop Distributed File System – Hadoop YARN – Hadoop Mapreduce Other Hadoop Key Projects • Hive • Hbase • Spark • Pig • Tez • Zookeper
  5. 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP delivers a comprehensive data management platform Hortonworks Data Platform 2.2 YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment ChoiceLinux Windows On-Premises Cloud YARN is the architectural center of HDP Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options Delivered Completely in the OPEN
  6. 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDFS and Yarn – The Core of Hadoop The core components of HDP are YARN and Hadoop Distributed Filesystem (HDFS). YARN is the architectural center of Hadoop that enables you to process data simultaneously in multiple ways. YARN provides the resource management and pluggable architecture for enabling a wide variety of data access methods. HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data.
  7. 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved YARN extends Hadoop into data center leaders YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases • Supports 3rd-party ISV tools (ex. SAS, Syncsort, Actian, etc.) YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark
  8. 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Access YARN provides the foundation for a versatile range of processing engines that empower you to interact with the same data in multiple ways, at the same time. This means applications can interact with the data in the best way: from batch to interactive SQL or low latency access with NoSQL. Emerging use cases for data science, search and streaming are also supported with Apache Spark, Solr and Storm. Additionally, ecosystem partners provide even more specialized data access engines for YARN. YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark
  9. 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance and Integration • HDP extends data access and management with powerful tools for data governance and integration. • They provide a reliable, repeatable, and simple framework for managing the flow of data in and out of Hadoop. This control structure, along with a set of tooling to ease and automate the application of schema or metadata on sources is critical for successful integration of Hadoop into your modern data architecture.• Apache SQOOP • Apache OOZIE • Apache FALCON • Apache FLUME
  10. 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Security • Authentication/ Authorization and Encryption • Kerberos • SSL & SASL • Apache Knox • Apache Ranger • HDFS File/Directory Encryption
  11. 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Operations – Apache Ambari • Provisioning, manage and monitor Hadoop Clusters • A complete set of operational capabilities that provide both visibilities into the health of your cluster as well as tooling to manage configuration and optimize performance across all data access methods. • Apache Ambari provides APIs to integrate with existing management systems: for instance Microsoft System Center and Teradata ViewPoint
  12. 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enterprise Hadoop: Central Set of Services YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for: • Governance • Operations • Security Everything that plugs into Hadoop inherits these services Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITYGOVERNANCE OPERATIONS Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) Tez Slider SliderTez Tez
  13. 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive&HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix Accumulo 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.98.0 0.9.1 1.4.4 1.3.1 1.4.0 1.4.4 1.5.1 3.3.2 4.0.0 3.4.5 0.4.0 4.0.0 1.5.1 Falcon 0.5.0 Ranger Spark Kafka 0.14.0 0.14.0 0.98.4 1.6.1 4.2 0.9.3 1.2.0 0.6.0 0.8.1 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process 3.4.5 Tez 0.4.0 Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr 4.7.2 4.10.0 0.5.1 Data Access Governance & Integration SecurityOperations
  14. 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved OPERATIONAL TOOLS DEV & DATA TOOLS INFRASTRUCTURE The Partner EcoSystemSOURCES EXISTING Systems Clickstream Web &Social Geolocation Sensor & Machine Server Logs Unstructured DATASYSTEM RDBMS EDW MPP HANA APPLICATIONS BusinessObjects BI Deep Partnerships Hortonworks engages in deep engineered relationships with the leaders in the data center, such as Microsoft, Teradata, Redhat, HP, SAS & SAP Broad Partnerships Over 900 partners work with us to certify their applications to work with Hadoop so they can extend big data to their users HDP 2.1 Governance &Integration Security Operations Data Access Data Management YARN
  15. 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved • Couchbase is primarily online operational NoSQL datastore, low latency, scalable • Source of data and also a sink • Example source: Pulling user profiles into Hadoop for deep analytics • Example sink: training machine learning models that are then cached / served from Couchbase Couchbase and HDP
  16. 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved • HDP Certified Sqoop connector for batch mode export / import • Couchbase Kafka connector enables both Producer and Consumer scenarios • Community supported Storm spout to persist data by writing to Couchbase Server • Developer Preview Spark Connector Couchbase and HDP New!
  17. 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved What’s New in HDP 2.2 New and Improved YARN Ready Engines • Enterprise SQL at Hadoop Scale with Stinger.next • Enterprise Ready Spark on YARN • Deep YARN integration for real-time engines: HBase, Accumulo, Storm • Enabling ISVs with a general SDK and API for direct YARN integration • Only solution to provide real-time to micro batch for analyzing the internet of things • Other engines/tools: Solr, Cascading Continued Innovation of Central Enterprise Services • Centralized security administration and policy enforcement • Ease of use and operations agility features to speed cluster deployment • 100% uptime target with cluster rolling upgrades Expanded Deployment Options • Enhanced business continuity with replication/archival across on-premises and cloud storage tiers (Azure Blob, S3) • Simultaneous ship of Windows and Linux installs • Expand Azure support beyond HDInsight Azure to include HDP for Windows or Linux in Azure VMs HDP 2.2 Delivering Apache Hadoop for the Enterprise
  18. 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Stinger.next: Enterprise SQL at Hadoop Scale A continuation of momentum built in Apache Hive Community to deliver Enterprise SQL at Hadoop scale HDP Stinger/Hive Goals: • Speed Deliver sub-second query response times • Scale The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes • SQL Enable transactions and SQL:2011 Analytics Familiar three phase delivery Stinger delivered 390,000 lines of code to Apache Hive in 13 months from 44 companies, 145 developers HDP 2.2 – Beyond Read Only • Transactions with ACID, allowing insert, update & delete • Temporary tables • Cost Based Optimizer for star & bushy join queries Phase 2 – Sub Second • Sub-second queries with LLAP • Hive-Spark Machine Learning integration • Operational reporting w/ Hive streaming ingest & transactions Phase 3 – Rich Analytics • SQL:2011 Analytics • Materialized views • Cross-geo queries • Workload management via YARN and LLAP integration HDP2.2 Security Operations Governance Access Management YARN
  19. 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Spark • Apache Spark is an open source project for fast and large scale data processing. – Simple and expressive programming model – Machine learning, graph computation and Streaming – in-memory compute for iterative workloads • It does most of the processing in memory • It support programming languages – Java, Scala and Python • It provides a high level modules for – Mlib – GraphX – Sprak Streaming – Sprark SQL • Cluster Manager – Yarn (recommended) – Mesos – Sparks Own
  20. 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark Stack
  21. 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enterprise Ready Spark for HDP 2.2 & beyond HDP 2.2 – Spark on YARN • Integrated: Hive 0.13 support • Integrated: Basic ORCfile support Phase 2 – Spark for HDP 2.2 • Managed: Deployment best practices with YARN Node Labels • Managed: Ambari Stack Definition: Install/Start/Stop/Config/Quick links to Spark UI • Security: Spark certification on Kerberized Cluster • Security: Authentication in Spark UI against LDAP Phase 3 - Beyond • Managed: Enhanced workload mgmnt & improved debuggability • Managed: Spark logs published to YARN Application Timeline • Security: Wire Encryption and Authorization with XA/Argus • Enhanced ORC support Deliver a reliable and managed, enterprise grade Apache Spark that will run alongside other workloads in Hadoop via YARN HDP Spark Goals: • Integrated Enterprise-grade Workload Management & Optimized multi-tenancy on YARN • Secure Extend comprehensive Hadoop security policy to Spark • Managed Provision, manage and monitor Spark along with other engines in hadoop HDP2.2 Security Operations Governance Access Management YARN
  22. 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Bringing more applications and services to YARN and making ISV adoption easier • Complete work for Pig with Tez • Cascading with Tez for Java and Scala apps • Integration of Spark on YARN • Kafka for inbound messaging to Storm & Spark – widest range from real-time to micro batch for internet of things HDP 2.2 Delivers more YARN Ready Engines YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Others Engines Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° Storm Stream Slider introduces native YARN integration for applications with long running services • HBase, Accumulo, Storm • SDK for 3rd-party ISVs Indicates “new to HDP” in 2.2. All engines have been updated HDP2.2 Security Operations Governance Access Management YARN Others Engines Slider Solr Search HBase NoSQL Slider Accumulo NoSQL Slider Spark In-Memory Kafka Slider ° ° ° ° HDFS (Hadoop Distributed File System)
  23. 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Security in HDP 2.2 HDP 2.2 New Features • Extend Authorization with Apache Ranger • Breadth: Knox and Storm integrations • Policy enforcement at depth: Hive, HDFS and HBase integrations • Documentation to support community development and partner ecosystem • Apache Hadoop Advances • TP: HDFS Transparent Encryption in HDFS – HDFS-6134 • Key Management Server - HADOOP-10433 • Key Provider API - HADOOP-10141 Continue investments across for central security policy for authentication, authorization, audit, and data protection HDP Security Goals: • Comprehensive Security Meet all security requirements across authentication, authorization, audit & data protection for all HDP components. • Central Administration Provide central administration ofg security policy and for viewing and managing audit across the platform. • Consistent Integration Integrate with other security and identity management systems, for compliance with IT policies. HDP2.2 Security Operations Governance Access Management YARN
  24. 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streamlining Operations in HDP 2.2 Apache Ambari 1.7.0 Delivers • Views A common, secure, and extensible approach for the user interface for Operators, System Administrators, Application Developers, Data Workers and ISVs • Blueprints Create and manage cluster templates for easy deployment Apache Ambari is advancing at light speed to enable the IT operator to more easily manage clusters HDP Operations Goals: • Open Deliver a complete set of features for Hadoop operations, in public and with the community. • Integrated Ensure Hadoop operations integrate with existing IT tools, behind a single pane of glass. • Intuitive Make Hadoop’s most complex operational challenges easy to manage. HDP2.2 Security Operations Governance Access Management YARN Ambari 2.0.0 delivers • Ambari on Windows • native metrics and alerts • rolling upgrade automation
  25. 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Rolling Upgrades Allow continuous operation and up-time for applications and services on the cluster while upgrading • Single most critical feature for streamlining operations • HDFS provides the ability to do this today… remaining components need to follow • Leverages native operating system tools and scripting • Allow jobs in-flight to complete • Provides support for rapid rollback HDP2.2 Security Operations Governance Access Management YARN
  26. 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Development & POC Cluster Production Cluster Vision: Maximize Hadoop Deployment Choice Deployment Choice • Linux, Windows • On-Premises, Cloud, Hybrid “Tethered” Clusters • Compatible services • An explicit “connection” Synchronized Datasets • Efficient sharing & access • Governance & lineage BI or ML Cluster Backup & Archive Cluster Learn
  27. 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved BI / Analytics (Hive) IoT Apps (Storm, HBase, Hive) Cloudbreak with HDP Dev / Test (all HDP services) Data Science (Spark) Cloudbreak 1. Pick a Blueprint 2. Choose a Cloud 3. Launch HDP! Example Ambari Blueprints: IoT Apps, BI / Analytics, Data Science, Dev / Test
  28. 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved BI / Analytics (Hive) IoT Apps (Storm, HBase, Hive) Periscope with HDP Dev / Test (all HDP services) Data Science (Spark) Autoscaling Policy Periscope • Policies based on any Ambari metrics • Coordinates with YARN to achieve elasticity based on the policies.
  29. 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You

×