SlideShare a Scribd company logo
1 of 17
HDP-1
Steve Loughran– Hortonworks
stevel at hortonworks.com
@steveloughran

Paris, June 2012




© Hortonworks Inc. 2012
Hortonworks Data Platform

                                                                                                                  Develop                            Interact




                                                                                            Non-Relational Database
                                                                                                                                Scripting                  Query




                                                                                                                                                                                                 Talend Open Studio for Big Data, Sqoop)
                                                                                                                                   (Pig)                    (Hive)




                                                                                                                                                                     Data Extraction & Loading
                                                          Workflow & Scheduling
          Management & Monitoring




                                                                                                                                                                                                      (HCatalog APIs, WebHDFS,
                                                                                                                      (HBase)


                                                                                                                                  Metadata Management
                                    (Ambari, Zookeeper)




                                                                                                                                              (HCatalog)
                                                                                  (Oozie)




Operate                                                                                                                           Distributed Processing                                                                                   Integrate
                                                                                                                                             (MapReduce)




                                                                                                                                Distributed Storage
                                                                                                                                           (HDFS)


                                                                                                Hortonworks Data Platform
                                                                                                                                                                                                                                            Page 2
          © Hortonworks Inc. 2012
Hortonworks Data Platform (HDP)
  Fully Integrated, Extensively Tested, Enterprise Supported

                                              Challenge:
                                              • Integrate, manage, and support
                                              changes across a wide range of open
                                              source projects that power the Hadoop
                                              platform; each with their own release
                                              schedules, versions, & dependencies.

                                              • Time-intensive, Complex, Expensive

                                              Solution: Hortonworks Data Platform
                                              • Integrated certified platform distributions

                                              • Extensive Q/A process: many apps
                                              across small, medium, & large clusters

                                              • Industry-leading Support with clear
Hadoop   Pig HCatalog Hive Ambari Zookeeper
                                              service levels for updates and patches
 Core

     = New Version
                                                                                  Page 3
           © Hortonworks Inc. 2012
HDP 1.0 Components
                       Component         Version
  Apache Hadoop (HDFS & MapReduce)          1.0.3+


  Apache HCatalog                           0.4.0+


  Apache Pig                                0.9.2


  Apache Hive                               0.9.0+


  Apache HBase                             0.92.1+


  Talend Open Studio for Big Data           5.1.0


  Apache Sqoop                              1.4.1+


  Apache Oozie                              3.1.3+


  Apache Zookeeper                          3.3.4

                                             0.1
  Apache Ambari
                                     (Technology Preview)


                                                            Page 4
     © Hortonworks Inc. 2012
Management & Monitoring: Ambari
• 100% Open Source

• Wizard-based install, provisioning & configuration
  management

• Monitoring and alerting dashboards

• Goals: ease of installation, scale to large clusters,
  effective monitoring of all services




                                                          Page 5
      © Hortonworks Inc. 2012
Cluster Provisioning through Web UI




Download and try from http://hortonworks.com

                                               Page 6
       © Hortonworks Inc. 2012
Monitoring and alerting dashboards




                                     Page 7
    © Hortonworks Inc. 2012
Installation and Provisioning
HMC Installer -GUI, puppet-driven
  – Installs Java and up;
  – Configures entire cluster
  – Sets up HMC for cluster monitoring
  – Web UI + text files listing nodes
gsInstall
  – Command line installer -file driven
RPM/YUM for custom installation processes
  – Configuration left as an exercise
  – Use if you have other cluster management tooling


           Qualified at scale on RHEL5.8 & Java 6u26

                                                       Page 8
      © Hortonworks Inc. 2012
Enterprise Data Integration -> Talend
• Talend Open Studio for Big Data
   – Feature-rich Job Designer
   – Rich palette of pre-built templates
   – Supports HDFS, Pig, Hive, HBase, HCatalog
   – Apache-licensed, bundled with HDP

• Key benefits
   – Graphical development
   – Robust and scalable execution
   – Broadest connectivity to support
     all systems:
     450+ components
   – Real-time debugging




                                                 Page 9
       © Hortonworks Inc. 2012
Metadata Management -> HCatalog
• Simplifies data sharing between Hadoop and other data systems
   – Enables Hadoop data to be described in a schema & accessed as tables
• Provides consistent data access for MapReduce, Hive and Pig
   – Minimizes hard coding of data structure, storage format, and location
• Manages metadata for table storage
   – Based on Hive’s metadata server
   – Uses Hive language for metadata manipulation operations
• Tables may be stored in RCFile, Text files, or SequenceFiles




                                                                             Page 10
       © Hortonworks Inc. 2012
RESTful API Front-door for Hadoop
• Opens the door to languages other than Java
• Thin clients via web services vs. fat-clients in gateway
• Insulation from interface changes release to release


                                HCatalog web interfaces
                          FS
                        HD
                      eb
                  W
                                  MapReduce     Pig      Hive

                                              HCatalog




                                                         External
                                   HDFS        HBase
                                                          Store


                                                                    Page 11
      © Hortonworks Inc. 2012
WebHDFS: HDFS over HTTP
~:$ GET http://nnode:50070/webhdfs/v1/results/part-r-00000.csv?
op=open

GATE4,eb8bd736445f415e18886ba037f84829,55000,2007-01-14,14:01:54,
GATE4,ec58edcce1049fa665446dc1fa690638,8030803000,2007-01-14,13:52:31,
GATE4,b6f07ce00f09035a6683c5e93e3c04b8,30000,2007-01-28,12:41:11,
GATE4,a1bc345b756090854e9dd0011087c6c0,30000,2007-01-28,12:59:33,
...



 Potential Uses:
   Out of cluster access to HDFS
   Cross-cluster, cross version HDFS access
   Native filesystem clients


                         dfs.webhdfs.enabled=true
                                                                         Page 12
       © Hortonworks Inc. 2012
The Web HDFS & service APIs
isolate Hadoop internals from
stable public interfaces


Long-haul, cross-language, stable, secure




                                            Page 13
© Hortonworks Inc. 2012
My project: HA on vSphere




                             Page 14
   © Hortonworks Inc. 2012
Release Schedule
HDP 1.x : quarterly releases
  – Large-scale QA process
  – Validate performance as well as functionality



Technology Preview Program
  – Early access; help w/ testing
  – Access to new features such as
  – HA
  – Windows Integration



            Predictable timetable of stable releases
                                                       Page 15
      © Hortonworks Inc. 2012
Ready and free to use today:

http://hortonworks.com/download/




                                   Page 16
    © Hortonworks Inc. 2012
Thank You!
Des questions?




                              Page 17
    © Hortonworks Inc. 2012

More Related Content

What's hot

Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionDataWorks Summit/Hadoop Summit
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 

What's hot (20)

Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application Adoption
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 

Viewers also liked

Battle At Goliad
Battle At GoliadBattle At Goliad
Battle At Goliadcompd
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentreSteve Loughran
 
Taming Deployment With Smart Frog
Taming Deployment With Smart FrogTaming Deployment With Smart Frog
Taming Deployment With Smart FrogSteve Loughran
 
Farms, Fabrics and Clouds
Farms, Fabrics and CloudsFarms, Fabrics and Clouds
Farms, Fabrics and CloudsSteve Loughran
 
Economic Scheduling of Hadoop Jobs
Economic Scheduling of Hadoop JobsEconomic Scheduling of Hadoop Jobs
Economic Scheduling of Hadoop JobsSteve Loughran
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-statusSteve Loughran
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 
A New Approach To Organization
A New Approach To OrganizationA New Approach To Organization
A New Approach To Organizationcompd
 
Did you really want that data?
Did you really want that data?Did you really want that data?
Did you really want that data?Steve Loughran
 

Viewers also liked (14)

Battle At Goliad
Battle At GoliadBattle At Goliad
Battle At Goliad
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentre
 
Taming Deployment With Smart Frog
Taming Deployment With Smart FrogTaming Deployment With Smart Frog
Taming Deployment With Smart Frog
 
Farms, Fabrics and Clouds
Farms, Fabrics and CloudsFarms, Fabrics and Clouds
Farms, Fabrics and Clouds
 
Extended essay overview
Extended essay overviewExtended essay overview
Extended essay overview
 
Economic Scheduling of Hadoop Jobs
Economic Scheduling of Hadoop JobsEconomic Scheduling of Hadoop Jobs
Economic Scheduling of Hadoop Jobs
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-status
 
H is for_hadoop
H is for_hadoopH is for_hadoop
H is for_hadoop
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Graphs
GraphsGraphs
Graphs
 
A New Approach To Organization
A New Approach To OrganizationA New Approach To Organization
A New Approach To Organization
 
Scholarly articles
Scholarly articlesScholarly articles
Scholarly articles
 
Echolocation
EcholocationEcholocation
Echolocation
 
Did you really want that data?
Did you really want that data?Did you really want that data?
Did you really want that data?
 

Similar to HDP-1 introduction for HUG France

NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemAL500745425
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondDataWorks Summit
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
 
Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Hortonworks
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013alanfgates
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

Similar to HDP-1 introduction for HUG France (20)

NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop Echosystem
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and Beyond
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Big data
Big dataBig data
Big data
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 

More from Steve Loughran

The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is overSteve Loughran
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)Steve Loughran
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionSteve Loughran
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!Steve Loughran
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()Steve Loughran
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming DeployedSteve Loughran
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupSteve Loughran
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARNSteve Loughran
 

More from Steve Loughran (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
YARN Services
YARN ServicesYARN Services
YARN Services
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

HDP-1 introduction for HUG France

  • 1. HDP-1 Steve Loughran– Hortonworks stevel at hortonworks.com @steveloughran Paris, June 2012 © Hortonworks Inc. 2012
  • 2. Hortonworks Data Platform Develop Interact Non-Relational Database Scripting Query Talend Open Studio for Big Data, Sqoop) (Pig) (Hive) Data Extraction & Loading Workflow & Scheduling Management & Monitoring (HCatalog APIs, WebHDFS, (HBase) Metadata Management (Ambari, Zookeeper) (HCatalog) (Oozie) Operate Distributed Processing Integrate (MapReduce) Distributed Storage (HDFS) Hortonworks Data Platform Page 2 © Hortonworks Inc. 2012
  • 3. Hortonworks Data Platform (HDP) Fully Integrated, Extensively Tested, Enterprise Supported Challenge: • Integrate, manage, and support changes across a wide range of open source projects that power the Hadoop platform; each with their own release schedules, versions, & dependencies. • Time-intensive, Complex, Expensive Solution: Hortonworks Data Platform • Integrated certified platform distributions • Extensive Q/A process: many apps across small, medium, & large clusters • Industry-leading Support with clear Hadoop Pig HCatalog Hive Ambari Zookeeper service levels for updates and patches Core = New Version Page 3 © Hortonworks Inc. 2012
  • 4. HDP 1.0 Components Component Version Apache Hadoop (HDFS & MapReduce) 1.0.3+ Apache HCatalog 0.4.0+ Apache Pig 0.9.2 Apache Hive 0.9.0+ Apache HBase 0.92.1+ Talend Open Studio for Big Data 5.1.0 Apache Sqoop 1.4.1+ Apache Oozie 3.1.3+ Apache Zookeeper 3.3.4 0.1 Apache Ambari (Technology Preview) Page 4 © Hortonworks Inc. 2012
  • 5. Management & Monitoring: Ambari • 100% Open Source • Wizard-based install, provisioning & configuration management • Monitoring and alerting dashboards • Goals: ease of installation, scale to large clusters, effective monitoring of all services Page 5 © Hortonworks Inc. 2012
  • 6. Cluster Provisioning through Web UI Download and try from http://hortonworks.com Page 6 © Hortonworks Inc. 2012
  • 7. Monitoring and alerting dashboards Page 7 © Hortonworks Inc. 2012
  • 8. Installation and Provisioning HMC Installer -GUI, puppet-driven – Installs Java and up; – Configures entire cluster – Sets up HMC for cluster monitoring – Web UI + text files listing nodes gsInstall – Command line installer -file driven RPM/YUM for custom installation processes – Configuration left as an exercise – Use if you have other cluster management tooling Qualified at scale on RHEL5.8 & Java 6u26 Page 8 © Hortonworks Inc. 2012
  • 9. Enterprise Data Integration -> Talend • Talend Open Studio for Big Data – Feature-rich Job Designer – Rich palette of pre-built templates – Supports HDFS, Pig, Hive, HBase, HCatalog – Apache-licensed, bundled with HDP • Key benefits – Graphical development – Robust and scalable execution – Broadest connectivity to support all systems: 450+ components – Real-time debugging Page 9 © Hortonworks Inc. 2012
  • 10. Metadata Management -> HCatalog • Simplifies data sharing between Hadoop and other data systems – Enables Hadoop data to be described in a schema & accessed as tables • Provides consistent data access for MapReduce, Hive and Pig – Minimizes hard coding of data structure, storage format, and location • Manages metadata for table storage – Based on Hive’s metadata server – Uses Hive language for metadata manipulation operations • Tables may be stored in RCFile, Text files, or SequenceFiles Page 10 © Hortonworks Inc. 2012
  • 11. RESTful API Front-door for Hadoop • Opens the door to languages other than Java • Thin clients via web services vs. fat-clients in gateway • Insulation from interface changes release to release HCatalog web interfaces FS HD eb W MapReduce Pig Hive HCatalog External HDFS HBase Store Page 11 © Hortonworks Inc. 2012
  • 12. WebHDFS: HDFS over HTTP ~:$ GET http://nnode:50070/webhdfs/v1/results/part-r-00000.csv? op=open GATE4,eb8bd736445f415e18886ba037f84829,55000,2007-01-14,14:01:54, GATE4,ec58edcce1049fa665446dc1fa690638,8030803000,2007-01-14,13:52:31, GATE4,b6f07ce00f09035a6683c5e93e3c04b8,30000,2007-01-28,12:41:11, GATE4,a1bc345b756090854e9dd0011087c6c0,30000,2007-01-28,12:59:33, ... Potential Uses: Out of cluster access to HDFS Cross-cluster, cross version HDFS access Native filesystem clients dfs.webhdfs.enabled=true Page 12 © Hortonworks Inc. 2012
  • 13. The Web HDFS & service APIs isolate Hadoop internals from stable public interfaces Long-haul, cross-language, stable, secure Page 13 © Hortonworks Inc. 2012
  • 14. My project: HA on vSphere Page 14 © Hortonworks Inc. 2012
  • 15. Release Schedule HDP 1.x : quarterly releases – Large-scale QA process – Validate performance as well as functionality Technology Preview Program – Early access; help w/ testing – Access to new features such as – HA – Windows Integration Predictable timetable of stable releases Page 15 © Hortonworks Inc. 2012
  • 16. Ready and free to use today: http://hortonworks.com/download/ Page 16 © Hortonworks Inc. 2012
  • 17. Thank You! Des questions? Page 17 © Hortonworks Inc. 2012

Editor's Notes

  1. <PRESENTATION> The newest way that JBoss is delivering Enterprise-class stability and performance is with JBoss Enterprise Platforms. Having to integrate, and maintain the integrations between the multiple community projects to meet your enterprise middleware platform needs can add complexity and cost to your IT operations. Red Hat solves this problem with JBoss Enterprise Platforms. JBoss Enterprise Platforms integrate the most popular JBoss.org projects into stable, secure, certified distributions with a single patch and update stream. JBoss Enterprise Platforms are available via subscriptions that include certified software, industry-leading support, updates and patches, documentation and multi-year maintenance policies. Now, customers can leverage all the innovation, flexibility and value of open source without additional time and expense of maintain their own application platform. Everybody wins. </PRESENTATION>
  2. How do you set up a cluster? Three ways. 1. The HMC installer uses Puppet to set up a set of machines -driven by files listing hostnames of machines you want in specific roles. It doesn't assume you have Java; installs the tested Java versions (64 bit for masters, 32 bit for workers). Brings up entire cluster, smoke tests, leaves you with web management console driven by ganglia and nagios. This is the easy way to set up an entire cluster. 2. There is the option of just installing the RPMs using Yum, directly from the HWX repository, using "yum upgrade" to upgrade -or even go to Kickstart and create your own OS images on demand. One thing to consider is that the platforms tested on look "dated" -why not RHEL6.3 + Java 7? Using experience w/ stability problems on the Y! cluster to stick to JVM version that is trusted to be stable; mature OS.
  3. HCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. Similar to a Schema in the RDBMS world except that it's more than just the SQL-layer. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the Hadoop ecosystem, you have many tools that might be used for data processing - you might use Pig or Hive, or your own custom MapReduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like Perl or Python, or you may want to hook up that HBase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager/data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.
  4. Picking on a new feature in Hadoop 1.0.3 : webhdfs is something interesting. Set one config option and the DNs and NNs become web servers (using the chosen auth mechanism), offering read and write access to the data. This is integral to the cluster -you ask the NN for data, which triggers a 307 redirect to a DN with the data, which serves up up. A redirect that is handled transparently by all HTTP clients set up to handle redirects.
  5. Up until now, a change in the internal Hadoop versions caused -protocol version mismatch problems with all remote clients. Those clients also needed the entire Hadoop JAR set on their classpath, and were java only. Now: stable APIs, cross-language,
  6. This is something still coming together: HA clustering using VMWare vSphere as the HA clustering system underneath the classic failure points - the Namenode of HDFS; the JobTracker of MapR Monitoring agents to report failure to vSphere, trigger failover on process crash or hang, VM crash/hang, and physical hardware failure. Lets you host a set of independent VMs, one per master server, with isolated lifecycle and management. Very good for ops tasks: snapshotting, update software in an offline VM, etc. Does not require that the workers are virtual -they can be physical, virtual or even a mix of both.