Future of HCatalog

•

9 gostaram•2,048 visualizações

The initial work in HCatalog has allowed users to share their data in Hadoop regardless of the tools they use and relieved them of needing to know where and how their data is stored. But there is much more to be done to deliver on the full promise of providing metadata and table management for Hadoop clusters. It should be easy to store and process semi-structured and unstructured data via HCatalog. We need interfaces and simple implementations of data life cycle management tools. We need to deepen the integration with NoSQL and MPP data stores. And we need to be able to store larger metadata such as partition level statistics and user generated metadata. This talk will cover these areas of growth and give an overview of how they might be approached.

Tecnologia

Future of HCatalog
Alan F. Gates
@alanfgates

Page 1

Who Am I?
• HCatalog committer and mentor
• Co-founder of Hortonworks
• Lead for Pig, Hive, and HCatalog at Hortonworks
• Pig committer and PMC Member
• Member of Apache Software Foundation and Incubator
PMC
• Author of Programming Pig from O’Reilly

© Hortonworks Inc. 2012
Page 2

Hadoop Ecosystem

MapReduce Hive Pig

SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store

HDFS
Metastore

© Hortonworks 2012
Page 3

Opening up Metadata to MR & Pig

MapReduce Hive Pig

HCaInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer

SerDe
InputFormat/
Metastore Client
OuputFormat

HDFS
Metastore

© Hortonworks 2012
Page 4

Templeton - REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop

Get a list of all tables in the default database:

GET
http://…/v1/ddl/database/default/table
Hadoop/
HCatalog
{
"tables": ["counted","processed",],
"database": "default"
}

© Hortonworks 2012
Page 5

Reading and Writing Data in Parallel
•  Use Case: Users want
–  to read and write records in parallel between Hadoop and their parallel system
–  driven by their system
–  in a language independent way
–  without needing to understand Hadoop’s file formats
•  Example: an MPP data store wants to read data out of Hadoop as
HCatRecords for its parallel jobs
•  What exists today
–  webhdfs
–  Language independent
–  Can move data in parallel
–  Driven from the user side
–  Moves only bytes, no understanding of file format
–  Sqoop
–  Can move data in parallel
–  Understands data format
–  Driven from Hadoop side
–  Requires connector or JDBC

© 2012 Hortonworks
Page 8

HCatReader and HCatWriter

getHCatReader
Master HCatalog
HCatReader

read
Input Slave
Splits Iterator<HCatRecord>

read
Slave HDFS
Iterator<HCatRecord>

read
Slave
Iterator<HCatRecord>

Right now all in Java, needs to be REST
© 2012 Hortonworks
Page 9

Storing Semi-/Unstructured Data

Table Users File Users
Name Zip {"name":"alice","zip":"93201"}
Alice 93201 {"name":"bob”,"zip":"76331"}
Bob 76331 {"name":"cindy"}
{"zip":"87890"}

select name, zip A = load ‘Users’ as
from users; (name:chararray, zip:chararray);
B = foreach A generate name, zip;

© Hortonworks Inc. 2012
Page 10

Hive ODBC/JDBC Today

Issue: Have to have Hive
JDBC code on the client
Client

Hive Hadoop
Server

Issues:
•  Not concurrent
ODBC •  Not secure
Client •  Not scalable

Issue: Open source version
not easy to use

© 2012 Hortonworks
Page 13

ODBC/JDBC Proposal

JDBC
Client

Provide robust open source REST Hadoop
implementations Server

•  Spawns job inside cluster
ODBC •  Runs job as submitting user
Client •  Works with security
•  Scaling web services well understood

© 2012 Hortonworks
Page 14

Mais conteúdo relacionado

Mais procurados

Future of HCatalog - Hadoop Summit 2012Hortonworks

Hive hcatalogAlexandre Poletto

An intriduction to hiveReza Ameri

Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar

HCatalog Hadoop Summit 2011Hortonworks

Apache HiveAjit Koti

May 2013 HUG: HCatalog/Hive Data OutYahoo Developer Network

Hive HadoopFarafekr Technology Ltd.

Building a Scalable Web Crawler with HadoopHadoop User Group

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner

Big Data and Hadoop EcosystemRajkumar Singh

Cloudera Hadoop DistributionThisara Pramuditha

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network

Large Scale Crawling with Apache Nutch and FriendsJulien Nioche

Web Crawling with Apache Nutchsebastian_nagel

Jan 2012 HUG: HCatalogYahoo Developer Network

Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter

Introduction to apache nutchSigmoid

Introduction to SparkLi Ming Tsai

Apache sqoopmegrhi haikel

Mais procurados (20)

Future of HCatalog - Hadoop Summit 2012

Hive hcatalog

An intriduction to hive

Introduction to Apache Hive(Big Data, Final Seminar)

HCatalog Hadoop Summit 2011

Apache Hive

May 2013 HUG: HCatalog/Hive Data Out

Hive Hadoop

Building a Scalable Web Crawler with Hadoop

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Big Data and Hadoop Ecosystem

Cloudera Hadoop Distribution

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...

Large Scale Crawling with Apache Nutch and Friends

Web Crawling with Apache Nutch

Jan 2012 HUG: HCatalog

Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)

Introduction to apache nutch

Introduction to Spark

Apache sqoop

Semelhante a Future of HCatalog

H cat berlinbuzzwords2012Hortonworks

Sql saturday pig session (wes floyd) v2Wes Floyd

HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh

מיכאלsqlserver.co.il

TriHUG November HCatalog Talk by Alan Gatestrihug

A Reference Architecture for ETL 2.0 DataWorks Summit

Yahoo! Hack Europe WorkshopHortonworks

2013 feb 20_thug_h_catalogAdam Muise

The other Apache technologies your big data solution needs!gagravarr

Mar 2012 HUG: Hive with HBaseYahoo Developer Network

Hive 3 a new horizonArtem Ervits

Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh

Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel

Strata feb2013alanfgates

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit

Hoya for Code ReviewSteve Loughran

App cap2956v2-121001194956-phpapp01 (1)outstanding59

Inside the Hadoop Machine @ VMworldRichard McDougall

App Cap2956v2 121001194956 Phpapp01 (1)outstanding59

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Semelhante a Future of HCatalog (20)

H cat berlinbuzzwords2012

Sql saturday pig session (wes floyd) v2

HUG Meetup 2013: HCatalog / Hive Data Out

מיכאל

TriHUG November HCatalog Talk by Alan Gates

A Reference Architecture for ETL 2.0

Yahoo! Hack Europe Workshop

2013 feb 20_thug_h_catalog

The other Apache technologies your big data solution needs!

Mar 2012 HUG: Hive with HBase

Hive 3 a new horizon

Hadoop Summit San Jose 2014: Data Discovery on Hadoop

Data discoveryonhadoop@yahoo! hadoopsummit2014

Strata feb2013

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...

Hoya for Code Review

App cap2956v2-121001194956-phpapp01 (1)

Inside the Hadoop Machine @ VMworld

App Cap2956v2 121001194956 Phpapp01 (1)

Big Data Hoopla Simplified - TDWI Memphis 2014

Mais de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Mais de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Último

2024 April Patch TuesdayIvanti

A Framework for Development in the AI AgeCprime

Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos

Data governance with Unity Catalog PresentationKnoldus Inc.

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica

Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood

QCon London: Mastering long-running processes in modern architecturesBernd Ruecker

Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers

A Journey Into the Emotions of Software DevelopersNicole Novielli

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Connecting the Dots for Information Discovery.pdfNeo4j

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma

[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Future of HCatalog

1. Future of HCatalog Alan F. Gates @alanfgates Page 1

2. Who Am I? • HCatalog committer and mentor • Co-founder of Hortonworks • Lead for Pig, Hive, and HCatalog at Hortonworks • Pig committer and PMC Member • Member of Apache Software Foundation and Incubator PMC • Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2

3. Hadoop Ecosystem MapReduce Hive Pig SerDe InputFormat/ InputFormat/ Load/ Metastore Client OuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 3

4. Opening up Metadata to MR & Pig MapReduce Hive Pig HCaInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 4

5. Templeton - REST API •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } © Hortonworks 2012 Page 5

6. Templeton - REST API •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 6

7. Templeton - REST API •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" } •  Included in HDP •  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © Hortonworks 2012 Page 7

8. Reading and Writing Data in Parallel •  Use Case: Users want –  to read and write records in parallel between Hadoop and their parallel system –  driven by their system –  in a language independent way –  without needing to understand Hadoop’s file formats •  Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs •  What exists today –  webhdfs –  Language independent –  Can move data in parallel –  Driven from the user side –  Moves only bytes, no understanding of file format –  Sqoop –  Can move data in parallel –  Understands data format –  Driven from Hadoop side –  Requires connector or JDBC © 2012 Hortonworks Page 8

9. HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader read Input Slave Splits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 9

10. Storing Semi-/Unstructured Data Table Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 10

11. Storing Semi-/Unstructured Data Table Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 11

12. Storing Semi-/Unstructured Data Table Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 12

13. Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: •  Not concurrent ODBC •  Not secure Client •  Not scalable Issue: Open source version not easy to use © 2012 Hortonworks Page 13

14. ODBC/JDBC Proposal JDBC Client Provide robust open source REST Hadoop implementations Server •  Spawns job inside cluster ODBC •  Runs job as submitting user Client •  Works with security •  Scaling web services well understood © 2012 Hortonworks Page 14

Future of HCatalog

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Future of HCatalog

Semelhante a Future of HCatalog (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Future of HCatalog