Chicago Data Summit: Apache HBase: An Introduction

•

124 gostaram•22,550 visualizações

Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.

Tecnologia

Apache HBase: an introduction ,[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],Introductions

Outline ,[object Object],[object Object],[object Object],[object Object],[object Object]

Apache HBase HBase is an open source , distributed , sorted map datastore modeled after Google’s BigTable

Open Source ,[object Object],[object Object],[object Object]

Distributed ,[object Object],[object Object],[object Object]

Sorted Map Datastore ,[object Object],[object Object],[object Object],[object Object]

Sorted Map Datastore (logical view as “records”) A single cell might have different values at different timestamps Different rows may have different sets of columns(table is sparse ) Useful for *-To-Many mappings Different types of data separated into different “ column families” Implicit PRIMARY KEY in RDBMS terms Data is all byte[] in HBase Row key Data cutting info: { ‘height’: ‘9ft’, ‘state’: ‘CA’ } roles: { ‘ASF’: ‘Director’, ‘Hadoop’: ‘Founder’ } tlipcon info: { ‘height’: ‘5ft7, ‘state’: ‘CA’ } roles: { ‘Hadoop’: ‘Committer’@ts=2010, ‘ Hadoop’: ‘PMC’@ts=2011, ‘ Hive’: ‘Contributor’ }

Sorted Map Datastore (physical view as “cells”) Sorted on disk by Row key, Col key, descending timestamp Milliseconds since unix epoch info Column Family roles Column Family Row key Column key Timestamp Cell value cutting roles:ASF 1273871823022 Director cutting roles:Hadoop 1183746289103 Founder tlipcon roles:Hadoop 1300062064923 PMC tlipcon roles:Hadoop 1293388212294 Committer tlipcon roles:Hive 1273616297446 Contributor Row key Column key Timestamp Cell value cutting info:height 1273516197868 9ft cutting info:state 1043871824184 CA tlipcon info:height 1273878447049 5ft7 tlipcon info:state 1273616297446 CA

Column Families ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Accessing HBase ,[object Object],[object Object],[object Object],[object Object]

HBase API ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

High Level Architecture HBase HDFS ZooKeeper Java Client MapReduce Hive/Pig Thrift/REST Gateway Your Java Application

Terms and Daemons ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cluster Architecture RegionServer HDFS HMaster RegionServer RegionServer … HMaster ZK Peer ZK Peer ZK Peer ZK Quorum Client Client finds RegionServer addresses in ZooKeeper Client reads and writes rows by directly accessing the RegionServers Master assigns regions and achieves load balancing

Cluster Deployment (big cluster) HDFS NameNode Secondary NameNode MapReduce JobTracker ZooKeeper ZooKeeper ZooKeeper HMaster HMaster RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 3 or 5 nodes ZK HMaster with one standby 40+ slaves with HBase, HDFS, and MR slave processes

Cluster Deployment (small cluster / POC) NameNode SecondaryNameNode HMaster JobTracker ZooKeeper RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 5+ slaves with HBase, HDFS, and MR slave processes The proverbial basket full of eggs

HBase vs just HDFS If you have neither random write nor random read, stick to HDFS! Plain HDFS/MR HBase Write pattern Append-only Random write, bulk incremental Read pattern Full table scan, partition table scan Random read, small range scan, or table scan Hive (SQL) performance Very good 4-5x slower Structured storage Do-it-yourself / TSV / SequenceFile / Avro / ? Sparse column-family data model Max data size 30+ PB ~1PB

HBase vs RDBMS RDBMS HBase Data layout Row-oriented Column-family-oriented Transactions Multi-row ACID Single row only Query language SQL get/put/scan/etc * Security Authentication/Authorization Work in progress Indexes On arbitrary columns Row-key only Max data size TBs ~1PB Read/write throughput limits 1000s queries/second Millions of queries/second

HBase vs other “NoSQL” ,[object Object],[object Object],[object Object],[object Object],[object Object]

HBase in Numbers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

SaaS Audit Logging ,[object Object],[object Object],[object Object],[object Object]

Facebook Analytics ,[object Object],[object Object],[object Object],[object Object],[object Object],http://tiny.cloudera.com/hbase-fb-analytics

OpenTSDB ,[object Object],[object Object],[object Object],[object Object],http://opentsdb.net

Use HBase if… ,[object Object],[object Object],[object Object]

Don’t use HBase if… ,[object Object],[object Object],[object Object]

Resources ,[object Object],[object Object],[object Object],[object Object],[object Object]

Questions? ,[object Object],[object Object],[object Object]

Mais conteúdo relacionado

Mais procurados

Hive + Tez: A Performance Deep DiveDataWorks Summit

Transactional operations in Apache Hive: present and futureDataWorks Summit

Apache phoenix: Past, Present and Future of SQL over HBAseenissoz

Big Data Tech StackAbdullah Çetin ÇAVDAR

6.hivePrashant Gupta

Nosql data modelsViet-Trung TRAN

NoSQL databasesHarri Kauhanen

LLAP: long-lived execution in HiveDataWorks Summit

Hadoop Overview & Architecture EMC

Apache Phoenix + Apache HBaseDataWorks Summit/Hadoop Summit

Apache HBase™Prashant Gupta

Big data and HadoopRahul Agarwal

Reshape Data Lake (as of 2020.07)Eric Sun

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Introduction to column oriented databasesArangoDB Database

Scaling HBase for Big DataSalesforce Engineering

Introduction to HadoopDr. C.V. Suresh Babu

DW Migration Webinar-March 2022.pptxDatabricks

Google Dremel. Concept and Implementations.Vicente Orjales

Apache HBase - Just the BasicsHBaseCon

Mais procurados (20)

Hive + Tez: A Performance Deep Dive

Transactional operations in Apache Hive: present and future

Apache phoenix: Past, Present and Future of SQL over HBAse

Big Data Tech Stack

6.hive

Nosql data models

NoSQL databases

LLAP: long-lived execution in Hive

Hadoop Overview & Architecture

Apache Phoenix + Apache HBase

Apache HBase™

Big data and Hadoop

Reshape Data Lake (as of 2020.07)

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...

Introduction to column oriented databases

Scaling HBase for Big Data

Introduction to Hadoop

DW Migration Webinar-March 2022.pptx

Google Dremel. Concept and Implementations.

Apache HBase - Just the Basics

Destaque

Apache Hadoop and HBaseCloudera, Inc.

Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.

HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.

Apache HBase for ArchitectsNick Dimiduk

Keynote: Apache HBase at Yahoo! ScaleHBaseCon

HBase for ArchitectsNick Dimiduk

HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!

Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes

HBase杂谈Joseph Pan

Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes

Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise

Apache HBase 1.0 ReleaseNick Dimiduk

Apache HBase - Introduction & Use CasesData Con LA

Introduction To HBaseAnil Gupta

Apache Mesos at Twitter (Texas LinuxFest 2014)Chris Aniszczyk

HBase: Just the BasicsHBaseCon

Intro to HBase Internals & Schema Design (for HBase users)alexbaranau

HBase Read High Availability Using Timeline Consistent Region Replicasenissoz

HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.

How To Get More From SlideShare - Super-Simple Tips For Content MarketingContent Marketing Institute

Destaque (20)

Apache Hadoop and HBase

Hw09 Practical HBase Getting The Most From Your H Base Install

HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers

Apache HBase for Architects

Keynote: Apache HBase at Yahoo! Scale

HBase for Architects

HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database

Hourglass: a Library for Incremental Processing on Hadoop

HBase杂谈

Hourglass: a Library for Incremental Processing on Hadoop

Sept 17 2013 - THUG - HBase a Technical Introduction

Apache HBase 1.0 Release

Apache HBase - Introduction & Use Cases

Introduction To HBase

Apache Mesos at Twitter (Texas LinuxFest 2014)

HBase: Just the Basics

Intro to HBase Internals & Schema Design (for HBase users)

HBase Read High Availability Using Timeline Consistent Region Replicas

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

How To Get More From SlideShare - Super-Simple Tips For Content Marketing

Semelhante a Chicago Data Summit: Apache HBase: An Introduction

Introduction to HBaseByeongweon Moon

Hadoop_arunam_pptjerrin joseph

Nextag talkJoydeep Sen Sarma

Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...IndicThreads

HbaseAmitkumarPal21

מיכאלsqlserver.co.il

Impala for PhillyDB MeetupShravan (Sean) Pabba

Escalando Aplicaciones WebSantiago Coffey

Facebook keynote-nicolas-qconYiwei Ma

支撑Facebook消息处理的h base存储系统yongboy

Facebook Messages & HBase强王

HbaseShashwat Shriparv

Hypertable Distilled by edydkim.github.comEdward D. Kim

HBase.pptxSadhik7

عصر کلان داده، چرا و چگونه؟datastack

Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.

The ABC of Big DataAndré Faria Gomes

Big data conceptsSerkan Özal

HBase introduction talkHayden Marchant

Big data hadoop ecosystem and nosqlKhanderao Kand

Semelhante a Chicago Data Summit: Apache HBase: An Introduction (20)

Introduction to HBase

Hadoop_arunam_ppt

Nextag talk

Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...

Hbase

מיכאל

Impala for PhillyDB Meetup

Escalando Aplicaciones Web

Facebook keynote-nicolas-qcon

支撑Facebook消息处理的h base存储系统

Facebook Messages & HBase

Hbase

Hypertable Distilled by edydkim.github.com

HBase.pptx

عصر کلان داده، چرا و چگونه؟

Sf NoSQL MeetUp: Apache Hadoop and HBase

The ABC of Big Data

Big data concepts

HBase introduction talk

Big data hadoop ecosystem and nosql

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Último

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Commit 2024 - Secret Management made easyAlfredo García Lavilla

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

"ML in Production",Oleksandr BaganFwdays

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Story boards and shot lists for my a level piececharlottematthew16

Chicago Data Summit: Apache HBase: An Introduction

4. Apache HBase HBase is an open source , distributed , sorted map datastore modeled after Google’s BigTable

8. Sorted Map Datastore (logical view as “records”) A single cell might have different values at different timestamps Different rows may have different sets of columns(table is sparse ) Useful for *-To-Many mappings Different types of data separated into different “ column families” Implicit PRIMARY KEY in RDBMS terms Data is all byte[] in HBase Row key Data cutting info: { ‘height’: ‘9ft’, ‘state’: ‘CA’ } roles: { ‘ASF’: ‘Director’, ‘Hadoop’: ‘Founder’ } tlipcon info: { ‘height’: ‘5ft7, ‘state’: ‘CA’ } roles: { ‘Hadoop’: ‘Committer’@ts=2010, ‘ Hadoop’: ‘PMC’@ts=2011, ‘ Hive’: ‘Contributor’ }

9. Sorted Map Datastore (physical view as “cells”) Sorted on disk by Row key, Col key, descending timestamp Milliseconds since unix epoch info Column Family roles Column Family Row key Column key Timestamp Cell value cutting roles:ASF 1273871823022 Director cutting roles:Hadoop 1183746289103 Founder tlipcon roles:Hadoop 1300062064923 PMC tlipcon roles:Hadoop 1293388212294 Committer tlipcon roles:Hive 1273616297446 Contributor Row key Column key Timestamp Cell value cutting info:height 1273516197868 9ft cutting info:state 1043871824184 CA tlipcon info:height 1273878447049 5ft7 tlipcon info:state 1273616297446 CA

10.

11.

12.

13. High Level Architecture HBase HDFS ZooKeeper Java Client MapReduce Hive/Pig Thrift/REST Gateway Your Java Application

14.

15. Cluster Architecture RegionServer HDFS HMaster RegionServer RegionServer … HMaster ZK Peer ZK Peer ZK Peer ZK Quorum Client Client finds RegionServer addresses in ZooKeeper Client reads and writes rows by directly accessing the RegionServers Master assigns regions and achieves load balancing

16. Cluster Deployment (big cluster) HDFS NameNode Secondary NameNode MapReduce JobTracker ZooKeeper ZooKeeper ZooKeeper HMaster HMaster RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 3 or 5 nodes ZK HMaster with one standby 40+ slaves with HBase, HDFS, and MR slave processes

17. Cluster Deployment (small cluster / POC) NameNode SecondaryNameNode HMaster JobTracker ZooKeeper RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 5+ slaves with HBase, HDFS, and MR slave processes The proverbial basket full of eggs

18. HBase vs other systems

19. HBase vs just HDFS If you have neither random write nor random read, stick to HDFS! Plain HDFS/MR HBase Write pattern Append-only Random write, bulk incremental Read pattern Full table scan, partition table scan Random read, small range scan, or table scan Hive (SQL) performance Very good 4-5x slower Structured storage Do-it-yourself / TSV / SequenceFile / Avro / ? Sparse column-family data model Max data size 30+ PB ~1PB

20. HBase vs RDBMS RDBMS HBase Data layout Row-oriented Column-family-oriented Transactions Multi-row ACID Single row only Query language SQL get/put/scan/etc * Security Authentication/Authorization Work in progress Indexes On arbitrary columns Row-key only Max data size TBs ~1PB Read/write throughput limits 1000s queries/second Millions of queries/second

21.

22.

23. Use cases

24.

25.

26.

27. Powered By HBase … and others

28.

29.

30.

31.

Notas do Editor

Hbase is a project that solves this problem. In a sentence, Hbase is an open source, distributed, sorted map modeled after Google’s BigTable. Open-source: Apache HBase is an open source project with an Apache 2.0 license. Distributed: HBase is designed to use multiple machines to store and serve data. Sorted Map: HBase stores data as a map, and guarantees that adjacent keys will be stored next to each other on disk. HBase is modeled after BigTable, a system that is used for hundreds of applications at Google. Copyright 2010 Cloudera - Do not distribute
Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the &quot;info&quot; column family. Row2 only has a single column. A column can also be empty. Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1 st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained. Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek. Copyright 2010 Cloudera - Do not distribute
Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the &quot;info&quot; column family. Row2 only has a single column. A column can also be empty. Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1 st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained. Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek. Copyright 2010 Cloudera - Do not distribute
Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the &quot;info&quot; column family. Row2 only has a single column. A column can also be empty. Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1 st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained. Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek. Copyright 2010 Cloudera - Do not distribute
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
One of the interesting things about NoSQL is that the different systems don’t usually compete directly. We all have picked different tradeoffs. Hbase is a strongly consistent system, so it does not have as good availability as an eventual consistency system like Cassandra. But, we find that availability is good in practice! Since Hbase is built on top of Hadoop, it has very good integration. For example, we have a very efficient bulk load feature, and the ability to run mapreduce into or out of Hbase tables. Hbase’s partitioning is range based, and data is sorted by key on disk. This is different than other systems which use a hash function to distribute keys. This can be useful for guaranteeing that for a given user account, all of that user’s data can be read with just one disk seek. Hbase automatically reshards when necessary, and regions automatically reassign if servers die. Adding more servers is simple – just turn them on. There is no “reshard” step. Hbase is not just a key value store – it is similar to Cassandra in that each row has a sparse set of columns which are efficiently stored
One of the interesting things about NoSQL is that the different systems don’t usually compete directly. We all have picked different tradeoffs. Hbase is a strongly consistent system, so it does not have as good availability as an eventual consistency system like Cassandra. But, we find that availability is good in practice! Since Hbase is built on top of Hadoop, it has very good integration. For example, we have a very efficient bulk load feature, and the ability to run mapreduce into or out of Hbase tables. Hbase’s partitioning is range based, and data is sorted by key on disk. This is different than other systems which use a hash function to distribute keys. This can be useful for guaranteeing that for a given user account, all of that user’s data can be read with just one disk seek. Hbase automatically reshards when necessary, and regions automatically reassign if servers die. Adding more servers is simple – just turn them on. There is no “reshard” step. Hbase is not just a key value store – it is similar to Cassandra in that each row has a sparse set of columns which are efficiently stored
Data Layout : An traditional RDBMS uses a fixed schema and row-oriented storage model. This has drawbacks if the number of columns per row could vary drastically. A semi-structured column-oriented store handles this case very well. Transactions : A benefit that an RDBMS offers is strict ACID compliance with full transaction support. HBase currently offers transactions on a per row basis. There is work being done to expand HBase's transactional support. Query language : RDBMSs support SQL, a full-featured language for doing filtering, joining, aggregating, sorting, etc. HBase does not support SQL*. There are two ways to find rows in HBase: get a row by key or scan a table. Security : In version 0.20.4, authentication and authorization are not yet available for HBase. Indexes : In a typical RDBMS, indexes can be created on arbitrary columns. HBase does not have any traditional indexes**. The rows are stored sorted, with a sparse index of row offsets. This means it is very fast to find a row by its row key. Max data size : Most RDBMS architectures are designed to store GBs or TBs of data. HBase can scale to much larger data sizes. Read/write throughput limits : Typical RDBMS deployments can scale to thousands of queries/second. There is virtually no upper bound to the number of reads and writes HBase can handle. * Hive/HBase integration is being worked on ** There are contrib packages for building indexes on HBase tables Copyright 2010 Cloudera - Do not distribute
One of the interesting things about NoSQL is that the different systems don’t usually compete directly. We all have picked different tradeoffs. Hbase is a strongly consistent system, so it does not have as good availability as an eventual consistency system like Cassandra. But, we find that availability is good in practice! Since Hbase is built on top of Hadoop, it has very good integration. For example, we have a very efficient bulk load feature, and the ability to run mapreduce into or out of Hbase tables. Hbase’s partitioning is range based, and data is sorted by key on disk. This is different than other systems which use a hash function to distribute keys. This can be useful for guaranteeing that for a given user account, all of that user’s data can be read with just one disk seek. Hbase automatically reshards when necessary, and regions automatically reassign if servers die. Adding more servers is simple – just turn them on. There is no “reshard” step. Hbase is not just a key value store – it is similar to Cassandra in that each row has a sparse set of columns which are efficiently stored
People often want to know “the numbers” about a storage system. I would recommend that you test it yourself – benchmarks always lie. But, here are some general numbers about Hbase. The largest cluster I’ve seen is 600 nodes, storing around 600TB. Most clusters are much smaller, only 5-20 nodes, hosting a few hundred gigabytes. Generally, writes take a few ms, and throughput is on the order of thousands of writes per node per second, but of course it depends on the size of the writes. Reads are a few milliseconds if the data is in cache, or 10-30ms if disk seeks are required. Generally we don’t recommend that you store very large values in Hbase. It is not efficient if the values stored are more than a few MB.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
So, if you are interested in Hadoop and Hbase, here are some resources. The easiest way to install Hadoop is to use Cloudera’s Distribution for Hadoop from cloudera.com. You can also download the Apache source directly from hadoop.apache.org. You can get started on your laptop, in a VM, or running on EC2. I also recommend our free training videos from our website. The Hadoop: The Definitive Guide book is also really great – it’s also available translated in Japanese.
Thanks very much for having me! If you have any questions, please feel free to ask now or send me an email. Also, we’re hiring both in the USA and in Japan, so if you’re interested in working on Hadoop or Hbase, please get in touch.

Chicago Data Summit: Apache HBase: An Introduction

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Chicago Data Summit: Apache HBase: An Introduction

Semelhante a Chicago Data Summit: Apache HBase: An Introduction (20)

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Último

Último (20)

Chicago Data Summit: Apache HBase: An Introduction

Notas do Editor