Start of a New era: Apache YARN 3.1 and Apache HBase 2.0

1 © Hortonworks Inc. 2011–2018. All rights reserved
Apache YARN 3.1 / Apache HBase 2.0
最新機能の紹介
Zhen Zeng/Toshihiro Suzuki
Hortonworks, Inc.
2018/10/10

Agenda
• YARN 3.0/3.1 最新機能
• HBase 2.0 最新機能

YARN 3.0/3.1 最新機能

自己紹介
Zhen Zeng(曾臻)
Hortonworksソリューションエンジニア。
これまでは、ヤフー、ITコンサルティングファーム、
SIerにてエンジニアを従事。
ビッグ・データ、データガバナンス、PaaS、
Webアプリケーションなどのアーキテクト、
設計や実装の経験を有す。

概要

Hadoop 3 Blog Series

Community Update: ソース・コード変更

Building Block – New YARN UI
• YARN RM UI はまだ使えるstill available; YARN Capacity Schedulerも

HDP 3.0- YARN Services UI
User Interface
REST (json)

新機能：Containerization

• 業界採用が増えている
• Docker is becoming widely known
• “Number of containerized applications will rise by 80%
in the next two years” [1]
• コンテナ利用のパターンが出来ている
• マルチクラウド/ハイブリッド構成
• マイクロサービス
• 急成長しているecosystem
• Dozens of container orchestrators
• Thousands of plugins
• Market moves
Containerization は主流になっている
1. http://i.dell.com/sites/doccontent/business/solutions/whitepapers/en/Documents/Containers_Real_Adoption_2017_Dell_EMC_Forrester_Paper.pdf

• 密度を上げることによってハードウェア利用率を上げる
• VMのOSオーバーヘッドがなくなる
• イメージレイヤーの再利用でDISKのデータ重複が避けられる
• リソース隔離(resource isolation)
• Namespaces と cgroups
• ソフトウェアのパッケージングがより進化する
• Package applications and dependencies together
• Distribution mechanism
• 開発者のセルフサービスが改善される
• More control over the execution environment
なぜコンテナが流行っている？
もちろん、ビッグデータのWorkloadsも
これらのフィーチャーから恩恵を受けることが出来る！

Introducing: Apache Hadoop YARNの中のContainerization
• YARN は最初から “container”の概念を
サポートしている
• YARN containerとは何？
• Process
• Local Resources (scripts, jars, security tokens)
• Resource constraints (CPU, Memory, I/O)
• 新興コンテナ技術に合致する。例え
ばDocker
https://www.pepperfry.com/tupperware-mini-rectangular-white-container-850ml-1109991.html

YARNにDockerを追加
• Why Docker?
• Provides a lightweight mechanism for
packaging, distributing, and isolating
processes
• Allows YARN developers to have more
control over their execution environment
• YARN Container modelと一番相性が良い
• 人気なcontainerizationフレームワーク

Building Blocks for Containers on YARN
• YARN Container Runtimes – Enables support for Docker containers to make
it easier to onboard new applications and services on YARN.
• YARN Services Framework – Provides AM implementation and various
improvements to enable long running services on YARN.

YARN Container Runtimes

• YARN Services Framework – Provides AM implementation and NM
improvements that enable long running services on YARN.

ゴール:
• HadoopジョブとDockerを同じクラスタで一緒に動く
• 実行時に選ぶ

New Abstraction: YARN Container Runtimes
• Challenge: Run existing process container in the same cluster as Docker containers
• Solution: Container Runtimes – アプリケーション実行時にcontainer runtime を指定.
DefaultLinuxContainerRuntime DockerLinuxContainerRuntime
Existing Linux process-
based execution.
Using Docker to run and
monitor a container.
Apache Hadoop 2.8からの構成

Distributed Shell and MR on Docker Examples
Environment variables are currently used to set the Container Runtime options.
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTaGXMdZCdR6_RUC235TdafDqURxk-KJIptwALUmg5ZmCb3YBW7
> yarn jar $YARN_EXAMPLES_JAR pi ¥
-Dmapreduce.map.env="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos:7" ¥
-Dmapreduce.reduce.env="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos:7" ¥
1 40000
> yarn jar $DSHELL_JAR ¥
-shell_env YARN_CONTAINER_RUNTIME_TYPE=docker ¥
-shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos:7 ¥
-shell_command "sleep 120” ¥
-jar $DSHELL_JAR ¥
-num_containers 1

To Summarize
 Dockerのバリューはパッケージング
• ユーザーはアプリケーションに実行環境をバンドルすることが出来る
 YARN Containerization はProcessベースコンテナと一緒に、Dockerコンテナを動かすことが出来る
• The application requesting the resources can specify which mode it should run in via environment variables.
 YARN NodeManager はDocker CLIと連携出来る
• To start and stop containers.
 The core YARN Docker Container Runtime is most appropriate for existing Hadoop workloads
• Spark, MapReduce, Distributed Shell, etc
 The YARN Containerization features are of the greatest value when combined with YARN Native Services
• which we will cover next.

YARN Services Framework

• YARN Services Framework – Provides AM implementation and NM
improvements that enable long running services on YARN.

YARN Services Goals
• Long Running – Simplify the deployment and management of long running
applications on YARN.
• Easily Bring New Applications – Remove tedious process of bringing new
applications to YARN.
• Easy to Manage Applications – REST API and Command Line tools.
• Declarative Configuration – Provide configuration to the applications,
declare resource needs, specify placement policies.

YARN Services Overview
• Apache Slider – incubating at Apache since 2014, designed to make it easier
to run long-running applications on YARN.
• Kicked off an effort to improve long running services into YARN
• Integrates Slider core into YARN.
• REST API for managing services on YARN
• Simplified discovery of services via DNS mechanisms
• Released in Apache Hadoop 3.1.0!

YARN Services Architecture
CLI
HTTP
JSON Resource
Manager
Node
Manage
rServices
AM
{
"name": "simple-
httpd-service",
"version": "1.0.0",
"lifetime": "3600",
"components": [
...

Define Services Through a JSON Spec
{
"name": "simple-httpd-service",
"version": "1.0.0",
"lifetime": "3600",
"components": [
{
"name": "httpd",
"number_of_containers": 2,
"launch_command": "/usr/bin/run-httpd",
"artifact": {
"id": "centos/httpd-24-centos7:latest",
"type": "DOCKER"
},
"resource": {
"cpus": 1,
"memory": "1024"
},
...
> yarn app –launch simple-httpd-service ¥
simple-httpd-service.json

YARN Services Docker Httpd Example continued
"readiness_check": {
"type": "HTTP",
"properties": {
"url": "http://${THIS_HOST}:8080"
}
},
"configuration": {
"files": [
{
"type": "TEMPLATE",
"dest_file": "/var/www/html/index.html",
"properties": {
"content": "<html><body>Hello from
${COMPONENT_INSTANCE_NAME}!</body></html>"
}
}
]
}

Service assembly
{
"name": "httpd-proxy-service",
"version": "1.0.0",
"components": [
{
"artifact": {
"id": "simple-httpd-service",
"type": "SERVICE"
}
},
{
"name": "httpd-proxy",
"number_of_containers": 1,
"dependencies": [ "httpd" ],
"artifact": {
"id": "centos/httpd-24-centos7",
"type": "DOCKER"
}, ...
> yarn app –save simple-httpd-service ¥
simple-httpd-service.json
> yarn app –launch httpd-proxy-service ¥
httpd-proxy-service.json

新機能：GPU/FPGA on YARN対応

Basics: Why GPU?
 GPU: Many cores to handle massive (but simple) computation tasks simultaneously:
GPU CPU
GPU Computation Intensive Other
Without GPU support, researchers/engineers
are almost impossible to wait job finish.

GPU support on YARN: Overview
• Prerequisite for users
• Only Nvidia GPU is supported.
• Purchase GPU & Install GPU driver
• Docker?
• Yes: Have proper docker image.
• No: Install proper libs like CUDNN / CUDA, etc.
• For end user:
• When using Ambari: Go to Ambari page, and enable GPU on YARN config
• When not using Ambari:
• https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html

GPU support on YARN:
• Why?
• No need to setup separate clusters
• Leverage shared compute
• Why need isolation?
• Multiple processes use the single GPU will be:
• Serialized.
• Cause OOM easily.
• GPU isolation on YARN: .
• Granularity is for per-GPU device.
• Use Cgroups / docker to enforce the isolation.
Tensorﬂow 1.2
Nginx AppUbuntu 14:04
Nginx AppHost OS
GPU Base Lib v1
Volume Mount
CUDA Library 5.0

FPGA support on YARN
• FPGA isolation on YARN:
• FPGAデバイス毎にリソースを保証
• Cgroupでisolation
• 現在はIntel OpenCL SDK for FPGAのみサポート。他のFPGA SDKに拡張出来るような構成

Spark on Docker in YARN
• Apache Spark アプリケーションは複雑な依存関係がある
• Docker on YARNでこのpackage isolation issueを解決
• PySpark – Python version, packages
• R - packages

All running on the same YARN platform
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs

Apache HBase 2.0 最新機能

自己紹介
鈴木俊裕(すずきとしひろ)
• Hortonworks
• Sr. Software Engineer, Breakfix
• サポートチームで働いています
• エスカレーションされたサポートチケットのトラブルシュー
ティング
• プロダクトのバグフィックス(主にHBase/Phoenix)
• HBaseコミッター
• 著書「HBase徹底入門」
• 中央大学の非常勤講師
• Twitter: @brfrn169

HBase 2.0 について
• 2018年4月にリリースされました
• HDP 3.0 から HBase が 2.0 にメジャーバージョンアップされました

今回紹介する HBase 2.0 の新機能
• Procedure v2
• Accordion
• Offheap read/write path

Procedure v2
• HBASE-12439 Procedure V2
• Master系のオペレーション (create/drop table や region の assign や split など)を行っ
ているときに障害が発生した場合のハンドリングが不完全だった
• Procedure v2 はこれらの問題を解決するためのフレームワーク

Procedure v2
• 障害時の不完全なハンドリング
Create
Table
Handler
Create regions on fs
Add regions to META
Assign
cpHost.postCreateHandler() -> (ACLs)
もし、ステップの間でクラッシュし
たら、中途半端な状態になる。
例) HDFS上のデータは存在する
がMETAには存在しない状態。こ
の場合は、hbckを使って復旧する
必要がある
もし、ステップ中でクラッシュする
と、場合によっては、hbckでは復
旧できないことがある。
例) HDFS上にデータを作っている
途中でクラッシュ。この場合は
hbckでは復旧できない。

Procedure v2
• 解決策 – すべてのオペレーションをステートマシンを使うように書き換え
Create
Table
Handler
Create regions on fs
Add regions to META
Assign
cpHost.postCreateHandler() -> (ACLs)
• 全ての実行されたステップの状態は
WALベースのステートストアに保存され
る
• もしマシンがクラッシュしてもそこから状態
を復旧することができる
• ロールバックの処理を定義できる
• ステップ中にサブprocedureを呼び出す
こともできる

Procedure v2
• 例) CreateTableProcedure
PRE_OPERATION WRITE_FS_LAYOUT ADD_TO_META
ASSIGN_REGIONSUPDATE_DESC_CACHEPOST_OPERATION
Start
End
各ステップ毎に状態
を保存し、障害時に
はその状態からリト
ライできる。
ステップごとに障害時の
ロールバック処理を定
義できる
ステップ中にサブProcedureを
呼び出すことができる。
ASSIGN_REGIONSステップは
AssignProcedureを呼んでいる

Procedure v2
• まとめ
• Procedure v2 は、Master 系のオペレーションの信頼性を向上させる仕組み
• これまで、RegionServer や Master のダウンなどの障害時に、hbck の実行や手動の復旧が必要
な場合があったが、 Procedure v2 によってかなり改善される見込み
• ソースコードもかなりシンプルになっている

Accordion
• HBASE-14918 In-Memory MemStore Flush and Compaction
• Compacting Memstoreの導入
• In memory Flush と In memory Compaction
• ディスクへのFlushを減らす
• メモリの使用効率を上げる
• Memstoreのデータ構造を効率化した
• 一部ConcurrentSkipListMap をやめて、別のシンプルなデータ構造(CellArrayMap,
CellChankMap) へ変更した
• さらに詳細はQiitaの記事をご覧ください
• https://qiita.com/brfrn169/items/1fc596f0c5070f9be091

Accordion
• CompactingMemStore の導入

Accordion
• まとめ
• これまではディスクに対してのみ Flush/Compaction を行っていたが、 CompactingMemStore
の導入によってIn memoryでFlushやCompactionを行えるようになった
• ディスクへのFlush回数の減少
• 書き込みパフォーマンスの向上
• コンパクションの回数も減り、読み書きのパフォーマンスを阻害する要因を減ら
すことができる
• ディスク使用量の減少
• BlockCacheの容量削減され、その使用効率も向上するのでキャッシュヒット率も上
がり、読み込みのパフォーマンスを向上させる
• つまり、システム全体を高速化することができる

Offheap read/write path
• HBASE-11425 Cell/DBB end-to-end on the read-path
• HBASE-15179 Cell/DBB end-to-end on the write-path
• Offheap を使って JVM の heap を減らすための改善
• heapが大きすぎるとGCに影響がでる
• メモリのコピーを減らしてパフォーマンスを改善

• Offheap read path
• HBaseは読み込みパフォーマンス向上のためにBlockCacheを持っている
• HBase Cache (BucketCacheを使った場合)
• L1
• onheap
• インデックスやブルームフィルタのブロックを格納
• L2に比べて小さい
• L2
• offheap / file
• データブロックを格納

• L2にフォーカス
• 従来の実装では、Cacheデータ自体は
offheap 上にあるが、実際に読み込むとき
に heap 上にコピーをしていた
• 余計なGCが発生
• レイテンシに影響

• offheap のデータをコピーをせずに直接使う
ようにに変更
• heapの使用量が減少
• スループットの向上

• Offheap write path
• これまでは、書き込みリクエストのデータ(実際の書き込みたいデータを含んでいる)が来
たときに heap 上にコピーをしていた
• これを offheap を使って無駄なコピーが発生しないように変更
• (Async WAL と組み合わせて) RPCのリクエストからWALとしてHDFSへ書かれるまで offheap
でコピーなし
• GCを減らすことができる

• MSLAB Chunk Pool
• onheap に 2MB の chunk のプールを確保
• Memstore 内のデータはこの chunkに書かれる
• chunk は再利用される
• これによってメモリのフラグメンテーションを
防いでいる

• MSLABをoffheapに置く
• 2MBの offheap chunk
• データは offheapへ、メタデータはonheapで(1つ
につき〜100byteのオーバヘッド)
• offheap に置くことで、Memstoreにこれまでよ
り大きなサイズのメモリを割り当てられるよう
になった

• まとめ
• JVMのheapサイズを減らして、GCの影響を小さくする
• offheapを使うことで、大きなメモリを使うことができるようになる
• 無駄なコピーを減らして、GCの頻度を下げる
• パフォマンスの向上

その他の新機能・変更点
• JDK 8 only
• Hadoop-2.7+ と Hadoop-3 対応
• Client API Changes
• Filter and Coprocessor Changes
• Better dependency management
• Async Client
• Netty Server/Client
• Assignment Manager v2

その他の新機能・変更点 (HDP 2系には既にバックポート済み)
• MOB (Medium Object Blobs)
• Region Server Groups
• Spark Module
• WAL and HFiles in different FileSystem
• FileSystem Quotas
• Backup/Restore (only in HDP)

まとめ
• HBase 2.0
• パフォーマンスやスケーラビリティの向上
• 信頼性の向上
• 多数のバグフィックス
• HDP 3.0 から HBase 2.0 を使うことができるのでぜひ使ってみてください！そし
て、情報交換しましょう！

Thank you

Start of a New era: Apache YARN 3.1 and Apache HBase 2.0

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Start of a New era: Apache YARN 3.1 and Apache HBase 2.0

Semelhante a Start of a New era: Apache YARN 3.1 and Apache HBase 2.0 (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (8)

Start of a New era: Apache YARN 3.1 and Apache HBase 2.0