SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Hive/Spark/HBase on S3 & NFS
– No More HDFS Operation
( / Yifeng Jiang)
@uprush
March 14th, 2019
Yifeng Jiang
Yifeng Jiang / / ( )
• APJ Solution Architect, Data Science @ Pure Storage
• Big data, machine learning, cloud, PaaS, web systems
Prior to Pure
• Hadooper since 2009
• HBook author
• Software engineer, PaaS, Cloud
Agenda
• Separate compute and storage
• Hive & Spark on S3 with S3A Committers
• HBase on NFS
• Demo, benchmark and case study
Hadoop Common Pain Points
• Hardware complexity: racks of server, cables, power
• Availability SLA’s / HDFS operation
• Unbalanced CPU and storage demands
• Software complexity: 20+ components
• Performance Tuning
Separate Compute & Storage
• Hardware complexity: racks of server, cables, power
• Availability SLA’s / HDFS operation
• Unbalanced CPU and storage demands
• Software complexity: 20+ components
• Performance Tuning
Virtual machine
S3/NFS
Scale independently
Modernizing Data Analytics
DATA INGEST
Search
NFS
ETL
S3
QueryStore
Deep Learning
Storage Backend
Cluster Topology
node1
NAS/Object Storage
NFSS3
SAN Block Storage
iSCSI/FC
node2 nodeN…
Hadoop/Spark Cluster
On Virtual Machine
• OS
• Hadoop binary, HDFS
• SAN volume
• HTTP only
• Data lake, file
• Spark, Hive
• Same mount point
on nodes
• HBase
S3A NFSXFS/Ext4
Hive/Spark on S3
Hadoop S3a Library
Hadoop DFS protocol
• The communication protocol between NN, DN and client
• Default implementation: HDFS
• Other implementations: S3a, Azure, local FS, etc.
Hadoop S3A library
• Hadoop DFS protocol implementation for S3 compatible
storage like Amazon S3, Pure Storage FlashBlade.
• Enable the Hadoop ecosystem (Spark, Hive, MR, etc.) to
store and process data in S3 object storage.
• Several years in production, heavily used on cloud.
Hadoop S3A Demo
THE ALL-FLASH DATA HUB FOR MODERN ANALYTICS
10+ PBs
/ RACK
DENSITY
FILE & OBJECT
CONVERGED
75 GB/S BW
7.5 M+ NFS OPS
BIG + FAST
JUST ADD
A BLADE!
SIMPLE ELASTIC SCALE
Hadoop Ecosystem and S3
How it works?
• Hadoop ecosystem (Spark, Hive, MR, etc.) uses HDFS client internally.
• Spark executor -> HDFS client -> storage
• Hive on Tez container -> HDFS client -> storage
• HDFS client speaks Hadoop DFS protocol.
• Client automatically choose proper implementation to use base on schema.
• /user/joe/data -> HDFS
• file:///user/joe/data -> local FS (including NFS)
• s3a://user/joe/data -> S3A
• Exception: HBase (details covered later)
Spark on S3
Spark submit
• Changes: use s3a:// as input/output.
• Temporary data I/O: YARN -> HDFS
• User data I/O: Spark executors -> S3A -> S3
YARN RM
spark-
submit
YARN
Container
HDFS
S3
temp data
val flatJSON =
sc.textFile("s3a://deephub/tweets/")
Hadoop on S3 Challenges
Consistency model
• Amazon S3: eventual consistency, use S3Guard
• S3 compatible storage (e.g. FlashBlade S3) supports strong consistency
Slow “rename”
• “rename” is critical in HDFS to support atomic commits, like Linux “mv”
• S3 does not support “rename” natively.
• S3A simulates “rename” as LIST – COPY - DELETE
Slow S3 “Rename”
Image source: https://stackoverflow.com/questions/42822483/extremely-slow-s3-write-times-from-emr-spark/42835927
Hadoop on S3 Updates
Make S3A cloud native
• Hundreds of JIRAs
• Robustness, scale and
performance
• S3Guard
• Zero-rename S3A committers.
Use Hadoop 3.1~
S3A Committers
Originally S3A use FileOutputCommitter, which relays on “rename”
Zero-rename, cloud-native S3A Committers
• Staging committer: directory & partitioned
• Magic committer
S3A Committers
Staging Committer
• Does not require strong consistency.
• Proven to work at Netflix.
• Requires large local FS space.
Magic committer
• Require strong consistency. S3Guard on
AWS, FlashBlade S3, etc.
• Faster. Use less local local FS space.
• Less stable/tested than staging
committer.
Common Key Points
• Fast, no “rename”
• Both leverage S3 transactional multi-parts upload
S3A Committer Demo
S3A Committer Benchmark
• 1TB MR random text writes
• FlashBlade S3
• 15 blades
• Plenty compute nodes
0
200
400
600
800
1000
1200
file directory magic
time(s)
Output Committer
1TB Genwords Bench
Teragen Benchmark
Magic committer
faster
File committer
slow
Staging committer
fast
S3 ReadLocal FS ReadLess local FS
Read
HBase on NFS
The peace of volcano
HBase & HDFS
What HBase want from HDFS?
• HFile & WAL durability
• Scale & performance
• Mostly latency, but also
throughput
What HBase does NOT want from HDFS?
• Noisy neighbors
• Co-locating compute: YARN, Spark
• Co-locating data: Hive data
warehouse
• Complexity: operation & upgrade
HBase on NFS
How it works?
• Use NFS as HBase root and staging directory.
• Same mount NFS point on all RegionServer nodes
• Point HBase to store data in that mount point
• Leverage HDFS local FS implementation (file:///mnt/hbase)
No change on application.
• Clients only see HBase tables.
HMasterclient
Region
Server
NFS
Table API
HFile Durability
• HDFS uses 3x replication to protect HFile.
• HFile replication is not necessary in enterprise NFS
• Erasure coding or RAID like data protection within/across storage
array
• Amazon EFS stores data within and across multiple AZs
• FlashBlade supports N+2 data durability and high availability
HBase WAL Durability
• HDFS uses “hflush/hsync” API to ensure WAL is safely flushed to multiple
data nodes before acknowledging clients.
• Not necessary in enterprise NFS
• FlashBlade acknowledges writes after data is persisted in NVRAM on 3
blades.
NFS Performance for HBase
• Depend on NFS implementation.
• NFS is generally good for random access.
• Also check throughput.
• Flash storage is ideal for HBase.
• All-flash scale-out NFS such as Pure Storage FlashBlade
HBase PE on FlashBlade NFS
random writes
1RS, 1M rows/client,
10 clients
7 blades
random reads
1RS, 100K rows/client,
20 clients
7 blades
Memstore flush
storm?
Block cache affects
result
Latency seen by
storage, stable
HBase PE on Amazon EFS
random writes
1RS, 1M rows/client,
10 clients
1024MB/s provisioned
throughput EFS
random reads
1RS, 100K rows/client,
20 clients
1024MB/s provisioned
throughput EFS
Region too busy,
memstore flush is slow
Key Takeaways
• Storage options for cloud-era Hadoop/Spark
• Hive & Spark on S3 with cloud-native S3A Committers
• HBase on enterprise NFS
• Available in cloud and on premise (Pure Storage FlashBlade)
• Additional benefits: always-on compression, encryption, etc.
• Proven to work
• Simple, reliable, performant
• No more HDFS operation
• Virtualize your Hadoop/Spark cluster

Mais conteúdo relacionado

Mais procurados

Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for PrometheusMitsuhiro Tanda
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Hadoop -ResourceManager HAの仕組み-
Hadoop -ResourceManager HAの仕組み-Hadoop -ResourceManager HAの仕組み-
Hadoop -ResourceManager HAの仕組み-Yuki Gonda
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021InfluxData
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Jelena Zanko
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Ken SASAKI
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
 
Fail-Safe Cluster for FirebirdSQL and something more
Fail-Safe Cluster for FirebirdSQL and something moreFail-Safe Cluster for FirebirdSQL and something more
Fail-Safe Cluster for FirebirdSQL and something moreAlexey Kovyazin
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?ArangoDB Database
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 

Mais procurados (20)

File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Hadoop -ResourceManager HAの仕組み-
Hadoop -ResourceManager HAの仕組み-Hadoop -ResourceManager HAの仕組み-
Hadoop -ResourceManager HAの仕組み-
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
Fail-Safe Cluster for FirebirdSQL and something more
Fail-Safe Cluster for FirebirdSQL and something moreFail-Safe Cluster for FirebirdSQL and something more
Fail-Safe Cluster for FirebirdSQL and something more
 
nginxの紹介
nginxの紹介nginxの紹介
nginxの紹介
 
Hiveを高速化するLLAP
Hiveを高速化するLLAPHiveを高速化するLLAP
Hiveを高速化するLLAP
 
Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Apache Kafka Security
Apache Kafka Security Apache Kafka Security
Apache Kafka Security
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 

Semelhante a Hive/Spark/HBase on S3 & NFS - No More HDFS Operation

HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
The Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsThe Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsTony Pearson
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
AWS Cloud experience concepts tips and tricks
AWS Cloud experience concepts tips and tricksAWS Cloud experience concepts tips and tricks
AWS Cloud experience concepts tips and tricksDirk Harms-Merbitz
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practicesAshish Thapliyal
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSandeep Patil
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
HBaseConAsia2018 Track3-2: HBase at China Telecom
HBaseConAsia2018 Track3-2:  HBase at China TelecomHBaseConAsia2018 Track3-2:  HBase at China Telecom
HBaseConAsia2018 Track3-2: HBase at China TelecomMichael Stack
 

Semelhante a Hive/Spark/HBase on S3 & NFS - No More HDFS Operation (20)

HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
The Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsThe Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged Environments
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
AWS Cloud experience concepts tips and tricks
AWS Cloud experience concepts tips and tricksAWS Cloud experience concepts tips and tricks
AWS Cloud experience concepts tips and tricks
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practices
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Azure DBA with IaaS
Azure DBA with IaaSAzure DBA with IaaS
Azure DBA with IaaS
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Azure Databases with IaaS
Azure Databases with IaaSAzure Databases with IaaS
Azure Databases with IaaS
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
HBaseConAsia2018 Track3-2: HBase at China Telecom
HBaseConAsia2018 Track3-2:  HBase at China TelecomHBaseConAsia2018 Track3-2:  HBase at China Telecom
HBaseConAsia2018 Track3-2: HBase at China Telecom
 

Mais de Yifeng Jiang

introduction-to-apache-kafka
introduction-to-apache-kafkaintroduction-to-apache-kafka
introduction-to-apache-kafkaYifeng Jiang
 
Hive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big DataHive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big DataYifeng Jiang
 
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerIntroduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerYifeng Jiang
 
HDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for EveryoneHDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for EveryoneYifeng Jiang
 
Hortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 UpdatesHortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 UpdatesYifeng Jiang
 
Introduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSIntroduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSYifeng Jiang
 
Real-time Analytics in Financial
Real-time Analytics in FinancialReal-time Analytics in Financial
Real-time Analytics in FinancialYifeng Jiang
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16Yifeng Jiang
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Yifeng hadoop-present-public
Yifeng hadoop-present-publicYifeng hadoop-present-public
Yifeng hadoop-present-publicYifeng Jiang
 
Hive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-publicHive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-publicYifeng Jiang
 
Yifeng spark-final-public
Yifeng spark-final-publicYifeng spark-final-public
Yifeng spark-final-publicYifeng Jiang
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveYifeng Jiang
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghaiYifeng Jiang
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Apache Hiveの今とこれから
Apache Hiveの今とこれからApache Hiveの今とこれから
Apache Hiveの今とこれからYifeng Jiang
 
Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2Yifeng Jiang
 

Mais de Yifeng Jiang (20)

introduction-to-apache-kafka
introduction-to-apache-kafkaintroduction-to-apache-kafka
introduction-to-apache-kafka
 
Hive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big DataHive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big Data
 
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerIntroduction to Streaming Analytics Manager
Introduction to Streaming Analytics Manager
 
HDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for EveryoneHDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for Everyone
 
Hortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 UpdatesHortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 Updates
 
Spark Security
Spark SecuritySpark Security
Spark Security
 
Introduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSIntroduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWS
 
Real-time Analytics in Financial
Real-time Analytics in FinancialReal-time Analytics in Financial
Real-time Analytics in Financial
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Yifeng hadoop-present-public
Yifeng hadoop-present-publicYifeng hadoop-present-public
Yifeng hadoop-present-public
 
Hive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-publicHive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-public
 
Yifeng spark-final-public
Yifeng spark-final-publicYifeng spark-final-public
Yifeng spark-final-public
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Apache Hiveの今とこれから
Apache Hiveの今とこれからApache Hiveの今とこれから
Apache Hiveの今とこれから
 
HDFS Deep Dive
HDFS Deep DiveHDFS Deep Dive
HDFS Deep Dive
 
Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2
 

Último

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Último (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Hive/Spark/HBase on S3 & NFS - No More HDFS Operation

  • 1. Hive/Spark/HBase on S3 & NFS – No More HDFS Operation ( / Yifeng Jiang) @uprush March 14th, 2019
  • 2. Yifeng Jiang Yifeng Jiang / / ( ) • APJ Solution Architect, Data Science @ Pure Storage • Big data, machine learning, cloud, PaaS, web systems Prior to Pure • Hadooper since 2009 • HBook author • Software engineer, PaaS, Cloud
  • 3. Agenda • Separate compute and storage • Hive & Spark on S3 with S3A Committers • HBase on NFS • Demo, benchmark and case study
  • 4. Hadoop Common Pain Points • Hardware complexity: racks of server, cables, power • Availability SLA’s / HDFS operation • Unbalanced CPU and storage demands • Software complexity: 20+ components • Performance Tuning
  • 5. Separate Compute & Storage • Hardware complexity: racks of server, cables, power • Availability SLA’s / HDFS operation • Unbalanced CPU and storage demands • Software complexity: 20+ components • Performance Tuning Virtual machine S3/NFS Scale independently
  • 6. Modernizing Data Analytics DATA INGEST Search NFS ETL S3 QueryStore Deep Learning Storage Backend
  • 7. Cluster Topology node1 NAS/Object Storage NFSS3 SAN Block Storage iSCSI/FC node2 nodeN… Hadoop/Spark Cluster On Virtual Machine • OS • Hadoop binary, HDFS • SAN volume • HTTP only • Data lake, file • Spark, Hive • Same mount point on nodes • HBase S3A NFSXFS/Ext4
  • 9. Hadoop S3a Library Hadoop DFS protocol • The communication protocol between NN, DN and client • Default implementation: HDFS • Other implementations: S3a, Azure, local FS, etc. Hadoop S3A library • Hadoop DFS protocol implementation for S3 compatible storage like Amazon S3, Pure Storage FlashBlade. • Enable the Hadoop ecosystem (Spark, Hive, MR, etc.) to store and process data in S3 object storage. • Several years in production, heavily used on cloud.
  • 11. THE ALL-FLASH DATA HUB FOR MODERN ANALYTICS 10+ PBs / RACK DENSITY FILE & OBJECT CONVERGED 75 GB/S BW 7.5 M+ NFS OPS BIG + FAST JUST ADD A BLADE! SIMPLE ELASTIC SCALE
  • 12. Hadoop Ecosystem and S3 How it works? • Hadoop ecosystem (Spark, Hive, MR, etc.) uses HDFS client internally. • Spark executor -> HDFS client -> storage • Hive on Tez container -> HDFS client -> storage • HDFS client speaks Hadoop DFS protocol. • Client automatically choose proper implementation to use base on schema. • /user/joe/data -> HDFS • file:///user/joe/data -> local FS (including NFS) • s3a://user/joe/data -> S3A • Exception: HBase (details covered later)
  • 13. Spark on S3 Spark submit • Changes: use s3a:// as input/output. • Temporary data I/O: YARN -> HDFS • User data I/O: Spark executors -> S3A -> S3 YARN RM spark- submit YARN Container HDFS S3 temp data val flatJSON = sc.textFile("s3a://deephub/tweets/")
  • 14. Hadoop on S3 Challenges Consistency model • Amazon S3: eventual consistency, use S3Guard • S3 compatible storage (e.g. FlashBlade S3) supports strong consistency Slow “rename” • “rename” is critical in HDFS to support atomic commits, like Linux “mv” • S3 does not support “rename” natively. • S3A simulates “rename” as LIST – COPY - DELETE
  • 15. Slow S3 “Rename” Image source: https://stackoverflow.com/questions/42822483/extremely-slow-s3-write-times-from-emr-spark/42835927
  • 16. Hadoop on S3 Updates Make S3A cloud native • Hundreds of JIRAs • Robustness, scale and performance • S3Guard • Zero-rename S3A committers. Use Hadoop 3.1~
  • 17. S3A Committers Originally S3A use FileOutputCommitter, which relays on “rename” Zero-rename, cloud-native S3A Committers • Staging committer: directory & partitioned • Magic committer
  • 18. S3A Committers Staging Committer • Does not require strong consistency. • Proven to work at Netflix. • Requires large local FS space. Magic committer • Require strong consistency. S3Guard on AWS, FlashBlade S3, etc. • Faster. Use less local local FS space. • Less stable/tested than staging committer. Common Key Points • Fast, no “rename” • Both leverage S3 transactional multi-parts upload
  • 20. S3A Committer Benchmark • 1TB MR random text writes • FlashBlade S3 • 15 blades • Plenty compute nodes 0 200 400 600 800 1000 1200 file directory magic time(s) Output Committer 1TB Genwords Bench
  • 21. Teragen Benchmark Magic committer faster File committer slow Staging committer fast S3 ReadLocal FS ReadLess local FS Read
  • 22. HBase on NFS The peace of volcano
  • 23. HBase & HDFS What HBase want from HDFS? • HFile & WAL durability • Scale & performance • Mostly latency, but also throughput What HBase does NOT want from HDFS? • Noisy neighbors • Co-locating compute: YARN, Spark • Co-locating data: Hive data warehouse • Complexity: operation & upgrade
  • 24. HBase on NFS How it works? • Use NFS as HBase root and staging directory. • Same mount NFS point on all RegionServer nodes • Point HBase to store data in that mount point • Leverage HDFS local FS implementation (file:///mnt/hbase) No change on application. • Clients only see HBase tables. HMasterclient Region Server NFS Table API
  • 25. HFile Durability • HDFS uses 3x replication to protect HFile. • HFile replication is not necessary in enterprise NFS • Erasure coding or RAID like data protection within/across storage array • Amazon EFS stores data within and across multiple AZs • FlashBlade supports N+2 data durability and high availability
  • 26. HBase WAL Durability • HDFS uses “hflush/hsync” API to ensure WAL is safely flushed to multiple data nodes before acknowledging clients. • Not necessary in enterprise NFS • FlashBlade acknowledges writes after data is persisted in NVRAM on 3 blades.
  • 27. NFS Performance for HBase • Depend on NFS implementation. • NFS is generally good for random access. • Also check throughput. • Flash storage is ideal for HBase. • All-flash scale-out NFS such as Pure Storage FlashBlade
  • 28. HBase PE on FlashBlade NFS random writes 1RS, 1M rows/client, 10 clients 7 blades random reads 1RS, 100K rows/client, 20 clients 7 blades Memstore flush storm? Block cache affects result Latency seen by storage, stable
  • 29. HBase PE on Amazon EFS random writes 1RS, 1M rows/client, 10 clients 1024MB/s provisioned throughput EFS random reads 1RS, 100K rows/client, 20 clients 1024MB/s provisioned throughput EFS Region too busy, memstore flush is slow
  • 30. Key Takeaways • Storage options for cloud-era Hadoop/Spark • Hive & Spark on S3 with cloud-native S3A Committers • HBase on enterprise NFS • Available in cloud and on premise (Pure Storage FlashBlade) • Additional benefits: always-on compression, encryption, etc. • Proven to work • Simple, reliable, performant • No more HDFS operation • Virtualize your Hadoop/Spark cluster