Storage Requirements and Options for Running Spark on Kubernetes

•Download as PPTX, PDF•

1 like•912 views

In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications. This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence. This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done

Technology

Storage requirements for
running Spark workloads on
Kubernetes
Rachit Arora
rachitar@in.ibm.com
IBM, India Software Labs

About Me
• Advisory Software Engineer @ IBM India Software Labs
• General Purpose Developer
• Love Containers & Kubernetes
• Conference traveler
• Upcoming book on Hadoop and Its Ecosystem
• Cricket fan, Foodie

Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation

Typical Bigdata Application
Secure
Catalog and Search
Ingest &
Store
Prepare Analyze Visualize
Date Engineer Date Scientist
Application
Developer

Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics

What Kubernetes Bring in?
• Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
• It Manages Containers for me
• It Manages High availability
• It Provides me flexibility to choose resource I WANT and Persistence I want
• Kubernetes – Lots of addon services: third-party logging, monitoring,
and security tools
• Reduced operational costs
• Improved infrastructure utilization

Storage Requirements
• Distributed File System
• Local Scratch Space
• Fast disk rights – DO NOT Write to Containers!!
• User Library
• Logs
• History Server Events
• Configs
• Secrets

What can we leverage
• Distributed
• NFS
• PV to PVC (1 to 1 Mapping in most of the Cloud Providers)
• Big NFS – Multiple PV – qouta
• HDFS – No Direct Support but can be configured to make it work but no data
localization
• DBFS – s3 based Databricks File System (DBFS) is a distributed file system
• S3/Obect Storage – Performance concerns
• Portworx – under exploration
• Glusterfs

What can we leverage
• Local temp dir scratch space
• emptyDir
• Clean Delete ? Need to return machines
• HostPath
• You manage delete
• Logs
• emptyDir vs NFS
• Push to Object store using fluentd (side containers)
• Roll over
• Do not write to containers

What we are looking for?
• Image as Volume
• https://github.com/kubernetes/kubernetes
/issues/831
• Flex Volume Plugin
• CSI
• Encrypted PVCs options – portworx
• PV to PVC 1 to Many Mapping with
Isolations
• Config Map: Better support for updates
• Local
• Clean Delete for HIPAA
• Distributed
• Clean Delete for HIPAA
• PVC transfer across Namespaces

References
• IBM Watson Studio
https://datascience.ibm.com
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud
Rachit Arora
rachitar@in.ibm.com
@rachit1arora

Thank you
Rachit Arora
rachitar@in.ibm.com
Twitter @rachit1arora

What's hot

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit

Best Practices for Using Apache Spark on AWSAmazon Web Services

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Apache Hudi: The Path ForwardAlluxio, Inc.

Apache Spark on K8S and HDFS Security with Ilan FlonenkoDatabricks

What is new in Apache Hive 3.0?DataWorks Summit

Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent

Parquet performance tuning: the missing guideRyan Blue

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Hive + Tez: A Performance Deep DiveDataWorks Summit

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

Hive: Loading DataBenjamin Leonhardi

Filesystem Comparison: NFS vs GFS2 vs OCFS2Giuseppe Paterno'

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

What's hot (20)

RocksDB Performance and Reliability Practices

Apache Spark Streaming in K8s with ArgoCD & Spark Operator

Apache Iceberg: An Architectural Look Under the Covers

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Best Practices for Using Apache Spark on AWS

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

The Parquet Format and Performance Optimization Opportunities

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

A Deep Dive into Query Execution Engine of Spark SQL

Apache Hudi: The Path Forward

Apache Spark on K8S and HDFS Security with Ilan Flonenko

What is new in Apache Hive 3.0?

Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset

Parquet performance tuning: the missing guide

Hudi architecture, fundamentals and capabilities

Hive + Tez: A Performance Deep Dive

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Hive: Loading Data

Filesystem Comparison: NFS vs GFS2 vs OCFS2

Apache Spark in Depth: Core Concepts, Architecture & Internals

Similar to Storage Requirements and Options for Running Spark on Kubernetes

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Meetup Kubernetes Rhein-Neckerinovex GmbH

Webinar - DreamObjects/Ceph Case StudyCeph Community

Netflix oss season 2 episode 1 - meetup Lightning talksRuslan Meshenberg

Best of re:InventAmazon Web Services

State of the Container EcosystemVinay Rao

Lessons learned from running Spark on DockerDataWorks Summit

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Serverless sparkMamathaBusi

Intro Docker october 2013dotCloud

What are clouds made fromJohn Garbutt

Solr + Hadoop: Interactive Search for Hadoopgregchanan

Kubernetes – An open platform for container orchestrationinovex GmbH

Apache Cassandra training. Overview and BasicsOleg Magazov

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksLucidworks

Move your on prem data to a lake in a Lake in CloudCAMMS

Cloud computing UNIT 2.1 presentation inRahulBhole12

Hadoop ppt1chariorienit

Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit

Similar to Storage Requirements and Options for Running Spark on Kubernetes (20)

Why Kubernetes as a container orchestrator is a right choice for running spar...

Meetup Kubernetes Rhein-Necker

Webinar - DreamObjects/Ceph Case Study

Netflix oss season 2 episode 1 - meetup Lightning talks

Best of re:Invent

State of the Container Ecosystem

Lessons learned from running Spark on Docker

Trend Micro Big Data Platform and Apache Bigtop

Serverless spark

Intro Docker october 2013

What are clouds made from

Solr + Hadoop: Interactive Search for Hadoop

Kubernetes – An open platform for container orchestration

Apache Cassandra training. Overview and Basics

Hadoop in the cloud – The what, why and how from the experts

Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks

Move your on prem data to a lake in a Lake in Cloud

Cloud computing UNIT 2.1 presentation in

Hadoop ppt1

Big Data in the Cloud - The What, Why and How from the Experts

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

unit 4 immunoblotting technique complete.pptxBkGupta21

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Dev Dives: Streamline document processing with UiPath Studio Web

DevoxxFR 2024 Reproducible Builds with Apache Maven

Streamlining Python Development: A Guide to a Modern Project Setup

Gen AI in Business - Global Trends Report 2024.pdf

Unraveling Multimodality with Large Language Models.pdf

DSPy a system for AI to Write Prompts and Do Fine Tuning

Connect Wave/ connectwave Pitch Deck Presentation

unit 4 immunoblotting technique complete.pptx

What's New in Teams Calling, Meetings and Devices March 2024

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

How AI, OpenAI, and ChatGPT impact business and software.

SIP trunking in Janus @ Kamailio World 2024

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Scanning the Internet for External Cloud Exposures via SSL Certs

Ensuring Technical Readiness For Copilot in Microsoft 365

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Storage Requirements and Options for Running Spark on Kubernetes

1. Storage requirements for running Spark workloads on Kubernetes Rachit Arora rachitar@in.ibm.com IBM, India Software Labs

2. About Me • Advisory Software Engineer @ IBM India Software Labs • General Purpose Developer • Love Containers & Kubernetes • Conference traveler • Upcoming book on Hadoop and Its Ecosystem • Cricket fan, Foodie

3. Spark Unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Yarn Mesos Standalon e Scheduler Kubernete s Spark SQL Interactive Queries Spark Streaming Stream processing Spark MLlib Machine Learning GraphX Graph Computation

4. Typical Bigdata Application Secure Catalog and Search Ingest & Store Prepare Analyze Visualize Date Engineer Date Scientist Application Developer

5. Evolution of Spark Analytics On Prem Install • Acquire Hardware • Prepare Machine • Install Spark • Retry • Apply patches • security • Upgrades • Scale • High availability Virtualization • Prepare Vm Imaging Solution • Network Management • High Avilability • Patches • Scale Managed • Configure Cluster • Customize • Scale • Pay even if idle Serverless • Run analytics

6. What Kubernetes Bring in? • Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. • It Manages Containers for me • It Manages High availability • It Provides me flexibility to choose resource I WANT and Persistence I want • Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools • Reduced operational costs • Improved infrastructure utilization

7. Typical Spark deployment

8. Storage Requirements • Distributed File System • Local Scratch Space • Fast disk rights – DO NOT Write to Containers!! • User Library • Logs • History Server Events • Configs • Secrets

9. What can we leverage • Distributed • NFS • PV to PVC (1 to 1 Mapping in most of the Cloud Providers) • Big NFS – Multiple PV – qouta • HDFS – No Direct Support but can be configured to make it work but no data localization • DBFS – s3 based Databricks File System (DBFS) is a distributed file system • S3/Obect Storage – Performance concerns • Portworx – under exploration • Glusterfs

10. What can we leverage • Local temp dir scratch space • emptyDir • Clean Delete ? Need to return machines • HostPath • You manage delete • Logs • emptyDir vs NFS • Push to Object store using fluentd (side containers) • Roll over • Do not write to containers

11. What we are looking for? • Image as Volume • https://github.com/kubernetes/kubernetes /issues/831 • Flex Volume Plugin • CSI • Encrypted PVCs options – portworx • PV to PVC 1 to Many Mapping with Isolations • Config Map: Better support for updates • Local • Clean Delete for HIPAA • Distributed • Clean Delete for HIPAA • PVC transfer across Namespaces

12. References • IBM Watson Studio https://datascience.ibm.com • IBM Watson https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/ • Analytics Engine https://www.ibm.com/cloud/analytics-engine • Apache Spark • Kubernetes Scheduler Design & Discussion • Kuberenetes Clusters on IBM Cloud Rachit Arora rachitar@in.ibm.com @rachit1arora

13. Thank you Rachit Arora rachitar@in.ibm.com Twitter @rachit1arora

Editor's Notes

Spark is an open source, scalable, massively parallel, in-memory execution engine for analytics applications. Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Spark Core: The foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over 100s of high-level operators that make it easy to build parallel apps. Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala. You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object Store, and any Hadoop data source.
Prepare Even though you have the right data, it may not be in the right format or structure for analysis. That’s where data preparation comes in. Data engineers need to bring raw data into one interface from wherever it lives – on premises, in the cloud or on your desktop – where it can then be shaped, transformed, explored, and prepared for analysis.Data scientist: Primarily responsible for building predictive analytic models and building insights. He will analyze data that’s been cataloged and prepared by the data engineer using machine learning tools like Watson Machine Learning. He will build applications using Jupyter Notebooks, RStudio After the data scientist shares his Analytical outputs , Application developer can build APPs like a cognitive chatbot. As the chatbot engages with customers, it will continuously improve its knowledge and help uncover new insights.
As a data scientist what I was required to do On Prem to Virtuliation as demand increased in my organization for the sevrice I decided to move to virtualized VM to handle many request on demand but there still pain was more Then I decided to try services being offereed on cloud like EMR and IBM Analytics Engine or Microsoft Insights etce but there I need to order cluster sand configure them to suit my work loads Keep them running even when I do not want to use them Cover what is takes to install a hadoop/spark cluster
IBM Watson brings together data management, data policies, data preparation, and analysis capabilities into a common framework. You can index, discover, control, and share data with Watson Knowledge Catalog, refine and prepare the data with Data Refinery, then organize resources to analyze the same data with Watson Studio. The IBM Watson apps are fully integrated to use the same user interface and framework. You can pick whichever apps and tools you need for your organization. Watson Studio (Watson Studio) provides you with the environment and tools to solve your business problems by collaboratively analyzing data What is Analytics Engine? You can use AE to Build and deploy clusters within minutes with simplified user experience, scalability, and reliability. You Custom configure the environment and Scale on demand.

Storage Requirements and Options for Running Spark on Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Storage Requirements and Options for Running Spark on Kubernetes

Similar to Storage Requirements and Options for Running Spark on Kubernetes (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Storage Requirements and Options for Running Spark on Kubernetes

Editor's Notes