SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Spark Structured APIs
Using Databricks
Presented By:
Raviyanshu Singh
Software Consultant
Knoldus Inc
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Our Agenda
01 What is Spark
02 What’s an RDD
03 Dataframes
04 Datasets
Databricks
05
05
06 Demo
What is Spark?
Unified Analytics Engine
Apache Spark is a unified engine designed for large-scale distributed data
processing, on premises in data centers or in the cloud.
Spark’s design philosophy is based
on these principles:
● Speed
● Ease of Use
● Modularity
● Extensibility
00
Spark APIs Trio
RDD, Dataframe & Datasets
Distributed collections of
JVM objects
Functional Operators
(Map, filter etc)
2011
Distributed collections of
Row objects.
Expression based
operations and UDFs
Fast/Efficient and
internal representations
2013
Internally rows,
externally
JVM objects.
“Best of both the
worlds”:
type safe + fast
2015
RDD Dataframe Datasets
The Timeline of Three
Whatʼs RDD?
[Resilient Distributed Datasets]
2013 2017 2018
● An RDD represents an immutable, partitioned collection of records that can be operated on in
parallel.
● RDDs gives you complete control because every record in RDD is just a Java or Python object.
RDD
Dependencies Partitions
Compute Function
Partition => Iterator[T]
Characteristics of an RDD
RDD Characteristics
2013 2017 2018
1. Dependencies
➢ The List of dependencies that instructs spark how an RDD is constructed.
➢ Spark can recreate an RDD from these dependencies and replicate operations on them.
(This characteristic gives RDDs resiliency)
2. Partitions
➢ This provide spark the ability to distribute the work to parallelize computation across executors.
➢ Spark also uses locality information to send work to executors close to the data.
(This characteristic gives RDDs distribution)
3. Compute Function
➢ An abstract method that computes the input split partition in the TaskContext to produce a
collection of values (of type T)
compute(split: Partition, context: TaskContext): Iterator[T]
Visualizing RDD
Simple &
Elegant
Whatʼs the Problem?
RDDs Expresses How-to Not What-to
Compute Function (or computation)
is opaque to Spark
Slow for non JVM languages like
Python
No optimization by Spark
No data compression techniques
Leading to inadvertent
inefficiencies
Dataframe
Solution is in structuring
What we mean by Structuring?
● Ordering and Structuring for allowing to arrange your data in
tabular format.
● Expressing computation using patterns like filtering, selecting,
counting etc.
The DataFrame API
Distributed in-memory tables with named columns and schemas, (where each_column ==
specific_datatype[String, Int, Timestamp etc.] )
To Human Eye DataFrame is like a table.
Visualizing Dataframes
With Custom Data
Spark Operations on Data
Manoeuvring Data
Transformation
Spark
Operation Head of IT
Actions
Finance Manager
Marketing Manager
● Transforming a Spark DF into a new
DF without altering the original data.
● Giving Immutability property.
● Actions are operations that returns the
raw value.
● It triggers the Lazy Evaluation of all the
recorded transformation
Transformations Actions
show()
take()
count()
collect()
orderBy()
groupBy()
filter()
select()
Common Dataframe Ops
Projections & Filter
➢ A way to return only the rows matching a certain relational condition by using filters.
➢ Projections are done with the select() method, while filters can be expressed using the filter() or where() method.
val topHits = df.select("Id", "First", "Url")
.where($"Hits" > 10000)
Renaming, Adding, and Dropping Columns
➢ Using withColumnRenamed() we can rename the column, just withColumn() will add new column and
drop() will drop the column specified inside it.
val newDf = df.withColumnRenamed("First","First_Name").withColumnRenamed("Last", "Last_Name")
val dfWithTS = newDf.withColumn("Issued_Date", to_timestamp(col("Published"), "dd/MM/yyyy"))
.drop("Published")
Common Dataframe Ops
Aggregation
➢ Transformations and actions on DataFrames, such as groupBy(), orderBy(), and count(), offer the ability to aggregate by column names and
then aggregate counts across them.
val mostShare = dfWithTS.select("Campaigns","First_Name").where(col("Campaigns").isNotNull)
.groupBy("Campaigns")
.count()
.orderBy(desc("count"))
The Datasets API
A Type-Safe one
According to the Dataset Documentation:
➢ A strongly typed collection of domain-specific objects that can be
transformed in parallel using functional or relational operations. Each
Dataset [in Scala] also has an untyped view called a DataFrame, which
is a Dataset of Row.
DataFrame
DataSets
Structured
APIs
Untyped APIs
Typed APIs
● Dataframe = Dataset[Row]
● Alias in Scala
● Dataset[T]
● In Scala & Java
Visualizing Datasets
Case Class (Type-Safe Hero)
Datasets Ops
Databricks?
A LakeHouse Company
● The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and
maintaining enterprise-grade data solutions at scale.
● Databricks integrates with cloud storage and security in your cloud account, and manages and deploys cloud
infrastructure on your behalf.
Common Tools In Databricks
Core Data Tasks
REST API
Interactive
Notebooks
ML Model
Serving
Workflows
Scheduler
Source
Controlling
(GIt)
SQL Editor &
Dashboard
Compute
Management
Data
Ingestion
DEMO
Thank You !

Mais conteúdo relacionado

Semelhante a Spark Structured APIs

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 

Semelhante a Spark Structured APIs (20)

Deeplearning in production
Deeplearning in productionDeeplearning in production
Deeplearning in production
 
2015 02-09 - NoSQL Vorlesung Mosbach
2015 02-09 - NoSQL Vorlesung Mosbach2015 02-09 - NoSQL Vorlesung Mosbach
2015 02-09 - NoSQL Vorlesung Mosbach
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Addressing Scenario
Addressing ScenarioAddressing Scenario
Addressing Scenario
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
SQL Windowing
SQL WindowingSQL Windowing
SQL Windowing
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 

Mais de Knoldus Inc.

Mais de Knoldus Inc. (20)

Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Spark Structured APIs

  • 1. Spark Structured APIs Using Databricks Presented By: Raviyanshu Singh Software Consultant Knoldus Inc
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Our Agenda 01 What is Spark 02 What’s an RDD 03 Dataframes 04 Datasets Databricks 05 05 06 Demo
  • 4. What is Spark? Unified Analytics Engine Apache Spark is a unified engine designed for large-scale distributed data processing, on premises in data centers or in the cloud. Spark’s design philosophy is based on these principles: ● Speed ● Ease of Use ● Modularity ● Extensibility
  • 5. 00 Spark APIs Trio RDD, Dataframe & Datasets Distributed collections of JVM objects Functional Operators (Map, filter etc) 2011 Distributed collections of Row objects. Expression based operations and UDFs Fast/Efficient and internal representations 2013 Internally rows, externally JVM objects. “Best of both the worlds”: type safe + fast 2015 RDD Dataframe Datasets The Timeline of Three
  • 6. Whatʼs RDD? [Resilient Distributed Datasets] 2013 2017 2018 ● An RDD represents an immutable, partitioned collection of records that can be operated on in parallel. ● RDDs gives you complete control because every record in RDD is just a Java or Python object. RDD Dependencies Partitions Compute Function Partition => Iterator[T] Characteristics of an RDD
  • 7. RDD Characteristics 2013 2017 2018 1. Dependencies ➢ The List of dependencies that instructs spark how an RDD is constructed. ➢ Spark can recreate an RDD from these dependencies and replicate operations on them. (This characteristic gives RDDs resiliency) 2. Partitions ➢ This provide spark the ability to distribute the work to parallelize computation across executors. ➢ Spark also uses locality information to send work to executors close to the data. (This characteristic gives RDDs distribution) 3. Compute Function ➢ An abstract method that computes the input split partition in the TaskContext to produce a collection of values (of type T) compute(split: Partition, context: TaskContext): Iterator[T]
  • 9. Whatʼs the Problem? RDDs Expresses How-to Not What-to Compute Function (or computation) is opaque to Spark Slow for non JVM languages like Python No optimization by Spark No data compression techniques Leading to inadvertent inefficiencies
  • 10. Dataframe Solution is in structuring What we mean by Structuring? ● Ordering and Structuring for allowing to arrange your data in tabular format. ● Expressing computation using patterns like filtering, selecting, counting etc. The DataFrame API Distributed in-memory tables with named columns and schemas, (where each_column == specific_datatype[String, Int, Timestamp etc.] ) To Human Eye DataFrame is like a table.
  • 12. Spark Operations on Data Manoeuvring Data Transformation Spark Operation Head of IT Actions Finance Manager Marketing Manager ● Transforming a Spark DF into a new DF without altering the original data. ● Giving Immutability property. ● Actions are operations that returns the raw value. ● It triggers the Lazy Evaluation of all the recorded transformation Transformations Actions show() take() count() collect() orderBy() groupBy() filter() select()
  • 13. Common Dataframe Ops Projections & Filter ➢ A way to return only the rows matching a certain relational condition by using filters. ➢ Projections are done with the select() method, while filters can be expressed using the filter() or where() method. val topHits = df.select("Id", "First", "Url") .where($"Hits" > 10000) Renaming, Adding, and Dropping Columns ➢ Using withColumnRenamed() we can rename the column, just withColumn() will add new column and drop() will drop the column specified inside it. val newDf = df.withColumnRenamed("First","First_Name").withColumnRenamed("Last", "Last_Name") val dfWithTS = newDf.withColumn("Issued_Date", to_timestamp(col("Published"), "dd/MM/yyyy")) .drop("Published")
  • 14. Common Dataframe Ops Aggregation ➢ Transformations and actions on DataFrames, such as groupBy(), orderBy(), and count(), offer the ability to aggregate by column names and then aggregate counts across them. val mostShare = dfWithTS.select("Campaigns","First_Name").where(col("Campaigns").isNotNull) .groupBy("Campaigns") .count() .orderBy(desc("count"))
  • 15. The Datasets API A Type-Safe one According to the Dataset Documentation: ➢ A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset [in Scala] also has an untyped view called a DataFrame, which is a Dataset of Row. DataFrame DataSets Structured APIs Untyped APIs Typed APIs ● Dataframe = Dataset[Row] ● Alias in Scala ● Dataset[T] ● In Scala & Java
  • 18. Databricks? A LakeHouse Company ● The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. ● Databricks integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf.
  • 19. Common Tools In Databricks Core Data Tasks REST API Interactive Notebooks ML Model Serving Workflows Scheduler Source Controlling (GIt) SQL Editor & Dashboard Compute Management Data Ingestion
  • 20. DEMO