2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
3. Our Agenda
01 What is Spark
02 What’s an RDD
03 Dataframes
04 Datasets
Databricks
05
05
06 Demo
4. What is Spark?
Unified Analytics Engine
Apache Spark is a unified engine designed for large-scale distributed data
processing, on premises in data centers or in the cloud.
Spark’s design philosophy is based
on these principles:
● Speed
● Ease of Use
● Modularity
● Extensibility
5. 00
Spark APIs Trio
RDD, Dataframe & Datasets
Distributed collections of
JVM objects
Functional Operators
(Map, filter etc)
2011
Distributed collections of
Row objects.
Expression based
operations and UDFs
Fast/Efficient and
internal representations
2013
Internally rows,
externally
JVM objects.
“Best of both the
worlds”:
type safe + fast
2015
RDD Dataframe Datasets
The Timeline of Three
6. Whatʼs RDD?
[Resilient Distributed Datasets]
2013 2017 2018
● An RDD represents an immutable, partitioned collection of records that can be operated on in
parallel.
● RDDs gives you complete control because every record in RDD is just a Java or Python object.
RDD
Dependencies Partitions
Compute Function
Partition => Iterator[T]
Characteristics of an RDD
7. RDD Characteristics
2013 2017 2018
1. Dependencies
➢ The List of dependencies that instructs spark how an RDD is constructed.
➢ Spark can recreate an RDD from these dependencies and replicate operations on them.
(This characteristic gives RDDs resiliency)
2. Partitions
➢ This provide spark the ability to distribute the work to parallelize computation across executors.
➢ Spark also uses locality information to send work to executors close to the data.
(This characteristic gives RDDs distribution)
3. Compute Function
➢ An abstract method that computes the input split partition in the TaskContext to produce a
collection of values (of type T)
compute(split: Partition, context: TaskContext): Iterator[T]
9. Whatʼs the Problem?
RDDs Expresses How-to Not What-to
Compute Function (or computation)
is opaque to Spark
Slow for non JVM languages like
Python
No optimization by Spark
No data compression techniques
Leading to inadvertent
inefficiencies
10. Dataframe
Solution is in structuring
What we mean by Structuring?
● Ordering and Structuring for allowing to arrange your data in
tabular format.
● Expressing computation using patterns like filtering, selecting,
counting etc.
The DataFrame API
Distributed in-memory tables with named columns and schemas, (where each_column ==
specific_datatype[String, Int, Timestamp etc.] )
To Human Eye DataFrame is like a table.
12. Spark Operations on Data
Manoeuvring Data
Transformation
Spark
Operation Head of IT
Actions
Finance Manager
Marketing Manager
● Transforming a Spark DF into a new
DF without altering the original data.
● Giving Immutability property.
● Actions are operations that returns the
raw value.
● It triggers the Lazy Evaluation of all the
recorded transformation
Transformations Actions
show()
take()
count()
collect()
orderBy()
groupBy()
filter()
select()
13. Common Dataframe Ops
Projections & Filter
➢ A way to return only the rows matching a certain relational condition by using filters.
➢ Projections are done with the select() method, while filters can be expressed using the filter() or where() method.
val topHits = df.select("Id", "First", "Url")
.where($"Hits" > 10000)
Renaming, Adding, and Dropping Columns
➢ Using withColumnRenamed() we can rename the column, just withColumn() will add new column and
drop() will drop the column specified inside it.
val newDf = df.withColumnRenamed("First","First_Name").withColumnRenamed("Last", "Last_Name")
val dfWithTS = newDf.withColumn("Issued_Date", to_timestamp(col("Published"), "dd/MM/yyyy"))
.drop("Published")
14. Common Dataframe Ops
Aggregation
➢ Transformations and actions on DataFrames, such as groupBy(), orderBy(), and count(), offer the ability to aggregate by column names and
then aggregate counts across them.
val mostShare = dfWithTS.select("Campaigns","First_Name").where(col("Campaigns").isNotNull)
.groupBy("Campaigns")
.count()
.orderBy(desc("count"))
15. The Datasets API
A Type-Safe one
According to the Dataset Documentation:
➢ A strongly typed collection of domain-specific objects that can be
transformed in parallel using functional or relational operations. Each
Dataset [in Scala] also has an untyped view called a DataFrame, which
is a Dataset of Row.
DataFrame
DataSets
Structured
APIs
Untyped APIs
Typed APIs
● Dataframe = Dataset[Row]
● Alias in Scala
● Dataset[T]
● In Scala & Java
18. Databricks?
A LakeHouse Company
● The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and
maintaining enterprise-grade data solutions at scale.
● Databricks integrates with cloud storage and security in your cloud account, and manages and deploys cloud
infrastructure on your behalf.
19. Common Tools In Databricks
Core Data Tasks
REST API
Interactive
Notebooks
ML Model
Serving
Workflows
Scheduler
Source
Controlling
(GIt)
SQL Editor &
Dashboard
Compute
Management
Data
Ingestion