Why would you care? Because PySpark is a cloud-agnostic analytics tool for Big Data processing, "hidden" in:
* AWS Glue - Managed ETL Service
* Amazon EMR - Big Data Platform
* Google Cloud Dataproc - Cloud-native Spark and Hadoop
* Azure HDInsight - Microsoft implementation of Apache Spark in the cloud
In this #ServerlessTO talk, Jonathan Rioux - Head of Data Science at EPAM Canada & author of PySpark in Action book (https://www.manning.com/books/pyspark-in-action), will get you acquainted with PySpark - Python API for Spark.
Event details: https://www.meetup.com/Serverless-Toronto/events/269124392/
Event recording: https://youtu.be/QGxytMbrjGY
Like always, BIG thanks to our knowledge sponsor Manning Publications – who generously offered to raffle not 1 but 3 of Jonathan's books!
RSVP for more exciting (online) events at https://www.meetup.com/Serverless-Toronto/events/
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
Intro to PySpark: Python Data Analysis at scale in the Cloud
1. Welcome to ServerlessToronto.org
“Home of Less IT Mess”
1
Introduce Yourself ☺
- Why are you here?
- Looking for work?
- Offering work?
Our feature presentation “Intro to PySpark” starts at 6:20pm…
2. Serverless is not just about the Tech:
2
Serverless is New Agile & Mindset
Serverless Dev (gluing
other people’s APIs and
managed services)
We're obsessed to
creating business value
(meaningful MVPs,
products), by helping
Startups & empowering
Business users!
We build bridges
between Serverless
Community (“Dev leg”),
and Front-end & Voice-
First folks (“UX leg”),
and empower UX
developers
Achieve agility NOT by
“sprinting” faster (like in
Scrum), but by working
smarter (by using
bigger building blocks
and less Ops)
3. Upcoming #ServerlessTO Online Meetups
3
1. Accelerating with a Cloud Contact Center – Patrick Kolencherry
Sr. Product Marketing Manager, and Karla Nussbaumer, Head of
Technical Marketing at Twilio **JULY 9 @ 6pm **
2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?
15. Goals of this presentation
Share my love of (Py)Spark
6/49
16. Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
6/49
17. Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
6/49
18. Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
6/49
19. Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
36,000 ft overview: Managed Spark in the Cloud
6/49
29. Data manipulation uses the
same vocabulary as SQL
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.groupby("age")
.count("*")
)
13/49
30. Data manipulation uses the
same vocabulary as SQL
.select("id", "first_name", "last_name", "age")
(
my_table
.where(col("age") > 21)
.groupby("age")
.count("*")
)
select
13/49
31. Data manipulation uses the
same vocabulary as SQL
.where(col("age") > 21)
(
my_table
.select("id", "first_name", "last_name", "age")
.groupby("age")
.count("*")
)
where
13/49
32. Data manipulation uses the
same vocabulary as SQL
.groupby("age")
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.count("*")
)
group by
13/49
33. Data manipulation uses the
same vocabulary as SQL
.count("*")
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.groupby("age")
)
count
13/49
34. I mean, you can legitimately use
SQL
spark.sql("""
select count(*) from (
select id, first_name, last_name, age
from my_table
where age > 21
)
group by age""")
14/49
35. Data manipulation and machine
learning with a uent API
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
15/49
36. Data manipulation and machine
learning with a uent API
spark.read.text("./data/Ch02/1342-0.txt")
results = (
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Read a text file
15/49
37. Data manipulation and machine
learning with a uent API
.select(F.split(F.col("value"), " ").alias("line"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Select the column value, where each element is
splitted (space as a separator). Alias to line.
15/49
38. Data manipulation and machine
learning with a uent API
.select(F.explode(F.col("line")).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Explode each element of line into its own record.
Alias to word.
15/49
39. Data manipulation and machine
learning with a uent API
.select(F.lower(F.col("word")).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Lower-case each word
15/49
40. Data manipulation and machine
learning with a uent API
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Extract only the first group of lower-case letters from
each word.
15/49
41. Data manipulation and machine
learning with a uent API
.where(F.col("word") != "")
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.groupby(F.col("word"))
.count()
)
Keep only the records where the word is not the
empty string.
15/49
42. Data manipulation and machine
learning with a uent API
.groupby(F.col("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.count()
)
Group by word
15/49
43. Data manipulation and machine
learning with a uent API
.count()
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
)
Count the number of records in each group
15/49
44.
Scala is not the only player in
town
16/49
48. Summoning PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
A SparkSession is your entry point to distributed data manipulation
19/49
49. Summoning PySpark
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
from pyspark.sql import SparkSession
We create our SparkSession with an optional library to access BigQuery as a data source.
19/49
50. Reading data
from functools import reduce
from pyspark.sql import DataFrame
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
20/49
51. Reading data
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
from functools import reduce
from pyspark.sql import DataFrame
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
We create a helper function to read our code from BigQuery.
20/49
52. Reading data
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
)
)
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
A DataFrame is a regular Python object.
20/49
57. Any data frame transformation will be stored until we need the
data.
Then, when we trigger an action, (Py)Spark will go and optimize
the query plan, select the best physical plan and apply the
transformation on the data.
24/49
67. Something a little more complex
import pyspark.sql.functions as F
stations = (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.stations")
.option("credentialsFile", "bq-key.json")
.load()
)
# We want to get the "hottest Countries" that have at least 60 measures
answer = (
gsod.join(stations, gsod["stn"] == stations["usaf"])
.where(F.col("country").isNotNull())
.groupBy("country")
.agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count"))
).where(F.col("count") > 12 * 5)
read, join, where, groupby, avg/count, where, orderby, show
26/49
72. Python or SQL?
spark.sql("""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
""").show(5)
gsod.createTempView("gsod")
stations.createTempView("stations")
We then can query using SQL without leaving Python!
29/49
73. Python and SQL!
(
spark.sql(
"""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
group by country"""
)
.where("country is not null")
.where("count > (12 * 5)")
.orderby("avg_temp", ascending=False)
.show(5)
)
30/49
83. Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)
# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
# only showing top 5 rows
A UDF can be used like any PySpark function.
35/49
84. Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
36/49
85. Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
A regular, fun, harmless function on (pandas) DataFrames
36/49
86. Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
36/49
92. You are not limited library-wise
from sklearn.linear_model import LinearRegression
@F.pandas_udf(T.DoubleType())
def rate_of_change_temperature(
day: pd.Series,
temp: pd.Series
) -> float:
"""Returns the slope of the daily temperature
for a given period of time."""
return (
LinearRegression()
.fit(X=day.astype("int").values.reshape(-1, 1), y=temp)
.coef_[0]
)
39/49
105. "Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
47/49
106. "Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
Uneven documentation
47/49