SlideShare uma empresa Scribd logo
1 de 108
Baixar para ler offline
Welcome to ServerlessToronto.org
“Home of Less IT Mess”
1
Introduce Yourself ☺
- Why are you here?
- Looking for work?
- Offering work?
Our feature presentation “Intro to PySpark” starts at 6:20pm…
Serverless is not just about the Tech:
2
Serverless is New Agile & Mindset
Serverless Dev (gluing
other people’s APIs and
managed services)
We're obsessed to
creating business value
(meaningful MVPs,
products), by helping
Startups & empowering
Business users!
We build bridges
between Serverless
Community (“Dev leg”),
and Front-end & Voice-
First folks (“UX leg”),
and empower UX
developers
Achieve agility NOT by
“sprinting” faster (like in
Scrum), but by working
smarter (by using
bigger building blocks
and less Ops)
Upcoming #ServerlessTO Online Meetups
3
1. Accelerating with a Cloud Contact Center – Patrick Kolencherry
Sr. Product Marketing Manager, and Karla Nussbaumer, Head of
Technical Marketing at Twilio **JULY 9 @ 6pm **
2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?
Feature Talk
Jonathan Rioux, Head Data Scientist at
EPAM Systems & author of Manning book
PySpark in Action
4
Getting acquainted
with PySpark
1/49
If you have not filled the Meetup survey, now is the
time to do it!
(Also copied in the chat)
https://forms.gle/6cyWGVY4L4GJvsXh7
2/49
Hi! I'm Jonathan
3/49
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
3/49
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Head of DS @ EPAM Canada
3/49
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Head of DS @ EPAM Canada
Author of PySpark in Action →
3/49
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Head of DS @ EPAM Canada
Author of PySpark in Action →
<3 Spark, <3 <3 Python
3/49
4/49
5/49
Goals of this presentation
6/49
Goals of this presentation
Share my love of (Py)Spark
6/49
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
6/49
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
6/49
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
6/49
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
36,000 ft overview: Managed Spark in the Cloud
6/49
What I expect from you
7/49
What I expect from you
You know a little bit of Python
7/49
What I expect from you
You know a little bit of Python
You know what SQL is
7/49
What I expect from you
You know a little bit of Python
You know what SQL is
You won't hesitate to ask questions :-)
7/49
What is Spark
Spark is a unified analytics engine for large-scale
data processing
8/49
What is Spark (bis)
Spark can be thought of a data factory that you
(mostly) program like a cohesive computer.
9/49
Spark under the hood
10/49
Spark as an analytics factory
11/49
Why is Pyspark
cool?
12/49
Data manipulation uses the
same vocabulary as SQL
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.groupby("age")
.count("*")
)
13/49
Data manipulation uses the
same vocabulary as SQL
.select("id", "first_name", "last_name", "age")
(
my_table
.where(col("age") > 21)
.groupby("age")
.count("*")
)
select
13/49
Data manipulation uses the
same vocabulary as SQL
.where(col("age") > 21)
(
my_table
.select("id", "first_name", "last_name", "age")
.groupby("age")
.count("*")
)
where
13/49
Data manipulation uses the
same vocabulary as SQL
.groupby("age")
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.count("*")
)
group by
13/49
Data manipulation uses the
same vocabulary as SQL
.count("*")
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.groupby("age")
)
count
13/49
I mean, you can legitimately use
SQL
spark.sql("""
select count(*) from (
select id, first_name, last_name, age
from my_table
where age > 21
)
group by age""")
14/49
Data manipulation and machine
learning with a uent API
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
15/49
Data manipulation and machine
learning with a uent API
spark.read.text("./data/Ch02/1342-0.txt")
results = (
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Read a text file
15/49
Data manipulation and machine
learning with a uent API
.select(F.split(F.col("value"), " ").alias("line"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Select the column value, where each element is
splitted (space as a separator). Alias to line.
15/49
Data manipulation and machine
learning with a uent API
.select(F.explode(F.col("line")).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Explode each element of line into its own record.
Alias to word.
15/49
Data manipulation and machine
learning with a uent API
.select(F.lower(F.col("word")).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Lower-case each word
15/49
Data manipulation and machine
learning with a uent API
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Extract only the first group of lower-case letters from
each word.
15/49
Data manipulation and machine
learning with a uent API
.where(F.col("word") != "")
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.groupby(F.col("word"))
.count()
)
Keep only the records where the word is not the
empty string.
15/49
Data manipulation and machine
learning with a uent API
.groupby(F.col("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.count()
)
Group by word
15/49
Data manipulation and machine
learning with a uent API
.count()
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
)
Count the number of records in each group
15/49
         
Scala is not the only player in
town
16/49
Let's code!
17/49
18/49
Summoning PySpark
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
19/49
Summoning PySpark
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
A SparkSession is your entry point to distributed data manipulation
19/49
Summoning PySpark
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
from pyspark.sql import SparkSession
 
We create our SparkSession with an optional library to access BigQuery as a data source.
19/49
Reading data
from functools import reduce
from pyspark.sql import DataFrame
 
 
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
 
 
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
20/49
Reading data
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
from functools import reduce
from pyspark.sql import DataFrame
 
 
 
 
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
We create a helper function to read our code from BigQuery.
20/49
Reading data
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
)
)
 
 
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
 
 
A DataFrame is a regular Python object.
20/49
Using the power of the schema
gsod.printSchema()
# root
# |-- stn: string (nullable = true)
# |-- wban: string (nullable = true)
# |-- year: string (nullable = true)
# |-- mo: string (nullable = true)
# |-- da: string (nullable = true)
# |-- temp: double (nullable = true)
# |-- count_temp: long (nullable = true)
# |-- dewp: double (nullable = true)
# |-- count_dewp: long (nullable = true)
# |-- slp: double (nullable = true)
# |-- count_slp: long (nullable = true)
# |-- stp: double (nullable = true)
# |-- count_stp: long (nullable = true)
# |-- visib: double (nullable = true)
# [...]
21/49
Using the power of the schema
# root
# |-- stn: string (nullable = true)
# |-- wban: string (nullable = true)
# |-- year: string (nullable = true)
# |-- mo: string (nullable = true)
# |-- da: string (nullable = true)
# |-- temp: double (nullable = true)
# |-- count_temp: long (nullable = true)
# |-- dewp: double (nullable = true)
# |-- count_dewp: long (nullable = true)
# |-- slp: double (nullable = true)
# |-- count_slp: long (nullable = true)
# |-- stp: double (nullable = true)
# |-- count_stp: long (nullable = true)
# |-- visib: double (nullable = true)
# [...]
gsod.printSchema()
The schema will give us the column names and their types.
21/49
And showing data
gsod = gsod.select("stn", "year", "mo", "da", "temp")
 
gsod.show(5)
 
# Approximately 5 seconds waiting
# +------+----+---+---+----+
# | stn|year| mo| da|temp|
# +------+----+---+---+----+
# |359250|2010| 02| 25|25.2|
# |359250|2010| 05| 25|65.0|
# |386130|2010| 02| 19|35.4|
# |386130|2010| 03| 15|52.2|
# |386130|2010| 01| 21|37.9|
# +------+----+---+---+----+
# only showing top 5 rows
22/49
What happens behind the scenes?
23/49
Any data frame transformation will be stored until we need the
data.
Then, when we trigger an action, (Py)Spark will go and optimize
the query plan, select the best physical plan and apply the
transformation on the data.
24/49
Transformations Actions
25/49
Transformations
select
Actions
25/49
Transformations
select
filter
Actions
25/49
Transformations
select
filter
group by
Actions
25/49
Transformations
select
filter
group by
partition
Actions
25/49
Transformations
select
filter
group by
partition
Actions
write
25/49
Transformations
select
filter
group by
partition
Actions
write
show
25/49
Transformations
select
filter
group by
partition
Actions
write
show
count
25/49
Transformations
select
filter
group by
partition
Actions
write
show
count
toPandas
25/49
Something a little more complex
import pyspark.sql.functions as F
 
stations = (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.stations")
.option("credentialsFile", "bq-key.json")
.load()
)
 
# We want to get the "hottest Countries" that have at least 60 measures
answer = (
gsod.join(stations, gsod["stn"] == stations["usaf"])
.where(F.col("country").isNotNull())
.groupBy("country")
.agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count"))
).where(F.col("count") > 12 * 5)
read, join, where, groupby, avg/count, where, orderby, show
26/49
read, join, where, groupby, avg/count, where, orderby, show
27/49
read, join, where, groupby, avg/count, where, orderby, show
28/49
Python or SQL?
gsod.createTempView("gsod")
stations.createTempView("stations")
 
spark.sql("""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
""").show(5)
29/49
Python or SQL?
gsod.createTempView("gsod")
stations.createTempView("stations")
 
spark.sql("""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
""").show(5)
We register the data frames as Spark SQL tables.
29/49
Python or SQL?
 
spark.sql("""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
""").show(5)
gsod.createTempView("gsod")
stations.createTempView("stations")
We then can query using SQL without leaving Python!
29/49
Python and SQL!
(
spark.sql(
"""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
group by country"""
)
.where("country is not null")
.where("count > (12 * 5)")
.orderby("avg_temp", ascending=False)
.show(5)
)
30/49
Python ⇄Spark
31/49
32/49
33/49
Scalar UDF
import pandas as pd
import pyspark.sql.types as T
 
@F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
 
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
34/49
Scalar UDF
import pandas as pd
import pyspark.sql.types as T
 
@F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
 
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
PySpark types are objects in the pyspark.sql.types modules.
34/49
Scalar UDF
@F.pandas_udf(T.DoubleType())
import pandas as pd
import pyspark.sql.types as T
 
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
 
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
We promote a regular Python function to a User Defined Function via a
decorator.
34/49
Scalar UDF
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
import pandas as pd
import pyspark.sql.types as T
 
@F.pandas_udf(T.DoubleType())
 
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
A simple function on pandas Series
34/49
Scalar UDF
f_to_c.func(pd.Series(range(32, 213)))
import pandas as pd
import pyspark.sql.types as T
 
@F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
 
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
Still unit-testable :-)
34/49
Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)
 
# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
# only showing top 5 rows
35/49
Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)
 
# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
# only showing top 5 rows
A UDF can be used like any PySpark function.
35/49
Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
 
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
36/49
Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
 
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
A regular, fun, harmless function on (pandas) DataFrames
36/49
Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
 
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
36/49
Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
)
 
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
)
 
gsod.show(5, False)
 
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
37/49
Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
)
 
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
)
 
gsod.show(5, False)
 
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
We provide PySpark the schema we expect our function to return
37/49
Grouped Map UDF
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
)
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
)
 
 
gsod.show(5, False)
 
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
We just have to partition (using group), and then applyInPandas!
37/49
Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
)
 
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
)
 
gsod.show(5, False)
 
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
37/49
38/49
You are not limited library-wise
from sklearn.linear_model import LinearRegression
 
 
@F.pandas_udf(T.DoubleType())
def rate_of_change_temperature(
day: pd.Series,
temp: pd.Series
) -> float:
"""Returns the slope of the daily temperature
for a given period of time."""
return (
LinearRegression()
.fit(X=day.astype("int").values.reshape(-1, 1), y=temp)
.coef_[0]
)
39/49
result = gsod.groupby("stn", "year", "mo").agg(
rate_of_change_temperature(gsod["da"], gsod["temp_norm"]).alias(
"rt_chg_temp"
)
)
 
result.show(5, False)
# +------+----+---+---------------------+
# |stn |year|mo |rt_chg_temp |
# +------+----+---+---------------------+
# |010250|2018|12 |-0.01014397905759162 |
# |011120|2018|11 |-0.01704736746691528 |
# |011150|2018|10 |-0.013510329829648423|
# |011510|2018|03 |0.020159116598556657 |
# |011800|2018|06 |0.012645501680677372 |
# +------+----+---+---------------------+
# only showing top 5 rows
40/49
41/49
Not fan of the
syntax?
42/49
43/49
From the README.md
import databricks.koalas as ks
import pandas as pd
 
pdf = pd.DataFrame(
{
'x':range(3),
'y':['a','b','b'],
'z':['a','b','b'],
}
)
 
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
 
# Rename the columns
df.columns = ['x', 'y', 'z1']
44/49
(Py)Spark in the
cloud
45/49
46/49
"Serverless" Spark?
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
Uneven documentation
47/49
Thank you!
48/49
Join www.ServerlessToronto.org
Home of “Less IT Mess”

Mais conteúdo relacionado

Mais procurados

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 

Mais procurados (20)

PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
 
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 

Semelhante a Intro to PySpark: Python Data Analysis at scale in the Cloud

Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
GeeksLab Odessa
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly
 

Semelhante a Intro to PySpark: Python Data Analysis at scale in the Cloud (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
 

Mais de Daniel Zivkovic

Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
Daniel Zivkovic
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill ShockThis is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
Daniel Zivkovic
 
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
Daniel Zivkovic
 
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless TorontoServerless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
Daniel Zivkovic
 

Mais de Daniel Zivkovic (20)

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
Conversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui CostaConversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui Costa
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
 
Gojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applicationsGojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applications
 
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha JarettRetail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
 
What's new in Serverless at AWS?
What's new in Serverless at AWS?What's new in Serverless at AWS?
What's new in Serverless at AWS?
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
Empowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare HeroesEmpowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare Heroes
 
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google CloudGet started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
 
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
 
Smart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoTSmart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoT
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
 
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill ShockThis is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
 
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
 
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless TorontoServerless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 

Último

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Último (20)

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 

Intro to PySpark: Python Data Analysis at scale in the Cloud

  • 1. Welcome to ServerlessToronto.org “Home of Less IT Mess” 1 Introduce Yourself ☺ - Why are you here? - Looking for work? - Offering work? Our feature presentation “Intro to PySpark” starts at 6:20pm…
  • 2. Serverless is not just about the Tech: 2 Serverless is New Agile & Mindset Serverless Dev (gluing other people’s APIs and managed services) We're obsessed to creating business value (meaningful MVPs, products), by helping Startups & empowering Business users! We build bridges between Serverless Community (“Dev leg”), and Front-end & Voice- First folks (“UX leg”), and empower UX developers Achieve agility NOT by “sprinting” faster (like in Scrum), but by working smarter (by using bigger building blocks and less Ops)
  • 3. Upcoming #ServerlessTO Online Meetups 3 1. Accelerating with a Cloud Contact Center – Patrick Kolencherry Sr. Product Marketing Manager, and Karla Nussbaumer, Head of Technical Marketing at Twilio **JULY 9 @ 6pm ** 2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?
  • 4. Feature Talk Jonathan Rioux, Head Data Scientist at EPAM Systems & author of Manning book PySpark in Action 4
  • 6. If you have not filled the Meetup survey, now is the time to do it! (Also copied in the chat) https://forms.gle/6cyWGVY4L4GJvsXh7 2/49
  • 8. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast 3/49
  • 9. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada 3/49
  • 10. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada Author of PySpark in Action → 3/49
  • 11. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada Author of PySpark in Action → <3 Spark, <3 <3 Python 3/49
  • 12. 4/49
  • 13. 5/49
  • 14. Goals of this presentation 6/49
  • 15. Goals of this presentation Share my love of (Py)Spark 6/49
  • 16. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines 6/49
  • 17. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop 6/49
  • 18. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop Get you excited about using PySpark 6/49
  • 19. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop Get you excited about using PySpark 36,000 ft overview: Managed Spark in the Cloud 6/49
  • 20. What I expect from you 7/49
  • 21. What I expect from you You know a little bit of Python 7/49
  • 22. What I expect from you You know a little bit of Python You know what SQL is 7/49
  • 23. What I expect from you You know a little bit of Python You know what SQL is You won't hesitate to ask questions :-) 7/49
  • 24. What is Spark Spark is a unified analytics engine for large-scale data processing 8/49
  • 25. What is Spark (bis) Spark can be thought of a data factory that you (mostly) program like a cohesive computer. 9/49
  • 26. Spark under the hood 10/49
  • 27. Spark as an analytics factory 11/49
  • 29. Data manipulation uses the same vocabulary as SQL ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .groupby("age") .count("*") ) 13/49
  • 30. Data manipulation uses the same vocabulary as SQL .select("id", "first_name", "last_name", "age") ( my_table .where(col("age") > 21) .groupby("age") .count("*") ) select 13/49
  • 31. Data manipulation uses the same vocabulary as SQL .where(col("age") > 21) ( my_table .select("id", "first_name", "last_name", "age") .groupby("age") .count("*") ) where 13/49
  • 32. Data manipulation uses the same vocabulary as SQL .groupby("age") ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .count("*") ) group by 13/49
  • 33. Data manipulation uses the same vocabulary as SQL .count("*") ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .groupby("age") ) count 13/49
  • 34. I mean, you can legitimately use SQL spark.sql(""" select count(*) from ( select id, first_name, last_name, age from my_table where age > 21 ) group by age""") 14/49
  • 35. Data manipulation and machine learning with a uent API results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) 15/49
  • 36. Data manipulation and machine learning with a uent API spark.read.text("./data/Ch02/1342-0.txt") results = ( .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Read a text file 15/49
  • 37. Data manipulation and machine learning with a uent API .select(F.split(F.col("value"), " ").alias("line")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Select the column value, where each element is splitted (space as a separator). Alias to line. 15/49
  • 38. Data manipulation and machine learning with a uent API .select(F.explode(F.col("line")).alias("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Explode each element of line into its own record. Alias to word. 15/49
  • 39. Data manipulation and machine learning with a uent API .select(F.lower(F.col("word")).alias("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Lower-case each word 15/49
  • 40. Data manipulation and machine learning with a uent API .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Extract only the first group of lower-case letters from each word. 15/49
  • 41. Data manipulation and machine learning with a uent API .where(F.col("word") != "") results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .groupby(F.col("word")) .count() ) Keep only the records where the word is not the empty string. 15/49
  • 42. Data manipulation and machine learning with a uent API .groupby(F.col("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .count() ) Group by word 15/49
  • 43. Data manipulation and machine learning with a uent API .count() results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) ) Count the number of records in each group 15/49
  • 44.           Scala is not the only player in town 16/49
  • 46. 18/49
  • 47. Summoning PySpark from pyspark.sql import SparkSession   spark = SparkSession.builder.config( "spark.jars.packages", ("com.google.cloud.spark:" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() 19/49
  • 48. Summoning PySpark from pyspark.sql import SparkSession   spark = SparkSession.builder.config( "spark.jars.packages", ("com.google.cloud.spark:" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() A SparkSession is your entry point to distributed data manipulation 19/49
  • 49. Summoning PySpark spark = SparkSession.builder.config( "spark.jars.packages", ("com.google.cloud.spark:" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() from pyspark.sql import SparkSession   We create our SparkSession with an optional library to access BigQuery as a data source. 19/49
  • 50. Reading data from functools import reduce from pyspark.sql import DataFrame     def read_df_from_bq(year): return ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() )     gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] 20/49
  • 51. Reading data def read_df_from_bq(year): return ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() ) from functools import reduce from pyspark.sql import DataFrame         gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] We create a helper function to read our code from BigQuery. 20/49
  • 52. Reading data gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] ) )     def read_df_from_bq(year): return ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() )     A DataFrame is a regular Python object. 20/49
  • 53. Using the power of the schema gsod.printSchema() # root # |-- stn: string (nullable = true) # |-- wban: string (nullable = true) # |-- year: string (nullable = true) # |-- mo: string (nullable = true) # |-- da: string (nullable = true) # |-- temp: double (nullable = true) # |-- count_temp: long (nullable = true) # |-- dewp: double (nullable = true) # |-- count_dewp: long (nullable = true) # |-- slp: double (nullable = true) # |-- count_slp: long (nullable = true) # |-- stp: double (nullable = true) # |-- count_stp: long (nullable = true) # |-- visib: double (nullable = true) # [...] 21/49
  • 54. Using the power of the schema # root # |-- stn: string (nullable = true) # |-- wban: string (nullable = true) # |-- year: string (nullable = true) # |-- mo: string (nullable = true) # |-- da: string (nullable = true) # |-- temp: double (nullable = true) # |-- count_temp: long (nullable = true) # |-- dewp: double (nullable = true) # |-- count_dewp: long (nullable = true) # |-- slp: double (nullable = true) # |-- count_slp: long (nullable = true) # |-- stp: double (nullable = true) # |-- count_stp: long (nullable = true) # |-- visib: double (nullable = true) # [...] gsod.printSchema() The schema will give us the column names and their types. 21/49
  • 55. And showing data gsod = gsod.select("stn", "year", "mo", "da", "temp")   gsod.show(5)   # Approximately 5 seconds waiting # +------+----+---+---+----+ # | stn|year| mo| da|temp| # +------+----+---+---+----+ # |359250|2010| 02| 25|25.2| # |359250|2010| 05| 25|65.0| # |386130|2010| 02| 19|35.4| # |386130|2010| 03| 15|52.2| # |386130|2010| 01| 21|37.9| # +------+----+---+---+----+ # only showing top 5 rows 22/49
  • 56. What happens behind the scenes? 23/49
  • 57. Any data frame transformation will be stored until we need the data. Then, when we trigger an action, (Py)Spark will go and optimize the query plan, select the best physical plan and apply the transformation on the data. 24/49
  • 67. Something a little more complex import pyspark.sql.functions as F   stations = ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.stations") .option("credentialsFile", "bq-key.json") .load() )   # We want to get the "hottest Countries" that have at least 60 measures answer = ( gsod.join(stations, gsod["stn"] == stations["usaf"]) .where(F.col("country").isNotNull()) .groupBy("country") .agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count")) ).where(F.col("count") > 12 * 5) read, join, where, groupby, avg/count, where, orderby, show 26/49
  • 68. read, join, where, groupby, avg/count, where, orderby, show 27/49
  • 69. read, join, where, groupby, avg/count, where, orderby, show 28/49
  • 70. Python or SQL? gsod.createTempView("gsod") stations.createTempView("stations")   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) 29/49
  • 71. Python or SQL? gsod.createTempView("gsod") stations.createTempView("stations")   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) We register the data frames as Spark SQL tables. 29/49
  • 72. Python or SQL?   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) gsod.createTempView("gsod") stations.createTempView("stations") We then can query using SQL without leaving Python! 29/49
  • 73. Python and SQL! ( spark.sql( """ select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf group by country""" ) .where("country is not null") .where("count > (12 * 5)") .orderby("avg_temp", ascending=False) .show(5) ) 30/49
  • 75. 32/49
  • 76. 33/49
  • 77. Scalar UDF import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 34/49
  • 78. Scalar UDF import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 PySpark types are objects in the pyspark.sql.types modules. 34/49
  • 79. Scalar UDF @F.pandas_udf(T.DoubleType()) import pandas as pd import pyspark.sql.types as T   def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 We promote a regular Python function to a User Defined Function via a decorator. 34/49
  • 80. Scalar UDF def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9 import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType())   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 A simple function on pandas Series 34/49
  • 81. Scalar UDF f_to_c.func(pd.Series(range(32, 213))) import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 Still unit-testable :-) 34/49
  • 82. Scalar UDF gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp"))) gsod.select("temp", "temp_c").distinct().show(5)   # +-----+-------------------+ # | temp| temp_c| # +-----+-------------------+ # | 37.2| 2.8888888888888906| # | 85.9| 29.944444444444443| # | 53.5| 11.944444444444445| # | 71.6| 21.999999999999996| # |-27.6|-33.111111111111114| # +-----+-------------------+ # only showing top 5 rows 35/49
  • 83. Scalar UDF gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp"))) gsod.select("temp", "temp_c").distinct().show(5)   # +-----+-------------------+ # | temp| temp_c| # +-----+-------------------+ # | 37.2| 2.8888888888888906| # | 85.9| 29.944444444444443| # | 53.5| 11.944444444444445| # | 71.6| 21.999999999999996| # |-27.6|-33.111111111111114| # +-----+-------------------+ # only showing top 5 rows A UDF can be used like any PySpark function. 35/49
  • 84. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) 36/49
  • 85. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) A regular, fun, harmless function on (pandas) DataFrames 36/49
  • 86. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) 36/49
  • 87. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema )   gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| 37/49
  • 88. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema )   gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| We provide PySpark the schema we expect our function to return 37/49
  • 89. Grouped Map UDF gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema ) scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )     gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| We just have to partition (using group), and then applyInPandas! 37/49
  • 90. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema )   gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| 37/49
  • 91. 38/49
  • 92. You are not limited library-wise from sklearn.linear_model import LinearRegression     @F.pandas_udf(T.DoubleType()) def rate_of_change_temperature( day: pd.Series, temp: pd.Series ) -> float: """Returns the slope of the daily temperature for a given period of time.""" return ( LinearRegression() .fit(X=day.astype("int").values.reshape(-1, 1), y=temp) .coef_[0] ) 39/49
  • 93. result = gsod.groupby("stn", "year", "mo").agg( rate_of_change_temperature(gsod["da"], gsod["temp_norm"]).alias( "rt_chg_temp" ) )   result.show(5, False) # +------+----+---+---------------------+ # |stn |year|mo |rt_chg_temp | # +------+----+---+---------------------+ # |010250|2018|12 |-0.01014397905759162 | # |011120|2018|11 |-0.01704736746691528 | # |011150|2018|10 |-0.013510329829648423| # |011510|2018|03 |0.020159116598556657 | # |011800|2018|06 |0.012645501680677372 | # +------+----+---+---------------------+ # only showing top 5 rows 40/49
  • 94. 41/49
  • 95. Not fan of the syntax? 42/49
  • 96. 43/49
  • 97. From the README.md import databricks.koalas as ks import pandas as pd   pdf = pd.DataFrame( { 'x':range(3), 'y':['a','b','b'], 'z':['a','b','b'], } )   # Create a Koalas DataFrame from pandas DataFrame df = ks.from_pandas(pdf)   # Rename the columns df.columns = ['x', 'y', 'z1'] 44/49
  • 99. 46/49
  • 102. "Serverless" Spark? Cost-effective for sporadic runs Scales easily 47/49
  • 103. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance 47/49
  • 104. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive 47/49
  • 105. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive Sometimes confusing pricing model 47/49
  • 106. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive Sometimes confusing pricing model Uneven documentation 47/49