Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

Michael Rys & Rahul Potharaju
Microsoft Corp. Big Data Team
Level:
@MikeDoesBigData,@RahulPotharaju
#DotNETForSpark #vslive
Bringing the Power and Familiarity of
.NET, C# and F# to Big Data
Processing in Apache Spark

• Introducing .NET for Apache® Spark™ for
building data pipelines
– Why do we need .NET for Apache Spark?
– What is .NET for Apache Spark?
– Can I use .NET for Apache Spark with Azure
HDInsight Spark, Azure Databricks etc?
– Show me some examples!

INGEST STORE PREP & TRAIN MODEL & SERVE
Azure modern data warehouse architecture
Azure Data Lake Storage
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Analysis
Services
Azure Databricks
Azure HDInsight Spark
(Python, Scala, Spark SQL,
.NET for Apache Spark)
Polybase
Business/custom apps
(Structured)
Power BI
Azure also supports other Big Data services like Azure Data Lake to allow customers to tailor the above architecture to meet their unique needs.
ORCHESTRATION & DATA FLOW ETL
Azure Data Factory

• Apache Spark is an OSS fast analytics engine for big data and machine learning
• Improves efficiency through:
• General computation graphs beyond map/reduce
• In-memory computing primitives
• Allows developers to scale out their user code & write in their language of
choice
• Rich APIs in Java, Scala, Python, R, SparkSQL etc.
• Batch processing, streaming and interactive shell
– Available on Azure via
• Azure Databricks
• Azure HDInsight
• IaaS/Kubernetes

.NET Developers 💖 Apache Spark…
A lot of big data-usable business logic (millions
of lines of code) is written in .NET!
Expensive and difficult to translate into
Python/Scala/Java!
Locked out from big data processing due to
lack of .NET support in OSS big data solutions
In a recently conducted .NET Developer survey (> 1000 developers), more than 70%
expressed interest in Apache Spark!
Would like to tap into OSS eco-system for: Code libraries, support, hiring

Goal: .NET for Apache Spark is aimed at providing
.NET developers a first-class experience when
working with Apache Spark.
Non-Goal: Converting existing Scala/Python/Java
Spark developers.

• Interop layer for .NET (Scala-side)
• Potentially optimizing Python and R interop layers
• Technical documentation, blogs and articles
• End-to-end scenarios
• Performance benchmarking (cluster)
• Production workloads
• Out of Box with Azure HDInsight, easy to use with Azure Databricks
• C# (and F#) language extensions using .NET
• Performance benchmarking (Interop)
• Portability aspects (e.g., cross-platform .NET Standard)
• Tooling (e.g., Apache Jupyter, Visual Studio, Visual Studio Code)
Microsoft is committed…

… and developing in the open!
Contributions to foundational OSS projects:
• Apache arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-
4502, ARROW-4737, ARROW-4543, ARROW-4435
• Pyrolite (pickling library): Improve pickling/unpickling performance,
Add a Strong Name to Pyrolite
.NET for Apache Spark was open sourced @Spark+AI Summit 2019
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
• Version 0.4 released End July 2019
Spark project improvement proposals:
• Interop support for Spark language extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006

Journey Since //Build 2019 (~ 3 mo)
~1k
GitHub unique
visitors/wk
~7k
GitHub page
views/wk
63
GitHub issues
closed
86
GitHub PRs
merged
~1.9k
Nuget
Downloads

.NET provides full-spectrum Spark support
Spark DataFrames
with SparkSQL
works with
Spark v2.3.x/v2.4.[0/1]
and includes
~300 SparkSQL functions
Grouped Map (Reducer,
v0.4)
.NET Spark UDFs
Batch &
streaming
including
Spark Structured
Streaming and all
Spark-supported data
sources
.NET Standard 2.0
works with
.NET Framework v4.6.1+
and .NET Core v2.1+
and includes C#/F#
support
.NET
Standard
Machine Learning
Including access to
ML.NET
Speed &
productivity
Performance optimized
interop, as fast or faster
than pySpark,
Support for HW
Vectorization (v0.4)
https://github.com/dotnet/spark/examples

UserId State Salary
Terry WA XX
Rahul WA XX
Dan WA YY
Tyson CA ZZ
Ankit WA YY
Michael WA YY
Introduction to Spark Programming: DataFrame

.NET for Apache Spark
programmability
var spark = SparkSession.Builder().GetOrCreate();
var dataframe =
spark.Read().Json(“input.json”);
dataframe.Filter(df["age"] > 21)
.Select(concat(df[“age”], df[“name”]).Show();
var concat =
Udf<int?, string, string>((age, name)=>name+age);

Language comparison: TPC-H Query 2
val europe = region.filter($"r_name" === "EUROPE")
.join(nation, $"r_regionkey" === nation("n_regionkey"))
.join(supplier, $"n_nationkey" === supplier("s_nationkey"))
.join(partsupp,
supplier("s_suppkey") === partsupp("ps_suppkey"))
val brass = part.filter(part("p_size") === 15
&& part("p_type").endsWith("BRASS"))
.join(europe, europe("ps_partkey") === $"p_partkey")
val minCost = brass.groupBy(brass("ps_partkey"))
.agg(min("ps_supplycost").as("min"))
brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey"))
.filter(brass("ps_supplycost") === minCost("min"))
.select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.sort($"s_acctbal".desc,
$"n_name", $"s_name", $"p_partkey")
.limit(100)
.show()
var europe = region.Filter(Col("r_name") == "EUROPE")
.Join(nation, Col("r_regionkey") == nation["n_regionkey"])
.Join(supplier, Col("n_nationkey") == supplier["s_nationkey"])
.Join(partsupp,
supplier["s_suppkey"] == partsupp["ps_suppkey"]);
var brass = part.Filter(part["p_size"] == 15
& part["p_type"].EndsWith("BRASS"))
.Join(europe, europe["ps_partkey"] == Col("p_partkey"));
var minCost = brass.GroupBy(brass["ps_partkey"])
.Agg(Min("ps_supplycost").As("min"));
brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"])
.Filter(brass["ps_supplycost"] == minCost["min"])
.Select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.Sort(Col("s_acctbal").Desc(),
Col("n_name"), Col("s_name"), Col("p_partkey"))
.Limit(100)
.Show();
Similar syntax – dangerously copy/paste friendly!
$”col_name” vs. Col(“col_name”) Capitalization
Scala C#
C# vs Scala (e.g., == vs ===)

Submitting a Spark Application
spark-submit `
--class <user-app-main-class> `
--master local `
<path-to-user-jar>
<argument(s)-to-your-app>
spark-submit
(Scala)
spark-submit `
--class org.apache.spark.deploy.DotnetRunner `
--master local `
<path-to-microsoft-spark-jar> `
<path-to-your-app-exe> <argument(s)-to-your-app>
spark-submit
(.NET)
Provided by .NET for
Apache Spark Library
Provided by User & has
business logic

Demo 2: Locally debugging a .NET for Spark App

Demo 3: GitHub analysis on the Cloud

Revisiting the question…
How does OSS developer commit pattern look
like over a week - do people work more over
weekdays or weekends?

Microsoft, as a workplace, has a
great work-life balance….
… that, or this is proof that I am not
a data scientist!
Y-Axis: % total time
spent on commits
that day
X-Axis: Top-10
GitHub projects

What is happening when you write .NET Spark code?
DataFrame
SparkSQL
.NET for
Apache
Spark
.NET
Program
Did you
define a
.NET
UDF?
Regular execution path
(no .NET runtime during execution)
Interop between
Spark and .NET
No
Yes
Spark
operation tree

Performance –
warm cluster
runs for Pickling
Serialization
(Arrow will be
tested in the
future)
Takeaway 1: Where
UDF performance does
not matter, .NET is on-
par with Python
Takeaway 2: Where UDF
performance is critical, .NET
is ~2x faster than Python!

Works everywhere!
Cross platform
Cross Cloud
Windows Ubuntu
Azure & AWS
Databricks
macOS
AWS EMR
Spark
Azure HDI
Spark

VSCode extension for Spark .NET
• Spark .NET Project creation
• Dependency packaging
• Language service
• Sample code
Author
• Reference management
• Spark local run
• Spark cluster run (e.g. HDInsight)
Run
• DebugFix
Extension to VSCode
 Tap into VSCode for C# programming
 Automate Maven and Spark dependency
for environment setup
 Facilitate first project success through
project template and sample code
 Support Spark local run and cluster run
 Integrate with Azure for HDInsight
clusters navigation
 Azure Databricks integration planned

More
programming
experiences in
.NET
(UDAF, UDT
support, multi-
language UDFs)
What’s next?
Spark data
connectors in
.NET
(e.g., Apache Kafka,
Azure Blob Store,
Azure Data Lake)
Tooling
experiences
(e.g., Jupyter, VS
Code, Visual
Studio, others?)
Idiomatic
experiences for
C# and F#
(LINQ, Type
Provider)
Go to https://github.com/dotnet/spark and let us know what is important to you!
Out-of-Box
Experiences
(Azure HDInsight,
Azure Databricks,
Cosmos DB Spark,
SQL 2019 BDC, …)

Call to action: Engage, use & guide us!
Useful links:
• http://github.com/dotnet/spark
https://aka.ms/GoDotNetForSpark
Website:
• https://dot.net/spark
Available out-of-box on Azure HDInsight
Spark
Running .NET for Spark anywhere—
https://aka.ms/InstallDotNetForSpark
You & .NET

.NET for Apache Spark Github repo: https://github.com/dotnet/spark
Microsoft resources and blog posts:
• https://dot.net/spark
• https://docs.microsoft.com/dotnet/spark
• https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-spark/
• Build BRK3011 Demo video:
https://www.youtube.com/watch?v=ZlO1utbB2GQ&t=356s
• https://www.slideshare.net/MichaelRys
Apache Spark project proposals:
• Spark Language Interop Spark Proposal (Jira SPARK-26257)
• “.NET for Spark” Spark Project Proposal (Jira SPARK-27006)

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

Semelhante a Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark (20)

Mais de Michael Rys

Mais de Michael Rys (20)

Último

Último (20)

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing in Apache Spark

Notas do Editor