SlideShare a Scribd company logo
1 of 37
Download to read offline
Apache Toree
A Jupyter Kernel for Scala and Apache Spark
THIS IS NOT A CONTRIBUTION
Luciano Resende
Apache Torree committer and PPMC member | Apple
ApacheCon North America 2022
About me - Luciano Resende
lresende@apache.org https://www.linkedin.com/in/lresende
@lresende1975 https://github.com/lresende
AI/ML Data Platform Architect — Apple
•Over 20 years industry experience with over 15 year contributing to open source
•Creator of Elyra and Jupyter Enterprise Gateway
•Open source areas of contribution: Elyra, Jupyter Notebook ecosystem, Apache
Bahir, Apache Toree, Apache Spark among other projects related to AI/ML
platforms
Jupyter Notebooks
Jupyter Notebooks
Notebooks are interactive
computational environments, in
which you can combine code
execution, rich text, mathematics,
plots and rich media
•
Jupyter Notebooks
Jupyter Ecosystem is
the de-facto standard tool in data
science and AI community
Popular IDE usage
JupyterLab
Visual Studio Code
PyCharm
Rstudio
Spyder
Notepad++
Sublime Text
Vim, Emacs, or similar
Visual Studio
MATLAB
Other
None
0% 20% 40% 60% 80%
0.7%
5.6%
5.8%
10.1%
11.0%
15.2%
19.4%
21.8%
31.5%
31.9%
33.2%
74.1%
Jupyter Notebooks
Simple, but Powerful
As simple as opening a web page, with the capabilities of a powerful, multilingual, development environment.
Interactive widgets
Code can produce rich outputs such as images, videos, markdown, LaTeX and JavaScript. Interactive widgets can be
used to manipulate and visualize data in real-time.
Language of choice
Jupyter Notebooks have support for over 50 programming languages, including those popular in Data Science, Data Engineer,
and AI such as Python, R, Julia and Scala.
Big Data Integration
Leverage Big Data platforms such as Apache Spark from Python, R and Scala. Explore the same data with pandas,
scikit-learn, ggplot2, dplyr, etc.
Share Notebooks
Notebooks can be shared with others using e-mail, Dropbox, Google Drive, GitHub, etc
􀙚􀆅􀫐􀂒􀡓
⤵
Jupyter Notebooks
Single page web interface
• File Browser
• Code Console (QT Console)
• Text Editor
Current Release
• Jupyter Notebook 6.4.12
• Available in Anaconda
• pip install --upgrade notebook
Jupyter Notebooks
The Classic Notebook is starting to move
towards maintenance mode
• Community e
ff
orts being concentrated in
the new JupyterLab UI.
• Community continue to deliver bug-
fi
xes
and security updates frequently
Jupyter 7.0 (based on JupyterLab) discussion:
• https://jupyter.org/enhancement-
proposals/79-notebook-v7/notebook-
v7.html
JupyterLab
JupyterLab is the next generation UI
for the Jupyter Ecosystem.
Brings all the previous improvements
into a single uni
fi
ed platform plus
more!
Provides a modular, extensible
architecture
Retains backward compatibility with
the old notebook we know and love
JupyterLab
File Browser
Widgets/
Rich Output
Tabbed
Workspaces
Text Editor
Console/
Terminal
Jupyter Notebooks
Notebook UI runs on the browser
The Notebook Server serves the
‘Notebooks’
Kernels interpret/execute cell contents
• Are responsible for code execution
• Abstracts di
ff
erent languages
• 1:1 relationship with Notebook
• Runs and consume resources as long as
notebook is running
Https/
websocket
ZMQ
􀡓􀈿
Jupyter Notebooks
􀇳
Front-end
Kernel Proxy
DEAL SUB
IPython Kernel
ROUTER
􀇳􀇳
Front-end
Kernel Proxy
DEAL SUB
Front-end
Kernel Proxy
DEAL SUB DEAL
ROUTER PUB
Kernel raw_iput
Requests to kernal
Kernel output broadcast
Request / Reply direction
Available Sockets:
• Shell (requests, history, info)
• IOPub (status, display, results)
• Stdin (input requests from kernel)
• Control (shutdown, interrupt)
• Heartbeat (poll)
Jupyter Notebooks
Two types of responses
Results
• Computations that return a result
• 1+1
• val a = 2 + 5
Stream Content
• Values that are written to output stream
• df.show(10)
Client Program Kernel
Evaluate (msgid=1) ‘1+1’
Busy (msgid=1)
Status (msgid=1) ok/error
Result (msgid=1)
Stream Content (msgid=1)
Idle (msgid=1)
Apache Toree
Apache Toree
A Scala based Jupyter Kernel that enables Jupyter
Notebooks to execute Scala code and connect to Apache
Spark to build interactive applications.
Apache Toree History
• December 2014 - Open Sourced Spark Kernel to GitHub
• July 2015 - Joined developerWorks Open
• https://developer.ibm.com/open/spark-kernel/
• December 2015 - Accepted as an Apache Incubator Project
• https://toree.apache.org
Apache Toree Releases
Release Scala Version Spark Version
Toree 0.1.x Scala 2.11 Spark 1.6
Toree 0.2.x - 0.4.x Scala 2.11 Spark 2.x
Toree 0.5.x Scala 2.12 Spark 3.x
Apache Toree
Installing the Toree Kernel
• pip install –upgrade toree
Con
fi
guring the Toree Kernel
• jupyter toree install --spark_home=/usr/local/bin/apache-spark/
Apache Toree
{
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh",
"--profile",
"{connection_file}"
],
"env": {
"DEFAULT_INTERPRETER": "Scala",
"__TOREE_SPARK_OPTS__": "",
"__TOREE_OPTS__": "",
"SPARK_HOME": "/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7/",
"PYTHONPATH": “/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7//python:/Users/lresende/opt/spark-3.2.2-bin-
hadoop2.7/python/lib/py4j-0.10.9.5-src.zip”,
"PYTHON_EXEC": "python"
},
"display_name": "Apache Toree - Scala",
"language": "scala",
"interrupt_mode": "signal",
"metadata": {}
}
Apache Toree
{
"argv": [
“/usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh",
"--spark-context-initialization-mode",
"none",
"--profile",
"{connection_file}"
],
"env": {
"DEFAULT_INTERPRETER": "Scala",
"__TOREE_SPARK_OPTS__": "",
"__TOREE_OPTS__": "",
"SPARK_HOME": "/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7/",
"PYTHONPATH": “/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7//python:/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7/
python/lib/py4j-0.10.9.5-src.zip”,
"PYTHON_EXEC": "python"
},
"display_name": "Apache Toree - Scala",
"language": "scala",
"interrupt_mode": "signal",
"metadata": {}
}
Apache Toree
{
"language": "scala",
"display_name": "Spark - Scala (YARN Cluster Mode)",
"metadata": { "process_proxy": { "class_name":
"enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy” } },
"env": {
"SPARK_HOME": "/usr/hdp/current/spark2-client",
"__TOREE_SPARK_OPTS__": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf
spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.am.waitTime=1d ${KERNEL_EXTRA_SPARK_OPTS}",
"__TOREE_OPTS__": "--alternate-sigint USR2",
"DEFAULT_INTERPRETER": "Scala"
},
"argv": [
"/usr/local/share/jupyter/kernels/spark_scala_yarn_cluster/bin/run.sh",
"--RemoteProcessProxy.kernel-id",
"{kernel_id}",
"--RemoteProcessProxy.response-address",
"{response_address}",
"--RemoteProcessProxy.spark-context-initialization-mode",
"lazy"
]
}
Apache Toree
PROG_HOME="$(cd "`dirname "$0"`"/..; pwd)"
if [ -z "$SPARK_HOME" ]; then
echo "SPARK_HOME must be set to the location of a Spark distribution!"
exit 1
fi
echo "Starting Spark Kernel with SPARK_HOME=$SPARK_HOME"
KERNEL_ASSEMBLY=`(cd ${PROG_HOME}/lib; ls -1 toree-assembly-*.jar;)`
# disable randomized hash for string in Python 3.3+
TOREE_ASSEMBLY=${PROG_HOME}/lib/${KERNEL_ASSEMBLY}
eval exec 
"${SPARK_HOME}/bin/spark-submit" 
--name "'Apache Toree'" 
"${SPARK_OPTS}" 
--class org.apache.toree.Main 
"${TOREE_ASSEMBLY}" 
"${TOREE_OPTS}" 
"$@"
Apache Toree Architectural Diagram
Apache Toree running as an Apache Spark application in client mode
Bob
Alice
􀉩
JupyterLab
jupyter client
Application
Apache
Toree
Driver
Spark Context
Apache Spark Cluster
Worker
Executor
Worker
Executor
Worker
Executor
Cluster
Manager
0MQ
IPython Kernel
protocol
Jupyter Notebooks
Stack Limitation
Scalability
• Jupyter Kernels running as local process
• Resources are limited by what is available on
the one single node that runs all Kernels and
associated Spark drivers
Security
• Single user sharing the same privileges
• Users can see and control each other
process using Jupyter administrative utilities
Maximum Number of Simultaneous Kernels
Max
Kernels
(4GB
Heap)
0
10
20
Cluster Size (32GB Nodes)
4 Nodes 8 Nodes 12 Nodes 16 Nodes
􀪬􀪬􀪬􀪬􀪬􀪬
Cloud Cluster
􀪬
k Kernel
k
k
YARN
Workers
Apache Toree Architectural Diagram
Apache Toree running as an Apache Spark application in cluster mode
YARN Cluster
Gateway Edge Node
Bob
Alice
Jupyter Enterprise Gateway
• Multitenancy
• Remote kernel lifecycle management
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Toree Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Toree Kernel
Spark Driver
Impersonation: Alice’s
kernel runs under
Alice’s user ID.
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Toree Kernel
Spark Driver
Security
Layer
􀉩􀝊
JupyterLab
Jupyter Notebook
(local)
Apache Toree Add-ons
Apache Toree, via extensions like
Brunel for Apache Toree or Plotly for
Scala, supports rich visualizations that
integrates directly with Spark Data
Frame APIs
Notes:
Brunel seems to be broken with Spark
3.2.2 (from previous 2.x/3.0.x)
Plotly seems to have a dependency
issue (https://github.com/
alexarchambault/plotly-scala/issues/14)
Apache Toree Visualizations
Apache Toree provides a set of
magics that enhances the user
experience manipulating data
coming from Spark tables or data
Apache Toree Visualizations
Apache Toree APIs
Apache Toree
Accessing Toree programmatically
using Scala
Torre Client APIs
https://github.com/lresende/toree-
gateway/blob/v1.0/src/main/scala/
org/apache/toree/gateway/
ToreeGateway.scala
object ToreeGatewayClient extends App {
// Parse our configuration and create a client connecting
to our kernel
val configFileContent =
scala.io.Source.fromFile(“config.json”).mkString
val config: Config =
ConfigFactory.parseString(configFileContent)
val client = (new ClientBootstrap(config)
with StandardSystemInitialization
with StandardHandlerInitialization).createClient()
…
val promise = Promise[String]
try {
val exRes: DeferredExecution = client.execute(“1+1”)
.onResult(executeResult => {
handleResult(promise, executeResult)
}).onError(executeReplyError =>{
handleError(promise, executeReplyError)
}).onSuccess(executeReplyOk => {
handleSuccess(promise, executeReplyOk)
}).onStream(streamResult => {
handleStream(promise, streamResult)
})
} catch {
case t : Throwable => {
log.info("Error submitting request: " + t.getMessage,
t)
promise.success("Error submitting request: " +
t.getMessage)
}
}
Await.result(promise.future, Duration.Inf)
}
Apache Toree
Accessing Toree programmatically
using Python
Jupyter Client package
https://github.com/lresende/toree-
gateway/blob/master/python/
toree_client.py
self.client =
BlockingKernelClient(connection_file=connectionFileLocation)
self.client.load_connection_file(connection_file=connectionFil
eLocation)
…
msg_id = self.client.execute(code=‘1+1’, allow_stdin=False)
reply = self.client.get_shell_msg(block=True, timeout=timeout)
results = []
while True:
try:
msg = self.client.get_iopub_msg(timeout=timeout)
except:
raise Exception("Error: Timeout executing
request")
# Stream results are being returned, accumulate them to
results
if msg['msg_type'] == 'stream':
type = 'stream'
results.append(msg['content']['text'])
continue
elif msg['msg_type'] == 'execute_result’:
if 'text/plain' in msg['content']['data']:
type = 'text'
results.append(msg['content']['data']['text/
plain'])
elif 'text/html' in msg['content']['data']:
# When idle, responses have all been processed/returned
elif msg['msg_type'] == 'status’:
if msg['content']['execution_state'] == 'idle':
break
if reply['content']['status'] == 'ok’:
return ''.join(results)
Apache Toree
Accessing Toree programmatically
using Jupyter Enterprise Gateway
GatewayClient
https://github.com/jupyter-server/
enterprise_gateway/tree/master/
enterprise_gateway/client
# initialize environment
gatewayClient = GatewayClient()
kernel = gatewayClient.start_kernel(‘toree-
kernelspec’)
…
result = self.kernel.execute(”1+1")
…
# shutdown environment
gatewayClient.shutdown_kernel(cls.kernel)
Demo
Apache Toree: Join the Community
Contribute to Apache Toree
Roadmap suggestions for contributors
• Decouple plain Scala kernel from Spark
• Enhance startup performance
• Evaluate/Implement better async/parallelism framework
• Progress bar for Spark Jobs
• Spark 3.x and Scala 2.13
• Help with documentation and website enhancements
Apache Toree
https://toree.apache.org
Apache Toree Mailing List
dev@toree.incubator.apache.org
Apache Toree source code at GitHub
https://github.com/apache/incubator-toree
􀋂
Star and fork the project on Github
Apache Toree Resources
Apache Toree
0.5.0 - incubating
pip install -upgrade toree
Questions
lresende@apache.org
@lresende1975
https://www.linkedin.com/in/lresende
https://github.com/lresende

More Related Content

What's hot

Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with ScyllaScyllaDB
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla ClusterScyllaDB
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Delta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquetDelta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquetAlban Phélip
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 

What's hot (20)

Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2Data Distribution and Ordering for Efficient Data Source V2
Data Distribution and Ordering for Efficient Data Source V2
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with Scylla
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Delta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquetDelta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquet
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 

Similar to A Jupyter kernel for Scala and Apache Spark.pdf

APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup   extended jupyter kernel gatewayJupyter con meetup   extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gatewayLuciano Resende
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayLuciano Resende
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkLuciano Resende
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsBuilding analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsLuciano Resende
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Using Elyra for COVID-19 Analytics
Using Elyra for COVID-19 AnalyticsUsing Elyra for COVID-19 Analytics
Using Elyra for COVID-19 AnalyticsLuciano Resende
 
Jupyter notebooks on steroids
Jupyter notebooks on steroidsJupyter notebooks on steroids
Jupyter notebooks on steroidsJose Enrique Ruiz
 
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 Openstack - An introduction/Installation - Presented at Dr Dobb's conference... Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...Rahul Krishna Upadhyaya
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_indexChester Chen
 
Habitat Overview
Habitat OverviewHabitat Overview
Habitat OverviewMandi Walls
 
Introduction to python history and platforms
Introduction to python history and platformsIntroduction to python history and platforms
Introduction to python history and platformsKirti Verma
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Undine: Turnkey Drupal Development Environments
Undine: Turnkey Drupal Development EnvironmentsUndine: Turnkey Drupal Development Environments
Undine: Turnkey Drupal Development EnvironmentsDavid Watson
 
Neev Open Source Contributions
Neev Open Source ContributionsNeev Open Source Contributions
Neev Open Source ContributionsNeev Technologies
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Mike Broberg
 

Similar to A Jupyter kernel for Scala and Apache Spark.pdf (20)

APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup   extended jupyter kernel gatewayJupyter con meetup   extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gateway
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsBuilding analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernels
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Using Elyra for COVID-19 Analytics
Using Elyra for COVID-19 AnalyticsUsing Elyra for COVID-19 Analytics
Using Elyra for COVID-19 Analytics
 
Jupyter notebooks on steroids
Jupyter notebooks on steroidsJupyter notebooks on steroids
Jupyter notebooks on steroids
 
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 Openstack - An introduction/Installation - Presented at Dr Dobb's conference... Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Habitat Overview
Habitat OverviewHabitat Overview
Habitat Overview
 
Introduction to python history and platforms
Introduction to python history and platformsIntroduction to python history and platforms
Introduction to python history and platforms
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Undine: Turnkey Drupal Development Environments
Undine: Turnkey Drupal Development EnvironmentsUndine: Turnkey Drupal Development Environments
Undine: Turnkey Drupal Development Environments
 
Neev Open Source Contributions
Neev Open Source ContributionsNeev Open Source Contributions
Neev Open Source Contributions
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
.NET RDF APIs
.NET RDF APIs.NET RDF APIs
.NET RDF APIs
 
Stackato
StackatoStackato
Stackato
 

More from Luciano Resende

Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Luciano Resende
 
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...Luciano Resende
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksAi pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksLuciano Resende
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise GatewayStrata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise GatewayLuciano Resende
 
Scaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloadsScaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloadsLuciano Resende
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewJupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewLuciano Resende
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeLuciano Resende
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache BahirIoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache BahirLuciano Resende
 
Getting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache BahirGetting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache BahirLuciano Resende
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examplesLuciano Resende
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirLuciano Resende
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML  - Declarative Machine LearningWhat's new in Apache SystemML  - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine LearningLuciano Resende
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende
 
How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceHow mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceLuciano Resende
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningLuciano Resende
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conferenceLuciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conferenceLuciano Resende
 
Open Source tools overview
Open Source tools overviewOpen Source tools overview
Open Source tools overviewLuciano Resende
 
Data access layer and schema definitions
Data access layer and schema definitionsData access layer and schema definitions
Data access layer and schema definitionsLuciano Resende
 
How mentoring programs can help newcomers get started with open source
How mentoring programs can help newcomers get started with open sourceHow mentoring programs can help newcomers get started with open source
How mentoring programs can help newcomers get started with open sourceLuciano Resende
 

More from Luciano Resende (20)

Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
 
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksAi pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooks
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise GatewayStrata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
 
Scaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloadsScaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloads
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewJupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway Overview
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache BahirIoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
 
Getting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache BahirGetting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache Bahir
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML  - Declarative Machine LearningWhat's new in Apache SystemML  - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine Learning
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceHow mentoring can help you start contributing to open source
How mentoring can help you start contributing to open source
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conferenceLuciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conference
 
Asf icfoss-mentoring
Asf icfoss-mentoringAsf icfoss-mentoring
Asf icfoss-mentoring
 
Open Source tools overview
Open Source tools overviewOpen Source tools overview
Open Source tools overview
 
Data access layer and schema definitions
Data access layer and schema definitionsData access layer and schema definitions
Data access layer and schema definitions
 
How mentoring programs can help newcomers get started with open source
How mentoring programs can help newcomers get started with open sourceHow mentoring programs can help newcomers get started with open source
How mentoring programs can help newcomers get started with open source
 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

A Jupyter kernel for Scala and Apache Spark.pdf

  • 1. Apache Toree A Jupyter Kernel for Scala and Apache Spark THIS IS NOT A CONTRIBUTION Luciano Resende Apache Torree committer and PPMC member | Apple ApacheCon North America 2022
  • 2. About me - Luciano Resende lresende@apache.org https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende AI/ML Data Platform Architect — Apple •Over 20 years industry experience with over 15 year contributing to open source •Creator of Elyra and Jupyter Enterprise Gateway •Open source areas of contribution: Elyra, Jupyter Notebook ecosystem, Apache Bahir, Apache Toree, Apache Spark among other projects related to AI/ML platforms
  • 4. Jupyter Notebooks Notebooks are interactive computational environments, in which you can combine code execution, rich text, mathematics, plots and rich media •
  • 5. Jupyter Notebooks Jupyter Ecosystem is the de-facto standard tool in data science and AI community Popular IDE usage JupyterLab Visual Studio Code PyCharm Rstudio Spyder Notepad++ Sublime Text Vim, Emacs, or similar Visual Studio MATLAB Other None 0% 20% 40% 60% 80% 0.7% 5.6% 5.8% 10.1% 11.0% 15.2% 19.4% 21.8% 31.5% 31.9% 33.2% 74.1%
  • 6. Jupyter Notebooks Simple, but Powerful As simple as opening a web page, with the capabilities of a powerful, multilingual, development environment. Interactive widgets Code can produce rich outputs such as images, videos, markdown, LaTeX and JavaScript. Interactive widgets can be used to manipulate and visualize data in real-time. Language of choice Jupyter Notebooks have support for over 50 programming languages, including those popular in Data Science, Data Engineer, and AI such as Python, R, Julia and Scala. Big Data Integration Leverage Big Data platforms such as Apache Spark from Python, R and Scala. Explore the same data with pandas, scikit-learn, ggplot2, dplyr, etc. Share Notebooks Notebooks can be shared with others using e-mail, Dropbox, Google Drive, GitHub, etc 􀙚􀆅􀫐􀂒􀡓 ⤵
  • 7. Jupyter Notebooks Single page web interface • File Browser • Code Console (QT Console) • Text Editor Current Release • Jupyter Notebook 6.4.12 • Available in Anaconda • pip install --upgrade notebook
  • 8. Jupyter Notebooks The Classic Notebook is starting to move towards maintenance mode • Community e ff orts being concentrated in the new JupyterLab UI. • Community continue to deliver bug- fi xes and security updates frequently Jupyter 7.0 (based on JupyterLab) discussion: • https://jupyter.org/enhancement- proposals/79-notebook-v7/notebook- v7.html
  • 9. JupyterLab JupyterLab is the next generation UI for the Jupyter Ecosystem. Brings all the previous improvements into a single uni fi ed platform plus more! Provides a modular, extensible architecture Retains backward compatibility with the old notebook we know and love
  • 11. Jupyter Notebooks Notebook UI runs on the browser The Notebook Server serves the ‘Notebooks’ Kernels interpret/execute cell contents • Are responsible for code execution • Abstracts di ff erent languages • 1:1 relationship with Notebook • Runs and consume resources as long as notebook is running Https/ websocket ZMQ 􀡓􀈿
  • 12. Jupyter Notebooks 􀇳 Front-end Kernel Proxy DEAL SUB IPython Kernel ROUTER 􀇳􀇳 Front-end Kernel Proxy DEAL SUB Front-end Kernel Proxy DEAL SUB DEAL ROUTER PUB Kernel raw_iput Requests to kernal Kernel output broadcast Request / Reply direction Available Sockets: • Shell (requests, history, info) • IOPub (status, display, results) • Stdin (input requests from kernel) • Control (shutdown, interrupt) • Heartbeat (poll)
  • 13. Jupyter Notebooks Two types of responses Results • Computations that return a result • 1+1 • val a = 2 + 5 Stream Content • Values that are written to output stream • df.show(10) Client Program Kernel Evaluate (msgid=1) ‘1+1’ Busy (msgid=1) Status (msgid=1) ok/error Result (msgid=1) Stream Content (msgid=1) Idle (msgid=1)
  • 15. Apache Toree A Scala based Jupyter Kernel that enables Jupyter Notebooks to execute Scala code and connect to Apache Spark to build interactive applications.
  • 16. Apache Toree History • December 2014 - Open Sourced Spark Kernel to GitHub • July 2015 - Joined developerWorks Open • https://developer.ibm.com/open/spark-kernel/ • December 2015 - Accepted as an Apache Incubator Project • https://toree.apache.org
  • 17. Apache Toree Releases Release Scala Version Spark Version Toree 0.1.x Scala 2.11 Spark 1.6 Toree 0.2.x - 0.4.x Scala 2.11 Spark 2.x Toree 0.5.x Scala 2.12 Spark 3.x
  • 18. Apache Toree Installing the Toree Kernel • pip install –upgrade toree Con fi guring the Toree Kernel • jupyter toree install --spark_home=/usr/local/bin/apache-spark/
  • 19. Apache Toree { "argv": [ "/usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh", "--profile", "{connection_file}" ], "env": { "DEFAULT_INTERPRETER": "Scala", "__TOREE_SPARK_OPTS__": "", "__TOREE_OPTS__": "", "SPARK_HOME": "/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7/", "PYTHONPATH": “/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7//python:/Users/lresende/opt/spark-3.2.2-bin- hadoop2.7/python/lib/py4j-0.10.9.5-src.zip”, "PYTHON_EXEC": "python" }, "display_name": "Apache Toree - Scala", "language": "scala", "interrupt_mode": "signal", "metadata": {} }
  • 20. Apache Toree { "argv": [ “/usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh", "--spark-context-initialization-mode", "none", "--profile", "{connection_file}" ], "env": { "DEFAULT_INTERPRETER": "Scala", "__TOREE_SPARK_OPTS__": "", "__TOREE_OPTS__": "", "SPARK_HOME": "/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7/", "PYTHONPATH": “/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7//python:/Users/lresende/opt/spark-3.2.2-bin-hadoop2.7/ python/lib/py4j-0.10.9.5-src.zip”, "PYTHON_EXEC": "python" }, "display_name": "Apache Toree - Scala", "language": "scala", "interrupt_mode": "signal", "metadata": {} }
  • 21. Apache Toree { "language": "scala", "display_name": "Spark - Scala (YARN Cluster Mode)", "metadata": { "process_proxy": { "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy” } }, "env": { "SPARK_HOME": "/usr/hdp/current/spark2-client", "__TOREE_SPARK_OPTS__": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.am.waitTime=1d ${KERNEL_EXTRA_SPARK_OPTS}", "__TOREE_OPTS__": "--alternate-sigint USR2", "DEFAULT_INTERPRETER": "Scala" }, "argv": [ "/usr/local/share/jupyter/kernels/spark_scala_yarn_cluster/bin/run.sh", "--RemoteProcessProxy.kernel-id", "{kernel_id}", "--RemoteProcessProxy.response-address", "{response_address}", "--RemoteProcessProxy.spark-context-initialization-mode", "lazy" ] }
  • 22. Apache Toree PROG_HOME="$(cd "`dirname "$0"`"/..; pwd)" if [ -z "$SPARK_HOME" ]; then echo "SPARK_HOME must be set to the location of a Spark distribution!" exit 1 fi echo "Starting Spark Kernel with SPARK_HOME=$SPARK_HOME" KERNEL_ASSEMBLY=`(cd ${PROG_HOME}/lib; ls -1 toree-assembly-*.jar;)` # disable randomized hash for string in Python 3.3+ TOREE_ASSEMBLY=${PROG_HOME}/lib/${KERNEL_ASSEMBLY} eval exec "${SPARK_HOME}/bin/spark-submit" --name "'Apache Toree'" "${SPARK_OPTS}" --class org.apache.toree.Main "${TOREE_ASSEMBLY}" "${TOREE_OPTS}" "$@"
  • 23. Apache Toree Architectural Diagram Apache Toree running as an Apache Spark application in client mode Bob Alice 􀉩 JupyterLab jupyter client Application Apache Toree Driver Spark Context Apache Spark Cluster Worker Executor Worker Executor Worker Executor Cluster Manager 0MQ IPython Kernel protocol
  • 24. Jupyter Notebooks Stack Limitation Scalability • Jupyter Kernels running as local process • Resources are limited by what is available on the one single node that runs all Kernels and associated Spark drivers Security • Single user sharing the same privileges • Users can see and control each other process using Jupyter administrative utilities Maximum Number of Simultaneous Kernels Max Kernels (4GB Heap) 0 10 20 Cluster Size (32GB Nodes) 4 Nodes 8 Nodes 12 Nodes 16 Nodes 􀪬􀪬􀪬􀪬􀪬􀪬 Cloud Cluster 􀪬 k Kernel k k
  • 25. YARN Workers Apache Toree Architectural Diagram Apache Toree running as an Apache Spark application in cluster mode YARN Cluster Gateway Edge Node Bob Alice Jupyter Enterprise Gateway • Multitenancy • Remote kernel lifecycle management Spark Executors Spark Executors Spark Executors Yarn Container Toree Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Toree Kernel Spark Driver Impersonation: Alice’s kernel runs under Alice’s user ID. Spark Executors Spark Executors Spark Executors Yarn Container Toree Kernel Spark Driver Security Layer 􀉩􀝊 JupyterLab Jupyter Notebook (local)
  • 27. Apache Toree, via extensions like Brunel for Apache Toree or Plotly for Scala, supports rich visualizations that integrates directly with Spark Data Frame APIs Notes: Brunel seems to be broken with Spark 3.2.2 (from previous 2.x/3.0.x) Plotly seems to have a dependency issue (https://github.com/ alexarchambault/plotly-scala/issues/14) Apache Toree Visualizations
  • 28. Apache Toree provides a set of magics that enhances the user experience manipulating data coming from Spark tables or data Apache Toree Visualizations
  • 30. Apache Toree Accessing Toree programmatically using Scala Torre Client APIs https://github.com/lresende/toree- gateway/blob/v1.0/src/main/scala/ org/apache/toree/gateway/ ToreeGateway.scala object ToreeGatewayClient extends App { // Parse our configuration and create a client connecting to our kernel val configFileContent = scala.io.Source.fromFile(“config.json”).mkString val config: Config = ConfigFactory.parseString(configFileContent) val client = (new ClientBootstrap(config) with StandardSystemInitialization with StandardHandlerInitialization).createClient() … val promise = Promise[String] try { val exRes: DeferredExecution = client.execute(“1+1”) .onResult(executeResult => { handleResult(promise, executeResult) }).onError(executeReplyError =>{ handleError(promise, executeReplyError) }).onSuccess(executeReplyOk => { handleSuccess(promise, executeReplyOk) }).onStream(streamResult => { handleStream(promise, streamResult) }) } catch { case t : Throwable => { log.info("Error submitting request: " + t.getMessage, t) promise.success("Error submitting request: " + t.getMessage) } } Await.result(promise.future, Duration.Inf) }
  • 31. Apache Toree Accessing Toree programmatically using Python Jupyter Client package https://github.com/lresende/toree- gateway/blob/master/python/ toree_client.py self.client = BlockingKernelClient(connection_file=connectionFileLocation) self.client.load_connection_file(connection_file=connectionFil eLocation) … msg_id = self.client.execute(code=‘1+1’, allow_stdin=False) reply = self.client.get_shell_msg(block=True, timeout=timeout) results = [] while True: try: msg = self.client.get_iopub_msg(timeout=timeout) except: raise Exception("Error: Timeout executing request") # Stream results are being returned, accumulate them to results if msg['msg_type'] == 'stream': type = 'stream' results.append(msg['content']['text']) continue elif msg['msg_type'] == 'execute_result’: if 'text/plain' in msg['content']['data']: type = 'text' results.append(msg['content']['data']['text/ plain']) elif 'text/html' in msg['content']['data']: # When idle, responses have all been processed/returned elif msg['msg_type'] == 'status’: if msg['content']['execution_state'] == 'idle': break if reply['content']['status'] == 'ok’: return ''.join(results)
  • 32. Apache Toree Accessing Toree programmatically using Jupyter Enterprise Gateway GatewayClient https://github.com/jupyter-server/ enterprise_gateway/tree/master/ enterprise_gateway/client # initialize environment gatewayClient = GatewayClient() kernel = gatewayClient.start_kernel(‘toree- kernelspec’) … result = self.kernel.execute(”1+1") … # shutdown environment gatewayClient.shutdown_kernel(cls.kernel)
  • 33. Demo
  • 34. Apache Toree: Join the Community
  • 35. Contribute to Apache Toree Roadmap suggestions for contributors • Decouple plain Scala kernel from Spark • Enhance startup performance • Evaluate/Implement better async/parallelism framework • Progress bar for Spark Jobs • Spark 3.x and Scala 2.13 • Help with documentation and website enhancements
  • 36. Apache Toree https://toree.apache.org Apache Toree Mailing List dev@toree.incubator.apache.org Apache Toree source code at GitHub https://github.com/apache/incubator-toree 􀋂 Star and fork the project on Github Apache Toree Resources Apache Toree 0.5.0 - incubating pip install -upgrade toree