Collaborative environment with data science notebook
1. What makes Data driven
environments more efficient and how to
build a data science toolchain around
Notebook technologies
Creator of Apache Zeppelin
Co-Founder, CTO
Moon soo Lee
moon@zepl.com
2. #GDSC 2018
Who am I
A true believer that data science notebook changes how
people collaborate
Creator of Apache Zeppelin
Co-founder
https://github.com/Leemoonsoo
3. #GDSC 2018
It was 2013, really wanted to have
interactive analytics interface for .
4. #GDSC 2018
Started an opensource project -
Zeppelin http://zeppelin-project.org/
data science notebook.Became an project in 2016.
http://zeppelin.apache.org
15. #GDSC 2018
Github
● Store notebook in github
● Versioning
● Github provides .ipynb viewer
● Fork / pull request / merge
● Private / Public / Team / Org
● Hard to apply Notebook level ACL
● Not easy for Non-engineers
16. #GDSC 2018
nbviewer
● Publishing notebook
● Share notebook by
sharing link
● Easy use
● No access control
Nbconvert (endering ipynb to static HTML) as a webservice
17. #GDSC 2018
Apache Zeppelin
● Share notebook with ACL, Read/Write/Execute
● In case of Jupyter notebook, need to convert .ipynb to
zeppelin format in command line.
21. #GDSC 2018
DON’Ts
● Email attach
● Direct send
● Share through USB
● ...
Email attach
Local copy in laptop
USB drive
22. #GDSC 2018
DO’s
● Provide access to the same
dataset
● Access control capability
● Horizontal scalability
23. #GDSC 2018
Data catalog
● Provides location of data, what it means and how to load
○ e.g.
● Catalogue need to be accessible / searchable / annotatable
● Many different way to build depends on team / infra
○ Hive Metastore as a data catalog
○ Cloud infrastructure service (e.g. AWS glue data catalog, Azure data catalog)
○ Data catalog / publishing software (e.g. CKAN, DKAN)
○ Custom built on top of RDBMS, Nosql, Indexing engine
○ Build data catalog using Notebook
Dataset Location Schema Note
Activity s3://service/activity Date (DateTime), type (INT), action(String) Type is either RUN or STOP. ….
Images s3://service/images 512x256 pixel images Images are collected from profile photo...
24. #GDSC 2018
Build data catalog using Notebook
● Flexible enough to describe data
● Searchable, shareable, annotatable
● Programmatic generation
27. #GDSC 2018
Sign in and Run
Install libraries and
Install notebook and
Configure driver, environments and
Request access to data and
Setup access to notebook repo and
….
Run
29. #GDSC 2018
● Easier to implement / manage
● Notebook sharing is decoupled with
execution environment
● Usually notebook sharing is basic or
restricted. (no notebook level ACL)
● e.g.
○ JupyterHub
○ AWS Sagemaker
Reverse Proxy
Single user
Notebook server
Kernel
Single user
Notebook server
Kernel
Notebook
Storage
Multi user
Notebook server
Notebook
Storage
Kernel Kernel Kernel
Browser
Browser
● More complex to implement / manage
● Notebook sharing is coupled with execution
environment
● Usually notebook sharing is more advanced
and fine grained
● e.g.
○ Apache Zeppelin
○ ZEPL
○ Google Colab