How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap

How-to create a multi tenancy for
an interactive data analysis with
JupyterHub & LDAP
Spark Cluster + Jupyter + LDAP

Introduction
With this presentation you should be able to create an architecture for a framework of an
interactive data analysis by using a Cloudera Spark Cluster with Kerberos, a Jupyter
machine with JupyterHub and authentication via LDAP.

Architecture
This architecture enables the following:
● Transparent data-science development
● User Impersonation
● Authentication via LDAP
● Upgrades on Cluster won’t affect the developments.
● Controlled access to the data and resources by Kerberos/Sentry.
● Several coding API’s (Scala, R, Python, PySpark, etc…).
● Two layers of security with Kerberos & LDAP

Pre-Assumptions
1. Cluster hostname: cm1.localdomain Jupyter hostname: cm3.localdomain
2. Cluster Python version: 3.7.1
3. Cluster Manager: Cloudera Manager 5.12.2
4. Service Yarn & PIP Installed
5. Cluster Authentication Pre-Installed: Kerberos
a. Kerberos Realm DOMAIN.COM
6. Chosen IDE: Jupyter
7. JupyterHub Machine Authentication Not-Installed: Kerberos
8. AD Machine Installed with hostname: ad.localdomain
9. Java 1.8 installed in Both Machines
10. Cluster Spark version 2.2.0

Anaconda
Download and installation
su - root
wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-x86_64.sh
chmod +x Anaconda3-2018.12-Linux-x86_64.sh
./Anaconda3-2018.12-Linux-x86_64.sh
Note 1: Change with your hostname and domain in the highlighted field.
Note 2: Due to the package SudoSpawner - that requires Anaconda be installed with the root user!
Note 3: JupyterHub requires Python 3.X, therefore it will be installed Anaconda 3

Anaconda
Path environment variables
export PATH=/opt/anaconda3/bin:$PATH
Java environment variables
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64/;
Spark environment variables
export SPARK_HOME=/opt/spark;
export SPARK_MASTER_IP=10.191.38.83;
Yarn environment variables
export YARN_CONF_DIR=/etc/hadoop/conf
Yarn environment variables
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip;
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py;
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python;
Note: Change with your values in the highlighted field.
Hadoop environment variables
export HADOOP_HOME=/etc/hadoop/conf;
export HADOOP_CONF_DIR=/etc/hadoop/conf;
Hive environment variables
export HIVE_HOME=/etc/hadoop/conf;

Anaconda
Validate installation
anaconda-navigator
Update Conda (Only if needed)
conda update -n base -c defaults conda
Start Jupyter Notebook (If non root)
jupyter-notebook --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1
Start Jupyter Notebook (if root)
jupyter-notebook --ip='10.111.22.333' --port 9001 --debug --allow-root > /opt/anaconda3/log.txt 2>&1
Note: it’s only necessary to change the highlighted, ex: for your ip.

Jupyter or JupyterHub?
JupyterHub it’s a multi-purpose notebook that:
● Manages authentication.
● Spawns single-user notebook on-demand.
● Gives each user a complete notebook
server.
How to choose?

JupyterHub
Install JupyterHub Package (with Http-Proxy)
conda install -c conda-forge jupyterhub
Validate Installation
jupyterhub -h
Start JupyterHub Server
jupyterhub --ip='10.111.22.333' --port 9001 --debug > /opt/anaconda3/log.txt 2>&1

JupyterHub With LDAP
Install Simple LDAP Authenticator Plugin for JupyterHub
conda install -c conda-forge jupyterhub-ldapauthenticator
Install SudoSpawner
conda install -c conda-forge sudospawner
Install Package LDAP to be able to Create Users Locally
pip install jupyterhub-ldapcreateusers
Generate JupyterHub Config File
jupyterhub --generate-config
Note 1: it’s only necessary to change the highlighted, ex: for your ip.
Note 2: Sudospawner enables JupyterHub to spawn single-user servers without being root

Configure JupyterHub Config File
nano /opt/anaconda3/jupyterhub_config.py
import os
import pwd
import subprocess
# Function to Create User Home
def create_dir_hook(spawner):
if not os.path.exists(os.path.join('/home/', spawner.user.name)):
subprocess.call(["sudo", "/sbin/mkhomedir_helper", spawner.user.name])
c.Spawner.pre_spawn_hook = create_dir_hook
c.JupyterHub.authenticator_class = 'ldapcreateusers.LocalLDAPCreateUsers'
c.LocalLDAPCreateUsers.server_address = 'ad.localdomain'
c.LocalLDAPCreateUsers.server_port = 3268
c.LocalLDAPCreateUsers.use_ssl = False
c.LocalLDAPCreateUsers.lookup_dn = True
# Instructions to Define LDAP Search - Doesn't have in consideration possible group users
c.LocalLDAPCreateUsers.bind_dn_template = ['CN={username},DC=ad,DC=localdomain']
c.LocalLDAPCreateUsers.user_search_base = 'DC=ad,DC=localdomain'

c.LocalLDAPCreateUsers.lookup_dn_search_user = 'admin'
c.LocalLDAPCreateUsers.lookup_dn_search_password = 'passWord'
c.LocalLDAPCreateUsers.lookup_dn_user_dn_attribute = 'CN'
c.LocalLDAPCreateUsers.user_attribute = 'sAMAccountName'
c.LocalLDAPCreateUsers.escape_userdn = False
c.JupyterHub.hub_ip = '10.111.22.333’
c.JupyterHub.port = 9001
# Instructions Required to Add User Home
c.LocalAuthenticator.add_user_cmd = ['useradd', '-m']
c.LocalLDAPCreateUsers.create_system_users = True
c.Spawner.debug = True
c.Spawner.default_url = 'tree/home/{username}'
c.Spawner.notebook_dir = '/'
c.PAMAuthenticator.open_sessions = True
Start JupyterHub Server With Config File
jupyterhub -f /opt/anaconda3/jupyterhub_config.py --debug

JupyterHub with LDAP + ProxyUser
Has a reminder, to have ProxyUser working, you will require on both Machines (Cluster and JupyterHub): Java 1.8 and
same Spark version, for this example it will be used the 2.2.0.
[Cluster] Confirm Cluster Spark & Hadoop Version
spark-shell
hadoop version
[JupyterHub] Download Spark & Create Symbolic link
cd /tmp/
wget https://archive.apache.org/dist/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.6.tgz
tar zxvf spark-2.2.0-bin-hadoop2.6.tgz
mv spark-2.2.0-bin-hadoop2.6 /opt/spark-2.2.0
ln -s /opt/spark-2.2.0 /opt/spark
Note: change with your Spark and Hadoop version in the highlighted field.

Jupyter Hub with LDAP + ProxyUser
[Cluster] Copy Hadoop/Hive/Spark Config files
cd /etc/spark2/conf.cloudera.spark2_on_yarn/
scp * root@10.111.22.333:/etc/hadoop/conf/
[Cluster] HDFS ProxyUser
Note: change with your IP and directory’s in the highlighted field.
[JupyterHub] Create hadoop config files directory
mkdir -p /etc/hadoop/conf/
ln -s /etc/hadoop/conf/ conf.cloudera.yarn
[JupyterHub] Create spark-events directory
mkdir /tmp/spark-events
chown spark:spark spark-events
chmod 777 /tmp/spark-events
[JupyterHub] Test Spark 2
spark-submit --class org.apache.spark.examples.SparkPi
--master yarn
--num-executors 1 --driver-memory 512m --executor-memory 512m
--executor-cores 1 --deploy-mode cluster
--proxy-user tpsimoes --keytab /root/jupyter.keytab
--conf spark.eventLog.enabled=true
/opt/spark-2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar 10;

Check available kernel specs
jupyter kernelspec list
Install PySpark Kernel
conda install -c conda-forge pyspark
Confirm kernel installation
Edit PySpark kernel
nano /opt/anaconda3/share/jupyter/kernels/pyspark/kernel.json
{"argv":
["/opt/anaconda3/share/jupyter/kernels/pyspark/python.sh", "-f", "{connection_file}"],
"display_name": "PySpark (Spark 2.2.0)", "language":"python" }
Create PySpark Script
cd /opt/anaconda3/share/jupyter/kernels/pyspark;
touch python.sh;
chmod a+x python.sh;

The python.sh script was elaborated due to the limitations on JupyterHub Kernel configurations that isn't able to get the
Kerberos Credentials and also due to LDAP package that doesn't allow the proxyUser has is possible with Zeppelin. Therefore
with this architecture solution you are able to:
● Add a new step of security, that requires the IDE keytab
● Enable the usage of proxyUser by using the flag from spark --proxy-user ${KERNEL_USERNAME}
Edit PySpark Script
touch /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh;
nano /opt/anaconda3/share/jupyter/kernels/pyspark/python.sh;
# !/usr/bin/env bash
# setup environment variable, etc.
PROXY_USER="$(whoami)"
export JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
export SPARK_HOME=/opt/spark
export SPARK_MASTER_IP=10.111.22.333
export HADOOP_HOME=/etc/hadoop/conf

Edit PySpark Script
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/etc/hadoop/conf
export PYTHONPATH=/opt/spark-2.2.0/python:/opt/spark-2.2.0/python/lib/py4j-0.10.4-src.zip
export PYTHONSTARTUP=/opt/spark-2.2.0/python/pyspark/shell.py
export PYSPARK_PYTHON=/usr/src/Python-3.7.1/python
export PYSPARK_SUBMIT_ARGS="-v --master yarn --deploy-mode client --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors 2 --driver-memory 1024m --executor-memory 1024m
--executor-cores 2 --proxy-user "${PROXY_USER}" --keytab /tmp/jupyter.keytab pyspark-shell"
# Kinit User/Keytab defined por the ProxyUser on the Cluster/HDFS
kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel $@
Note: change with your IP and directories in the highlighted field.

Interact with JupyterHub
Login
http://10.111.22.333:9001/hub/login
Notebook Kernel

To use JupyterLab without it being the default interface, you just have to
swap on your browser url the “tree” with Lab!
http://10.111.22.333:9001/user/tpsimoes/lab
JupyterLab
JupyterLab it’s the next-generation web-based
interface for Jupyter.
Install JupyterLab
conda install -c conda-forge jupyterlab
Install JupyterLab Launcher
conda install -c conda-forge jupyterlab_launcher

JupyterLab
To be able to use the JupyterLab interface as default on Jupyter it requires additional changes.
● Change the JupyterHub Config File
● Additional extensions (for the Hub Menu)
● Create config file for JupyterLab
Edit PySpark Script
nano /opt/anaconda3/jupyterhub_config.py
...
# Change the values on this Flags
c.Spawner.default_url = '/lab'
c.Spawner.notebook_dir = '/home/{username}'
# Add this Flag
c.Spawner.cmd = ['jupyter-labhub']

JupyterLab
Install jupyterlab-hub extension
jupyter labextension install @jupyterlab/hub-extension
Create JupyterLab Config File
cd /opt/anaconda3/share/jupyter/lab/settings/
nano page_config.json
{
"hub_prefix": "/jupyter"
}

JupyterLab
The final architecture:

R, Hive and Impala on JupyterHub
On this section the focus will reside on R, Hive, Impala and Kerberized Kernel.
With R Kernel, it requires libs on both Machines (Cluster and Jupyter)
[Cluster & Jupyter] Install R Libs
yum install -y openssl-devel openssl libcurl-devel libssh2-devel
[Jupyter] Create SymLinks for R libs
ln -s /opt/anaconda3/lib/libssl.so.1.0.0 /usr/lib64/libssl.so.1.0.0;
ln -s /opt/anaconda3/lib/libcrypto.so.1.0.0 /usr/lib64/libcrypto.so.1.0.0;
[Cluster & Jupyter] To use SparkR
devtools::install_github('apache/spark@v2.2.0', subdir='R/pkg')
[Cluster & Jupyter] Start R & Install Packages
R
install.packages('git2r')
install.packages('devtools')
install.packages('repr')
install.packages('IRdisplay')
install.packages('crayon')
install.packages('pbdZMQ')

To interact with Hive metadata and the direct use of the sintax, the my recommendation is the HiveQL.
Install Developer Toolset Libs
yum install cyrus-sasl-devel.x86_64 cyrus-sasl-gssapi.x86_64 cyrus-sasl-sql.x86_64 cyrus-sasl-plain.x86_64 gcc-c++
Install Python + Hive interface (SQLAlchemy interface for Hive)
pip install pyhive
Install HiveQL Kernel
pip install --upgrade hiveqlKernel
jupyter hiveql install
Confirm HiveQL Kernel installation

Edit HiveQL Kernel
cd /usr/local/share/jupyter/kernels/hiveql
nano kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"],
"display_name": "HiveQL", "language": "hiveql", "name": "hiveql"}
Create and Edit HiveQL script
touch /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh;
nano /opt/anaconda3/share/jupyter/kernels/hiveql/hiveql.sh;
# !/usr/bin/env bash
# setup environment variable, etc.

Edit HiveQL script
export HIVE_AUX_JARS_PATH=/etc/hadoop/postgresql-9.0-801.jdbc4.jar
export HADOOP_CLIENT_OPTS="-Xmx2147483648 -XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true"
kinit -kt /tmp/jupyter.keytab jupyter/cm1.localdomain@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel $@
Note 1: change with your IP. directories and versions in the highlighted field.
Note 2: add your users keytab to a chosen directory so that is possible to run with proxyuser

To interact with Impala metadata, my recommendation is the Impyla, but there’s a catch, because due to a specific version of a
lib (thrift_sasl), the HiveQL will stop working, because hiveqlkernel 1.0.13 has the requirement thrift-sasl==0.3.*.
Install additional Libs for Impyla
pip install thrift_sasl==0.2.1: pip install sasl;
Install ipython-sql
conda install -c conda-forge ipython-sql
Install impyla
pip install impyla==0.15a1
Note: it was installed a alfa version for impyla due to an incompatibility with python versions superior to 3.7.

If you require to have access to Hive & Impala metadata, you can use Python + Hive with a kerberized custom kernel.
Install Jaydebeapi package
conda install -c conda-forge jaydebeapi
Create Python Kerberized Kernel
mkdir -p /usr/share/jupyter/kernels/pythonKerb
cd /usr/share/jupyter/kernels/pythonKerb
touch kernel.json
touch pythonKerb.sh
chmod a+x /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh
Edit Kerberized Kernel
nano /usr/share/jupyter/kernels/kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/pythonKerb/pythonKerb.sh
", "-f", "{connection_file}"],
"display_name": "PythonKerberized", "language": "python",
"name": "pythonKerb"}
Edit Kerberized Kernel script
nano /usr/share/jupyter/kernels/pythonKerb/pythonKerb.sh

export CLASSPATH=$CLASSPATH:`hadoop classpath`:/etc/hadoop/*:/tmp/*
export PYTHONPATH=$PYTHONPATH:/opt/anaconda3/lib/python3.7/site-packages/jaydebeapi
kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m ipykernel_launcher $@

Assuming that you don't have Impyla installed, or if so, you have created an environment for it!
HiveQL it’s the best Kernel to access to hive metadata and it has support.
Install Hive interface & HiveQL Kernel
pip install pyhive; pip install --upgrade hiveqlKernel;
Jupyter Install Kernel
jupyter hiveql install
Check kernel installation

To access to a kerberized Cluster you will require a Kerberos Ticket in cache, therefore the solution will be the following:
Edit Kerberized Kernel
nano /usr/local/share/jupyter/kernels/hiveql/kernel.json
{"argv":
["/usr/local/share/jupyter/kernels/hiveql/hiveql.sh", "-f", "{connection_file}"],
"display_name": "HiveQL", "language": "hiveql", "name": "hiveql"}
touch /usr/local/share/jupyter/kernels/hiveql/hiveql.sh
nano /usr/local/share/jupyter/kernels/hiveql/hiveql.sh

kinit -kt /tmp/${PROXY_USER}.keytab ${PROXY_USER}@DOMAIN.COM
# run the ipykernel
exec /opt/anaconda3/bin/python -m hiveql $@

Interact with JupyterHub Kernels
The following information will serve as base of knowledge, how to interact with previous configured kernels with a
kerberized Cluster.
[HiveQL] Create Connection
$$ url=hive://hive@cm1.localdomain:10000/
$$ connect_args={"auth": "KERBEROS","kerberos_service_name": "hive"}
$$ pool_size=5
$$ max_overflow=10
[Impyla] Create Connection
from impala.dbapi import connect
conn = connect(host='cm1.localdomain', port=21050, kerberos_service_name='impala', auth_mechanism='GSSAPI')

Interact with JupyterHub Kernels
[Impyla] Create Connection via SQLMagic
%load_ext sql
%config SqlMagic.autocommit=False
%sql impala://tpsimoes:welcome1@cm1.localdomain:21050/db?kerberos_service_name=impala&auth_mechanism=GSSAPI
[Python] Create Connection
import jaydebeapi
import pandas as pd
conn_hive =
jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver","jdbc:hive2://cm1.localdomain:10000/db;AuthMech=1;KrbRealm=DOMAIN.
COM;KrbHostFQDN=cm1.localdomain;KrbServiceName=hive;KrbAuthType=2")
[Python] Kinit Keytab
import subprocess
result = subprocess.run(['kinit', '-kt','/tmp/tpsimoes.keytab',tpsimoes/cm1.localdomain@DOMAIN.COM'],
stdout=subprocess.PIPE)
result.stdout

Thanks
Big Data Engineer
Tiago Simões

How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap

Semelhante a How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap (20)

Mais de Tiago Simões

Mais de Tiago Simões (7)

Último

Último (20)

How to create a multi tenancy for an interactive data analysis with jupyter hub and ldap