Tutorial JupyterHub, Jupyter e PySpark (PythonSudeste)

Análise de dados com Python e
JupyterHub

Cronograma
- Instalação e configuração do pyspark + jupyter
- Análise de dados do governo
- Integração do pyspark com Pandas
- Instalação e configuração do JupyterHub
- Customizando Jupyter e JupyterHub (bônus)
Material do curso:
https://github.com/dmvieira/tutorial-jupyter-pyspark

O que é Jupyter?
Jupyter = Julia Python R
Mas…
https://github.com/jupyter/jupyter/wiki/Jupyter-
kernels
PYTHON É O PADRÃO!

O que é Jupyter?
pip3 install jupyter==1.0.0 -i https://pypi.python.org/simple
ou
conda install -c anaconda jupyter=1.0.0
jupyter notebook

Como adiciono um kernel?
Vamos adicionar kernel do PySpark!
http://d3kbcqa49mib13.cloudfront.net/spark-2.1
.0-bin-hadoop2.7.tgz
Mas o que é PySpark?

● Ferramenta para processamento de dados em larga
escala
● Até 100x mais rápido que map reduce no hadoop
● Distribui tarefas em paralelo
● Suporte a
– Java(Spark)
– Scala(Spark)
– Python(PySpark)
– R(SparkR)

Vamos descompactar e trabalhar!
● echo "palavra um tres dois tres dois tres">
dataset/teste_de_palavras.txt
● tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz
● cd spark-2.1.0-bin-hadoop2.7/bin
● ./pyspark
text_file = sc.textFile("../../dataset/teste_de_palavras.txt")
counts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.collect()

Agora sim… Voltamos pro Kernel
export SPARK_HOME=/Users/diogo.munaro/learn/tutorial-
jupyter/spark-2.1.0-bin-hadoop2.7
export PATH="$PATH:$SPARK_HOME/bin"
pip3 install -i https://pypi.anaconda.org/hyoon/simple toree==0.2.0.dev1
ou
conda install -c anaconda toree=0.2.0.dev1
pip3 install "jupyter_client<5.0" -i https://pypi.python.org/simple
ou
conda install -c "jupyter_client<5.0"
jupyter toree install –spark_home=$SPARK_HOME
jupyter toree install –interpreters=PySpark
jupyter notebook

Agora sim… Do jeito que funciona
export SPARK_HOME=/Users/diogo.munaro/learn/tutorial-
jupyter/spark-2.1.0-bin-hadoop2.7
export PATH="$PATH:$SPARK_HOME/bin"
pip3 install findspark==1.1.0 -i https://pypi.python.org/simple
jupyter notebook

Vamos para a análise de dados!

Vamos para a análise de dados!
pip3 install pandas==0.19.2 matplotlib==2.0.1 -i https://pypi.python.org/simple
ou
conda install -c anaconda pandas=0.19.2 matplotlib=2.0.1
jupyter notebook

Ok, e JupyterHub?
npm install -g configurable-http-proxy –registry https://registry.npmjs.org/
pip3 install jupyterhub==0.7.2 -i https://pypi.python.org/simple
ou
conda install -c conda-forge jupyterhub=0.7.2
jupyterhub
http://localhost:8000

Mas como administro com
JupyterHub?
jupyterhub –generate-config
Editar jupyterhub_config.py
c.Authenticator.admin_users = set() → {“diogo.munaro”}

Vamos customizar!
jupyter --paths
Mudar arquivo de login!

Obrigado
http://github.com/dmvieira/
https://www.linkedin.com/in/dmvieira/
diogo.mvieira@gmail.com

Tutorial JupyterHub, Jupyter e PySpark (PythonSudeste)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Diogo Munaro Vieira

Mais de Diogo Munaro Vieira (9)

Tutorial JupyterHub, Jupyter e PySpark (PythonSudeste)