O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

data science toolkit 101: set up Python, Spark, & Jupyter

332 visualizações

Publicada em

Data science toolkit 101. set up Python, Spark, & Jupyter on a Mac laptop

Publicada em: Tecnologia
  • Seja o primeiro a comentar

data science toolkit 101: set up Python, Spark, & Jupyter

  1. 1. IBM Cloud Data Services data science toolkit 101 set up Python, Spark, & Jupyter Raj Singh, PhD Developer Advocate: Geo | Open Data rrsingh@us.ibm.com http://ibm.biz/rajrsingh twitter: @rajrsingh
  2. 2. @rajrsingh IBM Cloud Data Services Agenda • Installation • Python • Spark • Pixiedust • Examples
  3. 3. @rajrsingh IBM Cloud Data Services IBM Analytics Data Science Experience (DSX)
  4. 4. @rajrsingh IBM Cloud Data Services What is Spark? • In-memory Hadoop • Hadoop was massively scalable but slow • “Up to 100x faster” (10x faster if memory is exhausted) • What is Hadoop? • HDFS: fault-tolerant storage using horizontally scalable commodity hardware • MapReduce: programming style for distributed processing • Presents data as an object independent of the underlying storage
  5. 5. @rajrsingh IBM Cloud Data Services Spark abstracted storage • Scala • PySpark = (Spark + Python) • Drivers • File storage • Cloudant • dashDB • Cassandra • …
  6. 6. @rajrsingh IBM Cloud Data Services Python installation with miniconda 1. https://www.continuum.io/downloads (choose version 2.7) 2. Miniconda2 install into this location: /Users/<username>/miniconda2 3. bash$ conda install pandas jupyter matplotlib 4. bash$ which python /Users/<username>/miniconda2/bin/python https://dzone.com/refcardz/apache-spark
  7. 7. @rajrsingh IBM Cloud Data Services Spark installation • http://spark.apache.org/downloads.html • Spark release: 1.6.2 • package type: Pre-built for Hadoop 2.6 • mkdir dev • cd dev • tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz • ln -s spark-1.6.2-bin-hadoop2.6 spark • mkdir dev/notebooks
  8. 8. @rajrsingh IBM Cloud Data Services PySpark configuration • create directory ~/.ipython/kernels/pyspark1.6/ • create file kernel.json • cd ~/dev/spark/conf • cp spark-defaults.conf.template spark-defaults.conf • add to end of spark-defaults.conf: spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/* { "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" } }
  9. 9. @rajrsingh IBM Cloud Data Services PySpark test • bash$ cd ~/dev • bash$ jupyter notebook • upper right of the Jupyter screen, click New, choose pySpark (Spark 1.6.2) Python 2 (or whatever name specified in your kernel.json file) • in the notebook's first cell enter sc.version and click the >| button to run it (or hit CTRL + Enter).
  10. 10. @rajrsingh IBM Cloud Data Services Pixiedust installation • cd ~/dev • git clone https://github.com/ibm-cds-labs/pixiedust.git • pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust • pip install maven-artifact • pip install mpld3
  11. 11. @rajrsingh IBM Cloud Data Services Examples • Pixiedust • https://github.com/ibm-cds-labs/pixiedust • Demographic analyses • http://ibm-cds-labs.github.io/open-data/samples/ • or https://github.com/ibm-cds-labs/open-data/tree/master/samples
  12. 12. IBM Cloud Data Services Raj Singh Developer Advocate: Geo | Open Data rrsingh@us.ibm.com http://ibm.biz/rajrsingh Twitter: @rajrsingh LinkedIn: rajrsingh Thanks

×