data science toolkit 101: set up Python, Spark, & Jupyter

•Transferir como PPTX, PDF•

2 gostaram•547 visualizações

Raj Singh

Data science toolkit 101. set up Python, Spark, & Jupyter on a Mac laptop

Tecnologia

IBM Cloud Data Services
data science toolkit 101
set up Python, Spark, & Jupyter
Raj Singh, PhD
Developer Advocate: Geo | Open Data
rrsingh@us.ibm.com
http://ibm.biz/rajrsingh
twitter: @rajrsingh

@rajrsingh
IBM Cloud Data Services
Agenda
• Installation
• Python
• Spark
• Pixiedust
• Examples

@rajrsingh
IBM Cloud Data Services
IBM Analytics
Data Science Experience (DSX)

@rajrsingh
IBM Cloud Data Services
What is Spark?
• In-memory Hadoop
• Hadoop was massively scalable but slow
• “Up to 100x faster” (10x faster if memory is exhausted)
• What is Hadoop?
• HDFS: fault-tolerant storage using horizontally scalable commodity hardware
• MapReduce: programming style for distributed processing
• Presents data as an object
independent of the
underlying storage

@rajrsingh
IBM Cloud Data Services
Spark abstracted storage
• Scala
• PySpark = (Spark + Python)
• Drivers
• File storage
• Cloudant
• dashDB
• Cassandra
• …

@rajrsingh
IBM Cloud Data Services
Python installation with miniconda
1. https://www.continuum.io/downloads (choose version 2.7)
2. Miniconda2 install into this location: /Users/<username>/miniconda2
3. bash$ conda install pandas jupyter matplotlib
4. bash$ which python
/Users/<username>/miniconda2/bin/python
https://dzone.com/refcardz/apache-spark

@rajrsingh
IBM Cloud Data Services
Spark installation
• http://spark.apache.org/downloads.html
• Spark release: 1.6.2
• package type: Pre-built for Hadoop 2.6
• mkdir dev
• cd dev
• tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz
• ln -s spark-1.6.2-bin-hadoop2.6 spark
• mkdir dev/notebooks

$@rajrsingh IBM Cloud Data Services PySpark configuration • create directory ~/.ipython/kernels/pyspark1.6/ • create file kernel.json • cd ~/dev/spark/conf • cp spark-defaults.conf.template spark-defaults.conf • add to end of spark-defaults.conf: spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/* { "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" } }$

@rajrsingh
IBM Cloud Data Services
PySpark test
• bash$ cd ~/dev
• bash$ jupyter notebook
• upper right of the Jupyter screen, click New, choose
pySpark (Spark 1.6.2) Python 2
(or whatever name specified in your kernel.json file)
• in the notebook's first cell enter sc.version
and click the >| button to run it (or hit CTRL + Enter).

@rajrsingh
IBM Cloud Data Services
Pixiedust installation
• cd ~/dev
• git clone https://github.com/ibm-cds-labs/pixiedust.git
• pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust
• pip install maven-artifact
• pip install mpld3

@rajrsingh
IBM Cloud Data Services
Examples
• Pixiedust
• https://github.com/ibm-cds-labs/pixiedust
• Demographic analyses
• http://ibm-cds-labs.github.io/open-data/samples/
• or https://github.com/ibm-cds-labs/open-data/tree/master/samples

IBM Cloud Data Services
Raj Singh
Developer Advocate: Geo | Open
Data
rrsingh@us.ibm.com
http://ibm.biz/rajrsingh
Twitter: @rajrsingh
LinkedIn: rajrsingh
Thanks

Mais conteúdo relacionado

Mais procurados

Scalable Data Science with SparkRDataWorks Summit

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew RayDatabricks

Keeping Spark on Track: Productionizing Spark for ETLDatabricks

How To Connect Spark To Your Own DatasourceMongoDB

Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks

Operational Tips for Deploying SparkDatabricks

ETL with SPARK - First Spark London meetupRafal Kwasny

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit

Programming in Spark using PySpark Mostafa

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

Spark Meetup at UberDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks

Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks

Apache Arrow and Pandas UDF on Apache SparkTakuya UESHIN

Mais procurados (20)

Scalable Data Science with SparkR

Spark Under the Hood - Meetup @ Data Science London

Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray

Keeping Spark on Track: Productionizing Spark for ETL

How To Connect Spark To Your Own Datasource

Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

Operational Tips for Deploying Spark

ETL with SPARK - First Spark London meetup

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Programming in Spark using PySpark

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Spark Meetup at Uber

Jump Start on Apache® Spark™ 2.x with Databricks

Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu

Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...

Apache Arrow and Pandas UDF on Apache Spark

Destaque

Cassandra and Spark datastaxjp

Introduction to Apache Spark Juan Pedro Moreno

Presentation of Apache Cassandra Nikiforos Botis

Introduction to Cassandra - DenverJon Haddad

Cassandra Basics: IndexingBenjamin Black

Developers summit cassandraで見るNoSQLRyu Kobayashi

Intro to py spark (and cassandra)Jon Haddad

Diagnosing Problems in Production: Cassandra Summit 2014Jon Haddad

Python & Cassandra - Best FriendsJon Haddad

Intro to CassandraJon Haddad

The Cassandra Distributed DatabaseEric Evans

PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph

Parquet overviewJulien Le Dem

Cassandra Summit 2010 Performance Tuningdriftx

C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzDataStax Academy

Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner

Data analysis with Pandas and SparkFelix Crisan

Spark, Python and Parquet odsc

Python performance profilingJon Haddad

Cassandra concepts, patterns and anti-patternsDave Gardner

Destaque (20)

Cassandra and Spark

Introduction to Apache Spark

Presentation of Apache Cassandra

Introduction to Cassandra - Denver

Cassandra Basics: Indexing

Developers summit cassandraで見るNoSQL

Intro to py spark (and cassandra)

Diagnosing Problems in Production: Cassandra Summit 2014

Python & Cassandra - Best Friends

Intro to Cassandra

The Cassandra Distributed Database

PySpark Cassandra - Amsterdam Spark Meetup

Parquet overview

Cassandra Summit 2010 Performance Tuning

C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz

Cassandra's Sweet Spot - an introduction to Apache Cassandra

Data analysis with Pandas and Spark

Spark, Python and Parquet

Python performance profiling

Cassandra concepts, patterns and anti-patterns

Semelhante a data science toolkit 101: set up Python, Spark, & Jupyter

Hands on with Apache SparkDan Lynn

Dask: Scaling PythonMatthew Rocklin

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

PYSPARK PROGRAMMING.pdfMuhammadFauzi713466

Paris Data Geek - Spark Streaming Djamel Zouaoui

Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Apache Spark TutorialAhmet Bulut

Introduction to Apache SparkRahul Jain

DUG'20: 02 - Accelerating apache spark with DAOS on AuroraAndrey Kudryavtsev

Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah

Running Spark In Production in the Cloud is Not Easy with Nayur KhanDatabricks

Ingesting hdfs intosolrusingsparktrimmedwhoschek

Intro to Apache Spark by CTO of TwingoMapR Technologies

Intro to Apache SparkRobert Sanders

Intro to Apache Sparkclairvoyantllc

Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos

HDPCD Spark using Python (pyspark)Durga Gadiraju

Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095

PixieDustMargriet Groenendijk

Semelhante a data science toolkit 101: set up Python, Spark, & Jupyter (20)

Hands on with Apache Spark

Dask: Scaling Python

Apache Spark for Everyone - Women Who Code Workshop

PYSPARK PROGRAMMING.pdf

Paris Data Geek - Spark Streaming

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Apache Spark Tutorial

Introduction to Apache Spark

DUG'20: 02 - Accelerating apache spark with DAOS on Aurora

Hadoop in Practice (SDN Conference, Dec 2014)

Running Spark In Production in the Cloud is Not Easy with Nayur Khan

Ingesting hdfs intosolrusingsparktrimmed

Intro to Apache Spark by CTO of Twingo

Intro to Apache Spark

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

HDPCD Spark using Python (pyspark)

Apache Spark™ is a multi-language engine for executing data-S5.ppt

PixieDust

Mais de Raj Singh

Optimizing location-based apps with open dataRaj Singh

All your database are belong to us - Koop, Cloudant, Feature ServicesRaj Singh

Field Work: Map-centric mobile apps with Cloudant Geo and LeafletJSRaj Singh

Painless Polyglot PersistenceRaj Singh

The Evolution of Mobile MappingRaj Singh

The NoSQL Geospatial LandscapeRaj Singh

JSON EverywhereRaj Singh

GeoPackage, OWS Context and the OGC Interoperability ProgramRaj Singh

IoT Meets GeoRaj Singh

GeoPackage, Context and POI (and a sprinkle of GeoJSON)Raj Singh

Introduction to GeoPackage and OWS ContextRaj Singh

Mais de Raj Singh (11)

Optimizing location-based apps with open data

All your database are belong to us - Koop, Cloudant, Feature Services

Field Work: Map-centric mobile apps with Cloudant Geo and LeafletJS

Painless Polyglot Persistence

The Evolution of Mobile Mapping

The NoSQL Geospatial Landscape

JSON Everywhere

GeoPackage, OWS Context and the OGC Interoperability Program

IoT Meets Geo

GeoPackage, Context and POI (and a sprinkle of GeoJSON)

Introduction to GeoPackage and OWS Context

Último

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Real Time Object Detection Using Open CVKhem

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

🐬 The future of MySQL is Postgres 🐘RTylerCroy

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Artificial Intelligence: Facts and MythsJoaquim Jorge

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

GenAI Risks & Security Meetup 01052024.pdflior mazor

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

data science toolkit 101: set up Python, Spark, & Jupyter

1. IBM Cloud Data Services data science toolkit 101 set up Python, Spark, & Jupyter Raj Singh, PhD Developer Advocate: Geo | Open Data rrsingh@us.ibm.com http://ibm.biz/rajrsingh twitter: @rajrsingh

2. @rajrsingh IBM Cloud Data Services Agenda • Installation • Python • Spark • Pixiedust • Examples

3. @rajrsingh IBM Cloud Data Services IBM Analytics Data Science Experience (DSX)

4. @rajrsingh IBM Cloud Data Services What is Spark? • In-memory Hadoop • Hadoop was massively scalable but slow • “Up to 100x faster” (10x faster if memory is exhausted) • What is Hadoop? • HDFS: fault-tolerant storage using horizontally scalable commodity hardware • MapReduce: programming style for distributed processing • Presents data as an object independent of the underlying storage

5. @rajrsingh IBM Cloud Data Services Spark abstracted storage • Scala • PySpark = (Spark + Python) • Drivers • File storage • Cloudant • dashDB • Cassandra • …

6. @rajrsingh IBM Cloud Data Services Python installation with miniconda 1. https://www.continuum.io/downloads (choose version 2.7) 2. Miniconda2 install into this location: /Users/<username>/miniconda2 3. bash$ conda install pandas jupyter matplotlib 4. bash$ which python /Users/<username>/miniconda2/bin/python https://dzone.com/refcardz/apache-spark

7. @rajrsingh IBM Cloud Data Services Spark installation • http://spark.apache.org/downloads.html • Spark release: 1.6.2 • package type: Pre-built for Hadoop 2.6 • mkdir dev • cd dev • tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz • ln -s spark-1.6.2-bin-hadoop2.6 spark • mkdir dev/notebooks

8. @rajrsingh IBM Cloud Data Services PySpark configuration • create directory ~/.ipython/kernels/pyspark1.6/ • create file kernel.json • cd ~/dev/spark/conf • cp spark-defaults.conf.template spark-defaults.conf • add to end of spark-defaults.conf: spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/* { "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" } }

9. @rajrsingh IBM Cloud Data Services PySpark test • bash$ cd ~/dev • bash$ jupyter notebook • upper right of the Jupyter screen, click New, choose pySpark (Spark 1.6.2) Python 2 (or whatever name specified in your kernel.json file) • in the notebook's first cell enter sc.version and click the >| button to run it (or hit CTRL + Enter).

10. @rajrsingh IBM Cloud Data Services Pixiedust installation • cd ~/dev • git clone https://github.com/ibm-cds-labs/pixiedust.git • pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust • pip install maven-artifact • pip install mpld3

11. @rajrsingh IBM Cloud Data Services Examples • Pixiedust • https://github.com/ibm-cds-labs/pixiedust • Demographic analyses • http://ibm-cds-labs.github.io/open-data/samples/ • or https://github.com/ibm-cds-labs/open-data/tree/master/samples

12. IBM Cloud Data Services Raj Singh Developer Advocate: Geo | Open Data rrsingh@us.ibm.com http://ibm.biz/rajrsingh Twitter: @rajrsingh LinkedIn: rajrsingh Thanks

data science toolkit 101: set up Python, Spark, & Jupyter

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a data science toolkit 101: set up Python, Spark, & Jupyter

Semelhante a data science toolkit 101: set up Python, Spark, & Jupyter (20)

Mais de Raj Singh

Mais de Raj Singh (11)

Último

Último (20)

data science toolkit 101: set up Python, Spark, & Jupyter