O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Building Machine Learning
applications locally with Spark
21/06/2017
Joel Pinho Lucas
Agenda
• Problems and Motivation
• Spark and MLlib overview
• Launching applications in a Spark cluster
• Simulating a Spa...
3
• How to setup a Spark cluster (infra + configuration)?
• Test and/or Debug a Spark job
• All team should have the same e...
4
• Lightweight cluster
• One machine
• Same environment for all team
• Deployed easily in any platform
Run Spark Locally ...
5
• Easy to develop (API in Java, Scala, Python, R)
• High Quality algorithms
http://spark.apache.org/mllib/
• Fast to run...
6
http://spark.apache.org/docs/2.1.0/cluster-overview.html
Spark Execution Model
Cluster Types
• Standalone
• Apache Mesos
• HadoopYarn
7
8
Starting a Cluster Manually
Manually Submitting an Application
Choose your Docker Image
(or build your own and share)
9
Some available Spark Docker
Images
10
• https://github.com/big-data-europe/docker-spark
• https://hub.docker.com/r/interna...
http://github.com/joelplucas/docker-spark 11
Example to Run
• MLlib's FP-Growth algorithm
• Data from the digital publishing domain
• Problem: to find frequent patterns...
The Dataset
13
Unit Testing using Spark Testing Base
• Launched in Strata NYC 2015 by Holden Karau (and maintained by the community)
• Su...
Q&A - Contact
‣ Linkedin: http://br.linkedin.com/in/joelplucas/
‣ Email: joelpl@gmail.com
15
Próximos SlideShares
Carregando em…5
×

Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

262 visualizações

Publicada em

In times of huge amounts of heterogeneous data available, processing and extracting knowledge requires more and more efforts on building complex software architectures. In this context, Apache Spark provides a powerful and efficient approach for large-scale data processing. This talk will briefly introduce a powerful machine learning library (MLlib) along with a general overview of the Spark framework, describing how to launch applications within a cluster. In this way, a demo will show how to simulate a Spark cluster in a local machine using images available on a Docker Hub public repository. In the end, another demo will show how to save time using unit tests for validating jobs before running them in a cluster.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

  1. 1. Building Machine Learning applications locally with Spark 21/06/2017 Joel Pinho Lucas
  2. 2. Agenda • Problems and Motivation • Spark and MLlib overview • Launching applications in a Spark cluster • Simulating a Spark cluster using Docker • Demo: deploying a Spark cluster in a local machine • Unit tests for Spark jobs 2
  3. 3. 3 • How to setup a Spark cluster (infra + configuration)? • Test and/or Debug a Spark job • All team should have the same environment
  4. 4. 4 • Lightweight cluster • One machine • Same environment for all team • Deployed easily in any platform Run Spark Locally with docker
  5. 5. 5 • Easy to develop (API in Java, Scala, Python, R) • High Quality algorithms http://spark.apache.org/mllib/ • Fast to run • Lazy evaluation • In memory Storage
  6. 6. 6 http://spark.apache.org/docs/2.1.0/cluster-overview.html Spark Execution Model
  7. 7. Cluster Types • Standalone • Apache Mesos • HadoopYarn 7
  8. 8. 8 Starting a Cluster Manually Manually Submitting an Application
  9. 9. Choose your Docker Image (or build your own and share) 9
  10. 10. Some available Spark Docker Images 10 • https://github.com/big-data-europe/docker-spark • https://hub.docker.com/r/internavenue/centos-spark/ • https://github.com/sequenceiq/docker-spark • https://github.com/epahomov/docker-spark • https://www.anchormen.nl/spark-docker/ • https://github.com/gettyimages/docker-spark • https://hub.docker.com/r/bigdatauniversity/spark/
  11. 11. http://github.com/joelplucas/docker-spark 11
  12. 12. Example to Run • MLlib's FP-Growth algorithm • Data from the digital publishing domain • Problem: to find frequent patterns from navigation profiles • Write results in MongoDB http://github.com/joelplucas/fpgrowth-spark-example 12
  13. 13. The Dataset 13
  14. 14. Unit Testing using Spark Testing Base • Launched in Strata NYC 2015 by Holden Karau (and maintained by the community) • Supports unit tests in Java, Scala and Python 14
  15. 15. Q&A - Contact ‣ Linkedin: http://br.linkedin.com/in/joelplucas/ ‣ Email: joelpl@gmail.com 15

×