In times of huge amounts of heterogeneous data available, processing and extracting knowledge requires more and more efforts on building complex software architectures. In this context, Apache Spark provides a powerful and efficient approach for large-scale data processing. This talk will briefly introduce a powerful machine learning library (MLlib) along with a general overview of the Spark framework, describing how to launch applications within a cluster. In this way, a demo will show how to simulate a Spark cluster in a local machine using images available on a Docker Hub public repository. In the end, another demo will show how to save time using unit tests for validating jobs before running them in a cluster.
2. Agenda
• Problems and Motivation
• Spark and MLlib overview
• Launching applications in a Spark cluster
• Simulating a Spark cluster using Docker
• Demo: deploying a Spark cluster in a local machine
• Unit tests for Spark jobs
2
3. 3
• How to setup a Spark cluster (infra + configuration)?
• Test and/or Debug a Spark job
• All team should have the same environment
4. 4
• Lightweight cluster
• One machine
• Same environment for all team
• Deployed easily in any platform
Run Spark Locally with docker
5. 5
• Easy to develop (API in Java, Scala, Python, R)
• High Quality algorithms
http://spark.apache.org/mllib/
• Fast to run
• Lazy evaluation
• In memory Storage
12. Example to Run
• MLlib's FP-Growth algorithm
• Data from the digital publishing domain
• Problem: to find frequent patterns from navigation profiles
• Write results in MongoDB
http://github.com/joelplucas/fpgrowth-spark-example
12
14. Unit Testing using Spark Testing Base
• Launched in Strata NYC 2015 by Holden Karau (and maintained by the community)
• Supports unit tests in Java, Scala and Python
14