Join us for an introductory walk-through of using the Spark-HPCC Systems ecosystem to analyze your HPCC Systems data using a collaborative Apache Zeppelin notebook environment.
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
Leveraging the Spark-HPCC Ecosystem
1. 2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
James McMullan
Sr Software Engineer
LexisNexis Risk Solutions
Leveraging the Spark-HPCC Systems
Ecosystem
2. Overview
• Spark-HPCC Plugin & Connector
• Basics of reading / writing to / from HPCC Systems
• Brief introduction to Apache Zeppelin
• Create a random forest model in Spark
• Compare to Kaggle competition leaderboard
• Future of Spark-HPCC Systems Ecosystem
• Closing thoughts
Leveraging the Spark-HPCC Systems Ecosystem
4. Spark-HPCC Systems – Overview
• Spark-HPCC Systems Connector
• Spark library
• Allows reading and writing to HPCC
Systems
• Can be installed on any Spark
cluster
• Spark Plugin – Managed Spark Cluster
• Requires HPCC Systems 7.0+
• Spark cluster that mirrors Thor
cluster
• Configured through Config Manager
• Installs Spark-HPCC Systems
connector
Leveraging the Spark-HPCC Systems Ecosystem
5. Spark-HPCC Systems Connector - Progress
• Added support for remote writing
• HPCC Systems 7.2+
• Improved performance
• Scala, Python and R
• Increased reliability
• Lots of testing and bug fixes
• Added support for DataSource API v1
• Unified Read / Write interface
Leveraging the Spark-HPCC Systems Ecosystem
6. Spark-HPCC Systems Connector – Reading
Leveraging the Spark-HPCC Systems Ecosystem
clusterURL = "http://192.168.56.101:8010"
fileName = "example::dataset"
# Read dataset from HPCC Systems
df = spark.read.load(format="hpcc",
host=clusterURL,
password="",
username="",
limitPerFilePart=100,
projectList="field1, field2",
fileAccessTimeout=240,
path=fileName)
clusterURL <- "http://192.168.56.101:8010"
fileName <- "example::dataset"
# Read dataset from HPCC Systems
df <- read.df(source = "hpcc",
host = clusterURL,
password = "",
username = "",
limitPerFilePart = 100,
projectList = "field1, field2",
fileAccessTimeout = 240,
path = fileName)
PySpark Read Example SparkR Read Example
7. Spark-HPCC Systems Connector – Writing
Leveraging the Spark-HPCC Systems Ecosystem
clusterURL = "http://192.168.56.101:8010"
fileName = "example::dataset"
# Write dataset to HPCC Systems
df.write.save(format="hpcc",
mode="overwrite",
host=clusterURL,
password="",
username="",
cluster="mythor",
path=fileName)
clusterURL <- "http://192.168.56.101:8010"
fileName <- "example::dataset"
# Write dataset to HPCC Systems
write.df(df, source = "hpcc",
host = clusterURL,
cluster = "mythor",
path = fileName,
mode = "overwrite",
password = "",
username = "",
fileAccessTimeout = 240)
PySpark Write Example SparkR Write Example
9. Apache Zeppelin - Overview
• Multi-user Notebook Environment
• Front end for Spark
• Collaborative
• Easy to use
• Handles resource management
• Handles job queuing and resource allocation
• We do not support or package Zeppelin
Leveraging the Spark-HPCC Systems Ecosystem
10. Apache Zeppelin – Features
• Multi-user environment by default
• Version Control
• Interpreters are bound at a Paragraph level
• Allows multiple languages in a single notebook
• Built-in visualization tools
• Ability to move data between languages
• Credential management
Leveraging the Spark-HPCC Systems Ecosystem
12. Spark ML Model – Brief Intro to Random Forests
• Random Forests: Ensemble of decision
trees
• Averaging output of multiple decision trees
gives a better prediction
• Random Forests requires data to be
numeric
Leveraging the Spark-HPCC Systems Ecosystem
13. Spark ML Model – Bulldozers R US
• Open source bulldozer auction dataset from Kaggle
• www.kaggle.com/c/bluebook-for-bulldozers
• Create a Random Forest Model to predict auction price
• Compare our model against the Kaggle leaderboard
• Score is calculated by RMSLE (Root Mean Square Log Error)
• RMSLE provides a percentage based error
Leveraging the Spark-HPCC Systems Ecosystem
15. Spark ML Model – Results
• Our RMSLE ~ 0.26
• Around 50th out of 450 participants
• Not bad for little to no feature engineering
• RMSLE around ~0.22 is possible with Random Forests
• Hyper parameter tuning
• Feature engineering
• Deep Learning can do better than ~0.22
Leveraging the Spark-HPCC Systems Ecosystem
16. Spark-HPCC Systems – Future & Future Use Cases
• Continued support and improvement
• Leveraging libraries in Spark, Python and R
• Optimus – Data cleaning for Spark
• Matplotlib
• Spark Streaming
• IoT Events
• Telematics
• Deep Learning with Spark
• Possible now through external libraries
• Spark 3.0 will support Tensorflow natively
Leveraging the Spark-HPCC Systems Ecosystem
17. Closing Thoughts
• Spark-HPCC Systems ecosystem provides new opportunities
• Access to an entire ecosystem of libraries and tools
• Apache Zeppelin is great
• Machine Learning and Deep Learning are accessible
• FastAI MOOC is a great way to learn
• Everyone should learn ML & Deep Learning
Leveraging the Spark-HPCC Systems Ecosystem
19. View this presentation on YouTube:
https://www.youtube.com/watch?v=AQF9XP-Hd74&list=PL-8MJMUpp8IKH5-
d56az56t52YccleX5h&index=4&t=0s
(4:55:00)
Leveraging the Spark-HPCC Systems Ecosystem
Notas do Editor
We had a problem. We needed a front end interface for Spark
Using command line to submit jobs to a cluster is not a good workflow for datascience
This is a solved problem. Notebook environments like Jupyter Notebooks or Apache Zeppelin were created to solve this problem
Internally we evaluated both Jupyter Notebooks or Apache Zeppelin and found that Apache Zeppelin met our needs better than Jupyter Notebooks
We have been testing Apache Zeppelin with Spark since Feburary
We have also contributed some code to mainline Zeppelin to meet our needs
We aren’t packaging Zeppelin alongside the Spark-HPCC environment
The reason I am discussing Zeppelin. Is I will be using Zeppelin during the demo portion of the talk and wanted to give some background