SlideShare uma empresa Scribd logo
1 de 46
The key to
machine
learning is
prepping the
right data
Jean Georges Perrin
@jgperrin
ATO, Raleigh, NC
October 23rd 2017
JGP • Jean Georges Perrin
๏ @jgperrin
๏ Chapel Hill, NC
๏ I 🏗 SW • Since 1983
๏ #Knowledge = 

𝑓 ( ∑ (#SmallData, #BigData), #DataScience)

& #Software 
๏ #IBMChampion x9 • #KeepLearning
๏ @ http://jgp.net
Who are thou?
๏ Experience with Spark?
๏ Who is familiar with Data Quality?
๏ Who has already implemented Data Quality in Spark?
๏ Who is expecting to be a ML guru after this session?
๏ Who is expecting to provide better insights faster after
this session?
Agenda
๏ What is ?
๏ What can I do with ?
๏ What is a app, anyway?
๏ Why is a cool environment for ML?
๏ Meet Cactar, the Mongolian warlord of Data Quality
๏ Why Data Quality matters?
๏ Your first ML app with
Title TextAnalytics Operating System
Apps
Analytics
Distrib.
An Analytics Operating System?
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
Apps
An Analytics Operating System?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An Analytics Operating System?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An Analytics Operating System?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
Use Cases
๏ NCEatery.com
๏Restaurant analytics
๏1.57×10^21 datapoints analyzed
๏ (they are hiring!)
๏General compute
๏Distributed data transfer
๏ IBM
๏DSX (Data Science Experiment)
๏Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
๏ Z
๏Data wrangling solution
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark
Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS
Node 1 -
Hardware
Node 2 -
Hardware
Node 3 -
Hardware
Node 4 -
Hardware
Unified API
Spark SQL Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
Node 5 - OS
Node 5 -
Hardware
Your Application
…
…
Node 1 Node 2 Node 3 Node 4
Unified API
Spark SQL
Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
Node 5
Your Application
…
DataFrame
Title Text Spark SQL
Spark Streaming
Machine Learning
(& Deep Learning)
GraphX
DataFrame
http://bit.ly/spark-clego
All over again
As goes the old adage:
Garbage In,
Garbage Out
xkcd
If Everything Was As Simple…
Dinner
revenue per
number of
guests
…as a Visual Representation
Anomaly #1
Anomaly #2
I Love It When a Plan Comes Together
Data is like a 

box of chocolates, 

you never know what
you're gonna get.
Jean ”Gump” Perrin
June 2017
Data From Everywhere
Databases
RDBMS, NoSQL
Files
CSV, XML, Json, Excel, Photos, Video
Machines
Services, REST, Streaming, IoT…
What Is Data Quality?
Skills (CACTAR):
๏ Consistency,
๏ Accuracy,
๏ Completeness,
๏ Timeliness,
๏ Accessibility,
๏ Reliability.
To allow:
๏ Operations,
๏ Decision making,
๏ Planning,
๏ Machine Learning.
Ouch, Bad Data Story Hubble was blind
because of a 2.2nm
error
Legend says it’s a
metric to imperial/US
conversion
Now it Hurts
๏ Challenger exploded in 1986 

because of a defective O-ring
๏ Root causes were:
๏ Invalid data for the 

O-ring parts
๏ Lack of data-lineage

& reporting



#1 Scripts and the Likes
๏ Source data are cleaned
by scripts (shell, Java
app…)
๏ I/O intensive
๏ Storage space intensive
๏ No parallelization
#2 Use Spark SQL
๏ All in memory!
๏ But limited to SQL and
built-in function
#3 Use UDFs
๏ User Defined Function
๏ SQL can be extended with
UDFs
๏ UDFs benefit from the
cluster architecture and
distributed processing
SPRU Your UDF
1. Service: build your code, it might be already existing!
2. Plumbing: connect your existing business logic to
Spark via an UDF
3. Register: the UDF in Spark
4. Use: the UDF is available in Spark SQL and via
callUDF()
Code Sample #1.2 - Plumbing
package net.jgp.labs.sparkdq4ml.dq.udf;


import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;


public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
Code Sample #1.3 - Register
SparkSession spark = SparkSession
.builder().appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
/jgperrin/net.jgp.labs.sparkdq4ml
Code Sample #1.4 - Use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
|   1|23.24|
|    2|30.89|
|    2|33.74|
|    3|34.89|
|    3|29.91|
|    3| 38.0|
|    4| 40.0|
|    5|120.0|
|    6| 50.0|
|    6|112.0|
|    8| 60.0|
|    8|127.0|
|    8|120.0|
|    9|130.0|
+-----+-----+
Code Sample #1.4 - Use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+------------+
|guest|price|price_no_min|
+-----+-----+------------+
|    1| 23.1|        23.1|
|    2| 30.0|        30.0|
|    2| 33.0|        33.0|
|    3| 34.0|        34.0|
|   24|142.0|       142.0|
|   24|138.0|       138.0|
|   25|  3.0|        -1.0|
|   26| 10.0|        -1.0|
|   25| 15.0|        -1.0|
|   26|  4.0|        -1.0|
|   28| 10.0|        -1.0|
|   28|158.0|       158.0|
|   30|170.0|       170.0|
|   31|180.0|       180.0|
+-----+-----+------------+
Code Sample #1.4 - Use
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+
|guest|price|
+-----+-----+
|    1| 23.1|
|    2| 30.0|
|    2| 33.0|
|    3| 34.0|
|    3| 30.0|
|    4| 40.0|
|   19|110.0|
|   20|120.0|
|   22|131.0|
|   24|142.0|
|   24|138.0|
|   28|158.0|
|   30|170.0|
|   31|180.0|
+-----+-----+
Store First, Ask Questions Later
๏ Build an in-memory data lake 🤔
๏ Store the data in
๏ Trust its data type detection capabilities
๏ Build Business Rules with the Business People
๏ Your data is now trusted and can be used for Machine
Learning (and other activities)
Data Can Now Be Used for ML
๏ Convert/Adapt dataset to Features and Label
๏ Required for Linear Regression in MLlib
๏Needs a column called label of type double
๏Needs a column called features of type VectorUDT
Code Sample #2 - Register & Use
spark.udf().register(
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));


// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);
/jgperrin/net.jgp.labs.sparkdq4ml
+-----+-----+-----+--------+------------------+
|guest|price|label|features|        prediction|
+-----+-----+-----+--------+------------------+
|    1| 23.1| 23.1|   [1.0]|24.563807596513133|
|    2| 30.0| 30.0|   [2.0]|29.595283312577884|
|    2| 33.0| 33.0|   [2.0]|29.595283312577884|
|    3| 34.0| 34.0|   [3.0]| 34.62675902864264|
|    3| 30.0| 30.0|   [3.0]| 34.62675902864264|
|    3| 38.0| 38.0|   [3.0]| 34.62675902864264|
|    4| 40.0| 40.0|   [4.0]| 39.65823474470739|
|   14| 89.0| 89.0|  [14.0]| 89.97299190535493|
|   16|102.0|102.0|  [16.0]|100.03594333748444|
|   20|120.0|120.0|  [20.0]|120.16184620174346|
|   22|131.0|131.0|  [22.0]|130.22479763387295|
|   24|142.0|142.0|  [24.0]|140.28774906600245|
+-----+-----+-----+--------+------------------+
Prediction for 40.0 guests is 220.79136052303852
Conclusion
Key Takeaways
๏ Build your data lake in memory (and disk).
๏ Store first, ask questions later.
๏ Sample your data (to spot anomalies and more).
๏ Rely on Spark SQL and UDFs to build a consistent &
coherent dataset.
๏ Rely on UDFs for prepping ML formats.
๏ Use Java.
Going Further
๏ Contact me @jgperrin
๏ Join the Spark User mailing list
๏ Get help from Stack Overflow
๏ fb.com/TriangleSpark
๏ Meetup on 12/12 in RTP in the exec briefing !
๏ Watch for my book on Spark with Java to come!
Thanks
@jgperrin
Backup
No more slides
You’re on your own!

Mais conteúdo relacionado

Mais de Jean-Georges Perrin

Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMJean-Georges Perrin
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)Jean-Georges Perrin
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...Jean-Georges Perrin
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseJean-Georges Perrin
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applicationsJean-Georges Perrin
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and servicesJean-Georges Perrin
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & servicesJean-Georges Perrin
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)Jean-Georges Perrin
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryJean-Georges Perrin
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryJean-Georges Perrin
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysJean-Georges Perrin
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoJean-Georges Perrin
 
Le contenu est un réel levier de croissance. On en a la preuve !
Le contenu est un réel levier de croissance. On en a la preuve !Le contenu est un réel levier de croissance. On en a la preuve !
Le contenu est un réel levier de croissance. On en a la preuve !Jean-Georges Perrin
 
Retour de la conférence O'Reilly Web 2.0 2009
Retour de la conférence O'Reilly Web 2.0 2009Retour de la conférence O'Reilly Web 2.0 2009
Retour de la conférence O'Reilly Web 2.0 2009Jean-Georges Perrin
 
IT en Alsace : un secteur économique insoupçonné
IT en Alsace : un secteur économique insoupçonnéIT en Alsace : un secteur économique insoupçonné
IT en Alsace : un secteur économique insoupçonnéJean-Georges Perrin
 

Mais de Jean-Georges Perrin (20)

Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 
Présentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT DaysPrésentation e-réputation lors des Nord IT Days
Présentation e-réputation lors des Nord IT Days
 
Tendances Web 2011 San Francicsco
Tendances Web 2011 San FrancicscoTendances Web 2011 San Francicsco
Tendances Web 2011 San Francicsco
 
Le contenu est un réel levier de croissance. On en a la preuve !
Le contenu est un réel levier de croissance. On en a la preuve !Le contenu est un réel levier de croissance. On en a la preuve !
Le contenu est un réel levier de croissance. On en a la preuve !
 
Retour de la conférence O'Reilly Web 2.0 2009
Retour de la conférence O'Reilly Web 2.0 2009Retour de la conférence O'Reilly Web 2.0 2009
Retour de la conférence O'Reilly Web 2.0 2009
 
Comprendre Web 2.0
Comprendre Web 2.0Comprendre Web 2.0
Comprendre Web 2.0
 
Web Intelligent
Web IntelligentWeb Intelligent
Web Intelligent
 
IT en Alsace : un secteur économique insoupçonné
IT en Alsace : un secteur économique insoupçonnéIT en Alsace : un secteur économique insoupçonné
IT en Alsace : un secteur économique insoupçonné
 

Último

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Último (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

The Key to Machine Learning is Prepping the Right Data - All Things Open 2017 (Rev. 002)

  • 1. The key to machine learning is prepping the right data Jean Georges Perrin @jgperrin ATO, Raleigh, NC October 23rd 2017
  • 2. JGP • Jean Georges Perrin ๏ @jgperrin ๏ Chapel Hill, NC ๏ I 🏗 SW • Since 1983 ๏ #Knowledge = 
 𝑓 ( ∑ (#SmallData, #BigData), #DataScience)
 & #Software  ๏ #IBMChampion x9 • #KeepLearning ๏ @ http://jgp.net
  • 3. Who are thou? ๏ Experience with Spark? ๏ Who is familiar with Data Quality? ๏ Who has already implemented Data Quality in Spark? ๏ Who is expecting to be a ML guru after this session? ๏ Who is expecting to provide better insights faster after this session?
  • 4. Agenda ๏ What is ? ๏ What can I do with ? ๏ What is a app, anyway? ๏ Why is a cool environment for ML? ๏ Meet Cactar, the Mongolian warlord of Data Quality ๏ Why Data Quality matters? ๏ Your first ML app with
  • 6. Apps Analytics Distrib. An Analytics Operating System? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS Apps
  • 7. An Analytics Operating System? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 8. An Analytics Operating System? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 9. An Analytics Operating System? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  • 10. Use Cases ๏ NCEatery.com ๏Restaurant analytics ๏1.57×10^21 datapoints analyzed ๏ (they are hiring!) ๏General compute ๏Distributed data transfer ๏ IBM ๏DSX (Data Science Experiment) ๏Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ ๏ Z ๏Data wrangling solution
  • 12. Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - Hardware Node 2 - Hardware Node 3 - Hardware Node 4 - Hardware Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 - OS Node 5 - Hardware Your Application … …
  • 13. Node 1 Node 2 Node 3 Node 4 Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 Your Application … DataFrame
  • 14. Title Text Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX DataFrame
  • 16. All over again As goes the old adage: Garbage In, Garbage Out xkcd
  • 17. If Everything Was As Simple… Dinner revenue per number of guests
  • 18. …as a Visual Representation Anomaly #1 Anomaly #2
  • 19. I Love It When a Plan Comes Together
  • 20. Data is like a 
 box of chocolates, 
 you never know what you're gonna get. Jean ”Gump” Perrin June 2017
  • 21. Data From Everywhere Databases RDBMS, NoSQL Files CSV, XML, Json, Excel, Photos, Video Machines Services, REST, Streaming, IoT…
  • 22. What Is Data Quality? Skills (CACTAR): ๏ Consistency, ๏ Accuracy, ๏ Completeness, ๏ Timeliness, ๏ Accessibility, ๏ Reliability. To allow: ๏ Operations, ๏ Decision making, ๏ Planning, ๏ Machine Learning.
  • 23. Ouch, Bad Data Story Hubble was blind because of a 2.2nm error Legend says it’s a metric to imperial/US conversion
  • 24. Now it Hurts ๏ Challenger exploded in 1986 
 because of a defective O-ring ๏ Root causes were: ๏ Invalid data for the 
 O-ring parts ๏ Lack of data-lineage
 & reporting
 

  • 25. #1 Scripts and the Likes ๏ Source data are cleaned by scripts (shell, Java app…) ๏ I/O intensive ๏ Storage space intensive ๏ No parallelization
  • 26. #2 Use Spark SQL ๏ All in memory! ๏ But limited to SQL and built-in function
  • 27. #3 Use UDFs ๏ User Defined Function ๏ SQL can be extended with UDFs ๏ UDFs benefit from the cluster architecture and distributed processing
  • 28. SPRU Your UDF 1. Service: build your code, it might be already existing! 2. Plumbing: connect your existing business logic to Spark via an UDF 3. Register: the UDF in Spark 4. Use: the UDF is available in Spark SQL and via callUDF()
  • 29. Code Sample #1.2 - Plumbing package net.jgp.labs.sparkdq4ml.dq.udf; 
 import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; 
 public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml
  • 30. Code Sample #1.3 - Register SparkSession spark = SparkSession .builder().appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  • 31. Code Sample #1.4 - Use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  • 32. +-----+-----+ |guest|price| +-----+-----+ |   1|23.24| |    2|30.89| |    2|33.74| |    3|34.89| |    3|29.91| |    3| 38.0| |    4| 40.0| |    5|120.0| |    6| 50.0| |    6|112.0| |    8| 60.0| |    8|127.0| |    8|120.0| |    9|130.0| +-----+-----+
  • 33. Code Sample #1.4 - Use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 34. +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ |    1| 23.1|        23.1| |    2| 30.0|        30.0| |    2| 33.0|        33.0| |    3| 34.0|        34.0| |   24|142.0|       142.0| |   24|138.0|       138.0| |   25|  3.0|        -1.0| |   26| 10.0|        -1.0| |   25| 15.0|        -1.0| |   26|  4.0|        -1.0| |   28| 10.0|        -1.0| |   28|158.0|       158.0| |   30|170.0|       170.0| |   31|180.0|       180.0| +-----+-----+------------+
  • 35. Code Sample #1.4 - Use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  • 36. +-----+-----+ |guest|price| +-----+-----+ |    1| 23.1| |    2| 30.0| |    2| 33.0| |    3| 34.0| |    3| 30.0| |    4| 40.0| |   19|110.0| |   20|120.0| |   22|131.0| |   24|142.0| |   24|138.0| |   28|158.0| |   30|170.0| |   31|180.0| +-----+-----+
  • 37. Store First, Ask Questions Later ๏ Build an in-memory data lake 🤔 ๏ Store the data in ๏ Trust its data type detection capabilities ๏ Build Business Rules with the Business People ๏ Your data is now trusted and can be used for Machine Learning (and other activities)
  • 38. Data Can Now Be Used for ML ๏ Convert/Adapt dataset to Features and Label ๏ Required for Linear Regression in MLlib ๏Needs a column called label of type double ๏Needs a column called features of type VectorUDT
  • 39. Code Sample #2 - Register & Use spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); 
 // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  • 40. +-----+-----+-----+--------+------------------+ |guest|price|label|features|        prediction| +-----+-----+-----+--------+------------------+ |    1| 23.1| 23.1|   [1.0]|24.563807596513133| |    2| 30.0| 30.0|   [2.0]|29.595283312577884| |    2| 33.0| 33.0|   [2.0]|29.595283312577884| |    3| 34.0| 34.0|   [3.0]| 34.62675902864264| |    3| 30.0| 30.0|   [3.0]| 34.62675902864264| |    3| 38.0| 38.0|   [3.0]| 34.62675902864264| |    4| 40.0| 40.0|   [4.0]| 39.65823474470739| |   14| 89.0| 89.0|  [14.0]| 89.97299190535493| |   16|102.0|102.0|  [16.0]|100.03594333748444| |   20|120.0|120.0|  [20.0]|120.16184620174346| |   22|131.0|131.0|  [22.0]|130.22479763387295| |   24|142.0|142.0|  [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852
  • 42. Key Takeaways ๏ Build your data lake in memory (and disk). ๏ Store first, ask questions later. ๏ Sample your data (to spot anomalies and more). ๏ Rely on Spark SQL and UDFs to build a consistent & coherent dataset. ๏ Rely on UDFs for prepping ML formats. ๏ Use Java.
  • 43. Going Further ๏ Contact me @jgperrin ๏ Join the Spark User mailing list ๏ Get help from Stack Overflow ๏ fb.com/TriangleSpark ๏ Meetup on 12/12 in RTP in the exec briefing ! ๏ Watch for my book on Spark with Java to come!
  • 46. No more slides You’re on your own!