SlideShare a Scribd company logo
1 of 58
Download to read offline
Slides @
www.jakequist.com/go/dataengconf
http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
Entity Resolution
Talk Structure
Layer 1: Naive ER
Layer 2: Graphical ER
Layer 3: Big Data ER
Layer 4: Temporal ER
Layer 5: Learned ER
Naive ER
Entity Resolution
ID Name Website Geo
A Facebook facebook.com Menlo	Park,	CA
B FB facebook.com CA
C Joe's	Cookies joescookies.com San	Francisco,	CA
Suppose we have the following data:
Entity Resolution
Suppose we have the following data:
ID Name Website Geo
A Facebook facebook.com Menlo	Park,	CA
B FB facebook.com CA
C Joe's	Cookies joescookies.com San	Francisco,	CA
D Joes	Cookies facebook.com San	Francisco,	CA
Entity Resolution
Suppose we have the following data:
ID Name Website Geo
A Facebook facebook.com Menlo	Park,	CA
B FB facebook.com CA
C Joe's	Cookies joescookies.com San	Francisco,	CA
D Joes	Cookies facebook.com San	Francisco,	CA
E Joes	Cookies NULL New	York,	NY
Fundamental Concept
Match entities on the similarity of
their properties
Example: Company
Similarity
Example: Company
Similarity
Problems
• What about when match arity != 2
• Entities can’t duplicate across matches
• O(N^2) isn’t great either
Graphical ER
Think Like a Graph
A B
EC
D
ID Name Website Geo
A Facebook facebook.com
Menlo	Park,	
CA
B FB facebook.com CA
C Joe's	Cookies joescookies.com
San	Francisco,	
CA
D Joes	Cookies facebook.com
San	Francisco,	
CA
E Joes	Cookies NULL New	York,	NY
Think Like a Graph
A B
EC
D
150
50
-100 -100
50 50
50 50
-150-150
Think Like a Graph
A B
EC
D
150
50
-100 -100
50 50
50 50
-150-150
Key Concept: Cliques
Think Like a Clique
A B
EC
D
150
50
-100 -100
50 50
50 50
-150-150
{A}
{B}
{C}
{D}
{E}
{E, A}
{E, B}
{E, C}
{E, D}
{A, B}
{A, C}
{A, D}
{B, C}
{B, D}
{C, D}
{E, A, B}
{E, A, C}
{E, A, D}
{E, B, C}
{E, B, D}
{E, C, D}
{A, B, C}
{A, B, D}
{A, C, D}
{B, C, D}
{E, A, B, C}
{E, A, B, D}
{E, A, C, D}
{E, B, C, D}
{A, B, C, D}
{E, A, B, C, D}
possible cliques =>
Recurring Theme:
Powerset
Scoring Cliques
from above
Overlapping Cliques
A B
EC
D
A B
EC
D
A = 0.75 B = 0.55
Overlapping Cliques
An entity can’t belong to more
than one clique.
When we choose a clique, we
must ensure no other cliques
use any of those entities
Clique Choosing
Clique Choosing
Recap
• Given a dataset of entities…
• Take the powerset of those entities => every
possible clique
• Score all the cliques
• In sorted order, choose the best cliques when no
elements have been touched
ER on Bigger Data
• Get potential matches on the same machine
• Avoid using powerset(n) for large n
Challenges
Locality-Sensitive Hashing
(LSH)
Basic Idea: Use Map Reduce to get likely matches onto the
same machines
“Johnathon”
“Sequoia Capital, LLC”
[37.773972, -122.431297]
“John”
“Sequoia”
[37.73, -122.43]
“app.example.com” “example.com”
Locality-Sensitive Hashing
Locality-Sensitive Hashing
Problems
• What if our entities have missing properties?
Locality-Sensitive Hashing
Joe’s CookiesJoe’s Cookie’s
joescookies.com joescookies.com
A B C
“Joe Cookie” “Joe Cookie” “”
LSH on “name”
Multilevel LSH
• Basic Idea: Use LSH multiple times on converging
cliques
Joe’s CookiesJoe’s Cookie’s
joescookies.com joescookies.com
A B C
“Joe Cookie” “Joe Cookie” “”
LSN on “name”
Joe’s Cookie’s
joescookies.com joescookies.com
Clique #3
Clique #2
“joescookies.com” “joescookies.com”
LSN on “website”
Clique #1
Clique Choosing
• We now have all potential cliques, spread across
the cluster
• We now need to choose the best cliques?
• Remember: But choosing one clique invalidates
others
• Fundamentally a Serial Algorithm!
Clique Choosing
RDD[T].toLocalIterator() : Iterator[T]
• Produces an iterator on the Driver that seamlessly
iterates every partition
Clique Choosing
Clique Choosing
uh oh
Challenge
• We need to keep track of which entities we’ve
“touched”
• But using a HashSet means we will start eating a lot
memory
Primer: Bloom Filters
BloomFilter {
def mightContain(T obj)
def put(T obj)
}
example: 1 MB @ 0.5% error => 130 KB
Clique Choosing w/ Bloom
Filters
Clique Choosing w/ Bloom
Filters
Recap
• Challenge: Get data to the right machine.
Solution: Use Locality-Sensitive-Hashing
• Challenge: Choose the best cliques.
Solution: Use serial iterator and bloom-filters to
keep memory low
Temporal ER
Temporal Entity
Resolution
T1 T2
Ms Sally Smith Mrs Sally Doe
thefacebook.com facebook.com
Zen Payroll Gusto
Temporal Entity
Resolution
A B
Zen Payroll
zenpayroll.com
Gusto
gusto.com
-1000
Temporal Entity
Resolution
A B
Zen Payroll
zenpayroll.com
+100
C
Zen Payroll <=> Gusto
zenpayroll.com <=> gusto.com
Gusto
gusto.com
+100
-1000
Iterative Poison Pills
• Basic Idea: Use ER techniques we’ve already
established
• Introduce “poison pills” that can break up cliques if
temporal properties don’t match
• Iteratively use the poison pills to match on
increasingly temporally-aware entities
gusto.com
(Payroll)
2016
Perform Regular ER
gusto.com
(Travel)
2010
gusto.com
< 2015
gusto.com
zenpayroll.com
> 2015
zenpayroll.com
(Payroll)
2014
A B C D E
A, C, D, E B, E
Kick Out Entities That
Don’t Match Temporal
Requirements
A, D
gusto.com < 2015
B, E
gusto.com > 2015
zenpayroll < 2014
C, E
gusto,2016
Perform Regular ER
(now with more temporal
fields available)
A, C, D B, C, E
Temporal Poison Pills
Temporal Entity
Resolution
• Very Computational Expensive
• Requires Significant Tuning & Tweaking to Keep
Tractable
• Considered one of the Holy Grails of ER
Learned ER
Recap
• Gorilla in the room: All of our scoring has been
manual
Supervised Learning ER
• Basic Idea: Use a training set to learn the weights
in our scoring functions
• Disclaimer: Only proceed with this if you have very
complex scoring properties
Supervised Learning ER
Supervised Learning ER
More Learning Opts
• Gradient Descent: What if we viewed the system
as having overall “error”? We can then use
Gradient Descent to find optimal solution.
• Very very computationally intense
Questions?
Thanks!
jakequist@gmail.com
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

More Related Content

Viewers also liked

DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesHakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartHakka Labs
 
Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Spark Summit
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
 
DataEngConf: Apache Spark in Financial Modeling at BlackRock
DataEngConf: Apache Spark in Financial Modeling at BlackRock DataEngConf: Apache Spark in Financial Modeling at BlackRock
DataEngConf: Apache Spark in Financial Modeling at BlackRock Hakka Labs
 
Knowledge Collaboration: Working with Data and Web Specialists
Knowledge Collaboration: Working with Data and Web SpecialistsKnowledge Collaboration: Working with Data and Web Specialists
Knowledge Collaboration: Working with Data and Web SpecialistsOlivier Serrat
 
Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInMitul Tiwari
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksSpark Summit
 
AI and Big Data For National Intelligence
AI and Big Data For National IntelligenceAI and Big Data For National Intelligence
AI and Big Data For National IntelligenceSonal Goyal
 

Viewers also liked (19)

DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
Lect21 09-11
Lect21 09-11Lect21 09-11
Lect21 09-11
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf: Apache Spark in Financial Modeling at BlackRock
DataEngConf: Apache Spark in Financial Modeling at BlackRock DataEngConf: Apache Spark in Financial Modeling at BlackRock
DataEngConf: Apache Spark in Financial Modeling at BlackRock
 
Knowledge Collaboration: Working with Data and Web Specialists
Knowledge Collaboration: Working with Data and Web SpecialistsKnowledge Collaboration: Working with Data and Web Specialists
Knowledge Collaboration: Working with Data and Web Specialists
 
Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedIn
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
AI and Big Data For National Intelligence
AI and Big Data For National IntelligenceAI and Big Data For National Intelligence
AI and Big Data For National Intelligence
 

Similar to DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into valueNAVER D2
 
Lecture 6: Watson and the Social Web (2014), Chris Welty
Lecture 6: Watson and the Social Web (2014), Chris WeltyLecture 6: Watson and the Social Web (2014), Chris Welty
Lecture 6: Watson and the Social Web (2014), Chris WeltyLora Aroyo
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceTrey Grainger
 
How can algorithms be biased?
How can algorithms be biased?How can algorithms be biased?
How can algorithms be biased?Software Guru
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017MLconf
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTrent McConaghy
 
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleSippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleDawn Yankeelov
 
Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:...
Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:...Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:...
Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:...Amazon Web Services
 
Enabling Opinion Driven Decision Making - Kavita Ganesan, GitHub
Enabling Opinion Driven Decision Making - Kavita Ganesan, GitHubEnabling Opinion Driven Decision Making - Kavita Ganesan, GitHub
Enabling Opinion Driven Decision Making - Kavita Ganesan, GitHubLucidworks
 
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...Core Security
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jMax De Marzi
 
What have fruits to do with technology? The case of Orange, Blackberry and Apple
What have fruits to do with technology? The case of Orange, Blackberry and AppleWhat have fruits to do with technology? The case of Orange, Blackberry and Apple
What have fruits to do with technology? The case of Orange, Blackberry and ApplePlanetData Network of Excellence
 
Introduction to ML and Decision Tree
Introduction to ML and Decision TreeIntroduction to ML and Decision Tree
Introduction to ML and Decision TreeSuman Debnath
 
Hacking Culture at VelocityConf
Hacking Culture at VelocityConfHacking Culture at VelocityConf
Hacking Culture at VelocityConfJesse Robbins
 
Artificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google BrainArtificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google BrainRawan Al-Omari
 
IBM IOD Conference 2011 Opening Keynote Deck
IBM IOD Conference 2011 Opening Keynote DeckIBM IOD Conference 2011 Opening Keynote Deck
IBM IOD Conference 2011 Opening Keynote DeckJeff Jonas
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Tim Mackinnon Agile And Beyond
Tim Mackinnon Agile And BeyondTim Mackinnon Agile And Beyond
Tim Mackinnon Agile And Beyonddeimos
 

Similar to DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark (20)

[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
 
Lecture 6: Watson and the Social Web (2014), Chris Welty
Lecture 6: Watson and the Social Web (2014), Chris WeltyLecture 6: Watson and the Social Web (2014), Chris Welty
Lecture 6: Watson and the Social Web (2014), Chris Welty
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
How can algorithms be biased?
How can algorithms be biased?How can algorithms be biased?
How can algorithms be biased?
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleSippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest Louisville
 
Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:...
Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:...Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:...
Stop Worrying about Prodweb001 and Start Loving i-98fb9856 (ARC201) | AWS re:...
 
Enabling Opinion Driven Decision Making - Kavita Ganesan, GitHub
Enabling Opinion Driven Decision Making - Kavita Ganesan, GitHubEnabling Opinion Driven Decision Making - Kavita Ganesan, GitHub
Enabling Opinion Driven Decision Making - Kavita Ganesan, GitHub
 
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4j
 
What have fruits to do with technology? The case of Orange, Blackberry and Apple
What have fruits to do with technology? The case of Orange, Blackberry and AppleWhat have fruits to do with technology? The case of Orange, Blackberry and Apple
What have fruits to do with technology? The case of Orange, Blackberry and Apple
 
Introduction to ML and Decision Tree
Introduction to ML and Decision TreeIntroduction to ML and Decision Tree
Introduction to ML and Decision Tree
 
Hacking Culture at VelocityConf
Hacking Culture at VelocityConfHacking Culture at VelocityConf
Hacking Culture at VelocityConf
 
Artificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google BrainArtificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google Brain
 
IBM IOD Conference 2011 Opening Keynote Deck
IBM IOD Conference 2011 Opening Keynote DeckIBM IOD Conference 2011 Opening Keynote Deck
IBM IOD Conference 2011 Opening Keynote Deck
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Tim Mackinnon Agile And Beyond
Tim Mackinnon Agile And BeyondTim Mackinnon Agile And Beyond
Tim Mackinnon Agile And Beyond
 

More from Hakka Labs

DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsDataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsHakka Labs
 
DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineDataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineHakka Labs
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastDataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastHakka Labs
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...Hakka Labs
 
DataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedDataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedHakka Labs
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...Hakka Labs
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...Hakka Labs
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 

More from Hakka Labs (12)

DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsDataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris Wiggins
 
DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineDataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation Engine
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastDataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
 
DataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedDataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeed
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark