SlideShare uma empresa Scribd logo
1 de 7
Baixar para ler offline
PyData London 2014
Jonathan Sedar, Applied AI
@jonsedar
Text mining to correct
missing CRM information
A practical data science project
LIGHTNING EDITION
In a nutshell
● A CRM dataset (100k business accounts) belonging to
a national energy supplier
● A knotty problem: multiple accounts per company,
without any grouping ids
● How can we to find groups of accounts (larger
company structures), using just the CRM data?
● Machine Learning (ML) and Natural Language
Processing (NLP) tools and techniques in Python.
● Import: Scikit Learn and TextBlob (NLTK & Pattern)
Show me the data!
Company
ID
Account name Contact
name
Premises address
lines 1 - 4
Billing address
lines 1 - 4
1 Bob’s Pizza Big Bob 5 High St, Wexford 5 High St, Wexford
1 Bob’s Pizza Big Bob Temple Bar, D2 5 High St, Wexford
1 Mike’s Kebabs Mad Mike 3 Upper St, Dublin 5 High St, Wexford
2 Mark’s Kebabs Mild Mark 8 Upper St, Dublin Main St, Waterford
3 Fred’s Falafel Fat Fred 9 Henry St, Cork 9 Henry st, cork
3 Fred Fallafell Freddie Bridges St, Galway Henrys St, Cork
… x100,000
This crucial bit of info groups the separately-
recorded accounts into companies…
and was missing from the dataset
Transforming text into useful structures
Account holder
Business name
Premises address
Billing address
...
x100,000
TF-IDF
2-D MATRIX
Cleaned, parsed,
tokenised text
strings
SVD N-D
MATRIX
NTLK.PunktTokeniser
sklearn.TfidfVectorizer
sklearn.TruncatedSVD
Now we can identify similar accounts:
1. Suggest similar
accounts to be grouped
sklearn.MiniBatchKMeans
sklear.AffinityPropagation
sklearn.RadiusNeighborsClassifier
2. Human
validation &
verification 3. Incorporate & propagate
valid groupings
A pragmatic process using OOTB Python
machine learning and human expertise
● A very quick turnaround from raw data to tagged
companies to 93% accuracy
● ~40% of accounts found to belong to a company, ~3.5
accounts per company
● NLP toolkits and scikit-learn allowed rapid
development and testing of solution
● Incorporated human identification at critical stages:
no ML problem is an island
PyData London 2014
Jonathan Sedar, Applied AI
@jonsedar
Thank you
Any questions?

Mais conteúdo relacionado

Destaque

Requirements for Processing Datasets for Recommender Systems
Requirements for Processing Datasets for Recommender SystemsRequirements for Processing Datasets for Recommender Systems
Requirements for Processing Datasets for Recommender SystemsStoitsis Giannis
 
Customer Relationship Management in Ireland Managing your Customers for Busin...
Customer Relationship Management in Ireland Managing your Customers for Busin...Customer Relationship Management in Ireland Managing your Customers for Busin...
Customer Relationship Management in Ireland Managing your Customers for Busin...Krishna De
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine DemystifiedDKALab
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques sun9413
 
Data Mining Techniques for CRM
Data Mining Techniques for CRMData Mining Techniques for CRM
Data Mining Techniques for CRMShyaamini Balu
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
The comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmThe comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmdeepti92pawar
 
Role-Based Contextual Recommendation
Role-Based Contextual RecommendationRole-Based Contextual Recommendation
Role-Based Contextual Recommendationphonecom
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 

Destaque (12)

Requirements for Processing Datasets for Recommender Systems
Requirements for Processing Datasets for Recommender SystemsRequirements for Processing Datasets for Recommender Systems
Requirements for Processing Datasets for Recommender Systems
 
Customer Relationship Management in Ireland Managing your Customers for Busin...
Customer Relationship Management in Ireland Managing your Customers for Busin...Customer Relationship Management in Ireland Managing your Customers for Busin...
Customer Relationship Management in Ireland Managing your Customers for Busin...
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine Demystified
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques
 
Data mining
Data miningData mining
Data mining
 
Data Mining Techniques for CRM
Data Mining Techniques for CRMData Mining Techniques for CRM
Data Mining Techniques for CRM
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
The comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmThe comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithm
 
Role-Based Contextual Recommendation
Role-Based Contextual RecommendationRole-Based Contextual Recommendation
Role-Based Contextual Recommendation
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Semelhante a Text Mining to Correct Missing CRM Information by Jonathan Sedar

SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)Laura Chiticariu
 
Into dq ed wrazen
Into dq ed wrazenInto dq ed wrazen
Into dq ed wrazenBigDataExpo
 
Data Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - WebinarData Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - WebinarMartin Magdinier
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sInside Analysis
 
DownUnder Dreaming - 5 steps to dreamy data
DownUnder Dreaming - 5 steps to dreamy dataDownUnder Dreaming - 5 steps to dreamy data
DownUnder Dreaming - 5 steps to dreamy dataClive Astbury
 
Customer Data Operations: Your hidden growth engine (CXL Live 2018)
Customer Data Operations: Your hidden growth engine (CXL Live 2018)Customer Data Operations: Your hidden growth engine (CXL Live 2018)
Customer Data Operations: Your hidden growth engine (CXL Live 2018)Hull
 
Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024Cristina Vidu
 
Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024UiPathCommunity
 
How Retail Banks Use MongoDB
How Retail Banks Use MongoDBHow Retail Banks Use MongoDB
How Retail Banks Use MongoDBMongoDB
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"MDS ap
 
[Conference] Cognitive Graph Analytics on Company Data and News
[Conference] Cognitive Graph Analytics on Company Data and News[Conference] Cognitive Graph Analytics on Company Data and News
[Conference] Cognitive Graph Analytics on Company Data and NewsOntotext
 
All india data in our hands
All india data in our handsAll india data in our hands
All india data in our handsquickdatasale
 
Frakture demo slides 4.1
Frakture demo slides 4.1Frakture demo slides 4.1
Frakture demo slides 4.1Chris Lundberg
 
ML Business Data Guide
ML Business Data GuideML Business Data Guide
ML Business Data GuideBiglad71
 
E Marketplace Strategy
E Marketplace StrategyE Marketplace Strategy
E Marketplace Strategyrlynes
 
IT Vertical Overview
IT Vertical OverviewIT Vertical Overview
IT Vertical OverviewTom Burns
 
Using graph technologies to fight fraud
Using graph technologies to fight fraudUsing graph technologies to fight fraud
Using graph technologies to fight fraudLinkurious
 

Semelhante a Text Mining to Correct Missing CRM Information by Jonathan Sedar (20)

SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Into dq ed wrazen
Into dq ed wrazenInto dq ed wrazen
Into dq ed wrazen
 
Data Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - WebinarData Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - Webinar
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
 
DownUnder Dreaming - 5 steps to dreamy data
DownUnder Dreaming - 5 steps to dreamy dataDownUnder Dreaming - 5 steps to dreamy data
DownUnder Dreaming - 5 steps to dreamy data
 
Customer Data Operations: Your hidden growth engine (CXL Live 2018)
Customer Data Operations: Your hidden growth engine (CXL Live 2018)Customer Data Operations: Your hidden growth engine (CXL Live 2018)
Customer Data Operations: Your hidden growth engine (CXL Live 2018)
 
Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024
 
Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024
 
How Retail Banks Use MongoDB
How Retail Banks Use MongoDBHow Retail Banks Use MongoDB
How Retail Banks Use MongoDB
 
What it takes to predict the future
What it takes to predict the futureWhat it takes to predict the future
What it takes to predict the future
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
[Conference] Cognitive Graph Analytics on Company Data and News
[Conference] Cognitive Graph Analytics on Company Data and News[Conference] Cognitive Graph Analytics on Company Data and News
[Conference] Cognitive Graph Analytics on Company Data and News
 
All india data in our hands
All india data in our handsAll india data in our hands
All india data in our hands
 
Frakture demo slides 4.1
Frakture demo slides 4.1Frakture demo slides 4.1
Frakture demo slides 4.1
 
ML Business Data Guide
ML Business Data GuideML Business Data Guide
ML Business Data Guide
 
E Marketplace Strategy
E Marketplace StrategyE Marketplace Strategy
E Marketplace Strategy
 
IT Vertical Overview
IT Vertical OverviewIT Vertical Overview
IT Vertical Overview
 
D3 IDP Slides.pdf
D3 IDP Slides.pdfD3 IDP Slides.pdf
D3 IDP Slides.pdf
 
Apics12
Apics12Apics12
Apics12
 
Using graph technologies to fight fraud
Using graph technologies to fight fraudUsing graph technologies to fight fraud
Using graph technologies to fight fraud
 

Mais de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Mais de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Último

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Text Mining to Correct Missing CRM Information by Jonathan Sedar

  • 1. PyData London 2014 Jonathan Sedar, Applied AI @jonsedar Text mining to correct missing CRM information A practical data science project LIGHTNING EDITION
  • 2. In a nutshell ● A CRM dataset (100k business accounts) belonging to a national energy supplier ● A knotty problem: multiple accounts per company, without any grouping ids ● How can we to find groups of accounts (larger company structures), using just the CRM data? ● Machine Learning (ML) and Natural Language Processing (NLP) tools and techniques in Python. ● Import: Scikit Learn and TextBlob (NLTK & Pattern)
  • 3. Show me the data! Company ID Account name Contact name Premises address lines 1 - 4 Billing address lines 1 - 4 1 Bob’s Pizza Big Bob 5 High St, Wexford 5 High St, Wexford 1 Bob’s Pizza Big Bob Temple Bar, D2 5 High St, Wexford 1 Mike’s Kebabs Mad Mike 3 Upper St, Dublin 5 High St, Wexford 2 Mark’s Kebabs Mild Mark 8 Upper St, Dublin Main St, Waterford 3 Fred’s Falafel Fat Fred 9 Henry St, Cork 9 Henry st, cork 3 Fred Fallafell Freddie Bridges St, Galway Henrys St, Cork … x100,000 This crucial bit of info groups the separately- recorded accounts into companies… and was missing from the dataset
  • 4. Transforming text into useful structures Account holder Business name Premises address Billing address ... x100,000 TF-IDF 2-D MATRIX Cleaned, parsed, tokenised text strings SVD N-D MATRIX NTLK.PunktTokeniser sklearn.TfidfVectorizer sklearn.TruncatedSVD
  • 5. Now we can identify similar accounts: 1. Suggest similar accounts to be grouped sklearn.MiniBatchKMeans sklear.AffinityPropagation sklearn.RadiusNeighborsClassifier 2. Human validation & verification 3. Incorporate & propagate valid groupings
  • 6. A pragmatic process using OOTB Python machine learning and human expertise ● A very quick turnaround from raw data to tagged companies to 93% accuracy ● ~40% of accounts found to belong to a company, ~3.5 accounts per company ● NLP toolkits and scikit-learn allowed rapid development and testing of solution ● Incorporated human identification at critical stages: no ML problem is an island
  • 7. PyData London 2014 Jonathan Sedar, Applied AI @jonsedar Thank you Any questions?