SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
PRACTICAL TEXT
MINING WITH SQL
USING RELATIONAL
DATABASES
Ralph Winters
Data Architect,
Actuarial Business Intelligence
EmblemHealth
June 5th, 2013
11th Annual Text and Social Analytics Summit
Cambridge, MA
RDMS TODAY
Gartner - clients tell us that
combining scored, processed
‘outside data’ with data inside our
relational databases is where all
the added value is.
IDC -RDMD database
management systems are
expected to nearly double in
market growth by 2016 driven by
intelligence demands and
expabusiness nded adoption to
tackle big data and unstructured
information streams
The relational database
management systems (RDBMS)
market continues to confound the
skeptics by maintaining strong
growth characteristics despite the
belief by some that the market has
become 'saturated‘ or that it will
be weakened by newer Big Data
technologies
Inmon:
listen carefully to the “big data”
vendors and this is what you hear:
“Let’s get rid of relational.” It is like
courtiers in the castle whispering,
“The king must die.” What’s going
on here?.
Why a relational DB?
Why a
relational
Database?
Marry
Structured +
Unstructured
Data
More suitable
for statistical
analytics
(matrices)
Leverage
existing
familiar
widespread
technology
Improving of
predictive
Models
Referential
Integrity
Integrated
Text/Data
Mining
Feasibility
What do I need
to know?
Costs
Benefits/Risks Industries
Adding
Value?
RDMS
File Interfaces
(XML,CSV)
ODBC/JDBC/DBI
Text Vendor
supplied Connector
Hadoop
Connectors
(SAS, Oracle)
Open Source Text
Mining Tools
(R, Java, Perl,
LingPipe)
In-Database Text
Mining Algorithms
(Oracle*Text,SAS
Text Miner,SQL
Server Text
Miniing)
RDMS
Internal/External
Connections
ANGRY Customer Comments
Short Tailed
Sampling
Not for Long
Tailed Data
Comment
- KardCo Premier Credit Card Promo Scam . I recently received
an KardCo promo promising 25,000 bonus points if you sign up
for the KardCo Premier Card and spend $2000 in the 1st three
months. and so i call in and apply ...got APPROVED...two
weeks later ..
Posting on your site DEFINITELY HELPED (it was pointed out by
retailer), and sped up response after 6 weeks of mulling
around BEFORE we posted our complaint.
$100 restaurant certificates
15 days ago I opened a cc w/ KardCo. I thought I did my
research on which company is the best, boy was I wrong. I go
to use my card for the 1st time lastnight & its declined. Ok.... I
call KardCo from the store and I'm placed on hold for 20 mins.
Finally I speak to an awful women who tells me my debt to
income ratio is too high and I have too many inquires. I pull my
credit report once I get home I pull the one from when I
opened the card and the most recent one. My revolving debt
$100, my credit score increased from 738 to 740 and 96% of
my credit is currently available....
1-800 Customer Service NOT LOCATED IN US!
2 years in a row they don't send me my rewards check
Full Text Search
Built in to many RDMS
Needs Indexing
Can be Slow
Necessary in some Applications
Complements Categorization
Oracle:
SELECT SCORE(1), comment,
issue_date from custdb
WHERE CONTAINS(text, 'APR', 1) > 0
AND issue_date >= ('01-OCT-97')
ORDER BY SCORE(1) DESC;
Operators: Like, Contains, Regex,
Sounds Like, Distance Measures
Term Doc
Best 1
Customer 1
Service 1
Highly 2
Recommended 2
Parse Terms from Each
Row
Remove StopWords
Cross Reference
Document ID & Term
Numbers
Output New
“Structured”
Table
Map Unstructured-to-Structured
Doc Term1 Term2 Term3 Term4
1 The Best Customer Service
2 Is Highly Recommended
“Wasted
Space”
Extended
SQL
User Defined
Functions
Stored
Process
Many Methods to Pivot Data
select
regexp_split_to_table(lower(line), 's+')
as word
from
customer_comments
“Words” Table
One Row for each
term in Doc.
Term Index
Number
“Document
ID”
Verbatim
Term Index
+1
Term Index -
1
Must handle
Negation!
Term document matrix
Harder to do
analysis in SQL
Wasted
Space
Weight Terms Discard Terms
Term Weighting in SQL
• Log(Number of
Docs / Number of
Docs which
contain term)
Calculate
IDF
• Number of times
Term occurs in
document
Calculate
Term Freq
• Mulitply IDF *TF
• Sort by High
values
• Select Top N
features
Calculate
tfidf
create table idf as select
word,num_docs.value as
numdocs,doc_freq.value as
docfreq,
log10(num_docs.value/doc_freq.
value) as idf
from doc_freq,WORK.num_docs
order by idf;
create table doc_freq as
select word,count(distinct
id) as value
from WORDS
group by word
order by value;
create table num_docs as
select count(distinct id)
as value
from WORDS;
Words
Table
Top N
Words
Pivot
on
Rows
Top N Weighted Words Matrix –
Ranked by Highest TD/IDF
select a.ID,
(compress(a.word) || ' ' ||
compress(b.word)) as pair,
from words a , words b
where a.ID=b.ID and (a.no=b.no_prev)
order by pair;
Generating Bigrams
Bigrams Output
 Run Frequencies on Terms
 Gift Card occurs more
frequently than expected
 Consider incorporating into
Taxonomy
SAMPLE BIGRAM COUNT
EXPECTED
COUNT
Have Been 2326
Gift Card 2910
Called Kardo 2119
Kardco Card 3125
Customer Service 3630
Credit Card 2429
Member Since 1013
Credit Limit 1310
Starlight Card 115
Kardco Customer 86
Big Ram
Do repeat callers signal Churn?
 ..
Research shows
improved predictive
Models
performance
Correlate with
Satisfaction Scores
Relevant Keywords
First Call Responders
pair Status Count satisfaction
CUSTOMER SERVICE A 27 8.47
GIFT CARD A 25 8.34
KARDCO CARD A 24 8.79
CREDIT CARD A 15 8.62
WITH KARDCO A 13 8.28
TRANSFERRED AGAIN I 12 8.30
CREDIT LIMIT A 11 8.35
FROM KARDCO A 10 8.50
PREMIER CARD A 9 8.42
WITH KARDCO I 9 8.48
THREE MONTHS A 9 8.37
CUSTOMER SERVICE I 9 8.36
select distinct comm1 from Customer Comments
Where prxmatch("m/2nd|3rd|again|resolve/oi",comm1) >0
Customer comment Sat
Hotel cant resolve my dispute. I'm going to cancel 4
Never resolved. Still waiting for a call back 3
So Completely Unhappy with KardCo. It took 3 calls to the service center to finally resolve my billing
problem
5
They gave me a 2nd chance to pay my bill 9
This complaint was never resolved to begin with 5
This is the 2nd year in a row that KardCo said they mailed my rewards refund that I have yet to
recieve. Same Pattern every year, I stop getting paper statements in December even though I am
signed up for them and I never get my Check. Then I mysteriously start getting paper statements
again after the period they say they will cut the checks and tell me i am no longer eligable.
6
This is the 3rd time I have complained about this and I may have to take my business elsewhere! 4
Transferred again for the 2nd time. I can't believe it. What happened to Cindy? 1
When ever I compare customer service between companies KardCo is the PREMIER standard. They
are on call 24 hours a day. Their operators are friendly and easy to speak with. They are always on the
customers side and they always work at a situation until they resolve the issue.
10
Looking for the Repeat Callers
Some False positive
Terms “resolve” and “2nd” can be positive
Satisfaction
Score
Outstanding
Balance
Predict Churn
Churn
Improves
Implement
New Scripts for
call center
Number of
Times Called
Select all comments
with “Gift Card”
Insert Keys into
Model Table
Join new Model with
existing model tables
How Text Analytics can improve Predictive
Model
STANDARD CLASSIFICATIONS
Advertising and
marketing Credit determination
Application
processing delay
Credit line
Increase/decrease
APR or interest
rate Credit reporting
Arbitration
Customer service /
Customer relations
Balance transfer Delinquent account
Balance transfer
fee
Forbearance / Workout
plans
Bankruptcy
Identity theft / Fraud /
Embezzlement
Billing disputes Late fee
Billing statement Other
Cash advance Other fee
Cash advance
fee Overlimit fee
Closing/Cancellin
g account Payoff process
Collection debt
dispute Privacy
Collection
practices Rewards
Convenience
checks Sale of account
Credit card
protection / Debt
protection Transaction issue
Unsolicited issuance of
credit card
Add “Gift Card”
as a
Classification
“Tweak”
Taxonomy
Apply Auto
Classification
Evaluate
according to
GOLD
Standard
Apply CRISP
or SEMMA
Methodology
and Repeat
Validation
CAT Count Customer
Service
Baseline
Average
Spend
ADV 15 15 15,483
APR 12 12 13,308
BANKRUPT 1 1 13,108
BILLDISP 6 6 12,682
BILLSTAT 6 6 10,617
COLL 1 1 17,720
CUSTSERV 25 25 14,725
DELAY 1 1 13,334
FRAUD 13 13 15,162
GIFTCARD 18 18 16,107
LATEFEE 3 3 18,989
LINEADJ 4 4 13,762
OTHER 125 125 18,482
OTHERFEE 15 15 10,153
PROT 1 1 17,808
REFUND 2 2 16,473
REWARDS 10 10 10,918
TRANS 1 1 14,224
TRAVEL 8 8 10,355
“There is no globally best method for
(automated) text analysis”
Other Types of Classification
Select id,comm
Case
When compged(‘High Interest Rate APR’,comm1 < 300 then ‘APR’
When compged(‘Best Customer Service’,comm1 < 300 then
‘DELIGHT’
Else ‘OTHER’ end as CAT from CUSTOMER_COMMENTS
Classify by
Keyword
Pairs
Regular Expressions
Boolean
Distance
Functions
Fuzzy
Matching
Regex
Bayesian
algorithms
Sentiment – Can be easy, can be hard!
Words
Table
Join to
Polarity
Dictionary
Assign +1 to
Positive /-1
to Negative
Sentiment
Score
Use Top N
Weighted
Terms
Use First and
Last Sentences
Vector Size CPU? Complexity Normalized
Use In-Memory
Lookups
Customized
Dictionary
Bayesian
Classifier in SQL
CAT Count
Average Satisfaction Neg
Pct
Not
Neg
PctSpend
ADV 15 15,483 7.5 49 51
APR 12 13,308 7.2 72 28
FRAUD 13 15,162 5.2 61 39
GIFTCARD 18 16,107 8.9 24 76
LATEFEE 3 18,989 7.0 12 88
Sentiment – Correlation
Correlating Sentiment Scores with other database metrics can support
hypothesis
THANK YOU!
Contact:
R_winters@emblemhealth.com
www.linkedin.com/in/ralphwinters

Mais conteúdo relacionado

Mais procurados

Inserindo em Ordem Crescente na Lista Encadeada
Inserindo em Ordem Crescente na Lista EncadeadaInserindo em Ordem Crescente na Lista Encadeada
Inserindo em Ordem Crescente na Lista EncadeadaElaine Cecília Gatto
 
ITE v5.0 - Chapter 8
ITE v5.0 - Chapter 8ITE v5.0 - Chapter 8
ITE v5.0 - Chapter 8Irsandi Hasan
 
How to shutdown the Netapp SAN 8.3 and 9.2 version
How to shutdown the Netapp SAN 8.3 and 9.2 versionHow to shutdown the Netapp SAN 8.3 and 9.2 version
How to shutdown the Netapp SAN 8.3 and 9.2 versionSaroj Sahu
 
SA08302002E Control Panel Design Guide
SA08302002E Control Panel Design GuideSA08302002E Control Panel Design Guide
SA08302002E Control Panel Design GuideErik Barnes
 
Sparc t4 4 system technical overview
Sparc t4 4 system technical overviewSparc t4 4 system technical overview
Sparc t4 4 system technical overviewsolarisyougood
 
CCNA Discovery 1 - Chapter 1
CCNA Discovery 1 - Chapter 1CCNA Discovery 1 - Chapter 1
CCNA Discovery 1 - Chapter 1Irsandi Hasan
 
Internet Procedure vesion 6 - IPV6 V4 - Computerland
Internet Procedure vesion 6 - IPV6 V4 - ComputerlandInternet Procedure vesion 6 - IPV6 V4 - Computerland
Internet Procedure vesion 6 - IPV6 V4 - ComputerlandPatricia NENZI
 
Interesting and Useful Features of the DeltaV PID Controller
Interesting and Useful Features of the DeltaV PID ControllerInteresting and Useful Features of the DeltaV PID Controller
Interesting and Useful Features of the DeltaV PID ControllerJim Cahill
 
How to connect to cisco asa
How to connect to cisco asaHow to connect to cisco asa
How to connect to cisco asaIT Tech
 
CCNA 2 Routing and Switching v5.0 Chapter 1
CCNA 2 Routing and Switching v5.0 Chapter 1CCNA 2 Routing and Switching v5.0 Chapter 1
CCNA 2 Routing and Switching v5.0 Chapter 1Nil Menon
 
αναλυτικοί πίνακες αδειών ηλεκτρολόγου α΄ ειδικότητας
 αναλυτικοί πίνακες αδειών ηλεκτρολόγου α΄ ειδικότητας αναλυτικοί πίνακες αδειών ηλεκτρολόγου α΄ ειδικότητας
αναλυτικοί πίνακες αδειών ηλεκτρολόγου α΄ ειδικότηταςιωαννης αληφραγκης
 
Hobi Elektronik Devre Projeleri
Hobi Elektronik Devre ProjeleriHobi Elektronik Devre Projeleri
Hobi Elektronik Devre ProjeleriEmre ARSLAN
 
1997 sea doo bombardier personal watercraft service repair manual
1997 sea doo  bombardier personal watercraft service repair manual1997 sea doo  bombardier personal watercraft service repair manual
1997 sea doo bombardier personal watercraft service repair manualfjkseksmefmm
 

Mais procurados (20)

Inserindo em Ordem Crescente na Lista Encadeada
Inserindo em Ordem Crescente na Lista EncadeadaInserindo em Ordem Crescente na Lista Encadeada
Inserindo em Ordem Crescente na Lista Encadeada
 
ITE v5.0 - Chapter 8
ITE v5.0 - Chapter 8ITE v5.0 - Chapter 8
ITE v5.0 - Chapter 8
 
Ipv4 address
Ipv4 addressIpv4 address
Ipv4 address
 
How to shutdown the Netapp SAN 8.3 and 9.2 version
How to shutdown the Netapp SAN 8.3 and 9.2 versionHow to shutdown the Netapp SAN 8.3 and 9.2 version
How to shutdown the Netapp SAN 8.3 and 9.2 version
 
SA08302002E Control Panel Design Guide
SA08302002E Control Panel Design GuideSA08302002E Control Panel Design Guide
SA08302002E Control Panel Design Guide
 
Delta sijas
Delta sijasDelta sijas
Delta sijas
 
Sparc t4 4 system technical overview
Sparc t4 4 system technical overviewSparc t4 4 system technical overview
Sparc t4 4 system technical overview
 
CCNA Discovery 1 - Chapter 1
CCNA Discovery 1 - Chapter 1CCNA Discovery 1 - Chapter 1
CCNA Discovery 1 - Chapter 1
 
Internet Procedure vesion 6 - IPV6 V4 - Computerland
Internet Procedure vesion 6 - IPV6 V4 - ComputerlandInternet Procedure vesion 6 - IPV6 V4 - Computerland
Internet Procedure vesion 6 - IPV6 V4 - Computerland
 
Interesting and Useful Features of the DeltaV PID Controller
Interesting and Useful Features of the DeltaV PID ControllerInteresting and Useful Features of the DeltaV PID Controller
Interesting and Useful Features of the DeltaV PID Controller
 
Westermo solutions for onboard rail networks
Westermo solutions for onboard rail networksWestermo solutions for onboard rail networks
Westermo solutions for onboard rail networks
 
Sockets
SocketsSockets
Sockets
 
Basic of IPv6
Basic of IPv6Basic of IPv6
Basic of IPv6
 
How to connect to cisco asa
How to connect to cisco asaHow to connect to cisco asa
How to connect to cisco asa
 
CCNA 2 Routing and Switching v5.0 Chapter 1
CCNA 2 Routing and Switching v5.0 Chapter 1CCNA 2 Routing and Switching v5.0 Chapter 1
CCNA 2 Routing and Switching v5.0 Chapter 1
 
αναλυτικοί πίνακες αδειών ηλεκτρολόγου α΄ ειδικότητας
 αναλυτικοί πίνακες αδειών ηλεκτρολόγου α΄ ειδικότητας αναλυτικοί πίνακες αδειών ηλεκτρολόγου α΄ ειδικότητας
αναλυτικοί πίνακες αδειών ηλεκτρολόγου α΄ ειδικότητας
 
Hobi Elektronik Devre Projeleri
Hobi Elektronik Devre ProjeleriHobi Elektronik Devre Projeleri
Hobi Elektronik Devre Projeleri
 
Ch8 v70 os_en
Ch8 v70 os_enCh8 v70 os_en
Ch8 v70 os_en
 
1997 sea doo bombardier personal watercraft service repair manual
1997 sea doo  bombardier personal watercraft service repair manual1997 sea doo  bombardier personal watercraft service repair manual
1997 sea doo bombardier personal watercraft service repair manual
 
Ipv4 ppt
Ipv4 pptIpv4 ppt
Ipv4 ppt
 

Destaque

Prefixes dis
Prefixes disPrefixes dis
Prefixes dissharyndJ
 
An Approach to Automated Learning of Conceptual Graphs from Text
An Approach to Automated Learning of Conceptual Graphs from TextAn Approach to Automated Learning of Conceptual Graphs from Text
An Approach to Automated Learning of Conceptual Graphs from TextFulvio Rotella
 
SECTION 11- GENERALIZATION AND INTERPRETATION OF RESULTS
SECTION 11- GENERALIZATION AND INTERPRETATION OF RESULTSSECTION 11- GENERALIZATION AND INTERPRETATION OF RESULTS
SECTION 11- GENERALIZATION AND INTERPRETATION OF RESULTSHezel Nee Gupit
 
Content analysis
Content analysisContent analysis
Content analysisHans Mallen
 
Critical Discourse Analysis adel thamery
 Critical Discourse Analysis adel thamery Critical Discourse Analysis adel thamery
Critical Discourse Analysis adel thameryAdel Thamery
 
Modal verbs Role-Play Activity
Modal verbs Role-Play ActivityModal verbs Role-Play Activity
Modal verbs Role-Play Activityemptylahh
 
Introduce prefixes suffixes roots affixes power point
Introduce prefixes suffixes roots affixes power pointIntroduce prefixes suffixes roots affixes power point
Introduce prefixes suffixes roots affixes power pointDaphna Doron
 

Destaque (11)

Prefixes dis
Prefixes disPrefixes dis
Prefixes dis
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysis
 
An Approach to Automated Learning of Conceptual Graphs from Text
An Approach to Automated Learning of Conceptual Graphs from TextAn Approach to Automated Learning of Conceptual Graphs from Text
An Approach to Automated Learning of Conceptual Graphs from Text
 
Affixes
AffixesAffixes
Affixes
 
A Sample of CDA
A Sample of CDAA Sample of CDA
A Sample of CDA
 
SECTION 11- GENERALIZATION AND INTERPRETATION OF RESULTS
SECTION 11- GENERALIZATION AND INTERPRETATION OF RESULTSSECTION 11- GENERALIZATION AND INTERPRETATION OF RESULTS
SECTION 11- GENERALIZATION AND INTERPRETATION OF RESULTS
 
Affixes
AffixesAffixes
Affixes
 
Content analysis
Content analysisContent analysis
Content analysis
 
Critical Discourse Analysis adel thamery
 Critical Discourse Analysis adel thamery Critical Discourse Analysis adel thamery
Critical Discourse Analysis adel thamery
 
Modal verbs Role-Play Activity
Modal verbs Role-Play ActivityModal verbs Role-Play Activity
Modal verbs Role-Play Activity
 
Introduce prefixes suffixes roots affixes power point
Introduce prefixes suffixes roots affixes power pointIntroduce prefixes suffixes roots affixes power point
Introduce prefixes suffixes roots affixes power point
 

Semelhante a Practical Text Mining with SQL using Relational Databases

CSI-globalVCard-Whitepaper-Whats-holding-your-business-back
CSI-globalVCard-Whitepaper-Whats-holding-your-business-backCSI-globalVCard-Whitepaper-Whats-holding-your-business-back
CSI-globalVCard-Whitepaper-Whats-holding-your-business-backDavid Disque
 
9,000 Ways to Optimize Outcomes in Financial Services
9,000 Ways to Optimize Outcomes in Financial Services9,000 Ways to Optimize Outcomes in Financial Services
9,000 Ways to Optimize Outcomes in Financial ServicesPrecisely
 
Data MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData MiningData MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData Miningabdulraqeebalareqi1
 
Heropay_Ultimate-guide-booklet_Final
Heropay_Ultimate-guide-booklet_FinalHeropay_Ultimate-guide-booklet_Final
Heropay_Ultimate-guide-booklet_FinalAdam J. Rebolloso
 
Everything You Need to Know About Taking Plastic
Everything You Need to Know About Taking PlasticEverything You Need to Know About Taking Plastic
Everything You Need to Know About Taking PlasticBusiness.com
 
Everything You Need to Know About Virtual Credit Cards
Everything You Need to Know About Virtual Credit CardsEverything You Need to Know About Virtual Credit Cards
Everything You Need to Know About Virtual Credit CardsRon Griswold
 
Comdata Overview
Comdata OverviewComdata Overview
Comdata Overviewappointlink
 
Ceridian E-Payables Solution
Ceridian E-Payables SolutionCeridian E-Payables Solution
Ceridian E-Payables Solutionscottymiller
 
Overwhelmed with data from different sources and systems?
Overwhelmed with data from different sources and systems?Overwhelmed with data from different sources and systems?
Overwhelmed with data from different sources and systems?Acquia
 
Subscribed 2013 Sydney Keynote
Subscribed 2013 Sydney Keynote Subscribed 2013 Sydney Keynote
Subscribed 2013 Sydney Keynote Zuora, Inc.
 
4 Reasons Why CFOs Should Rethink B2B Accounts Receivable Payments
4 Reasons Why CFOs Should Rethink B2B Accounts Receivable Payments4 Reasons Why CFOs Should Rethink B2B Accounts Receivable Payments
4 Reasons Why CFOs Should Rethink B2B Accounts Receivable PaymentsWilliamJames346254
 
Abn amro altares Marijne le Comte
Abn amro altares Marijne le ComteAbn amro altares Marijne le Comte
Abn amro altares Marijne le ComteBigDataExpo
 
Marketing Network presentation: Why marketers need to be concerned with data ...
Marketing Network presentation: Why marketers need to be concerned with data ...Marketing Network presentation: Why marketers need to be concerned with data ...
Marketing Network presentation: Why marketers need to be concerned with data ...KETL Limited
 
Lidma this is now! 2006
Lidma this is now! 2006Lidma this is now! 2006
Lidma this is now! 2006Todd Ewing
 
Lidma this is Now! 2006
Lidma this is Now! 2006Lidma this is Now! 2006
Lidma this is Now! 2006Todd Ewing
 
Profisee_Ebook_MasterDataWhatWhyHow_11x8.5.pdf
Profisee_Ebook_MasterDataWhatWhyHow_11x8.5.pdfProfisee_Ebook_MasterDataWhatWhyHow_11x8.5.pdf
Profisee_Ebook_MasterDataWhatWhyHow_11x8.5.pdfssuser2ae7ea2
 
Referral Partner Program
Referral Partner ProgramReferral Partner Program
Referral Partner ProgramVincent_Mills
 

Semelhante a Practical Text Mining with SQL using Relational Databases (20)

CSI-globalVCard-Whitepaper-Whats-holding-your-business-back
CSI-globalVCard-Whitepaper-Whats-holding-your-business-backCSI-globalVCard-Whitepaper-Whats-holding-your-business-back
CSI-globalVCard-Whitepaper-Whats-holding-your-business-back
 
9,000 Ways to Optimize Outcomes in Financial Services
9,000 Ways to Optimize Outcomes in Financial Services9,000 Ways to Optimize Outcomes in Financial Services
9,000 Ways to Optimize Outcomes in Financial Services
 
Data MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData MiningData MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData Mining
 
Heropay_Ultimate-guide-booklet_Final
Heropay_Ultimate-guide-booklet_FinalHeropay_Ultimate-guide-booklet_Final
Heropay_Ultimate-guide-booklet_Final
 
Everything You Need to Know About Taking Plastic
Everything You Need to Know About Taking PlasticEverything You Need to Know About Taking Plastic
Everything You Need to Know About Taking Plastic
 
Everything You Need to Know About Virtual Credit Cards
Everything You Need to Know About Virtual Credit CardsEverything You Need to Know About Virtual Credit Cards
Everything You Need to Know About Virtual Credit Cards
 
Comdata overview
Comdata overviewComdata overview
Comdata overview
 
Comdata Overview
Comdata OverviewComdata Overview
Comdata Overview
 
Ceridian E-Payables Solution
Ceridian E-Payables SolutionCeridian E-Payables Solution
Ceridian E-Payables Solution
 
Lighthouse Rabrochure1
Lighthouse Rabrochure1Lighthouse Rabrochure1
Lighthouse Rabrochure1
 
Overwhelmed with data from different sources and systems?
Overwhelmed with data from different sources and systems?Overwhelmed with data from different sources and systems?
Overwhelmed with data from different sources and systems?
 
Subscribed 2013 Sydney Keynote
Subscribed 2013 Sydney Keynote Subscribed 2013 Sydney Keynote
Subscribed 2013 Sydney Keynote
 
4 Reasons Why CFOs Should Rethink B2B Accounts Receivable Payments
4 Reasons Why CFOs Should Rethink B2B Accounts Receivable Payments4 Reasons Why CFOs Should Rethink B2B Accounts Receivable Payments
4 Reasons Why CFOs Should Rethink B2B Accounts Receivable Payments
 
Abn amro altares Marijne le Comte
Abn amro altares Marijne le ComteAbn amro altares Marijne le Comte
Abn amro altares Marijne le Comte
 
Marketing Network presentation: Why marketers need to be concerned with data ...
Marketing Network presentation: Why marketers need to be concerned with data ...Marketing Network presentation: Why marketers need to be concerned with data ...
Marketing Network presentation: Why marketers need to be concerned with data ...
 
Lidma this is now! 2006
Lidma this is now! 2006Lidma this is now! 2006
Lidma this is now! 2006
 
Lidma this is Now! 2006
Lidma this is Now! 2006Lidma this is Now! 2006
Lidma this is Now! 2006
 
B2B credit card processing mistakes
B2B credit card processing mistakesB2B credit card processing mistakes
B2B credit card processing mistakes
 
Profisee_Ebook_MasterDataWhatWhyHow_11x8.5.pdf
Profisee_Ebook_MasterDataWhatWhyHow_11x8.5.pdfProfisee_Ebook_MasterDataWhatWhyHow_11x8.5.pdf
Profisee_Ebook_MasterDataWhatWhyHow_11x8.5.pdf
 
Referral Partner Program
Referral Partner ProgramReferral Partner Program
Referral Partner Program
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Practical Text Mining with SQL using Relational Databases

  • 1. PRACTICAL TEXT MINING WITH SQL USING RELATIONAL DATABASES Ralph Winters Data Architect, Actuarial Business Intelligence EmblemHealth June 5th, 2013 11th Annual Text and Social Analytics Summit Cambridge, MA
  • 2. RDMS TODAY Gartner - clients tell us that combining scored, processed ‘outside data’ with data inside our relational databases is where all the added value is. IDC -RDMD database management systems are expected to nearly double in market growth by 2016 driven by intelligence demands and expabusiness nded adoption to tackle big data and unstructured information streams The relational database management systems (RDBMS) market continues to confound the skeptics by maintaining strong growth characteristics despite the belief by some that the market has become 'saturated‘ or that it will be weakened by newer Big Data technologies Inmon: listen carefully to the “big data” vendors and this is what you hear: “Let’s get rid of relational.” It is like courtiers in the castle whispering, “The king must die.” What’s going on here?.
  • 3. Why a relational DB? Why a relational Database? Marry Structured + Unstructured Data More suitable for statistical analytics (matrices) Leverage existing familiar widespread technology Improving of predictive Models Referential Integrity Integrated Text/Data Mining
  • 4. Feasibility What do I need to know? Costs Benefits/Risks Industries Adding Value?
  • 5. RDMS File Interfaces (XML,CSV) ODBC/JDBC/DBI Text Vendor supplied Connector Hadoop Connectors (SAS, Oracle) Open Source Text Mining Tools (R, Java, Perl, LingPipe) In-Database Text Mining Algorithms (Oracle*Text,SAS Text Miner,SQL Server Text Miniing) RDMS Internal/External Connections
  • 6. ANGRY Customer Comments Short Tailed Sampling Not for Long Tailed Data Comment - KardCo Premier Credit Card Promo Scam . I recently received an KardCo promo promising 25,000 bonus points if you sign up for the KardCo Premier Card and spend $2000 in the 1st three months. and so i call in and apply ...got APPROVED...two weeks later .. Posting on your site DEFINITELY HELPED (it was pointed out by retailer), and sped up response after 6 weeks of mulling around BEFORE we posted our complaint. $100 restaurant certificates 15 days ago I opened a cc w/ KardCo. I thought I did my research on which company is the best, boy was I wrong. I go to use my card for the 1st time lastnight & its declined. Ok.... I call KardCo from the store and I'm placed on hold for 20 mins. Finally I speak to an awful women who tells me my debt to income ratio is too high and I have too many inquires. I pull my credit report once I get home I pull the one from when I opened the card and the most recent one. My revolving debt $100, my credit score increased from 738 to 740 and 96% of my credit is currently available.... 1-800 Customer Service NOT LOCATED IN US! 2 years in a row they don't send me my rewards check
  • 7. Full Text Search Built in to many RDMS Needs Indexing Can be Slow Necessary in some Applications Complements Categorization Oracle: SELECT SCORE(1), comment, issue_date from custdb WHERE CONTAINS(text, 'APR', 1) > 0 AND issue_date >= ('01-OCT-97') ORDER BY SCORE(1) DESC; Operators: Like, Contains, Regex, Sounds Like, Distance Measures
  • 8. Term Doc Best 1 Customer 1 Service 1 Highly 2 Recommended 2 Parse Terms from Each Row Remove StopWords Cross Reference Document ID & Term Numbers Output New “Structured” Table Map Unstructured-to-Structured Doc Term1 Term2 Term3 Term4 1 The Best Customer Service 2 Is Highly Recommended “Wasted Space”
  • 9. Extended SQL User Defined Functions Stored Process Many Methods to Pivot Data select regexp_split_to_table(lower(line), 's+') as word from customer_comments
  • 10. “Words” Table One Row for each term in Doc. Term Index Number “Document ID” Verbatim Term Index +1 Term Index - 1 Must handle Negation!
  • 11. Term document matrix Harder to do analysis in SQL Wasted Space Weight Terms Discard Terms
  • 12. Term Weighting in SQL • Log(Number of Docs / Number of Docs which contain term) Calculate IDF • Number of times Term occurs in document Calculate Term Freq • Mulitply IDF *TF • Sort by High values • Select Top N features Calculate tfidf create table idf as select word,num_docs.value as numdocs,doc_freq.value as docfreq, log10(num_docs.value/doc_freq. value) as idf from doc_freq,WORK.num_docs order by idf; create table doc_freq as select word,count(distinct id) as value from WORDS group by word order by value; create table num_docs as select count(distinct id) as value from WORDS; Words Table Top N Words Pivot on Rows
  • 13. Top N Weighted Words Matrix – Ranked by Highest TD/IDF
  • 14. select a.ID, (compress(a.word) || ' ' || compress(b.word)) as pair, from words a , words b where a.ID=b.ID and (a.no=b.no_prev) order by pair; Generating Bigrams
  • 15. Bigrams Output  Run Frequencies on Terms  Gift Card occurs more frequently than expected  Consider incorporating into Taxonomy SAMPLE BIGRAM COUNT EXPECTED COUNT Have Been 2326 Gift Card 2910 Called Kardo 2119 Kardco Card 3125 Customer Service 3630 Credit Card 2429 Member Since 1013 Credit Limit 1310 Starlight Card 115 Kardco Customer 86 Big Ram
  • 16. Do repeat callers signal Churn?  .. Research shows improved predictive Models performance Correlate with Satisfaction Scores Relevant Keywords First Call Responders pair Status Count satisfaction CUSTOMER SERVICE A 27 8.47 GIFT CARD A 25 8.34 KARDCO CARD A 24 8.79 CREDIT CARD A 15 8.62 WITH KARDCO A 13 8.28 TRANSFERRED AGAIN I 12 8.30 CREDIT LIMIT A 11 8.35 FROM KARDCO A 10 8.50 PREMIER CARD A 9 8.42 WITH KARDCO I 9 8.48 THREE MONTHS A 9 8.37 CUSTOMER SERVICE I 9 8.36
  • 17. select distinct comm1 from Customer Comments Where prxmatch("m/2nd|3rd|again|resolve/oi",comm1) >0 Customer comment Sat Hotel cant resolve my dispute. I'm going to cancel 4 Never resolved. Still waiting for a call back 3 So Completely Unhappy with KardCo. It took 3 calls to the service center to finally resolve my billing problem 5 They gave me a 2nd chance to pay my bill 9 This complaint was never resolved to begin with 5 This is the 2nd year in a row that KardCo said they mailed my rewards refund that I have yet to recieve. Same Pattern every year, I stop getting paper statements in December even though I am signed up for them and I never get my Check. Then I mysteriously start getting paper statements again after the period they say they will cut the checks and tell me i am no longer eligable. 6 This is the 3rd time I have complained about this and I may have to take my business elsewhere! 4 Transferred again for the 2nd time. I can't believe it. What happened to Cindy? 1 When ever I compare customer service between companies KardCo is the PREMIER standard. They are on call 24 hours a day. Their operators are friendly and easy to speak with. They are always on the customers side and they always work at a situation until they resolve the issue. 10 Looking for the Repeat Callers Some False positive Terms “resolve” and “2nd” can be positive
  • 18. Satisfaction Score Outstanding Balance Predict Churn Churn Improves Implement New Scripts for call center Number of Times Called Select all comments with “Gift Card” Insert Keys into Model Table Join new Model with existing model tables How Text Analytics can improve Predictive Model
  • 19. STANDARD CLASSIFICATIONS Advertising and marketing Credit determination Application processing delay Credit line Increase/decrease APR or interest rate Credit reporting Arbitration Customer service / Customer relations Balance transfer Delinquent account Balance transfer fee Forbearance / Workout plans Bankruptcy Identity theft / Fraud / Embezzlement Billing disputes Late fee Billing statement Other Cash advance Other fee Cash advance fee Overlimit fee Closing/Cancellin g account Payoff process Collection debt dispute Privacy Collection practices Rewards Convenience checks Sale of account Credit card protection / Debt protection Transaction issue Unsolicited issuance of credit card Add “Gift Card” as a Classification
  • 20. “Tweak” Taxonomy Apply Auto Classification Evaluate according to GOLD Standard Apply CRISP or SEMMA Methodology and Repeat Validation CAT Count Customer Service Baseline Average Spend ADV 15 15 15,483 APR 12 12 13,308 BANKRUPT 1 1 13,108 BILLDISP 6 6 12,682 BILLSTAT 6 6 10,617 COLL 1 1 17,720 CUSTSERV 25 25 14,725 DELAY 1 1 13,334 FRAUD 13 13 15,162 GIFTCARD 18 18 16,107 LATEFEE 3 3 18,989 LINEADJ 4 4 13,762 OTHER 125 125 18,482 OTHERFEE 15 15 10,153 PROT 1 1 17,808 REFUND 2 2 16,473 REWARDS 10 10 10,918 TRANS 1 1 14,224 TRAVEL 8 8 10,355 “There is no globally best method for (automated) text analysis”
  • 21. Other Types of Classification Select id,comm Case When compged(‘High Interest Rate APR’,comm1 < 300 then ‘APR’ When compged(‘Best Customer Service’,comm1 < 300 then ‘DELIGHT’ Else ‘OTHER’ end as CAT from CUSTOMER_COMMENTS Classify by Keyword Pairs Regular Expressions Boolean Distance Functions Fuzzy Matching Regex Bayesian algorithms
  • 22. Sentiment – Can be easy, can be hard! Words Table Join to Polarity Dictionary Assign +1 to Positive /-1 to Negative Sentiment Score Use Top N Weighted Terms Use First and Last Sentences Vector Size CPU? Complexity Normalized Use In-Memory Lookups Customized Dictionary Bayesian Classifier in SQL
  • 23. CAT Count Average Satisfaction Neg Pct Not Neg PctSpend ADV 15 15,483 7.5 49 51 APR 12 13,308 7.2 72 28 FRAUD 13 15,162 5.2 61 39 GIFTCARD 18 16,107 8.9 24 76 LATEFEE 3 18,989 7.0 12 88 Sentiment – Correlation Correlating Sentiment Scores with other database metrics can support hypothesis