SlideShare uma empresa Scribd logo
1 de 32
Text Analytics at Scale
Listening to 45 Million Customers
Heather Wasserlein, Intuit
STRATA Hadoop World, Oct 30, 2013
We’ve all been here..

2
On the phone with customer support

3
Can anyone hear me?

4
It’s extremely frustrating

5
Employees are eager to help

So, why the gap?

6
Many touch points

User Intent
7

User Feedback
Overwhelming data volumes
You can read a few 1000 customer comments, but not millions.
And, new themes come up every day..

8
You can pull a “top 1000” list, but..
Is it telling you anything new? Actionable?
Top
hello
help
call
login

Mid

password

cant find pwd

account

multiple accounts

print

import error 5514

phone

printing blank page

phone number

call customer sevice

change password charged twice cancel
Long tail
10

Tail
print function not working new
version of IE, error msg 87956

please call back at 555-555-5555
Insights often in the tail
Top

Needle-in-the-haystack problem – valuable details
hidden in descriptive, tail verbatims

hello
help
call
login

Mid

password

cant find pwd

account

multiple accounts

print

import error 5514

phone

printing blank page

phone number

call customer sevice

print function not working
version of IE, error msg 87956

change password

charged twice cancel

please call back at 555-555-5555

Long tail
11

Tail
Related topics dispersed
Top

The “top 1000” can be misleading – the most common
verbatims may not represent the most common themes

hello
help
call
login

Mid

password

cant find pwd

account

multiple accounts

print

import error 5514

phone

printing blank page

phone number

call customer sevice

print function not working new
version of IE, error msg 87956

change password

charged twice cancel

please call back at 555-555-5555

Long tail
12

Tail
What is text analytics?
With numeric data, you can run summary stats summarizing textual data is more complex

Statistics + Linguistics

13

You can mix and match various statistical and linguistic tools,
depending on the problem
Example – forensic linguistics

Same author?
14
Case Studies
Applying text analytics
to simple and complex problems
at Travelocity, Yahoo! and Intuit

15
Travelocity search

Where is Albekerke?
San
San
San
San

Jose
Jose, CA
Jose, Costa Rica
Jose Intl Airport

NY
NYC
JFK
New York, NY, USA
NY, New York
Grand Canyon
Disneyland
16

Home
Travelocity search solution
Finite set of airports, but many variations in search
San Jose
San Jose, CA
San Jose International
Mineta San Jose Airport
San Josee Airport
Silicon Valley
SJC

SJC

Simple, but manually intensive solution –
Mapping of all known search variations to relevant
airport codes. Plus, sound-ex phonetic matching
to catch unforeseen misspellings.
“Rules-based” approach
no statistics, minimal linguistics (sounds)
17
Yahoo! web site classification

Is this site clean?
Does it contain any illegal
or sensitive content?
alcohol
tobacco
drug
online gambling
violence or weapons
adult content
Does the web site meet
advertiser standards?

18
Yahoo! web site classification solution
Verbose, rapidly-changing data, but finite set of topics.
100,000’s of web sites in Y! and partner Ad Networks.
Training data (human-labeled)
5K positive examples

30K negative examples

Multiple approaches –
Classifiers, keyword matching, image
matching, and human-review process.

19

Supervised machine learning
Pattern detection, phrases and contexts
associated with finite set of “risk categories.”
Emphasis on recall, catching true positives.
Intuit tax support

Adjusted cost basis?

20
Intuit tax support solution
Millions of questions daily, of all types.
Google-like search, but often in natural language.
PIN number
Where can I find my PIN?
Newly married, file jointly
File married or separately?
Home mortgage deduction
Can I deduct my dog?
Why is 1099-int import slow?
Where’s my refund??
Solution –
Clustering of site searches,
topic “discovery”.

21

PIN
file married
deduct
1099int
refund

Unsupervised machine learning
Statistics and linguistics. Part of speech
tagging. Detection of words that “go
together more often than not”.

import
Results for 3 algorithms
LDA

(bag of words)

File, free, taxes
File, extension, get
File, security, social
Income, state,
business
Payment, state, filed
State, refund, check

Lingo

(hierarchal clustering)

File
File 2012
File an extension
File state
Deduction
Deduction car
Deduction sales tax
Deduction standard

Custom

(n-gram clustering)

File extension
Social security
Business income
Sales tax deduction
Refund check
Payment

(in-house solution)
Words + numbers = insights
Emerging
Topics

Funnel
Analysis

Refund

deduct

Late legislation
File extension
Error 576
etc.

Enter
w2

Import
error..

Trending &
(pre) Segmentation
Taxes done!

Sentiment

23
Use Cases
Product
Managers
1.

User needs

Customer
Care
1.

– Identify product
enhancements
– Rapidly diagnose
product defects
– Tune site search
– Personalize content

Common questions

Marketing
1.

– Train agents & staff
appropriately

2.
3.

– Address common
questions to retain users
– Segment by sentiment
and empower promotors

Emerging issues
– Early insight to new issues

Call routing

Segment by VOC

2.

Customer dialogue
– Listen to feedback &
respond 1:1 or 1:many
Our journey

Site search &
FAQ tuning

2 new
products
100’s items enabled
actioned,
$10M’s
X-functional value
“VOC team”
Scaled
meets weekly

Data
volume
grew,
system
crawled

Emerging issues
detection

Science
project

Clustering
2M searches
2-day lag

Vocal
early
adopters

Y1
Proof of concept
25

Transfer
from
science to
eng

Y2
Productize
Campaign
to grow
adoption

to 15M
searches,
1-day lag

Report
email

Scaled to
30M
searches,
next day
9am SLA

Viral
adoption,
50+ users

Y3
Scale..!
Scaling
Reduce
problem size
1.

Pre-process
– de-dup
– remove PII, system
generated info, etc.
– remove stop words
– map synonyms
– stemming

2.

Reduce data size
– sample
– segment
– narrow time period
– remove tail terms
(cautiously)

Add
hardware
1.

Add memory
– text clustering is
memory constrained
– verbose text is harder

2.

Distribute processes
– rule-based categorization
scales linearly
– clustering of segments
can be run in parallel
– data sourcing
– pre-processing

Optimize
algorithm
1.

Tradeoffs & tuning
– Choose approach to
balance accuracy vs.
performance
– Tune algorithm
parameters
Results
1. Faster time to insights
2. Better customer experience
3. $10’s millions in revenue

Customer issues detected up to 1
week earlier
Search is a leading indicator for call
drivers – a canary in the coal mine

Using text insights to tune search
results improved relevancy
Identifying users with common questions
made it possible to personalize the
experience
VOC data + user behavior led to a whole
new understanding of product use

Detecting and resolving customer pain
points generated $10’s of millions
27
Getting started?
1. Read a sample of verbatims + scope the problem
– Topic discovery or known topics?
– Sources of text and verbosity (few words, sentences, pages)?
– Estimate data volumes and define SLA’s

2. Build vs. buy
– Compare tools, build proofs of concept
– Compare results relative to a “golden set”

3. Start small
– One data source, non-verbose text, small volumes
– 1000’s of documents for statistically valid results
– Beta test reporting, QA topic-verbatim fit

4. Establish business processes
– X-functional process to action insights, let reports go viral
Scale and incorporate domain knowledge later (“phase 2”)
28
Long story short

Listen.
To everyone!

Words
+ Numbers
=
Insights

Apply the
right tools for
the job
Thank You!
@heatherwater
@IntuitInc

30
Appendix

31
“Home grown” Algorithm
Unsupervised machine learning / clustering
1. Identify candidate phrases
– Sparse: Identify all combinations of bi-grams, tri-grams, four-grams
– Verbose: Use linguistic approaches to identify phrases
• Split text into sentences + identify part-of-speech for each word (noun, adj, etc.)
• Apply linguistic filters to parse candidate phrases (adj noun, verb adv, etc.)

2. Determine which phrases are “significant”
– Count word frequencies and calculate likelihood ratios
• L1 = words are independent, L2 = words are dependent
• If L2 > L1, the words appear together more often than not

3. Cluster related topics
– Represent n-grams and searches as vectors, calculate similarity (cosine
distance), and cluster related topics when similarity > pre-defined threshold

4. Identify topic “title”
32

– Construct “title” representative of the cluster (ex. most common search)
What’s next for text at Intuit?
1.
2.
3.
4.

Finalize evaluation of new algorithms (ex. Lingo3G, LDA, etc)
Scale through distributed processing (ie. move to Hadoop)
Support more types of text (ex. verbose)
Continue to integrate topics & usage data for complete
picture of end-to-end user experience
5. Provide text analytics as a service
6. Semantic search
7. Internationalization (future)

33

Mais conteúdo relacionado

Destaque

01 linux-quick-start
01 linux-quick-start01 linux-quick-start
01 linux-quick-startNguyen Vinh
 
California DECA - What is DECA?
California DECA - What is DECA?California DECA - What is DECA?
California DECA - What is DECA?californiadeca
 
Nicole C.
Nicole C.Nicole C.
Nicole C.LorneBr
 
What to Look for in a Computer for Development
What to Look for in a Computer for DevelopmentWhat to Look for in a Computer for Development
What to Look for in a Computer for DevelopmentLearnToProgram, Inc.
 
Presentatie synext
Presentatie synext Presentatie synext
Presentatie synext emmamedia
 
DS casestudy_city_box_a4_emea
DS casestudy_city_box_a4_emeaDS casestudy_city_box_a4_emea
DS casestudy_city_box_a4_emeaRory Heath
 
The kSORT assay to detect renal transplant patients at risk for acute rejecti...
The kSORT assay to detect renal transplant patients at risk for acute rejecti...The kSORT assay to detect renal transplant patients at risk for acute rejecti...
The kSORT assay to detect renal transplant patients at risk for acute rejecti...Kevin Jaglinski
 

Destaque (11)

01 linux-quick-start
01 linux-quick-start01 linux-quick-start
01 linux-quick-start
 
California DECA - What is DECA?
California DECA - What is DECA?California DECA - What is DECA?
California DECA - What is DECA?
 
Nicole C.
Nicole C.Nicole C.
Nicole C.
 
What to Look for in a Computer for Development
What to Look for in a Computer for DevelopmentWhat to Look for in a Computer for Development
What to Look for in a Computer for Development
 
Presentatie synext
Presentatie synext Presentatie synext
Presentatie synext
 
DS casestudy_city_box_a4_emea
DS casestudy_city_box_a4_emeaDS casestudy_city_box_a4_emea
DS casestudy_city_box_a4_emea
 
Gourmet festival in Portugal
Gourmet festival in Portugal Gourmet festival in Portugal
Gourmet festival in Portugal
 
Cwb2012 half
Cwb2012 halfCwb2012 half
Cwb2012 half
 
Modulo 3
Modulo 3Modulo 3
Modulo 3
 
The kSORT assay to detect renal transplant patients at risk for acute rejecti...
The kSORT assay to detect renal transplant patients at risk for acute rejecti...The kSORT assay to detect renal transplant patients at risk for acute rejecti...
The kSORT assay to detect renal transplant patients at risk for acute rejecti...
 
Pin, Post, Push to Promote Planning
Pin, Post, Push to Promote PlanningPin, Post, Push to Promote Planning
Pin, Post, Push to Promote Planning
 

Mais de Intuit Inc.

State of Small Business – Growth and Success Report
State of Small Business – Growth and Success ReportState of Small Business – Growth and Success Report
State of Small Business – Growth and Success ReportIntuit Inc.
 
The State of Small Business Cash Flow
The State of Small Business Cash FlowThe State of Small Business Cash Flow
The State of Small Business Cash FlowIntuit Inc.
 
Small Business in the Age of AI
Small Business in the Age of AI Small Business in the Age of AI
Small Business in the Age of AI Intuit Inc.
 
Get financially Fit: Tips for Using QuickBooks
Get financially Fit: Tips for Using QuickBooksGet financially Fit: Tips for Using QuickBooks
Get financially Fit: Tips for Using QuickBooksIntuit Inc.
 
SEO, Social, and More: Digital Marketing for your Business
SEO, Social, and More: Digital Marketing for your BusinessSEO, Social, and More: Digital Marketing for your Business
SEO, Social, and More: Digital Marketing for your BusinessIntuit Inc.
 
Why Building Your Brand is Key to Getting Customers
Why Building Your Brand is Key to Getting CustomersWhy Building Your Brand is Key to Getting Customers
Why Building Your Brand is Key to Getting CustomersIntuit Inc.
 
Get Found Fast: Google AdWords Strategies for Growth
Get Found Fast: Google AdWords Strategies for GrowthGet Found Fast: Google AdWords Strategies for Growth
Get Found Fast: Google AdWords Strategies for GrowthIntuit Inc.
 
Giving Clients What They Want
Giving Clients What They WantGiving Clients What They Want
Giving Clients What They WantIntuit Inc.
 
What Accounting Will Look Like in 2030
What Accounting Will Look Like in 2030What Accounting Will Look Like in 2030
What Accounting Will Look Like in 2030Intuit Inc.
 
Pricing in the Digital Age
Pricing in the Digital Age Pricing in the Digital Age
Pricing in the Digital Age Intuit Inc.
 
Handbook: Power Panel on Apps you need to give you more time to serve your cl...
Handbook: Power Panel on Apps you need to give you more time to serve your cl...Handbook: Power Panel on Apps you need to give you more time to serve your cl...
Handbook: Power Panel on Apps you need to give you more time to serve your cl...Intuit Inc.
 
Handbook: Advanced QuickBooks Online - Handling Tricky Transactions
Handbook: Advanced QuickBooks Online - Handling Tricky TransactionsHandbook: Advanced QuickBooks Online - Handling Tricky Transactions
Handbook: Advanced QuickBooks Online - Handling Tricky TransactionsIntuit Inc.
 
Advanced QuickBooks Online - Handling Tricky Transactions
Advanced QuickBooks Online - Handling Tricky TransactionsAdvanced QuickBooks Online - Handling Tricky Transactions
Advanced QuickBooks Online - Handling Tricky TransactionsIntuit Inc.
 
Handling tricky transactions in QuickBooks Online
Handling tricky transactions in QuickBooks OnlineHandling tricky transactions in QuickBooks Online
Handling tricky transactions in QuickBooks OnlineIntuit Inc.
 
Social media is social business
Social media is social business  Social media is social business
Social media is social business Intuit Inc.
 
Conversation guide: Forming deep relationships with your clients
Conversation guide: Forming deep relationships with your clientsConversation guide: Forming deep relationships with your clients
Conversation guide: Forming deep relationships with your clientsIntuit Inc.
 
Making tax digital
Making tax digital  Making tax digital
Making tax digital Intuit Inc.
 
Giving clients what they want
Giving clients what they want Giving clients what they want
Giving clients what they want Intuit Inc.
 
100 percent cloud your action plan for success
100 percent cloud your action plan for success 100 percent cloud your action plan for success
100 percent cloud your action plan for success Intuit Inc.
 
Attracting and retaining top talent
Attracting and retaining top talent Attracting and retaining top talent
Attracting and retaining top talent Intuit Inc.
 

Mais de Intuit Inc. (20)

State of Small Business – Growth and Success Report
State of Small Business – Growth and Success ReportState of Small Business – Growth and Success Report
State of Small Business – Growth and Success Report
 
The State of Small Business Cash Flow
The State of Small Business Cash FlowThe State of Small Business Cash Flow
The State of Small Business Cash Flow
 
Small Business in the Age of AI
Small Business in the Age of AI Small Business in the Age of AI
Small Business in the Age of AI
 
Get financially Fit: Tips for Using QuickBooks
Get financially Fit: Tips for Using QuickBooksGet financially Fit: Tips for Using QuickBooks
Get financially Fit: Tips for Using QuickBooks
 
SEO, Social, and More: Digital Marketing for your Business
SEO, Social, and More: Digital Marketing for your BusinessSEO, Social, and More: Digital Marketing for your Business
SEO, Social, and More: Digital Marketing for your Business
 
Why Building Your Brand is Key to Getting Customers
Why Building Your Brand is Key to Getting CustomersWhy Building Your Brand is Key to Getting Customers
Why Building Your Brand is Key to Getting Customers
 
Get Found Fast: Google AdWords Strategies for Growth
Get Found Fast: Google AdWords Strategies for GrowthGet Found Fast: Google AdWords Strategies for Growth
Get Found Fast: Google AdWords Strategies for Growth
 
Giving Clients What They Want
Giving Clients What They WantGiving Clients What They Want
Giving Clients What They Want
 
What Accounting Will Look Like in 2030
What Accounting Will Look Like in 2030What Accounting Will Look Like in 2030
What Accounting Will Look Like in 2030
 
Pricing in the Digital Age
Pricing in the Digital Age Pricing in the Digital Age
Pricing in the Digital Age
 
Handbook: Power Panel on Apps you need to give you more time to serve your cl...
Handbook: Power Panel on Apps you need to give you more time to serve your cl...Handbook: Power Panel on Apps you need to give you more time to serve your cl...
Handbook: Power Panel on Apps you need to give you more time to serve your cl...
 
Handbook: Advanced QuickBooks Online - Handling Tricky Transactions
Handbook: Advanced QuickBooks Online - Handling Tricky TransactionsHandbook: Advanced QuickBooks Online - Handling Tricky Transactions
Handbook: Advanced QuickBooks Online - Handling Tricky Transactions
 
Advanced QuickBooks Online - Handling Tricky Transactions
Advanced QuickBooks Online - Handling Tricky TransactionsAdvanced QuickBooks Online - Handling Tricky Transactions
Advanced QuickBooks Online - Handling Tricky Transactions
 
Handling tricky transactions in QuickBooks Online
Handling tricky transactions in QuickBooks OnlineHandling tricky transactions in QuickBooks Online
Handling tricky transactions in QuickBooks Online
 
Social media is social business
Social media is social business  Social media is social business
Social media is social business
 
Conversation guide: Forming deep relationships with your clients
Conversation guide: Forming deep relationships with your clientsConversation guide: Forming deep relationships with your clients
Conversation guide: Forming deep relationships with your clients
 
Making tax digital
Making tax digital  Making tax digital
Making tax digital
 
Giving clients what they want
Giving clients what they want Giving clients what they want
Giving clients what they want
 
100 percent cloud your action plan for success
100 percent cloud your action plan for success 100 percent cloud your action plan for success
100 percent cloud your action plan for success
 
Attracting and retaining top talent
Attracting and retaining top talent Attracting and retaining top talent
Attracting and retaining top talent
 

Último

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Strata 2013: Text Analytics at Scale

  • 1. Text Analytics at Scale Listening to 45 Million Customers Heather Wasserlein, Intuit STRATA Hadoop World, Oct 30, 2013
  • 2. We’ve all been here.. 2
  • 3. On the phone with customer support 3
  • 6. Employees are eager to help So, why the gap? 6
  • 7. Many touch points User Intent 7 User Feedback
  • 8. Overwhelming data volumes You can read a few 1000 customer comments, but not millions. And, new themes come up every day.. 8
  • 9. You can pull a “top 1000” list, but.. Is it telling you anything new? Actionable? Top hello help call login Mid password cant find pwd account multiple accounts print import error 5514 phone printing blank page phone number call customer sevice change password charged twice cancel Long tail 10 Tail print function not working new version of IE, error msg 87956 please call back at 555-555-5555
  • 10. Insights often in the tail Top Needle-in-the-haystack problem – valuable details hidden in descriptive, tail verbatims hello help call login Mid password cant find pwd account multiple accounts print import error 5514 phone printing blank page phone number call customer sevice print function not working version of IE, error msg 87956 change password charged twice cancel please call back at 555-555-5555 Long tail 11 Tail
  • 11. Related topics dispersed Top The “top 1000” can be misleading – the most common verbatims may not represent the most common themes hello help call login Mid password cant find pwd account multiple accounts print import error 5514 phone printing blank page phone number call customer sevice print function not working new version of IE, error msg 87956 change password charged twice cancel please call back at 555-555-5555 Long tail 12 Tail
  • 12. What is text analytics? With numeric data, you can run summary stats summarizing textual data is more complex Statistics + Linguistics 13 You can mix and match various statistical and linguistic tools, depending on the problem
  • 13. Example – forensic linguistics Same author? 14
  • 14. Case Studies Applying text analytics to simple and complex problems at Travelocity, Yahoo! and Intuit 15
  • 15. Travelocity search Where is Albekerke? San San San San Jose Jose, CA Jose, Costa Rica Jose Intl Airport NY NYC JFK New York, NY, USA NY, New York Grand Canyon Disneyland 16 Home
  • 16. Travelocity search solution Finite set of airports, but many variations in search San Jose San Jose, CA San Jose International Mineta San Jose Airport San Josee Airport Silicon Valley SJC SJC Simple, but manually intensive solution – Mapping of all known search variations to relevant airport codes. Plus, sound-ex phonetic matching to catch unforeseen misspellings. “Rules-based” approach no statistics, minimal linguistics (sounds) 17
  • 17. Yahoo! web site classification Is this site clean? Does it contain any illegal or sensitive content? alcohol tobacco drug online gambling violence or weapons adult content Does the web site meet advertiser standards? 18
  • 18. Yahoo! web site classification solution Verbose, rapidly-changing data, but finite set of topics. 100,000’s of web sites in Y! and partner Ad Networks. Training data (human-labeled) 5K positive examples 30K negative examples Multiple approaches – Classifiers, keyword matching, image matching, and human-review process. 19 Supervised machine learning Pattern detection, phrases and contexts associated with finite set of “risk categories.” Emphasis on recall, catching true positives.
  • 19. Intuit tax support Adjusted cost basis? 20
  • 20. Intuit tax support solution Millions of questions daily, of all types. Google-like search, but often in natural language. PIN number Where can I find my PIN? Newly married, file jointly File married or separately? Home mortgage deduction Can I deduct my dog? Why is 1099-int import slow? Where’s my refund?? Solution – Clustering of site searches, topic “discovery”. 21 PIN file married deduct 1099int refund Unsupervised machine learning Statistics and linguistics. Part of speech tagging. Detection of words that “go together more often than not”. import
  • 21. Results for 3 algorithms LDA (bag of words) File, free, taxes File, extension, get File, security, social Income, state, business Payment, state, filed State, refund, check Lingo (hierarchal clustering) File File 2012 File an extension File state Deduction Deduction car Deduction sales tax Deduction standard Custom (n-gram clustering) File extension Social security Business income Sales tax deduction Refund check Payment (in-house solution)
  • 22. Words + numbers = insights Emerging Topics Funnel Analysis Refund deduct Late legislation File extension Error 576 etc. Enter w2 Import error.. Trending & (pre) Segmentation Taxes done! Sentiment 23
  • 23. Use Cases Product Managers 1. User needs Customer Care 1. – Identify product enhancements – Rapidly diagnose product defects – Tune site search – Personalize content Common questions Marketing 1. – Train agents & staff appropriately 2. 3. – Address common questions to retain users – Segment by sentiment and empower promotors Emerging issues – Early insight to new issues Call routing Segment by VOC 2. Customer dialogue – Listen to feedback & respond 1:1 or 1:many
  • 24. Our journey Site search & FAQ tuning 2 new products 100’s items enabled actioned, $10M’s X-functional value “VOC team” Scaled meets weekly Data volume grew, system crawled Emerging issues detection Science project Clustering 2M searches 2-day lag Vocal early adopters Y1 Proof of concept 25 Transfer from science to eng Y2 Productize Campaign to grow adoption to 15M searches, 1-day lag Report email Scaled to 30M searches, next day 9am SLA Viral adoption, 50+ users Y3 Scale..!
  • 25. Scaling Reduce problem size 1. Pre-process – de-dup – remove PII, system generated info, etc. – remove stop words – map synonyms – stemming 2. Reduce data size – sample – segment – narrow time period – remove tail terms (cautiously) Add hardware 1. Add memory – text clustering is memory constrained – verbose text is harder 2. Distribute processes – rule-based categorization scales linearly – clustering of segments can be run in parallel – data sourcing – pre-processing Optimize algorithm 1. Tradeoffs & tuning – Choose approach to balance accuracy vs. performance – Tune algorithm parameters
  • 26. Results 1. Faster time to insights 2. Better customer experience 3. $10’s millions in revenue Customer issues detected up to 1 week earlier Search is a leading indicator for call drivers – a canary in the coal mine Using text insights to tune search results improved relevancy Identifying users with common questions made it possible to personalize the experience VOC data + user behavior led to a whole new understanding of product use Detecting and resolving customer pain points generated $10’s of millions 27
  • 27. Getting started? 1. Read a sample of verbatims + scope the problem – Topic discovery or known topics? – Sources of text and verbosity (few words, sentences, pages)? – Estimate data volumes and define SLA’s 2. Build vs. buy – Compare tools, build proofs of concept – Compare results relative to a “golden set” 3. Start small – One data source, non-verbose text, small volumes – 1000’s of documents for statistically valid results – Beta test reporting, QA topic-verbatim fit 4. Establish business processes – X-functional process to action insights, let reports go viral Scale and incorporate domain knowledge later (“phase 2”) 28
  • 28. Long story short Listen. To everyone! Words + Numbers = Insights Apply the right tools for the job
  • 31. “Home grown” Algorithm Unsupervised machine learning / clustering 1. Identify candidate phrases – Sparse: Identify all combinations of bi-grams, tri-grams, four-grams – Verbose: Use linguistic approaches to identify phrases • Split text into sentences + identify part-of-speech for each word (noun, adj, etc.) • Apply linguistic filters to parse candidate phrases (adj noun, verb adv, etc.) 2. Determine which phrases are “significant” – Count word frequencies and calculate likelihood ratios • L1 = words are independent, L2 = words are dependent • If L2 > L1, the words appear together more often than not 3. Cluster related topics – Represent n-grams and searches as vectors, calculate similarity (cosine distance), and cluster related topics when similarity > pre-defined threshold 4. Identify topic “title” 32 – Construct “title” representative of the cluster (ex. most common search)
  • 32. What’s next for text at Intuit? 1. 2. 3. 4. Finalize evaluation of new algorithms (ex. Lingo3G, LDA, etc) Scale through distributed processing (ie. move to Hadoop) Support more types of text (ex. verbose) Continue to integrate topics & usage data for complete picture of end-to-end user experience 5. Provide text analytics as a service 6. Semantic search 7. Internationalization (future) 33

Notas do Editor

  1. In a digital world, businesses give customers many channels to communicate – throughout the end-to-end customer experience of shop, browse, buy, use, etc.Ideally, we’d “listen” equally well across all of these touch pointsYet, much of the analytics focus is either upstream (ex. Search engines) or downstream (ex. Social media)This provides insight into user intent and feedback, but misses very important insights into customer experience with your product and servicesFor example, site search, customer support channels (call centers, chat) and communities are valuable sources of insights Rather than wait for feedback on yelp or twitter, there’s an opportunity to be proactive and address customer questions during product useAlso, with many channels, there are many formats for dataA tweet doesn’t look like a blog postVoice data often gets converted to text (by a machine or an agent summarizing a call, for example)
  2. It is not uncommon to see people trying to read through 1000’s of customer surveys, suggestions, etc., one of my first text analytics requirements sessions was with a sharp User Experience Designer who would spend her Friday afternoons reading as many feedback reports as possibleCEO’s often personally read a subset of emails from customers or listen in on support calls, our CEO doesThis is commendable, but doesn’t scale when you receive millions of communications every dayNor is it possible to keep up with ever changing topics – today’s customer questions could be completely different than yesterdays
  3. While language has some structure, there is ambiguityWords have multiple meanings, different forms, and can be used in metaphor(ex. Can and can, tin can vs. we can. Colorful fish vs. let’s go fish vs. a fish out of water)In addition, we are human. We have our own unique way of saying things. Some of us are polite and punctuate. Others misspell and abbreviate.. Sometimes we share TMI, including our PII. With Text Analytics, all of our data gets thrown in the mix. The goal is to make sense of it all.
  4. In order to accurately “summarize” text data, the trick is to count all related topics across the corpus
  5. At the most basic level, we’re trying to understand the meaning of words – with uncertainty due to context, morphology, and accuracy (ex. Misspellings)More generally, we’re trying to understand user intent, sentiment, etc.Note: As documents become more verbose (ex. Blog is verbose, a tweet is sparse), the more linguistics can help. Linguistics – SoundsWords (literal meaning)Bi-grams, etc. (words that go together, like “new york”)Phrases (“who let the dogs out?”)Sentences and Part of Speech / POS (subject, object, noun, adj, verb, etc.)Context within a large block of textTerminology:CorpusDocuments (text data, could be a tweet, search query, blog entry, etc.) – called a “verbatim” if in the user’s wordsWords vs. tokensTopics / themes
  6. Everyone has a particular writing (and speaking) styleSome people use some vocabulary more than othersI bet you could distinguish a paragraph from NYT vs. CosmopolitanStatistics can be used here – to find distributions for every word (ex. How many times is “the” used in general publications) and compare it to your writing (ex. Do you use “the” more than the average person)?Note: women use adjectives more than men
  7. Taxes are complex, people have tons of questions from start to finish
  8. Intuit also uses Clarabridge (rules based solution) for categorization of Support call logs and Radian6 for monitoring and sentiment analysis of social media The primary driver for unsupervised clustering of in-product search queries was to capture “emerging issues” – Things we couldn’t foresee ahead of time when building rules (ex. Bug introduced in a product launch, late legislation issues with IRS, etc.)Another benefit of unsupervised approaches is they don’t require human input or maintenance (low effort)
  9. Numbers tell us what is happening, but not whyThis is where text completes the story.For example, you may see conversion going up or downBut, what’s driving this change?By looking at emerging issues (what people are talking about today), you can see if a bug was introduced in your recent launch, etc.Trending is also valuable – to determine if a particular topic is gaining strength or gone away (ex. after making a product enhancement)Segmentation enables you to see the types of questions new vs. returning users haveBetter yet, questions from non-convertersBut, unlike numeric data, where you can slice and dice results after aggregatingWith text, you get more accurate results if you segment before clusteringAre tax filers procrastinators? ;-) File extension is a perennial top theme the night before tax dayIntegrating text into “funnel analysis” was extremely valuable. Clickstream data tells us where users drop off,but not why. Verbatims helped pinpoint user pain points / road blocks. Resolving just one of these pain points was worth $5MAnalysis of adjectives provides directional gage for sentimentPerhaps a more accurate way to gage sentiment is to segment promoters from detractors and see what each group has to say
  10. When I began working at Intuit 3 years ago, there were text analytics efforts centered around call logs and social.We used a rules-based categorization tool called Clarabridge to classify logs from call center agents.We obtained a data feed from facebook, twitter, blogs, etc. and evaluated results with radian6, a Sales Force tool.Both of these tools work well for their respective use cases, but we noticed a gap – we didn’t have a good way to detect emerging issues.Thus began a 3-year journey in unguided machine learning for automated topic discovery (ie. No human input required)..
  11. Pre-processing is 90% of the solution – you can greatly reduce complexity by removing stop words, stemming, mapping synonyms, etc. Reduces the term-doc matrix.With a 30% sampling rate, we saw an equivalent set of “top themes” as with a complete, 100%, data setRules-based categorization scales linearly, butClustering is memory constrained, because everything is compared with everything else.., segmentation helps, because segments can be processed in parallelWith 64GB memory, clustering of 5 million searches took < 2 hrs,enabling next day reporting on yesterday’s clickstream by 9AMOptimizing upstream processes helps too Note: as text becomes more verbose, computation time slows, a lot. Using Part of Speech parsing to focus on nouns can help identify what a document is about, although you miss sentiment (adjectives). Rules based approaches, categorization based on keywords, is also easier. Depends what type of problem you are solving.