O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Strata 2013: Text Analytics at Scale

2.298 visualizações

Publicada em

Millions of users visit Intuit product portals every day. With web analytics, we know what user behavior looks like, but not why. By tapping into in-product search and social data, we began to understand the types of questions, pain points, and suggestions users have. This was made possible with text analytics, via unguided machine learning at scale.

Topic discovery was just the beginning though. Trending, segmentation, integration with clickstream data and association with business goals made voice of customer insights actionable. In this presentation, learn about:

Text analytics at Intuit (case study)
Building decision support around text analytics
Technical approach & scaling
Protecting data privacy
Open source & commercial solutions

Heather Wasserlein is a Senior Product Manager at Intuit, where she partners with Data Science to create data-driven New Business Initiatives. Prior to Intuit, Heather worked on advertising marketplaces and web content classification at Yahoo! Heather holds a Master’s degree in Mechanical Engineering from MIT.

Publicada em: Tecnologia, Educação
  • Seja o primeiro a comentar

Strata 2013: Text Analytics at Scale

  1. 1. Text Analytics at Scale Listening to 45 Million Customers Heather Wasserlein, Intuit STRATA Hadoop World, Oct 30, 2013
  2. 2. We’ve all been here.. 2
  3. 3. On the phone with customer support 3
  4. 4. Can anyone hear me? 4
  5. 5. It’s extremely frustrating 5
  6. 6. Employees are eager to help So, why the gap? 6
  7. 7. Many touch points User Intent 7 User Feedback
  8. 8. Overwhelming data volumes You can read a few 1000 customer comments, but not millions. And, new themes come up every day.. 8
  9. 9. You can pull a “top 1000” list, but.. Is it telling you anything new? Actionable? Top hello help call login Mid password cant find pwd account multiple accounts print import error 5514 phone printing blank page phone number call customer sevice change password charged twice cancel Long tail 10 Tail print function not working new version of IE, error msg 87956 please call back at 555-555-5555
  10. 10. Insights often in the tail Top Needle-in-the-haystack problem – valuable details hidden in descriptive, tail verbatims hello help call login Mid password cant find pwd account multiple accounts print import error 5514 phone printing blank page phone number call customer sevice print function not working version of IE, error msg 87956 change password charged twice cancel please call back at 555-555-5555 Long tail 11 Tail
  11. 11. Related topics dispersed Top The “top 1000” can be misleading – the most common verbatims may not represent the most common themes hello help call login Mid password cant find pwd account multiple accounts print import error 5514 phone printing blank page phone number call customer sevice print function not working new version of IE, error msg 87956 change password charged twice cancel please call back at 555-555-5555 Long tail 12 Tail
  12. 12. What is text analytics? With numeric data, you can run summary stats summarizing textual data is more complex Statistics + Linguistics 13 You can mix and match various statistical and linguistic tools, depending on the problem
  13. 13. Example – forensic linguistics Same author? 14
  14. 14. Case Studies Applying text analytics to simple and complex problems at Travelocity, Yahoo! and Intuit 15
  15. 15. Travelocity search Where is Albekerke? San San San San Jose Jose, CA Jose, Costa Rica Jose Intl Airport NY NYC JFK New York, NY, USA NY, New York Grand Canyon Disneyland 16 Home
  16. 16. Travelocity search solution Finite set of airports, but many variations in search San Jose San Jose, CA San Jose International Mineta San Jose Airport San Josee Airport Silicon Valley SJC SJC Simple, but manually intensive solution – Mapping of all known search variations to relevant airport codes. Plus, sound-ex phonetic matching to catch unforeseen misspellings. “Rules-based” approach no statistics, minimal linguistics (sounds) 17
  17. 17. Yahoo! web site classification Is this site clean? Does it contain any illegal or sensitive content? alcohol tobacco drug online gambling violence or weapons adult content Does the web site meet advertiser standards? 18
  18. 18. Yahoo! web site classification solution Verbose, rapidly-changing data, but finite set of topics. 100,000’s of web sites in Y! and partner Ad Networks. Training data (human-labeled) 5K positive examples 30K negative examples Multiple approaches – Classifiers, keyword matching, image matching, and human-review process. 19 Supervised machine learning Pattern detection, phrases and contexts associated with finite set of “risk categories.” Emphasis on recall, catching true positives.
  19. 19. Intuit tax support Adjusted cost basis? 20
  20. 20. Intuit tax support solution Millions of questions daily, of all types. Google-like search, but often in natural language. PIN number Where can I find my PIN? Newly married, file jointly File married or separately? Home mortgage deduction Can I deduct my dog? Why is 1099-int import slow? Where’s my refund?? Solution – Clustering of site searches, topic “discovery”. 21 PIN file married deduct 1099int refund Unsupervised machine learning Statistics and linguistics. Part of speech tagging. Detection of words that “go together more often than not”. import
  21. 21. Results for 3 algorithms LDA (bag of words) File, free, taxes File, extension, get File, security, social Income, state, business Payment, state, filed State, refund, check Lingo (hierarchal clustering) File File 2012 File an extension File state Deduction Deduction car Deduction sales tax Deduction standard Custom (n-gram clustering) File extension Social security Business income Sales tax deduction Refund check Payment (in-house solution)
  22. 22. Words + numbers = insights Emerging Topics Funnel Analysis Refund deduct Late legislation File extension Error 576 etc. Enter w2 Import error.. Trending & (pre) Segmentation Taxes done! Sentiment 23
  23. 23. Use Cases Product Managers 1. User needs Customer Care 1. – Identify product enhancements – Rapidly diagnose product defects – Tune site search – Personalize content Common questions Marketing 1. – Train agents & staff appropriately 2. 3. – Address common questions to retain users – Segment by sentiment and empower promotors Emerging issues – Early insight to new issues Call routing Segment by VOC 2. Customer dialogue – Listen to feedback & respond 1:1 or 1:many
  24. 24. Our journey Site search & FAQ tuning 2 new products 100’s items enabled actioned, $10M’s X-functional value “VOC team” Scaled meets weekly Data volume grew, system crawled Emerging issues detection Science project Clustering 2M searches 2-day lag Vocal early adopters Y1 Proof of concept 25 Transfer from science to eng Y2 Productize Campaign to grow adoption to 15M searches, 1-day lag Report email Scaled to 30M searches, next day 9am SLA Viral adoption, 50+ users Y3 Scale..!
  25. 25. Scaling Reduce problem size 1. Pre-process – de-dup – remove PII, system generated info, etc. – remove stop words – map synonyms – stemming 2. Reduce data size – sample – segment – narrow time period – remove tail terms (cautiously) Add hardware 1. Add memory – text clustering is memory constrained – verbose text is harder 2. Distribute processes – rule-based categorization scales linearly – clustering of segments can be run in parallel – data sourcing – pre-processing Optimize algorithm 1. Tradeoffs & tuning – Choose approach to balance accuracy vs. performance – Tune algorithm parameters
  26. 26. Results 1. Faster time to insights 2. Better customer experience 3. $10’s millions in revenue Customer issues detected up to 1 week earlier Search is a leading indicator for call drivers – a canary in the coal mine Using text insights to tune search results improved relevancy Identifying users with common questions made it possible to personalize the experience VOC data + user behavior led to a whole new understanding of product use Detecting and resolving customer pain points generated $10’s of millions 27
  27. 27. Getting started? 1. Read a sample of verbatims + scope the problem – Topic discovery or known topics? – Sources of text and verbosity (few words, sentences, pages)? – Estimate data volumes and define SLA’s 2. Build vs. buy – Compare tools, build proofs of concept – Compare results relative to a “golden set” 3. Start small – One data source, non-verbose text, small volumes – 1000’s of documents for statistically valid results – Beta test reporting, QA topic-verbatim fit 4. Establish business processes – X-functional process to action insights, let reports go viral Scale and incorporate domain knowledge later (“phase 2”) 28
  28. 28. Long story short Listen. To everyone! Words + Numbers = Insights Apply the right tools for the job
  29. 29. Thank You! @heatherwater @IntuitInc 30
  30. 30. Appendix 31
  31. 31. “Home grown” Algorithm Unsupervised machine learning / clustering 1. Identify candidate phrases – Sparse: Identify all combinations of bi-grams, tri-grams, four-grams – Verbose: Use linguistic approaches to identify phrases • Split text into sentences + identify part-of-speech for each word (noun, adj, etc.) • Apply linguistic filters to parse candidate phrases (adj noun, verb adv, etc.) 2. Determine which phrases are “significant” – Count word frequencies and calculate likelihood ratios • L1 = words are independent, L2 = words are dependent • If L2 > L1, the words appear together more often than not 3. Cluster related topics – Represent n-grams and searches as vectors, calculate similarity (cosine distance), and cluster related topics when similarity > pre-defined threshold 4. Identify topic “title” 32 – Construct “title” representative of the cluster (ex. most common search)
  32. 32. What’s next for text at Intuit? 1. 2. 3. 4. Finalize evaluation of new algorithms (ex. Lingo3G, LDA, etc) Scale through distributed processing (ie. move to Hadoop) Support more types of text (ex. verbose) Continue to integrate topics & usage data for complete picture of end-to-end user experience 5. Provide text analytics as a service 6. Semantic search 7. Internationalization (future) 33