SlideShare uma empresa Scribd logo
1 de 29
Automated Content Labeling
using Context in Email
  Aravindan Raghuveer
  Yahoo! Inc, Bangalore.
at-least one email with an attachment last 2 weeks?
                                                                                        2
Yahoo! Confidential http://www.crcna.org/site_uploads/uploads/osjha/mdgs/mdghands.jpg
   Image Source:
Introduction: The “What ?”

                 Email attachments are a very popular mechanism
                  to exchange content.
                      - Both in our personal / office worlds.


                 The one-liner:
                      - The email usually contains a crisp description of the
                        attachment.
                      - “Can we auto generate tags for the attachment from
                        the email ?”

                                                                                3
Yahoo! Confidential
Introduction: The “Why?”

                 Tags can be stored as extended attributes of files.
                 Applications like desktop search can use these
                  tags for building indexes.
                 Tags generated even without parsing content:
                      - Useful for Images
                      - In some cases, the tags have more context than the
                        attachment itself.



                                                                             4
Yahoo! Confidential
Outline

                 Problem Statement
                 Challenges
                 Overview of solution
                 The Dataset : Quirks and observations
                 Feature Design
                 Experiments and overview of results
                 Conclusion

                                                          5
Yahoo! Confidential
Problem Statement

                  “Given an email E that has a set of attachments A,
                  find the set of words KEA that appear in E and are
                  relevant to the attachments A.”

                       Subproblems:
                       • Sentence Selection
                       • Keyword Selection




                                                                       6
Yahoo! Confidential
Overview of Solution

                 Solve a binary classification problem :
                      - Is a given sentence relevant to the attachments or not?


                 Two part solution:
                      - What are the features to use?
                         • Most insightful and interesting aspect of this work.
                         • Focus of this talk
                      - What classification algorithm to use?


                                                                                  7
Yahoo! Confidential
Challenges : Why is the classification hard?

                 Only a small part of the email is relevant to the
                  attachment. Which part?
                 While writing emails: users tend to rely on
                  context that is easily understood by humans.
                      - usage of pronouns
                      - nick names
                      - cross referencing information from a different
                        conversation in the same email thread.


                                                                         8
Yahoo! Confidential
The Enron Email Dataset

                 Made public by Federal Energy Regulatory
                  Commission during its investigation [1].
                 Curated version [2] consists of 157,510 emails
                  belonging to 150 users.
                 A total of 30,968 emails have at least one
                  attachment



               [1] http://www.cs.cmu.edu/~enron/
               [2] Thanks to Mark Dredze for providing the curated dataset.   9
Yahoo! Confidential
Observation-1: User behavior for attachments




   • Barring a few outliers, almost all users sent/received emails with
   attachments.
   • A rich / well-suited corpus for studying attachment behavior
Yahoo! Confidential
                                                                          10
Observation-2: Email length




     • Roughly 80% of the emails have less than 8 sentences.
     • In another analysis: even in emails that have less than 3 sentences,
     not every sentence is related to the attachment!!
                                                                          11
Yahoo! Confidential
Feature Design : Levels of Granularity

                                                         S1
                                                         S2
                                                         S3
                                                         S4
                                                         S5
                                                         S6


             Email Level       Conversation Level   Sentence Level




                                                                     12
Yahoo! Confidential
Feature Taxonomy:
                                                        Features

            Email                            Conversation                                      Sentence

                         Length                             Level             Anchor     Noisy       Anaphora
                      (Short email: less than 4         (Level: Older in the
                      sentences? )                      email thread, higher the
                                                        conversation level )
                                                                                   Strong Phrase      Noun
                         Lexicon                                                                      Verb
                                                                                   Extension
                      (high dimension, sparse, bag of
                      words feature )
                                                                                   Attachment Name


                                                                                    Weak Phrase              13
Yahoo! Confidential
Feature Design : Sentence level
                           S1                      Anchor Sentences:
                           S2                      Most likely positive
                           S3
                                                   matches.
                           S4
                           S5
                                                   Noisy Sentences:
                           S6
                                                   Most likely negative
                                                   matches.
                      Anaphora Sentences: have linguistic relationships
                      to anchor sentences
                                                             14
Yahoo! Confidential
Feature Design: Sentence Level  Anchor

                 Strong Phrase Anchor: Feature value set to 1 if
                  sentence has any of the words:
                   - attach
                   - here is
                   - Enclosed


                 of the 30968 emails that have an attachment,
                  52% of them had a strong anchor phrase.

                                                                    15
Yahoo! Confidential
Feature Design: Sentence Level  Anchor

                 Behavioral Observation: Users tend to refer to an
                  attachment by its file type

                 Extension Anchor: Feature value set to 1 if sentence has
                  any of the extension keywords:
                   - xls  spreadsheet, report, excel file
                   - jpg  image, photo, picture

            Example:
           “Please refer to the attached spreadsheet for a list of Associates and
              Analysts who will be ranked in these meetings and their PRC Reps.”    16
Yahoo! Confidential
Feature Design: Sentence Level  Anchor

                 Behavioral Observation: Users tend to use file name
                  tokens of the attachment to refer to the attachment

                 Attachment Name Anchor: Feature value set to 1 if
                  sentence has any of the file name tokens.
                      - Tokenization done on case and type transitions,
                 Example:
                      - attachment name “Book Request Form East.xls”
                      - “These are book requests for the Netco books for all
                        regions.”
                                                                               17
Yahoo! Confidential
Feature Design : Sentence level
                           S1                      Anchor Sentences:
                           S2                      Most likely positive
                           S3
                                                   matches.
                           S4
                           S5
                                                   Noisy Sentences:
                           S6
                                                   Most likely negative
                                                   matches.
                      Anaphora Sentences: have linguistic relationships
                      to anchor sentences
                                                             18
Yahoo! Confidential
Feature Design : Sentence level  Noisy

                 Noisy sentences are usually salutations, signature
                  sections and email headers of conversations.
                 Two features to capture noisy sentences
                      - Noisy Noun
                      - Noisy Verb

                Noisy Noun: marked true if    •   Noisy Verb: marked true
                 more than 85% of the              if no verbs in the
                 words in the sentence are         sentence
                 nouns.
                                                                             19
Yahoo! Confidential
Feature Design : Sentence level
                           S1                      Anchor Sentences:
                           S2                      Most likely positive
                           S3
                                                   matches.
                           S4
                           S5
                                                   Noisy Sentences:
                           S6
                                                   Most likely negative
                                                   matches.
                      Anaphora Sentences: have linguistic relationships
                      to anchor sentences
                                                             20
Yahoo! Confidential
Feature Design : Sentence level  Anaphora

                 Once anchors have been identified:
                      - NLP technique called anaphora detection can be
                        employed
                      - Detects other sentences that are linguistically
                        dependent on anchor sentence.
                      - Tracks the hidden context in email.
                 Example:
                      - “Thought you might me interested in the report. It gives a nice
                        snapshot of our activity with our major counterparties.”


                                                                                          21
Yahoo! Confidential
Correlation Analysis

                                                                      Best positive correlation:
                                                                      • Strong phrase anchor
                                                                      • Anaphora feature

                                                                      Short email  low
                                                                      correlation coefficient.

                                                                      noisy verb feature 
                                                                      good negative correlation




          conversation level <=2 feature has lower negative correlation when compared
          to the conversation level > 2 feature.
                                                                                                 22
Yahoo! Confidential
Experiments

                 Ground Truth Data
                      - randomly sampled 1800 sent emails.
                      - Two independent editors for producing class labels for
                        every sentence in the above sample.
                      - Reconciliation:
                         • Discarded emails that had at least one sentence with
                           conflicting labels.
                 ML Algorithms studied : Naïve Bayes, SVM, CRF.


                                                                                  23
Yahoo! Confidential
Summary of Results: F1 measure
                 With all features used, F1 scores read:
                      - CRF : 0.87
                      - SVM: 0.79
                      - Naïve Bayes: 0.74
                 CRF consistently beats the other two methods across all
                  feature sub-sets
                      - Sequential nature of the data works in CRF’s favor.
                 The Phrase Anchor provides best increase in precision
                 The Anaphora feature provides best increase in recall.


                                                                              24
Yahoo! Confidential
Summary of Results: User Specific Performance

             In majority of the cases,
              CRF outperforms
             With same set of
              features:
                - the CRF can learn a more
                  generic model applicable
                  to a variety of users




                                                           25
Yahoo! Confidential
In the paper . . .

                 A specific use-case for attachment based tagging:
                      - Images sent over email
                      - A user study based on survey

                 Heuristics for sentences  keywords
                 More detailed evaluation studies / comparison
                  for Naïve Bayes, SVM, CRF
                 TagSeeker : A prototype for proposed algorithms
                  implemented as a Thunderbird plugin.
                                                                      26
Yahoo! Confidential
Closing Remarks

                 Improvement of retrieval effectiveness due to
                  keywords mined:
                      - Could not be performed because the attachment content
                        is not available.
                 Working on an experiment to do this on a different
                  dataset.

                 Thanks to the:
                      - Reviewers for the great feedback!
                      - Organizers for the effort putting together this conference! 27
Yahoo! Confidential
Conclusion

                 Presented a technique to extract information
                  from noisy data.
                 The F1 measure of the proposed methodology
                      - In the high eighties. Good!
                      - Generalized well across different users.


                 For more information on this work / Information
                  Extraction @ Yahoo!
                      - aravindr@yahoo-inc.com
                                                                    28
Yahoo! Confidential
29
Yahoo! Confidential

Mais conteúdo relacionado

Semelhante a Automated Content Labeling using Context in Email

The API Facade Pattern: Overview - Episode 1
The API Facade Pattern: Overview - Episode 1The API Facade Pattern: Overview - Episode 1
The API Facade Pattern: Overview - Episode 1Apigee | Google Cloud
 
UAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligenceUAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligenceINSEMTIVES project
 
Word Tree Corpus Interface
Word Tree Corpus InterfaceWord Tree Corpus Interface
Word Tree Corpus InterfaceBen Showers
 
Itch Scratching
Itch ScratchingItch Scratching
Itch ScratchingRubyX
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration Marco Torchiano
 
Extracting Semantic User Networks from Informal Communication Exchanges
Extracting Semantic User Networks from Informal Communication ExchangesExtracting Semantic User Networks from Informal Communication Exchanges
Extracting Semantic User Networks from Informal Communication ExchangesSuvodeep Mazumdar
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory acijjournal
 

Semelhante a Automated Content Labeling using Context in Email (10)

The API Facade Pattern: Overview - Episode 1
The API Facade Pattern: Overview - Episode 1The API Facade Pattern: Overview - Episode 1
The API Facade Pattern: Overview - Episode 1
 
UAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligenceUAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligence
 
What's new in Exchange 2013?
What's new in Exchange 2013?What's new in Exchange 2013?
What's new in Exchange 2013?
 
Word Tree Corpus Interface
Word Tree Corpus InterfaceWord Tree Corpus Interface
Word Tree Corpus Interface
 
Itch Scratching
Itch ScratchingItch Scratching
Itch Scratching
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Metamorphic Domain-Specific Languages
Metamorphic Domain-Specific LanguagesMetamorphic Domain-Specific Languages
Metamorphic Domain-Specific Languages
 
A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration
 
Extracting Semantic User Networks from Informal Communication Exchanges
Extracting Semantic User Networks from Informal Communication ExchangesExtracting Semantic User Networks from Informal Communication Exchanges
Extracting Semantic User Networks from Informal Communication Exchanges
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Automated Content Labeling using Context in Email

  • 1. Automated Content Labeling using Context in Email Aravindan Raghuveer Yahoo! Inc, Bangalore.
  • 2. at-least one email with an attachment last 2 weeks? 2 Yahoo! Confidential http://www.crcna.org/site_uploads/uploads/osjha/mdgs/mdghands.jpg Image Source:
  • 3. Introduction: The “What ?”  Email attachments are a very popular mechanism to exchange content. - Both in our personal / office worlds.  The one-liner: - The email usually contains a crisp description of the attachment. - “Can we auto generate tags for the attachment from the email ?” 3 Yahoo! Confidential
  • 4. Introduction: The “Why?”  Tags can be stored as extended attributes of files.  Applications like desktop search can use these tags for building indexes.  Tags generated even without parsing content: - Useful for Images - In some cases, the tags have more context than the attachment itself. 4 Yahoo! Confidential
  • 5. Outline  Problem Statement  Challenges  Overview of solution  The Dataset : Quirks and observations  Feature Design  Experiments and overview of results  Conclusion 5 Yahoo! Confidential
  • 6. Problem Statement “Given an email E that has a set of attachments A, find the set of words KEA that appear in E and are relevant to the attachments A.” Subproblems: • Sentence Selection • Keyword Selection 6 Yahoo! Confidential
  • 7. Overview of Solution  Solve a binary classification problem : - Is a given sentence relevant to the attachments or not?  Two part solution: - What are the features to use? • Most insightful and interesting aspect of this work. • Focus of this talk - What classification algorithm to use? 7 Yahoo! Confidential
  • 8. Challenges : Why is the classification hard?  Only a small part of the email is relevant to the attachment. Which part?  While writing emails: users tend to rely on context that is easily understood by humans. - usage of pronouns - nick names - cross referencing information from a different conversation in the same email thread. 8 Yahoo! Confidential
  • 9. The Enron Email Dataset  Made public by Federal Energy Regulatory Commission during its investigation [1].  Curated version [2] consists of 157,510 emails belonging to 150 users.  A total of 30,968 emails have at least one attachment [1] http://www.cs.cmu.edu/~enron/ [2] Thanks to Mark Dredze for providing the curated dataset. 9 Yahoo! Confidential
  • 10. Observation-1: User behavior for attachments • Barring a few outliers, almost all users sent/received emails with attachments. • A rich / well-suited corpus for studying attachment behavior Yahoo! Confidential 10
  • 11. Observation-2: Email length • Roughly 80% of the emails have less than 8 sentences. • In another analysis: even in emails that have less than 3 sentences, not every sentence is related to the attachment!! 11 Yahoo! Confidential
  • 12. Feature Design : Levels of Granularity S1 S2 S3 S4 S5 S6 Email Level Conversation Level Sentence Level 12 Yahoo! Confidential
  • 13. Feature Taxonomy: Features Email Conversation Sentence Length Level Anchor Noisy Anaphora (Short email: less than 4 (Level: Older in the sentences? ) email thread, higher the conversation level ) Strong Phrase Noun Lexicon Verb Extension (high dimension, sparse, bag of words feature ) Attachment Name Weak Phrase 13 Yahoo! Confidential
  • 14. Feature Design : Sentence level S1 Anchor Sentences: S2 Most likely positive S3 matches. S4 S5 Noisy Sentences: S6 Most likely negative matches. Anaphora Sentences: have linguistic relationships to anchor sentences 14 Yahoo! Confidential
  • 15. Feature Design: Sentence Level  Anchor  Strong Phrase Anchor: Feature value set to 1 if sentence has any of the words: - attach - here is - Enclosed  of the 30968 emails that have an attachment, 52% of them had a strong anchor phrase. 15 Yahoo! Confidential
  • 16. Feature Design: Sentence Level  Anchor  Behavioral Observation: Users tend to refer to an attachment by its file type  Extension Anchor: Feature value set to 1 if sentence has any of the extension keywords: - xls  spreadsheet, report, excel file - jpg  image, photo, picture  Example: “Please refer to the attached spreadsheet for a list of Associates and Analysts who will be ranked in these meetings and their PRC Reps.” 16 Yahoo! Confidential
  • 17. Feature Design: Sentence Level  Anchor  Behavioral Observation: Users tend to use file name tokens of the attachment to refer to the attachment  Attachment Name Anchor: Feature value set to 1 if sentence has any of the file name tokens. - Tokenization done on case and type transitions,  Example: - attachment name “Book Request Form East.xls” - “These are book requests for the Netco books for all regions.” 17 Yahoo! Confidential
  • 18. Feature Design : Sentence level S1 Anchor Sentences: S2 Most likely positive S3 matches. S4 S5 Noisy Sentences: S6 Most likely negative matches. Anaphora Sentences: have linguistic relationships to anchor sentences 18 Yahoo! Confidential
  • 19. Feature Design : Sentence level  Noisy  Noisy sentences are usually salutations, signature sections and email headers of conversations.  Two features to capture noisy sentences - Noisy Noun - Noisy Verb  Noisy Noun: marked true if • Noisy Verb: marked true more than 85% of the if no verbs in the words in the sentence are sentence nouns. 19 Yahoo! Confidential
  • 20. Feature Design : Sentence level S1 Anchor Sentences: S2 Most likely positive S3 matches. S4 S5 Noisy Sentences: S6 Most likely negative matches. Anaphora Sentences: have linguistic relationships to anchor sentences 20 Yahoo! Confidential
  • 21. Feature Design : Sentence level  Anaphora  Once anchors have been identified: - NLP technique called anaphora detection can be employed - Detects other sentences that are linguistically dependent on anchor sentence. - Tracks the hidden context in email.  Example: - “Thought you might me interested in the report. It gives a nice snapshot of our activity with our major counterparties.” 21 Yahoo! Confidential
  • 22. Correlation Analysis Best positive correlation: • Strong phrase anchor • Anaphora feature Short email  low correlation coefficient. noisy verb feature  good negative correlation conversation level <=2 feature has lower negative correlation when compared to the conversation level > 2 feature. 22 Yahoo! Confidential
  • 23. Experiments  Ground Truth Data - randomly sampled 1800 sent emails. - Two independent editors for producing class labels for every sentence in the above sample. - Reconciliation: • Discarded emails that had at least one sentence with conflicting labels.  ML Algorithms studied : Naïve Bayes, SVM, CRF. 23 Yahoo! Confidential
  • 24. Summary of Results: F1 measure  With all features used, F1 scores read: - CRF : 0.87 - SVM: 0.79 - Naïve Bayes: 0.74  CRF consistently beats the other two methods across all feature sub-sets - Sequential nature of the data works in CRF’s favor.  The Phrase Anchor provides best increase in precision  The Anaphora feature provides best increase in recall. 24 Yahoo! Confidential
  • 25. Summary of Results: User Specific Performance  In majority of the cases, CRF outperforms  With same set of features: - the CRF can learn a more generic model applicable to a variety of users 25 Yahoo! Confidential
  • 26. In the paper . . .  A specific use-case for attachment based tagging: - Images sent over email - A user study based on survey  Heuristics for sentences  keywords  More detailed evaluation studies / comparison for Naïve Bayes, SVM, CRF  TagSeeker : A prototype for proposed algorithms implemented as a Thunderbird plugin. 26 Yahoo! Confidential
  • 27. Closing Remarks  Improvement of retrieval effectiveness due to keywords mined: - Could not be performed because the attachment content is not available.  Working on an experiment to do this on a different dataset.  Thanks to the: - Reviewers for the great feedback! - Organizers for the effort putting together this conference! 27 Yahoo! Confidential
  • 28. Conclusion  Presented a technique to extract information from noisy data.  The F1 measure of the proposed methodology - In the high eighties. Good! - Generalized well across different users.  For more information on this work / Information Extraction @ Yahoo! - aravindr@yahoo-inc.com 28 Yahoo! Confidential

Notas do Editor

  1. left with 1150emails whose sentences were marked either relevant toattachment or not. This data corresponds to 112 usersfrom the Enron corpus and 6472 sentences.