SlideShare uma empresa Scribd logo
1 de 15
Master DMKM Presentation




    Entity Aspect Analysis
        By: Ahmed Kamel
Supervision: Ingmar Weber, Yahoo! Labs Barcelona
             Marta Arias, Universitat Politècnica de Catalunya
  Location: Yahoo! Labs Barcelona
The Web
Opinion Summarization
Entity                Freq      +Freq     -Freq       +Score           -Score        Score
Lionel_Messi          378,076   283,450 94,626        89,386.5         -29,449.3     59,937.2
Cristiano_Ronaldo     312,338   228,480 83,858        72,342.7         -27,883.2     44,459.5




Entity   EFreq         Aspect    EAFreq    +EAFreq   -EAFreq +Score         -Score     Score
France   11,697,238    economy 2,633       1,452     1,181     469.2        -390.6     78.5
Spain    6,602,450     economy 1,561       620       941       211.7        -312.2     -100.3
Architecture
Text Extraction

                                             Boilerpipe




                                                                     Stanford CoreNLP

Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.

The son of a black man from Kenya and a white woman from Kansas, he is the first African-American to ascend
to the highest office in the land.

He defeated Hillary Rodham Clinton in a lengthy and bitter primary battle before defeating Senator John McCain
, the Arizona Republican, in November 2008.
…
Entity Recognition

Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.
…



                                                Entity Recognition
                                                (Wikification)


 Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.
 Barack_Obama||0.9727||Barack||0.9868||Barack Hussein Obama||0.9907
 President_of_the_United_States||0.9707||president of the United States||0.9918
 …
Aspect Extraction

Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.
Barack_Obama||0.9727||Barack||0.9868||Barack Hussein Obama||0.9907
President_of_the_United_States||0.9707||president of the United States||0.9918
…


                                                   PoS tagging aspect extraction

Barack/NNP Hussein/NNP Obama/NNP was/VBD sworn/VBN in/IN as/IN the/DT 44th/JJ
president/NN of/IN the/DT United/NNP States/NNPS on/IN Jan./NNP 20/CD ,/, 2009/CD ./.
…




Barack_Obama                                  President_of_the_United_States
Barack Hussein Obama                          Barack Hussein Obama
president                                     president
United States                                 United States
Jan                                           Jan
44th president                                44th president
Sentiment Analysis

The iPhone is in general very good, however, its battery life is very bad
…



                           distance=10
                                                  SentiStrength
             distance=3




The iPhone is in general very good[2][+1 booster word],however ,its battery life is very bad[-2][-1
booster word][sentence: 3,-3] [result: max + and - of any sentence]
very good||3||3
very bad||-3||10
Score = 3/3 + -3/10
Our work is
• Doing the previous for
  – Over 2 billion english pages
  – Wikipedia entities (over 3.5 million entities)
• Mostly using
  – Hadoop
  – Pig
Experiments
• Lack of ground truth
• Correlations to real-world factors
• Three experiments
  – Countries
  – Countries’ economy
  – Grammy award winners
Countries
                                •Travel




                                                Costa Rica positive aspects




Top 10 positively mentioned



                                •Axis of Evil
                                •BBC Poll
                                                Iran negative aspects




 Top 10 negatively mentioned                     Israel negative aspects
Countries’ economy
• Correlation between sentiment scores and
  countries’ nominal GDP
• Normalized scores vs. non-normalized scores
Grammy Award Winners


Correlations with Grammy   Inequality of scores
Conclusion
• Analysis
   – Methodology for correlating sentiments with other real-
     world factors
   – Experiments
• Pipeline
   – Big data
   – Can be an online in-production system
• Future work
   – Restricting the analysis to a subset of the Web, e.g., blogs
   – Sentiment scoring scheme (taking the volume problem
     into account)
Thanks
     Merci
Gràcies – Gracias
     Danke
  Teşekkürler

Mais conteúdo relacionado

Semelhante a Entity Aspect Analysis

Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in PracticePeter Mika
 
Big Data Analytics and Open Data
Big Data Analytics and Open Data Big Data Analytics and Open Data
Big Data Analytics and Open Data Sharjeel Imtiaz
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with DataRitvvij Parrikh
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
 
TBEX North America 2017, Rally those SuperHeroes and make positive change, Ji...
TBEX North America 2017, Rally those SuperHeroes and make positive change, Ji...TBEX North America 2017, Rally those SuperHeroes and make positive change, Ji...
TBEX North America 2017, Rally those SuperHeroes and make positive change, Ji...tbexcon
 
The Business of APIs 2009 - Active Network
The Business of APIs 2009 - Active NetworkThe Business of APIs 2009 - Active Network
The Business of APIs 2009 - Active NetworkMashery
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysisLuke Czarnecki
 
Turning Data into Infographics: An Interactive Workshop for Problem Solvers
Turning Data into Infographics: An Interactive Workshop for Problem SolversTurning Data into Infographics: An Interactive Workshop for Problem Solvers
Turning Data into Infographics: An Interactive Workshop for Problem SolversUNCResearchHub
 
Presentation on-google
Presentation on-googlePresentation on-google
Presentation on-googleGurjit
 

Semelhante a Entity Aspect Analysis (10)

Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Big Data Analytics and Open Data
Big Data Analytics and Open Data Big Data Analytics and Open Data
Big Data Analytics and Open Data
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with Data
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
TBEX North America 2017, Rally those SuperHeroes and make positive change, Ji...
TBEX North America 2017, Rally those SuperHeroes and make positive change, Ji...TBEX North America 2017, Rally those SuperHeroes and make positive change, Ji...
TBEX North America 2017, Rally those SuperHeroes and make positive change, Ji...
 
Rethinking Innovation
Rethinking InnovationRethinking Innovation
Rethinking Innovation
 
The Business of APIs 2009 - Active Network
The Business of APIs 2009 - Active NetworkThe Business of APIs 2009 - Active Network
The Business of APIs 2009 - Active Network
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Turning Data into Infographics: An Interactive Workshop for Problem Solvers
Turning Data into Infographics: An Interactive Workshop for Problem SolversTurning Data into Infographics: An Interactive Workshop for Problem Solvers
Turning Data into Infographics: An Interactive Workshop for Problem Solvers
 
Presentation on-google
Presentation on-googlePresentation on-google
Presentation on-google
 

Último

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Entity Aspect Analysis

  • 1. Master DMKM Presentation Entity Aspect Analysis By: Ahmed Kamel Supervision: Ingmar Weber, Yahoo! Labs Barcelona Marta Arias, Universitat Politècnica de Catalunya Location: Yahoo! Labs Barcelona
  • 3. Opinion Summarization Entity Freq +Freq -Freq +Score -Score Score Lionel_Messi 378,076 283,450 94,626 89,386.5 -29,449.3 59,937.2 Cristiano_Ronaldo 312,338 228,480 83,858 72,342.7 -27,883.2 44,459.5 Entity EFreq Aspect EAFreq +EAFreq -EAFreq +Score -Score Score France 11,697,238 economy 2,633 1,452 1,181 469.2 -390.6 78.5 Spain 6,602,450 economy 1,561 620 941 211.7 -312.2 -100.3
  • 5. Text Extraction Boilerpipe Stanford CoreNLP Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009. The son of a black man from Kenya and a white woman from Kansas, he is the first African-American to ascend to the highest office in the land. He defeated Hillary Rodham Clinton in a lengthy and bitter primary battle before defeating Senator John McCain , the Arizona Republican, in November 2008. …
  • 6. Entity Recognition Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009. … Entity Recognition (Wikification) Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009. Barack_Obama||0.9727||Barack||0.9868||Barack Hussein Obama||0.9907 President_of_the_United_States||0.9707||president of the United States||0.9918 …
  • 7. Aspect Extraction Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009. Barack_Obama||0.9727||Barack||0.9868||Barack Hussein Obama||0.9907 President_of_the_United_States||0.9707||president of the United States||0.9918 … PoS tagging aspect extraction Barack/NNP Hussein/NNP Obama/NNP was/VBD sworn/VBN in/IN as/IN the/DT 44th/JJ president/NN of/IN the/DT United/NNP States/NNPS on/IN Jan./NNP 20/CD ,/, 2009/CD ./. … Barack_Obama President_of_the_United_States Barack Hussein Obama Barack Hussein Obama president president United States United States Jan Jan 44th president 44th president
  • 8. Sentiment Analysis The iPhone is in general very good, however, its battery life is very bad … distance=10 SentiStrength distance=3 The iPhone is in general very good[2][+1 booster word],however ,its battery life is very bad[-2][-1 booster word][sentence: 3,-3] [result: max + and - of any sentence] very good||3||3 very bad||-3||10 Score = 3/3 + -3/10
  • 9. Our work is • Doing the previous for – Over 2 billion english pages – Wikipedia entities (over 3.5 million entities) • Mostly using – Hadoop – Pig
  • 10. Experiments • Lack of ground truth • Correlations to real-world factors • Three experiments – Countries – Countries’ economy – Grammy award winners
  • 11. Countries •Travel Costa Rica positive aspects Top 10 positively mentioned •Axis of Evil •BBC Poll Iran negative aspects Top 10 negatively mentioned Israel negative aspects
  • 12. Countries’ economy • Correlation between sentiment scores and countries’ nominal GDP • Normalized scores vs. non-normalized scores
  • 13. Grammy Award Winners Correlations with Grammy Inequality of scores
  • 14. Conclusion • Analysis – Methodology for correlating sentiments with other real- world factors – Experiments • Pipeline – Big data – Can be an online in-production system • Future work – Restricting the analysis to a subset of the Web, e.g., blogs – Sentiment scoring scheme (taking the volume problem into account)
  • 15. Thanks Merci Gràcies – Gracias Danke Teşekkürler

Notas do Editor

  1. Explosive growthUser Generated Content (UGC)Question-answering databasesDigital videoBloggingSocial networksWikisSelf expression and opinionated contentWeb of Concepts – or entitiesGoogle 2008 Over one trillion unique URLsIndexed web at least 8.47 billion pages
  2. Opinion summaries allow for discovering all kinds of fun factsMessi vs. RonaldoFrance’s economy vs. Spain’s economyIt also allows for something that’s more interesting. That is, further studies between sentiments as discovered on the Web and other real-world factors
  3. We build a system thatIs simple yet effective approach capable of handling sentiments from all over the WebGenerates opinion summary for entitiesGenerates opinion summary for entities’ aspectsThe system we are building here “allows for interesting types of analysis“
  4. The Web is mostly in HTML. We need to be able to get the text out of itBoilerpipe is a machine learnt classifier that uses shallow text features – word counts – to extract text from htmlStanford CoreNLP allows for sentence splitting on common sentence ends like full stops, question and exclamation marks
  5. In house propreitory tool that uses machine learning to learn a model that’s able to infer the topics of a given textWikipedia entities, allow for rich information about entities
  6. An aspect is a predefined sequence of postagsWe use two main patters; nouns and adjectives nouns
  7. Ranking countries by sentimentsMost frequent sentimental aspectsNormalized vs. non-normalized scoresRANKING
  8. RANKINGS AND CORRELATIONS FOR RANKINGS
  9. Are sentiments associated with Grammy Award winners different from those associated with other musicians?Statistical tests1. Correlations with Grammy2.Inequality of scores3.Positive score to predict a Grammy winner.Receiver Operating Characteristic (ROC) not shown
  10. Analysis ExperimentsCountries: are really different in the sense that we picked up a good signal whether we normalize or notGDP: we unfortunately didn’t get the expected results where frequency tended to top the sentiments. Maybe it’s not the right criteria to compare against. Maybe unemployment rate or maybe the volume problem is just inherently thereGrammy: it worked – though with not strong correlation – when restricting frequencies and normalizing.Sentiments vs. volumeBig Dataif something can go wrong it will definitely go wrongWe had to choose simple effective approaches that can scale easilyOnline in production systemI imagine it running in parallel with the web crawlers, doing its analysis and updating the summariesThe methods chosen as well allow for continous updates, generating the summaries doesn’t require the presence of the whole set of webpages at onceINTERNSHIP STILL GOING ON