SlideShare uma empresa Scribd logo
1 de 42
Knowledge Integration in Practice
P e t e r M i k a , D i r e c t o r o f S e m a n t i c S e a r c h , Y a h o o L a b s ⎪ J a n u a r y 1 3 , 2 0 1 5
Agenda
2
 Intro
 Yahoo’s Knowledge Graph
› Why a Knowledge Graph for Yahoo?
› Building the Knowledge Graph
› Challenges
 Future work
 Q&A
Disclaimers:
• Yahoo’s Knowledge Graph is the work of many at Yahoo, so I can’t speak to all of it with authority
• I’ll be rather loose with terminology…
About Yahoo
3
 Yahoo makes the world's daily habits inspiring and entertaining
› An online media and technology company
• 1 billion+ monthly users
• 600 million+ monthly mobile users
• #3 US internet destination*
• 81% of the US internet audience*
› Founded in 1994 by Jerry Yang and David Filo
› Headquartered in Sunnyvale, California
› Led by Marissa Mayer, CEO (since July, 2012)
› 10,700 employees (as of Sept 30, 2015)
*ComScore Media Metrix, Aug 2015
 Yahoo’s global research organization
› Impact on Yahoo’s products AND academic
excellence
› Established in 2005
› ~200 scientists and research engineers
› Wide range of disciplines
› Locations in Sunnyvale, New York, Haifa
› Led by Ron Brachman, Chief Scientist and
Head of Labs
› Academic programs
› Visit
• labs.yahoo.com
• Tumblr/Flickr/LinkedIn/Facebook/Twitter
4
Yahoo Labs
Semantic Search at Yahoo Labs London
Extraction
Integration
Indexing
Ranking
Evaluation
Information extraction from text and the Web
Knowledge representation and data fusion
Efficient indexing of text annotations and entity graphs
Entity-retrieval and recommendations
Evaluation of semantic search
Why a Knowledge Graph?
6
The world of Yahoo
7
 Search
› Web Search
› Yahoo Answers
 Communications
› Mail, Messenger, Groups
 Media
› Homepage
› News, Sports, Finance, Style…
 Video
 Flickr and Tumblr
 Advertizing products See everything.yahoo.com for all Yahoo products
In a perfect world, the Semantic Web is the end-game for IR
#ROI_BLANCO
#ROI_BLANCO
#ROI_BLANCO
Search: entity-based results
9
 Enhanced results for entity-pages
› Based on metadata embedded in the page or semi-automated IE
› Yahoo Searchmonkey (2008)
• Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011:
725-734
 Adopted industry-wide
› Google, Bing, Facebook, Twitter…
› Leads to the launch of schema.org effort
Search
10
 Understand entity-based queries
› ~70% of queries contain a named entity* (entity mention queries)
• brad pitt height
› ~50% of queries have an entity focus* (entity seeking queries)
• brad pitt attacked by fans
› ~10% of queries are looking for a class of entities*
• brad pitt movies
 Even more prominent on mobile
› Limited input/output
› Different types of queries
• Less research, more immediate needs
• Need answers or actions related to an entity, not pages to read
brad pitt height
how tall is
tall
…
* Statistics from [Pound et al. WWW2010]. Similar results in [Lin et al. WWW2012].
 Entity display
› Information about the entity
› Combined information with provenance
 Related entity recommendation
› Where should I go next?
 Question-Answering
 Direct actions
› e.g. movie show times and tickets
Search: entity-based experiences
Communications
 Extraction of information from email
› Notifications
• Package delivery updates, upcoming flights etc.
• Show up in Yahoo Search/Mail
› Better targeting for ads
• e.g. understanding past product purchases
 Personal knowledge combined with the Web
› e.g. contact information is completed from FB/LinkedIn/Twitter
Media
13
 Personalization
› Articles are classified by broad topics
› Named entities are extracted and linked to the KG
› Recommend other articles based on the extracted entities/topics
Show me less
stories about this
entity or topic
Requirements
14
 Entity-centric representation of the world
› Use cases in search, email, media, ads
 Integration of disparate information sources
› User/advertizer content and data
› Information from the Web
• Aggregate view of different domains relating to different facet’s of an entity
› Third-party licensed data
 Large scale
› Batch processing OK but at least daily updates
 High quality
 Multiple languages and markets
Building the Yahoo Knowledge Graph
15
Yahoo Knowledge Graph
16
Knowledge integration
Knowledge integration process
19
 Standard data fusion process
› Schema matching
• Map data to a common schema
› Entity reconciliation
• Determine which source entities refer to the same real-world entity
› Blending
• Aggregate information and resolve conflicts
 Result: unified knowledge base built from dozens of sources
› ~100 million unique entities and billions of facts
› Note: internal representations may be 10x larger due modeling, metadata etc.
 Related work
› Bleiholder and Naumann: Data Fusion. ACM Computing Surveys, 2008.
 Common ontology
› Covers the domains of interest of Yahoo
• Celebrities, Movies, music, sports, finance, etc.
› Editorially maintained
› OWL ontology
• ~300 classes, ~800 datatype-props, ~500 object-props
› Protégé and custom tooling (e.g. documentation)
• Git for versioning (similar to schema.org)
› More detailed and expressive than schema.org
• Class disjunction, cardinality constraints, inverse
properties, datatypes and units
• But limited use of complex class/property expressions
– e.g. MusicArtist = Musician OR Band
– Difficult for data consumers
 Manual schema mapping
› Works for ~10 sources
› Not scalable
• Web tables
• Language editions of Wikipedia
20
Ontology matching
Entity Reconciliation
21
 Determine which source entities refer to the same real world object
!=!=
== ==!=
==
Entity reconciliation
22
1. Blocking
› Compute hashes for each entity
› Based on type+property value combinations, e.g. type:Movie+releaseYear=1978
› Multiple hashes per entity
› Optimize for high recall
2. Pair-wise matching within blocks
› Manual as well as machine-learned classifiers
3. Clustering
› Transitive closure of matching pairs
› Assign unique identifier
CONFIDENTIAL & PROPRIETARY
 Source facts can be:
2
3
Blending
cast: .
mpaaRating: R
releaseDate: 2001-01-21
userRating: 8.5/10
budget: $9.1m
cast: .
mpaaRating: R
releaseDate: 2001-03-16
budget: $9.2m
criticRating: 92/100
Conflicting
Complementary
Corroborating
Blending
24
 Rule-based system initially, moving to machine learning
 Features
› Source trustworthiness
› Value prior probabilities
› Data freshness
› Logical constraints
• Derived from ontology
• Programmatic, e.g. children must be born after parents
Challenges
25
Challenge: scalable infrastructure
26
 Property graph/RDF databases are a poor fit for ETL and data fusion
› Large batch writes
› Require transaction support
› Navigation over the graph, no need for more complex joins
• Required information is at most two hops away
 Hadoop-based solutions
› Yahoo already hosts ~10k machines in Hadoop clusters
› HBase initially
› Moved to Spark/GraphX
• Support row/column as well as graph view of the data
› Separate inverted index for storing hashes
– Welch et al.: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012
 JSON-LD is used as input/output format
Challenge: quality
27
 Not enough to get it right… it has to be perfect
• Key difference between applied science and academic research
 Many sources of errors
› Poor quality or outdated source data
› Errors in extraction
› Errors in schema mapping and normalization
› Errors in merging (reconciliation)
• Blocking
• Disambiguation
• Blending
› Errors in display
• Image issue, poor title or description etc.
 Human intervention should be possible at every stage of the pipeline
Error in source
(Wikipedia)
Reconciliation issue
Reconciliation issue
Challenge: type classification and ranking
31
 Type classification
› Determine all the types of an entity
› Mostly system issue, e.g. types are used in blocking
› Features
• NLP extraction
– e.g. Wikipedia first paragraph
• Taxonomy mapping
– e.g. Wikipedia category hierarchy
• Relationships
– e.g. acted in a Movie -> Actor
• Trust in source
– e.g. IMDB vs. Wikipedia for Actors
 What types are the most
relevant?
› Arnold Schwarzenegger:
Actor > Athlete > Officeholder >
Politician (perhaps)
› Pope Francis is a Musician per
MusicBrainz
› Barack Obama is an Actor per IMDB
 Display issue
› Right template and display label
 Moving from manual to machine-
learned ranking
32
Challenge: type ranking Much better
known as
an Actor
Arnold
Schwarzenegger
credit.actingPerformanceIn
The Terminator
The Terminator
The Terminator
The Terminator
The Terminator
The Terminator
partyAffiliation
Republican Party
(United States)
description
Arnold Alois
Schwarzenegger is an
Austrian-American actor,
model, producer, director,
activist, businessman,
investor, philanthropist,
former professional
bodybuilder, ...
Television Director
...
historicJobPosition
...
Television Director
Television Director
Television Director
credit.actingPerformanceIn
credit.actingPerformanceIn
credit.actingPerformanceIn
credit.actingPerformanceIn
credit.actingPerformanceIn
historicJobPosition
historicJobPosition
historicJobPosition
Athlete
Officeholder
Politician
Actor
Type ranking features
 Implemented two novel unsupervised methods
› Entity likelihood
› Nearest-neighbor
 Ensemble learning on (features extracted from) entity attributes
› Cosine, KL-div, Dice, sumAF, minAF, meanAF, etc.
› Entity features, textual features, etc.
• E.g. order of type mentions in Wikipedia first paragraph
 Variants
› Combinations of the above
› Stacked ML, FMs
Challenge: mining aliases and entity pages
35
 Extensive set of alternate names/labels are required by applications
› Named Entity Linking on short/long forms of text
 Some of this comes free from Wikipedia
› Anchor text, redirects
› e.g. all redirects to Brad Pitt
 Query logs are also useful source of aliases
› e.g. incoming queries to Brad Pitt’s page on Wikipedia
 Can be extended to other sites if we find entity webpages
› A type of foreign key, but specifically on the Web
› e.g. Brad Pitt’s page on IMDB, RottenTomatoes
 Machine learned model to filter out poor aliases
› Ambiguous or not representative
Challenge: data normalization
36
 Issue at both scoring and blending time
 Multiple aspects
› Datatype match
• “113 minutes” vs. “PT1H53M”
› Text variants
• Spelling, punctuation, casing, abbreviations etc.
› Precision
• sim(weight=53 kg, weight=53.5kg)?
• sim(birthplace=California, birthplace=Los Angeles, California)
› Temporality
• e.g. Frank Sinatra married to {Barbara Blakeley, Barbara Marx, Barbara Marx Sinatra, Barbara Sinatra}
• Side issue: we don’t capture historical values
– e.g. Men’s Decathlon at 1976 Olympics was won by Bruce Jenner, not Caitlyn Jenner
Challenge: relevance
37
 All information in the graph is true, but not equally relevant
 Relevance of entities to queries
› Query understanding
› Entity retrieval
 Relevance of relationships
› Required for entity recommendations (“people also search for”)
• Who is more relevant to Brad Pitt? Angelina Jolie or Jennifer Aniston?
Relationship ranking
38
 Machine-learned ranking based on a diverse set of features
› Relationship type
› Co-occurrence in usage data and text sources
• How often people query for them together?
• How often one entity is mentioned in the context of the other?
› Popularity of each entity
• e.g. search views/clicks
› Graph-based metrics
• e.g. number of common related entities
 See
› Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec:
Entity Recommendations in Web Search. ISWC 2013
Conclusions
Conclusions
40
 Yahoo benefits from a unified view of domain knowledge
› Focusing on domains of interest to Yahoo
› Complementary information from an array of sources
› Use cases in Search, Ads, Media
 Data integration challenge
› Triple stores/graph databases are a poor fit
• Reasoning for data validation (not materialization)
› But there is benefit to Semantic Web technology
• OWL ontology language
• JSON-LD
• Data on the Web (schema.org, Dbpedia…)
Future work
41
 Scope, size and complexity of Yahoo Knowledge will expand
› Combination of world knowledge and personal knowledge
› Advanced extraction from the Web
› Additional domains
› Tasks/actions
 All of the challenges mentioned will need better answers…
› Can you help us?
Q&A
 Credits
› Yahoo Knowledge engineering team in Sunnyvale and Taipei
› Yahoo Labs scientists and engineers in Sunnyvale and London
 Contact me
› pmika@yahoo-inc.com
› @pmika
› http://www.slideshare.net/pmika/

Mais conteúdo relacionado

Mais procurados

Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Peter Mika
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Peter Mika
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011sssw2011
 
Jim Hendler's Presentation at SSSW 2011
Jim Hendler's Presentation at SSSW 2011Jim Hendler's Presentation at SSSW 2011
Jim Hendler's Presentation at SSSW 2011sssw2011
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialBarbara Starr
 
Social Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 yearsSocial Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 yearsPeter Mika
 
Harith Alani's presentation at SSSW 2011
Harith Alani's presentation at SSSW 2011Harith Alani's presentation at SSSW 2011
Harith Alani's presentation at SSSW 2011sssw2011
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 
Wimmics Overview 2021
Wimmics Overview 2021Wimmics Overview 2021
Wimmics Overview 2021Fabien Gandon
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Webostephens
 
LD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseLD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseDavid Shorthouse
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customersrichwig
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
The Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionThe Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionJohn Breslin
 

Mais procurados (20)

Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
 
Jim Hendler's Presentation at SSSW 2011
Jim Hendler's Presentation at SSSW 2011Jim Hendler's Presentation at SSSW 2011
Jim Hendler's Presentation at SSSW 2011
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Social Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 yearsSocial Networks and the Semantic Web: a retrospective of the past 10 years
Social Networks and the Semantic Web: a retrospective of the past 10 years
 
Harith Alani's presentation at SSSW 2011
Harith Alani's presentation at SSSW 2011Harith Alani's presentation at SSSW 2011
Harith Alani's presentation at SSSW 2011
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
Wimmics Overview 2021
Wimmics Overview 2021Wimmics Overview 2021
Wimmics Overview 2021
 
Semantic Web, e-commerce
Semantic Web, e-commerceSemantic Web, e-commerce
Semantic Web, e-commerce
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
Searching Online
Searching OnlineSearching Online
Searching Online
 
Tactical Information Gathering
Tactical Information GatheringTactical Information Gathering
Tactical Information Gathering
 
LD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - ShorthouseLD4 Wikidata Affinity Group - Shorthouse
LD4 Wikidata Affinity Group - Shorthouse
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customers
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
The Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionThe Social Semantic Web: An Introduction
The Social Semantic Web: An Introduction
 

Destaque

Brands, packaging, and other product feature
Brands, packaging, and other product featureBrands, packaging, and other product feature
Brands, packaging, and other product featureViqar Ahmad Usmani
 
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...Gregoire Burel
 
Hackathon s pb
Hackathon s pbHackathon s pb
Hackathon s pbPeter Mika
 
Investigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisInvestigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisPeter Mika
 
Semantics and linked data at astra zeneca
Semantics and linked data at astra zenecaSemantics and linked data at astra zeneca
Semantics and linked data at astra zenecaKerstin Forsberg
 
Future of Search | Yury Lifshits, Yahoo! Research
Future of Search | Yury Lifshits, Yahoo! ResearchFuture of Search | Yury Lifshits, Yahoo! Research
Future of Search | Yury Lifshits, Yahoo! ResearchYury Lifshits
 
Process Oriented Knowledge Management
Process Oriented Knowledge ManagementProcess Oriented Knowledge Management
Process Oriented Knowledge ManagementMichael Wyrsch
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Kerstin Forsberg
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareKerstin Forsberg
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chiBarbara Starr
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
Nestle Good Food, Good Life - SPACE Matrix, BCG Matrix and Product Positionin...
Nestle Good Food, Good Life - SPACE Matrix, BCG Matrix and Product Positionin...Nestle Good Food, Good Life - SPACE Matrix, BCG Matrix and Product Positionin...
Nestle Good Food, Good Life - SPACE Matrix, BCG Matrix and Product Positionin...Mita Angela M. Dimalanta
 
Semantic Blockchains in the Supply Chain
Semantic Blockchains in the Supply ChainSemantic Blockchains in the Supply Chain
Semantic Blockchains in the Supply ChainChristopher Brewster
 
Trends in knowledge management
Trends in knowledge managementTrends in knowledge management
Trends in knowledge managementSIKM
 
Self Efficacy Presentation
Self Efficacy PresentationSelf Efficacy Presentation
Self Efficacy Presentationkkervin
 
15 Hot Knowledge Management Trends
15 Hot Knowledge Management Trends15 Hot Knowledge Management Trends
15 Hot Knowledge Management TrendsAxero Solutions
 
Knowledge Management In The Real World
Knowledge  Management In The  Real  WorldKnowledge  Management In The  Real  World
Knowledge Management In The Real WorldStan Garfield
 
Knowledge management in theory and practice
Knowledge management in theory and practiceKnowledge management in theory and practice
Knowledge management in theory and practicethewi025
 
Knowledge Management Presentation
Knowledge Management PresentationKnowledge Management Presentation
Knowledge Management Presentationkreaume
 

Destaque (20)

Brands, packaging, and other product feature
Brands, packaging, and other product featureBrands, packaging, and other product feature
Brands, packaging, and other product feature
 
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
 
Hackathon s pb
Hackathon s pbHackathon s pb
Hackathon s pb
 
Investigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisInvestigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log Analysis
 
Semantics and linked data at astra zeneca
Semantics and linked data at astra zenecaSemantics and linked data at astra zeneca
Semantics and linked data at astra zeneca
 
Future of Search | Yury Lifshits, Yahoo! Research
Future of Search | Yury Lifshits, Yahoo! ResearchFuture of Search | Yury Lifshits, Yahoo! Research
Future of Search | Yury Lifshits, Yahoo! Research
 
Process Oriented Knowledge Management
Process Oriented Knowledge ManagementProcess Oriented Knowledge Management
Process Oriented Knowledge Management
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcare
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chi
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Nestle Good Food, Good Life - SPACE Matrix, BCG Matrix and Product Positionin...
Nestle Good Food, Good Life - SPACE Matrix, BCG Matrix and Product Positionin...Nestle Good Food, Good Life - SPACE Matrix, BCG Matrix and Product Positionin...
Nestle Good Food, Good Life - SPACE Matrix, BCG Matrix and Product Positionin...
 
Semantic Blockchains in the Supply Chain
Semantic Blockchains in the Supply ChainSemantic Blockchains in the Supply Chain
Semantic Blockchains in the Supply Chain
 
Trends in knowledge management
Trends in knowledge managementTrends in knowledge management
Trends in knowledge management
 
Self Efficacy Presentation
Self Efficacy PresentationSelf Efficacy Presentation
Self Efficacy Presentation
 
Smart Enterprises
Smart EnterprisesSmart Enterprises
Smart Enterprises
 
15 Hot Knowledge Management Trends
15 Hot Knowledge Management Trends15 Hot Knowledge Management Trends
15 Hot Knowledge Management Trends
 
Knowledge Management In The Real World
Knowledge  Management In The  Real  WorldKnowledge  Management In The  Real  World
Knowledge Management In The Real World
 
Knowledge management in theory and practice
Knowledge management in theory and practiceKnowledge management in theory and practice
Knowledge management in theory and practice
 
Knowledge Management Presentation
Knowledge Management PresentationKnowledge Management Presentation
Knowledge Management Presentation
 

Semelhante a Knowledge Integration in Practice

(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”icwe2015
 
Tech M&A Forecast 2011
Tech M&A Forecast 2011Tech M&A Forecast 2011
Tech M&A Forecast 2011Alina Soltys
 
Enterprise Open Source Intelligence Gathering
Enterprise Open Source Intelligence GatheringEnterprise Open Source Intelligence Gathering
Enterprise Open Source Intelligence GatheringTom Eston
 
Informationliteracy
InformationliteracyInformationliteracy
InformationliteracyYvonne M
 
Semantic mark-up with schema.org: helping search engines understand the Web
Semantic mark-up with schema.org: helping search engines understand the WebSemantic mark-up with schema.org: helping search engines understand the Web
Semantic mark-up with schema.org: helping search engines understand the WebPeter Mika
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Tech M&A Monthly: 10 Keys to a Valuable Valuation
Tech M&A Monthly: 10 Keys to a Valuable ValuationTech M&A Monthly: 10 Keys to a Valuable Valuation
Tech M&A Monthly: 10 Keys to a Valuable ValuationCorum Group
 
2014 Tech M&A Monthly - Quarterly Report
2014 Tech M&A Monthly - Quarterly Report2014 Tech M&A Monthly - Quarterly Report
2014 Tech M&A Monthly - Quarterly ReportCorum Group
 
Enterprise SEO and AI - Houston IMA Interactive Strategies 17
Enterprise SEO and AI - Houston IMA Interactive Strategies 17Enterprise SEO and AI - Houston IMA Interactive Strategies 17
Enterprise SEO and AI - Houston IMA Interactive Strategies 17Keith Goode
 
The Art of Connecting: Recruit Like an FBI Agent, the Original Social Enginee...
The Art of Connecting: Recruit Like an FBI Agent, the Original Social Enginee...The Art of Connecting: Recruit Like an FBI Agent, the Original Social Enginee...
The Art of Connecting: Recruit Like an FBI Agent, the Original Social Enginee...RecruitDC
 
GWU Ethics in Publishing 2015 - Is is ethical for publishers to make a profit?
GWU Ethics in Publishing 2015 - Is is ethical for publishers to make a profit?GWU Ethics in Publishing 2015 - Is is ethical for publishers to make a profit?
GWU Ethics in Publishing 2015 - Is is ethical for publishers to make a profit?Stephen Rhind-Tutt
 
2013-08 10 evil things - Northeast PHP Conference Keynote
2013-08 10 evil things - Northeast PHP Conference Keynote2013-08 10 evil things - Northeast PHP Conference Keynote
2013-08 10 evil things - Northeast PHP Conference Keynoteterry chay
 
Social won’t work without search….and today search will be improved by social...
Social won’t work without search….and today search will be improved by social...Social won’t work without search….and today search will be improved by social...
Social won’t work without search….and today search will be improved by social...Michael Pranikoff
 
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...Amsive
 

Semelhante a Knowledge Integration in Practice (20)

(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”(Keynote) Peter Mika - “Making the Web Searchable”
(Keynote) Peter Mika - “Making the Web Searchable”
 
Tech M&A Forecast 2011
Tech M&A Forecast 2011Tech M&A Forecast 2011
Tech M&A Forecast 2011
 
Enterprise Open Source Intelligence Gathering
Enterprise Open Source Intelligence GatheringEnterprise Open Source Intelligence Gathering
Enterprise Open Source Intelligence Gathering
 
Stuart
StuartStuart
Stuart
 
Context, Narratives & Big Data Analytics
Context, Narratives & Big Data AnalyticsContext, Narratives & Big Data Analytics
Context, Narratives & Big Data Analytics
 
Informationliteracy
InformationliteracyInformationliteracy
Informationliteracy
 
Semantic mark-up with schema.org: helping search engines understand the Web
Semantic mark-up with schema.org: helping search engines understand the WebSemantic mark-up with schema.org: helping search engines understand the Web
Semantic mark-up with schema.org: helping search engines understand the Web
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Tech M&A Monthly: 10 Keys to a Valuable Valuation
Tech M&A Monthly: 10 Keys to a Valuable ValuationTech M&A Monthly: 10 Keys to a Valuable Valuation
Tech M&A Monthly: 10 Keys to a Valuable Valuation
 
Detecting Corporate Fraud at SABEW with Theo Francis and Roddy Boyd
Detecting Corporate Fraud at SABEW with Theo Francis and Roddy BoydDetecting Corporate Fraud at SABEW with Theo Francis and Roddy Boyd
Detecting Corporate Fraud at SABEW with Theo Francis and Roddy Boyd
 
2014 Tech M&A Monthly - Quarterly Report
2014 Tech M&A Monthly - Quarterly Report2014 Tech M&A Monthly - Quarterly Report
2014 Tech M&A Monthly - Quarterly Report
 
Enterprise SEO and AI - Houston IMA Interactive Strategies 17
Enterprise SEO and AI - Houston IMA Interactive Strategies 17Enterprise SEO and AI - Houston IMA Interactive Strategies 17
Enterprise SEO and AI - Houston IMA Interactive Strategies 17
 
The Art of Connecting: Recruit Like an FBI Agent, the Original Social Enginee...
The Art of Connecting: Recruit Like an FBI Agent, the Original Social Enginee...The Art of Connecting: Recruit Like an FBI Agent, the Original Social Enginee...
The Art of Connecting: Recruit Like an FBI Agent, the Original Social Enginee...
 
GWU Ethics in Publishing 2015 - Is is ethical for publishers to make a profit?
GWU Ethics in Publishing 2015 - Is is ethical for publishers to make a profit?GWU Ethics in Publishing 2015 - Is is ethical for publishers to make a profit?
GWU Ethics in Publishing 2015 - Is is ethical for publishers to make a profit?
 
Big data gaurav
Big data gauravBig data gaurav
Big data gaurav
 
2013-08 10 evil things - Northeast PHP Conference Keynote
2013-08 10 evil things - Northeast PHP Conference Keynote2013-08 10 evil things - Northeast PHP Conference Keynote
2013-08 10 evil things - Northeast PHP Conference Keynote
 
Social won’t work without search….and today search will be improved by social...
Social won’t work without search….and today search will be improved by social...Social won’t work without search….and today search will be improved by social...
Social won’t work without search….and today search will be improved by social...
 
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
 
Presentation to PMI Westchester
Presentation to PMI WestchesterPresentation to PMI Westchester
Presentation to PMI Westchester
 
Ims333 vc project
Ims333 vc projectIms333 vc project
Ims333 vc project
 

Mais de Peter Mika

Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchablePeter Mika
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialPeter Mika
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic WebPeter Mika
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011Peter Mika
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009Peter Mika
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyPeter Mika
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin YahooPeter Mika
 

Mais de Peter Mika (7)

Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 

Último

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Último (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

Knowledge Integration in Practice

  • 1. Knowledge Integration in Practice P e t e r M i k a , D i r e c t o r o f S e m a n t i c S e a r c h , Y a h o o L a b s ⎪ J a n u a r y 1 3 , 2 0 1 5
  • 2. Agenda 2  Intro  Yahoo’s Knowledge Graph › Why a Knowledge Graph for Yahoo? › Building the Knowledge Graph › Challenges  Future work  Q&A Disclaimers: • Yahoo’s Knowledge Graph is the work of many at Yahoo, so I can’t speak to all of it with authority • I’ll be rather loose with terminology…
  • 3. About Yahoo 3  Yahoo makes the world's daily habits inspiring and entertaining › An online media and technology company • 1 billion+ monthly users • 600 million+ monthly mobile users • #3 US internet destination* • 81% of the US internet audience* › Founded in 1994 by Jerry Yang and David Filo › Headquartered in Sunnyvale, California › Led by Marissa Mayer, CEO (since July, 2012) › 10,700 employees (as of Sept 30, 2015) *ComScore Media Metrix, Aug 2015
  • 4.  Yahoo’s global research organization › Impact on Yahoo’s products AND academic excellence › Established in 2005 › ~200 scientists and research engineers › Wide range of disciplines › Locations in Sunnyvale, New York, Haifa › Led by Ron Brachman, Chief Scientist and Head of Labs › Academic programs › Visit • labs.yahoo.com • Tumblr/Flickr/LinkedIn/Facebook/Twitter 4 Yahoo Labs
  • 5. Semantic Search at Yahoo Labs London Extraction Integration Indexing Ranking Evaluation Information extraction from text and the Web Knowledge representation and data fusion Efficient indexing of text annotations and entity graphs Entity-retrieval and recommendations Evaluation of semantic search
  • 6. Why a Knowledge Graph? 6
  • 7. The world of Yahoo 7  Search › Web Search › Yahoo Answers  Communications › Mail, Messenger, Groups  Media › Homepage › News, Sports, Finance, Style…  Video  Flickr and Tumblr  Advertizing products See everything.yahoo.com for all Yahoo products
  • 8. In a perfect world, the Semantic Web is the end-game for IR #ROI_BLANCO #ROI_BLANCO #ROI_BLANCO
  • 9. Search: entity-based results 9  Enhanced results for entity-pages › Based on metadata embedded in the page or semi-automated IE › Yahoo Searchmonkey (2008) • Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011: 725-734  Adopted industry-wide › Google, Bing, Facebook, Twitter… › Leads to the launch of schema.org effort
  • 10. Search 10  Understand entity-based queries › ~70% of queries contain a named entity* (entity mention queries) • brad pitt height › ~50% of queries have an entity focus* (entity seeking queries) • brad pitt attacked by fans › ~10% of queries are looking for a class of entities* • brad pitt movies  Even more prominent on mobile › Limited input/output › Different types of queries • Less research, more immediate needs • Need answers or actions related to an entity, not pages to read brad pitt height how tall is tall … * Statistics from [Pound et al. WWW2010]. Similar results in [Lin et al. WWW2012].
  • 11.  Entity display › Information about the entity › Combined information with provenance  Related entity recommendation › Where should I go next?  Question-Answering  Direct actions › e.g. movie show times and tickets Search: entity-based experiences
  • 12. Communications  Extraction of information from email › Notifications • Package delivery updates, upcoming flights etc. • Show up in Yahoo Search/Mail › Better targeting for ads • e.g. understanding past product purchases  Personal knowledge combined with the Web › e.g. contact information is completed from FB/LinkedIn/Twitter
  • 13. Media 13  Personalization › Articles are classified by broad topics › Named entities are extracted and linked to the KG › Recommend other articles based on the extracted entities/topics Show me less stories about this entity or topic
  • 14. Requirements 14  Entity-centric representation of the world › Use cases in search, email, media, ads  Integration of disparate information sources › User/advertizer content and data › Information from the Web • Aggregate view of different domains relating to different facet’s of an entity › Third-party licensed data  Large scale › Batch processing OK but at least daily updates  High quality  Multiple languages and markets
  • 15. Building the Yahoo Knowledge Graph 15
  • 17.
  • 19. Knowledge integration process 19  Standard data fusion process › Schema matching • Map data to a common schema › Entity reconciliation • Determine which source entities refer to the same real-world entity › Blending • Aggregate information and resolve conflicts  Result: unified knowledge base built from dozens of sources › ~100 million unique entities and billions of facts › Note: internal representations may be 10x larger due modeling, metadata etc.  Related work › Bleiholder and Naumann: Data Fusion. ACM Computing Surveys, 2008.
  • 20.  Common ontology › Covers the domains of interest of Yahoo • Celebrities, Movies, music, sports, finance, etc. › Editorially maintained › OWL ontology • ~300 classes, ~800 datatype-props, ~500 object-props › Protégé and custom tooling (e.g. documentation) • Git for versioning (similar to schema.org) › More detailed and expressive than schema.org • Class disjunction, cardinality constraints, inverse properties, datatypes and units • But limited use of complex class/property expressions – e.g. MusicArtist = Musician OR Band – Difficult for data consumers  Manual schema mapping › Works for ~10 sources › Not scalable • Web tables • Language editions of Wikipedia 20 Ontology matching
  • 21. Entity Reconciliation 21  Determine which source entities refer to the same real world object !=!= == ==!= ==
  • 22. Entity reconciliation 22 1. Blocking › Compute hashes for each entity › Based on type+property value combinations, e.g. type:Movie+releaseYear=1978 › Multiple hashes per entity › Optimize for high recall 2. Pair-wise matching within blocks › Manual as well as machine-learned classifiers 3. Clustering › Transitive closure of matching pairs › Assign unique identifier
  • 23. CONFIDENTIAL & PROPRIETARY  Source facts can be: 2 3 Blending cast: . mpaaRating: R releaseDate: 2001-01-21 userRating: 8.5/10 budget: $9.1m cast: . mpaaRating: R releaseDate: 2001-03-16 budget: $9.2m criticRating: 92/100 Conflicting Complementary Corroborating
  • 24. Blending 24  Rule-based system initially, moving to machine learning  Features › Source trustworthiness › Value prior probabilities › Data freshness › Logical constraints • Derived from ontology • Programmatic, e.g. children must be born after parents
  • 26. Challenge: scalable infrastructure 26  Property graph/RDF databases are a poor fit for ETL and data fusion › Large batch writes › Require transaction support › Navigation over the graph, no need for more complex joins • Required information is at most two hops away  Hadoop-based solutions › Yahoo already hosts ~10k machines in Hadoop clusters › HBase initially › Moved to Spark/GraphX • Support row/column as well as graph view of the data › Separate inverted index for storing hashes – Welch et al.: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012  JSON-LD is used as input/output format
  • 27. Challenge: quality 27  Not enough to get it right… it has to be perfect • Key difference between applied science and academic research  Many sources of errors › Poor quality or outdated source data › Errors in extraction › Errors in schema mapping and normalization › Errors in merging (reconciliation) • Blocking • Disambiguation • Blending › Errors in display • Image issue, poor title or description etc.  Human intervention should be possible at every stage of the pipeline
  • 31. Challenge: type classification and ranking 31  Type classification › Determine all the types of an entity › Mostly system issue, e.g. types are used in blocking › Features • NLP extraction – e.g. Wikipedia first paragraph • Taxonomy mapping – e.g. Wikipedia category hierarchy • Relationships – e.g. acted in a Movie -> Actor • Trust in source – e.g. IMDB vs. Wikipedia for Actors
  • 32.  What types are the most relevant? › Arnold Schwarzenegger: Actor > Athlete > Officeholder > Politician (perhaps) › Pope Francis is a Musician per MusicBrainz › Barack Obama is an Actor per IMDB  Display issue › Right template and display label  Moving from manual to machine- learned ranking 32 Challenge: type ranking Much better known as an Actor
  • 33. Arnold Schwarzenegger credit.actingPerformanceIn The Terminator The Terminator The Terminator The Terminator The Terminator The Terminator partyAffiliation Republican Party (United States) description Arnold Alois Schwarzenegger is an Austrian-American actor, model, producer, director, activist, businessman, investor, philanthropist, former professional bodybuilder, ... Television Director ... historicJobPosition ... Television Director Television Director Television Director credit.actingPerformanceIn credit.actingPerformanceIn credit.actingPerformanceIn credit.actingPerformanceIn credit.actingPerformanceIn historicJobPosition historicJobPosition historicJobPosition Athlete Officeholder Politician Actor
  • 34. Type ranking features  Implemented two novel unsupervised methods › Entity likelihood › Nearest-neighbor  Ensemble learning on (features extracted from) entity attributes › Cosine, KL-div, Dice, sumAF, minAF, meanAF, etc. › Entity features, textual features, etc. • E.g. order of type mentions in Wikipedia first paragraph  Variants › Combinations of the above › Stacked ML, FMs
  • 35. Challenge: mining aliases and entity pages 35  Extensive set of alternate names/labels are required by applications › Named Entity Linking on short/long forms of text  Some of this comes free from Wikipedia › Anchor text, redirects › e.g. all redirects to Brad Pitt  Query logs are also useful source of aliases › e.g. incoming queries to Brad Pitt’s page on Wikipedia  Can be extended to other sites if we find entity webpages › A type of foreign key, but specifically on the Web › e.g. Brad Pitt’s page on IMDB, RottenTomatoes  Machine learned model to filter out poor aliases › Ambiguous or not representative
  • 36. Challenge: data normalization 36  Issue at both scoring and blending time  Multiple aspects › Datatype match • “113 minutes” vs. “PT1H53M” › Text variants • Spelling, punctuation, casing, abbreviations etc. › Precision • sim(weight=53 kg, weight=53.5kg)? • sim(birthplace=California, birthplace=Los Angeles, California) › Temporality • e.g. Frank Sinatra married to {Barbara Blakeley, Barbara Marx, Barbara Marx Sinatra, Barbara Sinatra} • Side issue: we don’t capture historical values – e.g. Men’s Decathlon at 1976 Olympics was won by Bruce Jenner, not Caitlyn Jenner
  • 37. Challenge: relevance 37  All information in the graph is true, but not equally relevant  Relevance of entities to queries › Query understanding › Entity retrieval  Relevance of relationships › Required for entity recommendations (“people also search for”) • Who is more relevant to Brad Pitt? Angelina Jolie or Jennifer Aniston?
  • 38. Relationship ranking 38  Machine-learned ranking based on a diverse set of features › Relationship type › Co-occurrence in usage data and text sources • How often people query for them together? • How often one entity is mentioned in the context of the other? › Popularity of each entity • e.g. search views/clicks › Graph-based metrics • e.g. number of common related entities  See › Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013
  • 40. Conclusions 40  Yahoo benefits from a unified view of domain knowledge › Focusing on domains of interest to Yahoo › Complementary information from an array of sources › Use cases in Search, Ads, Media  Data integration challenge › Triple stores/graph databases are a poor fit • Reasoning for data validation (not materialization) › But there is benefit to Semantic Web technology • OWL ontology language • JSON-LD • Data on the Web (schema.org, Dbpedia…)
  • 41. Future work 41  Scope, size and complexity of Yahoo Knowledge will expand › Combination of world knowledge and personal knowledge › Advanced extraction from the Web › Additional domains › Tasks/actions  All of the challenges mentioned will need better answers… › Can you help us?
  • 42. Q&A  Credits › Yahoo Knowledge engineering team in Sunnyvale and Taipei › Yahoo Labs scientists and engineers in Sunnyvale and London  Contact me › pmika@yahoo-inc.com › @pmika › http://www.slideshare.net/pmika/

Notas do Editor

  1. More info at http://info.yahoo.com/ http://investor.yahoo.net/faq.cfm Marissa’s CES 2013 keynote: http://screen.yahoo.com/marissa-mayer-ces-keynote-live-210000558.html ComScore traffic: http://www.bloomberg.com/news/2013-08-22/yahoo-tops-google-in-u-s-for-web-traffic-in-july-comscore-says.html http://www.comscore.com/Insights/Press_Releases/2013/8/comScore_Media_Metrix_Ranks_Top_50_US_Web_Properties_for_July_2013
  2. This is how a machine sees the world… Machines are not ‘intelligent’ and can not ‘read’… they just see a string of symbols and try to match the users input to that stream.
  3. We also show “People also searched the height of…”
  4. Efficiency in processing, though not real-time Developed by a large, distributed team of engineers and scientists, in Sunnyvale, London and Taiwan As of Dec, 2015: 600M source entities and 10B source triples 75M reconciled entities and 5B triples
  5. The KG understands facts about real world entities People, places, movies, organizations and more and how they relate to each other.
  6. In practice, due to modeling (reification) 75M unique entities -> 1.2B vertices in Spark/GraphX
  7. If we had 600m source entities and 10k cores with 1ms per comparison, about 400 years (3.6*10^17 comparisons) Blocking reduces this to 3.6 * 10^8 comparisons, about 30 minutes of runtime