Talk at the 2nd Summer Workshop of the Center for Semantic Web Research (January 16, 2016, Santiago, Chile) about the construction of Yahoo's Knowledge Graph and associated research challenges.
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Knowledge Integration in Practice
1. Knowledge Integration in Practice
P e t e r M i k a , D i r e c t o r o f S e m a n t i c S e a r c h , Y a h o o L a b s ⎪ J a n u a r y 1 3 , 2 0 1 5
2. Agenda
2
Intro
Yahoo’s Knowledge Graph
› Why a Knowledge Graph for Yahoo?
› Building the Knowledge Graph
› Challenges
Future work
Q&A
Disclaimers:
• Yahoo’s Knowledge Graph is the work of many at Yahoo, so I can’t speak to all of it with authority
• I’ll be rather loose with terminology…
3. About Yahoo
3
Yahoo makes the world's daily habits inspiring and entertaining
› An online media and technology company
• 1 billion+ monthly users
• 600 million+ monthly mobile users
• #3 US internet destination*
• 81% of the US internet audience*
› Founded in 1994 by Jerry Yang and David Filo
› Headquartered in Sunnyvale, California
› Led by Marissa Mayer, CEO (since July, 2012)
› 10,700 employees (as of Sept 30, 2015)
*ComScore Media Metrix, Aug 2015
4. Yahoo’s global research organization
› Impact on Yahoo’s products AND academic
excellence
› Established in 2005
› ~200 scientists and research engineers
› Wide range of disciplines
› Locations in Sunnyvale, New York, Haifa
› Led by Ron Brachman, Chief Scientist and
Head of Labs
› Academic programs
› Visit
• labs.yahoo.com
• Tumblr/Flickr/LinkedIn/Facebook/Twitter
4
Yahoo Labs
5. Semantic Search at Yahoo Labs London
Extraction
Integration
Indexing
Ranking
Evaluation
Information extraction from text and the Web
Knowledge representation and data fusion
Efficient indexing of text annotations and entity graphs
Entity-retrieval and recommendations
Evaluation of semantic search
7. The world of Yahoo
7
Search
› Web Search
› Yahoo Answers
Communications
› Mail, Messenger, Groups
Media
› Homepage
› News, Sports, Finance, Style…
Video
Flickr and Tumblr
Advertizing products See everything.yahoo.com for all Yahoo products
8. In a perfect world, the Semantic Web is the end-game for IR
#ROI_BLANCO
#ROI_BLANCO
#ROI_BLANCO
9. Search: entity-based results
9
Enhanced results for entity-pages
› Based on metadata embedded in the page or semi-automated IE
› Yahoo Searchmonkey (2008)
• Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011:
725-734
Adopted industry-wide
› Google, Bing, Facebook, Twitter…
› Leads to the launch of schema.org effort
10. Search
10
Understand entity-based queries
› ~70% of queries contain a named entity* (entity mention queries)
• brad pitt height
› ~50% of queries have an entity focus* (entity seeking queries)
• brad pitt attacked by fans
› ~10% of queries are looking for a class of entities*
• brad pitt movies
Even more prominent on mobile
› Limited input/output
› Different types of queries
• Less research, more immediate needs
• Need answers or actions related to an entity, not pages to read
brad pitt height
how tall is
tall
…
* Statistics from [Pound et al. WWW2010]. Similar results in [Lin et al. WWW2012].
11. Entity display
› Information about the entity
› Combined information with provenance
Related entity recommendation
› Where should I go next?
Question-Answering
Direct actions
› e.g. movie show times and tickets
Search: entity-based experiences
12. Communications
Extraction of information from email
› Notifications
• Package delivery updates, upcoming flights etc.
• Show up in Yahoo Search/Mail
› Better targeting for ads
• e.g. understanding past product purchases
Personal knowledge combined with the Web
› e.g. contact information is completed from FB/LinkedIn/Twitter
13. Media
13
Personalization
› Articles are classified by broad topics
› Named entities are extracted and linked to the KG
› Recommend other articles based on the extracted entities/topics
Show me less
stories about this
entity or topic
14. Requirements
14
Entity-centric representation of the world
› Use cases in search, email, media, ads
Integration of disparate information sources
› User/advertizer content and data
› Information from the Web
• Aggregate view of different domains relating to different facet’s of an entity
› Third-party licensed data
Large scale
› Batch processing OK but at least daily updates
High quality
Multiple languages and markets
19. Knowledge integration process
19
Standard data fusion process
› Schema matching
• Map data to a common schema
› Entity reconciliation
• Determine which source entities refer to the same real-world entity
› Blending
• Aggregate information and resolve conflicts
Result: unified knowledge base built from dozens of sources
› ~100 million unique entities and billions of facts
› Note: internal representations may be 10x larger due modeling, metadata etc.
Related work
› Bleiholder and Naumann: Data Fusion. ACM Computing Surveys, 2008.
20. Common ontology
› Covers the domains of interest of Yahoo
• Celebrities, Movies, music, sports, finance, etc.
› Editorially maintained
› OWL ontology
• ~300 classes, ~800 datatype-props, ~500 object-props
› Protégé and custom tooling (e.g. documentation)
• Git for versioning (similar to schema.org)
› More detailed and expressive than schema.org
• Class disjunction, cardinality constraints, inverse
properties, datatypes and units
• But limited use of complex class/property expressions
– e.g. MusicArtist = Musician OR Band
– Difficult for data consumers
Manual schema mapping
› Works for ~10 sources
› Not scalable
• Web tables
• Language editions of Wikipedia
20
Ontology matching
22. Entity reconciliation
22
1. Blocking
› Compute hashes for each entity
› Based on type+property value combinations, e.g. type:Movie+releaseYear=1978
› Multiple hashes per entity
› Optimize for high recall
2. Pair-wise matching within blocks
› Manual as well as machine-learned classifiers
3. Clustering
› Transitive closure of matching pairs
› Assign unique identifier
24. Blending
24
Rule-based system initially, moving to machine learning
Features
› Source trustworthiness
› Value prior probabilities
› Data freshness
› Logical constraints
• Derived from ontology
• Programmatic, e.g. children must be born after parents
26. Challenge: scalable infrastructure
26
Property graph/RDF databases are a poor fit for ETL and data fusion
› Large batch writes
› Require transaction support
› Navigation over the graph, no need for more complex joins
• Required information is at most two hops away
Hadoop-based solutions
› Yahoo already hosts ~10k machines in Hadoop clusters
› HBase initially
› Moved to Spark/GraphX
• Support row/column as well as graph view of the data
› Separate inverted index for storing hashes
– Welch et al.: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012
JSON-LD is used as input/output format
27. Challenge: quality
27
Not enough to get it right… it has to be perfect
• Key difference between applied science and academic research
Many sources of errors
› Poor quality or outdated source data
› Errors in extraction
› Errors in schema mapping and normalization
› Errors in merging (reconciliation)
• Blocking
• Disambiguation
• Blending
› Errors in display
• Image issue, poor title or description etc.
Human intervention should be possible at every stage of the pipeline
31. Challenge: type classification and ranking
31
Type classification
› Determine all the types of an entity
› Mostly system issue, e.g. types are used in blocking
› Features
• NLP extraction
– e.g. Wikipedia first paragraph
• Taxonomy mapping
– e.g. Wikipedia category hierarchy
• Relationships
– e.g. acted in a Movie -> Actor
• Trust in source
– e.g. IMDB vs. Wikipedia for Actors
32. What types are the most
relevant?
› Arnold Schwarzenegger:
Actor > Athlete > Officeholder >
Politician (perhaps)
› Pope Francis is a Musician per
MusicBrainz
› Barack Obama is an Actor per IMDB
Display issue
› Right template and display label
Moving from manual to machine-
learned ranking
32
Challenge: type ranking Much better
known as
an Actor
33. Arnold
Schwarzenegger
credit.actingPerformanceIn
The Terminator
The Terminator
The Terminator
The Terminator
The Terminator
The Terminator
partyAffiliation
Republican Party
(United States)
description
Arnold Alois
Schwarzenegger is an
Austrian-American actor,
model, producer, director,
activist, businessman,
investor, philanthropist,
former professional
bodybuilder, ...
Television Director
...
historicJobPosition
...
Television Director
Television Director
Television Director
credit.actingPerformanceIn
credit.actingPerformanceIn
credit.actingPerformanceIn
credit.actingPerformanceIn
credit.actingPerformanceIn
historicJobPosition
historicJobPosition
historicJobPosition
Athlete
Officeholder
Politician
Actor
34. Type ranking features
Implemented two novel unsupervised methods
› Entity likelihood
› Nearest-neighbor
Ensemble learning on (features extracted from) entity attributes
› Cosine, KL-div, Dice, sumAF, minAF, meanAF, etc.
› Entity features, textual features, etc.
• E.g. order of type mentions in Wikipedia first paragraph
Variants
› Combinations of the above
› Stacked ML, FMs
35. Challenge: mining aliases and entity pages
35
Extensive set of alternate names/labels are required by applications
› Named Entity Linking on short/long forms of text
Some of this comes free from Wikipedia
› Anchor text, redirects
› e.g. all redirects to Brad Pitt
Query logs are also useful source of aliases
› e.g. incoming queries to Brad Pitt’s page on Wikipedia
Can be extended to other sites if we find entity webpages
› A type of foreign key, but specifically on the Web
› e.g. Brad Pitt’s page on IMDB, RottenTomatoes
Machine learned model to filter out poor aliases
› Ambiguous or not representative
36. Challenge: data normalization
36
Issue at both scoring and blending time
Multiple aspects
› Datatype match
• “113 minutes” vs. “PT1H53M”
› Text variants
• Spelling, punctuation, casing, abbreviations etc.
› Precision
• sim(weight=53 kg, weight=53.5kg)?
• sim(birthplace=California, birthplace=Los Angeles, California)
› Temporality
• e.g. Frank Sinatra married to {Barbara Blakeley, Barbara Marx, Barbara Marx Sinatra, Barbara Sinatra}
• Side issue: we don’t capture historical values
– e.g. Men’s Decathlon at 1976 Olympics was won by Bruce Jenner, not Caitlyn Jenner
37. Challenge: relevance
37
All information in the graph is true, but not equally relevant
Relevance of entities to queries
› Query understanding
› Entity retrieval
Relevance of relationships
› Required for entity recommendations (“people also search for”)
• Who is more relevant to Brad Pitt? Angelina Jolie or Jennifer Aniston?
38. Relationship ranking
38
Machine-learned ranking based on a diverse set of features
› Relationship type
› Co-occurrence in usage data and text sources
• How often people query for them together?
• How often one entity is mentioned in the context of the other?
› Popularity of each entity
• e.g. search views/clicks
› Graph-based metrics
• e.g. number of common related entities
See
› Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec:
Entity Recommendations in Web Search. ISWC 2013
40. Conclusions
40
Yahoo benefits from a unified view of domain knowledge
› Focusing on domains of interest to Yahoo
› Complementary information from an array of sources
› Use cases in Search, Ads, Media
Data integration challenge
› Triple stores/graph databases are a poor fit
• Reasoning for data validation (not materialization)
› But there is benefit to Semantic Web technology
• OWL ontology language
• JSON-LD
• Data on the Web (schema.org, Dbpedia…)
41. Future work
41
Scope, size and complexity of Yahoo Knowledge will expand
› Combination of world knowledge and personal knowledge
› Advanced extraction from the Web
› Additional domains
› Tasks/actions
All of the challenges mentioned will need better answers…
› Can you help us?
42. Q&A
Credits
› Yahoo Knowledge engineering team in Sunnyvale and Taipei
› Yahoo Labs scientists and engineers in Sunnyvale and London
Contact me
› pmika@yahoo-inc.com
› @pmika
› http://www.slideshare.net/pmika/
Notas do Editor
More info at
http://info.yahoo.com/
http://investor.yahoo.net/faq.cfm
Marissa’s CES 2013 keynote:
http://screen.yahoo.com/marissa-mayer-ces-keynote-live-210000558.html
ComScore traffic:
http://www.bloomberg.com/news/2013-08-22/yahoo-tops-google-in-u-s-for-web-traffic-in-july-comscore-says.html
http://www.comscore.com/Insights/Press_Releases/2013/8/comScore_Media_Metrix_Ranks_Top_50_US_Web_Properties_for_July_2013
This is how a machine sees the world… Machines are not ‘intelligent’ and can not ‘read’… they just see a string of symbols and try to match the users input to that stream.
We also show “People also searched the height of…”
Efficiency in processing, though not real-time
Developed by a large, distributed team of engineers and scientists, in Sunnyvale, London and Taiwan
As of Dec, 2015:
600M source entities and 10B source triples
75M reconciled entities and 5B triples
The KG understands facts about real world entities
People, places, movies, organizations and more
and how they relate to each other.
In practice, due to modeling (reification) 75M unique entities -> 1.2B vertices in Spark/GraphX
If we had 600m source entities and 10k cores with 1ms per comparison, about 400 years (3.6*10^17 comparisons)
Blocking reduces this to 3.6 * 10^8 comparisons, about 30 minutes of runtime