SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
Big Data With Knowledge Graph
A Real Experience
Agenda
● My Case Study
● Big Data Fact
● The Challenges
● The Solution
● Knowledge Graph
● RDF triple
● Freebase
● Ranker
● Our custom Knowledge Graph- How did we build it?
● Conclusion
My Case Study
This case study is about-
● Our graph processing engine which uses one of the largest
knowledge graphs available as a source and creating multiple
knowledge graphs specific to the application.
● This graph processing engine deals with traversing through more
than 700 million triples.
Big Data Fact
The term Big data from software engineering and computer
science describes it as the data-sets that grow so large that they
become awkward to work with using on-hand database
management tools - Wiki
Read on for an exciting tour of big data, knowledge graph, the
challenges we faced & how we came up with a solution.
The Challenges
● RDF is not a mature data structure as compared to other
data structures/sets which have a mature ecosystem built
around them.
● Freebase has more than 760 million triples in their
knowledge graph. What would be the data store for such a
huge knowledge graph?
● Optimum way to store this knowledge graph locally in a data
store.
● Transform this huge knowledge graph into ranker knowledge
graph.
The Solution
Highlights
● Our platform has proven to scale to the biggest knowledge
graph available.
● Our graph processing engine deals with 760 million triples
from freebase.
● We did it even before google used it.
● Really the next big thing in big data is large scale processing
of knowledge graph to your application perspective!
Knowledge Graph
● Freebase data is organised and stored as a graph instead of
tables & keys, as in rdbms.
● The dataset is organised into nodes. Each node connects to
several nodes via predicates hence representing the relative
data in a simplistic and realistic way.
● The nodes are grouped together using topics & types. The
data is inter connected so it is very easy to traverse through
them if we know the right predicates.
Knowledge graph & Conventional Data- How
Different Are They?
In an RDBMS database-
● The data is organized into tables
● They are connected via foreign keys.
● Once the table is designed the relationship is fixed. The
number of tables needed would depend on the predicates.
● We cannot have new predicate definitions at runtime. We will
have to create the table definition and then save the data.
RDF triple
An RDF triple consists of three parts-
● A subject
● A Predicate
● An object
A Subject is related to an object via a Predicate. Each triple is a
complete assertive statement which makes complete sense.
Examples of RDF triple:
Francis Ford Coppola | Directed | The Godfather
Al Pacino | Acted in | The Godfather
The Godfather | Written by | Mario Puzo
I recommend the below video to get a brief idea on knowledge graph.
Google's Knowledge Graph
Freebase
Facts
● It is an online knowledge database.
● The source of this data is mainly from its community
members and Wikipedia, ChefMoz, NNDB, and MusicBrainz.
● It became public in 2007 by Metaweb, which was acquired by
google in 2010.
"Freebase is an open shared database of the world's
knowledge."- this is how Metaweb described freebase.
Ranker
Facts
● Ranker is a social web platform designed for collaborative
and individual list making & voting.
● Ranker launched in August, 2009, and has since grown to
over 4 million monthly unique visitors and over 14 million
monthly page views, per Quantcast. As of January 2012
Ranker’s traffic was ranked at 949 on Quantcast.
● One of the prominent data partners for ranker is freebase,
now Google owned.
Click here for more info...
Our custom knowledge graph
- How did we build it?
Freebase data expose option-1
MQL
The Metaweb query API is a powerful API provided by freebase in order to read data.
The data is communicated over http using JSON. This method is very effective if it is
used to just browse the data or download limited data.
For very large data consumption, I do not recommend MQL because of the following
reasons-
● Freebase API is intermittently down.
● Freebase has throttling controls on both the API limit as well as the size of
datasets returned on a daily basis. We have faced issues in the past where the
API was responding with the “allowance exceeded” timeout errors. The max
results returned for any query is 100.
Freebase data expose option-2
Data Dumps
● Freebase provides weekly quad dumps available for download via its download
site.
● It is a complete dump of all the assertions in freebase in utf-8 format.
The dump is available as a compressed file, 4+ Gb in size. It has to be
downloaded & unzipped, which will be approximately 30 Gb.
● The quad dump has to be converted into RDF statements. For this we use the
Open source freebase-quad-rdfize program which is a free distribution. After the
end of this process you will have a .nt file which will be approximately 90-100 Gb
in size. So disk size is a vital requirement.
Datastore
● A triple store is a data store for storing RDF triples. It is optimized for the storage
and retrieval of triples. Our knowledge graph datastore is openlink virtuoso. It has
the ability to handle more than a billion triples, hence for our requirement this
suited well.
● Since the “nt” file is very large, the ingestion of data into the triple store had
various issues. After a million triples the server froze. Hence we just broke the nt
file into smaller chunks. After doing this, the ingestion was fine and competed
successfully.
● The system we use for ingestion is an ubuntu 10.04, 48 Gb RAM machine. It
takes approximately 36 hours to ingest the complete quad dump into our triple
store.
Data consumption for the App
Our platform is a highly scalable graph processing engine that operates on the largest
knowledge graph (freebase) and uses a graph datastore from openlink virtuoso.
However, the platform itself is built using standard protocols for graph navigation,
processing and traversing - sparql.
● Every node on freebase has an unique alphanumeric id made of two parts;
Namespace and Key. Together they are called the 'mid'.
● Every predicate in freebase has source id or source namespace. Example, the
predicate “Nationality” has a source url as “http://rdf.freebase.
com/ns/people/person/nationality”.
What we have done in our app is predefined entities and their properties by using
predicate urls as source ids. Example, a Person entity in our system has a Nationality
property with a source url and source is “freebase”. This way we can add more
sources in future and also have one entity with properties from one or more sources.
SPARQL
● This is a query language for RDF data.
● The results of these queries are always triples.
Hence we chose to dynamically build these queries depending on what data we need.
Based on our experience we found that avoiding joins in SPARQL queries will improve
the performance.
API
● We chose the java based jena api for virtuoso.
● It establishes a connection to the triple store over jdbc.
The api supports sparql and hence the results are packages as RDF objects, so that
we can easily read them and use adapters to transform them to the app objects.
Data Aggregation
This is what makes our platform truly powerful. Not only do we store the knowledge
graph locally, we also have the ability to create our own custom graph from this data.
The ranker system has approximately 20 million nodes & powers half a million lists &
counting.
Not all entities in our system are simple, we have complex ones. By complex I mean
the properties belong to one or more types on freebase.
For example a 'Person' node in our system will not only have date of birth, place of
birth, age etc but also have properties like dated, breakups. We have achieved this by
pre-defining aggregation rules for each and every entity in our system based on
feedback from our seo & business team.
Conclusion

Mais conteúdo relacionado

Destaque

A new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksA new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksNees Jan van Eck
 
Enterprise Applications, Analytics and Knowledge Products Positionings in Isr...
Enterprise Applications, Analytics and Knowledge Products Positionings in Isr...Enterprise Applications, Analytics and Knowledge Products Positionings in Isr...
Enterprise Applications, Analytics and Knowledge Products Positionings in Isr...Einat Shimoni
 
Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphsSören Auer
 
The Algorithm of Magical Customer Experiences
The Algorithm of Magical Customer ExperiencesThe Algorithm of Magical Customer Experiences
The Algorithm of Magical Customer ExperiencesEinat Shimoni
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Relational to Big Graph
Relational to Big GraphRelational to Big Graph
Relational to Big GraphNeo4j
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge GraphLukas Masuch
 
Enterprise Knowledge - Taxonomy Design Best Practices and Methodology
Enterprise Knowledge - Taxonomy Design Best Practices and MethodologyEnterprise Knowledge - Taxonomy Design Best Practices and Methodology
Enterprise Knowledge - Taxonomy Design Best Practices and MethodologyEnterprise Knowledge
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum
 
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning MeetupKnowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning MeetupBenjamin Nussbaum
 
Venture Scanner Artificial Intelligence 2016 Q4
Venture Scanner Artificial Intelligence 2016 Q4Venture Scanner Artificial Intelligence 2016 Q4
Venture Scanner Artificial Intelligence 2016 Q4Nathan Pacer
 
Using AWS to Build a Graph-Based Product Recommendation System (BDT303) | AWS...
Using AWS to Build a Graph-Based Product Recommendation System (BDT303) | AWS...Using AWS to Build a Graph-Based Product Recommendation System (BDT303) | AWS...
Using AWS to Build a Graph-Based Product Recommendation System (BDT303) | AWS...Amazon Web Services
 
Knowledge Graphs at Elsevier
Knowledge Graphs at ElsevierKnowledge Graphs at Elsevier
Knowledge Graphs at ElsevierPaul Groth
 
Semantic Technologies for Big Data
Semantic Technologies for Big DataSemantic Technologies for Big Data
Semantic Technologies for Big DataMarin Dimitrov
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceLukas Masuch
 

Destaque (17)

A new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksA new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networks
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge Graph
 
Enterprise Applications, Analytics and Knowledge Products Positionings in Isr...
Enterprise Applications, Analytics and Knowledge Products Positionings in Isr...Enterprise Applications, Analytics and Knowledge Products Positionings in Isr...
Enterprise Applications, Analytics and Knowledge Products Positionings in Isr...
 
Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphs
 
The Algorithm of Magical Customer Experiences
The Algorithm of Magical Customer ExperiencesThe Algorithm of Magical Customer Experiences
The Algorithm of Magical Customer Experiences
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Relational to Big Graph
Relational to Big GraphRelational to Big Graph
Relational to Big Graph
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge Graph
 
Enterprise Knowledge - Taxonomy Design Best Practices and Methodology
Enterprise Knowledge - Taxonomy Design Best Practices and MethodologyEnterprise Knowledge - Taxonomy Design Best Practices and Methodology
Enterprise Knowledge - Taxonomy Design Best Practices and Methodology
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
 
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning MeetupKnowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
Knowledge Graphs for a Connected World - AI, Deep & Machine Learning Meetup
 
Venture Scanner Artificial Intelligence 2016 Q4
Venture Scanner Artificial Intelligence 2016 Q4Venture Scanner Artificial Intelligence 2016 Q4
Venture Scanner Artificial Intelligence 2016 Q4
 
Using AWS to Build a Graph-Based Product Recommendation System (BDT303) | AWS...
Using AWS to Build a Graph-Based Product Recommendation System (BDT303) | AWS...Using AWS to Build a Graph-Based Product Recommendation System (BDT303) | AWS...
Using AWS to Build a Graph-Based Product Recommendation System (BDT303) | AWS...
 
Knowledge Graphs at Elsevier
Knowledge Graphs at ElsevierKnowledge Graphs at Elsevier
Knowledge Graphs at Elsevier
 
Semantic Technologies for Big Data
Semantic Technologies for Big DataSemantic Technologies for Big Data
Semantic Technologies for Big Data
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial Intelligence
 

Big data with knowledge graph

  • 1. Big Data With Knowledge Graph A Real Experience
  • 2. Agenda ● My Case Study ● Big Data Fact ● The Challenges ● The Solution ● Knowledge Graph ● RDF triple ● Freebase ● Ranker ● Our custom Knowledge Graph- How did we build it? ● Conclusion
  • 3. My Case Study This case study is about- ● Our graph processing engine which uses one of the largest knowledge graphs available as a source and creating multiple knowledge graphs specific to the application. ● This graph processing engine deals with traversing through more than 700 million triples.
  • 4. Big Data Fact The term Big data from software engineering and computer science describes it as the data-sets that grow so large that they become awkward to work with using on-hand database management tools - Wiki Read on for an exciting tour of big data, knowledge graph, the challenges we faced & how we came up with a solution.
  • 5. The Challenges ● RDF is not a mature data structure as compared to other data structures/sets which have a mature ecosystem built around them. ● Freebase has more than 760 million triples in their knowledge graph. What would be the data store for such a huge knowledge graph? ● Optimum way to store this knowledge graph locally in a data store. ● Transform this huge knowledge graph into ranker knowledge graph.
  • 6. The Solution Highlights ● Our platform has proven to scale to the biggest knowledge graph available. ● Our graph processing engine deals with 760 million triples from freebase. ● We did it even before google used it. ● Really the next big thing in big data is large scale processing of knowledge graph to your application perspective!
  • 7. Knowledge Graph ● Freebase data is organised and stored as a graph instead of tables & keys, as in rdbms. ● The dataset is organised into nodes. Each node connects to several nodes via predicates hence representing the relative data in a simplistic and realistic way. ● The nodes are grouped together using topics & types. The data is inter connected so it is very easy to traverse through them if we know the right predicates.
  • 8. Knowledge graph & Conventional Data- How Different Are They? In an RDBMS database- ● The data is organized into tables ● They are connected via foreign keys. ● Once the table is designed the relationship is fixed. The number of tables needed would depend on the predicates. ● We cannot have new predicate definitions at runtime. We will have to create the table definition and then save the data.
  • 9. RDF triple An RDF triple consists of three parts- ● A subject ● A Predicate ● An object A Subject is related to an object via a Predicate. Each triple is a complete assertive statement which makes complete sense. Examples of RDF triple: Francis Ford Coppola | Directed | The Godfather Al Pacino | Acted in | The Godfather The Godfather | Written by | Mario Puzo I recommend the below video to get a brief idea on knowledge graph. Google's Knowledge Graph
  • 10. Freebase Facts ● It is an online knowledge database. ● The source of this data is mainly from its community members and Wikipedia, ChefMoz, NNDB, and MusicBrainz. ● It became public in 2007 by Metaweb, which was acquired by google in 2010. "Freebase is an open shared database of the world's knowledge."- this is how Metaweb described freebase.
  • 11. Ranker Facts ● Ranker is a social web platform designed for collaborative and individual list making & voting. ● Ranker launched in August, 2009, and has since grown to over 4 million monthly unique visitors and over 14 million monthly page views, per Quantcast. As of January 2012 Ranker’s traffic was ranked at 949 on Quantcast. ● One of the prominent data partners for ranker is freebase, now Google owned. Click here for more info...
  • 12. Our custom knowledge graph - How did we build it? Freebase data expose option-1 MQL The Metaweb query API is a powerful API provided by freebase in order to read data. The data is communicated over http using JSON. This method is very effective if it is used to just browse the data or download limited data. For very large data consumption, I do not recommend MQL because of the following reasons- ● Freebase API is intermittently down. ● Freebase has throttling controls on both the API limit as well as the size of datasets returned on a daily basis. We have faced issues in the past where the API was responding with the “allowance exceeded” timeout errors. The max results returned for any query is 100.
  • 13. Freebase data expose option-2 Data Dumps ● Freebase provides weekly quad dumps available for download via its download site. ● It is a complete dump of all the assertions in freebase in utf-8 format. The dump is available as a compressed file, 4+ Gb in size. It has to be downloaded & unzipped, which will be approximately 30 Gb. ● The quad dump has to be converted into RDF statements. For this we use the Open source freebase-quad-rdfize program which is a free distribution. After the end of this process you will have a .nt file which will be approximately 90-100 Gb in size. So disk size is a vital requirement.
  • 14. Datastore ● A triple store is a data store for storing RDF triples. It is optimized for the storage and retrieval of triples. Our knowledge graph datastore is openlink virtuoso. It has the ability to handle more than a billion triples, hence for our requirement this suited well. ● Since the “nt” file is very large, the ingestion of data into the triple store had various issues. After a million triples the server froze. Hence we just broke the nt file into smaller chunks. After doing this, the ingestion was fine and competed successfully. ● The system we use for ingestion is an ubuntu 10.04, 48 Gb RAM machine. It takes approximately 36 hours to ingest the complete quad dump into our triple store.
  • 15. Data consumption for the App Our platform is a highly scalable graph processing engine that operates on the largest knowledge graph (freebase) and uses a graph datastore from openlink virtuoso. However, the platform itself is built using standard protocols for graph navigation, processing and traversing - sparql. ● Every node on freebase has an unique alphanumeric id made of two parts; Namespace and Key. Together they are called the 'mid'. ● Every predicate in freebase has source id or source namespace. Example, the predicate “Nationality” has a source url as “http://rdf.freebase. com/ns/people/person/nationality”. What we have done in our app is predefined entities and their properties by using predicate urls as source ids. Example, a Person entity in our system has a Nationality property with a source url and source is “freebase”. This way we can add more sources in future and also have one entity with properties from one or more sources.
  • 16. SPARQL ● This is a query language for RDF data. ● The results of these queries are always triples. Hence we chose to dynamically build these queries depending on what data we need. Based on our experience we found that avoiding joins in SPARQL queries will improve the performance. API ● We chose the java based jena api for virtuoso. ● It establishes a connection to the triple store over jdbc. The api supports sparql and hence the results are packages as RDF objects, so that we can easily read them and use adapters to transform them to the app objects.
  • 17. Data Aggregation This is what makes our platform truly powerful. Not only do we store the knowledge graph locally, we also have the ability to create our own custom graph from this data. The ranker system has approximately 20 million nodes & powers half a million lists & counting. Not all entities in our system are simple, we have complex ones. By complex I mean the properties belong to one or more types on freebase. For example a 'Person' node in our system will not only have date of birth, place of birth, age etc but also have properties like dated, breakups. We have achieved this by pre-defining aggregation rules for each and every entity in our system based on feedback from our seo & business team.