Big data with knowledge graph

Big Data With Knowledge Graph
A Real Experience

Agenda
● My Case Study
● Big Data Fact
● The Challenges
● The Solution
● Knowledge Graph
● RDF triple
● Freebase
● Ranker
● Our custom Knowledge Graph- How did we build it?
● Conclusion

My Case Study
This case study is about-
● Our graph processing engine which uses one of the largest
knowledge graphs available as a source and creating multiple
knowledge graphs specific to the application.
● This graph processing engine deals with traversing through more
than 700 million triples.

Big Data Fact
The term Big data from software engineering and computer
science describes it as the data-sets that grow so large that they
become awkward to work with using on-hand database
management tools - Wiki
Read on for an exciting tour of big data, knowledge graph, the
challenges we faced & how we came up with a solution.

The Challenges
● RDF is not a mature data structure as compared to other
data structures/sets which have a mature ecosystem built
around them.
● Freebase has more than 760 million triples in their
knowledge graph. What would be the data store for such a
huge knowledge graph?
● Optimum way to store this knowledge graph locally in a data
store.
● Transform this huge knowledge graph into ranker knowledge
graph.

The Solution
Highlights
● Our platform has proven to scale to the biggest knowledge
graph available.
● Our graph processing engine deals with 760 million triples
from freebase.
● We did it even before google used it.
● Really the next big thing in big data is large scale processing
of knowledge graph to your application perspective!

Knowledge Graph
● Freebase data is organised and stored as a graph instead of
tables & keys, as in rdbms.
● The dataset is organised into nodes. Each node connects to
several nodes via predicates hence representing the relative
data in a simplistic and realistic way.
● The nodes are grouped together using topics & types. The
data is inter connected so it is very easy to traverse through
them if we know the right predicates.

Knowledge graph & Conventional Data- How
Different Are They?
In an RDBMS database-
● The data is organized into tables
● They are connected via foreign keys.
● Once the table is designed the relationship is fixed. The
number of tables needed would depend on the predicates.
● We cannot have new predicate definitions at runtime. We will
have to create the table definition and then save the data.

RDF triple
An RDF triple consists of three parts-
● A subject
● A Predicate
● An object
A Subject is related to an object via a Predicate. Each triple is a
complete assertive statement which makes complete sense.
Examples of RDF triple:
Francis Ford Coppola | Directed | The Godfather
Al Pacino | Acted in | The Godfather
The Godfather | Written by | Mario Puzo
I recommend the below video to get a brief idea on knowledge graph.
Google's Knowledge Graph

Freebase
Facts
● It is an online knowledge database.
● The source of this data is mainly from its community
members and Wikipedia, ChefMoz, NNDB, and MusicBrainz.
● It became public in 2007 by Metaweb, which was acquired by
google in 2010.
"Freebase is an open shared database of the world's
knowledge."- this is how Metaweb described freebase.

Ranker
Facts
● Ranker is a social web platform designed for collaborative
and individual list making & voting.
● Ranker launched in August, 2009, and has since grown to
over 4 million monthly unique visitors and over 14 million
monthly page views, per Quantcast. As of January 2012
Ranker’s traffic was ranked at 949 on Quantcast.
● One of the prominent data partners for ranker is freebase,
now Google owned.
Click here for more info...

Our custom knowledge graph
- How did we build it?
Freebase data expose option-1
MQL
The Metaweb query API is a powerful API provided by freebase in order to read data.
The data is communicated over http using JSON. This method is very effective if it is
used to just browse the data or download limited data.
For very large data consumption, I do not recommend MQL because of the following
reasons-
● Freebase API is intermittently down.
● Freebase has throttling controls on both the API limit as well as the size of
datasets returned on a daily basis. We have faced issues in the past where the
API was responding with the “allowance exceeded” timeout errors. The max
results returned for any query is 100.

Freebase data expose option-2
Data Dumps
● Freebase provides weekly quad dumps available for download via its download
site.
● It is a complete dump of all the assertions in freebase in utf-8 format.
The dump is available as a compressed file, 4+ Gb in size. It has to be
downloaded & unzipped, which will be approximately 30 Gb.
● The quad dump has to be converted into RDF statements. For this we use the
Open source freebase-quad-rdfize program which is a free distribution. After the
end of this process you will have a .nt file which will be approximately 90-100 Gb
in size. So disk size is a vital requirement.

Datastore
● A triple store is a data store for storing RDF triples. It is optimized for the storage
and retrieval of triples. Our knowledge graph datastore is openlink virtuoso. It has
the ability to handle more than a billion triples, hence for our requirement this
suited well.
● Since the “nt” file is very large, the ingestion of data into the triple store had
various issues. After a million triples the server froze. Hence we just broke the nt
file into smaller chunks. After doing this, the ingestion was fine and competed
successfully.
● The system we use for ingestion is an ubuntu 10.04, 48 Gb RAM machine. It
takes approximately 36 hours to ingest the complete quad dump into our triple
store.

Data consumption for the App
Our platform is a highly scalable graph processing engine that operates on the largest
knowledge graph (freebase) and uses a graph datastore from openlink virtuoso.
However, the platform itself is built using standard protocols for graph navigation,
processing and traversing - sparql.
● Every node on freebase has an unique alphanumeric id made of two parts;
Namespace and Key. Together they are called the 'mid'.
● Every predicate in freebase has source id or source namespace. Example, the
predicate “Nationality” has a source url as “http://rdf.freebase.
com/ns/people/person/nationality”.
What we have done in our app is predefined entities and their properties by using
predicate urls as source ids. Example, a Person entity in our system has a Nationality
property with a source url and source is “freebase”. This way we can add more
sources in future and also have one entity with properties from one or more sources.

SPARQL
● This is a query language for RDF data.
● The results of these queries are always triples.
Hence we chose to dynamically build these queries depending on what data we need.
Based on our experience we found that avoiding joins in SPARQL queries will improve
the performance.
API
● We chose the java based jena api for virtuoso.
● It establishes a connection to the triple store over jdbc.
The api supports sparql and hence the results are packages as RDF objects, so that
we can easily read them and use adapters to transform them to the app objects.

Data Aggregation
This is what makes our platform truly powerful. Not only do we store the knowledge
graph locally, we also have the ability to create our own custom graph from this data.
The ranker system has approximately 20 million nodes & powers half a million lists &
counting.
Not all entities in our system are simple, we have complex ones. By complex I mean
the properties belong to one or more types on freebase.
For example a 'Person' node in our system will not only have date of birth, place of
birth, age etc but also have properties like dated, breakups. We have achieved this by
pre-defining aggregation rules for each and every entity in our system based on
feedback from our seo & business team.

Big data with knowledge graph

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (17)

Big data with knowledge graph