Open Source Tools for Creating Mashups with Government Datasets MOSC2010
1. Open Source Tools for Creating Mashups with
Government Datasets
Mohammed Firdaus, Muhd Sharuzzamal Bakri
June 29, 2010
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
2. Introduction About the Speakers
About the Speakers
Mohammed Firdaus bin Mohammed Ab Halim
(@firdaus halim) and Muhd Sharuzzamal Bakri (@amai)
Founders of Persada Terbilang Sdn Bhd - We have no
relationship whatsoever to any fertilizer supplier
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
3. Introduction What are Mashups?
Mashups
A mashup is a web page or application that uses and
combines data, presentation or functionality from two or
more sources to create new services.
(Source: Wikipedia)
Data mashups combine similar types of media and
information from multiple sources into a single
representation.
(Source: Wikipedia)
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
4. Challenges Data Sets are Not Available in Machine Readable Form
Data Sets are Not Available in Machine Readable Form
Nothing useful here:
filetype:csv site:.gov.my
filetype:xml site:.gov.my
filetype:rdf site:.gov.my
We have to resort to web scraping.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
5. Challenges No Data Dictionaries
No Data Dictionaries
Since the data sets that are available were meant for humans
to consume rather machines they are usually published
without any type of data dictionary.
This means that an application developer will have to make
assumptions about the structure of each field e.g. whether it’s
unique, whether it’s a multi-value field, which fields are
mandatory/option.
These assumptions may or may not turn out be correct as you
see more and more data in the data set.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
6. Challenges New Data Sets Constantly Become Available
New Data Sets Constantly Become Available
This is a not a bad thing.
However, our code, database and schema must be flexible
enough to deal with future data sets that we might want to
use in our applications.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
7. Challenges Lack of Standards Across Agencies
Lack of Standards Across Agencies
Different identifiers for referring to the same entity.
The lack of common identifiers makes it tedious to combine
data sets together which maybe describing the same entity.
MyCoID and MyID are steps in the right direction.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
8. Challenges Summary
In Summary
Because of these challenges, we need an agile method for
modeling, storing and processing these government datasets in
our application.
The purpose of this presentation is to show how representing
your data as a graph both help you deal with these challenges
and at the same time help make compelling data mashups.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
9. Graphs Introduction to Graphs
What is a Graph?
A data structure that consists of a collection of vertices and
the connections between those vertices, called edges.
Vertices are sometimes called nodes or dots.
Edges are sometimes called relationships or edges.
The terminology differs between software packages.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
10. Graphs Types of Graphs
Types of Graphs
A directed graph (or digraph) is one where the edges have a
direction (i.e. there’s an outgoing and incoming vertex).
A multigraph is one where multiple edges can exist between
two vertices.
An edge-labeled graph is a graph where edges have labels.
Similarly, a vertex-labeled graph is one in which the vertices
have labels.
An attributed graph is one in which the vertices and edges can
have attributes (key-value pairs).
A graph can have more than one of these properties e.g. a
multi digraph is one which multiple directed edges can exist
between two vertices.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
11. Graphs Types of Graphs
Types of Graphs - Simple/Undirected Graphs
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
12. Graphs Types of Graphs
Types of Graphs - Directed Graph
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
13. Graphs Types of Graphs
Types of Graphs - Edge and Node Labeled Graph
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
14. Graphs Types of Graphs
Types of Graphs - Multigraph
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
15. Graphs Types of Graphs
Types of Graphs - Attributed Multigraph
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
16. Graphs Types of Graphs
Examples - Social Graphs
Source: http://www.flickr.com/photos/greenem/11696663/
Undirected Graph - Vertices represent people and edges
represents friendship.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
17. Graphs Types of Graphs
Examples - Web Graph
http://en.wikipedia.org/wiki/File:WorldWideWebAroundWikipedia.png
Multi-digraph - Vertices represent web pages and directed
edges represent links between pages.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
18. Graphs Property Graphs
Property Graphs
’Property graph’ is another term for attributed labeled
multi-digraph.
Property graphs are flexible enough to support most types of
graph data. Other types of graphs (with the exception of
hypergraphs) can be built on top of property graphs by
removing features or using features of the property graph in
certain ways.
The tools that we are covering in this presentation deal
primarily with property graphs.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
19. Graphs Property Graphs
Property Graphs
Source: http://wiki.github.com/tinkerpop/gremlin/defining-a-property-graph
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
20. Data Sets Treasury Procurement Data
Treasury - Tenders Awarded
Source: http://myprocurement.treasury.gov.my/index.php/en/list-keputusan-tender
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
21. Data Sets Treasury Procurement Data
Fields
Tajuk Tender (Title of Tender)
Nombor Tender (Tendor Number)
Kategori Perolehan (Procurement Category)
Kementerian (Ministry)
Petender Berjaya (Winner of Tender)
No Pendaftaran Dengan ROB/ROS/ROC (Registration
Number with ROB/ROS/ROC)
No Pendaftaran Dengan MOF/PKK (Registration Number
with MOF/PKK)
Harga Setuju Terima (Agreed Upon Value)
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
22. Data Sets Treasury Procurement Data
Code and Data in Machine Readable Form
For this presentation we are using data that we scraped form
this site on 2010-04-26
The source code for our scraper and the CSV dump from
2010-04-26 is available at
http://mfirdaus.com/mosc-paper/
The dump contains 2615 records.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
23. Data Sets Treasury Procurement Data
The Dump
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
24. Data Sets Issues with this Data Sets
Missing Fields
Out of the 2615 records in the dump
510 records were missing a tender number
472 records were missing a category
1836 records were missing a ROB/ROS/ROC number
510 records were missing a MOF no
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
25. Data Sets Issues with this Data Sets
Tender Numbers are Not Unique
32 records have the same tender number and title as another
record
23 records have the same tender number as another record
In some cases these appear to be duplicate records since the
fields all match up.
In other cases, one or two fields are slightly different
indicating that there was a probably a typo (erroneous record
was not deleted).
In some cases, the other fields are completely different which
leads us to think that it’s possible for there to be multiple
winners of a tender (need some government officials to verify
this for us).
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
26. Data Sets Issues with this Data Sets
Format of Tender Numbers
Examples of tender numbers:
8/2009
PL.(T).08.2009(JKP)
X0141110101090021
128/2009
KBS.S.4-14/69 (T.26/2009)
Probably not a good idea to write code that attempts to parse the
tender number.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
27. Data Sets Issues with this Data Sets
Format of the ”Petender Berjaya” Field
SYARIKAT PROSPECTRUM SDN BHD
TELEKOM SMART SCHOOL SDN BHD NO.45-8, LEVEL 3,
BLOCK C, PLAZA DAMANSARA, JALAN MEDAN SETIA
1, BUKIT DAMANSARA 50490 KUALA LUMPUR
1. GLOBAL AEROSPACE SDN BHD (A002) 2. SYSTEM
ALLIANCE TECHNOLOGY SDN. BHD.(A003) 3. KARISMA
WIRA SDN. BHD. (A004) 4. KESUMA TECHNOLOGY
SDN. BHD (A005)
A QUALITY REPUTATION SDN BHD B PRIMABUMI SDN
BHD
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
28. Data Sets Modeling
Modeling this Data Set as a Property Graph
One way to model this data as a graph is to:
Vertices to represent tenders, ministries and
companies/businesses.
An ”awarded by” labeled edge to associate a tender with a
ministry.
An ”awarded to” labeled edge to associate a tender with the
winner of the tender (the company/business).
Attributes on tender vertices for the tender title, number,
value, category
Attributes on company/business vertices for the
company/business name, ROB/ROC/ROS registration
number and MOF registration number.
Attributes on ministry vertices from the name of the ministry.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
29. Data Sets Modeling
Example
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
30. Graph Databases and Neo4j Neo4j - Introduction
Neo4j
Neo4j is a graph database. Persists data in graph form.
Property graph data model with the exception of vertex labels.
In Neo4j terms, vertices are nodes, edges are relationships and
attributes are properties.
Property values can be a String or any Java primitive (arrays
of these types are supported as well).
Licensed under the AGPLv3. Which basically means that you
don’t need a license if your application is released under a
compatible free software license.
For other uses, you need a commercial license from them.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
31. Graph Databases and Neo4j Neo4j - Introduction
Neo4j
Written in Java.
Bindings available for Python, Ruby, Clojure, Erlang, Groovy,
Scalan and PHP.
We will be using the Python bindings in this talk.
An embedded database, meaning that it runs in the same
process space as the application.
There’s a standalone REST server for those who prefer it.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
32. Graph Databases and Neo4j Inserting into Neo4j
Initializing the Database
import neo4j
db = neo4j.GraphDatabase("db")
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
33. Graph Databases and Neo4j Inserting into Neo4j
Creating the Nodes
ministry node = db.node(name=ministry, type="ministry")
entity node = db.node(name=entity name, no=entity no,
mof no=entity mof no, type="business entity")
tender node = db.node(no=tender no, title=tender title,
category=tender category, value=tender value,
type="tender")
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
34. Graph Databases and Neo4j Inserting into Neo4j
Creating the Relationships
tender node.awarded by(ministry node)
tender node.awarded to(entity node)
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
35. Graph Databases and Neo4j Inserting into Neo4j
Indexing Nodes
ministries = db.index("ministries", create=True)
business entities = db.index("business entities",
create=True)
tenders by no = db.index("tenders by no", create=True)
tenders by title = db.index("tenders by title", create=True)
tenders by no[tender no] = tender node
tenders by title[tender title] = tender node
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
36. Graph Databases and Neo4j Inserting into Neo4j
The Result
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
37. Graph Traversals
Traversing the Graph
Traversing is the process of walking around the graph.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
38. Graph Traversals
Graph Traversal Options
Graph Traversal Framework
Gremlin
SPARQL
Manual traversal
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
39. Graph Traversals
Problem
Lets use graph traversal to find all the companies who have been
awarded contracts by Kementerian Kesihatan.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
40. Graph Traversals
Graph Around Kementerian Kesihatan
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
41. Graph Traversals Traversal Framework
Defining the Traversal
# Companies who have gotten contracts from a particular ministry
# The start node is a ministry
class Contractors(neo4j.Traversal):
types = [neo4j.Incoming.awarded by,
neo4j.Outgoing.awarded to]
order = neo4j.DEPTH FIRST
stop = neo4j.STOP AT END OF GRAPH
def isReturnable(self, position):
if position["type"] == "business entity":
return True
else:
return False
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
42. Graph Traversals Traversal Framework
Using the Traversal
with db.transaction:
moh = ministries["KEMENTERIAN KESIHATAN"]
contractors = Contractors(moh)
for c in contractors:
print c["name"]
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
43. Graph Traversals Traversal Framework
Output
RAF SYNERGY SDN BHD
PRIMABUMI SDN BHD
AVERROES PHARMACEUTICALS SDN BHD
QUALITY REPUTATION SDN BHD
UNISENDO SDN BHD
PRESTIGE PHARMA SDN BHD
PHARMANIAGA LOGISTICS SDN BHD
IDAMAN PHARMA SDN BHD
PHARMASERV ALLIANCES SDN BHD
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
44. Graph Traversals Traversing Graphs with Gremlin
Gremlin
Gremlin is a graph based programming language.
Can express complex graph traversals concisely.
Available at
http://wiki.github.com/tinkerpop/gremlin/
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
45. Graph Traversals Traversing Graphs with Gremlin
Traversing the Graph with Gremlin
$ ./gremlin.sh
,,,/
(o o)
--–-oOOo-( )-oOOo--–-
gremlin> $ := g:key(”ministries”, ”KEMENTERIAN KESIHATAN”)
==>v[66]
gremlin> ./inE[@label=”awarded by”]/outV/
outE[@label=”awarded to”]/inV/@name
==>PHARMASERV ALLIANCES SDN BHD
==>IDAMAN PHARMA SDN BHD
==>PHARMANIAGA LOGISTICS SDN BHD
==>PRIMABUMI SDN BHD
==>PRESTIGE PHARMA SDN BHD
==>UNISENDO SDN BHD
==>PRIMABUMI SDN BHD
==>QUALITY REPUTATION SDN BHD
==>AVERROES PHARMACEUTICALS SDN BHD
==>PRIMABUMI SDN BHD
.....
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
46. Graph Traversals Traversing Graphs with Gremlin
Explanation
./inE[@label=”awarded by”]/outV/outE[@label=”awarded to”]/inV/@name
inE - incoming edges
outV - outgoing vertices
outE - outgoing edges
inV - incoming vertices
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
47. Graph Traversals Traversing Graphs with Gremlin
Explanation
./inE[@label=”awarded by”]/outV/outE[@label=”awarded to”]/inV/@name
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
48. Graph Traversals Traversing Graphs with Gremlin
Explanation
./inE[@label=”awarded by”]/outV/outE[@label=”awarded to”]/inV/@name
Get current object (.) (the ’KEMENTERIAN KESIHATAN’
node).
Get the incoming edges labeled ”awarded by”
(inE[@label=”awarded by”]).
Get the outgoing vertices of those edges (outV) (the contract
nodes).
Get the outgoing ”awarded to” edges of the contract nodes
(outE[@label=”awarded to”]).
Get the incoming vertices of those edges (inV) (the business
entity vertices).
Get the name attributes of those vertices (@name).
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
49. Graph Visualizations Gephi
Gephi
Photoshop for graphs.
Supports for various graph layout algorithms.
Graph metrics supported - clustering coefficient. pagerank,
diameter, betweeness centrality, closeness centrality
File formats supported - csv, graphml, gexf etc..
http://www.gephi.org
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
50. Graph Visualizations Gephi
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
51. Graph Visualizations Gephi
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
52. Mashing Up Adding External Data Sources
Mashing Up
Lets add shareholding data from Suruhanjaya Syarikat Malaysia
(SSM) to the graph so that we can show the tenders that have
been awarded to Telekom Malaysia BERHAD and any of its
subsidiaries/associate companies.
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
53. Mashing Up Adding External Data Sources
Connecting Telekom Malaysia Berhad and Telekom Smart
School Sdn Bhd
telekom = business entities["TELEKOM MALAYSIA BERHAD"]
telekom smart school = business entities["TELEKOM SMART SCHOOL SDN
BHD"]
telekom multi media = db.node(
name="TELEKOM MULTI-MEDIA SDN BHD",
no="345420-H", text="TELEKOM MULTI-MEDIA SDN BHD",
type="business entity")
telekom.shareholder in(telekom multi media, units=1650000)
telekom multi media.shareholder in(telekom smart school,
units=7650000)
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
54. Mashing Up Adding External Data Sources
Graph Centered at Telekom Malaysia Berhad
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
55. Mashing Up Adding External Data Sources
Graph Centered at Telekom Smart School Sdn Bhd
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
56. Mashing Up Traversing to Find Direct/Indirect Awards
The Traverser
class AllTendersDirectIndirect(neo4j.Traversal):
types = [neo4j.Incoming.awarded to,
neo4j.Outgoing.shareholder in]
order = neo4j.DEPTH FIRST
stop = neo4j.STOP AT END OF GRAPH
def isReturnable(self, position):
if position["type"] == "tender":
return True
else:
return False
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
57. Mashing Up Traversing to Find Direct/Indirect Awards
Executing the Traverser and the Output
Executing the Traversal Definition
telekom = business entities["TELEKOM MALAYSIA BERHAD"]
tenders = AllTendersDirectIndirect(telekom)
for tender in tenders:
print tender["no"]
Output
30/2009
35/2009
8/2009
162/2009
JASA/OP/1/2009
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
58. Wrapup Making this Easier
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
59. Wrapup Making this Easier
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas
60. Wrapup Making this Easier
Mohammed Firdaus, Muhd Sharuzzamal Bakri Open Source Tools for Creating Mashups with Government Datas