1. sitemap4rdf
generate Sitemap files from a SPARQL
endpoint
http://www.deri.ie/
http://www deri ie/
Boris Villazón-Terrazas and Richard Cyganiak (DERI)
Facultad de Informática, Universidad Politécnica de Madrid
Campus de Montegancedo sn 28660 Boadilla del Monte Madrid
sn, Monte,
http://www.oeg-upm.net
Phone: 34.91.3366605, Fax: 34.91.3524819
2. ToC
• Publishing Linked Data from a triple store
• Search engines
• The Sitemap protocol
• sitemap4rdf
• Summary
S
• Future work
2
3. Linked Data frontends for triple stores
Source: Pubby website, http://www4.wiwiss.fu-berlin.de/pubby/
3
4. ToC
• Publishing Linked Data from a triple store
• Search engines
• The Sitemap protocol
• sitemap4rdf
• Summary
S
• Future work
4
6. Sindice: the best RDF search engine
• 120M+ documents
• Continuously updating since 2006
C ti l d ti i
• Search API
• RDF/XML, Turtle, RDFa, microformats
6
7. ToC
• Publishing Linked Data from a triple store
• Search engines
• The Sitemap protocol
• sitemap4rdf
• Summary
S
• Future work
7
8. Sitemap Protocol
• Used by web crawlers
• Efficiently find all your content & discover
what has been updated
http://sitemaps.org/
A sitemap fil contains i f
i file i information regarding one or more URL on
i di URLs
your Web site. The information that is stored there helps search
engines better spider your website.
8
10. Sitemap Protocol: Optional parts
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://yoursite/</loc>
<lastmod>2010-06-24</lastmod>
<changefreq>daily</changefreq>
< h f >d il </ h f >
</url>
</urlset>
10
11. Sitemap Protocol: Huge sitemaps
• Gzip-compress your sitemap
• Limit: 50k URLs or 10MB
• split into multiple sitemap files
• add a sitemap index file
11
12. Sitemap Protocol: Discovery
• Publish the sitemap file
• Add a line to http://yoursite/robots.txt
• Web site owners use the /robots.txt file to give instructions about their site
to web robots; this is called The Robots Exclusion Protocol.
Sitemap: http://yoursite/sitemap.xml
12
13. ToC
• Publishing Linked Data from a triple store
• Search engines
• The Sitemap protocol
• sitemap4rdf
• Summary
S
• Future work
13
14. sitemap4rdf
• Simple command line tool
• Sends a SPARQL query to list all URIs
• Generates sitemap
sitemap4rdf htt //
it 4 df http://yoursite/sparql htt //
it / l http://yoursite/resource/
it / /
Example:
sitemap4rdf http://geo.linkeddata.es/sparql http://geo.linkeddata.es/
• run sitemap4rdf specifying th SPARQL endpoint
it 4 df if i the d i t
and the prefix of the URLs to include in the Sitemap
14
15. Submit the sitemap location - Sindice
• http://sindice.com/main/submit
15
16. Submit the sitemap location - Google
• https://www.google.com/webmasters/tools/
16
17. ToC
• Publishing Linked Data from a triple store
• Search engines
• The Sitemap protocol
• sitemap4rdf
• Summary
S
• Future work
17
18. Summary
• Sitemap protocol informs search engines about
available pages
• Supported by Sindice!
• sitemap4rdf generates Sitemap files by listing URIs
in a SPARQL endpoint
• Open source, Java
• http://lab.linkeddata.deri.ie/2010/sitemap4rdf/
• http://mccarthy dia fi upm es/sitemap4rdf/
http://mccarthy.dia.fi.upm.es/sitemap4rdf/
• http://www.oeg-upm.net/index.php/en/downloads/122-sitemap4rdf
18
19. ToC
• Publishing Linked Data from a triple store
• Search engines
• The Sitemap protocol
• sitemap4rdf
• Summary
S
• Future work
19
20. Future Work
• Integrate sitemap4rdf with Pubby
• Generate voiD file automatically from a SPARQL
endpoint
• Generate an entry in CKAN (registry of open
knowledge packages) automatically through CKAN-
API
• http://ckan net/package/geolinkeddata
http://ckan.net/package/geolinkeddata
• Interact with prefix cc ( service for remembering and
prefix.cc
looking up URI prefixes) through its API
• geoes: < http://geo.linkeddata.es/ontology>
http://geo.linkeddata.es/ontology
20
21. Future Work
• Support the semantic sitemap extension (when it will
be compatible with google)
• http://sw.deri.org/2007/07/sitemapextension/
21
22. sitemap4rdf
generate Sitemap files from a SPARQL
endpoint
http://www.deri.ie/
http://www deri ie/
Boris Villazón-Terrazas and Richard Cyganiak (DERI)
Facultad de Informática, Universidad Politécnica de Madrid
Campus de Montegancedo sn 28660 Boadilla del Monte Madrid
sn, Monte,
http://www.oeg-upm.net
Phone: 34.91.3366605, Fax: 34.91.3524819