[2024]Digital Global Overview Report 2024 Meltwater.pdf
ckan 2.0: Harvesting from other sources
1. ckan 2.0:
Harvesting from other sources
Internship @ Academia Sinica
Report #3
Presenter: Cheng-Jen Lee (Sol)
Email: cjlee AT iis.sinica.edu.tw
This work is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Taiwan License.
2. Agenda
● Harvesters
– Usage: manually and automatically
– Custom harvester
– some issues
● Linked Data and RDF
Oct 13, 2014 2
7. Harvesters
● Custom harvester
– We can implement the harvester interface to
perform harvesting operations
– The process take place on three steps:
● gather: get the identification
● fetch: fetch the contents
● import: create ckan package(dataset)
– Implementation
● https://github.com/u10313335/ckanext-harvest/
blob/master/ckanext/harvest/harvesters/srda
harvester.py
Oct 13, 2014 7
8. Harvesters
● Harvesting Interface
from ckan.plugins.core import SingletonPlugin, implements
from ckanext.harvest.interfaces import IHarvester
class MyHarvester(SingletonPlugin):
implements(IHarvester)
def get_original_url(self, harvest_object_id):
:param harvest_object_id: HarvestObject id
:returns: A string with the URL to the original document
def gather_stage(self, harvest_job):
:param harvest_job: HarvestJob object
:returns: A list of HarvestObject ids
def fetch_stage(self, harvest_object):
:param harvest_object: HarvestObject object
:returns: True if everything went right, False if errors were found
def import_stage(self, harvest_object):
Oct 13, 2014 8
:param harvest_object: HarvestObject object
:returns: True if everything went right, False if errors were found
9. Harvesters
● Some issues
– Title with non-ASCII characters
– Useless update check
– TGOS CSW: failed in gather stage
● Caused by OWSLib
– Harvest source varies
● We should modified the extension for properly
harvesting
● Modified version available
– On Github:
https://github.com/u10313335/ckanext-harvest
Oct 13, 2014 9
10. Linked Data and RDF
● Resource Description Framework
– a family of W3C specifications
– a metadata data model
– based on XML, URI
Oct 13, 2014 10
Source: http://techserviceslibrary.blogspot.tw/2011/04/rdf-resource-description.html
11. Linked Data and RDF
● Vocabularies
– DCAT and Dublin Core
● Two way to get RDF metadata
– curl -L -H "Accept:application/rdf+xml"
http://thedatahub.org/dataset/gold-prices
– curl -L http://thedatahub.org/dataset/gold-prices.
rdf
Oct 13, 2014 11
12. Documents
● Read the Docs:
– https://readthedocs.org/projects/ckan-docs-tw/
Oct 13, 2014 12