1. IBM Watson Data Platform
and Open Data
27 February 2017
Margriet Groenendijk | Developer Advocate | IBM Watson Data Platform
@MargrietGr
https://medium.com/ibm-watson-data-lab
21. @MargrietGr
Cloudant is a database
id firstname lastname dob
1 John Smith 1970-01-01
2 Kate Jones 1971-12-25
{
"_id": "1",
"firstname": "John",
"lastname": "Smith",
"dob": "1970-01-01"
}
38. @MargrietGr
Open Street Map Data
IBM Cloudant Use from
anywhere!
Daily updates
VM
daily cron
Python script
Always up to date!
Currently 12,467,460 POIs
40. @MargrietGr
Extract the POIs with osmosis
osmosis --read-pbf netherlands-latest.osm.pbf
--tf accept-nodes
aerialway=station
aeroway=aerodrome,helipad,heliport
amenity=* craft=* emergency=*
highway=bus_stop,rest_area,services
historic=* leisure=* office=*
public_transport=stop_position,stop_area
shop=* tourism=*
--tf reject-ways --tf reject-relations
--write-xml netherlands.nodes.osm
(easy to install with brew on Mac)
41. @MargrietGr
Some cleaning up with osmconvert
Convert from osm to json format with ogr2ogr
osmconvert $netherlands.nodes.osm
--drop-ways --drop-author --drop-relations
--drop-versions >$netherlands.poi.osm
ogr2ogr -f GeoJSON $netherlands.poi.json
$netherlands.poi.osm points
42. @MargrietGr
Upload to Cloudant with couchimport
export COUCH_URL="https://
username:password@username.cloudant.com"
cat $netherlands.poi.json | couchimport
--db poi-$netherlands --type json --jsonpath "features.*"
https://github.com/glynnbird/couchimport
IBM Cloudant
51. @MargrietGr
3
1
2
posted:2016-08-01,2016-10-01
followers_count:3000 friends_count: 3000
(weather OR sun OR sunny OR rain OR hail
OR storm OR rainy OR drought OR flood OR
hurricane OR tornado OR cold OR snow OR
drizzle OR cloudy OR thunder OR lightning
OR wind OR windy OR heatwave)
REST API docs:
https://new-console.ng.bluemix.net/docs/
services/Twitter/
twitter_rest_apis.html#rest_apis
Search for tweets
4 Select table
Use an existing service
56. @MargrietGr
RDDs : Resilient Distributed Datasets
Data does not have to fit on a single machine
Data is separated into partitions
Creation of RDDs
Load an external dataset
Distribute a collection of objects
Transformations construct a new RDD from a previous one (lazy!)
Actions compute a result based on an RDD
62. Getting started
▪ Go to datascience.ibm.com and sign in with your Bluemix account when you have one, else
sign up for one at the top right of the screen
63. Create a project
▪ Create New project, click on the link in top of the screen
▪ Or go to the My Projects in the menu on the left of the screen and click Create New Project
here
64. Create a project
▪ Name the Project
▪ Choose a Spark Service
▪ Choose an Object Storage
▪ Click Create
67. Add a notebook
▪ Click add notebooks
▪ Pick your favourite:
▪ Python 2
▪ Scala
▪ R
▪ Choose Spark 1.6 or 2.0
▪ Click Create Notebook
68. Let’s write some code
▪ Click the pen icon to start adding code (edit mode)
▪ When collaborating only one person can edit, others can add comments to the notebook
when in view mode