6. Opening Up Data
✴ Gather data from disparate sources
✴ Data dumps (SQL, Fixed-width columns)
✴ Web scraping
✴ Text/PDF parsing
✴ Serving RESTful JSON APIs
Question? @LuigiMontanez
7. JSON
✴ Tree structure, not tabular
✴ Still relational
✴ JSON for data, XML for documents
✴ Closely resembles native data structures
✴ No manual parsing needed
Question? @LuigiMontanez
8. Three Projects
✴ Poligraft
✴ Real Time Congress API
✴ Open State Project
Question? @LuigiMontanez
9. Three Projects
✴ Poligraft
✴ Real Time Congress API
✴ Open State Project
Question? @LuigiMontanez
15. {
"title": "President Obama's climate 'Plan B' in hot water -
Darren Samuelsohn - POLITICO.com"
}
Text
16.
17. {
"title": "President Obama's climate 'Plan B' in hot water -
Darren Samuelsohn - POLITICO.com",
"slug": "EOsc",
"source_url": "http://www.politico.com/news/stories/
0810/40534.html",
"content": ".................",
}
Text
18.
19.
20. {
"title": "President Obama's climate 'Plan B' in hot water -
Darren Samuelsohn - POLITICO.com",
"slug": "EOsc",
"source_url": "http://www.politico.com/news/stories/
0810/40534.html",
"content": ".................",
"entities": [...] Text
}
21. {
"title": "President Obama's climate 'Plan B' in hot water -
Darren Samuelsohn - POLITICO.com",
"slug": "EOsc",
"source_url": "http://www.politico.com/news/stories/
0810/40534.html",
"content": ".................",
"entities": [
{
Text
"name": "Barack Obama",
"type": "politician",
},
...
]
}
22.
23. {
"title": "President Obama's climate 'Plan B' in hot water -
Darren Samuelsohn - POLITICO.com",
"slug": "EOsc",
"source_url": "http://www.politico.com/news/stories/
0810/40534.html",
"content": ".................",
"entities": [
{
Text
"name": "Barack Obama",
"type": "politician",
"breakdown": {"indiv": "33", "pac": "67"}
"top_industries": ["Lawyers/Lobbyists","Finance/Insurance/
Real Estate","Misc. Business"]
},
...
]
}
40. Custom Fields
✴ Traditional RDBMS
✴ Update the schema for new fields, run a
migration, feel icky
✴ Create a custom_fields table
✴ MongoDB
✴ Just store it
Question? @LuigiMontanez